eRambler eRambler Recent content on eRambler Intro to the fediverse Wow, it turns out to be 10 years since I wrote this beginners guide to Twitter. Things have moved on a loooooong way since then. Far from being the interesting, disruptive technology it was back then, Twitter has become part of the mainstream, the establishment. Almost everyone and everything is on Twitter now, which has both pros and cons. So what’s the problem? It’s now possible to follow all sorts of useful information feeds, from live updates on transport delays to your favourite sports team’s play-by-play performance to an almost infinite number of cat pictures. In my professional life it’s almost guaranteed that anyone I meet will be on Twitter, meaning that I can contact them to follow up at a later date without having to exchange contact details (and they have options to block me if they don’t like that). On the other hand, a medium where everyone’s opinion is equally valid regardless of knowledge or life experience has turned some parts of the internet into a toxic swamp of hatred and vitriol. It’s easier than ever to forget that we have more common ground with any random stranger than we have similarities, and that’s led to some truly awful acts and a poisonous political arena. Part of the problem here is that each of the social media platforms is controlled by a single entity with almost no accountability to anyone other than shareholders. Technological change has been so rapid that the regulatory regime has no idea how to handle them, leaving them largely free to operate how they want. This has led to a whole heap of nasty consequences that many other people have done a much better job of documenting than I could (Shoshana Zuboff’s book The Age of Surveillance Capitalism is a good example). What I’m going to focus on instead are some possible alternatives. If you accept the above argument, one obvious solution is to break up the effective monopoly enjoyed by Facebook, Twitter et al. We need to be able to retain the wonderful affordances of social media but democratise control of it, so that it can never be dominated by a small number of overly powerful players. What’s the solution? There’s actually a thing that already exists, that almost everyone is familiar with and that already works like this. It’s email. There are a hundred thousand email servers, but my email can always find your inbox if I know your address because that address identifies both you and the email service you use, and they communicate using the same protocol, Simple Mail Transfer Protocol (SMTP)1. I can’t send a message to your Twitter from my Facebook though, because they’re completely incompatible, like oil and water. Facebook has no idea how to talk to Twitter and vice versa (and the companies that control them have zero interest in such interoperability anyway). Just like email, a federated social media service like Mastodon allows you to use any compatible server, or even run your own, and follow accounts on your home server or anywhere else, even servers running different software as long as they use the same ActivityPub protocol. There’s no lock-in because you can move to another server any time you like, and interact with all the same people from your new home, just like changing your email address. Smaller servers mean that no one server ends up with enough power to take over and control everything, as the social media giants do with their own platforms. But at the same time, a small server with a small moderator team can enforce local policy much more easily and block accounts or whole servers that host trolls, nazis or other poisonous people. How do I try it? I have no problem with anyone for choosing to continue to use what we’re already calling “traditional” social media; frankly, Facebook and Twitter are still useful for me to keep in touch with a lot of my friends. However, I do think it’s useful to know some of the alternatives if only to make a more informed decision to stick with your current choices. Most of these services only ask for an email address when you sign up and use of your real name vs a pseudonym is entirely optional so there’s not really any risk in signing up and giving one a try. That said, make sure you take sensible precautions like not reusing a password from another account. Instead of… Try… Twitter, Facebook Mastodon, Pleroma, Misskey Slack, Discord, IRC Matrix WhatsApp, FB Messenger, Telegram Also Matrix Instagram, Flickr PixelFed YouTube PeerTube The web Interplanetary File System (IPFS) Which, if you can believe it, was formalised nearly 40 years ago in 1982 and has only had fairly minor changes since then! ↩︎ Collaborations Workshop 2021: collaborative ideas & hackday My last post covered the more “traditional” lectures-and-panel-sessions approach of the first half of the SSI Collaborations Workshop. The rest of the workshop was much more interactive, consisting of a discussion session, a Collaborative Ideas session, and a whole-day hackathon! The discussion session on day one had us choose a topic (from a list of topics proposed leading up to the workshop) and join a breakout room for that topic with the aim of producing a “speed blog” by then end of 90 minutes. Those speed blogs will be published on the SSI blog over the coming weeks, so I won’t go into that in more detail. The Collaborative Ideas session is a way of generating hackday ideas, by putting people together at random into small groups to each raise a topic of interest to them before discussing and coming up with a combined idea for a hackday project. Because of the serendipitous nature of the groupings, it’s a really good way of generating new ideas from unexpected combinations of individual interests. After that, all the ideas from the session, along with a few others proposed by various participants, were pitched as ideas for the hackday and people started to form teams. Not every idea pitched gets worked on during the hackday, but in the end 9 teams of roughly equal size formed to spend the third day working together. My team’s project: “AHA! An Arts & Humanities Adventure” There’s a lot of FOMO around choosing which team to join for an event like this: there were so many good ideas and I wanted to work on several of them! In the end I settled on a team developing an escape room concept to help Arts & Humanities scholars understand the benefits of working with research software engineers for their research. Five of us rapidly mapped out an example storyline for an escape room, got a website set up with GitHub and populated it with the first few stages of the game. We decided to focus on a story that would help the reader get to grips with what an API is and I’m amazed how much we managed to get done in less than a day’s work! You can try playing through the escape room (so far) yourself on the web, or take a look at the GitHub repository, which contains the source of the website along with a list of outstanding tasks to work on if you’re interested in contributing. I’m not sure yet whether this project has enough momentum to keep going, but it was a really valuable way both of getting to know and building trust with some new people and demonstrating the concept is worth more work. Other projects Here’s a brief rundown of the other projects worked on by teams on the day. Coding Confessions Everyone starts somewhere and everyone cuts corners from time to time. Real developers copy and paste! Fight imposter syndrome by looking through some of these confessions or contributing your own. https://coding-confessions.github.io/ CarpenPI A template to set up a Raspberry Pi with everything you need to run a Carpentries (https://carpentries.org/) data science/software engineering workshop in a remote location without internet access. https://github.com/CarpenPi/docs/wiki Research Dugnads A guide to running an event that is a coming together of a research group or team to share knowledge, pass on skills, tidy and review code, among other software and working best practices (based on the Norwegian concept of a dugnad, a form of “voluntary work done together with other people”) https://research-dugnads.github.io/dugnads-hq/ Collaborations Workshop ideas A meta-project to collect together pitches and ideas from previous Collaborations Workshop conferences and hackdays, to analyse patterns and revisit ideas whose time might now have come. https://github.com/robintw/CW-ideas howDescribedIs Integrate existing tools to improve the machine-readable metadata attached to open research projects by integrating projects like SOMEF, codemeta.json and HowFAIRIs (https://howfairis.readthedocs.io/en/latest/index.html). Complete with CI and badges! https://github.com/KnowledgeCaptureAndDiscovery/somef-github-action Software end-of-project plans Develop a template to plan and communicate what will happen when the fixed-term project funding for your research software ends. Will maintenance continue? When will the project sunset? Who owns the IP? https://github.com/elichad/software-twilight Habeas Corpus A corpus of machine readable data about software used in COVID-19 related research, based on the CORD19 dataset. https://github.com/softwaresaved/habeas-corpus Credit-all Extend the all-contributors GitHub bot (https://allcontributors.org/) to include rich information about research project contributions such as the CASRAI Contributor Roles Taxonomy (https://casrai.org/credit/) https://github.com/dokempf/credit-all I’m excited to see so many metadata-related projects! I plan to take a closer look at what the Habeas Corpus, Credit-all and howDescribedIs teams did when I get time. I also really want to try running a dugnad with my team or for the GLAM Data Science network. Collaborations Workshop 2021: talks & panel session I’ve just finished attending (online) the three days of this year’s SSI Collaborations Workshop (CW for short), and once again it’s been a brilliant experience, as well as mentally exhausting, so I thought I’d better get a summary down while it’s still fresh it my mind. Collaborations Workshop is, as the name suggests, much more focused on facilitating collaborations than a typical conference, and has settled into a structure that starts off with with longer keynotes and lectures, and progressively gets more interactive culminating with a hack day on the third day. That’s a lot to write about, so for this post I’ll focus on the talks and panel session, and follow up with another post about the collaborative bits. I’ll also probably need to come back and add in more links to bits and pieces once slides and the “official” summary of the event become available. Updates 2021-04-07 Added links to recordings of keynotes and panel sessions Provocations The first day began with two keynotes on this year’s main themes: FAIR Research Software and Diversity & Inclusion, and day 2 had a great panel session focused on disability. All three were streamed live and the recordings remain available on Youtube: View the keynotes recording; Google-free alternative link View the panel session recording; Google-free alternative link FAIR Research Software Dr Michelle Barker, Director of the Research Software Alliance, spoke on the challenges to recognition of software as part of the scholarly record: software is not often cited. The FAIR4RS working group has been set up to investigate and create guidance on how the FAIR Principles for data can be adapted to research software as well; as they stand, the Principles are not ideally suited to software. This work will only be the beginning though, as we will also need metrics, training, career paths and much more. ReSA itself has 3 focus areas: people, policy and infrastructure. If you’re interested in getting more involved in this, you can join the ReSA email list. Equality, Diversity & Inclusion: how to go about it Dr Chonnettia Jones, Vice President of Research, Michael Smith Foundation for Health Research spoke extensively and persuasively on the need for Equality, Diversity & Inclusion (EDI) initiatives within research, as there is abundant robust evidence that all research outcomes are improved. She highlighted the difficulties current approaches to EDI have effecting structural change, and changing not just individual behaviours but the cultures & practices that perpetuate iniquity. What initiatives are often constructed around making up for individual deficits, a bitter framing is to start from an understanding of individuals having equal stature but having different tired experiences. Commenting on the current focus on “research excellent” she pointed out that the hyper-competition this promotes is deeply unhealthy. suggesting instead that true excellence requires diversity, and we should focus on an inclusive excellence driven by inclusive leadership. Equality, Diversity & Inclusion: disability issues Day 2’s EDI panel session brought together five disabled academics to discuss the problems of disability in research. Dr Becca Wilson, UKRI Innovation Fellow, Institute of Population Health Science, University of Liverpool (Chair) Phoenix C S Andrews (PhD Student, Information Studies, University of Sheffield and Freelance Writer) Dr Ella Gale (Research Associate and Machine Learning Subject Specialist, School of Chemistry, University of Bristol) Prof Robert Stevens (Professor and Head of Department of Computer Science, University of Manchester) Dr Robin Wilson (Freelance Data Scientist and SSI Fellow) NB. The discussion flowed quite freely so the following summary, so the following summary mixes up input from all the panel members. Researchers are often assumed to be single-minded in following their research calling, and aptness for jobs is often partly judged on “time send”, which disadvantages any disabled person who has been forced to take a career break. On top of this disabled people are often time-poor because of the extra time needed to manage their condition, leaving them with less “output” to show for their time served on many common metrics. This can partially affect early-career researchers, since resources for these are often restricted on a “years-since-PhD” criterion. Time poverty also makes funding with short deadlines that much harder to apply for. Employers add more demands right from the start: new starters are typically expected to complete a health and safety form, generally a brief affair that will suddenly become an 80-page bureaucratic nightmare if you tick the box declaring a disability. Many employers claim to be inclusive yet utterly fail to understand the needs of their disabled staff. Wheelchairs are liberating for those who use them (despite the awful but common phrase “wheelchair-bound”) and yet employers will refuse to insure a wheelchair while travelling for work, classifying it as a “high value personal item” that the owner would take the same responsibility for as an expensive camera. Computers open up the world for blind people in a way that was never possible without them, but it’s not unusual for mandatory training to be inaccessible to screen readers. Some of these barriers can be overcome, but doing so takes yet more time that could and should be spent on more important work. What can we do about it? Academia works on patronage whether we like it or not, so be the person who supports people who are different to you rather than focusing on the one you “recognise yourself in” to mentor. As a manager, it’s important to ask each individual what they need and believe them: they are the expert in their own condition and their lived experience of it. Don’t assume that because someone else in your organisation with the same disability needs one set of accommodations, it’s invalid for your staff member to require something totally different. And remember: disability is unusual as a protected characteristic in that anyone can acquire it at any time without warning! Lightning talks Lightning talk sessions are always tricky to summarise, and while this doesn’t do them justice, here are a few highlights from my notes. Data & metadata Malin Sandstrom talked about a much-needed refinement of contributor role taxonomies for scientific computing Stephan Druskat showcased a project to crowdsource a corpus of research software for further analysis Learning & teaching/community Matthew Bluteau introduced the concept of the “coding dojo” as a way to enhance community of practice. A group of coders got together to practice & learn by working together to solve a problem and explaining their work as they go He described 2 models: a code jam, where people work in small groups, and the Randori method, where 2 people do pair programming while the rest observe. I’m excited to try this out! Steve Crouch talked about intermediate skills and helping people take the next step, which I’m also very interested in with the GLAM Data Science network Esther Plomp recounted experience of running multiple Carpentry workshops online, while Diego Alonso Alvarez discussed planned workshops on making research software more usable with GUIs Shoaib Sufi showcased the SSI’s new event organising guide Caroline Jay reported on a diary study into autonomy & agency in RSE during COVID Lopez, T., Jay, C., Wermelinger, M., & Sharp, H. (2021). How has the covid-19 pandemic affected working conditions for research software engineers? Unpublished manuscript. Wrapping up That’s not everything! But this post is getting pretty long so I’ll wrap up for now. I’ll try to follow up soon with a summary of the “collaborative” part of Collaborations Workshop: the idea-generating sessions and hackday! Time for a new look... I’ve decided to try switching this website back to using Hugo to manage the content and generate the static HTML pages. I’ve been on the Python-based Nikola for a few years now, but recently I’ve been finding it quite slow, and very confusing to understand how to do certain things. I used Hugo recently for the GLAM Data Science Network website and found it had come on a lot since the last time I was using it, so I thought I’d give it another go, and redesign this site to be a bit more minimal at the same time. The theme is still a work in progress so it’ll probably look a bit rough around the edges for a while, but I think I’m happy enough to publish it now. When I get round to it I might publish some more detailed thoughts on the design. Ideas for Accessible Communications The Disability Support Network at work recently ran a survey on “accessible communications”, to develop guidance on how to make communications (especially internal staff comms) more accessible to everyone. I grabbed a copy of my submission because I thought it would be useful to share more widely, so here it is. Please note that these are based on my own experiences only. I am in no way suggesting that these are the only things you would need to do to ensure your communications are fully accessible. They’re just some things to keep in mind. Policies/procedures/guidance can be stressful to use if anything is vague or inconsistent, or if it looks like there might be more information implied than is explicitly given (a common cause of this is use of jargon in e.g. HR policies). Emails relating to these policies have similar problems, made worse because they tend to be very brief. Online meetings can be very helpful, but can also be exhausting, especially if there are too many people, or not enough structure. Larger meetings & webinars without agendas (or where the agenda is ignored, or timings are allowed to drift without acknowledgement) are very stressful, as are those where there is not enough structure to ensure fair opportunities to contribute. Written reference documents and communications should: Be carefully checked for consistency and clarity Have all all key points explicitly stated Explicitly acknowledge the need for flexibility where it is necessary, rather than implying or hinting at it Clearly define jargon & acronyms where they are necessary to the point being made, and avoid them otherwise Include links to longer, more explicit versions where space is tight Provide clear bullet-point summaries with links to the details Online meetings should: Include sufficient break time (at least 10 minutes out of every hour) and not allow this to be compromised just because a speaker has misjudged the length of their talk Include initial “settling-in” time in agendas to avoid timing getting messed up from the start Ensure the agenda is stuck to, or that divergence from the agenda is acknowledged explicitly by the chair and updated timing briefly discussed to ensure everyone is clear Establish a norm for participation at the start of the meeting and stick to it e.g. ask people to raise hands when they have a point to make, or have specific time for round-robin contributions Ensure quiet/introverted people have space to contribute, but don’t force them to do so if they have nothing to add at the time Offer a text-based alternative to contributing verbally If appropriate, at the start of the meeting assign specific roles of: Gatekeeper: ensures everyone has a chance to contribute Timekeeper: ensures meeting runs to time Scribe: ensures a consistent record of the meeting Be chaired by someone with the confidence to enforce the above: offer training to all staff on chairing meetings to ensure everyone has the skills to run a meeting effectively Matrix self-hosting I started running my own Matrix server a little while ago. Matrix is something rather cool, a chat system similar to IRC or Slack, but open and federated. Open in that the standard is available for anyone to view, but also the reference implementations of server and client are open source, along with many other clients and a couple of nascent alternative servers. Federated in that, like email, it doesn’t matter what server you sign up with, you can talk to users on your own or any other server. I decided to host my own for three reasons. Firstly, to see if I could and to learn from it. Secondly, to try and rationalise the Cambrian explosion of Slack teams I was being added to in 2019. Thirdly, to take some control of the loss of access to historical messages in some communities that rely on Slack (especially the Carpentries and RSE communities). Since then, I’ve also added a fourth goal: taking advantage of various bridges to bring other messaging network I use (such as Signal and Telegram) into a consistent UI. I’ve also found that my use of Matrix-only rooms has grown as more individuals & communities have adopted the platform. So, I really like Matrix and I use it daily. My problem now is whether to keep self-hosting. Synapse, the only full server implementation at the moment, is really heavy on memory, so I’ve ended up running it on a much bigger server than I thought I’d need, which seems overkill for a single-user instance. So now I have to make a decision about whether it’s worth keeping going, or shutting it down and going back to matrix.org, or setting up on one of the other servers that have sprung up in the last couple of years. There are a couple of other considerations here. Firstly, Synapse resource usage is entirely down to the size of the rooms joined by users of the homeowner, not directly the number of users. So if users have mostly overlapping interests, and thus keep to the same rooms, you can support quite a large community without significant extra resource usage. Secondly, there are a couple of alternative server implementations in development specifically addressing this issue for small servers. Dendrite and Conduit. Neither are quite ready for what I want yet, but are getting close, and when ready that will allow running small homeservers with much more sensible resource usage. So I could start opening up for other users, and at least justify the size of the server that way. I wouldn’t ever want to make it a paid-for service but perhaps people might be willing to make occasional donations towards running costs. That still leaves me with the question of whether I’m comfortable running a service that others may come to rely on, or being responsible for the safety of their information. I could also hold out for Dendrite or Conduit to mature enough that I’m ready to try them, which might not be more than a few months off. Hmm, seems like I’ve convinced myself to stick with it for now, and we’ll see how it goes. In the meantime, if you know me and you want to try it out let me know and I might risk setting you up with an account! What do you miss least about pre-lockdown life? @JanetHughes on Twitter: What do you miss the least from pre-lockdown life? I absolutely do not miss wandering around the office looking for a meeting room for a confidential call or if I hadn’t managed to book a room in advance. Let’s never return to that joyless frustration, hey? 10:27 AM · Feb 3, 2021 After seeing Terence Eden taking Janet Hughes' tweet from earlier this month as a writing prompt, I thought I might do the same. The first thing that leaps to my mind is commuting. At various points in my life I’ve spent between one and three hours a day travelling to and from work and I’ve never more than tolerated it at best. It steals time from your day, and societal norms dictate that it’s your leisure & self-care time that must be sacrificed. Longer commutes allow more time to get into a book or podcast, especially if not driving, but I’d rather have that time at home rather than trying to be comfortable in a train seat designed for some mythical average man shaped nothing like me! The other thing I don’t miss is the colds and flu! Before the pandemic, British culture encouraged working even when ill, which meant constantly coming into contact with people carrying low-grade viruses. I’m not immunocompromised but some allergies and residue of being asthmatic as a child meant that I would get sick 2-3 times a year. A pleasant side-effect of the COVID precautions we’re all taking is that I haven’t been sick for over 12 months now, which is amazing! Finally, I don’t miss having so little control over my environment. One of the things that working from home has made clear is that there are certain unavoidable aspects of working in my shared office that cause me sensory stress, and that are completely unrelated to my work. Working (or trying to work) next to a noisy automatic scanner; trying to find a light level that works for 6 different people doing different tasks; lacking somewhere quiet and still to eat lunch and recover from a morning of meetings or the constant vaguely-distracting bustle of a large shared office. It all takes energy. Although it’s partly been replaced by the new stress of living through a global pandemic, that old stress was a constant drain on my productivity and mood that had been growing throughout my career as I moved (ironically, given the common assumption that seniority leads to more privacy) into larger and larger open plan offices. Remarkable blogging And the handwritten blog saga continues, as I’ve just received my new reMarkable 2 tablet, which is designed for reading, writing and nothing else. It uses a super-responsive e-ink display and writing on it with a stylus is a dream. It has a slightly rough texture with just a bit of friction that makes my writing come out a lot more legibly than on a slippery glass touchscreen. If that was all there was to it, I might not have wasted my money, but it turns out that it runs on Linux and the makers have wisely decided not to lock it down but to give you full root mess. Yes, you read that right: root access. It presents as an ethernet device over USB, so you can SSH in with a password found in the settings and have full control over your own devices. What a novel concept. This fact alone has meant it’s built a small yet devoted community of users who have come up with some clever ways of extending its functionality. In fact, many of these are listed on this GitHub repository. Finally, from what I’ve seen so far, the handwriting recognition is impressive to say the least. This post was written on it and needed only a little editing. I think this is a device that will get a lot of use! GLAM Data Science Network fellow travellers Updates 2021-02-04 Thanks to Gene @dzshuniper@ausglam.space for suggesting ADHO and a better attribution for the opening quote (see comments below for details) See comments & webmentions for details. “If you want to go fast, go alone. If you want to go far, go together.” — African proverb, probably popularised in English by Kenyan church leader Rev. Samuel Kobia (original) This quote is a popular one in the Carpentries community, and I interpret it in this context to mean that a group of people working together is more sustainable than individuals pursuing the same goal independently. That’s something that speaks to me, and that I want to make sure is reflected in nurturing this new community for data science in galleries, archives, libraries & museums (GLAM). To succeed, this work needs to be complementary and collaborative, rather than competitive, so I want to acknowledge a range of other networks & organisations whose activities complement this. The rest of this article is an unavoidably incomplete list of other relevant organisations whose efforts should be acknowledged and potentially built on. And it should go without saying, but just in case: if the work I’m planning fits right into an existing initiative, then I’m happy to direct my resources there rather than duplicate effort. Inspirations & collaborators Groups with similar goals or undertaking similar activities, but focused on a different sector, geographic area or topic. I think we should make as much use of and contribution to these existing communities as possible since there will be significant overlap. code4lib Probably the closest existing community to what I want to build, but primarily based in the US, so timezones (and physical distance for in-person events) make it difficult to participate fully. This is a well-established community though, with regular events including an annual conference so there’s a lot to learn here. newCardigan Similar to code4lib but an Australian focus, so the timezone problem is even bigger! GLAM Labs Focused on supporting the people experimenting with and developing the infrastructure to enable scholars to access GLAM materials in new ways. In some ways, a GLAM data science network would be complementary to their work, by providing people not directly involved with building GLAM Labs with the skills to make best use of GLAM Labs infrastructure. UK Government data science community Another existing community with very similar intentions, but focused on UK Government sector. Clearly the British Library and a few national & regional museums & archives fall into this, but much of the rest of the GLAM sector does not. Artifical Intelligence for Libraries, Archives & Museums (AI4LAM) A multinational collaboration between several large libraries, archives and museums with a specific focus on the Artificial Intelligence (AI) subset of data science UK Reproducibility Network A network of researchers, primarily in HEIs, with an interest in improving the transparency and reliability of academic research. Mostly science-focused but with some overlap of goals around ethical and robust use of data. Museums Computer Group I’m less familiar with this than the others, but it seems to have a wider focus on technology generally, within the slightly narrower scope of museums specifically. Again, a lot of potential for collaboration. Training Several organisations and looser groups exist specifically to develop and deliver training that will be relevant to members of this network. The network also presents an opportunity for those who have done a workshop with one of these and want to know what the “next steps” are to continue their data science journey. The Carpentries, aka: Library Carpentry Data Carpentry Software Carpentry Data Science Training for Librarians (DST4L) The Programming Historian CDH Cultural Heritage Data School Supporters These misson-driven organisations have goals that align well with what I imagine for the GLAM DSN, but operate at a more strategic level. They work by providing expert guidance and policy advice, lobbying and supporting specific projects with funding and/or effort. In particular, the SSI runs a fellowship programme which is currently providing a small amount of funding to this project. Digital Preservation Coalition (DPC) Software Sustainability Institute (SSI) Research Data Alliance (RDA) Alliance of Digital Humanities Organizations (ADHO) … and its Libraries and Digital Humanities Special Interest Group (Lib&DH SIG) Professional bodies These organisations exist to promote the interests of professionals in particular fields, including supporting professional development. I hope they will provide communication channels to their various members at the least, and may be interested in supporting more directly, depending on their mission and goals. Society of Research Software Engineering Chartered Institute of Library and Information Professionals Archives & Records Association Museums Association Conclusion As I mentioned at the top of the page, this list cannot possibly be complete. This is a growing area and I’m not the only or first person to have this idea. If you can think of anything glaring that I’ve missed and you think should be on this list, leave a comment or tweet/toot at me! A new font for the blog I’ve updated my blog theme to use the quasi-proportional fonts Iosevka Aile and Iosevka Etoile. I really like the aesthetic, as they look like fixed-width console fonts (I use the true fixed-width version of Iosevka in my terminal and text editor) but they’re actually proportional which makes them easier to read. https://typeof.net/Iosevka/ Training a model to recognise my own handwriting If I’m going to train an algorithm to read my weird & awful writing, I’m going to need a decent-sized training set to work with. And since one of the main things I want to do with it is to blog “by hand” it makes sense to focus on that type of material for training. In other words, I need to write out a bunch of blog posts on paper, scan them and transcribe them as ground truth. The added bonus of this plan is that after transcribing, I also end up with some digital text I can use as an actual post — multitasking! So, by the time you read this, I will have already run it through a manual transcription process using Transkribus to add it to my training set, and copy-pasted it into emacs for posting. This is a fun little project because it means I can: Write more by hand with one of my several nice fountain pens, which I enjoy Learn more about the operational process some of my colleagues go through when digitising manuscripts Learn more about the underlying technology & maths, and how to tune the process Produce more lovely content! For you to read! Yay! Write in a way that forces me to put off editing until after a first draft is done and focus more on getting the whole of what I want to say down. That’s it for now — I’ll keep you posted as the project unfolds. Addendum Tee hee! I’m actually just enjoying the process of writing stuff by hand in long-form prose. It’ll be interesting to see how the accuracy turns out and if I need to be more careful about neatness. Will it be better or worse than the big but generic models used by Samsung Notes or OneNote. Maybe I should include some stylus-written text for comparison. Blogging by hand I wrote the following text on my tablet with a stylus, which was an interesting experience: So, thinking about ways to make writing fun again, what if I were to write some of them by hand? I mean I have a tablet with a pretty nice stylus, so maybe handwriting recognition could work. One major problem, of course, is that my handwriting is AWFUL! I guess I’ll just have to see whether the OCR is good enough to cope… It’s something I’ve been thinking about recently anyway: I enjoy writing with a proper fountain pen, so is there a way that I can have a smooth workflow to digitise handwritten text without just typing it back in by hand? That would probably be preferable to this, which actually seems to work quite well but does lead to my hand tensing up to properly control the stylus on the almost-frictionless glass screen. I’m surprised how well it worked! Here’s a sample of the original text: And here’s the result of converting that to text with the built-in handwriting recognition in Samsung Notes: Writing blog posts by hand So, thinking about ways to make writing fun again, what if I were to write some of chum by hand? I mean, I have a toldest winds a pretty nice stylus, so maybe handwriting recognition could work. One major problems, ofcourse, is that my , is AWFUL! Iguess I’ll just have to see whattime the Ocu is good enough to cope… It’s something I’ve hun tthinking about recently anyway: I enjoy wilting with a proper fountain pion, soischeme a way that I can have a smooch workflow to digitise handwritten text without just typing it back in by hand? That wouldprobally be preferableto this, which actually scams to work quito wall but doers load to my hand tensing up to properly couldthe stylus once almost-frictionlessg lass scream. It’s pretty good! It did require a fair bit of editing though, and I reckon we can do better with a model that’s properly trained on a large enough sample of my own handwriting. What I want from a GLAM/Cultural Heritage Data Science Network Introduction As I mentioned last year, I was awarded a Software Sustainability Institute Fellowship to pursue the project of setting up a Cultural Heritage/GLAM data science network. Obviously, the global pandemic has forced a re-think of many plans and this is no exception, so I’m coming back to reflect on it and make sure I’m clear about the core goals so that everything else still moves in the right direction. One of the main reasons I have for setting up a GLAM data science network is because it’s something I want. The advice to “scratch your own itch” is often given to people looking for an open project to start or contribute to, and the lack of a community of people with whom to learn & share ideas and practice is something that itches for me very much. The “motivation” section in my original draft project brief for this work said: Cultural heritage work, like all knowledge work, is increasingly data-based, or at least gives opportunities to make use of data day-to-day. The proper skills to use this data enable more effective working. Knowledge and experience thus gained improves understanding of and empathy with users also using such skills. But of course, I have my own reasons for wanting to do this too. In particular, I want to: Advocate for the value of ethical, sustainable data science across a wide range of roles within the British Library and the wider sector Advance the sector to make the best use of data and digital sources in the most ethical and sustainable way possible Understand how and why people use data from the British Library, and plan/deliver better services to support that Keep up to date with relevant developments in data science Learn from others' skills and experiences, and share my own in turn Those initial goals imply some further supporting goals: Build up the confidence of colleagues who might benefit from data science skills but don’t feel they are “technical” or “computer literate” enough Further to that, build up a base of colleagues with the confidence to share their skills & knowledge with others, whether through teaching, giving talks, writing or other channels Identify common awareness gaps (skills/knowledge that people don’t know they’re missing) and address them Develop a communal space (primarily online) in which people feel safe to ask questions Develop a body of professional practice and help colleagues to learn and contribute to the evolution of this, including practices of data ethics, software engineering, statistics, high performance computing, … Break down language barriers between data scientists and others I’ll expand on this separately as my planning develops, but here are a few specific activities that I’d like to be able to do to support this: Organise less-formal learning and sharing events to complement the more formal training already available within organisations and the wider sector, including “show and tell” sessions, panel discussions, code cafés, masterclasses, guest speakers, reading/study groups, co-working sessions, … Organise training to cover intermediate skills and knowledge currently missing from the available options, including the awareness gaps and professional practice mentioned above Collect together links to other relevant resources to support self-led learning Decisions to be made There are all sorts of open questions in my head about this right now, but here are some of the key ones. Is it GLAM or Cultural Heritage? When I first started planning this whole thing, I went with “Cultural Heritage”, since I was pretty transparently targeting my own organisation. The British Library is fairly unequivocally a CH organisation. But as I’ve gone along I’ve found myself gravitating more towards the term “GLAM” (which stands for Galleries, Libraries, Archives, Museums) as it covers a similar range of work but is clearer (when you spell out the acronym) about what kinds of work are included. What skills are relevant? This turns out to be surprisingly important, at least in terms of how the community is described, as they define the boundaries of the community and can be the difference between someone feeling welcome or excluded. For example, I think that some introductory statistics training would be immensely valuable for anyone working with data to understand what options are open to them and what limitations those options have, but is the word “statistics” offputting per se to those who’ve chosen a career in arts & humanities? I don’t know because I don’t have that background and perspective. Keep it internal to the BL, or open up early on? I originally planned to focus primarily on my own organisation to start with, feeling that it would be easier to organise events and build a network within a single organisation. However, the pandemic has changed my thinking significantly. Firstly, it’s now impossible to organise in-person events and that will continue for quite some time to come, so there is less need to focus on the logistics of getting people into the same room. Secondly, people within the sector are much more used to attending remote events, which can easily be opened up to multiple organisations in many countries, timezones allowing. It now makes more sense to focus primarily on online activities, which opens up the possibility of building a critical mass of active participants much more quickly by opening up to the wider sector. Conclusion This is the type of post that I could let run and run without ever actually publishing, but since it’s something I need feedback and opinions on from other people, I’d better ship it! I really want to know what you think about this, whether you feel it’s relevant to you and what would make it useful. Comments are open below, or you can contact me via Mastodon or Twitter. Writing About Not Writing Under Construction Grunge Sign by Nicolas Raymond — CC BY 2.0 Every year, around this time of year, I start doing two things. First, I start thinking I could really start to understand monads and write more than toy programs in Haskell. This is unlikely to ever actually happen unless and until I get a day job where I can justify writing useful programs in Haskell, but Advent of Code always gets me thinking otherwise. Second, I start mentally writing this same post. You know, the one about how the blogger in question hasn’t had much time to write but will be back soon? “Sorry I haven’t written much lately…” It’s about as cliché as a Geocities site with a permanent “Under construction” GIF. At some point, not long after the dawn of ~time~ the internet, most people realised that every website was permanently under construction and publishing something not ready to be published was just pointless. So I figured this year I’d actually finish writing it and publish it. After all, what’s the worst that could happen? If we’re getting all reflective about this, I could probably suggest some reasons why I’m not writing much: For a start, there’s a lot going on in both my world and The World right now, which doesn’t leave a lot of spare energy after getting up, eating, housework, working and a few other necessary activities. As a result, I’m easily distracted and I tend to let myself get dragged off in other directions before I even get to writing much of anything. If I do manage to focus on this blog in general, I’ll often end up working on some minor tweak to the theme or functionality. I mean, right now I’m wondering if I can do something clever in my text-editor (Emacs, since you’re asking) to streamline my writing & editing process so it’s more elegant, efficient, ergonomic and slightly closer to perfect in every way. It also makes me much more likely to self-censor, and to indulge my perfectionist tendencies to try and tweak the writing until it’s absolutely perfect, which of course never happens. I’ve got a whole heap of partly-written posts that are juuuust waiting for the right motivation for me to just finish them off. The only real solution is to accept that: I’m not going to write much and that’s probably OK What I do write won’t always be the work of carefully-researched, finely crafted genius that I want it to be, and that’s probably OK too Also to remember why I started writing and publishing stuff in the first place: to reflect and get my thoughts out onto a (virtual) page so that I can see them, figure out whether I agree with myself and learn; and to stimulate discussion and get other views on my (possibly uninformed, incorrect or half-formed) thoughts, also to learn. In other words, a thing I do for me. It’s easy to forget that and worry too much about whether anyone else wants to read my s—t. Will you notice any changes? Maybe? Maybe not? Who knows. But it’s a new year and that’s as good a time for a change as any. When is a persistent identifier not persistent? Or an identifier? I wrote a post on the problems with ISBNs as persistent identifiers (PIDS) for work, so check it out if that sounds interesting. IDCC20 reflections I’m just back from IDCC20, so here are a few reflections on this year’s conference. You can find all the available slides and links to shared notes on the conference programme. There’s also a list of all the posters and an overview of the Unconference Skills for curation of diverse datasets Here in the UK and elsewhere, you’re unlikely to find many institutions claiming to apply a deep level of curation to every dataset/software package/etc deposited with them. There are so many different kinds of data and so few people in any one institution doing “curation” that it’s impossible to do this for everything. Absent the knowledge and skills required to fully evaluate an object the best that can be done is usually to make a sense check on the metadata and flag up with the depositor potential for high-level issues such as accidental disclosure of sensitive personal information. The Data Curation Network in the United States is aiming to address this issue by pooling expertise across multiple organisations. The pilot has been highly successful and they’re now looking to obtain funding to continue this work. The Swedish National Data Service is experimenting with a similar model, also with a lot of success. As well as sharing individual expertise, the DCN collaboration has also produced some excellent online quick-reference guides for curating common types of data. We had some further discussion as part of the Unconference on the final day about what it would look like to introduce this model in the UK. There was general agreement that this was a good idea and a way to make optimal use of sparse resources. There were also very valid concerns that it would be difficult in the current financial climate for anyone to justify doing work for another organisation, apparently for free. In my mind there are two ways around this, which are not mutually exclusive by any stretch of the imagination. First is to Just Do It: form an informal network of curators around something simple like a mailing list, and give it a try. Second is for one or more trusted organisations to provide some coordination and structure. There are several candidates for this including DCC, Jisc, DPC and the British Library; we all have complementary strengths in this area so it’s my hope that we’ll be able to collaborate around it. In the meantime, I hope the discussion continues. Artificial intelligence, machine learning et al As you might expect at any tech-oriented conference there was a strong theme of AI running through many presentations, starting from the very first keynote from Francine Berman. Her talk, The Internet of Things: Utopia or Dystopia? used self-driving cars as a case study to unpack some of the ethical and privacy implications of AI. For example, driverless cars can potentially increase efficiency, both through route-planning and driving technique, but also by allowing fewer vehicles to be shared by more people. However, a shared vehicle is not a private space in the way your own car is: anything you say or do while in that space is potentially open to surveillance. Aside from this, there are some interesting ideas being discussed, particularly around the possibility of using machine learning to automate increasingly complex actions and workflows such as data curation and metadata enhancement. I didn’t get the impression anyone is doing this in the real world yet, but I’ve previously seen theoretical concepts discussed at IDCC make it into practice so watch this space! Playing games! Training is always a major IDCC theme, and this year two of the most popular conference submissions described games used to help teach digital curation concepts and skills. Mary Donaldson and Matt Mahon of the University of Glasgow presented their use of Lego to teach the concept of sufficient metadata. Participants build simple models before documenting the process and breaking them down again. Then everyone had to use someone else’s documentation to try and recreate the models, learning important lessons about assumptions and including sufficient detail. Kirsty Merrett and Zosia Beckles from the University of Bristol brought along their card game “Researchers, Impact and Publications (RIP)”, based on the popular “Cards Against Humanity”. RIP encourages players to examine some of the reasons for and against data sharing with plenty of humour thrown in. Both games were trialled by many of the attendees during Thursday’s Unconference. Summary I realised in Dublin that it’s 8 years since I attended my first IDCC, held at the University of Bristol in December 2011 while I was still working at the nearby University of Bath. While I haven’t been every year, I’ve been to every one held in Europe since then and it’s interesting to see what has and hasn’t changed. We’re no longer discussing data management plans, data scientists or various other things as abstract concepts that we’d like to encourage, but dealing with the real-world consequences of them. The conference has also grown over the years: this year was the biggest yet, boasting over 300 attendees. There has been especially big growth in attendees from North America, Australasia, Africa and the Middle East. That’s great for the diversity of the conference as it brings in more voices and viewpoints than ever. With more people around to interact with I have to work harder to manage my energy levels but I think that’s a small price to pay. Iosevka: a nice fixed-width-font Iosevka is a nice, slender monospace font with a lot of configurable variations. Check it out: https://typeof.net/Iosevka/ Replacing comments with webmentions Just a quickie to say that I’ve replaced the comment section at the bottom of each post with webmentions, which allows you to comment by posting on your own site and linking here. It’s a fundamental part of the IndieWeb, which I’m slowly getting to grips with having been a halfway member of it for years by virtue of having my own site on my own domain. I’d already got rid of Google Analytics to stop forcing that tracking on my visitors, I wanted to get rid of Disqus too because I’m pretty sure the only way that is free for me is if they’re selling my data and yours to third parties. Webmention is a nice alternative because it relies only on open standards, has no tracking and allows people to control their own comments. While I’m currently using a third-party service to help, I can switch to self-hosted at any point in the future, completely transparently. Thanks to webmention.io, which handles incoming webmentions for me, and webmention.js, which displays them on the site, I can keep it all static and not have to implement any of this myself, which is nice. It’s a bit harder to comment because you have to be able to host your own content somewhere, but then almost no-one ever commented anyway, so it’s not like I’ll lose anything! Plus, if I get Bridgy set up right, you should be able to comment just by replying on Mastodon, Twitter or a few other places. A spot of web searching shows that I’m not the first to make the Disqus -> webmentions switch (yes, I’m putting these links in blatantly to test outgoing webmentions with Telegraph…): So long Disqus, hello webmention — Nicholas Hoizey Bye Disqus, hello Webmention! — Evert Pot Implementing Webmention on a static site — Deluvi Let’s see how this goes! Bridging Carpentries Slack channels to Matrix It looks like I’ve accidentally taken charge of bridging a bunch of The Carpentries Slack channels over to Matrix. Given this, it seems like a good idea to explain what that sentence means and reflect a little on my reasoning. I’m more than happy to discuss the pros and cons of this approach If you just want to try chatting in Matrix, jump to the getting started section What are Slack and Matrix? Slack (see also on Wikipedia), for those not familiar with it, is an online text chat platform with the feel of IRC (Internet Relay Chat), a modern look and feel and both web and smartphone interfaces. By providing a free tier that meets many peoples' needs on its own Slack has become the communication platform of choice for thousands of online communities, private projects and more. One of the major disadvantages of using Slack’s free tier, as many community organisations do, is that as an incentive to upgrade to a paid service your chat history is limited to the most recent 10,000 messages across all channels. For a busy community like The Carpentries, this means that messages older than about 6-7 weeks are already inaccessible, rendering some of the quieter channels apparently empty. As Slack is at pains to point out, that history isn’t gone, just archived and hidden from view unless you pay the low, low price of $1/user/month. That doesn’t seem too pricy, unless you’re a non-profit organisation with a lot of projects you want to fund and an active membership of several hundred worldwide, at which point it soon adds up. Slack does offer to waive the cost for registered non-profit organisations, but only for one community. The Carpentries is not an independent organisation, but one fiscally sponsored by Community Initiatives, which has already used its free quota of one elsewhere rendering the Carpentries ineligible. Other umbrella organisations such as NumFocus (and, I expect, Mozilla) also run into this problem with Slack. So, we have a community which is slowly and inexorably losing its own history behind a paywall. For some people this is simply annoying, but from my perspective as a facilitator of the preservation of digital things the community is haemhorraging an important record of its early history. Enter Matrix. Matrix is a chat platform similar to IRC, Slack or Discord. It’s divided into separate channels, and users can join one or more of these to take part in the conversation happening in those channels. What sets it apart from older technology like IRC and walled gardens like Slack & Discord is that it’s federated. Federation means simply that users on any server can communicate with users and channels on any other server. Usernames and channel addresses specify both the individual identifier and the server it calls home, just as your email address contains all the information needed for my email server to route messages to it. While users are currently tied to their home server, channels can be mirrored and synchronised across multiple servers making the overall system much more resilient. Can’t connect to your favourite channel on server X? No problem: just connect via its alias on server Y and when X comes back online it will be resynchronised. The technology used is much more modern and secure than the aging IRC protocol, and there’s no vender lock-in like there is with closed platforms like Slack and Discord. On top of that, Matrix channels can easily be “bridged” to channels/rooms on other platforms, including, yes, Slack, so that you can join on Matrix and transparently talk to people connected to the bridged room, or vice versa. So, to summarise: The current Carpentries Slack channels could be bridged to Matrix at no cost and with no disruption to existing users The history of those channels from that point on would be retained on matrix.org and accessible even when it’s no longer available on Slack If at some point in the future The Carpentries chose to invest in its own Matrix server, it could adopt and become the main Matrix home of these channels without disruption to users of either Matrix or (if it’s still in use at that point) Slack Matrix is an open protocol, with a reference server implementation and wide range of clients all available as free software, which aligns with the values of the Carpentries community On top of this: I’m fed up of having so many different Slack teams to switch between to see the channels in all of them, and prefer having all the channels I regularly visit in a single unified interface; I wanted to see how easy this would be and whether others would also be interested. Given all this, I thought I’d go ahead and give it a try to see if it made things more manageable for me and to see what the reaction would be from the community. How can I get started? !!! reminder Please remember that, like any other Carpentries space, the Code of Conduct applies in all of these channels. First, sign up for a Matrix account. The quickest way to do this is on the Matrix “Try now” page, which will take you to the Riot Web client which for many is synonymous with Matrix. Other clients are also available for the adventurous. Second, join one of the channels. The links below will take you to a page that will let you connect via your preferred client. You’ll need to log in as they are set not to allow guest access, but, unlike Slack, you won’t need an invitation to be able to join. #general — the main open channel to discuss all things Carpentries #random — anything that would be considered offtopic elsewhere #welcome — join in and introduce yourself! That’s all there is to getting started with Matrix. To find all the bridged channels there’s a Matrix “community” that I’ve added them all to: Carpentries Matrix community. There’s a lot more, including how to bridge your favourite channels from Slack to Matrix, but this is all I’ve got time and space for here! If you want to know more, leave a comment below, or send me a message on Slack (jezcope) or maybe Matrix (@petrichor:matrix.org)! I’ve also made a separate channel for Matrix-Slack discussions: #matrix on Slack and Carpentries Matrix Discussion on Matrix MozFest19 first reflections Discussions of neurodiversity at #mozfest Photo by Jennifer Riggins The other weekend I had my first experience of Mozilla Festival, aka #mozfest. It was pretty awesome. I met quite a few people in real life that I’ve previously only known (/stalked) on Twitter, and caught up with others that I haven’t seen for a while. I had the honour of co-facilitating a workshop session on imposter syndrome and how to deal with it with the wonderful Yo Yehudi and Emmy Tsang. We all learned a lot and hope our participants did too; we’ll be putting together a summary blog post as soon as we can get our act together! I also attended a great session, led by Kiran Oliver (psst, they’re looking for a new challenge), on how to encourage and support a neurodiverse workforce. I was only there for the one day, and I really wish that I’d taken the plunge and committed to the whole weekend. There’s always next year though! To be honest, I’m just disappointed that I never had the courage to go sooner, Music for working Today1 the office conversation turned to blocking out background noise. (No, the irony is not lost on me.) Like many people I work in a large, open-plan office, and I’m not alone amongst my colleagues in sometimes needing to find a way to boost concentration by blocking out distractions. Not everyone is like this, but I find music does the trick for me. I also find that different types of music are better for different types of work, and I use this to try and manage my energy better. There are more distractions than auditory noise, and at times I really struggle with visual noise. Rather than have this post turn into a rant about the evils of open-plan offices, I’ll just mention that the scientific evidence doesn’t paint them in a good light2, or at least suggests that the benefits are more limited in scope than is commonly thought3, and move on to what I actually wanted to share: good music for working to. There are a number of genres that I find useful for working. Generally, these have in common a consistent tempo, a lack of lyrics, and enough variation to prevent boredom without distracting. Familiarity helps my concentration too so I’ll often listen to a restricted set of albums for a while, gradually moving on by dropping one out and bringing in another. In my case this includes: Traditional dance music, generally from northern and western European traditions for me. This music has to be rhythmically consistent to allow social dancing, and while the melodies are typically simple repeated phrases, skilled musicians improvise around that to make something beautiful. I tend to go through phases of listening to particular traditions; I’m currently listening to a lot of French, Belgian and Scandinavian. Computer game soundtracks, which are specifically designed to enhance gameplay without distracting, making them perfect for other activities requiring a similar level of concentration. Chiptunes and other music incorporating it; partly overlapping with the previous category, chiptunes is music made by hacking the audio chips from (usually) old computers and games machines to become an instrument for new music. Because of the nature of the instrument, this will have millisecond-perfect rhythm and again makes for undistracting noise blocking with an extra helping of nostalgia! Purists would disagree with me, but I like artists that combine chiptunes with other instruments and effects to make something more complete-sounding. Retrowave/synthwave/outrun, synth-driven music that’s instantly familiar as the soundtrack to many 90s sci-fi and thriller movies. Atmospheric, almost dreamy, but rhythmic with a driving beat, it’s another genre that fits into the “pleasing but not too surprising” category for me. So where to find this stuff? One of the best resources I’ve found is Music for Programming which provides carefully curated playlists of mostly electronic music designed to energise without distracting. They’re so well done that the tracks move seamlessly, one to the next, without ever getting boring. Spotify is an obvious option, and I do use it quite a lot. However, I’ve started trying to find ways to support artists more directly, and Bandcamp seems to be a good way of doing that. It’s really easy to browse by genre, or discover artists similar to what you’re currently hearing. You can listen for free as long as you don’t mind occasional nags to buy the music you’re hearing, but you can also buy tracks or albums. Music you’ve paid for is downloadable in several open, DRM-free formats for you to keep, and you know that a decent chunk of that cash is going directly to that artist. I also love noise generators; not exactly music, but a variety of pleasant background noises, some of which nicely obscure typical office noise. I particularly like mynoise.net, which has a cornucopia of different natural and synthetic noises. Each generator comes with a range of sliders allowing you to tweak the composition and frequency range, and will even animate them randomly for you to create a gently shifting soundscape. A much simpler, but still great, option is Noisli with it’s nice clean interface. Both offer apps for iOS and Android. For bonus points, you can always try combining one or more of the above. Adding in a noise generator allows me to listen to quieter music while still getting good environmental isolation when I need concentration. Another favourite combo is to open both the cafe and rainfall generators from myNoise, made easier by the ability to pop out a mini-player then open up a second generator. I must be missing stuff though. What other musical genres should I try? What background sounds are nice to work to? Well, you know. The other day. Whatever. ↩︎ See e.g.: Lee, So Young, and Jay L. Brand. ‘Effects of Control over Office Workspace on Perceptions of the Work Environment and Work Outcomes’. Journal of Environmental Psychology 25, no. 3 (1 September 2005): 323–33. https://doi.org/10.1016/j.jenvp.2005.08.001. ↩︎ Open plan offices can actually work under certain conditions, The Conversation ↩︎ Working at the British Library: 6 months in It barely seems like it, but I’ve been at the British Library now for nearly 6 months. It always takes a long time to adjust and from experience I know it’ll be another year before I feel fully settled, but my team, department and other colleagues have really made me feel welcome and like I belong. One thing that hasn’t got old yet is the occasional thrill of remembering that I work at my national library now. Every now and then I’ll catch a glimpse of the collections at Boston Spa or step into one of the reading rooms and think “wow, I actually work here!” I also like having a national and international role to play, which means I get to travel a bit more than I used to. Budgets are still tight so there are limits, and I still prefer to be home more often than not, but there is more scope in this job than I’ve had previously for travelling to conferences, giving talks that change the way people think, and learning in different contexts. I’m learning a lot too, especially how to work with and manage people split across multiple sites, and the care and feeding of budgets. As well as missing mo old team at Sheffield, I do also miss some of the direct contact I had with researchers in HE. I especially miss the teaching work, but also the higher-level influencing of more senior academics to change practices on a wider scale. Still, I get to use those influencing skills in different ways now, and I’m still involved with the Carpentries which should let me keep my hand in with teaching. I still deal with my general tendency to try and do All The Things, and as before I’m slowly learning to recognise it, tame it and very occasionally turn it to my advantage. That also leads to feelings of imposterism that are only magnified by the knowledge that I now work at a national institution! It’s a constant struggle some days to believe that I’ve actually earned my place here through hard work, Even if I don’t always feel that I have, my colleagues here certainly have, so I should have more faith in their opinion of me. Finally, I couldn’t write this type of thing without mentioning the commute. I’ve gone from 90 minutes each way on a good day (up to twice that if the trains were disrupted) to 35 minutes each way along fairly open roads. I have less time to read, but much more time at home. On top of that, the library has implemented flexitime across all pay grades, with even senior managers strongly encouraged to make full use. Not only is this an important enabler of equality across the organisation, it relieves for me personally the pressure to work over my contracted hours and the guilt I’ve always felt at leaving work even 10 minutes early. If I work late, it’s now a choice I’m making based on business needs instead of guilt and in full knowledge that I’ll get that time back later. So that’s where I am right now. I’m really enjoying the work and the culture, and I look forward to what the next 6 months will bring! RDA Plenary 13 reflection Photo by me I sit here writing this in the departure lounge at Philadelphia International Airport, waiting for my Aer Lingus flight back after a week at the 13th Research Data Alliance (RDA) Plenary (although I’m actually publishing this a week or so later at home). I’m pretty exhausted, partly because of the jet lag, and partly because it’s been a very full week with so much to take in. It’s my first time at an RDA Plenary, and it was quite a new experience for me! First off, it’s my first time outside Europe, and thus my first time crossing quite so many timezones. I’ve been waking at 5am and ready to drop by 8pm, but I’ve struggled on through! Secondly, it’s the biggest conference I’ve been to for a long time, both in number of attendees and number of parallel sessions. There’s been a lot of sustained input so I’ve been very glad to have a room in the conference hotel and be able to escape for a few minutes when I needed to recharge. Thirdly, it’s not really like any other conference I’ve been to: rather than having large numbers of presentations submitted by attendees, each session comprises lots of parallel meetings of RDA interest groups and working groups. It’s more community-oriented: an opportunity for groups to get together face to face and make plans or show off results. I found it pretty intense and struggled to take it all in, but incredibly valuable nonetheless. Lots of information to process (I took a lot of notes) and a few contacts to follow up on too, so overall I loved it! Using Pipfile in Binder Photo by Sear Greyson on Unsplash I recently attended a workshop, organised by the excellent team of the Turing Way project, on a tool called BinderHub. BinderHub, along with public hosting platform MyBinder, allows you to publish computational notebooks online as “binders” such that they’re not static but fully interactive. It’s able to do this by using a tool called repo2docker to capture the full computational environment and dependencies required to run the notebook. !!! aside “What is the Turing Way?” The Turing Way is, in its own words, “a lightly opinionated guide to reproducible data science.” The team is building an open textbook and running a number of workshops for scientists and research software engineers, and you should check out the project on Github. You could even contribute! The Binder process goes roughly like this: Do some work in a Jupyter Notebook or similar Put it into a public git repository Add some extra metadata describing the packages and versions your code relies on Go to mybinder.org and tell it where to find your repository Open the URL it generates for you Profit Other than step 5, which can take some time to build the binder, this is a remarkably quick process. It supports a number of different languages too, including built-in support for R, Python and Julia and the ability to configure pretty much any other language that will run on Linux. However, the Python support currently requires you to have either a requirements.txt or Conda-style environment.yml file to specify dependencies, and I commonly use a Pipfile for this instead. Pipfile allows you to specify a loose range of compatible versions for maximal convenience, but then locks in specific versions for maximal reproducibility. You can upgrade packages any time you want, but you’re fully in control of when that happens, and the locked versions are checked into version control so that everyone working on a project gets consistency. Since Pipfile is emerging as something of a standard thought I’d see if I could use that in a binder, and it turns out to be remarkably simple. The reference implementation of Pipfile is a tool called pipenv by the prolific Kenneth Reitz. All you need to use this in your binder is two files of one line each. requirements.txt tells repo2binder to build a Python-based binder, and contains a single line to install the pipenv package: pipenv Then postBuild is used by repo2binder to install all other dependencies using pipenv: pipenv install --system The --system flag tells pipenv to install packages globally (its default behaviour is to create a Python virtualenv). With these two files, the binder builds and runs as expected. You can see a complete example that I put together during the workshop here on Gitlab. What do you think I should write about? I’ve found it increasingly difficult to make time to blog, and it’s not so much not having the time — I’m pretty privileged in that regard — but finding the motivation. Thinking about what used to motivate me, one of the big things was writing things that other people wanted to read. Rather than try to guess, I thought I’d ask! Those who know what I'm about, what would you read about, if it was written by me?I'm trying to break through the blog-writers block and would love to know what other people would like to see my ill-considered opinions on.— Jez Cope (@jezcope) March 7, 2019 I’m still looking for ideas, so please tweet me or leave me a comment below. Below are a few thoughts that I’m planning to do something with. Something taking one of the more techy aspects of Open Research, breaking it down and explaining the benefits for non-techy folks?— Dr Beth 🏳️‍🌈 🐺 (@PhdGeek) March 7, 2019 Skills (both techy and non techy) that people need to most effectively support RDM— Kate O'Neill (@KateFONeill) March 7, 2019 Sometimes I forget that my background makes me well-qualified to take some of these technical aspects of the job and break them down for different audiences. There might be a whole series in this… Carrying on our conversation last week I'd love to hear more about how you've found moving from an HE lib to a national library and how you see the BL's role in RDM. Appreciate this might be a bit niche/me looking for more interesting things to cite :)— Rosie Higman (@RosieHLib) March 7, 2019 This is interesting, and something I’d like to reflect on; moving from one job to another always has lessons and it’s easy to miss them if you’re not paying attention. Another one for the pile. Life without admin rights to your computer— Mike Croucher (@walkingrandomly) March 7, 2019 This is so frustrating as an end user, but at the same time I get that endpoint security is difficult and there are massive risks associated with letting end users have admin rights. This is particularly important at the BL: as custodian’s of a nation’s cultural heritage, the risk for us is bigger than for many and for this reason we are now Cyber Essentials Plus certified. At some point I’d like to do some research and have a conversation with someone who knows a lot more about InfoSec to work out what the proper approach to this, maybe involving VMs and a demilitarized zone on the network. I’m always looking for more inspiration, so please leave a comment if you’ve got anything you’d like to read my thoughts on. If you’re not familiar with my writing, please take a minute or two to explore the blog; the tags page is probably a good place to get an overview. Ultimate Hacking Keyboard: first thoughts Following on from the excitement of having built a functioning keyboard myself, I got a parcel on Monday. Inside was something that I’ve been waiting for since September: an Ultimate Hacking Keyboard! Where the custom-built Laplace is small and quiet for travelling, the UHK is to be my main workhorse in the study at home. Here are my first impressions: Key switches I went with Kailh blue switches from the available options. In stark contrast to the quiet blacks on the Laplace, blues are NOISY! They have an extra piece of plastic inside the switch that causes an audible and tactile click when the switch activates. This makes them very satisfying to type on and should help as I train my fingers not to bottom out while typing, but does make them unsuitable for use in a shared office! Here are some animations showing how the main types of key switch vary. Layout This keyboard has what’s known as a 60% layout: no number pad, arrows or function keys. As with the more spartan Laplace, these “missing” keys are made up for with programmable layers. For example, the arrow keys are on the Mod layer on the I/J/K/L keys, so I can access them without moving from the home row. I actually find this preferable to having to move my hand to the right to reach them, and I really never used the number pad in any case. Split This is a split keyboard, which means that the left and right halves can be separated to place the hands further apart which eases strain across the shoulders. The UHK has a neat coiled cable joining the two which doesn’t get in the way. A cool design feature is that the two halves can be slotted back together and function perfectly well as a non-split keyboard too, held together by magnets. There are even electrical contacts so that when the two are joined you don’t need the linking cable. Programming The board is fully programmable, and this is achieved via a custom (open source) GUI tool which talks to the (open source) firmware on the board. You can have multiple keymaps, each of which has a separate Base, Mod, Fn and Mouse layer, and there’s an LED display that shows a short mnemonic for the currently active map. I already have a customised Dvorak layout for day-to-day use, plus a standard QWERTY for not-me to use and an alternative QWERTY which will be slowly tweaked for games that don’t work well with Dvorak. Mouse keys One cool feature that the designers have included in the firmware is the ability to emulate a mouse. There’s a separate layer that allows me to move the cursor, scroll and click without moving my hands from the keyboard. Palm rests Not much to say about the palm rests, other than they are solid wood, and chunky, and really add a little something. I have to say, I really like it so far! Overall it feels really well designed, with every little detail carefully thought out and excellent build quality and a really solid feeling. Custom-built keyboard I’m typing this post on a keyboard I made myself, and I’m rather excited about it! Why make my own keyboard? I wanted to learn a little bit about practical electronics, and I like to learn by doing I wanted to have the feeling of making something useful with my own hands I actually need a small, keyboard with good-quality switches now that I travel a fair bit for work and this lets me completely customise it to my needs Just because! While it is possible to make a keyboard completely from scratch, it makes much more sense to put together some premade parts. The parts you need are: PCB (printed circuit board): the backbone of the keyboard, to which all the other electrical components attach, this defines the possible physical locations for each key Switches: one for each key to complete a circuit whenever you press it Keycaps: switches are pretty ugly and pretty uncomfortable to press, so each one gets a cap; these are what you probably think of as the “keys” on your keyboard and come in almost limitless variety of designs (within the obvious size limitation) and are the easiest bit of personalisation Controller: the clever bit, which detects open and closed switches on the PCB and tells your computer what keys you pressed via a USB cable Firmware: the program that runs on the controller starts off as source code like any other program, and altering this can make the keyboard behave in loads of different ways, from different layouts to multiple layers accessed by holding a particular key, to macros and even emulating a mouse! In my case, I’ve gone for the following: PCB Laplace from keeb.io, a very compact 47-key (“40%") board, with no number pad, function keys or number row, but a lot of flexibility for key placement on the bottom row. One of my key design goals was small size so I can just pop it in my bag and have on my lap on the train. Controller Elite-C, designed specifically for keyboard builds to be physically compatible with the cheaper Pro Micro, with a more-robust USB port (the Pro Micro’s has a tendency to snap off), and made easier to program with a built-in reset button and better bootloader. Switches Gateron Black: Gateron is one of a number of manufacturers of mechanical switches compatible with the popular Cherry range. The black switch is linear (no click or bump at the activation point) and slightly heavier sprung than the more common red. Cherry also make a black switch but the Gateron version is slightly lighter and having tested a few I found them smoother too. My key goal here was to reduce noise, as the stronger spring will help me type accurately without hitting the bottom of the keystroke with an audible sound. Keycaps Blank grey PBT in DSA profile: this keyboard layout has a lot of non-standard sized keys, so blank keycaps meant that I wouldn’t be putting lots of keys out of their usual position; they’re also relatively cheap, fairly classy IMHO and a good placeholder until I end up getting some really cool caps on a group buy or something; oh, and it minimises the chance of someone else trying the keyboard and getting freaked out by the layout… Firmware QMK (Quantum Mechanical Keyboard), with a work-in-progress layout, based on Dvorak. QMK has a lot of features and allows you to fully program each and every key, with multiple layers accessed through several different routes. Because there are so few keys on this board, I’ll need to make good use of layers to make all the keys on a usual keyboard available. Dvorak Simplified Keyboard I’m grateful to the folks of the Leeds Hack Space, especially Nav & Mark who patiently coached me in various soldering techniques and good practice, but also everyone else who were so friendly and welcoming and interested in my project. I’m really pleased with the result, which is small, light and fully customisable. Playing with QMK firmware features will keep me occupied for quite a while! This isn’t the end though, as I’ll need a case to keep the dust out. I’m hoping to be able to 3D print this or mill it from wood with a CNC mill, for which I’ll need to head back to the Hack Space! Less, but better “Wenniger aber besser” — Dieter Rams {:.big-quote} I can barely believe it’s a full year since I published my intentions for 2018. A lot has happened since then. Principally: in November I started a new job as Data Services Lead at The British Library. One thing that hasn’t changed is my tendency to try to do too much, so this year I’m going to try and focus on a single intention, a translation of designer Dieter Rams' famous quote above: Less, but better. This chimes with a couple of other things I was toying with over the Christmas break, as they’re essentially other ways of saying the same thing: Take it steady One thing at a time I’m also going to keep in mind those touchstones from last year: What difference is this making? Am I looking after myself? Do I have evidence for this? I mainly forget to think about them, so I’ll be sticking up post-its everywhere to help me remember! How to extend Python with Rust: part 1 Python is great, but I find it useful to have an alternative language under my belt for occasions when no amount of Pythonic cleverness will make some bit of code run fast enough. One of my main reasons for wanting to learn Rust was to have something better than C for that. Not only does Rust have all sorts of advantages that make it a good choice for code that needs to run fast and correctly, it’s also got a couple of rather nice crates (libraries) that make interfacing with Python a lot nicer. Here’s a little tutorial to show you how easy it is to call a simple Rust function from Python. If you want to try it yourself, you’ll find the code on GitHub. !!! prerequisites I’m assuming for this tutorial that you’re already familiar with writing Python scripts and importing & using packages, and that you’re comfortable using the command line. You’ll also need to have installed Rust. The Rust bit The quickest way to get compiled code into Python is to use the builtin ctypes package. This is Python’s “Foreign Function Interface” or FFI: a means of calling functions outside the language you’re using to make the call. ctypes allows us to call arbitrary functions in a shared library1, as long as those functions conform to certain standard C language calling conventions. Thankfully, Rust tries hard to make it easy for us to build such a shared library. The first thing to do is to create a new project with cargo, the Rust build tool: $ cargo new rustfrompy Created library `rustfrompy` project $ tree . ├── Cargo.toml └── src └── lib.rs 1 directory, 2 files !!! aside I use the fairly common convention that text set in fixed-width font is either example code or commands to type in. For the latter, a $ precedes the command that you type (omit the $), and lines that don’t start with a $ are output from the previous command. I assume a basic familiarity with Unix-style command line, but I should probably put in some links to resources if you need to learn more! We need to edit the Cargo.toml file and add a [lib] section: [package] name = "rustfrompy" version = "0.1.0" authors = ["Jez Cope <j.cope@erambler.co.uk>"] [dependencies] [lib] name = "rustfrompy" crate-type = ["cdylib"] This tells cargo that we want to make a C-compatible dynamic library (crate-type = ["cdylib"]) and what to call it, plus some standard metadata. We can then put our code in src/lib.rs. We’ll just use a simple toy function that adds two numbers together: #[no_mangle] pub fn add(a: i64, b: i64) -> i64 { a + b } Notice the pub keyword, which instructs the compiler to make this function accessible to other modules, and the #[no_mangle] annotation, which tells it to use the standard C naming conventions for functions. If we don’t do this, then Rust will generate a new name for the function for its own nefarious purposes, and as a side effect we won’t know what to call it when we want to use it from Python. Being good developers, let’s also add a test: #[cfg(test)] mod test { use ::*; #[test] fn test_add() { assert_eq!(4, add(2, 2)); } } We can now run cargo test which will compile that code and run the test: $ cargo test Compiling rustfrompy v0.1.0 (file:///home/jez/Personal/Projects/rustfrompy) Finished dev [unoptimized + debuginfo] target(s) in 1.2 secs Running target/debug/deps/rustfrompy-3033caaa9f5f17aa running 1 test test test::test_add ... ok test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out Everything worked! Now just to build that shared library and we can try calling it from Python: $ cargo build Compiling rustfrompy v0.1.0 (file:///home/jez/Personal/Projects/rustfrompy) Finished dev [unoptimized + debuginfo] target(s) in 0.30 secs Notice that the build is unoptimized and includes debugging information: this is useful in development, but once we’re ready to use our code it will run much faster if we compile it with optimisations. Cargo makes this easy: $ cargo build --release Compiling rustfrompy v0.1.0 (file:///home/jez/Personal/Projects/rustfrompy) Finished release [optimized] target(s) in 0.30 secs The Python bit After all that, the Python bit is pretty short. First we import the ctypes package (which is included in all recent Python versions): from ctypes import cdll Cargo has tidied our shared library away into a folder, so we need to tell Python where to load it from. On Linux, it will be called lib<something>.so where the “something” is the crate name from Cargo.toml, “rustfrompy”: lib = cdll.LoadLibrary('target/release/librustfrompy.so') Finally we can call the function anywhere we want. Here it is in a pytest-style test: def test_rust_add(): assert lib.add(27, 15) == 42 If you have pytest installed (and you should!) you can run the whole test like this: $ pytest --verbose test.py ====================================== test session starts ====================================== platform linux -- Python 3.6.4, pytest-3.1.1, py-1.4.33, pluggy-0.4.0 -- /home/jez/.virtualenvs/datasci/bin/python cachedir: .cache rootdir: /home/jez/Personal/Projects/rustfrompy, inifile: collected 1 items test.py::test_rust_add PASSED It worked! I’ve put both the Rust and Python code on github if you want to try it for yourself. Shortcomings Ok, so that was a pretty simple example, and I glossed over a lot of things. For example, what would happen if we did lib.add(2.0, 2)? This causes Python to throw an error because our Rust function only accepts integers (64-bit signed integers, i64, to be precise), and we gave it a floating point number. ctypes can’t guess what type(s) a given function will work with, but it can at least tell us when we get it wrong. To fix this properly, we need to do some extra work, telling the ctypes library what the argument and return types for each function are. For a more complex library, there will probably be more housekeeping to do, such as translating return codes from functions into more Pythonic-style errors. For a small example like this there isn’t much of a problem, but the bigger your compiled library the more extra boilerplate is required on the Python side just to use all the functions. When you’re working with an existing library you don’t have much choice about this, but if you’re building it from scratch specifically to interface with Python, there’s a better way using the Python C API. You can call this directly in Rust, but there are a couple of Rust crates that make life much easier, and I’ll be taking a look at those in a future blog post. .so on Linux, .dylib on Mac and .dll on Windows ↩︎ New Years's irresolution Photo by Andrew Hughes on Unsplash I’ve chosen not to make any specific resolutions this year; I’ve found that they just don’t work for me. Like many people, all I get is a sense of guilt when I inevitably fail to live up to the expectations I set myself at the start of the year. However, I have set a couple of what I’m referring to as “themes” for the year: touchstones that I’ll aim to refer to when setting priorities or just feeling a bit overwhelmed or lacking in direction. They are: Contribution Self-care Measurement I may do some blog posts expanding on these, but in the meantime, I’ve put together a handful of questions to help me think about priorities and get perspective when I’m doing (or avoiding doing) something. What difference is this making? I feel more motivated when I can figure out how I’m contributing to something bigger than myself. In society? In my organisation? To my friends & family? Am I looking after myself? I focus a lot on the expectations have (or at least that I think others have) of me, but I can’t do anything well unless I’m generally happy and healthy. Is this making me happier and healthier? Is this building my capacity to to look after myself, my family & friends and do my job? Is this worth the amount of time and energy I’m putting in? Do I have evidence for this? I don’t have to base decisions purely on feelings/opinions: I have the skills to obtain, analyse and interpret data. Is this fact or opinion? What are the facts? Am I overthinking this? Can I put a confidence interval for this? Build documents from code and data with Saga !!! tldr “TL;DR” I’ve made Saga, a thing for compiling documents by combining code and data with templates. What is it? Saga is a very simple command-line tool that reads in one or more data files, runs one or more scripts, then passes the results into a template to produce a final output document. It enables you to maintain a clean separation between data, logic and presentation and produce data-based documents that can easily be updated. That allows the flow of data through the document to be easily understood, a cornerstone of reproducible analysis. You run it like this: saga build -d data.yaml -d other_data.yaml \ -s analysis.py -t report.md.tmpl \ -O report.md Any scripts specified with -s will have access to the data in local variables, and any changes to local variables in a script will be retained when everything is passed to the template for rendering. For debugging, you can also do: saga dump -d data.yaml -d other_data.yaml -s analysis.py which will print out the full environment that would be passed to your template with saga build. Features Right now this is a really early version. It does the job but I have lots of ideas for features to add if I ever have time. At present it does the following: Reads data from one or more YAML files Transforms data with one or more Python scripts Renders a template in Mako format Works with any plain-text output format, including Markdown, LaTeX and HTML Use cases Write reproducible reports & papers based on machine-readable data Separate presentation from content in any document, e.g. your CV (example coming soon) Yours here? Get it! I haven’t released this on PyPI yet, but all the code is available on GitHub to try out. If you have pipenv installed (and if you use Python you should!), you can try it out in an isolated virtual environment by doing: git clone https://github.com/jezcope/sagadoc.git cd sagadoc pipenv install pipenv run saga or you can set up for development and run some tests: pipenv install --dev pipenv run pytest Why? Like a lot of people, I have to produce reports for work, often containing statistics computed from data. Although these generally aren’t academic research papers, I see no reason not to aim for a similar level of reproducibility: after all, if I’m telling other people to do it, I’d better take my own advice! A couple of times now I’ve done this by writing a template that holds the text of the report and placeholders for values, along with a Python script that reads in the data, calculates the statistics I want and completes the template. This is valuable for two main reasons: If anyone wants to know how I processed the data and calculated those statistics, it’s all there: no need to try and remember and reproduce a series of button clicks in Excel; If the data or calculations change, I just need to update the relevant part and run it again, and all the relevant parts of the document will be updated. This is particularly important if changing a single data value requires recalculation of dozens of tables, charts, etc. It also gives me the potential to factor out and reuse bits of code in the future, add tests and version control everything. Now that I’ve done this more than once (and it seems likely I’ll do it again) it makes sense to package that script up in a more portable form so I don’t have to write it over and over again (or, shock horror, copy & paste it!). It saves time, and gives others the possibility to make use of it. Prior art I’m not the first person to think of this, but I couldn’t find anything that did exactly what I needed. Several tools will let you interweave code and prose, including the results of evaluating each code snippet in the document: chief among these are Jupyter and Rmarkdown. There are also tools that let you write code in the order that makes most sense to read and then rearrange it into the right order to execute, so-call literate programming. The original tool for this is the venerable noweb. Sadly there is very little that combine both of these and allow you to insert the results of various calculations at arbitrary points in a document, independent of the order of either presenting or executing the code. The only two that I’m aware of are: Dexy and org-mode. Unfortunately, Dexy currently only works on Legacy Python (/Python 2) and org-mode requires emacs (which is fine but not exactly portable). Rmarkdown comes close and supports a range of languages but the full feature set is only available with R. Actually, my ideal solution is org-mode without the emacs dependency, because that’s the most flexible solution; maybe one day I’ll have both the time and skill to implement that. It’s also possible I might be able to figure out Dexy’s internals to add what I want to it, but until then Saga does the job! Future work There are lots of features that I’d still like to add when I have time: Some actual documentation! And examples! More data formats (e.g. CSV, JSON, TOML) More languages (e.g. R, Julia) Fetching remote data over http Caching of intermediate results to speed up rebuilds For now, though, I’d love for you to try it out and let me know what you think! As ever, comment here, tweet me or start an issue on GitHub. Why try Rust for scientific computing? When you’re writing analysis code, Python (or R, or JavaScript, or …) is usually the right choice. These high-level languages are set up to make you as productive as possible, and common tasks like array manipulation have been well optimised. However, sometimes you just can’t get enough speed and need to turn to a lower-level compiled language. Often that will be C, C++ or Fortran, but I thought I’d do a short post on why I think you should consider Rust. One of my goals for 2017’s Advent of Code was to learn a modern, memory-safe, statically-typed language. I now know that there are quite a lot of options in this space, but two seem to stand out: Go & Rust. I gave both of them a try, and although I’ll probably go back to give Go a more thorough test at some point I found I got quite hooked on Rust. Both languages, though young, are definitely production-ready. Servo, the core of the new Firefox browser, is entirely written in Rust. In fact, Mozilla have been trying to rewrite the rendering core in C for nearly a decade, and switching to Rust let them get it done in just a couple of years. !!! tldr “TL;DR” - It’s fast: competitive with idiomatic C/C++, and no garbage-collection overhead - It’s harder to write buggy code, and compiler errors are actually helpful - It’s C-compatible: you can call into Rust code anywhere you’d call into C, call C/C++ from Rust, and incrementally replace C/C++ code with Rust - It has sensible modern syntax that makes your code clearer and more concise - Support for scientific computing are getting better all the time (matrix algebra libraries, built-in SIMD, safe concurrency) - It has a really friendly and active community - It’s production-ready: Servo, the new rendering core in Firefox, is built entirely in Rust Performance To start with, as a compiled language Rust executes much faster than a (pseudo-)interpreted language like Python or R; the price you pay for this is time spent compiling during development. However, having a compile step also allows the language to enforce certain guarantees, such as type-correctness and memory safety, which between them prevent whole classes of bugs from even being possible. Unlike Go (which, like many higher-level languages, uses a garbage collector), Rust handles memory safety at compile time through the concepts of ownership and borrowing. These can take some getting used to and were a big source of frustration when I was first figuring out the language, but ultimately contribute to Rust’s reliably-fast performance. Performance can be unpredictable in a garbage-collected language because you can’t be sure when the GC is going to run and you need to understand it really well to stand a chance of optimising it if becomes a problem. On the other hand, code that has the potential to be unsafe will result in compilation errors in Rust. There are a number of benchmarks (example) that show Rust’s performance on a par with idiomatic C & C++ code, something that very few languages can boast. Helpful error messages Because beginner Rust programmers often get compile errors, it’s really important that those errors are easy to interpret and fix, and Rust is great at this. Not only does it tell you what went wrong, but wherever possible it prints out your code annotated with arrows to show exactly where the error is, and makes specific suggestions how to fix the error which usually turn out to be correct. It also has a nice suite of warnings (things that don’t cause compilation to fail but may indicate bugs) that are just as informative, and this can be extended even further by using the clippy linting tool to further analyse your code. warning: unused variable: `y` --> hello.rs:3:9 | 3 | let y = x; | ^ | = note: #[warn(unused_variables)] on by default = note: to avoid this warning, consider using `_y` instead Easy to integrate with other languages If you’re like me, you’ll probably only use a low-level language for performance-critical code that you can call from a high-level language, and this is an area where Rust shines. Most programmers will turn to C, C++ or Fortran for this because they have a well established ABI (Application Binary Interface) which can be understood by languages like Python and R1. In Rust, it’s trivial to make a C-compatible shared library, and the standard library includes extra features for working with C types. That also means that existing C code can be incrementally ported to Rust: see remacs for an example. On top of this, there are projects like rust-cpython and PyO3 which provide macros and structures that wrap the Python C API to let you build Python modules in Rust with minimal glue code; rustr does a similar job for R. Nice language features Rust has some really nice features, which let you write efficient, concise and correct code. Several feel particularly comfortable as they remind me of similar things available in Haskell, including: Enums, a super-powered combination of C enums and unions (similar to Haskell’s algebraic data types) that enable some really nice code with no runtime cost Generics and traits that let you get more done with less code Pattern matching, a kind of case statement that lets you extract parts of structs, tuples & enums and do all sorts of other clever things Lazy computation based on an iterator pattern, for efficient processing of lists of things: you can do for item in list { ... } instead of the C-style use of an index2, or you can use higher-order functions like map and filter Functions/closures as first-class citizens Scientific computing Although it’s a general-purpose language and not designed specifically for scientific computing, Rust’s support is improving all the time. There are some interesting matrix algebra libraries available, and built-in SIMD is incoming. The memory safety features also work to ensure thread safety, so it’s harder to write concurrency bugs. You should be able to use your favourite MPI implementation too, and there’s at least one attempt to portably wrap MPI in a more Rust-like way. Active development and friendly community One of the things you notice straight away is how active and friendly the Rust community is. There are several IRC channels on irc.mozilla.org including #rust-beginners, which is a great place to get help. The compiler is under constant but carefully-managed development, so that new features are landing all the time but without breaking existing code. And the fabulous Cargo build tool and crates.io are enabling the rapid growth of a healthy ecosystem of open source libraries that you can use to write less code yourself. Summary So, next time you need a compiled language to speed up hotspots in your code, try Rust. I promise you won’t regret it! Julia actually allows you to call C and Fortran functions as a first-class language feature ↩︎ Actually, since C++11 there’s for (auto item : list) { ... } but still… ↩︎ Reflections on #aoc2017 Trees reflected in a lake Joshua Reddekopp on Unsplash It seems like ages ago, but way back in November I committed to completing Advent of Code. I managed it all, and it was fun! All of my code is available on GitHub if you’re interested in seeing what I did, and I managed to get out a blog post for every one with a bit more commentary, which you can see in the series list above. How did I approach it? I’ve not really done any serious programming challenges before. I don’t get to write a lot of code at the moment, so all I wanted from AoC was an excuse to do some proper problem-solving. I never really intended to take a polyglot approach, though I did think that I might use mainly Python with a bit of Haskell. In the end, though, I used: Python (×12); Haskell (×7); Rust (×4); Go; C++; Ruby; Julia; and Coconut. For the most part, my priorities were getting the right answer, followed by writing readable code. I didn’t specifically focus on performance but did try to avoid falling into traps that I knew about. What did I learn? I found Python the easiest to get on with: it’s the language I know best and although I can’t always remember exact method names and parameters I know what’s available and where to look to remind myself, as well as most of the common idioms and some performance traps to avoid. Python was therefore the language that let me focus most on solving the problem itself. C++ and Ruby were more challenging, and it was harder to write good idiomatic code but I can still remember quite a lot. Haskell I haven’t used since university, and just like back then I really enjoyed working out how to solve problems in a functional style while still being readable and efficient (not always something I achieved…). I learned a lot about core Haskell concepts like monads & functors, and I’m really amazed by the way the Haskell community and ecosystem has grown up in the last decade. I also wanted to learn at least one modern, memory-safe compiled language, so I tried both Go and Rust. Both seem like useful languages, but Rust really intrigued me with its conceptual similarities to both Haskell and C++ and its promise of memory safety without a garbage collector. I struggled a lot initially with the “borrow checker” (the component that enforces memory safety at compile time) but eventually started thinking in terms of ownership and lifetimes after which things became easier. The Rust community seems really vibrant and friendly too. What next? I really want to keep this up, so I’m going to look out some more programming challenges (Project Euler looks interesting). It turns out there’s a regular Code Dojo meetup in Leeds, so hopefully I’ll try that out too. I’d like to do more realistic data-science stuff, so I’ll be taking a closer look at stuff like Kaggle too, and figuring out how to do a bit more analysis at work. I’m also feeling motivated to find an open source project to contribute to and/or release a project of my own, so we’ll see if that goes anywhere! I’ve always found the advice to “scratch your own itch” difficult to follow because everything I think of myself has already been done better. Most of the projects I use enough to want to contribute to tend to be pretty well developed with big communities and any bugs that might be accessible to me will be picked off and fixed before I have a chance to get started. Maybe it’s time to get over myself and just reimplement something that already exists, just for the fun of it! The Halting Problem — Python — #adventofcode Day 25 Today’s challenge, takes us back to a bit of computing history: a good old-fashioned Turing Machine. → Full code on GitHub !!! commentary Today’s challenge was a nice bit of nostalgia, taking me back to my university days learning about the theory of computing. Turing Machines are a classic bit of computing theory, and are provably able to compute any value that is possible to compute: a value is computable if and only if a Turing Machine can be written that computes it (though in practice anything non-trivial is mind-bendingly hard to write as a TM). A bit of a library-fest today, compared to other days! from collections import deque, namedtuple from collections.abc import Iterator from tqdm import tqdm import re import fileinput as fi These regular expressions are used to parse the input that defines the transition table for the machine. RE_ISTATE = re.compile(r'Begin in state (?P<state>\w+)\.') RE_RUNTIME = re.compile( r'Perform a diagnostic checksum after (?P<steps>\d+) steps.') RE_STATETRANS = re.compile( r"In state (?P<state>\w+):\n" r" If the current value is (?P<read0>\d+):\n" r" - Write the value (?P<write0>\d+)\.\n" r" - Move one slot to the (?P<move0>left|right).\n" r" - Continue with state (?P<next0>\w+).\n" r" If the current value is (?P<read1>\d+):\n" r" - Write the value (?P<write1>\d+)\.\n" r" - Move one slot to the (?P<move1>left|right).\n" r" - Continue with state (?P<next1>\w+).") MOVE = {'left': -1, 'right': 1} A namedtuple to provide some sugar when using a transition rule. Rule = namedtuple('Rule', 'write move next_state') The TuringMachine class does all the work. class TuringMachine: def __init__(self, program=None): self.tape = deque() self.transition_table = {} self.state = None self.runtime = 0 self.steps = 0 self.pos = 0 self.offset = 0 if program is not None: self.load(program) def __str__(self): return f"Current: {self.state}; steps: {self.steps} of {self.runtime}" Some jiggery-pokery to allow us to use self[pos] to reference an infinite tape. def __getitem__(self, i): i += self.offset if i < 0 or i >= len(self.tape): return 0 else: return self.tape[i] def __setitem__(self, i, x): i += self.offset if i >= 0 and i < len(self.tape): self.tape[i] = x elif i == -1: self.tape.appendleft(x) self.offset += 1 elif i == len(self.tape): self.tape.append(x) else: raise IndexError('Tried to set position off end of tape') Parse the program and set up the transtion table. def load(self, program): if isinstance(program, Iterator): program = ''.join(program) match = RE_ISTATE.search(program) self.state = match['state'] match = RE_RUNTIME.search(program) self.runtime = int(match['steps']) for match in RE_STATETRANS.finditer(program): self.transition_table[match['state']] = { int(match['read0']): Rule(write=int(match['write0']), move=MOVE[match['move0']], next_state=match['next0']), int(match['read1']): Rule(write=int(match['write1']), move=MOVE[match['move1']], next_state=match['next1']), } Run the program for the required number of steps (given by self.runtime). tqdm isn’t in the standard library but it should be: it shows a lovely text-mode progress bar as we go. def run(self): for _ in tqdm(range(self.runtime), desc="Running", unit="steps", unit_scale=True): read = self[self.pos] rule = self.transition_table[self.state][read] self[self.pos] = rule.write self.pos += rule.move self.state = rule.next_state Calculate the “diagnostic checksum” required for the answer. @property def checksum(self): return sum(self.tape) Aaand GO! machine = TuringMachine(fi.input()) machine.run() print("Checksum:", machine.checksum) Electromagnetic Moat — Rust — #adventofcode Day 24 Today’s challenge, the penultimate, requires us to build a bridge capable of reaching across to the CPU, our final destination. → Full code on GitHub !!! commentary We have a finite number of components that fit together in a restricted way from which to build a bridge, and we have to work out both the strongest and the longest bridge we can build. The most obvious way to do this is to recursively build every possible bridge and select the best, but that’s an O(n!) algorithm that could blow up quickly, so might as well go with a nice fast language! Might have to try this in Haskell too, because it’s the type of algorithm that lends itself naturally to a pure functional approach. I feel like I've applied some of the things I've learned in previous challenges I used Rust for, and spent less time mucking about with ownership, and made better use of various language features, including structs and iterators. I'm rather pleased with how my learning of this language is progressing. I'm definitely overusing `Option.unwrap` at the moment though: this is a lazy way to deal with `Option` results and will panic if the result is not what's expected. I'm not sure whether I need to be cloning the components `Vector` either, or whether I could just be passing iterators around. First, we import some bits of standard library and define some data types. The BridgeResult struct lets us use the same algorithm for both parts of the challenge and simply change the value used to calculate the maximum. use std::io; use std::fmt; use std::io::BufRead; #[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)] struct Component(u8, u8); #[derive(Debug, Copy, Clone, Default)] struct BridgeResult { strength: u16, length: u16, } impl Component { fn from_str(s: &str) -> Component { let parts: Vec<&str> = s.split('/').collect(); assert!(parts.len() == 2); Component(parts[0].parse().unwrap(), parts[1].parse().unwrap()) } fn fits(self, port: u8) -> bool { self.0 == port || self.1 == port } fn other_end(self, port: u8) -> u8 { if self.0 == port { return self.1; } else if self.1 == port { return self.0; } else { panic!("{} doesn't fit port {}", self, port); } } fn strength(self) -> u16 { self.0 as u16 + self.1 as u16 } } impl fmt::Display for BridgeResult { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { write!(f, "(S: {}, L: {})", self.strength, self.length) } } best_bridge calculates the length and strength of the “best” bridge that can be built from the remaining components and fits the required port. Whether this is based on strength or length is given by the key parameter, which is passed to Iter.max_by_key. fn best_bridge<F>(port: u8, key: &F, components: &Vec<Component>) -> Option<BridgeResult> where F: Fn(&BridgeResult) -> u16 { if components.len() == 0 { return None; } components.iter() .filter(|c| c.fits(port)) .map(|c| { let b = best_bridge(c.other_end(port), key, &components.clone().into_iter() .filter(|x| x != c).collect()) .unwrap_or_default(); BridgeResult{strength: c.strength() + b.strength, length: 1 + b.length} }) .max_by_key(key) } Now all that remains is to read the input and calculate the result. I was rather pleasantly surprised to find that in spite of my pessimistic predictions about efficiency, when compiled with optimisations turned on this terminates in less than 1s on my laptop. fn main() { let stdin = io::stdin(); let components: Vec<_> = stdin.lock() .lines() .map(|l| Component::from_str(&l.unwrap())) .collect(); match best_bridge(0, &|b: &BridgeResult| b.strength, &components) { Some(b) => println!("Strongest bridge is {}", b), None => println!("No strongest bridge found") }; match best_bridge(0, &|b: &BridgeResult| b.length, &components) { Some(b) => println!("Longest bridge is {}", b), None => println!("No longest bridge found") }; } Coprocessor Conflagration — Haskell — #adventofcode Day 23 Today’s challenge requires us to understand why a coprocessor is working so hard to perform an apparently simple calculation. → Full code on GitHub !!! commentary Today’s problem is based on an assembly-like language very similar to day 18, so I went back and adapted my code from that, which works well for the first part. I’ve also incorporated some advice from /r/haskell, and cleaned up all warnings shown by the -Wall compiler flag and the hlint tool. Part 2 requires the algorithm to run with much larger inputs, and since some analysis shows that it's an `O(n^3)` algorithm it gets intractible pretty fast. There are several approaches to this. First up, if you have a fast enough processor and an efficient enough implementation I suspect that the simulation would probably terminate eventually, but that would likely still take hours: not good enough. I also thought about doing some peephole optimisations on the instructions, but the last time I did compiler optimisation was my degree so I wasn't really sure where to start. What I ended up doing was actually analysing the input code by hand to figure out what it was doing, and then just doing that calculation in a sensible way. I'd like to say I managed this on my own (and I ike to think I would have) but I did get some tips on [/r/adventofcode](https://reddit.com/r/adventofcode). The majority of this code is simply a cleaned-up version of day 18, with some tweaks to accommodate the different instruction set: module Main where import qualified Data.Vector as V import qualified Data.Map.Strict as M import Control.Monad.State.Strict import Text.ParserCombinators.Parsec hiding (State) type Register = Char type Value = Int type Argument = Either Value Register data Instruction = Set Register Argument | Sub Register Argument | Mul Register Argument | Jnz Argument Argument deriving Show type Program = V.Vector Instruction data Result = Cont | Halt deriving (Eq, Show) type Registers = M.Map Char Int data Machine = Machine { dRegisters :: Registers , dPtr :: !Int , dMulCount :: !Int , dProgram :: Program } instance Show Machine where show d = show (dRegisters d) ++ " @" ++ show (dPtr d) ++ " ×" ++ show (dMulCount d) defaultMachine :: Machine defaultMachine = Machine M.empty 0 0 V.empty type MachineState = State Machine program :: GenParser Char st Program program = do instructions <- endBy instruction eol return $ V.fromList instructions where instruction = try (regOp "set" Set) <|> regOp "sub" Sub <|> regOp "mul" Mul <|> jump "jnz" Jnz regOp n c = do string n >> spaces val1 <- oneOf "abcdefgh" secondArg c val1 jump n c = do string n >> spaces val1 <- regOrVal secondArg c val1 secondArg c val1 = do spaces val2 <- regOrVal return $ c val1 val2 regOrVal = register <|> value register = do name <- lower return $ Right name value = do val <- many $ oneOf "-0123456789" return $ Left $ read val eol = char '\n' parseProgram :: String -> Either ParseError Program parseProgram = parse program "" getReg :: Char -> MachineState Int getReg r = do st <- get return $ M.findWithDefault 0 r (dRegisters st) putReg :: Char -> Int -> MachineState () putReg r v = do st <- get let current = dRegisters st new = M.insert r v current put $ st { dRegisters = new } modReg :: (Int -> Int -> Int) -> Char -> Argument -> MachineState () modReg op r v = do u <- getReg r v' <- getRegOrVal v putReg r (u `op` v') incPtr getRegOrVal :: Argument -> MachineState Int getRegOrVal = either return getReg addPtr :: Int -> MachineState () addPtr n = do st <- get put $ st { dPtr = n + dPtr st } incPtr :: MachineState () incPtr = addPtr 1 execInst :: Instruction -> MachineState () execInst (Set reg val) = do newVal <- getRegOrVal val putReg reg newVal incPtr execInst (Mul reg val) = do result <- modReg (*) reg val st <- get put $ st { dMulCount = 1 + dMulCount st } return result execInst (Sub reg val) = modReg (-) reg val execInst (Jnz val1 val2) = do test <- getRegOrVal val1 jump <- if test /= 0 then getRegOrVal val2 else return 1 addPtr jump execNext :: MachineState Result execNext = do st <- get let prog = dProgram st p = dPtr st if p >= length prog then return Halt else do execInst (prog V.! p) return Cont runUntilTerm :: MachineState () runUntilTerm = do result <- execNext unless (result == Halt) runUntilTerm This implements the actual calculation: the number of non-primes between (for my input) 107900 and 124900: optimisedCalc :: Int -> Int -> Int -> Int optimisedCalc a b k = sum $ map (const 1) $ filter notPrime [a,a+k..b] where notPrime n = elem 0 $ map (mod n) [2..(floor $ sqrt (fromIntegral n :: Double))] main :: IO () main = do input <- getContents case parseProgram input of Right prog -> do let c = defaultMachine { dProgram = prog } (_, c') = runState runUntilTerm c putStrLn $ show (dMulCount c') ++ " multiplications made" putStrLn $ "Calculation result: " ++ show (optimisedCalc 107900 124900 17) Left e -> print e Sporifica Virus — Rust — #adventofcode Day 22 Today’s challenge has us helping to clean up (or spread, I can’t really tell) an infection of the “sporifica” virus. → Full code on GitHub !!! commentary I thought I’d have another play with Rust, as its Haskell-like features resonate with me at the moment. I struggled quite a lot with the Rust concepts of ownership and borrowing, and this is a cleaned-up version of the code based on some good advice from the folks on /r/rust. use std::io; use std::env; use std::io::BufRead; use std::collections::HashMap; #[derive(PartialEq, Clone, Copy, Debug)] enum Direction {Up, Right, Down, Left} #[derive(PartialEq, Clone, Copy, Debug)] enum Infection {Clean, Weakened, Infected, Flagged} use self::Direction::*; use self::Infection::*; type Grid = HashMap<(isize, isize), Infection>; fn turn_left(d: Direction) -> Direction { match d {Up => Left, Right => Up, Down => Right, Left => Down} } fn turn_right(d: Direction) -> Direction { match d {Up => Right, Right => Down, Down => Left, Left => Up} } fn turn_around(d: Direction) -> Direction { match d {Up => Down, Right => Left, Down => Up, Left => Right} } fn make_move(d: Direction, x: isize, y: isize) -> (isize, isize) { match d { Up => (x-1, y), Right => (x, y+1), Down => (x+1, y), Left => (x, y-1), } } fn basic_step(grid: &mut Grid, x: &mut isize, y: &mut isize, d: &mut Direction) -> usize { let mut infect = 0; let current = match grid.get(&(*x, *y)) { Some(v) => *v, None => Clean, }; if current == Infected { *d = turn_right(*d); } else { *d = turn_left(*d); infect = 1; }; grid.insert((*x, *y), match current { Clean => Infected, Infected => Clean, x => panic!("Unexpected infection state {:?}", x), }); let new_pos = make_move(*d, *x, *y); *x = new_pos.0; *y = new_pos.1; infect } fn nasty_step(grid: &mut Grid, x: &mut isize, y: &mut isize, d: &mut Direction) -> usize { let mut infect = 0; let new_state: Infection; let current = match grid.get(&(*x, *y)) { Some(v) => *v, None => Infection::Clean, }; match current { Clean => { *d = turn_left(*d); new_state = Weakened; }, Weakened => { new_state = Infected; infect = 1; }, Infected => { *d = turn_right(*d); new_state = Flagged; }, Flagged => { *d = turn_around(*d); new_state = Clean; } }; grid.insert((*x, *y), new_state); let new_pos = make_move(*d, *x, *y); *x = new_pos.0; *y = new_pos.1; infect } fn virus_infect<F>(mut grid: Grid, mut step: F, mut x: isize, mut y: isize, mut d: Direction, n: usize) -> usize where F: FnMut(&mut Grid, &mut isize, &mut isize, &mut Direction) -> usize, { (0..n).map(|_| step(&mut grid, &mut x, &mut y, &mut d)) .sum() } fn main() { let args: Vec<String> = env::args().collect(); let n_basic: usize = args[1].parse().unwrap(); let n_nasty: usize = args[2].parse().unwrap(); let stdin = io::stdin(); let lines: Vec<String> = stdin.lock() .lines() .map(|x| x.unwrap()) .collect(); let mut grid: Grid = HashMap::new(); let x0 = (lines.len() / 2) as isize; let y0 = (lines[0].len() / 2) as isize; for (i, line) in lines.iter().enumerate() { for (j, c) in line.chars().enumerate() { grid.insert((i as isize, j as isize), match c {'#' => Infected, _ => Clean}); } } let basic_steps = virus_infect(grid.clone(), basic_step, x0, y0, Up, n_basic); println!("Basic: infected {} times", basic_steps); let nasty_steps = virus_infect(grid, nasty_step, x0, y0, Up, n_nasty); println!("Nasty: infected {} times", nasty_steps); } Fractal Art — Python — #adventofcode Day 21 Today’s challenge asks us to assist an artist building fractal patterns from a rulebook. → Full code on GitHub !!! commentary Another fairly straightforward algorithm: the really tricky part was breaking the pattern up into chunks and rejoining it again. I could probably have done that more efficiently, and would have needed to if I had to go for a few more iterations and the grid grows with every iteration and gets big fast. Still behind on the blog posts… import fileinput as fi from math import sqrt from functools import reduce, partial import operator INITIAL_PATTERN = ((0, 1, 0), (0, 0, 1), (1, 1, 1)) DECODE = ['.', '#'] ENCODE = {'.': 0, '#': 1} concat = partial(reduce, operator.concat) def rotate(p): size = len(p) return tuple(tuple(p[i][j] for i in range(size)) for j in range(size - 1, -1, -1)) def flip(p): return tuple(p[i] for i in range(len(p) - 1, -1, -1)) def permutations(p): yield p yield flip(p) for _ in range(3): p = rotate(p) yield p yield flip(p) def print_pattern(p): print('-' * len(p)) for row in p: print(' '.join(DECODE[x] for x in row)) print('-' * len(p)) def build_pattern(s): return tuple(tuple(ENCODE[c] for c in row) for row in s.split('/')) def build_pattern_book(lines): book = {} for line in lines: source, target = line.strip().split(' => ') for rotation in permutations(build_pattern(source)): book[rotation] = build_pattern(target) return book def subdivide(pattern): size = 2 if len(pattern) % 2 == 0 else 3 n = len(pattern) // size return (tuple(tuple(pattern[i][j] for j in range(y * size, (y + 1) * size)) for i in range(x * size, (x + 1) * size)) for x in range(n) for y in range(n)) def rejoin(parts): n = int(sqrt(len(parts))) size = len(parts[0]) return tuple(concat(parts[i + k][j] for i in range(n)) for k in range(0, len(parts), n) for j in range(size)) def enhance_once(p, book): return rejoin(tuple(book[part] for part in subdivide(p))) def enhance(p, book, n, progress=None): for _ in range(n): p = enhance_once(p, book) return p book = build_pattern_book(fi.input()) intermediate_pattern = enhance(INITIAL_PATTERN, book, 5) print("After 5 iterations:", sum(sum(row) for row in intermediate_pattern)) final_pattern = enhance(intermediate_pattern, book, 13) print("After 18 iterations:", sum(sum(row) for row in final_pattern)) Particle Swarm — Python — #adventofcode Day 20 Today’s challenge finds us simulating the movements of particles in space. → Full code on GitHub !!! commentary Back to Python for this one, another relatively straightforward simulation, although it’s easier to calculate the answer to part 1 than to simulate. import fileinput as fi import numpy as np import re First we parse the input into 3 2D arrays: using numpy enables us to do efficient arithmetic across the whole set of particles in one go. PARTICLE_RE = re.compile(r'p=<(-?\d+),(-?\d+),(-?\d+)>, ' r'v=<(-?\d+),(-?\d+),(-?\d+)>, ' r'a=<(-?\d+),(-?\d+),(-?\d+)>') def parse_input(lines): x = [] v = [] a = [] for l in lines: m = PARTICLE_RE.match(l) x.append([int(x) for x in m.group(1, 2, 3)]) v.append([int(x) for x in m.group(4, 5, 6)]) a.append([int(x) for x in m.group(7, 8, 9)]) return (np.arange(len(x)), np.array(x), np.array(v), np.array(a)) i, x, v, a = parse_input(fi.input()) Now we can calculate which particle will be closest to the origin in the long-term: this is simply the particle with the smallest acceleration. It turns out that several have the same acceleration, so of these, the one we want is the one with the lowest starting velocity. This is only complicated slightly by the need to get the number of the particle rather than its other information, hence the need to use numpy.argmin. a_abs = np.sum(np.abs(a), axis=1) a_min = np.min(a_abs) a_i = np.squeeze(np.argwhere(a_abs == a_min)) closest = i[a_i[np.argmin(np.sum(np.abs(v[a_i]), axis=1))]] print("Closest: ", closest) Now we define functions to simulate collisions between particles. We have to use the return_index and return_counts options to numpy.unique to be able to get rid of all the duplicate positions (the standard usage is to keep one of each duplicate). def resolve_collisions(x, v, a): (_, i, c) = np.unique(x, return_index=True, return_counts=True, axis=0) i = i[c == 1] return x[i], v[i], a[i] The termination criterion for this loop is an interesting aspect: the most robust to my mind seems to be that eventually the particles will end up sorted in order of their initial acceleration in terms of distance from the origin, so you could check for this but that’s pretty computationally expensive. In the end, all that was needed was a bit of trial and error: terminating arbitrarily after 1,000 iterations seems to work! In fact, all the collisions are over after about 40 iterations for my input but there was always the possibility that two particles with very slightly different accelerations would eventually intersect much later. def simulate_collisions(x, v, a, iterations=1000): for _ in range(iterations): v += a x += v x, v, a = resolve_collisions(x, v, a) return len(x) print("Remaining particles: ", simulate_collisions(x, v, a)) A Series of Tubes — Rust — #adventofcode Day 19 Today’s challenge asks us to help a network packet find its way. → Full code on GitHub !!! commentary Today’s challenge was fairly straightforward, following an ASCII art path, so I thought I’d give Rust another try. I’m a bit behind on the blog posts, so I’m presenting the code below without any further commentary. I’m not really convinced this is good idiomatic Rust, and it was interesting turning a set of strings into a 2D array of characters because there are both u8 (byte) and char types to deal with. use std::io; use std::io::BufRead; const ALPHA: &'static str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; fn change_direction(dia: &Vec<Vec<u8>>, x: usize, y: usize, dx: &mut i32, dy: &mut i32) { assert_eq!(dia[x][y], b'+'); if dx.abs() == 1 { *dx = 0; if y + 1 < dia[x].len() && (dia[x][y + 1] == b'-' || ALPHA.contains(dia[x][y + 1] as char)) { *dy = 1; } else if dia[x][y - 1] == b'-' || ALPHA.contains(dia[x][y - 1] as char) { *dy = -1; } else { panic!("Huh? {} {}", dia[x][y+1] as char, dia[x][y-1] as char); } } else { *dy = 0; if x + 1 < dia.len() && (dia[x + 1][y] == b'|' || ALPHA.contains(dia[x + 1][y] as char)) { *dx = 1; } else if dia[x - 1][y] == b'|' || ALPHA.contains(dia[x - 1][y] as char) { *dx = -1; } else { panic!("Huh?"); } } } fn follow_route(dia: Vec<Vec<u8>>) -> (String, i32) { let mut x: i32 = 0; let mut y: i32; let mut dx: i32 = 1; let mut dy: i32 = 0; let mut result = String::new(); let mut steps = 1; match dia[0].iter().position(|x| *x == b'|') { Some(i) => y = i as i32, None => panic!("Could not find '|' in first row"), } loop { x += dx; y += dy; match dia[x as usize][y as usize] { b'A'...b'Z' => result.push(dia[x as usize][y as usize] as char), b'+' => change_direction(&dia, x as usize, y as usize, &mut dx, &mut dy), b' ' => return (result, steps), _ => (), } steps += 1; } } fn main() { let stdin = io::stdin(); let lines: Vec<Vec<u8>> = stdin.lock().lines() .map(|l| l.unwrap().into_bytes()) .collect(); let result = follow_route(lines); println!("Route: {}", result.0); println!("Steps: {}", result.1); } Duet — Haskell — #adventofcode Day 18 Today’s challenge introduces a type of simplified assembly language that includes instructions for message-passing. First we have to simulate a single program (after humorously misinterpreting the snd and rcv instructions as “sound” and “recover”), but then we have to simulate two concurrent processes and the message passing between them. → Full code on GitHub !!! commentary Well, I really learned a lot from this one! I wanted to get to grips with more complex stuff in Haskell and this challenge seemed like an excellent opportunity to figure out a) parsing with the parsec library and b) using the State monad to keep the state of the simulator. As it turned out, that wasn't all I'd learned: I also ran into an interesting situation whereby lazy evaluation was creating an infinite loop where there shouldn't be one, so I also had to learn how to selectively force strict evaluation of values. I'm pretty sure this isn't the best Haskell in the world, but I'm proud of it. First we have to import a bunch of stuff to use later, but also notice the pragma on the first line which instructs the compiler to enable the BangPatterns language extension, which will be important later. {-# LANGUAGE BangPatterns #-} module Main where import qualified Data.Vector as V import qualified Data.Map.Strict as M import Data.List import Data.Either import Data.Maybe import Control.Monad.State.Strict import Control.Monad.Loops import Text.ParserCombinators.Parsec hiding (State) First up we define the types that will represent the program code itself. data DuetVal = Reg Char | Val Int deriving Show type DuetQueue = [Int] data DuetInstruction = Snd DuetVal | Rcv DuetVal | Jgz DuetVal DuetVal | Set DuetVal DuetVal | Add DuetVal DuetVal | Mul DuetVal DuetVal | Mod DuetVal DuetVal deriving Show type DuetProgram = V.Vector DuetInstruction Next we define the types to hold the machine state, which includes: registers, instruction pointer, send & receive buffers and the program code, plus a counter of the number of sends made (to provide the solution). type DuetRegisters = M.Map Char Int data Duet = Duet { dRegisters :: DuetRegisters , dPtr :: Int , dSendCount :: Int , dRcvBuf :: DuetQueue , dSndBuf :: DuetQueue , dProgram :: DuetProgram } instance Show Duet where show d = show (dRegisters d) ++ " @" ++ show (dPtr d) ++ " S" ++ show (dSndBuf d) ++ " R" ++ show (dRcvBuf d) defaultDuet = Duet M.empty 0 0 [] [] V.empty type DuetState = State Duet program is a parser built on the cool parsec library to turn the program text into a Haskell format that we can work with, a Vector of instructions. Yes, using a full-blown parser is overkill here (it would be much simpler just to split each line on whitespace, but I wanted to see how Parsec works. I’m using Vector here because we need random access to the instruction list, which is much more efficient with Vector: O(1) compared with the O(n) of the built in Haskell list ([]) type. parseProgram applies the parser to a string and returns the result. program :: GenParser Char st DuetProgram program = do instructions <- endBy instruction eol return $ V.fromList instructions where instruction = try (oneArg "snd" Snd) <|> oneArg "rcv" Rcv <|> twoArg "set" Set <|> twoArg "add" Add <|> try (twoArg "mul" Mul) <|> twoArg "mod" Mod <|> twoArg "jgz" Jgz oneArg n c = do string n >> spaces val <- regOrVal return $ c val twoArg n c = do string n >> spaces val1 <- regOrVal spaces val2 <- regOrVal return $ c val1 val2 regOrVal = register <|> value register = do name <- lower return $ Reg name value = do val <- many $ oneOf "-0123456789" return $ Val $ read val eol = char '\n' parseProgram :: String -> Either ParseError DuetProgram parseProgram = parse program "" Next up we have some utility functions that sit in the DuetState monad we defined above and perform common manipulations on the state: getting/setting/updating registers, updating the instruction pointer and sending/receiving messages via the relevant queues. getReg :: Char -> DuetState Int getReg r = do st <- get return $ M.findWithDefault 0 r (dRegisters st) putReg :: Char -> Int -> DuetState () putReg r v = do st <- get let current = dRegisters st new = M.insert r v current put $ st { dRegisters = new } modReg :: (Int -> Int -> Int) -> Char -> DuetVal -> DuetState Bool modReg op r v = do u <- getReg r v' <- getRegOrVal v putReg r (u `op` v') incPtr return False getRegOrVal :: DuetVal -> DuetState Int getRegOrVal (Reg r) = getReg r getRegOrVal (Val v) = return v addPtr :: Int -> DuetState () addPtr n = do st <- get put $ st { dPtr = n + dPtr st } incPtr = addPtr 1 send :: Int -> DuetState () send v = do st <- get put $ st { dSndBuf = (dSndBuf st ++ [v]), dSendCount = dSendCount st + 1 } recv :: DuetState (Maybe Int) recv = do st <- get case dRcvBuf st of (x:xs) -> do put $ st { dRcvBuf = xs } return $ Just x [] -> return Nothing execInst implements the logic for each instruction. It returns False as long as the program can continue, but True if the program tries to receive from an empty buffer. execInst :: DuetInstruction -> DuetState Bool execInst (Set (Reg reg) val) = do newVal <- getRegOrVal val putReg reg newVal incPtr return False execInst (Mul (Reg reg) val) = modReg (*) reg val execInst (Add (Reg reg) val) = modReg (+) reg val execInst (Mod (Reg reg) val) = modReg mod reg val execInst (Jgz val1 val2) = do st <- get test <- getRegOrVal val1 jump <- if test > 0 then getRegOrVal val2 else return 1 addPtr jump return False execInst (Snd val) = do v <- getRegOrVal val send v st <- get incPtr return False execInst (Rcv (Reg r)) = do st <- get v <- recv handle v where handle :: Maybe Int -> DuetState Bool handle (Just x) = putReg r x >> incPtr >> return False handle Nothing = return True execInst x = error $ "execInst not implemented yet for " ++ show x execNext looks up the next instruction and executes it. runUntilWait runs the program until execNext returns True to signal the wait state has been reached. execNext :: DuetState Bool execNext = do st <- get let prog = dProgram st p = dPtr st if p >= length prog then return True else execInst (prog V.! p) runUntilWait :: DuetState () runUntilWait = do waiting <- execNext unless waiting runUntilWait runTwoPrograms handles the concurrent running of two programs, by running first one and then the other to a wait state, then swapping each program’s send buffer to the other’s receive buffer before repeating. If you look carefully, you’ll see a “bang” (!) before the two arguments of the function: runTwoPrograms !d0 !d1. Haskell is a lazy language and usually doesn’t evaluate a computation until you ask for a result, instead carrying around a “thunk” or plan for how to carry out the computation. Sometimes that can be a problem because the amount of memory your program is using can explode unnecessarily as a long computation turns into a large thunk which isn’t evaluated until the very end. That’s not the problem here though. What happens here without the bangs is another side-effect of laziness. The exit condition of this recursive function is that a deadlock has been reached: both programs are waiting to receive, but neither has sent anything, so neither can ever continue. The check for this is (null $ dSndBuf d0') && (null $ dSndBuf d1'). As long as the first program has something in its send buffer, the test fails without ever evaluating the second part, which means the result d1' of running the second program is never needed. The function immediately goes to the recursive case and tries to continue the first program again, which immediately returns because it’s still waiting to receive. The same thing happens again, and the result is that instead of running the second program to obtain something for the first to receive, we get into an infinite loop trying and failing to continue the first program. The bang forces both d0 and d1 to be evaluated at the point we recurse, which forces the rest of the computation: running the second program and swapping the send/receive buffers. With that, the evaluation proceeds correctly and we terminate with a result instead of getting into an infinite loop! runTwoPrograms :: Duet -> Duet -> (Int, Int) runTwoPrograms !d0 !d1 | (null $ dSndBuf d0') && (null $ dSndBuf d1') = (dSendCount d0', dSendCount d1') | otherwise = runTwoPrograms d0'' d1'' where (_, d0') = runState runUntilWait d0 (_, d1') = runState runUntilWait d1 d0'' = d0' { dSndBuf = [], dRcvBuf = dSndBuf d1' } d1'' = d1' { dSndBuf = [], dRcvBuf = dSndBuf d0' } All that remains to be done now is to run the programs and see how many messages were sent before the deadlock. main = do prog <- fmap (fromRight V.empty . parseProgram) getContents let d0 = defaultDuet { dProgram = prog, dRegisters = M.fromList [('p', 0)] } d1 = defaultDuet { dProgram = prog, dRegisters = M.fromList [('p', 1)] } (send0, send1) = runTwoPrograms d0 d1 putStrLn $ "Program 0 sent " ++ show send0 ++ " messages" putStrLn $ "Program 1 sent " ++ show send1 ++ " messages" Spinlock — Rust/Python — #adventofcode Day 17 In today’s challenge we deal with a monstrous whirlwind of a program, eating up CPU and memory in equal measure. → Full code on GitHub (and Python driver script) !!! commentary One of the things I wanted from AoC was an opportunity to try out some popular languages that I don’t currently know, including the memory-safe, strongly-typed compiled languages Go and Rust. Realistically though, I’m likely to continue doing most of my programming in Python, and use one of these other languages when it has better tools or I need the extra speed. In which case, what I really want to know is how I can call functions written in Go or Rust from Python. I thought I'd try Rust first, as it seems to be designed to be C-compatible and that makes it easy to call from Python using [`ctypes`](https://docs.python.org/3.6/library/ctypes.html). Part 1 was another straightforward simulation: translate what the "spinlock" monster is doing into code and run it. It was pretty obvious from the story of this challenge and experience of the last few days that this was going to be another one where the simulation is too computationally expensive for part two, which turns out to be correct. So, first thing to do is to implement the meat of the solution in Rust. spinlock solves the first part of the problem by doing exactly what the monster does. Since we only have to go up to 2017 iterations, this is very tractable. The last number we insert is 2017, so we just return the number immediately after that. #[no_mangle] pub extern fn spinlock(n: usize, skip: usize) -> i32 { let mut buffer: Vec<i32> = Vec::with_capacity(n+1); buffer.push(0); buffer.push(1); let mut pos = 1; for i in 2..n+1 { pos = (pos + skip + 1) % buffer.len(); buffer.insert(pos, i as i32); } pos = (pos + 1) % buffer.len(); return buffer[pos]; } For the second part, we have to do 50 million iterations instead, which is a lot. Given that every time you insert an item in the list it has to move up all the elements after that position, I’m pretty sure the algorithm is O(n^2), so it’s going to take a lot longer than 10,000ish times the first part. Thankfully, we don’t need to build the whole list, just keep track of where 0 is and what number is immediately after it. There may be a closed-form solution to simply calculate the result, but I couldn’t think of it and this is good enough. #[no_mangle] pub extern fn spinlock0(n: usize, skip: usize) -> i32 { let mut pos = 1; let mut pos_0 = 0; let mut after_0 = 1; for i in 2..n+1 { pos = (pos + skip + 1) % i; if pos == pos_0 + 1 { after_0 = i; } if pos <= pos_0 { pos_0 += 1; } } return after_0 as i31; } Now it’s time to call this code from Python. Notice the #[no_mangle] pragmas and pub extern declarations for each function above, which are required to make sure the functions are exported in a C-compatible way. We can build this into a shared library like this: rustc --crate-type=cdylib -o spinlock.so 17-spinlock.rs The Python script is as simple as loading this library, reading the puzzle input from the command line and calling the functions. The ctypes module does a lot of magic so that we don’t have to worry about converting from Python types to native types and back again. import ctypes import sys lib = ctypes.cdll.LoadLibrary("./spinlock.so") skip = int(sys.argv[1]) print("Part 1:", lib.spinlock(2017, skip)) print("Part 2:", lib.spinlock0(50_000_000, skip)) This is a toy example as far as calling Rust from Python is concerned, but it’s worth noting that already we can play with the parameters to the two Rust functions without having to recompile. For more serious work, I’d probably be looking at something like PyO3 to make a proper Python module. Looks like there’s also a very early Rust numpy integration for integrating numerical stuff. You can also do the same thing from Julia, which has a ccall function built in: ccall((:spinlock, "./spinlock.so"), Int32, (UInt64, UInt64), 2017, 377) My next thing to try might be Haskell → Python though… Permutation Promenade — Julia — #adventofcode Day 16 Today’s challenge rather appeals to me as a folk dancer, because it describes a set of instructions for a dance and asks us to work out the positions of the dancing programs after each run through the dance. → Full code on GitHub !!! commentary So, part 1 is pretty straight forward: parse the set of instructions, interpret them and keep track of the dancer positions as you go. One time through the dance. However, part 2 asks for the positions after 1 billion (yes, that’s 1,000,000,000) times through the dance. In hindsight I should have immediately become suspicious, but I thought I’d at least try the brute force approach first because it was simpler to code. So I give it a try, and after waiting for a while, having a cup of tea etc. it still hasn't terminated. I try reducing the number of iterations to 1,000. Now it terminates, but takes about 6 seconds. A spot of arithmetic suggests that running the full version will take a little over 190 years. There must be a better way than that! I'm a little embarassed that I didn't spot the solution immediately (blaming Julia) and tried again in Python to see if I could get it to terminate quicker. When that didn't work I had to think again. A little further investigation with a while loop shows that in fact the dance position repeats (in the case of my input) every 48 times. After that it becomes much quicker! Oh, and it was time for a new language, so I wasted some extra time working out the quirks of [Julia][]. First, a function to evaluate a single move — for neatness, this dispatches to a dedicated function depending on the type of move, although this isn’t really necessary to solve the challenge. Ending a function name with a bang (!) is a Julia convention to indicate that it has side-effects. function eval_move!(move, dancers) move_type = move[1] params = move[2:end] if move_type == 's' # spin eval_spin!(params, dancers) elseif move_type == 'x' # exchange eval_exchange!(params, dancers) elseif move_type == 'p' # partner swap eval_partner!(params, dancers) end end These take care of the individual moves. Parsing the parameters from a string every single time probably isn’t ideal, but as it turns out, that optimisation isn’t really necessary. Note the + 1 in eval_exchange!, which is necessary because Julia is one of those crazy languages where indexes start from 1 instead of 0. These actions are pretty nice to implement, because Julia has circshift as a builtin to rotate a list, and allows you to assign to list slices and swap values in place with a single statement. function eval_spin!(params, dancers) shift = parse(Int, params) dancers[1:end] = circshift(dancers, shift) end function eval_exchange!(params, dancers) i, j = map(x -> parse(Int, x) + 1, split(params, "/")) dancers[i], dancers[j] = dancers[j], dancers[i] end function eval_partner!(params, dancers) a, b = split(params, "/") ia = findfirst([x == a for x in dancers]) ib = findfirst([x == b for x in dancers]) dancers[ia], dancers[ib] = b, a end dance! takes a list of moves and takes the dances once through the dance. function dance!(moves, dancers) for m in moves eval_move!(m, dancers) end end To solve part 1, we simply need to read the moves in, set up the initial positions of the dances and run the dance through once. join is necessary to a) turn characters into length-1 strings, and b) convert the list of strings back into a single string to print out. moves = split(readchomp(STDIN), ",") dancers = collect(join(c) for c in 'a':'p') orig_dancers = copy(dancers) dance!(moves, dancers) println(join(dancers)) Part 2 requires a little more work. We run the dance through again and again until we get back to the initial position, saving the intermediate positions in a list. The list now contains every possible position available from that starting point, so we can find position 1 billion by taking 1,000,000,000 modulo the list length (plus 1 because 1-based indexing) and use that to index into the list to get the final position. dance_cycle = [orig_dancers] while dancers != orig_dancers push!(dance_cycle, copy(dancers)) dance!(moves, dancers) end println(join(dance_cycle[1_000_000_000 % length(dance_cycle) + 1])) This terminates on my laptop in about 1.6s: Brute force 0; Careful thought 1! Dueling Generators — Rust — #adventofcode Day 15 Today’s challenge introduces two pseudo-random number generators which are trying to agree on a series of numbers. We play the part of the “judge”, counting the number of times their numbers agree in the lowest 16 bits. → Full code on GitHub Ever since I used Go to solve day 3, I’ve had a hankering to try the other new kid on the memory-safe compiled language block, Rust. I found it a bit intimidating at first because the syntax wasn’t as close to the C/C++ I’m familiar with and there are quite a few concepts unique to Rust, like the use of traits. But I figured it out, so I can tick another language of my to-try list. I also implemented a version in Python for comparison: the Python version is more concise and easier to read but the Rust version runs about 10× faster. First we include the std::env “crate” which will let us get access to commandline arguments, and define some useful constants for later. use std::env; const M: i64 = 2147483647; const MASK: i64 = 0b1111111111111111; const FACTOR_A: i64 = 16807; const FACTOR_B: i64 = 48271; gen_next generates the next number for a given generator’s sequence. gen_next_picky does the same, but for the “picky” generators, only returning values that meet their criteria. fn gen_next(factor: i64, current: i64) -> i64 { return (current * factor) % M; } fn gen_next_picky(factor: i64, current: i64, mult: i64) -> i64 { let mut next = gen_next(factor, current); while next % mult != 0 { next = gen_next(factor, next); } return next; } duel runs a single duel, and returns the number of times the generators agreed in the lowest 16 bits (found by doing a binary & with the mask defined above). Rust allows functions to be passed as parameters, so we use this to be able to run both versions of the duel using only this one function. fn duel<F, G>(n: i64, next_a: F, mut value_a: i64, next_b: G, mut value_b: i64) -> i64 where F: Fn(i64) -> i64, G: Fn(i64) -> i64, { let mut count = 0; for _ in 0..n { value_a = next_a(value_a); value_b = next_b(value_b); if (value_a & MASK) == (value_b & MASK) { count += 1; } } return count; } Finally, we read the start values from the command line and run the two duels. The expressions that begin |n| are closures (anonymous functions, often called lambdas in other languages) that we use to specify the generator functions for each duel. fn main() { let args: Vec<String> = env::args().collect(); let start_a: i64 = args[1].parse().unwrap(); let start_b: i64 = args[2].parse().unwrap(); println!( "Duel 1: {}", duel( 40000000, |n| gen_next(FACTOR_A, n), start_a, |n| gen_next(FACTOR_B, n), start_b, ) ); println!( "Duel 2: {}", duel( 5000000, |n| gen_next_picky(FACTOR_A, n, 4), start_a, |n| gen_next_picky(FACTOR_B, n, 8), start_b, ) ); } Disk Defragmentation — Haskell — #adventofcode Day 14 Today’s challenge has us helping a disk defragmentation program by identifying contiguous regions of used sectors on a 2D disk. → Full code on GitHub !!! commentary Wow, today’s challenge had a pretty steep learning curve. Day 14 was the first to directly reuse code from a previous day: the “knot hash” from day 10. I solved day 10 in Haskell, so I thought it would be easier to stick with Haskell for today as well. The first part was straightforward, but the second was pretty mind-bending in a pure functional language! I ended up solving it by implementing a [flood fill algorithm][flood]. It's recursive, which is right in Haskell's wheelhouse, but I ended up using `Data.Sequence` instead of the standard list type as its API for indexing is better. I haven't tried it, but I think it will also be a little faster than a naive list-based version. It took a looong time to figure everything out, but I had a day off work to be able to concentrate on it! A lot more imports for this solution, as we’re exercising a lot more of the standard library. module Main where import Prelude hiding (length, filter, take) import Data.Char (ord) import Data.Sequence import Data.Foldable hiding (length) import Data.Ix (inRange) import Data.Function ((&)) import Data.Maybe (fromJust, mapMaybe, isJust) import qualified Data.Set as Set import Text.Printf (printf) import System.Environment (getArgs) Also we’ll extract the key bits from day 10 into a module and import that. import KnotHash Now we define a few data types to make the code a bit more readable. Sector represent the state of a particular disk sector, either free, used (but unmarked) or used and marked as belonging to a given integer-labelled group. Grid is a 2D matrix of Sector, as a sequence of sequences. data Sector = Free | Used | Mark Int deriving (Eq) instance Show Sector where show Free = " ." show Used = " #" show (Mark i) = printf "%4d" i type GridRow = Seq Sector type Grid = Seq (GridRow) Some utility functions to make it easier to view the grids (which can be quite large): used for debugging but not in the finished solution. subGrid :: Int -> Grid -> Grid subGrid n = fmap (take n) . take n printRow :: GridRow -> IO () printRow row = do mapM_ (putStr . show) row putStr "\n" printGrid :: Grid -> IO () printGrid = mapM_ printRow makeKey generates the hash key for a given row. makeKey :: String -> Int -> String makeKey input n = input ++ "-" ++ show n stringToGridRow converts a binary string of ‘1’ and ‘0’ characters to a sequence of Sector values. stringToGridRow :: String -> GridRow stringToGridRow = fromList . map convert where convert x | x == '1' = Used | x == '0' = Free makeRow and makeGrid build up the grid to use based on the provided input string. makeRow :: String -> Int -> GridRow makeRow input n = stringToGridRow $ concatMap (printf "%08b") $ dense $ fullKnotHash 256 $ map ord $ makeKey input n makeGrid :: String -> Grid makeGrid input = fromList $ map (makeRow input) [0..127] Utility functions to count the number of used and free sectors, to give the solution to part 1. countEqual :: Sector -> Grid -> Int countEqual x = sum . fmap (length . filter (==x)) countUsed = countEqual Used countFree = countEqual Free Now the real meat begins! fundUnmarked finds the location of the next used sector that we haven’t yet marked. It returns a Maybe value, which is Just (x, y) if there is still an unmarked block or Nothing if there’s nothing left to mark. findUnmarked :: Grid -> Maybe (Int, Int) findUnmarked g | y == Nothing = Nothing | otherwise = Just (fromJust x, fromJust y) where hasUnmarked row = isJust $ elemIndexL Used row x = findIndexL hasUnmarked g y = case x of Nothing -> Nothing Just x' -> elemIndexL Used $ index g x' floodFill implements a very simple recursive flood fill. It takes a target and replacement value and a starting location, and fills in the replacement value for every connected location that currently has the target value. We use it below to replace a connected used region with a marked region. floodFill :: Sector -> Sector -> (Int, Int) -> Grid -> Grid floodFill t r (x, y) g | inRange (0, length g - 1) x && inRange (0, length g - 1) y && elem == t = let newRow = update y r row newGrid = update x newRow g in newGrid & floodFill t r (x+1, y) & floodFill t r (x-1, y) & floodFill t r (x, y+1) & floodFill t r (x, y-1) | otherwise = g where row = g `index` x elem = row `index` y markNextGroup looks for an unmarked group and marks it if found. If no more groups are found it returns Nothing. markAllGroups then repeatedly applies markNextGroup until Nothing is returned. markNextGroup :: Int -> Grid -> Maybe Grid markNextGroup i g = case findUnmarked g of Nothing -> Nothing Just loc -> Just $ floodFill Used (Mark i) loc g markAllGroups :: Grid -> Grid markAllGroups g = markAllGroups' 1 g where markAllGroups' i g = case markNextGroup i g of Nothing -> g Just g' -> markAllGroups' (i+1) g' onlyMarks filters a grid row and returns a list of (possibly duplicated) group numbers in the row. onlyMarks :: GridRow -> [Int] onlyMarks = mapMaybe getMark . toList where getMark Free = Nothing getMark Used = Nothing getMark (Mark i) = Just i Finally, countGroups puts all the group numbers into a set to get rid of duplicates and returns the size of the set, i.e. the total number of separate groups. countGroups :: Grid -> Int countGroups g = Set.size groupSet where groupSet = foldl' Set.union Set.empty $ fmap rowToSet g rowToSet = Set.fromList . toList . onlyMarks As always, every Haskell program needs a main function to drive the I/O and produce the actual result. main = do input <- fmap head getArgs let grid = makeGrid input used = countUsed grid marked = countGroups $ markAllGroups grid putStrLn $ "Used sectors: " ++ show used putStrLn $ "Groups: " ++ show marked Packet Scanners — Haskell — #adventofcode Day 13 Today’s challenge requires us to sneak past a firewall made up of a series of scanners. → Full code on GitHub !!! commentary I wasn’t really thinking straight when I solved this challenge. I got a solution without too much trouble, but I ended up simulating the step-by-step movement of the scanners. I finally realised that I could calculate whether or not a given scanner was safe at a given time directly with modular arithmetic, and it bugged me so much that I reimplemented the solution. Both are given below, the faster one first. First we introduce some standard library stuff and define some useful utilities. module Main where import qualified Data.Text as T import Data.Maybe (mapMaybe) strip :: String -> String strip = T.unpack . T.strip . T.pack splitOn :: String -> String -> [String] splitOn sep = map T.unpack . T.splitOn (T.pack sep) . T.pack parseScanner :: String -> (Int, Int) parseScanner s = (d, r) where [d, r] = map read $ splitOn ": " s traverseFW does all the hard work: it checks for each scanner whether or not it’s safe as we pass through, and returns a list of the severities of each time we’re caught. mapMaybe is like the standard map in many languages, but operates on a list of Haskell Maybe values, like a combined map and filter. If the value is Just x, x gets included in the returned list; if the value is Nothing, then it gets thrown away. traverseFW :: Int -> [(Int, Int)] -> [Int] traverseFW delay = mapMaybe caught where caught (d, r) = if (d + delay) `mod` (2*(r-1)) == 0 then Just (d * r) else Nothing Then the total severity of our passage through the firewall is simply the sum of each individual severity. severity :: [(Int, Int)] -> Int severity = sum . traverseFW 0 But we don’t want to know how badly we got caught, we want to know how long to wait before setting off to get through safely. findDelay tries traversing the firewall with increasing delay, and returns the delay for the first pass where we predict not getting caught. findDelay :: [(Int, Int)] -> Int findDelay scanners = head $ filter (null . flip traverseFW scanners) [0..] And finally, we put it all together and calculate and print the result. main = do scanners <- fmap (map parseScanner . lines) getContents putStrLn $ "Severity: " ++ (show $ severity scanners) putStrLn $ "Delay: " ++ (show $ findDelay scanners) I’m not generally bothered about performance for these challenges, but here I’ll note that my second attempt runs in a little under 2 seconds on my laptop: $ time ./13-packet-scanners-redux < 13-input.txt Severity: 1900 Delay: 3966414 ./13-packet-scanners-redux < 13-input.txt 1.73s user 0.02s system 99% cpu 1.754 total Compare that with the first, simulation-based one, which takes nearly a full minute: $ time ./13-packet-scanners < 13-input.txt Severity: 1900 Delay: 3966414 ./13-packet-scanners < 13-input.txt 57.63s user 0.27s system 100% cpu 57.902 total And for good measure, here’s the code. Notice the tick and tickOne functions, which together simulate moving all the scanners by one step; for this to work we have to track the full current state of each scanner, which is easier to read with a Haskell record-based custom data type. traverseFW is more complicated because it has to drive the simulation, but the rest of the code is mostly the same. module Main where import qualified Data.Text as T import Control.Monad (forM_) data Scanner = Scanner { depth :: Int , range :: Int , pos :: Int , dir :: Int } instance Show Scanner where show (Scanner d r p dir) = show d ++ "/" ++ show r ++ "/" ++ show p ++ "/" ++ show dir strip :: String -> String strip = T.unpack . T.strip . T.pack splitOn :: String -> String -> [String] splitOn sep str = map T.unpack $ T.splitOn (T.pack sep) $ T.pack str parseScanner :: String -> Scanner parseScanner s = Scanner d r 0 1 where [d, r] = map read $ splitOn ": " s tickOne :: Scanner -> Scanner tickOne (Scanner depth range pos dir) | pos <= 0 = Scanner depth range (pos+1) 1 | pos >= range - 1 = Scanner depth range (pos-1) (-1) | otherwise = Scanner depth range (pos+dir) dir tick :: [Scanner] -> [Scanner] tick = map tickOne traverseFW :: [Scanner] -> [(Int, Int)] traverseFW = traverseFW' 0 where traverseFW' _ [] = [] traverseFW' layer scanners@((Scanner depth range pos _):rest) -- | layer == depth && pos == 0 = (depth*range) + (traverseFW' (layer+1) $ tick rest) | layer == depth && pos == 0 = (depth,range) : (traverseFW' (layer+1) $ tick rest) | layer == depth && pos /= 0 = traverseFW' (layer+1) $ tick rest | otherwise = traverseFW' (layer+1) $ tick scanners severity :: [Scanner] -> Int severity = sum . map (uncurry (*)) . traverseFW empty :: [a] -> Bool empty [] = True empty _ = False findDelay :: [Scanner] -> Int findDelay scanners = delay where (delay, _) = head $ filter (empty . traverseFW . snd) $ zip [0..] $ iterate tick scanners main = do scanners <- fmap (map parseScanner . lines) getContents putStrLn $ "Severity: " ++ (show $ severity scanners) putStrLn $ "Delay: " ++ (show $ findDelay scanners) Digital Plumber — Python — #adventofcode Day 12 Today’s challenge has us helping a village of programs who are unable to communicate. We have a list of the the communication channels between their houses, and need to sort them out into groups such that we know that each program can communicate with others in its own group but not any others. Then we have to calculate the size of the group containing program 0 and the total number of groups. → Full code on GitHub !!! commentary This is one of those problems where I’m pretty sure that my algorithm isn’t close to being the most efficient, but it definitely works! For the sake of solving the challenge that’s all that matters, but it still bugs me. By now I’ve become used to using fileinput to transparently read data either from files given on the command-line or standard input if no arguments are given. import fileinput as fi First we make an initial pass through the input data, creating a group for each line representing the programs on that line (which can communicate with each other). We store this as a Python set. groups = [] for line in fi.input(): head, rest = line.split(' <-> ') group = set([int(head)]) group.update([int(x) for x in rest.split(', ')]) groups.append(group) Now we iterate through the groups, starting with the first, and merging any we find that overlap with our current group. i = 0 while i < len(groups): current = groups[i] Each pass through the groups brings more programs into the current group, so we have to go through and check their connections too. We make several merge passes, until we detect that no more merges took place. num_groups = len(groups) + 1 while num_groups > len(groups): j = i+1 num_groups = len(groups) This inner loop does the actual merging, and deletes each group as it’s merged in. while j < len(groups): if len(current & groups[j]) > 0: current.update(groups[j]) del groups[j] else: j += 1 i += 1 All that’s left to do now is to display the results. print("Number in group 0:", len([g for g in groups if 0 in g][0])) print("Number of groups:", len(groups)) Hex Ed — Python — #adventofcode Day 11 Today’s challenge is to help a program find its child process, which has become lost on a hexagonal grid. We need to follow the path taken by the child (given as input) and calculate the distance it is from home along with the furthest distance it has been at any point along the path. → Full code on GitHub !!! commentary I found this one quite interesting in that it was very quick to solve. In fact, I got lucky and my first quick implementation (max(abs(l)) below) gave the correct answer in spite of missing an obvious not-so-edge case. Thinking about it, there’s only a ⅓ chance that the first incorrect implementation would give the wrong answer! The code is shorter, so you get more words today. ☺ There are a number of different co-ordinate systems on a hexagonal grid (I discovered while reading up after solving it…). I intuitively went for the system known as ‘axial’ coordinates, where you pick two directions aligned to the grid as your x and y axes: note that these won’t be perpendicular. I chose ne/sw as the x axis and se/nw as y, but there are three other possible choices. That leads to the following definition for the directions, encoded as numpy arrays because that makes some of the code below neater. import numpy as np STEPS = {d: np.array(v) for d, v in [('ne', (1, 0)), ('se', (0, -1)), ('s', (-1, -1)), ('sw', (-1, 0)), ('nw', (0, 1)), ('n', (1, 1))]} hex_grid_dist, given a location l calculates the number of steps needed to reach that location from the centre at (0, 0). Notice that we can’t simply use the Manhattan distance here because, for example, one step north takes us to (1, 1), which would give a Manhattan distance of 2. Instead, we can see that moving in the n/s direction allows us to increment or decrement both coordinates at the same time: If the coordinates have the same sign: move n/s until one of them is zero, then move along the relevant ne or se axis back to the origin; in this case the number of steps is greatest of the absolute values of the two coordinates If the coordinates have opposite signs: move independently along the ne and se axes to reduce each to 0; this time the number of steps is the sum of the absolute values of the two coordinates def hex_grid_distance(l): if sum(np.sign(l)) == 0: # i.e. opposite signs return sum(abs(l)) else: return max(abs(l)) Now we can read in the path followed by the child and follow it ourselves, tracking the maximum distance from home along the way. path = input().strip().split(',') location = np.array((0, 0)) max_distance = 0 for step in map(STEPS.get, path): location += step max_distance = max(max_distance, hex_grid_distance(location)) distance = hex_grid_distance(location) print("Child process is at", location, "which is", distance, "steps away") print("Greatest distance was", max_distance) Knot Hash — Haskell — #adventofcode Day 10 Today’s challenge asks us to help a group of programs implement a (highly questionable) hashing algorithm that involves repeatedly reversing parts of a list of numbers. → Full code on GitHub !!! commentary I went with Haskell again today, because it’s the weekend so I have a bit more time, and I really enjoyed yesterday’s Haskell implementation. Today gave me the opportunity to explore the standard library a bit more, as well as lending itself nicely to being decomposed into smaller parts to be combined using higher-order functions. You know the drill by know: import stuff we’ll use later. module Main where import Data.Char (ord) import Data.Bits (xor) import Data.Function ((&)) import Data.List (unfoldr) import Text.Printf (printf) import qualified Data.Text as T The worked example uses a concept of the “current position” as a pointer to a location in a static list. In Haskell it makes more sense to instead use the front of the list as the current position, and rotate the whole list as we progress to bring the right element to the front. rotate :: Int -> [Int] -> [Int] rotate 0 xs = xs rotate n xs = drop n' xs ++ take n' xs where n' = n `mod` length xs The simple version of the hash requires working through the input list, modifying the working list as we go, and incrementing a “skip” counter with each step. Converting this to a functional style, we simply zip up the input with an infinite list [0, 1, 2, 3, ...] to give the counter values. Notice that we also have to calculate how far to rotate the working list to get back to its original position. foldl lets us specify a function that returns a modified version of the working list and feeds the input list in one at a time. simpleKnotHash :: Int -> [Int] -> [Int] simpleKnotHash size input = foldl step [0..size-1] input' & rotate (negate finalPos) where input' = zip input [0..] finalPos = sum $ zipWith (+) input [0..] reversePart xs n = (reverse $ take n xs) ++ drop n xs step xs (n, skip) = reversePart xs n & rotate (n+skip) The full version of the hash (part 2 of the challenge) starts the same way as the simple version, except making 64 passes instead of one: we can do this by using replicate to make a list of 64 copies, then collapse that into a single list with concat. fullKnotHash :: Int -> [Int] -> [Int] fullKnotHash size input = simpleKnotHash size input' where input' = concat $ replicate 64 input The next step in calculating the full hash collapses the full 256-element “sparse” hash down into 16 elements by XORing groups of 16 together. unfoldr is a nice efficient way of doing this. dense :: [Int] -> [Int] dense = unfoldr dense' where dense' [] = Nothing dense' xs = Just (foldl1 xor $ take 16 xs, drop 16 xs) The final hash step is to convert the list of integers into a hexadecimal string. hexify :: [Int] -> String hexify = concatMap (printf "%02x") These two utility functions put together building blocks from the Data.Text module to parse the input string. Note that no arguments are given: the functions are defined purely by composing other functions using the . operator. In Haskell this is referred to as “point-free” style. strip :: String -> String strip = T.unpack . T.strip . T.pack parseInput :: String -> [Int] parseInput = map (read . T.unpack) . T.splitOn (T.singleton ',') . T.pack Now we can put it all together, including building the weird input for the “full” hash. main = do input <- fmap strip getContents let simpleInput = parseInput input asciiInput = map ord input ++ [17, 31, 73, 47, 23] (a:b:_) = simpleKnotHash 256 simpleInput print $ (a*b) putStrLn $ fullKnotHash 256 asciiInput & dense & hexify Stream Processing — Haskell — #adventofcode Day 9 In today’s challenge we come across a stream that we need to cross. But of course, because we’re stuck inside a computer, it’s not water but data flowing past. The stream is too dangerous to cross until we’ve removed all the garbage, and to prove we can do that we have to calculate a score for the valid data “groups” and the number of garbage characters to remove. → Full code on GitHub !!! commentary One of my goals for this process was to knock the rust of my functional programming skills in Haskell, and I haven’t done that for the whole of the first week. Processing strings character by character and acting according to which character shows up seems like a good choice for pattern-matching though, so here we go. I also wanted to take a bash at test-driven development in Haskell, so I also loaded up the Test.Hspec module to give it a try. I did find keeping track of all the state in arguments a bit mind boggling, and I think it could have been improved through use of a data type using record syntax and the `State` monad, so that's something to look at for a future challenge. First import the extra bits we’ll need. module Main where import Test.Hspec import Data.Function ((&)) countGroups solves the first part of the problem, counting up the “score” of the valid data in the stream. countGroups' is an auxiliary function that holds some state in its arguments. We use pattern matching for the base case: [] represents the empty list in Haskell, which indicates we’ve finished the whole stream. Otherwise, we split the remaining stream into its first character and remainder, and use guards to decide how to interpret it. If skip is true, discard the character and carry on with skip set back to false. If we find a “!”, that tells us to skip the next. Other characters mark groups or sets of garbage: groups increase the score when they close and garbage is discarded. We continue to progress the list by recursing with the remainder of the stream and any updated state. countGroups :: String -> Int countGroups = countGroups' 0 0 False False where countGroups' score _ _ _ [] = score countGroups' score level garbage skip (c:rest) | skip = countGroups' score level garbage False rest | c == '!' = countGroups' score level garbage True rest | garbage = case c of '>' -> countGroups' score level False False rest _ -> countGroups' score level True False rest | otherwise = case c of '{' -> countGroups' score (level+1) False False rest '}' -> countGroups' (score+level) (level-1) False False rest ',' -> countGroups' score level False False rest '<' -> countGroups' score level True False rest c -> error $ "Garbage character found outside garbage: " ++ show c countGarbage works almost identically to countGroups, except it ignores groups and counts garbage. They are structured so similarly that it would probably make more sense to combine them to a single function that returns both counts. countGarbage :: String -> Int countGarbage = countGarbage' 0 False False where countGarbage' count _ _ [] = count countGarbage' count garbage skip (c:rest) | skip = countGarbage' count garbage False rest | c == '!' = countGarbage' count garbage True rest | garbage = case c of '>' -> countGarbage' count False False rest _ -> countGarbage' (count+1) True False rest | otherwise = case c of '<' -> countGarbage' count True False rest _ -> countGarbage' count False False rest Hspec gives us a domain-specific language heavily inspired by the rspec library for Ruby: the tests read almost like natural language. I built up these tests one-by-one, gradually implementing the appropriate bits of the functions above, a process known as Test-driven development. runTests = hspec $ do describe "countGroups" $ do it "counts valid groups" $ do countGroups "{}" `shouldBe` 1 countGroups "{{{}}}" `shouldBe` 6 countGroups "{{{},{},{{}}}}" `shouldBe` 16 countGroups "{{},{}}" `shouldBe` 5 it "ignores garbage" $ do countGroups "{<a>,<a>,<a>,<a>}" `shouldBe` 1 countGroups "{{<ab>},{<ab>},{<ab>},{<ab>}}" `shouldBe` 9 it "skips marked characters" $ do countGroups "{{<!!>},{<!!>},{<!!>},{<!!>}}" `shouldBe` 9 countGroups "{{<a!>},{<a!>},{<a!>},{<ab>}}" `shouldBe` 3 describe "countGarbage" $ do it "counts garbage characters" $ do countGarbage "<>" `shouldBe` 0 countGarbage "<random characters>" `shouldBe` 17 countGarbage "<<<<>" `shouldBe` 3 it "ignores non-garbage" $ do countGarbage "{{},{}}" `shouldBe` 0 countGarbage "{{<ab>},{<ab>},{<ab>},{<ab>}}" `shouldBe` 8 it "skips marked characters" $ do countGarbage "<{!>}>" `shouldBe` 2 countGarbage "<!!>" `shouldBe` 0 countGarbage "<!!!>" `shouldBe` 0 countGarbage "<{o\"i!a,<{i<a>" `shouldBe` 10 Finally, the main function reads in the challenge input and calculates the answers, printing them on standard output. main = do runTests repeat '=' & take 78 & putStrLn input <- getContents & fmap (filter (/='\n')) putStrLn $ "Found " ++ show (countGroups input) ++ " groups" putStrLn $ "Found " ++ show (countGarbage input) ++ " characters garbage" I Heard You Like Registers — Python — #adventofcode Day 8 Today’s challenge describes a simple instruction set for a CPU, incrementing and decrementing values in registers according to simple conditions. We have to interpret a stream of these instructions, and to prove that we’ve done so, give the highest value of any register, both at the end of the program and throughout the whole program. → Full code on GitHub !!! commentary This turned out to be a nice straightforward one to implement, as the instruction format was easily parsed by regular expression, and Python provides the eval function which made evaluating the conditions a doddle. Import various standard library bits that we’ll use later. import re import fileinput as fi from math import inf from collections import defaultdict We could just parse the instructions by splitting the string, but using a regular expression is a little bit more robust because it won’t match at all if given an invalid instruction. INSTRUCTION_RE = re.compile(r'(\w+) (inc|dec) (-?\d+) if (.+)\s*') def parse_instruction(instruction): match = INSTRUCTION_RE.match(instruction) return match.group(1, 2, 3, 4) Executing an instruction simply checks the condition and if it evaluates to True updates the relevant register. def exec_instruction(registers, instruction): name, op, value, cond = instruction value = int(value) if op == 'dec': value = -value if eval(cond, globals(), registers): registers[name] += value highest_value returns the maximum value found in any register. def highest_value(registers): return sorted(registers.items(), key=lambda x: x[1], reverse=True)[0][1] Finally, loop through all the instructions and carry them out, updating global_max as we go. We need to be able to deal with registers that haven’t been accessed before. Keeping the registers in a dictionary means that we can evaluate the conditions directly using eval above, passing it as the locals argument. The standard dict will raise an exception if we try to access a key that doesn’t exist, so instead we use collections.defaultdict, which allows us to specify what the default value for a non-existent key will be. New registers start at 0, so we use a simple lambda to define a function that always returns 0. global_max = -inf registers = defaultdict(lambda: 0) for i in map(parse_instruction, fi.input()): exec_instruction(registers, i) global_max = max(global_max, highest_value(registers)) print('Max value:', highest_value(registers)) print('All-time max:', global_max) Recursive Circus — Ruby — #adventofcode Day 7 Today’s challenge introduces a set of processes balancing precariously on top of each other. We find them stuck and unable to get down because one of the processes is the wrong size, unbalancing the whole circus. Our job is to figure out the root from the input and then find the correct weight for the single incorrect process. → Full code on GitHub !!! commentary So I didn’t really intend to take a full polyglot approach to Advent of Code, but it turns out to have been quite fun, so I made a shortlist of languages to try. Building a tree is a classic application for object-orientation using a class to represent tree nodes, and I’ve always liked the feel of Ruby’s class syntax, so I gave it a go. First make sure we have access to Set, which we’ll use later. require 'set' Now to define the CircusNode class, which represents nodes in the tree. attr :s automatically creates a function s that returns the value of the instance attribute @s class CircusNode attr :name, :weight def initialize(name, weight, children=nil) @name = name @weight = weight @children = children || [] end Add a << operator (the same syntax for adding items to a list) that adds a child to this node. def <<(c) @children << c @total_weight = nil end total_weight recursively calculates the weight of this node and everything above it. The @total_weight ||= blah idiom caches the value so we only calculate it once. def total_weight @total_weight ||= @weight + @children.map {|c| c.total_weight}.sum end balance_weight does the hard work of figuring out the proper weight for the incorrect node by recursively searching through the tree. def balance_weight(target=nil) by_weight = Hash.new{|h, k| h[k] = []} @children.each{|c| by_weight[c.total_weight] << c} if by_weight.size == 1 then if target return @weight - (total_weight - target) else raise ArgumentError, 'This tree seems balanced!' end else odd_one_out = by_weight.select {|k, v| v.length == 1}.first[1][0] child_target = by_weight.select {|k, v| v.length > 1}.first[0] return odd_one_out.balance_weight child_target end end A couple of utility functions for displaying trees finish off the class. def to_s "#{@name} (#{@weight})" end def print_tree(n=0) puts "#{' '*n}#{self} -> #{self.total_weight}" @children.each do |child| child.print_tree n+1 end end end build_circus takes input as a list of lists [name, weight, children]. We make two passes over this list, first creating all the nodes, then building the tree by adding children to parents. def build_circus(data) all_nodes = {} all_children = Set.new data.each do |name, weight, children| all_nodes[name] = CircusNode.new name, weight end data.each do |name, weight, children| children.each {|child| all_nodes[name] << all_nodes[child]} all_children.merge children end root_name = (all_nodes.keys.to_set - all_children).first return all_nodes[root_name] end Finally, build the tree and solve the problem! Note that we use String.to_sym to convert the node names to symbols (written in Ruby as :symbol), because they’re faster to work with in Hashes and Sets as we do above. data = readlines.map do |line| match = /(?<parent>\w+) \((?<weight>\d+)\)(?: -> (?<children>.*))?/.match line [match['parent'].to_sym, match['weight'].to_i, match['children'] ? match['children'].split(', ').map {|x| x.to_sym} : []] end root = build_circus data puts "Root node: #{root}" puts root.balance_weight Memory Reallocation — Python — #adventofcode Day 6 Today’s challenge asks us to follow a recipe for redistributing objects in memory that bears a striking resemblance to the rules of the African game Mancala. → Full code on GitHub !!! commentary When I was doing my MSci, one of our programming exercises was to write (in Haskell, IIRC) a program to play a Mancala variant called Oware, so this had a nice ring of nostalgia. Back to Python today: it's already become clear that it's by far my most fluent language, which makes sense as it's the only one I've used consistently since my schooldays. I'm a bit behind on the blog posts, so you get this one without any explanation, for now at least! import math def reallocate(mem): max_val = -math.inf size = len(mem) for i, x in enumerate(mem): if x > max_val: max_val = x max_index = i i = max_index mem[i] = 0 remaining = max_val while remaining > 0: i = (i + 1) % size mem[i] += 1 remaining -= 1 return mem def detect_cycle(mem): mem = list(mem) steps = 0 prev_states = {} while tuple(mem) not in prev_states: prev_states[tuple(mem)] = steps steps += 1 mem = reallocate(mem) return (steps, steps - prev_states[tuple(mem)]) initial_state = map(int, input().split()) print("Initial state is ", initial_state) steps, cycle = detect_cycle(initial_state) print("Steps to cycle: ", steps) print("Steps in cycle: ", cycle) A Maze of Twisty Trampolines — C++ — #adventofcode Day 5 Today’s challenge has us attempting to help the CPU escape from a maze of instructions. It’s not quite a Turing Machine, but it has that feeling of moving a read/write head up and down a tape acting on and changing the data found there. → Full code on GitHub !!! commentary I haven’t written anything in C++ for over a decade. It sounds like there have been lots of interesting developments in the language since then, with C++11, C++14 and the freshly finalised C++17 standards (built-in parallelism in the STL!). I won’t use any of those, but I thought I’d dust off my C++ and see what happened. Thankfully the Standard Template Library classes still did what I expected! As usual, we first include the parts of the standard library we’re going to use: iostream for input & output; vector for the container. We also declare that we’re using the std namespace, so that we don’t have to prepend vector and the other classes with std::. #include <iostream> #include <vector> using namespace std; steps_to_escape_part1 implements part 1 of the challenge: we read a location, move forward/backward by the number of steps given in that location, then add one to the location before repeating. The result is the number of steps we take before jumping outside the list. int steps_to_escape_part1(vector<int>& instructions) { int pos = 0, iterations = 0, new_pos; while (pos < instructions.size()) { new_pos = pos + instructions[pos]; instructions[pos]++; pos = new_pos; iterations++; } return iterations; } steps_to_escape_part2 solves part 2, which is very similar, except that an offset greater than 3 is decremented instead of incremented before moving on. int steps_to_escape_part2(vector<int>& instructions) { int pos = 0, iterations = 0, new_pos, offset; while (pos < instructions.size()) { offset = instructions[pos]; new_pos = pos + offset; instructions[pos] += offset >=3 ? -1 : 1; pos = new_pos; iterations++; } return iterations; } Finally we pull it all together and link it up to the input. int main() { vector<int> instructions1, instructions2; int n; The cin class lets us read data from standard input, which we then add to a vector of ints to give our list of instructions. while (true) { cin >> n; if (cin.eof()) break; instructions1.push_back(n); } Solving the problem modifies the input, so we need to take a copy to solve part 2 as well. Thankfully the STL makes this easy with iterators. instructions2.insert(instructions2.begin(), instructions1.begin(), instructions1.end()); Finally, compute the result and print it on standard output. cout << steps_to_escape_part1(instructions1) << endl; cout << steps_to_escape_part2(instructions2) << endl; return 0; } High Entropy Passphrases — Python — #adventofcode Day 4 Today’s challenge describes some simple rules supposedly intended to enforce the use of secure passwords. All we have to do is test a list of passphrase and identify which ones meet the rules. → Full code on GitHub !!! commentary Fearing that today might be as time-consuming as yesterday, I returned to Python and it’s hugely powerful “batteries-included” standard library. Thankfully this challenge was more straightforward, and I actually finished this before finishing day 3. First, let’s import two useful utilities. from fileinput import input from collections import Counter Part 1 requires simply that a passphrase contains no repeated words. No problem: we split the passphrase into words and count them, and check if any was present more than once. Counter is an amazingly useful class to have in a language’s standard library. All it does is count things: you add objects to it, and then it will tell you how many of a given object you have. We’re going to use it to count those potentially duplicated words. def is_valid(passphrase): counter = Counter(passphrase.split()) return counter.most_common(1)[0][1] == 1 Part 2 requires that no word in the passphrase be an anagram of any other word. Since we don’t need to do anything else with the words afterwards, we can check for anagrams by sorting the letters in each word: “leaf” and “flea” both become “aefl” and can be compared directly. Then we count as before. def is_valid_ana(passphrase): counter = Counter(''.join(sorted(word)) for word in passphrase.split()) return counter.most_common(1)[0][1] == 1 Finally we pull everything together. sum(map(boolean_func, list)) is a common idiom in Python for counting the number of times a condition (checked by boolean_func) is true. In Python, True and False can be treated as the numbers 1 and 0 respectively, so that summing a list of Boolean values gives you the number of True values in the list. lines = list(input()) print(sum(map(is_valid, lines))) print(sum(map(is_valid_ana, lines))) Spiral Memory — Go — #adventofcode Day 3 Today’s challenge requires us to perform some calculations on an “experimental memory layout”, with cells moving outwards from the centre of a square spiral (squiral?). → Full code on GitHub !!! commentary I’ve been wanting to try my hand at Go, the memory-safe, statically typed compiled language from Google for a while. Today’s challenge seemed a bit more mathematical in nature, meaning that I wouldn’t need too many advanced language features or knowledge of a standard library, so I thought I’d give it a “go”. It might have been my imagination, but it was impressive how quickly the compiled program chomped through 60 different input values while I was debugging. I actually spent far too long on this problem because my brain led me down a blind alley trying to do the wrong calculation, but I got there in the end! The solution is a bit difficult to explain without diagrams, which I don't really have time to draw right now, but fear not because several other people have. First take a look at [the challenge itself which explains the spiral memory concept](http://adventofcode.com/2017/day/3). Then look at the [nice diagrams that Phil Tooley made with Python](http://acceleratedscience.co.uk/blog/adventofcode-day-3-spiral-memory/) and hopefully you'll be able to see what's going on! It's interesting to note that this challenge also admits of an algorithmic solution instead of the mathematical one: you can model the memory as an infinite grid using a suitable data structure and literally move around it in a spiral. In hindsight this is a much better way of solving the challenge quickly because it's easier and less error-prone to code. I'm quite pleased with my maths-ing though, and it's much quicker than the algorithmic version! First some Go boilerplate: we have to define the package we’re in (main, because it’s an executable we’re producing) and import the libraries we’ll use. package main import ( "fmt" "math" "os" ) Weirdly, Go doesn’t seem to have these basic mathematics functions for integers in its standard library (please someone correct me if I’m wrong!) so I’ll define them instead of mucking about with data types. Go doesn’t do any implicit type conversion, even between numeric types, and the math builtin package only operates on float64 values. func abs(n int) int { if n < 0 { return -n } return n } func min(x, y int) int { if x < y { return x } return y } func max(x, y int) int { if x > y { return x } return y } This does the heavy lifting for part one: converting from a position on the spiral to a column and row in the grid. (0, 0) is the centre of the spiral. This actually does a bit more than is necessary to calculate the distance as required for part 1, but we’ll use it again for part 2. func spiral_to_xy(n int) (int, int) { if n == 1 { return 0, 0 } r := int(math.Floor((math.Sqrt(float64(n-1)) + 1) / 2)) n_r := n - (2*r-1)*(2*r-1) o := ((n_r - 1) % (2 * r)) - r + 1 sector := (n_r - 1) / (2 * r) switch sector { case 0: return r, o case 1: return -o, r case 2: return -r, -o case 3: return o, -r } return 0, 0 } Now use spiral_to_xy to calculate the Manhattan distance that the value at location n in the spiral memory are carried to reach the “access port” at 0. func distance(n int) int { x, y := spiral_to_xy(n) return abs(x) + abs(y) } This function does the opposite of spiral_to_xy, translating a grid position back to its position on the spiral. This is the one that took me far too long to figure out because I had a brain bug and tried to calculate the value s (which sector or quarter of the spiral we’re looking at) in a way that was never going to work! Fortunately I came to my senses. func xy_to_spiral(x, y int) int { if x == 0 && y == 0 { return 1 } r := max(abs(x), abs(y)) var s, o, n int if x+y > 0 && x-y >= 0 { s = 0 } else if x-y < 0 && x+y >= 0 { s = 1 } else if x+y < 0 && x-y <= 0 { s = 2 } else { s = 3 } switch s { case 0: o = y case 1: o = -x case 2: o = -y case 3: o = x } n = o + r*(2*s+1) + (2*r-1)*(2*r-1) return n } This is a utility function that uses xy_to_spiral to fetch the value at a given (x, y) location, and returns zero if we haven’t filled that location yet. func get_spiral(mem []int, x, y int) int { n := xy_to_spiral(x, y) - 1 if n < len(mem) { return mem[n] } return 0 } Finally we solve part 2 of the problem, which involves going round the spiral writing values into it that are the sum of some values already written. The result is the first of these sums that is greater than or equal to the given input value. func stress_test(input int) int { mem := make([]int, 1) n := 0 mem[0] = 1 for mem[n] < input { n++ x, y := spiral_to_xy(n + 1) mem = append(mem, get_spiral(mem, x+1, y)+ get_spiral(mem, x+1, y+1)+ get_spiral(mem, x, y+1)+ get_spiral(mem, x-1, y+1)+ get_spiral(mem, x-1, y)+ get_spiral(mem, x-1, y-1)+ get_spiral(mem, x, y-1)+ get_spiral(mem, x+1, y-1)) } return mem[n] } Now the last part of the program puts it all together, reading the input value from a commandline argument and printing the results of the two parts of the challenge: func main() { var n int fmt.Sscanf(os.Args[1], "%d", &n) fmt.Printf("Input is %d\n", n) fmt.Printf("Distance is %d\n", distance(n)) fmt.Printf("Stress test result is %d\n", stress_test(n)) } Corruption Checksum — Python — #adventofcode Day 2 Today’s challenge is to calculate a rather contrived “checksum” over a grid of numbers. → Full code on GitHub !!! commentary Today I went back to plain Python, and I didn’t do formal tests because only one test case was given for each part of the problem. I just got stuck in. I did write part 2 out in as nested `for` loops as an intermediate step to working out the generator expression. I think that expanded version may have been more readable. Having got that far, I couldn't then work out how to finally eliminate the need for an auxiliary function entirely without either sorting the same elements multiple times or sorting each row as it's read. First we read in the input, split it and convert it to numbers. fileinput.input() returns an iterator over the lines in all the files passed as command-line arguments, or over standard input if no files are given. from fileinput import input sheet = [[int(x) for x in l.split()] for l in input()] Part 1 of the challenge calls for finding the difference between the largest and smallest number in each row, and then summing those differences: print(sum(max(x) - min(x) for x in sheet)) Part 2 is a bit more involved: for each row we have to find the unique pair of elements that divide into each other without remainder, then sum the result of those divisions. We can make it a little easier by sorting each row; then we can take each number in turn and compare it only with the numbers after it (which are guaranteed to be larger). Doing this ensures we only make each comparison once. def rowsum_div(row): row = sorted(row) return sum(y // x for i, x in enumerate(row) for y in row[i+1:] if y % x == 0) print(sum(map(rowsum_div, sheet))) We can make this code shorter (if not easier to read) by sorting each row as it’s read: sheet = [sorted(int(x) for x in l.split()) for l in input()] Then we can just use the first and last elements in each row for part 1, as we know those are the smallest and largest respectively in the sorted row: print(sum(x[-1] - x[0] for x in sheet)) Part 2 then becomes a sum over a single generator expression: print(sum(y // x for row in sheet for i, x in enumerate(row) for y in row[i+1:] if y % x == 0)) Very satisfying! Inverse Captcha — Coconut — #adventofcode Day 1 Well, December’s here at last, and with it Day 1 of Advent of Code. … It goes on to explain that you may only leave by solving a captcha to prove you’re not a human. Apparently, you only get one millisecond to solve the captcha: too fast for a normal human, but it feels like hours to you. … As well as posting solutions here when I can, I’ll be putting them all on https://github.com/jezcope/aoc2017 too. !!! commentary After doing some challenges from last year in Haskell for a warm up, I felt inspired to try out the functional-ish Python dialect, Coconut. Now that I’ve done it, it feels a bit of an odd language, neither fish nor fowl. It’ll look familiar to any Pythonista, but is loaded with features normally associated with functional languages, like pattern matching, destructuring assignment, partial application and function composition. That makes it quite fun to work with, as it works similarly to Haskell, but because it's restricted by the basic rules of Python syntax everything feels a bit more like hard work than it should. The accumulator approach feels clunky, but it's necessary to allow [tail call elimination](https://en.wikipedia.org/wiki/Tail_call), which Coconut will do and I wanted to see in action. Lo and behold, if you take a look at the [compiled Python version](https://github.com/jezcope/aoc2017/blob/86c8100824bda1b35e5db6e02d4b80890be7a022/01-inverse-captcha.py#L675) you'll see that my recursive implementation has been turned into a non-recursive `while` loop. Then again, maybe I'm just jealous of Phil Tooley's [one-liner solution in Python](https://github.com/ptooley/aocGolf/blob/1380d78194f1258748ccfc18880cfd575baf5d37/2017.py#L8). import sys def inverse_captcha_(s, acc=0): case reiterable(s): match (|d, d|) :: rest: return inverse_captcha_((|d|) :: rest, acc + int(d)) match (|d0, d1|) :: rest: return inverse_captcha_((|d1|) :: rest, acc) return acc def inverse_captcha(s) = inverse_captcha_(s :: s[0]) def inverse_captcha_1_(s0, s1, acc=0): case (reiterable(s0), reiterable(s1)): match ((|d0|) :: rest0, (|d0|) :: rest1): return inverse_captcha_1_(rest0, rest1, acc + int(d0)) match ((|d0|) :: rest0, (|d1|) :: rest1): return inverse_captcha_1_(rest0, rest1, acc) return acc def inverse_captcha_1(s) = inverse_captcha_1_(s, s$[len(s)//2:] :: s) def test_inverse_captcha(): assert "1111" |> inverse_captcha == 4 assert "1122" |> inverse_captcha == 3 assert "1234" |> inverse_captcha == 0 assert "91212129" |> inverse_captcha == 9 def test_inverse_captcha_1(): assert "1212" |> inverse_captcha_1 == 6 assert "1221" |> inverse_captcha_1 == 0 assert "123425" |> inverse_captcha_1 == 4 assert "123123" |> inverse_captcha_1 == 12 assert "12131415" |> inverse_captcha_1 == 4 if __name__ == "__main__": sys.argv[1] |> inverse_captcha |> print sys.argv[1] |> inverse_captcha_1 |> print Advent of Code 2017: introduction It’s a common lament of mine that I don’t get to write a lot of code in my day-to-day job. I like the feeling of making something from nothing, and I often look for excuses to write bits of code, both at work and outside it. Advent of Code is a daily series of programming challenges for the month of December, and is about to start its third annual incarnation. I discovered it too late to take part in any serious way last year, but I’m going to give it a try this year. There are no restrictions on programming language (so of course some people delight in using esoteric languages like Brainf**k), but I think I’ll probably stick with Python for the most part. That said, I miss my Haskell days and I’m intrigued by new kids on the block Go and Rust, so I might end up throwing in a few of those on some of the simpler challenges. I’d like to focus a bit more on how I solve the puzzles. They generally come in two parts, with the second part only being revealed after successful completion of the first part. With that in mind, test-driven development makes a lot of sense, because I can verify that I haven’t broken the solution to the first part in modifying to solve the second. I may also take a literate programming approach with org-mode or Jupyter notebooks to document my solutions a bit more, and of course that will make it easier to publish solutions here so I’ll do that as much as I can make time for. On that note, here are some solutions for 2016 that I’ve done recently as a warmup. Day 1: Python Day 1 instructions import numpy as np import pytest as t import sys TURN = { 'L': np.array([[0, 1], [-1, 0]]), 'R': np.array([[0, -1], [1, 0]]) } ORIGIN = np.array([0, 0]) NORTH = np.array([0, 1]) class Santa: def __init__(self, location, heading): self.location = np.array(location) self.heading = np.array(heading) self.visited = [(0,0)] def execute_one(self, instruction): start_loc = self.location.copy() self.heading = self.heading @ TURN[instruction[0]] self.location += self.heading * int(instruction[1:]) self.mark(start_loc, self.location) def execute_many(self, instructions): for i in instructions.split(','): self.execute_one(i.strip()) def distance_from_start(self): return sum(abs(self.location)) def mark(self, start, end): for x in range(min(start[0], end[0]), max(start[0], end[0])+1): for y in range(min(start[1], end[1]), max(start[1], end[1])+1): if any((x, y) != start): self.visited.append((x, y)) def find_first_crossing(self): for i in range(1, len(self.visited)): for j in range(i): if self.visited[i] == self.visited[j]: return self.visited[i] def distance_to_first_crossing(self): crossing = self.find_first_crossing() if crossing is not None: return abs(crossing[0]) + abs(crossing[1]) def __str__(self): return f'Santa @ {self.location}, heading {self.heading}' def test_execute_one(): s = Santa(ORIGIN, NORTH) s.execute_one('L1') assert all(s.location == np.array([-1, 0])) assert all(s.heading == np.array([-1, 0])) s.execute_one('L3') assert all(s.location == np.array([-1, -3])) assert all(s.heading == np.array([0, -1])) s.execute_one('R3') assert all(s.location == np.array([-4, -3])) assert all(s.heading == np.array([-1, 0])) s.execute_one('R100') assert all(s.location == np.array([-4, 97])) assert all(s.heading == np.array([0, 1])) def test_execute_many(): s = Santa(ORIGIN, NORTH) s.execute_many('L1, L3, R3') assert all(s.location == np.array([-4, -3])) assert all(s.heading == np.array([-1, 0])) def test_distance(): assert Santa(ORIGIN, NORTH).distance_from_start() == 0 assert Santa((10, 10), NORTH).distance_from_start() == 20 assert Santa((-17, 10), NORTH).distance_from_start() == 27 def test_turn_left(): east = NORTH @ TURN['L'] south = east @ TURN['L'] west = south @ TURN['L'] assert all(east == np.array([-1, 0])) assert all(south == np.array([0, -1])) assert all(west == np.array([1, 0])) def test_turn_right(): west = NORTH @ TURN['R'] south = west @ TURN['R'] east = south @ TURN['R'] assert all(east == np.array([-1, 0])) assert all(south == np.array([0, -1])) assert all(west == np.array([1, 0])) if __name__ == '__main__': instructions = sys.stdin.read() santa = Santa(ORIGIN, NORTH) santa.execute_many(instructions) print(santa) print('Distance from start:', santa.distance_from_start()) print('Distance to target: ', santa.distance_to_first_crossing()) Day 2: Haskell Day 2 instructions module Main where data Pos = Pos Int Int deriving (Show) -- Magrittr-style pipe operator (|>) :: a -> (a -> b) -> b x |> f = f x swapPos :: Pos -> Pos swapPos (Pos x y) = Pos y x clamp :: Int -> Int -> Int -> Int clamp lower upper x | x < lower = lower | x > upper = upper | otherwise = x clampH :: Pos -> Pos clampH (Pos x y) = Pos x' y' where y' = clamp 0 4 y r = abs (2 - y') x' = clamp r (4-r) x clampV :: Pos -> Pos clampV = swapPos . clampH . swapPos buttonForPos :: Pos -> String buttonForPos (Pos x y) = [buttons !! y !! x] where buttons = [" D ", " ABC ", "56789", " 234 ", " 1 "] decodeChar :: Pos -> Char -> Pos decodeChar (Pos x y) 'R' = clampH $ Pos (x+1) y decodeChar (Pos x y) 'L' = clampH $ Pos (x-1) y decodeChar (Pos x y) 'U' = clampV $ Pos x (y+1) decodeChar (Pos x y) 'D' = clampV $ Pos x (y-1) decodeLine :: Pos -> String -> Pos decodeLine p "" = p decodeLine p (c:cs) = decodeLine (decodeChar p c) cs makeCode :: String -> String makeCode instructions = lines instructions -- split into lines |> scanl decodeLine (Pos 1 1) -- decode to positions |> tail -- drop start position |> concatMap buttonForPos -- convert to buttons main = do input <- getContents putStrLn $ makeCode input Research Data Management Forum 18, Manchester !!! intro "" Monday 20 and Tuesday 21 November 2017 I’m at the Research Data Management Forum in Manchester. I thought I’d use this as an opportunity to try liveblogging, so during the event some notes should appear in the box below (you may have to manually refresh your browser tab periodically to get the latest version). I've not done this before, so if the blog stops updating then it's probably because I've stopped updating it to focus on the conference instead! This was made possible using GitHub's cool [Gist](https://gist.github.com) tool. Draft content policy I thought it was about time I had some sort of content policy on here so this is a first draft. It will eventually wind up as a separate page. Feedback welcome! !!! aside “Content policy” This blog’s primary purpose is as a reflective learning tool for my own development; my aim in writing any given post is mainly to expose and develop my own thinking on a topic. My reasons for making a public blog rather than a private journal are: 1. If I'm lucky, someone smarter than me will provide feedback that will help me and my readers to learn more 2. If I'm extra lucky, someone else might learn from the material as well Each post, therefore, represents the state of my thinking at the time I wrote it, or perhaps a deliberate provocation or exaggeration; either way, if you don't know me personally please don't judge me based entirely on my past words. This is a request though, not an attempt to excuse bad behaviour on my part. I accept full responsibility for any consequences of my words, whether intended or not. I will not remove comments or ban individuals for disagreeing with me, only for behaving offensively or disrespectfully. I will do my best to be fair and balanced and explain decisions that I take, but I reserve the right to take those decisions without making any explanation at all if it seems likely to further inflame a situation. If I end up responding to anything simply with a link to this policy, that's probably all the explanation you're going to get. It should go without saying, but the opinions presented in this blog are my own and not those of my employer or anyone else I might at times represent. Learning to live with anxiety !!! intro "" This is a post that I’ve been writing for months, and writing in my head for years. For some it will explain aspects of my personality that you might have wondered about. For some it will just be another person banging on self-indulgently about so-called “mental health issues”. Hopefully, for some it will demystify some stuff and show that you’re not alone and things do get better. For as long as I can remember I’ve been a worrier. I’ve also suffered from bouts of what I now recognise as depression, on and off since my school days. It’s only relatively recently that I’ve come to the realisation that these two might be connected and that my ‘worrying’ might in fact be outside the normal range of healthy human behaviour and might more accurately be described as chronic anxiety. You probably won’t have noticed it, but it’s been there. More recently I’ve begun feeling like I’m getting on top of it and feeling “normal” for the first time in my life. Things I’ve found that help include: getting out of the house more and socialising with friends; and getting a range of exercise, outdoors and away from the city (rock climbing is mentally and physically engaging and open water swimming is indescribably joyful). But mostly it’s the cognitive behavioural therapy (CBT) and the antidepressants. Before I go any further, a word about drugs (“don’t do drugs, kids”): I’m on the lowest available dose of a common antidepressant. This isn’t because it stops me being sad all the time (I’m not) or because it makes all my problems go away (it really doesn’t). It’s because the scientific evidence points to a combination of CBT and antidepressants as being the single most effective treatment for generalised anxiety disorder. The reason for this is simple: CBT isn’t easy, because it asks you to challenge habits and beliefs you’ve held your whole life. In the short term there is going to be more anxiety and some antidepressants are also effective at blunting the effect of this additional anxiety. In short, CBT is what makes you better, and the drugs just make it a little bit more effective. A lot of people have misconceptions about what it means to be ‘in therapy’. I suspect a lot of these are derived from the psychoanalysis we often see portrayed in (primarily US) film and TV. The problem with that type of navel-gazing therapy is that you can spend years doing it, finally reach some sort of breakthrough insight, and still not have no idea what the supposed insight means for your actual life. CBT is different in that rather than addressing feelings directly it focuses on habits in your thoughts (cognitive) and actions (behavioural) with feeling better as an outcome (therapy). CBT and related forms of therapy now have decades of clinical evidence showing that they really work. It uses a wide range of techniques to identify, challenge and reduce various common unhelpful thoughts and behaviours. By choosing and practicing these, you can break bad mental habits that you’ve been carrying around, often for decades. For me this means giving fair weight to my successes as well as my failings, allowing flexibility into the rigid rules that I have always, subconsciously, lived by, and being a bit kinder to myself when I make mistakes. It’s not been easy and I have to remind myself to practice this every day, but it’s really helped. !!! aside “More info” If you live in the UK, you might not be aware that you can get CBT and other psychological therapies on the NHS through a scheme called IAPT (improving access to psychological therapies). You can self-refer so you don’t need to see a doctor first, but you might want to anyway if you think medication might help. They also have a progression of treatments, so you might be offered a course of “guided self-help” and then progressed to CBT or another talking therapy if need be. This is what happened to me, and it did help a bit but it was CBT that helped me the most. Becoming a librarian What is a librarian? Is it someone who has a masters degree in librarianship and information science? Is it someone who looks after information for other people? Is it simply someone who works in a library? I’ve been grappling with this question a lot lately because I’ve worked in academic libraries for about 3 years now and I never really thought that’s something that might happen. People keep referring to me as “a librarian” but there’s some imposter feelings here because all the librarians around me have much more experience, have skills in areas like cataloguing and collection management and, generally, have a librarian masters degree. So I’ve been thinking about what it actually means to me to be a librarian or not. NB. some of these may be tongue-in-cheek Ways in which I am a librarian: I work in a library I help people to access and organise information I have a cat I like gin Ways in which I am not a librarian: I don’t have a librarianship qualification I don’t work with books 😉 I don’t knit (though I can probably remember how if pressed) I don’t shush people or wear my hair in a bun (I can confirm that this is also true of every librarian I know) Ways in which I am a shambrarian: I like beer I have more IT experience and qualification than librarianship At the end of the day, I still don’t know how I feel about this or, for that matter, how important it is. I’m probably going to accept whatever title people around me choose to bestow, though any label will chafe at times! Lean Libraries: applying agile practices to library services Kanban board Jeff Lasovski (via Wikimedia Commons) I’ve been working with our IT services at work quite closely for the last year as product owner for our new research data portal, ORDA. That’s been a fascinating process for me as I’ve been able to see first-hand some of the agile techniques that I’ve been reading about from time-to-time on the web over the last few years. They’re in the process of adopting a specific set of practices going under the name “Scrum”, which is fun because it uses some novel terminology that sounds pretty weird to non-IT folks, like “scrum master”, “sprint” and “product backlog”. On my small project we’ve had great success with the short cycle times and been able to build trust with our stakeholders by showing concrete progress on a regular basis. Modern librarianship is increasingly fluid, particularly in research services, and I think that to handle that fluidity it’s absolutely vital that we are able to work in a more agile way. I’m excited about the possibilities of some of these ideas. However, Scrum as implemented by our IT services doesn’t seem something that transfers directly to the work that we do: it’s too specialised for software development to adapt directly. What I intend to try is to steal some of the individual practices on an experimental basis and simply see what works and what doesn’t. The Lean concepts currently popular in IT were originally developed in manufacturing: if they can be translated from the production of physical goods to IT, I don’t see why we can’t make the ostensibly smaller step of translating them to a different type of knowledge work. I’ve therefore started reading around this subject to try and get as many ideas as possible. I’m generally pretty rubbish at taking notes from books, so I’m going to try and record and reflect on any insights I make on this blog. The framework for trying some of these out is clearly a Plan-Do-Check-Act continuous improvement cycle, so I’ll aim to reflect on that process too. I’m sure there will have been people implementing Lean in libraries already, so I’m hoping to be able to discover and learn from them instead of starting froms scratch. Wish me luck! Mozilla Global Sprint 2017 Photo by Lena Bell on Unsplash Every year, the Mozilla Foundation runs a two-day Global Sprint, giving people around the world 50 hours to work on projects supporting and promoting open culture and tech. Though much of the work during the sprint is, of course, technical software development work, there are always tasks suited to a wide range of different skill sets and experience levels. The participants include writers, designers, teachers, information professionals and many others. This year, for the first time, the University of Sheffield hosted a site, providing a space for local researchers, developers and others to get out of their offices, work on #mozsprint and link up with others around the world. The Sheffield site was organised by the Research Software Engineering group in collaboration with the University Library. Our site was only small compared to others, but we still had people working on several different projects. My reason for taking part in the sprint was to contribute to the international effort on the Library Carpentry project. A team spread across four continents worked throughout the whole sprint to review and develop our lesson material. As there were no other Library Carpentry volunteers at the Sheffield site, I chose to work on some urgent work around improving the presentation of our workshops and lessons on the web and related workflows. It was a really nice subproject to work on, requiring not only cleaning up and normalising the metadata we hold on workshops and lessons, but also digesting and formalising our current ad hoc process of lesson development. The largest group were solar physicists from the School of Maths and Statistics, working on the SunPy project, an open source environment for solar data analysis. They pushed loads of bug fixes and documentation improvements, and also mentored a new contributor through their first additions to the project. Anna Krystalli from Research Software Engineering worked on the EchoBurst project, which is building a web browser extension to help people break out of their online echo chambers. It does this by using natural language processing techniques to highlight well-written, logically sound articles that disagree with the reader’s stated views on particular topics of interest. Anna was part of an effort to begin extending this technology to online videos. We had a couple of individuals simply taking the opportunity to break out of their normal work environments to work or learn, including a couple of members of library staff show up for a couple of hours to learn how to use git on a new project! IDCC 2017 reflection For most of the last few years I've been lucky enough to attend the International Digital Curation Conference (IDCC). One of the main audiences attending is people who, like me, work on research data management at universities around the world and it's begun to feel like a sort of "home" conference to me. This year, IDCC was held at the Royal College of Surgeons in the beautiful city of Edinburgh. For the last couple of years, my overall impression has been that, as a community, we're moving away from the "first-order" problem of trying to convince people (from PhD students to senior academics) to take RDM seriously and into a rich set of "second-order" problems around how to do things better and widen support to more people. This year has been no exception. Here are a few of my observations and takeaway points. Everyone has a repository now Only last year, the most common question you'd get asked by strangers in the coffee break would be "Do you have a data repository?" Now the question is more likely to be "What are you using for your data repository?", along with more subtle questions about specific components of systems and how they interact. Integrating active storage and archival systems Now that more institutions have data worth preserving, there is more interest in (and in many cases experience of) setting up more seamless integrations between active and archival storage. There are lessons here we can learn. Freezing in amber vs actively maintaining assets There seemed to be an interesting debate going on throughout the conference around the aim of preservation: should we be faithfully preserving the bits and bytes provided without trying to interpret them, or should we take a more active approach by, for example, migrating obsolete formats to newer alternatives. If the former, should we attempt to preserve the software required to access the data as well? If the latter, how much effort do we invest and how do we ensure nothing is lost or altered in the migration? Demonstrating Data Science instead of debating what it is The phrase "Data Science" was once again one of the most commonly uttered of the conference. However, there is now less abstract discussion about what, exactly, is meant by this "data science" thing; this has been replaced more by concrete demonstrations. This change was exemplified perfectly by the keynote by data scientist Alice Daish, who spent a riveting 40 minutes or so enthusing about all the cool stuff she does with data at the British Museum. Recognition of software as an issue Even as recently as last year, I've struggled to drum up much interest in discussing software sustainability and preservation at events like this; the interest was there, but there were higher priorities. So I was completely taken by surprise when we ended up with 30+ people in the Software Preservation Birds of a Feather (BoF) session, and when very little input was needed from me as chair to keep a productive discussion going for a full 90 minutes. Unashamed promotion of openness As a community we seem to have nearly overthrown our collective embarrassment about the phrase "open data" (although maybe this is just me). We've always known it was a good thing, but I know I've been a bit of an apologist in the past, feeling that I had to "soften the blow" when asking researchers to be more open. Now I feel more confident in leading with the benefits of openness, and it felt like that's a change reflected in the community more widely. Becoming more involved in the conference This year, I took a decision to try and do more to contribute to the conference itself, and I felt like this was pretty successful both in making that contribution and building up my own profile a bit. I presented a paper on one of my current passions, Library Carpentry; it felt really good to be able to share my enthusiasm. I presented a poster on our work integrating our data repository and digital preservation platform; this gave me more of a structure for networking during breaks, as I was able to stand by the poster and start discussions with anyone who seemed interested. I chaired a parallel session; a first for me, and a different challenge from presenting or simply attending the talks. And finally, I proposed and chaired the Software Preservation BoF session (blog post forthcoming). Renewed excitement It's weird, and possibly all in my imagination, but there seemed to be more energy at this conference than at the previous couple I've been to. More people seemed to be excited about the work we're all doing, recent achievements and the possibilities for the future. Introducing PyRefine: OpenRefine meets Python I’m knocking the rust off my programming skills by attempting to write a pure-Python interpreter for OpenRefine “scripts”. OpenRefine is a great tool for exploring and cleaning datasets prior to analysing them. It also records an undo history of all actions that you can export as a sort of script in JSON format. One thing that bugs me though is that, having spent some time interactively cleaning up your dataset, you then need to fire up OpenRefine again and do some interactive mouse-clicky stuff to apply that cleaning routine to another dataset. You can at least re-import the JSON undo history to make that as quick as possible, but there’s no getting around the fact that there’s no quick way to do it from a cold start. There is a project, BatchRefine, that extends the OpenRefine server to accept batch requests over a HTTP API, but that isn’t useful when you can’t or don’t want to keep a full Java stack running in the background the whole time. My concept is this: you use OR to explore the data interactively and design a cleaning process, but then export the process to JSON and integrate it into your analysis in Python. That way it can be repeated ad nauseam without having to fire up a full Java stack. I’m taking some inspiration from the great talk “So you want to be a wizard?" by Julia Evans (@b0rk), who recommends trying experiments as a way to learn. She gives these Rules of Programming Experiments: “it doesn’t have to be good it doesn’t have to work you have to learn something” In that spirit, my main priorities are: to see if this can be done; to see how far I can get implementing it; and to learn something. If it also turns out to be a useful thing, well, that’s a bonus. Some of the interesting possible challenges here: Implement all core operations; there are quite a lot of these, some of which will be fun (i.e. non-trivial) to implement Implement (a subset of?) GREL, the General Refine Expression Language; I guess my undergrad course on implementing parsers and compilers will come in handy after all! Generate clean, sane Python code from the JSON rather than merely executing it; more than anything, this would be a nice educational tool for users of OpenRefine who want to see how to do equivalent things in Python Selectively optimise key parts of the process; this will involve profiling the code to identify bottlenecks as well as tweaking the actual code to go faster Potentially handle contributions to the code from other people; I’d be really happy if this happened but I’m realistic… If you’re interested, the project is called PyRefine and it’s on github. Constructive criticism, issues & pull requests all welcome! Implementing Yesterbox in emacs with mu4e I’ve been meaning to give Yesterbox a try for a while. The general idea is that each day you only deal with email that arrived yesterday or earlier. This forms your inbox for the day, hence “yesterbox”. Once you’ve emptied your yesterbox, or at least got through some minimum number (10 is recommended) then you can look at emails from today. Even then you only really want to be dealing with things that are absolutely urgent. Anything else can wait til tomorrow. The motivation for doing this is to get away from the feeling that we are King Canute, trying to hold back the tide. I find that when I’m processing my inbox toward zero there’s always a temptation to keep skipping to the new stuff that’s just come in. Hiding away the new email until I’ve dealt with the old is a very interesting idea. I use mu4e in emacs for reading my email, and handily the mu search syntax is very flexible so you’d think it would be easy to create a yesterbox filter: maildir:"/INBOX" date:..1d Unfortunately, 1d is interpreted as “24 hours ago from right now” so this filter misses everything that was sent yesterday but less than 24 hours ago. There was a feature request raised on the mu github repository to implement an additional date filter syntax but it seems to have died a death for now. In the meantime, the answer to this is to remember that my workplace observes fairly standard office hours, so that anything sent more than 9 hours ago is unlikely to have been sent today. The following does the trick: maildir:"/INBOX" date:..9h In my mu4e bookmarks list, that looks like this: (setq mu4e-bookmarks '(("flag:unread AND NOT flag:trashed" "Unread messages" ?u) ("flag:flagged maildir:/archive" "Starred messages" ?s) ("date:today..now" "Today's messages" ?t) ("date:7d..now" "Last 7 days" ?w) ("maildir:\"/Mailing lists.*\" (flag:unread OR flag:flagged)" "Unread in mailing lists" ?M) ("maildir:\"/INBOX\" date:..1d" "Yesterbox" ?y))) ;; <- this is the new one Rewarding good practice in research From opensource.com on Flickr Whenever I’m involved in a discussion about how to encourage researchers to adopt new practices, eventually someone will come out with some variant of the following phrase: “That’s all very well, but researchers will never do XYZ until it’s made a criterion in hiring and promotion decisions.” With all the discussion of carrots and sticks I can see where this attitude comes from, and strongly empathise with it, but it raises two main problems: It’s unfair and more than a little insulting to anyone to be lumped into one homogeneous group; and Taking all the different possible XYZs into account, that’s an awful lot of hoops to expect anyone to jump through. Firstly, “researchers” are as diverse as the rest of us in terms of what gets them out of bed in the morning. Some of us want prestige; some want to contribute to a greater good; some want to create new things; some just enjoy the work. One thing I’d argue we all have in common is this: nothing is more offputting than feeling like you’re being strongarmed into something you don’t want to do. If we rely on simplistic metrics, people will focus on those and miss the point. At best people will disengage and at worst they will actively game the system. I’ve got to do these ten things to get my next payrise, and still retain my sanity? Ok, what’s the least I can get away with and still tick them off. You see it with students taking poorly-designed assessments and grown-ups are no difference. We do need to wield carrots as well as sticks, but the whole point is that these practices are beneficial in and of themselves. The carrots are already there if we articulate them properly and clear the roadblocks (don’t you enjoy mixed metaphors?). Creating artificial benefits will just dilute the value of the real ones. Secondly, I’ve heard a similar argument made for all of the following practices and more: Research data management Open Access publishing Public engagement New media (e.g. blogging) Software management and sharing Some researchers devote every waking hour to their work, whether it’s in the lab, writing grant applications, attending conferences, authoring papers, teaching, and so on and so on. It’s hard to see how someone with all this in their schedule can find time to exercise any of these new skills, let alone learn them in the first place. And what about the people who sensibly restrict the hours taken by work to spend more time doing things they enjoy? Yes, all of the above practices are valuable, both for the individual and the community, but they’re all new (to most) and hence require more effort up front to learn. We have to accept that it’s inevitably going to take time for all of them to become “business as usual”. I think if the hiring/promotion/tenure process has any role in this, it’s in asking whether the researcher can build a coherent narrative as to why they’ve chosen to focus their efforts in this area or that. You’re not on Twitter but your data is being used by 200 research groups across the world? Great! You didn’t have time to tidy up your source code for github but your work is directly impacting government policy? Brilliant! We still need convince more people to do more of these beneficial things, so how? Call me naïve, but maybe we should stick to making rational arguments, calming fears and providing low-risk opportunities to learn new skills. Acting (compassionately) like a stuck record can help. And maybe we’ll need to scale back our expectations in other areas (journal impact factors, anyone?) to make space for the new stuff. Software Carpentry: SC Test; does your software do what you meant? “The single most important rule of testing is to do it.” — Brian Kernighan and Rob Pike, The Practice of Programming (quote taken from SC Test page One of the trickiest aspects of developing software is making sure that it actually does what it’s supposed to. Sometimes failures are obvious: you get completely unreasonable output or even (shock!) a comprehensible error message. But failures are often more subtle. Would you notice if your result was out by a few percent, or consistently ignored the first row of your input data? The solution to this is testing: take some simple example input with a known output, run the code and compare the actual output with the expected one. Implement a new feature, test and repeat. Sounds easy, doesn’t it? But then you implement a new bit of code. You test it and everything seems to work fine, except that your new feature required changes to existing code and those changes broke something else. So in fact you need to test everything, and do it every time you make a change. Further than that, you probably want to test that all your separate bits of code work together properly (integration testing) as well as testing the individual bits separately (unit testing). In fact, splitting your tests up like that is a good way of holding on to your sanity. This is actually a lot less scary than it sounds, because there are plenty of tools now to automate that testing: you just type a simple test command and everything is verified. There are even tools that enable you to have tests run automatically when you check the code into version control, and even automatically deploy code that passes the tests, a process known as continuous integration or CI. The big problems with testing are that it’s tedious, your code seems to work without it and no-one tells you off for not doing it. At the time when the Software Carpentry competition was being run, the idea of testing wasn’t new, but the tools to help were in their infancy. “Existing tools are obscure, hard to use, expensive, don’t actually provide much help, or all three.” The SC Test category asked entrants “to design a tool, or set of tools, which will help programmers construct and maintain black box and glass box tests of software components at all levels, including functions, modules, and classes, and whole programs.” The SC Test category is interesting in that the competition administrators clearly found it difficult to specify what they wanted to see in an entry. In fact, the whole category was reopened with a refined set of rules and expectations. Ultimately, it’s difficult to tell whether this category made a significant difference. Where the tools to write tests used to be very sparse and difficult to use they are now many and several options exist for most programming languages. With this proliferation, several tried-and-tested methodologies have emerged which are consistent across many different tools, so while things still aren’t perfect they are much better. In recent years there has been a culture shift in the wider software development community towards both testing in general and test-first development, where the tests for a new feature are written first, and then the implementation is coded incrementally until all tests pass. The current challenge is to transfer this culture shift to the academic research community! Tools for collaborative markdown editing Photo by Alan Cleaver I really love Markdown1. I love its simplicity; its readability; its plain-text nature. I love that it can be written and read with nothing more complicated than a text-editor. I love how nicely it plays with version control systems. I love how easy it is to convert to different formats with Pandoc and how it’s become effectively the native text format for a wide range of blogging platforms. One frustration I’ve had recently, then, is that it’s surprisingly difficult to collaborate on a Markdown document. There are various solutions that almost work but at best feel somehow inelegant, especially when compared with rock solid products like Google Docs. Finally, though, we’re starting to see some real possibilities. Here are some of the things I’ve tried, but I’d be keen to hear about other options. 1. Just suck it up To be honest, Google Docs isn’t that bad. In fact it works really well, and has almost no learning curve for anyone who’s ever used Word (i.e. practically anyone who’s used a computer since the 90s). When I’m working with non-technical colleagues there’s nothing I’d rather use. It still feels a bit uncomfortable though, especially the vendor lock-in. You can export a Google Doc to Word, ODT or PDF, but you need to use Google Docs to do that. Plus as soon as I start working in a word processor I get tempted to muck around with formatting. 2. Git(hub) The obvious solution to most techies is to set up a GitHub repo, commit the document and go from there. This works very well for bigger documents written over a longer time, but seems a bit heavyweight for a simple one-page proposal, especially over short timescales. Who wants to muck around with pull requests and merging changes for a document that’s going to take 2 days to write tops? This type of project doesn’t need a bug tracker or a wiki or a public homepage anyway. Even without GitHub in the equation, using git for such a trivial use case seems clunky. 3. Markdown in Etherpad/Google Docs Etherpad is great tool for collaborative editing, but suffers from two key problems: no syntax highlighting or preview for markdown (it’s just treated as simple text); and you need to find a server to host it or do it yourself. However, there’s nothing to stop you editing markdown with it. You can do the same thing in Google Docs, in fact, and I have. Editing a fundamentally plain-text format in a word processor just feels weird though. 4. Overleaf/Authorea Overleaf and Authorea are two products developed to support academic editing. Authorea has built-in markdown support but lacks proper simultaneous editing. Overleaf has great simultaneous editing but only supports markdown by wrapping a bunch of LaTeX boilerplate around it. Both OK but unsatisfactory. 5. StackEdit Now we’re starting to get somewhere. StackEdit has both Markdown syntax highlighting and near-realtime preview, as well as integrating with Google Drive and Dropbox for file synchronisation. 6. HackMD HackMD is one that I only came across recently, but it looks like it does exactly what I’m after: a simple markdown-aware editor with live preview that also permits simultaneous editing. I’m a little circumspect simply because I know simultaneous editing is difficult to get right, but it certainly shows promise. 7. Classeur I discovered Classeur literally today: it’s developed by the same team as StackEdit (which is now apparently no longer in development), and is currently in beta, but it looks to offer two killer features: real-time collaboration, including commenting, and pandoc-powered export to loads of different formats. Anything else? Those are the options I’ve come up with so far, but they can’t be the only ones. Is there anything I’ve missed? Other plain-text formats are available. I’m also a big fan of org-mode. ↩︎ Software Carpentry: SC Track; hunt those bugs! This competition will be an opportunity for the next wave of developers to show their skills to the world — and to companies like ours. — Dick Hardt, ActiveState (quote taken from SC Track page) All code contains bugs, and all projects have features that users would like but which aren’t yet implemented. Open source projects tend to get more of these as their user communities grow and start requesting improvements to the product. As your open source project grows, it becomes harder and harder to keep track of and prioritise all of these potential chunks of work. What do you do? The answer, as ever, is to make a to-do list. Different projects have used different solutions, including mailing lists, forums and wikis, but fairly quickly a whole separate class of software evolved: the bug tracker, which includes such well-known examples as Bugzilla, Redmine and the mighty JIRA. Bug trackers are built entirely around such requests for improvement, and typically track them through workflow stages (planning, in progress, fixed, etc.) with scope for the community to discuss and add various bits of metadata. In this way, it becomes easier both to prioritise problems against each other and to use the hive mind to find solutions. Unfortunately most bug trackers are big, complicated beasts, more suited to large projects with dozens of developers and hundreds or thousands of users. Clearly a project of this size is more difficult to manage and requires a certain feature set, but the result is that the average bug tracker is non-trivial to set up for a small single-developer project. The SC Track category asked entrants to propose a better bug tracking system. In particular, the judges were looking for something easy to set up and configure without compromising on functionality. The winning entry was a bug-tracker called Roundup, proposed by Ka-Ping Yee. Here we have another tool which is still in active use and development today. Given that there is now a huge range of options available in this area, including the mighty github, this is no small achievement. These days, of course, github has become something of a de facto standard for open source project management. Although ostensibly a version control hosting platform, each github repository also comes with a built-in issue tracker, which is also well-integrated with the “pull request” workflow system that allows contributors to submit bug fixes and features themselves. Github’s competitors, such as GitLab and Bitbucket, also include similar features. Not everyone wants to work in this way though, so it’s good to see that there is still a healthy ecosystem of open source bug trackers, and that Software Carpentry is still having an impact. Software Carpentry: SC Config; write once, compile anywhere Nine years ago, when I first release Python to the world, I distributed it with a Makefile for BSD Unix. The most frequent questions and suggestions I received in response to these early distributions were about building it on different Unix platforms. Someone pointed me to autoconf, which allowed me to create a configure script that figured out platform idiosyncracies Unfortunately, autoconf is painful to use – its grouping, quoting and commenting conventions don’t match those of the target language, which makes scripts hard to write and even harder to debug. I hope that this competition comes up with a better solution — it would make porting Python to new platforms a lot easier! — Guido van Rossum, Technical Director, Python Consortium (quote taken from SC Config page) On to the next Software Carpentry competition category, then. One of the challenges of writing open source software is that you have to make it run on a wide range of systems over which you have no control. You don’t know what operating system any given user might be using or what libraries they have installed, or even what versions of those libraries. This means that whatever build system you use, you can’t just send the Makefile (or whatever) to someone else and expect everything to go off without a hitch. For a very long time, it’s been common practice for source packages to include a configure script that, when executed, runs a bunch of tests to see what it has to work with and sets up the Makefile accordingly. Writing these scripts by hand is a nightmare, so tools like autoconf and automake evolved to make things a little easier. They did, and if the tests you want to use are already implemented they work very well indeed. Unfortunately they’re built on an unholy combination of shell scripting and the archaic Gnu M4 macro language. That means if you want to write new tests you need to understand both of these as well as the architecture of the tools themselves — not an easy task for the average self-taught research programmer. SC Conf, then, called for a re-engineering of the autoconf concept, to make it easier for researchers to make their code available in a portable, platform-independent format. The second round configuration tool winner was SapCat, “a tool to help make software portable”. Unfortunately, this one seems not to have gone anywhere, and I could only find the original proposal on the Internet Archive. There were a lot of good ideas in this category about making catalogues and databases of system quirks to avoid having to rerun the same expensive tests again the way a standard ./configure script does. I think one reason none of these ideas survived is that they were overly ambitions, imagining a grand architecture where their tool provide some overarching source of truth. This is in stark contrast to the way most Unix-like systems work, where each tool does one very specific job well and tools are easy to combine in various ways. In the end though, I think Moore’s Law won out here, making it easier to do the brute-force checks each time than to try anything clever to save time — a good example of avoiding unnecessary optimisation. Add to that the evolution of the generic pkg-config tool from earlier package-specific tools like gtk-config, and it’s now much easier to check for particular versions and features of common packages. On top of that, much of the day-to-day coding of a modern researcher happens in interpreted languages like Python and R, which give you a fully-functioning pre-configured environment with a lot less compiling to do. As a side note, Tom Tromey, another of the shortlisted entrants in this category, is still a major contributor to the open source world. He still seems to be involved in the automake project, contributes a lot of code to the emacs community too and blogs sporadically at The Cliffs of Inanity. Semantic linefeeds: one clause per line I’ve started using “semantic linefeeds”, a concept I discovered on Brandon Rhodes' blog, when writing content, an idea described in that article far better than I could. I turns out this is a very old idea, promoted way back in the day by Brian W Kernighan, contributor to the original Unix system, co-creator of the AWK and AMPL programming languages and co-author of a lot of seminal programming textbooks including “The C Programming Language”. The basic idea is that you break lines at natural gaps between clauses and phrases, rather than simply after the last word before you hit 80 characters. Keeping line lengths strictly to 80 characters isn’t really necessary in these days of wide aspect ratios for screens. Breaking lines at points that make semantic sense in the sentence is really helpful for editing, especially in the context of version control, because it isolates changes to the clause in which they occur rather than just the nearest 80-character block. I also like it because it makes my crappy prose feel just a little bit more like poetry. ☺ Software Carpentry: SC Build; or making a better make Software tools often grow incrementally from small beginnings into elaborate artefacts. Each increment makes sense, but the final edifice is a mess. make is an excellent example: a simple tool that has grown into a complex domain-specific programming language. I look forward to seeing the improvements we will get from designing the tool afresh, as a whole… — Simon Peyton-Jones, Microsoft Research (quote taken from SC Build page) Most people who have had to compile an existing software tool will have come across the venerable make tool (which usually these days means GNU Make). It allows the developer to write a declarative set of rules specifying how the final software should be built from its component parts, mostly source code, allowing the build itself to be carried out by simply typing make at the command line and hitting Enter. Given a set of rules, make will work out all the dependencies between components and ensure everything is built in the right order and nothing that is up-to-date is rebuilt. Great in principle but make is notoriously difficult for beginners to learn, as much of the logic for how builds are actually carried out is hidden beneath the surface. This also makes it difficult to debug problems when building large projects. For these reasons, the SC Build category called for a replacement build tool engineered from the ground up to solve these problems. The second round winner, ScCons, is a Python-based make-like build tool written by Steven Knight. While I could find no evidence of any of the other shortlisted entries, this project (now renamed SCons) continues in active use and development to this day. I actually use this one myself from time to time and to be honest I prefer it in many cases to trendy new tools like rake or grunt and the behemoth that is Apache Ant. Its Python-based SConstruct file syntax is remarkably intuitive and scales nicely from very simple builds up to big and complicated project, with good dependency tracking to avoid unnecessary recompiling. It has a lot of built-in rules for performing common build & compile tasks, but it’s trivial to add your own, either by combining existing building blocks or by writing a new builder with the full power of Python. A minimal SConstruct file looks like this: Program('hello.c') Couldn’t be simpler! And you have the full power of Python syntax to keep your build file simple and readable. It’s interesting that all the entries in this category apart from one chose to use a Python-derived syntax for describing build steps. Python was clearly already a language of choice for flexible multi-purpose computing. The exception is the entry that chose to use XML instead, which I think is a horrible idea (oh how I used to love XML!) but has been used to great effect in the Java world by tools like Ant and Maven. What happened to the original Software Carpentry? “Software Carpentry was originally a competition to design new software tools, not a training course. The fact that you didn’t know that tells you how well it worked.” When I read this in a recent post on Greg Wilson’s blog, I took it as a challenge. I actually do remember the competition, although looking at the dates it was long over by the time I found it. I believe it did have impact; in fact, I still occasionally use one of the tools it produced, so Greg’s comment got me thinking: what happened to the other competition entries? Working out what happened will need a bit of digging, as most of the relevant information is now only available on the Internet Archive. It certainly seems that by November 2008 the domain name had been allowed to lapse and had been replaced with a holding page by the registrar. There were four categories in the competition, each representing a category of tool that the organisers thought could be improved: SC Build: a build tool to replace make SC Conf: a configuration management tool to replace autoconf and automake SC Track: a bug tracking tool SC Test: an easy to use testing framework I’m hoping to be able to show that this work had a lot more impact than Greg is admitting here. I’ll keep you posted on what I find! Changing static site generators: Nanoc → Hugo I’ve decided to move the site over to a different static site generator, Hugo. I’ve been using Nanoc for a long time and it’s worked very well, but lately it’s been taking longer and longer to compile the site and throwing weird errors that I can’t get to the bottom of. At the time I started using Nanoc, static site generators were in their infancy. There weren’t the huge number of feature-loaded options that there are now, so I chose one and I built a whole load of blogging-related functionality myself. I did it in ways that made sense at the time but no longer work well with Nanoc’s latest versions. So it’s time to move to something that has blogging baked-in from the beginning and I’m taking the opportunity to overhaul the look and feel too. Again, when I started there weren’t many pre-existing themes so I built the whole thing myself and though I’m happy with the work I did on it it never quite felt polished enough. Now I’ve got the opportunity to adapt one of the many well-designed themes already out there, so I’ve taken one from the Hugo themes gallery and tweaked the colours to my satisfaction. Hugo also has various features that I’ve wanted to implement in Nanoc but never quite got round to it. The nicest one is proper handling of draft posts and future dates, but I keep finding others. There’s a lot of old content that isn’t quite compatible with the way Hugo does things so I’ve taken the old Nanoc-compiled content and frozen it to make sure that old links should still work. I could probably fiddle with it for years without doing much so it’s probably time to go ahead and publish it. I’m still not completely happy with my choice of theme but one of the joys of Hugo is that I can change that whenever I want. Let me know what you think! License Except where otherwise stated, all content on eRambler by Jez Cope is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license. RDM Resources I occasionally get asked for resources to help someone learn more about research data management (RDM) as a discipline (i.e. for those providing RDM support rather than simply wanting to manage their own data). I’ve therefore collected a few resources together on this page. If you’re lucky I might even update it from time to time! First, a caveat: this is very focussed on UK Higher Education, though much of it will still be relevant for people outside that narrow demographic. My general recommendation would be to start with the Digital Curation Centre (DCC) website and follow links out from there. I also have a slowly growing list of RDM links on Diigo, and there’s an RDM section in my list of blogs and feeds too. Mailing lists Jiscmail is a popular list server run for the benefit of further and higher education in the UK; the following lists are particularly relevant: RESEARCH-DATAMAN DATA-PUBLICATION DIGITAL-PRESERVATION LIS-RESEARCHSUPPORT The Research Data Alliance have a number of Interest Groups and Working Groups that discuss issues by email Events International Digital Curation Conference — major annual conference Research Data Management Forum — roughly every six months, places are limited! RDA Plenary — also every 6 months, but only about 1 in every 3 in Europe Books In no particular order: Martin, Victoria. Demystifying eResearch: A Primer for Librarians. Libraries Unlimited, 2014. Borgman, Christine L. Big Data, Little Data, No Data: Scholarship in the Networked World. Cambridge, Massachusetts: The MIT Press, 2015. Corti, Louise, Veerle Van den Eynden, and Libby Bishop. Managing and Sharing Research Data. Thousand Oaks, CA: SAGE Publications Ltd, 2014. Pryor, Graham, ed. Managing Research Data. Facet Publishing, 2012. Pryor, Graham, Sarah Jones, and Angus Whyte, eds. Delivering Research Data Management Services: Fundamentals of Good Practice. Facet Publishing, 2013. Ray, Joyce M., ed. Research Data Management: Practical Strategies for Information Professionals. West Lafayette, Indiana: Purdue University Press, 2014. Reports ‘Ten Recommendations for Libraries to Get Started with Research Data Management’. LIBER, 24 August 2012. http://libereurope.eu/news/ten-recommendations-for-libraries-to-get-started-with-research-data-management/. ‘Science as an Open Enterprise’. Royal Society, 2 June 2012. https://royalsociety.org/policy/projects/science-public-enterprise/Report/. Mary Auckland. ‘Re-Skilling for Research’. RLUK, January 2012. http://www.rluk.ac.uk/wp-content/uploads/2014/02/RLUK-Re-skilling.pdf. Journals International Journal of Digital Curation (IJDC) Journal of eScience Librarianship (JeSLib) Fairphone 2: initial thoughts on the original ethical smartphone I’ve had my eye on the Fairphone 2 for a while now, and when my current phone, an aging Samsung Galaxy S4, started playing up I decided it was time to take the plunge. A few people have asked for my thoughts on the Fairphone so here are a few notes. Why I bought it The thing that sparked my interest, and the main reason for buying the phone really, was the ethical stance of the manufacturer. The small Swedish company have gone to great lengths to ensure that both labour and materials are sourced as responsibly as possible. They regularly inspect the factories where the parts are made and assembled to ensure fair treatment of the workers and they source all the raw materials carefully to minimise the environmental impact and the use of conflict minerals. Another side to this ethical stance is a focus on longevity of the phone itself. This is not a product with an intentionally limited lifespan. Instead, it’s designed to be modular and as repairable as possible, by the owner themselves. Spares are available for all of the parts that commonly fail in phones (including screen and camera), and at the time of writing the Fairphone 2 is the only phone to receive 10/10 for reparability from iFixit. There are plans to allow hardware upgrades, including an expansion port on the back so that NFC or wireless charging could be added with a new case, for example. What I like So far, the killer feature for me is the dual SIM card slots. I have both a personal and a work phone, and the latter was always getting left at home or in the office or running out of charge. Now I have both SIMs in the one phone: I can recieve calls on either number, turn them on and off independently and choose which account to use when sending a text or making a call. The OS is very close to “standard” Android, which is nice, and I really don’t miss all the extra bloatware that came with the Galaxy S4. It also has twice the storage of that phone, which is hardly unique but is still nice to have. Overall, it seems like a solid, reliable phone, though it’s not going to outperform anything else at the same price point. It certainly feels nice and snappy for everything I want to use it for. I’m no mobile gamer, but there is that distant promise of upgradability on the horizon if you are. What I don’t like I only have two bugbears so far. Once or twice it’s locked up and become unresponsive, requiring a “manual reset” (removing and replacing the battery) to get going again. It also lacks NFC, which isn’t really a deal breaker, but I was just starting to make occasional use of it on the S4 (mostly experimenting with my Yubikey NEO) and it would have been nice to try out Android Pay when it finally arrives in the UK. Overall It’s definitely a serious contender if you’re looking for a new smartphone and aren’t bothered about serious mobile gaming. You do pay a premium for the ethical sourcing and modularity, but I feel that’s worth it for me. I’m looking forward to seeing how it works out as a phone. Wiring my web I’m a nut for automating repetitive tasks, so I was dead pleased a few years ago when I discovered that IFTTT let me plug different bits of the web together. I now use it for tasks such as: Syndicating blog posts to social media Creating scheduled/repeating todo items from a Google Calendar Making a note to revisit an article I’ve starred in Feedly I’d probably only be half-joking if I said that I spend more time automating things than I save not having to do said things manually. Thankfully it’s also a great opportunity to learn, and recently I’ve been thinking about reimplementing some of my IFTTT workflows myself to get to grips with how it all works. There are some interesting open source projects designed to offer a lot of this functionality, such as Huginn, but I decided to go for a simpler option for two reasons: I want to spend my time learning about the APIs of the services I use and how to wire them together, rather than learning how to use another big framework; and I only have a small Amazon EC2 server to pay with and a heavy Ruby on Rails app like Huginn (plus web server) needs more memory than I have. Instead I’ve gone old-school with a little collection of individual scripts to do particular jobs. I’m using the built-in scheduling functionality of systemd, which is already part of a modern Linux operating system, to get them to run periodically. It also means I can vary the language I use to write each one depending on the needs of the job at hand and what I want to learn/feel like at the time. Currently it’s all done in Python, but I want to have a go at Lisp sometime, and there are some interesting new languages like Go and Julia that I’d like to get my teeth into as well. You can see my code on github as it develops: https://github.com/jezcope/web-plumbing. Comments and contributions are welcome (if not expected) and let me know if you find any of the code useful. Image credit: xkcd #1319, Automation Data is like water, and language is like clothing I admit it: I’m a grammar nerd. I know the difference between ‘who’ and ‘whom’, and I’m proud. I used to be pretty militant, but these days I’m more relaxed. I still take joy in the mechanics of the language, but I also believe that English is defined by its usage, not by a set of arbitrary rules. I’m just as happy to abuse it as to use it, although I still think it’s important to know what rules you’re breaking and why. My approach now boils down to this: language is like clothing. You (probably) wouldn’t show up to a job interview in your pyjamas1, but neither are you going to wear a tuxedo or ballgown to the pub. Getting commas and semicolons in the right place is like getting your shirt buttons done up right. Getting it wrong doesn’t mean you’re an idiot. Everyone will know what you meant. It will affect how you’re perceived, though, and that will affect how your message is perceived. And there are former rules2 that some still enforce that are nonetheless dropping out of regular usage. There was a time when everyone in an office job wore formal clothing. Then it became acceptable just to have a blouse, or a shirt and tie. Then the tie became optional and now there are many professions where perfectly well-respected and competent people are expected to show up wearing nothing smarter than jeans and a t-shirt. One such rule IMHO is that ‘data’ is a plural and should take pronouns like ‘they’ and ‘these’. The origin of the word ‘data’ is in the Latin plural of ‘datum’, and that idea has clung on for a considerable period. But we don’t speak Latin and the English language continues to evolve: ‘agenda’ also began life as a Latin plural, but we don’t use the word ‘agendum’ any more. It’s common everyday usage to refer to data with singular pronouns like ‘it’ and ‘this’, and it’s very rare to see someone referring to a single datum (as opposed to ‘data point’ or something). If you want to get technical, I tend to think of data as a mass noun, like ‘water’ or ‘information’. It’s uncountable: talking about ‘a water’ or ‘an information’ doesn’t make much sense, but it uses singular pronouns, as in ‘this information’. If you’re interested, the Oxford English Dictionary also takes this position, while Chambers leaves the choice of singular or plural noun up to you. There is absolutely nothing wrong, in my book, with referring to data in the plural as many people still do. But it’s no longer a rule and for me it’s weakened further from guideline to preference. It’s like wearing a bow-tie to work. There’s nothing wrong with it and some people really make it work, but it’s increasingly outdated and even a little eccentric. or maybe you’d totally rock it. ↩︎ Like not starting a sentence with a conjunction… ↩︎ #IDCC16 day 2: new ideas Well, I did a great job of blogging the conference for a couple of days, but then I was hit by the bug that’s been going round and didn’t have a lot of energy for anything other than paying attention and making notes during the day! I’ve now got round to reviewing my notes so here are a few reflections on day 2. Day 2 was the day of many parallel talks! So many great and inspiring ideas to take in! Here are a few of my take-home points. Big science and the long tail The first parallel session had examples of practical data management in the real world. Jian Qin & Brian Dobreski (School of Information Studies, Syracuse University) worked on reproducibility with one of the research groups involved with the recent gravitational wave discovery. “Reproducibility” for this work (as with much of physics) mostly equates to computational reproducibility: tracking the provenance of the code and its input and output is key. They also found that in practice the scientists' focus was on making the big discovery, and ensuring reproducibility was seen as secondary. This goes some way to explaining why current workflows and tools don’t really capture enough metadata. Milena Golshan & Ashley Sands (Center for Knowledge Infrastructures, UCLA) investigated the use of Software-as-a-Service (SaaS, such as Google Drive, Dropbox or more specialised tools) as a way of meeting the needs of long-tail science research such as ocean science. This research is characterised by small teams, diverse data, dynamic local development of tools, local practices and difficulty disseminating data. This results in a need for researchers to be generalists, as opposed to “big science” research areas, where they can afford to specialise much more deeply. Such generalists tend to develop their own isolated workflows, which can differ greatly even within a single lab. Long-tail research also often struggles from a lack of dedicated IT support. They found that use of SaaS could help to meet these challenges, but with a high cost required to cover the needed guarantees of security and stability. Education & training This session focussed on the professional development of library staff. Eleanor Mattern (University of Pittsburgh) described the immersive training introduced to improve librarians' understanding of the data needs of their subject areas in delivering their RDM service delivery model. The participants each conducted a “disciplinary deep dive”, shadowing researchers and then reporting back to the group on their discoveries with a presentation and discussion. Liz Lyon (also University of Pittsburgh, formerly UKOLN/DCC) gave a systematic breakdown of the skills, knowledge and experience required in different data-related roles, obtained from an analysis of job adverts. She identified distinct roles of data analyst, data engineer and data journalist, and as well as each role’s distinctive skills, pinpointed common requirements of all three: Python, R, SQL and Excel. This work follows on from an earlier phase which identified an allied set of roles: data archivist, data librarian and data steward. Data sharing and reuse This session gave an overview of several specific workflow tools designed for researchers. Marisa Strong (University of California Curation Centre/California Digital Libraries) presented Dash, a highly modular tool for manual data curation and deposit by researchers. It’s built on their flexible backend, Stash, and though it’s currently optimised to deposit in their Merritt data repository it could easily be hooked up to other repositories. It captures DataCite metadata and a few other fields, and is integrated with ORCID to uniquely identify people. In a different vein, Eleni Castro (Institute for Quantitative Social Science, Harvard University) discussed some of the ways that Harvard’s Dataverse repository is streamlining deposit by enabling automation. It provides a number of standardised endpoints such as OAI-PMH for metadata harvest and SWORD for deposit, as well as custom APIs for discovery and deposit. Interesting use cases include: An addon for the Open Science Framework to deposit in Dataverse via SWORD An R package to enable automatic deposit of simulation and analysis results Integration with publisher workflows Open Journal Systems A growing set of visualisations for deposited data In the future they’re also looking to integrate with DMPtool to capture data management plans and with Archivematica for digital preservation. Andrew Treloar (Australian National Data Service) gave us some reflections on the ANDS “applications programme”, a series of 25 small funded projects intended to address the fourth of their strategic transformations, single use → reusable. He observed that essentially these projects worked because they were able to throw money at a problem until they found a solution: not very sustainable. Some of them stuck to a traditional “waterfall” approach to project management, resulting in “the right solution 2 years late”. Every researcher’s needs are “special” and communities are still constrained by old ways of working. The conclusions from this programme were that: “Good enough” is fine most of the time Adopt/Adapt/Augment is better than Build Existing toolkits let you focus on the 10% functionality that’s missing Succussful projects involved research champions who can: 1) articulate their community’s requirements; and 2) promote project outcomes Summary All in all, it was a really exciting conference, and I’ve come home with loads of new ideas and plans to develop our services at Sheffield. I noticed a continuation of some of the trends I spotted at last year’s IDCC, especially an increasing focus on “second-order” problems: we’re no longer spending most of our energy just convincing researchers to take data management seriously and are able to spend more time helping them to do it better and get value out of it. There’s also a shift in emphasis (identified by closing speaker Cliff Lynch) from sharing to reuse, and making sure that data is not just available but valuable. #IDCC16 Day 1: Open Data The main conference opened today with an inspiring keynote by Barend Mons, Professor in Biosemantics, Leiden University Medical Center. The talk had plenty of great stuff, but two points stood out for me. First, Prof Mons described a newly discovered link between Huntingdon’s Disease and a previously unconsidered gene. No-one had previously recognised this link, but on mining the literature, an indirect link was identified in more than 10% of the roughly 1 million scientific claims analysed. This is knowledge for which we already had more than enough evidence, but which could never have been discovered without such a wide-ranging computational study. Second, he described a number of behaviours which should be considered “malpractice” in science: Relying on supplementary data in articles for data sharing: the majority of this is trash (paywalled, embedded in bitmap images, missing) Using the Journal Impact Factor to evaluate science and ignoring altmetrics Not writing data stewardship plans for projects (he prefers this term to “data management plan”) Obstructing tenure for data experts by assuming that all highly-skilled scientists must have a long publication record A second plenary talk from Andrew Sallons of the Centre for Open Science introduced a number of interesting-looking bits and bobs, including the Transparency & Openness Promotion (TOP) Guidelines which set out a pathway to help funders, publishers and institutions move towards more open science. The rest of the day was taken up with a panel on open data, a poster session, some demos and a birds-of-a-feather session on sharing sensitive/confidential data. There was a great range of posters, but a few that stood out to me were: Lessons learned about ISO 16363 (“Audit and certification of trustworthy digital repositories”) certification from the British Library Two separate posters (from the Universities of Toronto and Colorado) about disciplinary RDM information & training for liaison librarians A template for sharing psychology data developed by a psychologist-turned-information researcher from Carnegie Mellon University More to follow, but for now it’s time for the conference dinner! #IDCC16 Day 0: business models for research data management I’m at the International Digital Curation Conference 2016 (#IDCC16) in Amsterdam this week. It’s always a good opportunity to pick up some new ideas and catch up with colleagues from around the world, and I always come back full of new possibilities. I’ll try and do some more reflective posts after the conference but I thought I’d do some quick reactions while everything is still fresh. Monday and Thursday are pre- and post-conference workshop days, and today I attended Developing Research Data Management Services. Joy Davidson and Jonathan Rans from the Digital Curation Centre (DCC) introduced us to the Business Model Canvas, a template for designing a business model on a single sheet of paper. The model prompts you to think about all of the key facets of a sustainable, profitable business, and can easily be adapted to the task of building a service model within a larger institution. The DCC used it as part of the Collaboration to Clarify Curation Costs (4C) project, whose output the Curation Costs Exchange is also worth a look. It was a really useful exercise to be able to work through the whole process for an aspect of research data management (my table focused on training & guidance provision), both because of the ideas that came up and also the experience of putting the framework into practice. It seems like a really valuable tool and I look forward to seeing how it might help us with our RDM service development. Tomorrow the conference proper begins, with a range of keynotes, panel sessions and birds-of-a-feather meetings so hopefully more then! About me I help people in Higher Education communicate and collaborate more effectively using technology. I currently work at the University of Sheffield focusing on research data management policy, practice, training and advocacy. In my free time, I like to: run; play the accordion; morris dance; climb; cook; read (fiction and non-fiction); write. Better Science Through Better Data #scidata17 Better Science through Better DoughnutsJez Cope Update: fixed the link to the slides so it works now! Last week I had the honour of giving my first ever keynote talk, at an event entitled Better Science Through Better Data hosted jointly by Springer Nature and the Wellcome Trust. It was nerve-wracking but exciting and seemed to go down fairly well. I even got accidentally awarded a PhD in the programme — if only it was that easy! The slides for the talk, “Supporting Open Research: The role of an academic library”, are available online (doi:10.15131/shef.data.5537269), and the whole event was video’d for posterity and viewable online. I got some good questions too, mainly from the clever online question system. I didn’t get to answer all of them, so I’m thinking of doing a blog post or two to address a few more. There were loads of other great presentations as well, both keynotes and 7-minute lightning talks, so I’d encourage you to take a look at at least some of it. I’ll pick out a few of my highlights. Dr Aled Edwards (University of Toronto) There’s a major problem with science funding that I hadn’t really thought about before. The available funding pool for research is divided up into pots by country, and often by funding body within a country. Each of these pots have robust processes to award funding to the most important problems and most capable researchers. The problem comes because there is no coordination between these pots, so researchers all over the world end up getting funded to research the most popular problems leading to a lot of duplication of effort. Industry funding suffers from a similar problem, particularly the pharmaceutical industry. Because there is no sharing of data or negative results, multiple companies spend billions researching the same dead ends chasing after the same drugs. This is where the astronomical costs of drug development come from. Dr Edwards presented one alternative, modelled by a company called M4K Pharma. The idea is to use existing IP laws to try and give academic researchers a reasonable, morally-justifiable and sustainable profit on drugs they develop, in contrast to the current model where basic research is funded by governments while large corporations hoover up as much profit as they possibly can. This new model would develop drugs all the way to human trial within academia, then license the resulting drugs to companies to manufacture with a price cap to keep the medicines affordable to all who need them. Core to this effort is openness with data, materials and methodology, and Dr Edwards presented several examples of how this approach benefited academic researchers, industry and patients compared with a closed, competitive focus. Dr Kirstie Whitaker (Alan Turing Institute) This was a brilliant presentation, presenting a practical how-to guide to doing reproducible research, from one researcher to another. I suggest you take a look at her slides yourself: Showing your working: a how-to guide to reproducible research. Dr Whitaker briefly addressed a number of common barriers to reproducible research: Is not considered for promotion: so it should be! Held to higher standards than others: reviewers should be discouraged from nitpicking just because the data/code/whatever is available (true unbiased peer review of these would be great though) Publication bias towards novel findings: it is morally wrong to not publish reproductions, replications etc. so we need to address the common taboo on doing so Plead the 5th: if you share, people may find flaws, but if you don’t they can’t — if you’re worried about this you should ask yourself why! Support additional users: some (much?) of the burden should reasonably on the reuser, not the sharer Takes time: this is only true if you hack it together after the fact; if you do it from the start, the whole process will be quicker! Requires additional skills: important to provide training, but also to judge PhD students on their ability to do this, not just on their thesis & papers The rest of the presentation, the “how-to” guide of the title' was a well-chosen and passionately delivered set of recommendations, but the thing that really stuck out for me is how good Dr Whitaker is at making the point that you only have to do one of these things to improve the quality of your research. It’s easy to get the impression at the moment that you have to be fully, perfectly open or not at all, but it’s actually OK to get there one step at a time, or even not to go all the way at all! Anyway, I think this is a slide deck that speaks for itself, so I won’t say any more! Lightning talk highlights There was plenty of good stuff in the lightning talks, which were constrained to 7 minutes each, but a few of the things that stood out for me were, in no particular order: Code Ocean — share and run code in the cloud dat project — peer to peer data syncronisation tool Can automate metadata creation, data syncing, versioning Set up a secure data sharing network that keeps the data in sync but off the cloud Berlin Institute of Health — open science course for students Pre-print paper Course materials InterMine — taking the pain out of data cleaning & analysis Nix/NixOS as a component of a reproducible paper BoneJ (ImageJ plugin for bone analysis) — developed by a scientist, used a lot, now has a Wellcome-funded RSE to develop next version ESASky — amazing live, online archive of masses of astronomical data Coda I really enjoyed the event (and the food was excellent too). My thanks go out to: The programme committee for asking me to come and give my take — I hope I did it justice! The organising team who did a brilliant job of keeping everything running smoothly before and during the event The University of Sheffield for letting me get away with doing things like this! Blog platform switch I’ve just switched my blog over to the Nikola static site generator. Hopefully you won’t notice a thing, but there might be a few weird spectres around til I get all the kinks ironed out. I’ve made the switch for a couple of main reasons: Nikola supports Jupyter notebooks as a source format for blog posts, which will be useful to include code snippets It’s written in Python, a language which I actually know, so I’m more likely to be able to fix things that break, customise it and potentially contribute to the open source project (by contrast, Hugo is written in Go, which I’m not really familiar with) Chat rooms vs Twitter: how I communicate now CC0, Pixabay This time last year, Brad Colbow published a comic in his “The Brads” series entitled “The long slow death of Twitter”. It really encapsulates the way I’ve been feeling about Twitter for a while now. Go ahead and take a look. I’ll still be here when you come back. According to my Twitter profile, I joined in February 2009 as user #20,049,102. It was nearing its 3rd birthday and, though there were clearly a lot of people already signed up at that point, it was still relatively quiet, especially in the UK. I was a lonely PhD student just starting to get interested in educational technology, and one thing that Twitter had in great supply was (and still is) people pushing back the boundaries of what tech can do in different contexts. Somewhere along the way Twitter got really noisy, partly because more people (especially commercial companies) are using it more to talk about stuff that doesn’t interest me, and partly because I now follow 1,200+ people and find I get several tweets a second at peak times, which no-one could be expected to handle. More recently I’ve found my attention drawn to more focussed communities instead of that big old shouting match. I find I’m much more comfortable discussing things and asking questions in small focussed communities because I know who might be interested in what. If I come across an article about a cool new Python library, I’ll geek out about it with my research software engineer friends; if I want advice on an aspect of my emacs setup, I’ll ask a bunch of emacs users. I feel like I’m talking to people who want to hear what I’m saying. Next to that experience, Twitter just feels like standing on a street corner shouting. IRC channels (mostly on Freenode), and similar things like Slack and gitter form the bulk of this for me, along with a growing number of WhatsApp group chats. Although online chat is theoretically a synchronous medium, I find that I can treat it more as “semi-synchronous”: I can have real-time conversations as they arise, but I can also close them and tune back in later to catch up if I want. Now I come to think about it, this is how I used to treat Twitter before the 1,200 follows happened. I also find I visit a handful of forums regularly, mostly of the Reddit link-sharing or StackExchange Q&A type. /r/buildapc was invaluable when I was building my latest box, /r/EarthPorn (very much not NSFW) is just beautiful. I suppose the risk of all this is that I end up reinforcing my own echo chamber. I’m not sure how to deal with that, but I certainly can’t deal with it while also suffering from information overload. Not just certifiable… A couple of months ago, I went to Oxford for an intensive, 2-day course run by Software Carpentry and Data Carpentry for prospective new instructors. I’ve now had confirmation that I’ve completed the checkout procedure so it’s official: I’m now a certified Data Carpentry instructor! As far as I’m aware, the certification process is now combined, so I’m also approved to teach Software Carpentry material too. And of course there’s Library Carpentry too… SSI Fellowship 2020 I’m honoured and excited to be named one of this year’s Software Sustainability Institute Fellows. There’s not much to write about yet because it’s only just started, but I’m looking forward to sharing more with you. In the meantime, you can take a look at the 2020 fellowship announcement and get an idea of my plans from my application video: Talks Here is a selection of talks that I’ve given. {{% template %}} <%! import arrow %> Date Title Location % for talk in post.data("talks"): % if 'date' in talk: ${date.format('ddd d MMM YYYY')} % endif % if 'url' in talk: % endif ${talk['title']} % if 'url' in talk: % endif ${talk.get('location', '')} % endfor {{% /template %}}