inkdroid inkdroid Paper or Plastic 25 for 2020 An obvious follow on from my last post is to see what my top 25 albums of the year are. In the past I’ve tried to mentally travel over the releases of the past year to try to cook up a list. But this year I thought it would be fun to use the LastFM API to look at my music listening history for 2020, and let the data do the talking as it were. The first problem that while LastFM is a good source of my listening history its metadata for albums seems quite sparse. The LastFM album.getInfo API call doesn’t seem to return the year the album was published. The LastFM docs indicate that a releasedate property is available, but I couldn’t seem to find it either in the XML or JSON responses. Maybe it was there once and now is gone? Maybe there’s some trick I was overlooking with the API? Who knows. So to get around this I used LastFM to get my listening history, but then the Discogs API to fetch metadata for a specific album using their search endpoint. LastFM includes MusicBrainz identifiers for tracks and most artists and albums. So I could have used those to look up the album using the MusicBrainz API. But I wasn’t sure if I would find good release dates there either as their focus seems to be on recognizing tracks, and linking them to albums and artists. Discogs is a superb human curated database, like a Wikipedia for music aficionados. Their API returns a good amount of information for each album, for example: { "country": "US", "year": "1983", "format": [ "Vinyl", "LP", "Album" ], "label": [ "I.R.S. Records", "I.R.S. Records", "I.R.S. Records", "A&M Records, Inc.", "A&M Records, Inc.", "I.R.S., Inc.", "I.R.S., Inc.", "Electrosound Group Midwest, Inc.", "Night Garden Music", "Unichappell Music, Inc.", "Reflection Sound Studios", "Sterling Sound" ], "type": "master", "genre": [ "Rock" ], "style": [ "Indie Rock" ], "id": 14515, "barcode": [ "SP-070604-A", "SP-070604-B", "SP0 70604 A ES1 EMW", "SP0 70604-B-ES1 EMW", "SP0 70604-B-ES2 EMW", "STERLING", "(B)", "BMI" ], "user_data": { "in_wantlist": false, "in_collection": false }, "master_id": 14515, "master_url": "https://api.discogs.com/masters/14515", "uri": "/REM-Murmur/master/14515", "catno": "SP 70604", "title": "R.E.M. - Murmur", "thumb": "https://discogs-images.imgix.net/R-414122-1459975774-1411.jpeg?auto=compress&blur=0&fit=max&fm=jpg&h=150&q=40&w=150&s=52b867c541b102b5c8bcf5accae025e0", "cover_image": "https://discogs-images.imgix.net/R-414122-1459975774-1411.jpeg?auto=compress&blur=0&fit=max&fm=jpg&h=600&q=90&w=600&s=0e227f30b3981fd2b0fb20fb4362df92", "resource_url": "https://api.discogs.com/masters/14515", "community": { "want": 17287, "have": 26133 } } So I created a small function that looks up an artist/album combination using the Discogs search API. I applied the function to the Pandas DataFrame of my listening history, which was grouped by artist and album. When I ran this across the 1,312 distinct albums I listened to in 2020 I actually ran into a handful of albums (86) that didn’t turn up at Discogs. I had actually listened to some of these albums quite often, and wanted to see if they were from 2020. I figured that these probably were obscure things I picked up on Bandcamp. Knowing the provenance of data is important. Bandcamp is another wonderful site for music lovers. It has an API too, but you have to write to them to request a key because it’s mostly designed for publishers that need to integrate their music catalogs with Bandcamp. I figured this little experiment wouldn’t qualify so I wrote a quick little scraping function that does a search, finds a match, and extracts the release date from the album’s page on the Bandcamp website. This left just four things that I listened just a handful of times,which have since disappeared from Bandcamp (I think). What I thought would be an easy little exercise with the LastFM API actually turned out to require me to talk to the Discogs API, and then scraping the Bandcamp website. So it goes with data analysis I suppose. If you want to see the details they are in this Jupyter notebook. And so, without further ado, here are my to 25 albums of 2020. .albums { display: flex; flex-wrap: wrap; } .album { margin: 5px; max-width :210px; text-align: center; border: thin solid #eee; } .album img { max-width: 200px; } 25 Perfume Genius / Set My Heart On Fire Immediately 24 Roger Eno / Mixing Colours 23 Blochemy / nebe 22 Idra / Lone Voyagers, Lovers and Lands 21 Rutger Zuydervelt and Bill Seaman / Rutger Zuydervelt and Bill Seaman - Movements of Dust 20 Purl / Renovatio 19 mute forest / Riderstorm 18 Michael Grigoni & Stephen Vitiello / Slow Machines 17 Seabuckthorn / Other Other 16 Windy & Carl / Unreleased Home Recordings 1992-1995 15 Mathieu Karsenti / Bygones 14 Rafael Anton Irisarri / Peripeteia 13 Mikael Lind / Give Shape to Space 12 Taylor Swift / folklore (deluxe version) 11 koji itoyama / I Know 10 Andrew Weathers / Dreams and Visions from the Llano Estacado 9 Jim Guthrie / Below OST - Volume III 8 Norken & Nyquist / Synchronized Minds 7 Jim Guthrie / Below OST - Volume II 6 Halftribe / Archipelago 5 Hazel English / Wake Up! 4 R Beny / natural fiction 3 Warmth / Life 2 David Newlyn / Apparitions I and II 1 Seabuckthorn / Through A Vulnerable Occur Diss Music I recently defended my dissertation, and am planning to write a short post here with a synopsis of what I studied. But before that I wanted to do a bit of navel gazing and examine the music of my dissertation. To be clear, my dissertation has no music. It’s one part discourse analysis, two parts ethnographic field study, and is comprised entirely of text and images bundled into a PDF. But over the last 5 years as I took classes, wrote papers, conducted research and did the final write up my research results I was almost always listening to music. I spent a lot of time on weekends in the tranquil workspaces of the Silver Spring Public Library. After the Coronavirus hit earlier this year I spent more time surrounded by piles of books in my impromptu office in the basement of my house. But wherever I found myself working music was almost always on. I leaned heavily on Bandcamp over this time period, listening and then purchasing music I enjoyed. Bandcamp is a truly remarkable platform for learning about new music from people whose tastes align with yours. My listening habits definitely trended over this time towards instrumental, experimental, found sound and ambient, partly because lyrics can distract me if I’m writing or reading. I’m also a long time LastFM user. So all the music I listened to over this period was logged (or “scrobbled”). LastFM have an API so I thought it would be fun to create a little report of the top albums I listened to each month of my dissertation. So this is the music of my dissertation–or the hidden soundtrack of my research, between August 2015 and November 2020. You can see how I obtained the information from the API in this Jupyter notebook. But the results are here below. .albums { display: flex; flex-wrap: wrap; } .album { margin: 5px; max-width :210px; text-align: center; border: thin solid #eee; } .album img { max-width: 200px; } 2015-08 White Rainbow / Thru.u 2015-09 Deradoorian / The Expanding Flower Planet 2015-10 James Elkington and Nathan Salsburg / Ambsace 2015-11 Moderat / II 2015-12 Deerhunter / Fading Frontier 2016-01 David Bowie / Blackstar 2016-02 Library Tapes / Escapism 2016-03 Twincities / …plays the brown mountain lights 2016-04 Moderat / III 2016-05 Radiohead / A Moon Shaped Pool 2016-06 Tigue / Peaks 2016-07 A Winged Victory for the Sullen / A Winged Victory for the Sullen 2016-08 Oneohtrix Point Never / Garden of Delete 2016-09 Oneohtrix Point Never / Drawn and Quartered 2016-10 Chihei Hatakeyama / Saunter 2016-11 Biosphere / Departed Glories 2016-12 Sarah Davachi / The Untuning of the Sky 2017-01 OFFTHESKY / The Beautiful Nowhere 2017-02 Clark / The Last Panthers 2017-03 Tim Hecker / Harmony In Ultraviolet 2017-04 Goldmund / Sometimes 2017-05 Deerhunter / Halcyon Digest 2017-06 Radiohead / OK Computer OKNOTOK 1997 2017 2017-07 Arcade Fire / Everything Now 2017-08 oh sees / Orc 2017-09 Lusine / Sensorimotor 2017-10 Four Tet / New Energy 2017-11 James Murray / Eyes to the Height 2017-12 Jlin / Black Origami 2018-01 Colleen / Captain of None (Bonus Track Version) 2018-02 Gersey / What You Kill 2018-03 Rhucle / Yellow Beach 2018-04 Christina Vantzou / No. 3 2018-05 Hotel Neon / Context 2018-06 Brendon Anderegg / June 2018-07 A Winged Victory for the Sullen / Atomos 2018-08 Ezekiel Honig / A Passage of Concrete 2018-09 Paperbark / Last Night 2018-10 Flying Lotus / You’re Dead! (Deluxe Edition) 2018-11 Porya Hatami / Kaziwa 2018-12 Sven Laux / You’ll Be Fine. 2019-01 Max Richter / Mary Queen Of Scots (Original Motion Picture Soundtrack) 2019-02 Ian Nyquist / Cuan 2019-03 Jens Pauly / Vihne 2019-04 Ciro Berenguer / El Mar De Junio 2019-05 Rival Consoles / Persona 2019-06 Caught In The Wake Forever / Waypoints 2019-07 Spheruleus / Light Through Open Blinds 2019-08 Valotihkuu / By The River 2019-09 Moss Covered Technology / Slow Walking 2019-10 Tsone / pagan oceans I, II, III 2019-11 Big Thief / Two Hands 2019-12 A Winged Victory for the Sullen / The Undivided Five 2020-01 Hirotaka Shirotsubaki / fragment 2011-2017 2020-02 Luis Miehlich / Timecuts 2020-03 Federico Durand / Jardín de invierno 2020-04 R.E.M. / Document - 25th Anniversary Edition 2020-05 Chicano Batman / Invisible People 2020-06 Hazel English / Wake Up! 2020-07 Josh Alexander / Hiraeth 2020-08 The Beatles / The Beatles (Remastered) 2020-09 Radiohead / OK Computer OKNOTOK 1997 2017 2020-10 Mathieu Karsenti / Bygones 2020-11 R.E.M. / Murmur - Deluxe Edition 25 Years of robots.txt After just over 25 years of use the Robots Exclusion Standard, otherwise known as robots.txt is being standardized at the IETF. This isn’t really news, as the group at Google that is working on it announced the work over a year ago. The effort continues apace, with the latest draft having been submitted back in the middle of pandemic summer. But it is notable I think because of the length of gestation time this particular standard took. It made me briefly think about what it would be like if standards always worked this way–by documenting established practices, desire lines if you will, rather than being quiet ways to shape markets (Russell, 2014). But then again maybe that hands off approach is fraught in other ways. Standardization processes offer the opportunity for consensus, and a framework for gathering input from multiple parties. It seems like a good time to write down some tricks of the robots.txt trade (e.g the stop reading after 500kb rule, which I didn’t know about). What would Google look like today if it wasn’t for some of the early conventions that developed around web crawling? Would early search engines have existed at all if a convention for telling them what to crawl and what not to crawl didn’t come into existence? Even though it has been in use for 25 years it will be important to watch the diffs with the existing de-facto standard, to see what new functionality gets added and what (if anything) is removed. I also wonder if this might be an opportunity for the digital preservation community to grapple with documenting some of its own practices around robots.txt. Much web archiving crawling software has options for observing robots.txt, or explicitly ignoring it. There are clearly legitimate reasons for a crawler to ignore robots.txt, as in cases where CSS files or images are accidentally blocked by a robots.txt and which prevent the rendering of an otherwise unblocked page. I think ethical arguments can also be made for ignoring an exclusion. But ethics are best decided by people not machines– even though some think the behavior of crawling bots can be measured and evaluated (Giles, Sun, & Councill, 2010 ; Thelwall & Stuart, 2006). Web archives use robots.txt in another significant way too. Ever since the Oakland Archive Policy the web archiving community has used the robots.txt in playback of archived data. Software like the Wayback Machine has basicaly become the reading room of the archived web. The Oakland Archive Policy made it possible for website owners to tell web archives about content on their site that they would like the web archive not to “play back”, even if they had the content. Here is what they said back then: Archivists should provide a ‘self-service’ approach site owners can use to remove their materials based on the use of the robots.txt standard. Requesters may be asked to substantiate their claim of ownership by changing or adding a robots.txt file on their site. This allows archivists to ensure that material will no longer be gathered or made available. These requests will not be made public; however, archivists should retain copies of all removal requests. This convention allows web publishers to use their robots.txt to tell the Internet Archive (and potentially other web archives) not to provide access to archived content from their website. It also is not really news at all. The Internet Archive’s Mark Graham wrote in 2017 about how robots.txt haven’t really been working out for them lately, and how they now ignore them for playback of .gov and .mil domains. There was a popular article about this use of robots.txt written by David Bixenspan at Gizmodo, When the Internet Archive Forgets, and a follow up from David Rosenthal Selective Amnesia. Perhaps the collective wisdom now is that the use of robots.txt to control playback in web archives is fundamentally flawed and shouldn’t be written down in a standard. But lacking a better way to request that something be removed from the Internet Archive I’m not sure if that is feasible. Some, like Rosenthal, suggest that it’s too easy for these take down notices to be issued. Consent on the web is difficult once you are operating at the scale that the Internet Archive does in its crawls. But if there were a time to write it down in a standard I guess that time would be now. References Giles, C. L., Sun, Y., & Councill, I. G. (2010). Measuring the web crawler ethics. In WWW 2010. Retrieved from https://clgiles.ist.psu.edu/pubs/WWW2010-web-crawler-ethics.pdf Russell, A. L. (2014). Open standards and the digital age. Cambridge University Press. Thelwall, M., & Stuart, D. (2006). Web crawling ethics revisited: Cost, privacy, and denial of service. Journal of the American Society for Information Science and Technology. https://doi.org/10.1002/asi.20388 Curation Communities As I indicated in the last post I’ve been teaching digital curation this semester at UMD. I ended up structuring the class around the idea of abstraction where we started at a fairly low level looking at file systems and slowly zoomed out to file formats and standards, types of metadata, platforms and finally community. It was a zooming out process, like changing the magnification on a microscope, or maybe more like the zooming out that happens as you pop between levels in the pages of Istvan Banyai’s beautiful little children’s book Zoom (pun intended). I’m curious to hear how well this worked from my student’s perspective, but it definitely helped me organize my own thoughts about a topic that can branch off in many directions. This is especially the case because I wanted the class to include discussion of digital curation concepts while also providing an opportunity to get some hands on experience using digital curation techniques and tools in the context of Jupyter notebooks. In addition to zooming out, it was a dialectical approach, flipping between reading and writing prose and reading and writing code, with the goal of reaching a kind of synthesis of understanding that digital curation practice is about both concepts and computation. Hopefully it didn’t just make everyone super dizzy :) This final module concerned community. In our reading and discussion we looked at the FAIR Principles and talked about what types of practices they encourage, and to evaluate some data sources in terms of findability, accessibility, interoperability and reusability. For the notebook exercise I decided to have students experiment with the Lumen Database (formerly Chilling Effects) which is a clearinghouse for cease-and-desist notices received by web platforms like Google, Twitter and Wikipedia. The database was created by Wendy Seltzer and a team of legal researchers that wanted to be able to study how copyright law and other legal instruments shaped what was, and was not, on the web. Examining Lumen helped us explore digital curation communities for two reasons. The first is that it provides an unprecedented look at how web platforms curate their content in partnership with their users. There really is nothing else like it unless you consider individual efforts like GitHub’s DMCA Repository which is an interesting approach too. The second reason is that Lumen itself is an example of community digital curation practice and principles like FAIR. FAIR began in the scientific community, and certainly has that air about it. But Lumen embodies principles around findability and accessibility: this is information that would be difficult if not impossible to access otherwise. Lumen also shows how some data cannot be readily available: there is redacted content, some notices lack information like infringing URLs. Working with Lumen helps students see that not all data can be open, and that the FAIR principles are a starting place for ethical conversations and designs, and not a rulebook to be followed. The Lumen API requires that you get a key for doing any meaningful work (the folks at Berkman-Klein were kind enough to supply me a temporary one for the semester). At any rate, if you are interested in taking a look the notebook (without the Lumen key) is available on GitHub. I’ve noticed that sometimes the GitHub JavaScript viewer for notebooks can timeout, so if you want you can also take a look at it over in Colab, which is the environment we’ve been using over the semester. The notebook explores the basics of interacting with the API using the Python requests library, while explaining the core data model that is behind the API, which relates together the principal, the sender, the recpipient and the submitter of a claim. It provides just a taste of the highly expressive search options that allow searching, ordering and filtering of results along many dimensions. It also provides an opportunity to show students the value of build functional abstractions to help reduce copy and paste, and develop reusable and testable curation functions. The goal was to do a module about infrastructure after talking about community. But unfortunately we ran out of time due to the pace of classes during the pandemic. I felt that a lot was being asked of students in the all online environment and I’ve really tried over the semester to keep things simple. This last module on community was actually completely optional, but I was surprised when half the class continued to do the work when it was not officially part of their final grade. The final goal of using Lumen this week was to introduce them to a resource that they could write about (essay) or use in a notebook or application that will be their final project. I’ve spent the semester stressing the need to be able to write both prose and code about digital curation practices and the final project is an opportunity for them to choose to inflect one of those modes more than the other. Mystery File! We started the semester in my Digital Curation class by engaging in a little exercise I called Mystery File. The exercise was ungraded and was designed to simply get the students thinking about some of the issues we would be exploring over the semester such as files, file formats, metadata, description, communities of practice and infrastructure. The exercise also gave me an opportunity to introduce them to some of the tools and skills we would be using such as developing and documenting our work in Jupyter notebooks. The students had a lot of fun with it, and it was really helpful for me to see the variety of knowledge and skills they brought to the problem. The mystery file turned out to be bundle of genetic data and metadata from the public National Center for Biotechnology Information a few minutes drive from UMD at the National Institutes of Health in Bethesda. If the students were able to notice that this file was a tar file, they could expand it and explore the directories and subdirectories. They could notice that some files were compressed, and examine some of them to notice that they contained metadata and a genetic sequence. Once they had submitted their answers I shared a video with them (the class is asynchronous except for in person office hours) where I answered these questions myself in a Jupyter notebook running in Google Colab. I shared the completed notebook with them for them to try on their own. It was a good opportunity to reacquaint students with notebooks since they were introduced to them in an Introduction to Programming class that is a pre-requisite. But I wanted to show how notebooks were useful for documenting their work, and especially useful in digital curation activities which are often ad-hoc, but include some repeatable steps. The bundle of data includes a manifest with hashes for fixity checking to ensure a bit hasn’t flipped, which anticipated our discussion of technical metadata later in the semester. I thought it was a good example of how a particular community is making data available, and how the NCBI and its services form a piece of critical infrastructure for the medical community. I also wanted to highlight how the data came from a Chinese team, despite the efforts of the Chinese government to suppress the information. This was science, the scientific community, and information infrastructures working despite (or in spite of) various types of social and political breakdowns. But I actually didn’t start this post wanting to write about all that, but rather to comment on a recent story I read about the origins of this data. It gave me so much hope and reason to celebrate data curation practices to read Zeynep Tufekci’s The Pandemic Heroes Who Gave us the Gift of Time and Gift of Information this afternoon. She describes how brave Yong-Zhen Zhang and his team in China were in doing their science, and releasing the information in a timely way to the world. If you look closely you can see Zhang’s name highlighted in the pictured metadata record above. It is simply astonishing to read how Zhang set the scientific machinery in motion which created a vaccine all the way back in January, just days after the virus was discovered and sequenced. Sending my students this piece from Zeynep here at the end of the semester gives me such pleasure, and is the perfect way to round out the semester as we talk about communities and infrastructure. (P.S. I’m planning on bundling up the discussion and notebook exercises once the semester is finished in case it is useful for others to adapt.) Kettle kettle boiling kitchen table two windows daylight reaching leaves kettle boiling   again Static-Dynamic At work we’ve been moving lots of previously dynamic web sites over to being static sites. Many of these sites have been Wordpress, Omeka or custom sites for projects that are no longer active, but which retain value as a record of the research. This effort has been the work of many hands, and has largely been driven (at least from my perspective) by an urge to make the websites less dependent on resources like databases, and custom code to render content. But we are certainly not alone in seeing the value of static site technology, especially in humanities scholarship. The underlying idea here is that a static site makes the content more resilient, or less prone to failure over time, because there are less moving pieces involved in keeping it both on and of the web. The pieces that are moving are tried-and-true software like web browsers and web servers, instead of pieces of software that have had less eyes on them (my code). The web has changed a lot over its lifetime, but the standards and software of the web have stayed remarkably backwards compatible (Rosenthal, 2010). There is long term value in mainstreaming your web publishing practices and investing in these foundational technologies and standards. Since they have less moving pieces static site architectures are also (in theory) more efficient. There is a hope that static sites lower the computational resources needed to keep your content on the web, which is good for the pocketbook and, more importantly, good for the environment. While there have been interesting experiments like this static site driven by solar energy it would be good to get some measurements on how significant these changes are. As I will discuss here in this post, I think static site technology can often push dynamic processing out of the picture frame, where it is less open to measurement. Migrating completed projects to static sites has been fairly uncontroversial when the project is no longer being added to or altered. Site owners have been amenable to the transformation, especially when we tell them that this will help ensure that their content stays online in a meaningful and useful way. It’s often important to explain that “static” doesn’t mean their website will be rendered as an image or a PDF, or that the links won’t be navigable. Sometimes we’ve needed to negotiate the disabling of a search box. But it usually has sufficed to show how much the search is used (server logs are handy for this), and to point out that most people search for things elsewhere (e.g Google) and arrive at their website. So it has been a price worth paying. But I think this may say more about the nature of the projects we were curating, than it does about web content in general. YMMV. Cue the “There are no silver bullets” soundtrack. Hmm, what would that soundtrack sound like? Continuing on… For ongoing projects where the content is changing, we’ve also been trying to use static site technology. Understandably there is a bit of an impedance mismatch between static site technology and dynamic curation of content. If the curators are comfortable with things like editing Markdown and doing Git commits/pushes things can go well. But if people aren’t invested in learning those things there is a danger that making a site static can well, make it too static, which can have significant consequences for an active project that uses the web as a canvas. One way of adjusting for this is to make the static site more dynamic. It’s ok if you lol here. This tacking back can be achieved in a multitude of ways, but they mostly boil down to: Curating content in another dynamic platform and having an export of that data integrated statically into the website at build time (not at runtime). Curating content in another dynamic platform and including it at runtime in your web application using the platform’s API. For example think of how Google Maps is embedded in a webpage. The advantage to #1 is that the curation platform can go away, and you still have your site, and your data, and are able to build it. But the disadvantage is that curatorial changes to the data do not get reflected in the website until the next build. The advantage to #2 is that you don’t need to manage the data assets, and changes made in the platform are immediately available in the web application. But the downside is that your website is totally dependent on this external platform. The platform could choose to change their API, put the service behind a paywall, shut down the service or completely go out of business. Actually, it is a certainty that one of these will eventually happen. So, depending on the specifics, this kind of static site is arguably more vulnerable than the previous dynamic web application. We’ve mostly been focused on scenario #1 in our recent work creating static sites for dynamic projects. For example we’ve been getting some good use of of Airtable as a content curation platform in conjunction with Gatsby and its gatsby-source-airtable plugin, which makes the Airtable data available via a GraphQL query that gets run at build time. Once built the pages have the data cooked into them, so Airtable can go away and our site will be fine. But the build is still quite dependent on Airtable. In the (soon to be released) Unlocking the Airwaves project we also are using Airtable but we didn’t use the GraphQL plugin, mostly because I didn’t know about it at the time. On that project there is a fetch-data utility that will talk to our Airtable base and persist the data into the Git reposistory as JSON files. These JSON files are then queried as part of the build process. This means the build is insulated from changes in Airtable, and that you can choose when to pull the latest Airtable data down. But the big downside to this approach is that when curators make changes in Airtable they want to see them reflected in the website quickly. They understandably don’t want to have to ask someone else for a build, or figure out how to build and push the site themselves. They just want their work to be available to others. One stopgap between 1 and 2 that we’ve developed recently is to create a special table in Airtable called Releases. A curator can add a row with their name, a description, and a tagged version of the site to use, which will cause a build of the website to be deployed. We have a program that runs from cron every few minutes and looks at this table and then decides whether a build of the site is needed. It is still early days so I’m not sure how well this approach will work in practice. It took a little bit of head scratching to come up with an initial solution where the release program uses this Airtable row as a lock to prevent multiple deploy processes from starting at the same time. Experience will show if we got this right. Let me know if you are interested in the details. But stepping back a moment I think it’s interesting how static site web technology simultaneously creates a site of static content, while also pushing dynamic processing outwards into other places like this little Airtable Releases table, or one of the many so called Jamstack service providers. This is where the legibility of the processing gets lost, and it becomes even more difficult to ascertain the environmental impact of the architectural shift. What is the environmental impact of our use of NetlifyCMS or Airtable? It’s almost impossible to say, whereas before when we were running all the pieces ourselves it was conceiveable that we could audit them. These platforms are also sites of extraction as well, where it’s difficult to even know what economies our data are even participating in. This is all to say that understanding the motivations to go static are complicated, and more work could be done to strategically think through some of these design decisions in a repeatable and generative way. References Rosenthal, D. (2010). The half-life of digital formats. Retrieved from https://blog.dshr.org/2010/11/half-life-of-digital-formats.html Dark Reading I just made a donation to the Dark Reader project over on Open Collective. If you haven’t seen it before Dark Reader is a browser plugin (Firefox, Chrome, Edge, Safari) that lets you enable dark mode on websites that don’t support it. It has lots of useful configuration settings, and allows you to easily turn it on and off for particular web sites. For most of my life I’ve actually preferred light backgrounds for my text editor. But for the past year or so as my eyes have gotten worse I’ve enabled dark mode in Vim and gradually in any application or website that will let me. I’m not an eye doctor, but it seems that the additional light from the screen reflects off of whatever material constitutes the cataracts in the lenses of my eyes, which causes everything to fuzz out, and for text to become basically illegible. Being able to turn on dark mode has meant I’ve been able continue to read online, although it still can be difficult. Dark Reader lets me turn on dark mode for other websites that don’t support it, which has been a real life saver. So it was nice to be able to say thank you. Just as an aside, I’ve been using Open Collective for a few years now, to donate regularly to the social.coop project which is how I’m doing social media these days. Realizing Dark Reader was on Open Collective too made me think how I should really look at more open source projects to support that are on there. It also made me think that perhaps Open Collective could be a useful platform for the [Documenting the Now] project to look at to support some of the tools it has developed, as the project draws down on its grant funding and moves into sustaining some of the things it has started. Perhaps it would be useful for other projects like Webrecorder potentially too? PS. It’s enabled here on my blog for people who have their browser/os set to dark mode. Seeing Software Today I presented a paper at the International Conference on the History of Records and Archives (ICHORA) that shared some of the results of my field study at the National Software Reference Library. The basic gist of the talk is that appraisal in web archives (at least in the case of the NSRL) is intrinsically tied to use, but not use in the typical sense, where use is only conceived in terms of research. I use Sarah Ahmed’s book What’s the Use and specifically her technique of examining the “use of use” and the “strange temporalities of use” as an analytical frame for examining the valuation of archival records. The varieties of use, misuse and disuse provide a window into how this web archive participates in a network of actors that discipline digital preservation, law enforcement, national security and defense. Since the conference is remote I decided to record my talk, in order to make sure I timed things properly–I have a tendency to wander. Unfortunately it seems that recording the talk on Linux and playing it back through Zoom turned out to be problematic. The slides didn’t neatly transition from one to the other, but left graphic artifacts behind. But luckily the audio worked ok, and the display in PeerTube is just fine: This talk actually forms the core of my dissertation defense next week, assuming the world still exists then. I was a bit surprised that nobody balked at my claim that appraisal is tied to use, and that records in web archives have no intrinsic value (e.g. as evidence) unless you consider the use of use. I choose to interpret this as the happy result of me making a really good case ;-) Curating Corpora Here’s a really nice talk by Everest Pipkin about the need to curate datasets for generative text algorithms, especially when they are being used in creative work. Everest considers creative work broadly as any work where care for the experience is important. To curate your own corpora is to let you have a hyper-specific control for the tone, vibe, content, ethics, language, and poetics of that space. Since I’m teaching a data curation class in an information studies department this semester, but I also work in a digital humanities center, this approach to understanding the algorithmic impacts of data curation is super interesting to me. I particularly liked how Everest extended this idea of curation to the rules that sit on top of the models, which select and deslect particular interactions that are useful in a particular context. It feels like this creative practice that Everest provides a view into could actually be more prevalent than the popular press might have it. There is so much focus on obtaining larger and larger datasets, and neural networks with more and more synapses to approach the complexity found in the brain. Rather than the selection of an algorithm, tuning its hyperparameters, and the vast infrastructures for training, being the secret sauce perhaps the data that is selected, and how model interactions are interpreted are more equally if not more important, especially in particular contexts. I guess this also highlights why data assets are so guarded, and why projects like Wikipedia and Common Crawl are so important for tools like GPT-3 and spaCy. It would be pretty cool to be able to select a model based on some subset, or subsets in a dataset like CommonCrawl. Like for example if you wanted to generate text based on text in a fan fiction site like AO3–or even an author or set of authors within an AO3. Maybe something like that already exists?