inkdroid

inkdroid
Paper or Plastic
	856
Coincidence?
	twarc2
This post was originally published on Medium but I spent time writing it so I wanted to have it here too.
TL;DR twarc has been redesigned from the ground up to work with the new Twitter v2 API and their Academic Research track. Many thanks for the code and design contributions of Betsy Alpert, Igor Brigadir, Sam Hames, Jeff Sauer, and Daniel Verdeer that have made twarc2 possible, as well as early feedback from Dan Kerchner, Shane Lin, Miles McCain, 李荣蓬, David Thiel, Melanie Walsh and Laura Wrubel. Extra special thanks to the Institute for Future Environments at Queensland University of Technology for supporting Betsy and Sam in their work, and for the continued support of the Mellon Foundation.

Back in August of last year Twitter announced early access to their new v2 API, and their plans to sunset the v1.1 API that has been active for almost the last 10 years. Over the lifetime of their v1.1 API Twitter has become deeply embedded in the media landscape. As magazines, newspapers and television have moved onto the web they have increasingly adopted tweets as a mechanism for citing politicians, celebrities and organizations, while also using them to document current events, generate leads and gather feedback for evolving stories. As a result Twitter has also become a popular object of study for humanities and social science researchers looking to understand the world as reflected, refracted and distorted by/in social media.
On the surface the v2 API update seems pretty insignificant since the shape of a tweet, its parts, properties and affordances, aren’t changing at all. Tweets with 280 characters of text, images and video will continue to be posted, retweeted and quoted. However behind the scenes the representation of a tweet as data, and the quotas that control the rates at which this data can flow between apps and other third party services will be greatly transformed.
Needless to say, v2 represents a big change for the Documenting the Now project. Along with community members we’ve developed and maintained open source tools like twarc that talk directly to the Twitter API to help users to search for and collect live tweets that match criteria like hashtags, names and geographic locations. Today we’re excited to announce the release of twarc v2 which has been designed from the ground up to work with the v2 API and Twitter’s new Academic Research track.
Clearly it’s extremely problematic having a multi-national corporation act as a gatekeeper for who counts as an academic researcher, and what constitutes academic research. We need look no further than the recent experiences of Timnit Gebru and Margaret Mitchell at Google for an example of what happens when research questions run up against the business objectives of capital. We only know their stories because Gebru and Mitchell’s bravely took a principled approach, where many researchers would have knowingly or unknowingly shaped their research to better fit the needs of the company.
So it is important for us that twarc still be usable by people with and without access to the Academic Research Track. But we have heard from many users that the Academic Research Track presents new opportunities for Twitter data collection that are essential for researchers interested in the observability of social media platforms. Twitter is making a good faith effort to work with the academic research community, and we thought twarc should support it, even if big challenges lie ahead.
So why are people interested in the Academic Research Track? Once your application has been approved you are able to collect data from the full history of Tweets, at no cost. This is a massive improvement over the v1.1 access which was limited to a one week window and researchers had to pay for access. Access to the full archive means it’s now possible to study events that have happened in the past back to the beginning of Twitter in 2006. If you do create any historical datasets we’d love for you to share the tweet identifier datasets in The Catalog.
However this opening up of access on the one hand comes with a simultaneous contraction in terms of how much data can be collected at one time. The remainder of this post describes some of the details and the design decisions we have made with twarc2 to address them. If you would prefer to watch a quick introduction to using twarc v2 please check out this short video:


Installation
If you are familiar with installing twarc nothing is changed. You still install (or upgrade) with pip as you did before:
$ pip install --upgrade twarc
In fact you will still have full access to the v1.1 API just as you did before. So the old commands will continue to work as they did1
$ twarc search blacklivesmatter &gt; tweets.jsonl
twarc2 was designed to let you to continue to use Twitter’s v1.1 API undisturbed until it is finally turned off by Twitter, at which point the functionality will be removed from twarc. All the support for the v2 API is mediated by a new command line utility twarc2. For example to search for blacklivesmatter tweets and write them to a file tweets.jsonl:
$ twarc2 search blacklivesmatter &gt; tweets.jsonl
All the usual twarc functionality such as searching for tweets, collecting live tweets from the streaming API endpoint, requesting user timelines and user metadata are all still there, twarc2 --help gives you the details. But while the interface looks the same there’s quite a bit different going on behind the scenes.
Representation
Truth be told, there is no shortage of open source libraries and tools for interacting with the Twitter API. In the past twarc has made a bit of a name for itself by catering to a niche group of users who want a reliable, programmable way to collect the canonical JSON representation of a tweet. JavaScript Object Notation (JSON) is the language of Web APIs, and Twitter has kept its JSON representation of a tweet relatively stable over the years. Rather than making lots of decisions about the many ways you might want to collect, model and analyze tweets twarc has tried to do one thing and do it well (data collection) and get out of the way so that you can use (or create) the tools for putting this data to use.
But the JSON representation of a tweet in the Twitter v2 API is completely burst apart. The v2 base representation of a tweet is extremely lean and minimal, and just includes the text of the tweet its identifier and a handful of other things. All the details about the user who created the tweet, embedded media, and more are not included. Fortunately this information is still available, but the user needs to craft their API request to request tweets using a set of expansions that tell the Twitter API what additional entities to include. In addition for each expansion there are a set of field options to include that control what of these expansions is returned.
So rather than there being a single JSON representation of a tweet API users now have the ability to shape the data based on what they need, much like how GraphQL APIs work. This kind of makes you wonder why Twitter didn’t make their GraphQL API available. For specific use cases this customizability is very useful, but the mutability of the representation of a tweet presents challenges when collecting data for future use. If you didn’t request the right expansions or fields when collecting the data then you won’t be able to analyze that data later when doing your research.
To solve for this twarc2 has been designed to collect the richest possible representation for a tweet, by requesting all possible expansions and field combinations for tweets. See the expansions module for the details if you are interested. This takes a significant burden off of users to digest the API documentation, and craft the correct API requests themselves. In addition the twarc community will be monitoring the Twitter API documentation going forward to incorporate new expansions and fields as they will inevitably be added in the future.
Flattening
This is diving into the weeds a little bit, but it’s worth noting here that Twitter’s introduction of expansions allows data that was once duplicated across multiple tweets (such as user information, media, retweets, etc) to be included once per response from the API. This means that instead of seeing information about the user who created a tweet in the context of their tweet the user will be referenced using an identifier, and this identifier will map to user metadata in the outer envelope of the response.
It makes sense why Twitter have introduced expansions since it means in a set of 100 tweets from a given user the user information will just be included once rather than repeated 100 times, which means less data, less network traffic and less money. It’s even more significant when consider the large number of possible expansions. However this pass by-reference rather than by-value presents some challenges for stream based processing which expects each tweet to be self-contained.
For this reason we’ve introduce the idea of flattening the response data when persisting the JSON to disk. This means that tools and data pipelines that expect to operate on a stream of tweets can continue to do so. Since the representation of a tweet is so dependent on how data is requested we’ve taken the opportunity to introduce a small stanza of twarc specific metadata using the __twarc prefix.
This metadata records what API endpoint the data was requested from, and when. This information is critically important when interpreting the data, because some information about a tweet like its retweet and quote counts are constantly changing.
Data Flows
As mentioned above you can still collect tweets from the search and streaming API endpoints in a way that seems quite similar to the v1 API. The big changes however are the quotas associated with these endpoints which govern how much can be collected. These quotas control how many requests can be sent to Twitter in 15 minute intervals.
In fact these quotas are not much changed, but what’s new are app wide quotas that constrain how many tweets a given application (app) can collect every month. An app in this context is a piece of software (e.g. your twarc software) identified by unique API keys set up in the Twitter Developer Portal. The standard API access sets a 500,000 tweet per month limit. This is a huge change considering there were no monthly app limits before. If you get approved for the Academic Research track your app quota is increased to 10 million per month. This is markedly better but the achievable data volume is still nothing like the v1.1 API, as these graphs attempt to illustrate:


twarc2 will still observe the same rate limits, but once you’ve collected your portion for the month there’s not much that can be done, for that app at least.
Apart from the quotas Twitter’s streaming endpoint in v2 is substantially changed which impacts how users interact with twarc. Previously twarc users would be able to create up to to two connections to the filter stream API. This could be done by simply:
twarc filter obama &gt; obama.jsonl
However in the Twitter v2 API only apps can connect to the filter stream, and they can only connect once. At first this seems like a major limitation but rather than creating a connection per query the v2 API allows you to build a set of rules for tweets to match, which in turns controls what tweets are included in the stream. This means you can collect for multiple types of queries at the same time, and the tweets will come back with a piece of metadata indicating what rule caused its inclusion.
This translates into a markedly different set of interactions at the command line for collecting from the stream where you first need to set your stream rules and then open a connection to fetch it.
twarc2 stream-rules add blacklivesmatter  
twarc2 stream &gt; tweets.jsonl
One useful side effect of this is that you can update the stream (add and remove rules) while the stream is in motion:
twarc2 stream-rules add blm
While you are limited by the API quota in terms of how many tweets you can collect, tweets are not “dropped on the floor” when the volume gets too high. Once upon a time the v1.1 filter stream was rumored to be rate limited when your stream exceeds 1% of the total volume of new tweets.
Plugins
In addition to twarc helping you collect tweets the GitHub repository has also been a place to collect a set of utilities for working with the data. For example there are scripts for extracting and unshortening urls, identifying suspended/deleted content, extracting videos, buiding wordclouds, putting tweets on maps, displaying network graph visualizations, counting hashtags, and more. These utilities all work like Unix filters where the input is a stream of tweets and the output varies depending on what the utility is doing, e.g. a Gephi file for a network visualization, or a folder of mp4 files for video extraction.
While this has worked well in general the kitchen sink approach has been difficult to manage from a configuration management perspective. Users have to download these scripts manually from GitHub or by cloning the repository. For some users this is fine, but it’s a bit of a barrier to entry for users who have just installed twarc with pip.
Furthermore these plugins often have their own dependencies which twarc itself does not. This lets twarc can stay pretty lean, and things like youtube_dl, NetworkX or Pandas can be installed by people that want to use utilities that need them. But since there is no way to install the utilities there isn’t a way to ensure that the dependencies are installed, which can lead to users needing to diagnose missing libraries themselves.
Finally the plugins have typically lacked their own tests. twarc’s test suite has really helped us track changes to the Twitter API and to make sure that it continues to operate properly as new functionality has been added. But nothing like this has existed for the utilities. We’ve noticed that over time some of them need updating. Also their command line arguments have drifted over time which can lead to some inconsistencies in how they are used.
So with twarc2 we’ve introduced the idea of plugins which extend the functionality of the twarc2 command, are distributed on PyPI separately from twarc, and exist in their own GitHub repositories where they can be developed and tested independently of twarc itself. This is all achieved through twarc2’s use of the click library and specifically click-plugins. So now if you would like to convert your collected tweets to CSV you can install the twarc-csv:
$ pip install twarc-csv  
$ twarc2 search covid19 &gt; covid19.jsonl  
$ twarc2 csv covid19.jsonl &gt; covid19.csv
Or if you want to extract embedded and referenced videos from tweets you can install twarc-videos which will write all the videos to a directory:
$ pip install twarc-videos  
$ twarc2 videos covid19.jsonl --download-dir covid19-videos
You can write these plugins yourself and release them as needed. Check out the plugin reference implementation tweet-ids for a simple example to adapt. We’re still in the process of porting some of the most useful utilities over and would love to see ideas for new plugins. Check out the current list of twarc2 plugins and use the twarc issue tracker on GitHub to join the discussion.
You may notice from the list of plugins that twarc now (finally) has documentation on ReadTheDocs external from the documentation that was previously only available on GitHub. We got by with GitHub’s rendering of Markdown documents for a while, but GitHub’s boilerplate designed for developers can prove to be quite confusing for users who aren’t used to selectively ignoring it. ReadTheDocs allows us to manage the command line and API documentation for twarc, and to showcase the work that has gone into the Spanish, Japanese, Portuguese, Swedish, Swahili and Chinese translations.
Feedback
Thanks for reading this far! We hope you will give twarc2 a try. Let us know what you think either in comments here, in the DocNow Slack or over on GitHub.
✨ ✨ Happy twarcing! ✨ ✨ ✨


Windows users will want to indicate the output file using a second argument rather than redirecting output with &gt;. See this page for details.↩
	$ j
You may have noticed that I try to use this static website as a journal. But, you know, not everything I want to write down is really ready (or appropriate) to put here.
Some of these things end up in actual physical notebooks–there’s no beating the tactile experience of writing on paper for some kind of thinking.
But I also spend a lot of time on my laptop, and at the command line in some form or another. So I have a directory of time stamped Markdown files stored on Dropbox, for example:
...
/home/ed/Dropbox/Journal/2019-08-25.md
/home/ed/Dropbox/Journal/2020-01-27.md
/home/ed/Dropbox/Journal/2020-05-24.md
/home/ed/Dropbox/Journal/2020-05-25.md
/home/ed/Dropbox/Journal/2020-05-31.md
...
Sometimes these notes migrate into a blog post or some other writing I’m doing. I used this technique quite a bit when writing my dissertation when I wanted to jot down things on my phone when an idea arrived.
I’ve tried a few different apps for editing Markdown on my phone, but mostly settled on iA Writer which mostly just gets out of the way.
But when editing on my laptop I tend to use my favorite text editor Vim with the vim-pencil plugin for making Markdown fun and easy. If Vim isn’t your thing and you use another text editor keep reading since this will work for you too.
The only trick to this method of journaling is that I just need to open the right file. With command completion on the command line this isn’t so much of a chore. But it does take a moment to remember the date, and craft the right path.
Today while reflecting on how nice it is to still be using Unix, it occurred to me that I could create a little shell script to open my journal for that day (or a previous day). So I put this little file j in my PATH:
#!/bin/zsh

journal_dir=&quot;/home/ed/Dropbox/Journal&quot;

if [ &quot;$1&quot; ];
then
  date=$1
else
  date=`date +%Y-%m-%d`
fi

vim &quot;$journal_dir/$date.md&quot;
So now when I’m in the middle of something else and want to jot a note in my journal I just type j.
Unix, still crazy after all these years.
	Strengths and Weaknesses
Quoting Macey (2019), quoting Foucault, quoting Nietzsche:

One thing is needful. – To ‘give style’ to one’s character – a great and rare art! It is practised by those who survey all the strengths and weaknesses that their nature has to offer and then fit them into an artistic plan until each appears as art and reason and even weaknesses delight the eye.
Nietzsche, Williams, Nauckhoff, &amp; Del Caro (2001), p. 290

This is a generous and lively image of what art does when it is working. Art is not perfection.


Macey, D. (2019). The lives of Michel Foucault: A biography. Verso.


Nietzsche, F. W., Williams, B., Nauckhoff, J., &amp; Del Caro, A. (2001). The gay science: with a prelude in German rhymes and an appendix of songs. Cambridge, U.K. ; New York: Cambridge University Press.
	Data Speculation
I’ve taken the ill-advised approach of using the Coronavirus as a topic to frame the exercises in my computer programming class this semester. I say “ill-advised” because given the impact that COVID has been having on students I’ve been thinking they probably need a way to escape news of the virus by way of writing code, rather than diving into it more. It’s late in the semester to modulate things but I think we will shift gears to look at programming through another lens after spring break.
That being said, one of the interesting things we’ve been doing is looking at vaccination data that is being released by the Maryland Department of Health through their ESRI ArcGIS Hub. Note: this dataset has since been removed from the web because it has been superseded by a new dataset that includes single dose vaccinations. I guess it’s good that students get a feel for how ephemeral data on the web is, even when it is published by the government.
We noticed that this dataset recorded a small number of vaccinations as happening as early as the 1930s up until December 11, 2020 when vaccines were approved for use. I asked students to apply what we have been learning about Python (files, strings, loops, and sets) to identify the Maryland counties that were responsible for generating this anomalous data.
I thought this exercise provided a good demonstration using real, live data that critical thinking about the provenance of data is always important because there is no such thing as raw data (Gitelman, 2013).
While we were working with the data to count the number of anomalous vaccinations per county one of my sharp eyed students noticed that the results we were seeing with my version of the dataset (downloaded on February 28) were different from what we saw with his (downloaded on March 4).
We expected to see new rows in the later one because new vaccination data seem to be reported daily–which is cool in itself. But we were surprised to find new vaccination records for dates earlier than December 11, 2020. Why would new vaccinations for these erroneous older dates still be entering the system?
For example the second dataset downloaded March 4 acquired 6 new rows:


Object ID
Vaccination Date
County
Daily First Dose
Cumulative First Dose
Daily Second Dose
Cumulative Second Dose


4
1972/10/13
Allegany
1
1
0
0


5
1972/12/16
Baltimore
1
1
0
0


6
2012/02/03
Baltimore
1
2
0
0


28
2020/02/24
Baltimore City
1
2
0
0


34
2020/08/24
Baltimore
1
4
0
0


64
2020/12/10
Prince George’s
1
3
0
0


And these rows present in the February 28 version were deleted in the March 4 version:


Object ID
Vaccination Date
County
Daily First Dose
Cumulative First Dose
Daily Second Dose
Cumulative Second Dose


4
2019/12/26
Frederick
1
1
0
0


15
2020/01/25
Talbot
1
1
0
0


19
2020/01/28
Baltimore
1
1
0
0


20
2020/01/30
Caroline
1
1
0
0


28
2020/02/12
Prince George’s
1
1
0
0


30
2020/02/20
Anne Arundel
1
6
0
0


56
2020/10/16
Frederick
1
7
0
4


59
2020/11/01
Wicomico
1
1
0
0


60
2020/11/04
Frederick
1
8
0
4


I found these additions perplexing at first, because I assumed these outliers were part of an initial load. But it appears that the anomalies are still being generated? The deletions suggest that perhaps the anomalous data is being identified and scrubbed in a live system that is then dumping out the data? Or maybe the code that is being used to update the dataset in ArcGIS Hub itself is malfunctioning in some way? If you are interested in toying around with the code and data it is up on GitHub. I was interested to learn about pandas.DataFrame.merge which is useful for diffing tables when you use indicator=True.
At any rate, having students notice, measure and document anomalies like this seems pretty useful. I also asked them to speculate about what kinds of activities could generate these errors. I meant speculate in the speculative fiction sense of imagining a specific scenario that caused it. I think this made some students scratch their head a bit, because I wasn’t asking them for the cause, but to invent a possible cause.
Based on the results so far I’d like to incorporate more of these speculative exercises concerned with the functioning of code and data representations into my teaching. I want to encourage students to think creatively about data processing as they learn about the nuts and bolts of how code operates. For example the treatments in How to Run a City Like Amazon, and Other Fables which use sci-fi to test ideas about how information technologies are deployed in society. Another model is the Speculative Ethics Book Club which also uses sci-fi to explore the ethical and social consequences of technology. I feel like I need to read up on specualtive research more generally before doing this though (Michael &amp; Wilkie, 2020). I’d also like to focus the speculation down at the level of the code or data processing, rather than at the macro super-system level. But that has its place too.
Another difference is that I was asking students to engage in speculation about the past rather than the future. How did the data end up this way? Perhaps this is more of a genealogical approach, of winding things backwards, and tracing what is known. Maybe it’s more Mystery than Sci-Fi. The speculative element is important because (in this case) operations at the MD Dept of Health, and their ArcGIS Hub setup are mostly opaque to us. But even when access isn’t a problem these systems they can feel opaque, because rather than there being a dearth of information you are drowning in it. Speculation is a useful abductive approach to hypothesis generation and, hopefully, understanding.
Update 2021-03-17: Over in the fediverse David Benque recommended I take a look at Matthew Stanley’s chapter in (Gitelman, 2013) “Where Is That Moon, Anyway? The Problem of Interpreting Historical Solar Eclipse Observations” for the connection to Mystery. For the connection to Peirce and abduction he also pointed to Luciana Parisi’s chapter “Speculation: A method for the unattainable” in Lury &amp; Wakeford (2012). Definitely things to follow up on!
References


Gitelman, L. (Ed.). (2013). “Raw data” is an oxymoron. MIT Press.


Lury, C., &amp; Wakeford, N. (2012). Inventive methods: The happening of the social. Routledge.


Michael, M., &amp; Wilkie, A. (2020). Speculative research. In The Palgrave encyclopedia of the possible (pp. 1–8). Cham: Springer International Publishing. Retrieved from https://doi.org/10.1007/978-3-319-98390-5_118-1
	Recovering Foucault
I’ve been enjoying reading David Macey’s biography of Michel Foucault, that was republished in 2019 by Verso. Macey himself is an interesting figure, both a scholar and an activist who took leave from academia to do translation work and to write this biography and others of Lacan and Fanon.
One thing that struck me as I’m nearing the end of Macey’s book is the relationship between Foucault and archives. I think Foucault has become emblematic of a certain brand of literary analysis of “the archive” that is far removed from the research literature of archival studies, while using “the archive” as a metaphor (Caswell, 2016). I’ve spent much of my life working in libraries and digital preservation, and now studying and teaching about them from the perspective of practice, so I am very sympathetic to this critique. It is perhaps ironic that the disconnect between these two bodies of research is a difference in discourse which Foucault himself brought attention to.
At any rate, the thing that has struck me while reading this biography is how much time Foucault himself spent working in libraries and archives. Here’s Foucault in his own words talking about his thesis:

In Histoire de la folie à l’âge classique I wished to determine what could be known about mental illness in a given epoch … An object took shape for me: the knowledge invested in complex systems of institutions. And a method became imperative: rather than perusing … only the library of scientific books, it was necessary to consult a body of archives comprising decrees, rules hospital and prison registers, and acts of jurisprudence. It was in the Arsenal or the Archives Nationales that I undertook the analysis of a knowledge whose visible body is neither scientific nor theoretical discourse, nor literature, but a daily and regulated practice. (Macey, 2019, p. 94)

Foucault didn’t simply use archives for his research: understanding the processes and practices of archives were integral to his method. Even though the theory and practice of libraries and archives are quite different given their different functions and materials, they are often lumped together as a convenience in the same buildings. Macey blurs them a little bit, in sections like this where he talks about how important libraries were to Foucault’s work:

Foucault required access to Paris for a variety of reasons, not least because he was also teaching part-time at ENS. The putative thesis he had begun at the Fondation Thiers – and which he now described to Polin as being on the philosophy of psychology – meant that he had to work at the Bibliothèque Nationale and he had already become one of its habitues. For the next thirty years, Henri Labrouste’s great building in the rue de Richelieu, with its elegant pillars and arches of cast iron, would be his primary place of work. His favourite seat was in the hemicycle, the small, raised section directly opposite the entrance, sheltered from the main reading room, where a central aisle separates rows of long tables subdivided into individual reading desks. The hemicycle affords slighty more quiet and privacy. For thirty years, Foucault pursued his research here almost daily, with occasional forays to the manuscript department and to other libraries, and contended with the Byzantine cataloguing system: two incomplete and dated printed catalogues supplemented by cabinets containing countless index cards, many of them inscribed with copperplate handwriting. Libraries were to become Foucault’s natural habitat: ‘those greenish institutions where books accumulate and where there grows the dense vegetation of their knowledge’

There’s a metaphor for you: libraries as vegetation :) It kind of reminds me of some recent work looking at decentralized web technologies in terms of mushrooms. But I digress.
I really just wanted to note here that the erasure of archival studies from humanities research about “the archive” shouldn’t really be attributed to Foucault, whose own practice centered the work of libraries and archives. Foucault wasn’t just writing about an abstract archive, he was practically living out of them. As someone who has worked in libraries and archives I can appreciate how power users (pun intended) often knew aspects of the holdings and intricacies of their their management better than I did. Archives, when they are working, are always collaborative endeavours, and the important thing is to recognize and attribute the various sides of that collaboration.
PS. Writing this blog post led me to dig up a few things I want to read (Eliassen, 2010; Radford, Radford, &amp; Lingel, 2015 ).
References


Caswell, M. (2016). The archive is not an archives: On acknowledging the intellectual contributions of archival studies. Reconstruction, 16(1). Retrieved from http://reconstruction.eserver.org/Issues/161/Caswell.shtml


Eliassen, K. (2010). Archives of Michel Foucualt. In E. Røssaak (Ed.), The archive in motion, new conceptions of the archive in contemporary thought and new media practices. Novus Press.


Macey, D. (2019). The lives of Michel Foucault: A biography. Verso.


Radford, G. P., Radford, M. L., &amp; Lingel, J. (2015). The library as heterotopia: Michel Foucault and the experience of library space. Journal of Documentation, 71(4), 773–751.
	Teaching OOP in the Time of COVID
I’ve been teaching a section of the Introduction to Object Oriented Programming at the UMD College for Information Studies this semester. It’s difficult for me, and for the students, because we are remote due to the Coronavirus pandemic. The class is largely asynchronous, but every week I’ve been holding two synchronous live coding sessions in Zoom to discuss the material and the exercises. These have been fun because the students are sharp, and haven’t been shy about sharing their screen and their VSCode session to work on the details. But students need quite a bit of self-discipline to move through the material, and probably only about 1/4 of the students take advantage of these live sessions.
I’m quite lucky because I’m working with a set of lectures, slides and exercises that have been developed over the past couple of years by other instructors: Josh Westgard, Aric Bills and Gabriel Cruz. You can see some of the public facing materials here. Having this backdrop of content combined with Severance’s excellent (and free) Python for Everybody has allowed me to focus more on my live sessions, on responsive grading, and to also spend some time crafting additional exercises that are geared to this particular moment.
This class is in the College for Information Studies and not in the Computer Science Department, so it’s important for the students to not only learn how to use a programming language, but to understand programming as a social activity, with real political and material effects in the world. Being able to read, understand, critique and talk about code and its documentation is just as important as being able to write it. In practice, out in the “real world” of open source software I think these aspects are arguably more important.
One way I’ve been trying to do this in the first few weeks of class is to craft a sequence of exercises that form a narrative around Coronavirus testing and data collection to help remind the students of the basics of programming: variables, expressions, conditionals, loops, functions, files.

In the first exercise we imagined a very simple data entry program that needed to record results of Real-time polymerase chain reaction tests (RT-PCR). I gave them the program and described how it was supposed to work, and asked them describe (in English) any problems that they noticed and to submit a version of the program with problems fixed. I also asked them to reflect on a request from their boss about adding the collection of race, gender and income information. The goal here was to test their ability to read the program and write English about it while also demonstrating a facility for modifying the program. Most importantly I wanted them to think about how inputs such as race or gender have questions about categories and standards behind them, and weren’t simply a matter of syntax.
The second exercise builds on the first by asking them to adjust the revised program to be able to save the data in a very particular format. Yes, in the first exercise the data is stored in memory and printed to the screen in aggregate at the end. The scenario here is that the Department of Health and Human Services has assumed the responsibility for COVID test data collection from the Centers for Disease Control. Of course this really happened, but the data format I chose was completely made up (maybe we will be working with some real data at the end of the semester if I continue with this theme). The goal in this exercise was to demonstrate their ability to read another program and fit a function into it. The students were given a working program that had a save_results() function stubbed out. In addition to submitting their revised code I asked them to reflect on some limitations of the data format chosen, and the data processing pipeline that it was a part of.
And in the third exercise I asked them to imagine that this lab they were working in had a scientist who discovered a problem with some of the thresholds for acceptable testing, which required an update to the program from Exercise 2, and also a test suite to make sure the program was behaving properly. In addition to writing the tests I asked them to reflect on what functionality was not being tested that probably should be.
This alternation between writing code and writing prose is something I started doing as part of a Digital Curation class. I don’t know if this dialogical or perhaps dialectical, approach is something others have tried. I should probably do some research to see. In my last class I alternated week by week: one week reading and writing code, the next week reading and writing prose. But this semester I’ve stayed focused on code, but required the reading and writing of code as well as prose about code in the same week. I hope to write more about how this goes, and these exercises as I go. I’m not sure if I will continue with the Coronavirus data examples. One thing I’m sensitive to is that my students themselves are experiencing the effects of the Coronavirus, and may want to escape it just for a bit in their school work. Just writing in the open about it here, in addition to the weekly meetings I’ve had with Aric, Josh and Gabriel has been very useful.
Speaking of those meetings. I learned today from Aric that tomorrow (February 20th, 2021) is the 30th anniversary of Python’s first public release! You can see this reflected in this timeline. This v0.9.1 release was the first release Guido van Rossum made outside of CWI and was made on the Usenet newsgroup alt.sources where it is split out into chunks that need to be reassembled. Back in 2009 Andrew Dalke located a and repackaged these sources in Google Groups which acquired alt.sources as part of DejaNews in 2001. But if you look at the time stamp on the first part of the release you can see that it was made February 19, 1991 (not February 20). So I’m not sure if the birthday is actually today.
I sent this little note out to my students with this wonderful two part oral history that the Computer History Museum did with Guido van Rossum a couple years ago. I turns out Both of his parents were atheists and pacifists. His dad went to jail because he refused to be conscripted into the military. That and many more details of his background and thoughts about the evolution of Python can be found in these delightful interviews:


Happy Birthday Python!
	GPT-3 Jam
One of the joys of pandemic academic life has been a true feast of online events to attend, on a wide variety of topics, some of which are delightfully narrow and esoteric. Case in point was today’s Reflecting on Power and AI: The Case of GPT-3 which lived up to its title. I’ll try to keep an eye out for when the video posts, and update here.
The workshop was largely organized around an exploration of whether GPT-3, the largest known machine learning language model, changes anything for media studies theory, or if it amounts to just more of the same. So the discussion wasn’t focused so much on what games could be played with GPT-3, but rather if GPT-3 changes the rules of the game for media theory, at all. I’m not sure there was a conclusive answer at the end, but it sounded like the consensus was that current theorization around media is adequate for understanding GPT-3, but it matters greatly what theory or theories are deployed. The online discussion after the presentations indicated that attendees didn’t see this as merely a theoretical issue, but one that has direct social and political impacts on our lives.
James Steinhoff looked at GPT-3 using a Marxist media theory perspective where he told the story of GPT-3’s as a project of OpenAI and as a project of capital. OpenAI started with much fanfare in 2015 as a non-profit initiative where the technology, algorithms and models developed would would be kept openly licensed and freely available so that the world could understand the benefits and risks of AI technology. Steinhoff described how in 2019 the project’s needs for capital (compute power and staff) transitioned it from a non-profit into a capped-profit company, which is now owned, or at least controlled, by Microsoft.
The code for generating the model as well as the model itself are gated behind a token driven Web API run my Microsoft. You can get on a waiting list to use it, but apparently a lot of people have been waiting a while, so … Being a Microsoft employee probably helps. I grabbed a screenshot of the pricing page that Steinhoff shared during his presentation:

I’d be interested to hear more about how these tokens operate. Are they per-request, or are they measured according something else? I googled around a bit during the presentation to try to find some documentation for the Web API, and came up empty handed. I did find Shreya Shankar’s gpt3-sandbox project for interacting with the API in your browser (mostly for iteratively crafting text input in order to generate desired output). It depends on the openai Python package created by OpenAI themselves. The docs for openai then point at a page on the openai.com website which is behind a login. You can create an account, but you need to be pre-approved (made it through the waitlist) to be able to see the docs. There’s probably some sense that can be made from examining the python client though.
All of the presentations in some form or another touched on the 175 billion parameters that were used to generate the model. But the API to the model doesn’t have that many parameters. It allows you to enter text and get text back. But the API surface that the GPT-3 service provides could be interesting to examine a bit more closely, especially to track how it changes over time. In terms of how this model mediates knowledge and understanding it’ll be important watch.
Steinhoff’s message seemed to be that, despite the best of intentions, GPT-3 functions in the service of very large corporations with very particular interests. One dimension that he didn’t explore perhaps because of time, is how the GPT-3 model itself is fed massive amounts of content from the web, or the commons. Indeed 60% of the data came from the CommonCrawl project.

GPT-3 is an example of an extraction project that has been underway at large Internet companies for some time. I think the critique of these corporations has often been confined to seeing them in terms of surveillance capitalism rather than in terms of raw resource extraction, or the primitive accumulation of capital. The behavioral indicators of who clicked on what are certainly valuable, but GPT-3 and sister projects like CommonCrawl shows just the accumulation of data with modest amounts of metadata can be extremely valuable. This discussion really hit home for me since I’ve been working with Jess Ogden and Shawn Walker using CommonCrawl as a dataset for talking about the use of web archives, while also reflecting on the use of web archives as data. CommonCrawl provides a unique glimpse into some of the data operations that are at work in the accumulation of web archives. I worry that the window is closing and the CommonCrawl itself will be absorbed into Microsoft.
Following Steinhoff Olya Kudina and Bas de Boer jointly presented some compelling thoughts about how its important to understand GPT-3 in terms of sociotechnical theory, using ideas drawn from Foucault and Arendt. I actually want to watch their presentation again because it followed a very specific path that I can’t do justice to here. But their main argument seemed to be that GPT-3 is an expression of power and that where there is power there is always resistance to power. GPT-3 can and will be subverted and used to achieve particular political ends of our own choosing.
Because of my own dissertation research I’m partial to Foucault’s idea of governmentality, especially as it relates to ideas of legibility (Scott, 1998)–the who, what and why of legibility projects, aka archives. GPT-3 presents some interesting challenges in terms of legibility because the model is so complex, the results it generates defy deductive logic and auditing. In some ways GPT-3 obscures more than it makes a population legible, as Foucault moved from disciplinary analysis of the subject, to the ways in which populations are described and governed through the practices of pastoral power, of open datasets. Again the significance of CommonCrawl as an archival project, as a web legibility project, jumps to the fore. I’m not as up on Arendt as I should be, so one outcome of their presentation is that I’m going to read her The Human Condition which they had in a slide. I’m long overdue.
References


Scott, J. C. (1998). Seeing like a state: How certain schemes to improve the human condition have failed. Yale University Press.
	mimetypes
Today I learned that Python has a mimetypes module, and has ever since Guido von Rossum added it in 1997. Honestly I’m just a bit sheepish to admit this discovery, as someone who has been using Python for digital preservation work for about 15 years. But maybe there’s a good reason for that.
Since the entire version history for Python is available on GitHub (which is a beautiful thing in itself) you can see that the mimetypes module started as a guess_type() function built around a pretty simple hard coded mapping of file extensions to mimetypes.
The module also includes a little bit of code to look for, and parse, mimetype registries that might be available on the host operating system. The initial mimetype registries used included one from the venerable Apache httpd web server, and the Netscape web browser, which was about three years old at the time.
It makes sense why this function to look up a mimetype for a filename would be useful at that time, since Python was being used to serve up files on the nascent web and for sending email, and whatnot.
Today the module looks much the same, but has a few new functions and about twice as many mimetypes in its internal list. Some of the new mimetypes include text/csv, audio.mpeg, application/vnd.ms-powerpoint, application/x-shockwave-flash, application/xml, and application/json. Comparing the first commit to the most latest provides a thumbnail sketch of 25 years of web format evolution.
I’ll admit, this is is a bit of an esoteric thing to be writing a blog post about. So I should explain.
At work I’ve been helping out on a community archiving project which has accumulated a significant amount of photographs, scans, documents of various kinds, audio files and videos. Some of these files are embedded in web applications like Omeka, some are in cloud storage like Google Drive, or on the office networked attached storage, and others are on scattered storage devices in people’s desk drawers and closets. We’ve also created new files during community digitization events, and oral history interviews.
As part of this work we’ve wanted to start building a place on the web where all these materials live. This has required not only describing the files, but also putting all the files in one place so that access can be provided. In principle this sounds simple. But it turns out that collecting the files from all these diverse locations poses significant challenges, because their context matters. The filenames, and the directories they are found in, are sometimes the only descriptive metadata that exists for this data.
In short, the original order matters. But putting this content on the web means that the files need to be brought together and connected with their metadata programmatically.
This is how I stumbled across the mimetypes module. I’ve been writing some throwaway code to collect the files together into the same directory structure while preserving their original filenames and locations in an Airtable database. I’ve been using the magic module to identify the format of the file, which is used to copy the file into a Dropbox storage location.
The extension is important because we are expecting this to be a static site serving up the content and we want the files to also be browsable using the Dropbox drive. It turns out the mimetypes.guess_extension is pretty useful for turning a mediatype into an file extension.
I’m kind of surprised that it took me this long to discover mimetypes, but I’m glad I did. As an aside I think this highlights for me how important Git can be as an archive and research method for software studies work.
	Northwest Branch Cairn
Here is a short recording and a couple photos from my morning walk along the Northwest Branch trail with Penny. I can’t go every day but at 7 months old she has tons of energy, so it’s generally a good idea for all concerned to go at least every other morning. And it’s a good thing, because the walk is surprisingly peaceful, and it’s such a joy to see her run through the woods.
After walking about 30 minutes there is this little cairn that is a reminder for me to turn around. After seeing it grow in size I was sad to see it knocked down one day. But, ever so slowly, it is getting built back up again.