Chapter 3 Humanities and Social Science Reading through Machine Learning Marisa Plumb San Jose State University Introduction The purposes of computational literary studies have evolved and diversified a great deal over the last half century. Within this dynamic and often contentious space, a set of fundamental ques- tions deserve our collective attention: does the computation and digitization of language recast the ways we read, value, and receive words? In what ways can research and scholarship on lit- erature become a more meaningful part of the future development of computer systems? As the theory and practice of computational literary studies evolve, their potential to play a direct role in revising historical narratives and framing new research questions poses cross-disciplinary implications. It’s worthwhile to anchor these questions in the origin stories that today’s digital humanists tell, from the work of Josephine Miles at Berkeley in the 1930s (Buurma and Heffernan 2018) to Roberto Busa’s work in the 1940s to work that links Structuralism and Russian Formalism at the turn of the 19th century (Algee-Hewitt 2015) to today’s systemized explorations of texts. The sciences and humanities have a shared history in their desire to solve the patterns and systems that make language functional and impactful, and there have long been linguistic and computational tools that help advance this work. What’s more challenging to unravel and articulate from these origin stories are the mathematical concepts behind the tools that humanists wield. Ideally one would navigate this historical landscape when assessing the fitness of any given computational technique for addressing a specific humanities research question, but often researchers choose 29 30 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 tools because they are powerful and popular, without a robust understanding of the conceptual assumptions they embody, which are defined by the mathematical and statistical principles they are based on. This can make it difficult to generate reproducible results that contribute to a tool’s methodological development. This is related to a set of issues that drive debates among computationally-minded scholars, which regularly appear in digital humanities forums. In 2019, for instance, Nan Da issued a harsh critique of humanists’ implementation of statistical methods in their research.1 Her claim is that computational methods are not a good match for literary research, and she systematically shows how the results from several computational humanities are not only difficult to reproduce, but can be easily skewed with minor changes to how an algorithm is implemented. Although this debate about digital methods points to a necessary evolution in the field (in which researchers become more accountable to the computational laws that they are utilizing), her essay’s broader mission is to question the appropriateness of using computational tools to investigate literary objects and ideas. Refutations to this claim were swift and abundant (Critical Inquiry 2019), and highlight a number of concepts central to my concern here with future intersections of machine learning and literary research. Respondents such as Mark Algee-Hewitt pointed out that literary scholars em- ploy computational statistical models in order to reveal something about texts that human readers could not. In doing so, literary scholars are at liberty to note where computation reaches its use- ful limit2 and take up more traditional forms of literary analysis (Algee-Hewitt 2019). Katherine Bode explores the promise and pitfalls of this hybrid “close and distant reading” approach in her 2020 article on the intersection of topic modeling and bias. Imperfect as the hybrid method is, stressing the value of familiar interpretive methods remains important, politically and practically, when bringing computation into humanities departments. This essay extends the argument that computational tools do more than turn big data into novel close reading opportunities. Machine learning, and word embedding algorithms in par- ticular, may have a unique ability to shift this conversation into new territory, where scholars begin to ask how historical research can contribute more sophisticated approaches to treating words as data. With historically-minded approaches to dataset creation for machine learning, issues emerge that engender new theoretical frameworks for evaluating the ability of statistical models of information to reveal cultural and artistic dimensions of language. I will first contex- tualize what they do, and then show a few of the mathematical concepts that have driven their development. Of the many available machine learning algorithms, word embedding algorithms have shown particular promise in capturing contextual meanings (of words or other units of textual data) more accurately than previous techniques in natural language processing. Word embeddings en- compass a set of language modeling techniques where words or phrases from a large set of texts (i.e., “corpus”) are analyzed through the use of a neural network architecture. For each vocabu- lary term in the corpus, the neural network algorithm uses the term’s proximity to other words to assign it values that become a vector of real numbers — one high-dimensional vector is generated for each word. (The term “embedding” refers to the mathematics that turns a space with many 1Da’s critique of statistical model usage in computational humanities work sparked a forum of responses in Critical Inquiry. 2This limit typically exists for a combination of three reasons: computer programs can only generate models based on the data we give them, a tool isn’t fully understood and so not robustly explored, and many algorithms and tools are being used in experimental ways. Plumb 31 dimensions per word into a continuous vector space with a much lower dimension.)3 They raise three critical issues to this essay: How do word embeddings reflect the contexts of words in order to capture their relative meanings? If word embeddings approximate word meanings, do they also reflect culture? How can literary history and cultural studies inform how scholars use them? Word embeddings are powerful because they calculate semantic similarities between words based on their distributional properties in large samples of language data. As computational lin- guist Jussi Karlgren puts it: Language is a general-purpose representation of human knowledge, and models to process it vary in the degree they are bound to some task or some specific usage. Currently, the trend is to learn regularities and representations with as little explicit knowledge-based linguistic processing as possible, and recent advances in such gen- eral models for end-to-end learning to address linguistics tasks have been quite suc- cessful. Most of those approaches make little use of information beyond the occur- rence or co-occurrence of words in the linguistic signal and take the single word to be the atomary unit. This is notable because it highlights the power of word embeddings to assign values to words in order to represent their relative meanings, simply based on unstructured language data, without a system of linguistic rules or a labelling system. It also highlights the fact that a word embedding model’s success is based on the parameters of the task it is designed to address. So while the accu- racy and power of word vector algorithms might be recognizable in general-purpose applications that improve with larger training corpora (for instance Google News and Wikipedia), they can be equally powerful representation learning systems for specific historical research tasks that use different benchmarks for success. Humanists using these machine learning methods are learning to think differently about corpora size, corpora content, and the utility of a successfully-trained model for analysis and interpretation. No matter what the application, the success of machine learning applications is predicated on creating good datasets. As a recent paper in IEEE Transactions on Knowledge and Data En- gineering notes, “the majority of the time for running machine learning end-to-end is spent on preparing the data, which includes collecting, cleaning, analyzing, visualizing, and feature en- gineering” (Roh et al. 2019, 1). Acknowledging this helps contextualize machine learning al- gorithms for text analysis tasks in the humanities, but also highlights data curation challenges that can be taken up in new ways by humanists. This naturally raises questions about how ma- chine learning algorithms like word embeddings are implemented for text analysis, and how they should be modified for historical research—they require different computational priorities and frameworks. In parallel to the corpora considerations that computational humanities scholars ponder, there is an abundance of work, across disciplines such as cognitive science and psychology (Grif- fiths et al. 2007), that attempts to refine the problems and limits of using large collections of text for training embeddings. These large collections tend to reflect the biases that exist in soci- ety and history, and in turn, systems based on these datasets can make troubling inferences, now well documented as algorithmic bias.4 Computer science researchers need to evaluate the social dimensions of their applications in diverse societies and find ways to fairly represent all popula- tions. 3See Koehrsen 2018 for a fuller explanation of the process. 4As investigated, for instance, in Noble 2018. 32 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 Digital humanities practices can implicitly help address these issues. Literary studies, as it evolves towards multivocality and canon expansion, makes explicit a link between methods of literary analysis and digital practices that are deliberately inclusive, less-biased, and diachronic (rather than ahistorical). Emerging literary scholarship uses computational methods to question hegemonic practices in the history of the field, through the now-familiar practice of data cura- tion (Poole 2013). But this work can also help combat algorithmic bias more broadly, and expand beyond corpus development into algorithmic design. As digital literary scholarship continues to deepen its exchanges with Sociology, History, and Information Science, stronger methodologies for using fair and representative data will become pervasive throughout these disciplines, as well as in commercial applications. Interdisciplinary methodologies are foundational to future com- putational literary research that can make sophisticated contributions to text analysis. The Bengal Annual: A Case Study Complex relationships between words cannot be fully assessed with one flat application of a pow- erful tool to a set of texts. But this does not mean that the usefulness of machine learning for literature is limited: rather, scholars can wield it to control how machines learn sets of relation- ships between concepts. Choosing which texts to include in a corpus is coupled to decisions about whether and how to label its contents, and how to tune the parameters of the algorithms. For the purposes of literary analysis, these should be embraced as interpretive, biased acts—ones that deepen understanding of commonly-employed computational methods—and folded into emerging methodologies. Because humanities scholars are not generating models to serve appli- cations with thousands of end-users who primarily expect accuracy, they can exploit the fallacies of machine learning in order to improve how dataset management and feature engineering are conducted. Working with big data in order to generate models isn’t valuable because it reveals history’s “true” cultural patterns, but because it demonstrates how machines already circulate those “truths.” A scholar’s deep knowledge of the historical content and formalities of language can determine how corpora are compared, how we experiment with known biases, and how we move towards a future landscape of literary analysis that is inclusive of marginalized texts and the latest cultural theory. Roopika Risam, for instance, advocates for both a theoretical and practice-based decoloniza- tion of the digital humanities, noting ways that postcolonial digital archives can intervene in knowledge production in society (2018, 79). Corpora created from periods of revolution, then, might reveal especially useful vector relationships and lead to better understanding of semantic changes during those times. Those word embeddings might be useful for teaching computers racialized language over timelines, so that machine learning applications do not only “read” his- tory as a flat set of relationships, and inevitably reflect the worst of its biases. To begin to unpack this process, I will present a case study on the 1830 Bengal Annual and a corpus of similarly-situated texts. Our team, made up of students in Katherine D. Harris’s graduate seminar on decolonizing Romantic Literature at San Jose State University, asked: can we operationalize questions that arise from close readings of texts to turn problematic quanti- tative evaluations of words into more complex methods of interpretation? A computer cannot interpret complex cultural concepts, but it can be instructed to weigh time period, narrative per- spective, and publication venue, much as a literary scholar would. With the explosion of print culture in England in the first half of the nineteenth century, publishers began introducing new forms of serialized print materials, which included serialized Plumb 33 publications known as literary annuals (Harris 2015). These multi-author texts were commonly produced as high-quality volumes that could be purchased as gifts in the months leading up to the holiday season. As a genre, the annual included poetry, prose, and engravings, among other varieties of content, very often from well-known authors. Literary annuals represent a significant shift in the economics surrounding the production of print materials for mass consumption— for instance, contributors were typically paid. And annuals, though a luxury item, were more affordable than books sold before the mechanization of the printing press (Harris 2015, 1-29). Literary annuals and other periodicals are interesting sites of literary study because they can be read as reinforcing or resisting the British Empire. London-based periodicals were eventually distributed to all of Britain’s colonial holdings, including India (Harris 2019). As The Bengal Annual was written in India and contains a small representation of Indian authors, our project investigates it as a variation on British-centric reading materials of the time, which perhaps offered a provisional voice to a wider community of writers (though not without claims of superiority over the colonized territory it exploits). Some of the contents invoke themes that are affiliated with major Romantic writers such as William Wordsworth and Samuel T. Coleridge, but editor D.L. Richardson included short stories and fiction, which were not held in the same regard as poetry. He also employed local native Indian engravers and writers. To explore the thesis that the concepts and genres typically associated with British Romantic Literature are represented differently in a text that was written and produced in a different space with a set of contributors who were not exclusively British natives, we experimented with word embeddings on semantic similarity tasks, comparing the annual to texts like Lyrical Ballads. Such a task is within the scope of traditional literary analysis, but my agenda was to probe the idea that we need large-scale representations of marginalized voices in order to show real differences from the ideas of the dominant race, class, and gender.5 The project team first used statistical tools to find out if the Annual’s poetry, non-fiction, and fiction contained interesting relationships between vocabularies about body parts, social class, and gender. We gathered information about terms that might reveal how different parts of the body were referenced depending on sex. These differences were validated by traditional close- reading knowledge about British Romantic Literature and its historical contexts,6 and signaled the need to read and analyze the Annual’s passages about body parts, especially ones by writers of different genders and social backgrounds. These simple methods allowed us to take a streamlined approach to confirming that an author’s perspective indeed altered his or her word choices and other aspects of their references to male vs. female bodies. Collecting and mapping those references, however, was not enough to build a larger argu- ment about how discourse on bodies might be different in non-canonical British Romantic Lit- erature. Based on the potential for word embeddings to model semantic spaces for different cor- pora and compare the distribution of terms, the next step was to build a corpus of non-canonical texts of similar scope to a corpus of canonical works, so that models for each could be legitimately compared. This work, currently in progress, faces challenges that are becoming more familiar to digital historians: the digitization of rare texts, the review of digitization processes for accuracy, and the cleaning of data. The primary challenge is to find the correct works to include: this requires historical exper- 5Such textual repositories are important outside of literature departments, too. We need data to represent all voices in training machines to represent any social arena. 6Some of these findings are illustrated in the project’s Scalar site: ?iiT,ffb+�H�`Xmb+X2/mfrQ`Fbfi?2@#2M ;�H@�MMm�Hf#Q/B2b@BM@i?2@�MMm�H. http://scalar.usc.edu/works/the-bengal-annual/bodies-in-the-annual http://scalar.usc.edu/works/the-bengal-annual/bodies-in-the-annual 34 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 tise, but also raises the question of how to uncover unknown authors. Manu Chander’s Brown Romantics calls for a global assessment of Romantic Literature’s impact by “calling attention to its genuinely unacknowledged legislators” (Chander 2017, 11). But he contends that even the authors he was able to study were already individuals who aspired to assimilate with British cul- ture and ideologies in some ways, and perhaps don’t represent political resistance or views entirely antithetical to the British Empire. Guided by Chander’s questions about how to locate dissent in contexts of colonization, we documented instances in the text that highlight the dynamics of colonialism, race, and nation- alism, and compared them to a set of statistical explorations of the text’s vocabulary (particu- larly terms related to national identity, gender, and bodies). Chander’s call for a more globally- comprehensive study of Romanticism speaks to the politics of corpora curation discussed above, but also suggests that corpus comparison can benefit from formal methodological guidelines. Puzzling out how to best combine traditional close readings with quantitative inquiries, and then map that work to a machine-learning research framework, revealed several shortcomings in methodological standardization. It also revealed several opportunities for rethinking the way algorithms could be implemented, by adopting and systematizing familiar comparative research practices. Ideas about such methodologies are emerging in many disciplines, which I highlight later in this essay. Disciplinary directions for word vector research The potential of word embedding techniques for projects such as our BengalAnnual analysis can be seen in the new computational research directions that have emerged in humanities research.7 Vector-space representations are based on high-dimensional vectors8 of real numbers.9 Those vectors’ values are assigned using a word’s relationship to the words near it in a text, based on the likelihood that a word will appear in proximity to other words it is told to “look” at. For example, this visualization demonstrates an embedding space for a historical corpus (1640-1699) using the values assigned to word vectors (figure 3.1). In a visualized space (with reduced dimensions) such as the one in figure 3.1, distances among vectors can be assessed, for example, to articulate the forty words most similar to wit. This partic- ular model (trained using the word2vec algorithm), published in the 2019 Debates in the Digital Humanities,10 allowed the authors to visualize the term wit with synonyms on the left side, and terms related to argumentation on the right, such as indeed, argues, and consequently. This ini- tial exploration prompted Gavin and his co-authors to look at a vector space model for a single author (John Dryden), in order to both validate the model against their subject matter expertise and explore the model’s results. Although word vectors are often employed for machine trans- lation tasks11 or to project analogistic relationships between concepts,12 they can also be used to 7See Kirschenbaum 2007 and Argamon and Olsen 2009. 8A word vector may have hundreds or even thousands of dimensions. 9Word embedding algorithms are modelled on the linguistic concept that context is a primary way that word meanings are produced. Their usefulness is dependent on the breadth and domain-relevance of the corpus they are trained on, meaning that a corpus of medical research vs. a corpus of 1980s television guides vs. a corpus of family law proceedings would generate models that show different relationships between words like “family,” “health,” “heart,” etc. 10See Goldstone 2019. 11Software used to translate text or speech from one language to a target language. Machine translation is a subfield of computational linguistics that can now allow for domain-based (i.e. specialized subject matter) customizations of translations, making translated word choices more context-specific.. 12Although word embeddings aren’t explicitly trained to learn analogies, the vectors exhibit seemingly linear behavior (such as “woman is to queen as man is to king”), which approximately describe a parallelogram. This phenomenon is Plumb 35 Figure 3.1: A visualized space with reduced dimensions of a neighborhood around wit (Gavin et al. 2019, Figure 21.2). question concepts that are traditionally associated with particular literary periods and evaluate those associations with new kinds of evidence. What this type of study suggests is that we can look at cultural concepts like wit in new ways. These results can also facilitate a comparison of historical models of wit to contemporary ones— to show how its meaning may have shifted, using its changing relationship to other words as evidence. This is a growing area of research in the social sciences, computational linguistics, and other disciplines (Kutuzov et al. 2019) In a survey paper on current work in diachronic word embeddings and semantic shifts, Kutuzov et al. note that the surge of interest points to its impor- tance for natural language processing, but that it currently lacks “cohesion, common terminology and shared practices.” Some of this cohesion might be generated by putting the usefulness of word vectors in con- text of the history of information retrieval and the history of distributed representation. Word embeddings emerged in the 1960s, with data modeled as a matrix, and a user’s query of a database represented as a vector. Simple vector operations could be used to locate relevant data or docu- ments. Gerald Salton is generally credited as one of the first to do this, based on the idea that he could represent a document as a vector of keywords and use measures like cosine similarity and di- mensionality reduction to compare documents.13 Since the 1990s, vector space models have been explored in Allen and Hospedales 2019. 13Algorithms like word2vec take as input the linguistic context of words in a given corpus of text, and output an N dimensional space of those words—each word is represented as a vector of dimension N in that Euclidean space. Word vectors with thousands of values are transformed to lower-dimensional spaces in which the directionality of two vectors can be measured using cosine similarity—words that exist in similar contexts would be expected to have a similar cosine measurement and map to like clusters in the distributed space. 36 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 used in distributional semantics. In a paper on the history of vector space models, which exam- ines the trajectory of Gerald Salton’s work, David Dubin notes that these mathematical models can be defined as “a consistent mathematical structure designed to correspond to some physical, biological, social, psychological, or conceptual entity” (2004). In the case of word vectors, word context and colocations give us quantifiable information about a word’s meaning. But research in cognitive science has long questioned the property of linguistic similarity in spatial representations because they don’t align with important aspects of human semantic pro- cessing (Tversky 1977). Tversky shows, for example, that people’s interpretation of semantic sim- ilarity does not always obey the triangle inequality, i.e., the words w1 and w3 are not necessarily similar when both pairs of (w1, w2) and (w2, w3) are similar. While “asteroid” is very similar to “belt” and “belt” is very similar to “buckle”, “asteroid” and “buckle” are not similar (Griffiths et al. 2007). One reason this violation arises is because a word is represented as a single vector even when it has multiple meanings. This has led to research that attempts new methods to capture different senses of words in embedding applications. In a paper surveying techniques for dif- ferentiating words at the “sense” level, Jose Camacho-Collados and Mohammad Taher Pilehvar show that these efforts fall in two camps: “Unsupervised models directly learn word senses from text corpora, while knowledge-based techniques exploit the sense inventories of lexical resources as their main source for representing meanings” (2018, 744). The first method, an unsupervised model, induces different meanings of a word — it is trained to analyze and represent each word sense based on statistical knowledge derived from the contexts within a corpus. The second method for disambiguation relies on information contained in other databases or sources. WordNet, for instance, associates multiple words with concepts, providing a sense inventory for terms. It is made up of synsets, which represent unique concepts that can be expressed through nouns, verbs, adjectives or adverbs. The synset of a concept such as “a busi- ness where patrons can purchase coffee and use WiFi” might be “cafe, coffeeshop, internet cafe” etc. Camacho-Collados and Pilehvar review different ways to process word embedding results using WordNet and similar resources, which essentially provide synonyms that share a common meaning. There exists a relationship between work that addresses word disambiguation and work that addresses the biases that word vector algorithms produce. Just as researchers can modify gen- eral word embedding models to capture a word’s multiple meanings, they can also modify them according to a word’s usage over time. These evolving methods begin to account for the social, historical, and psychological dimensions of language. If one can show that applying word embed- ding algorithms to diachronic corpora or corpora of different domains produces different biases, this would suggest that nuanced shifts in vocabulary and word usage can be used to impact data curation practices that seek to isolate and remove historical bias from other word embedding models. Biases, one might say, persist despite contextual changes. Or, one might say that the short- comings of word embeddings don’t account for changes in bias that are present in context. This is where the domain expertise of literary scholars also becomes essential. Historians’ domain ex- pertise and natural interest in comparative corpora (from different time periods or containing different types of documents) situates their ability to curate datasets that tend to both data ethics and computational innovation. Such work could have impact beyond historical research, and result in data-level corrections to biases that emerge in more general-purpose embedding applica- tions. This could be more effective and reproducible than correcting them superficially (Gonen and Goldberg 2019). For instance, if novel cultural biases can be traced to an origin period, texts Plumb 37 from that period could constitute a sub-corpus. Embedding models specific to that corpus might be subtracted from the vectors generated from a broader dataset. Examining a methodology’s history is an essential way in which scholars can strengthen the validity of computationally-driven research and its integration into literary departments—this type of scholarship reconstitutes literary insights after the risky move of flattening literary texts with the rigor of machines. But as Lauren Klein (2019) and others reveal, scholars have begun to apply interpretation and imagination in both the computational and the “close reading” as- pects of their research. This reinforces that computational shifts in the study of literature are more than just the adoption of useful tools for the sake of locating a novel pattern in data. An increasingly important branch of digital literary research demonstrates the efficacy of engaging the interdisciplinary complexity of computational tools in relation to the complexity of literary analysis. New ideas for close readings and analysis can serve as windows into defining secondary com- putational research questions that emerge from an initial statistical exploration. As in the work reviewed by Camacho-Collados Pilehvar, outside knowledge of word senses can be used for post- processing word embeddings that address theoretical issues. Implementing this type of process for humanities research, one might begin with the question: can I generate word vector models that attend to both author gender and word context if I train them in innovative ways? Does this require a corpus of male authors and one of female authors? Or would this be better accom- plished with an outside lexical source that has already associated word senses with genders? Multi-disciplinary scholars are experimenting with a variety of methods to use word vector algorithms to track semantic complexities, and humanities researchers need an awareness of the technical innovations across a range of these disciplines because they are in a position to bring im- portant domain knowledge to these efforts. Ideally, the questions that unite these disciplinary ef- forts might be: how do we make word contexts and distributional semantics more useful for both historians, who need reproducible results that lead to new interpretation, and technologists, who need historical interpretation to play a larger role in language generalization? Modeling language histories depends on how deeply humanists can understand word embedding models, so that they can augment their inherent shortcomings. Cross-disciplinary collaborations help scholars return to fundamental issues that arise when we treat words as data, and help bring more cohesive methodological standards to language modeling. New directions in cross-disciplinary machine learning frameworks Literary scholars set up computational inquiries with attention to cultural complexity, and seek out instances of language that convey historical context. So while they aren’t likely to lead the charge in correcting fundamental shortcomings of language representation algorithms, they can increasingly impact social assessments of those algorithms, provide methodologies for those al- gorithms to locate anomalies in language usage, and assess whether those algorithms embody socially just practices (D’Ignazio and Klein 2020). Some literary scholars also critique the non- neutral ideologies that are in place in both computing and the humanities (Rhody 2017, 660). These efforts not only make the field of literary studies (and its history) more relevant to a digitally and computationally-driven future, but also help literary scholars create meaningful intersections between their computational tools and theoretical training. That training includes frameworks 38 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 for reading and analysis that computers cannot yet perform, but should aspire to—from close reading, Semiotic Criticism, and Formalism to Post-structuralism, Cultural Studies, and Femi- nist Theory. The varied systems literary scholars have developed for thinking about signs, words, and symbols should not be seen as irreconcilable with computational tools for text analysis. In- stead, they should become the foundation for new methodologies that tackle the shortcomings of machine learning algorithms and project future directions for text analysis. Linguists and scientists interested in natural language processing have often looked to the hu- manities for methods that assign rules to the production of meaning. Such methods exist within the history of literary criticism, some of which are being newly explored as concepts for language modeling algorithms. For instance, data curation takes inspiration from cultural studies, which empowers literary scholars to correct for bias and underrepresentation in literature by expand- ing the canon. Subsequent literary findings from that research need not only be literary ones: they have the potential to serve as models for best practices for computational tools and datasets more broadly. While the rift between society’s most progressive ideas and its technological ad- vancement is not unique to the rise of machine learning, practical opportunities exist to repair the rift with a blend of literary criticism and computational skills, and there are many recent ex- amples14 of the growing importance of combining rich technical explanations, interdisciplinary theories, and original computational work in corpus linguistics and beyond. A desire to wield social and computational concerns simultaneously is evident also in recent work in Linguistics,15 Sociology,16 and History.17 Studies in computational Sociology by Lauren Nelson, Austin C. Kozlowski, Matt Taddy, James A. Evans, Peter McMahan, and Kenneth Benoit contain important parallels for machine learning-driven text analysis. Nelson, for instance, calls for a new three-step methodology to com- putational sociology, one that “combines expert human knowledge and hermeneutic skills with the processing power and pattern recognition of computers, producing a more methodologically rigorous but interpretive approach to content analysis” (2020, 1). She describes a framework that can aid in reproducibility, which was noted as a problem by Da. Kozlowski, Taddy, and Evans, who study relationships between attention and knowledge, in a September 2019 paper on the “geometry of culture” use a vector space model to analyze a century of books. They show “that the markers of class continuously shifted amidst the economic transformations of the twentieth century, yet the basic cultural dimensions of class remained remarkably stable. The notable ex- ception is education, which became tightly linked to affluence independent of its association with cultivated taste” (1). This implies that disciplinary expertise can be used to isolate sub-corpora for use in secondary word embedding research problems. Resulting word similarity findings could aid in both validating the initial research finding and defining domain-specific datasets that are reusable for future research. The idea of using humanities methodologies to inform model architectures for machine learn- 14See Whitt 2018 for a state-of-the-art overview of the intersecting fields of corpus linguistics, historical linguistics, and genre-based studies of language usage. 15A special issue in the journal Language from the Linguistic Society of America published responses to a call to reconcile the unproductive rift between generative linguistics and neural network models. Christopher Potts’s response (2019) advocates an imperative integration between deep learning and traditional linguistic semantics. 16Sociologist Laura K. Nelson (2020) calls for a three-step methodological framework called computational grounded theory. 17Another special issue, this one from Isis, a journal from the History of Science Society, suggests that “the history of knowledge can act as a bridge between the world of the humanities, with its tradition of close reading and detailed understanding of individual cases, and the world of big data and computational analysis” (Laubichler, Maienschein, and Renn 2019, 502). Plumb 39 ing is part of a wider history of computational scientists drawing inspiration from other fields to make AI systems better. Designing humanities research with novel word embedding mod- els stands to widen the territory where machine learning engineers look for conceptual concepts to inspire strategies for improving the performance of artificial language understanding. Many computer scientists are investigating the figurative (Gagliano et al. 2019) and the metaphorical (Mao et al. 2018) in language. As machines get better at reading and interpreting texts, literary studies and theories will become more applicable to how those machines are programmed to look at multiple layers and dimensions of language. Ted Underwood, Andrew Piper, Katherine Bode, James Dobson, and others make connections between computational literary research and social dimensions of the history of vector space model research. Since vector models are based on the 1950s linguistic notion of similarity (Firth 1957), researchers working to show superior algorith- mic performance focus on different aspects of why similarity is important than do researchers seeking cultural insights within their data. But Underwood points out that a word vector can also be seen as a way to quantitatively account for more aspects of meaning (2019). Already, cross-disciplinary scholarship draws on computational linguistics,18 information science,19 and semantic linguistics, and the imperative to understand concepts from all of these fields is grow- ing. As better methods are developed for using word embeddings to better understand texts from different domains and time periods, more sophisticated tools and paradigms emerge that echo the complexity of traditional literary and historical interpretation. Systematic data curation, combined with word embedding algorithms, represent a new inter- pretive system for literary scholars. The potential of machine learning methods for text analysis goes beyond historical literary text analysis, and the methods for literary text analysis using ma- chine learning also go beyond literature departments. The corpora they model and the way they frame their research questions reframe the potential to use systems like word vectors to under- stand aspects of historical language and could have broader ramifications on how other applica- tions model word meanings. Because such literary research generates novel frameworks for using machine learning to represent language, it’s imperative to explore the question: Are there ways that humanities methodologies and research goals can exert greater influence in the computa- tional sciences, make the history of literary studies more relevant in the evolution of machine learning techniques, and better serve our shared social values? References Algee-Hewitt, Mark. 2015. “The Order of Poetry: Information, Aesthetics and Jakobson’s The- ory of Literary Communication.” Presented at the Russian Formalism & the Digital Hu- manities Conference, April 13, Stanford University, Palo Alto, CA. ?iiTb,ff/B;Bi�H? mK�MBiB2bXbi�M7Q`/X2/mf`mbbB�M@7Q`K�HBbK@/B;Bi�H@?mK�MBiB2b. Algee-Hewitt, Mark. 2019. “Criticism, Augmented.” In the Moment (blog). April 1, 2019. ?iiTb,ff+`BiBM[XrQ`/T`2bbX+QKfkyRNfy9fyRf+QKTmi�iBQM�H@HBi2`�`v@ bim/B2b@T�`iB+BT�Mi@7Q`mK@`2bTQMb2bf. Allen, Carl, and Timothy Hospedales. 2019. “Analogies Explained: Towards Understanding Word Embeddings.” In International Conference on Machine Learning, 223–31. PMLR. ?iiT,ffT`Q+22/BM;bXKH`XT`2bbfpNdf�HH2MRN�X?iKH. 18Linguistics scholars are also adopting computational models to make progress with theories related to semantic sim- ilarity. For instance, see Potts 2019. 19See Lin 1998, for example. https://digitalhumanities.stanford.edu/russian-formalism-digital-humanities https://digitalhumanities.stanford.edu/russian-formalism-digital-humanities https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses/ https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses/ http://proceedings.mlr.press/v97/allen19a.html 40 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 Argamon, Shlomo and Mark Olsen. 2009. “Words, Patterns and Documents: Experiments in Machine Learning and Text Analysis.” Digital Humanities Quarterly 3 (2). ?iiT,ffrrrX /B;Bi�H?mK�MBiB2bXQ`;f/?[fpQHfjfkfyyyy9Rfyyyy9RX?iKH. Bode, Katherine. 2020. “Why You Can’t Model Away Bias.” Modern Language Quarterly 81 (1): 95–124. ?iiTb,ff/QBXQ`;fRyXRkR8fyykedNkN@dNjjRyk. Buurma, Rachel Sagner, and Laura Heffernan. 2018. “Search and Replace: Josephine Miles and the Origins of Distant Reading.” Modernism / Modernity Print+ 3, Cycle 1 (April). ?iiTb,ffKQ/2`MBbKKQ/2`MBivXQ`;f7Q`mKbfTQbibfb2�`+?@�M/@`2TH�+2. Camacho-Collados, Jose and Mohammad Taher Pilehvar. 2018. “From Word To Sense Embed- dings: A Survey on Vector Representations of Meaning.” Journal of Artificial Intelligence Research 63 (December): 743–88. ?iiTb,ff/QBXQ`;fRyXReRjfD�B`XRXRRk8N. Chander, Manu Samriti. 2017. Brown Romantics: Poetry and Nationalism in the Global Nine- teenth Century. Lewisburg, PA: Bucknell University Press. Critical Inquiry. 2019. “Computational Literary Studies: A Critical Inquiry Online Forum.” In the Moment (blog). March 31, 2019. ?iiTb,ff+`BiBM[XrQ`/T`2bbX+QKfkyRNfyjf jRf+QKTmi�iBQM�H@HBi2`�`v@bim/B2b@�@+`BiB+�H@BM[mB`v@QMHBM2@7Q`m Kf. Da, Nan Z. 2019. “The Computational Case against Computational Literary Studies.” Critical Inquiry 45 (3): 601–39. ?iiTb,ff/QBXQ`;fRyXRy3efdyk8N9. D’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism. Cambridge: MIT Press. Douglas, Samantha, Dan Dirilo, Taylor-Dawn Francis, Keith Giles, and Marisa Plumb. n.d. “The Bengal Annual: A Digital Exploration of Non-Canonical British Romantic Litera- ture.” ?iiTb,ffb+�H�`Xmb+X2/mfrQ`Fbfi?2@#2M;�H@�MMm�HfBM/2t. Dubin, David. 2004. “The Most Influential Paper Gerard Salton Never Wrote.” Library Trends 52 (4): 748–64. ?iiTb,ffrrrXB/2�HbXBHHBMQBbX2/mf#Bibi`2�Kf?�M/H2fkR9 kfReNdf.m#BMd93de9XT/7?b2[m2M+24k. Firth, J.R. 1957. “A Synopsis of Linguistic Theory.” In Studies in Linguistic Analysis, 1–32. Oxford: Blackwell. Gagliano, Andrea, Emily Paul, Kyle Booten, and Marti A. Hearst. 2019. “Intersecting Word Vectors to Take Figurative Language to New Heights.” In Proceedings of the Fifth Workshop on Computational Linguistics for Literature, 20-31. San Diego, California: Association for Computational Linguistics. ?iiTb,ff/QBXQ`;fRyXR3e8jfpRfqRe@ykyj. Gavin, Michael, Collin Jennings, Lauren Kersey, and Brad Pasanek. 2019. “Spaces of Meaning: Conceptual History, Vector Semantics, and Close Reading.” In Debates in the Digital Hu- manities 2019, edited by Matthew K. Gold and Lauren F. Klein, 243–267. Minneapolis: University of Minnesota Press. Goldstone, Andrew. 2019. “Teaching Quantitative Methods: What Makes It Hard (in Literary Studies).” In Debates in the Digital Humanities 2019, edited by Matthew K. Gold and Lau- ren F. Klein. Minneapolis: University of Minnesota Press. ?iiTb,ff/?/2#�i2bX;+X+ mMvX2/mf`2�/fmMiBiH2/@7k�+7dk+@�9eN@9N/3@#2j8@ed7N�+R2j�eyfb2+iB QMfeky+�7N7@y3�3@9382@�9Ne@8R9yykNe2#+/O+?RN. Gonen, Hila and Yoav Goldberg. 2019. “Lipstick on a Pig: Debiasing Methods Cover up System- atic Gender Biases in Word Embeddings But do not Remove Them.” ArXiv:1903.03862, September. ?iiTb,ff�`tBpXQ`;f�#bfRNyjXyj3ek. Griffiths, Thomas L., Mark Steyvers, and Joshua B. Tenenbaum. 2007. “Topics in Semantic Representation.” Psychological Review 114 (2): 211–44. ?iiTb,ff/QBXQ`;fRyXRyjdf http://www.digitalhumanities.org/dhq/vol/3/2/000041/000041.html http://www.digitalhumanities.org/dhq/vol/3/2/000041/000041.html https://doi.org/10.1215/00267929-7933102 https://modernismmodernity.org/forums/posts/search-and-replace https://doi.org/10.1613/jair.1.11259 https://critinq.wordpress.com/2019/03/31/computational-literary-studies-a-critical-inquiry-online-forum/ https://critinq.wordpress.com/2019/03/31/computational-literary-studies-a-critical-inquiry-online-forum/ https://critinq.wordpress.com/2019/03/31/computational-literary-studies-a-critical-inquiry-online-forum/ https://doi.org/10.1086/702594 https://scalar.usc.edu/works/the-bengal-annual/index https://www.ideals.illinois.edu/bitstream/handle/2142/1697/Dubin748764.pdf?sequence=2 https://www.ideals.illinois.edu/bitstream/handle/2142/1697/Dubin748764.pdf?sequence=2 https://doi.org/10.18653/v1/W16-0203 https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/620caf9f-08a8-485e-a496-51400296ebcd#ch19 https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/620caf9f-08a8-485e-a496-51400296ebcd#ch19 https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/620caf9f-08a8-485e-a496-51400296ebcd#ch19 https://arxiv.org/abs/1903.03862 https://doi.org/10.1037/0033-295X.114.2.211 https://doi.org/10.1037/0033-295X.114.2.211 Plumb 41 yyjj@kN8sXRR9XkXkRR. Harris, Katherine D. 2015. Forget Me Not: The Rise of the British Literary Annual, 1823NJ1835. Athens: Ohio University Press. Harris, Katherine D. 2019. “TheBengalAnnual and #bigger6.” Keats-ShelleyJournal 68: 117–18. ?iiTb,ffKmb2XD?mX2/mf�`iB+H2fddRRjk. Kirschenbaum, Matthew. 2007. “The Remaking of Reading: Data Mining and the Digital Hu- manities.” Presented at the National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation, Baltimore, MD, October 11. ?iiTb,ffrrrX+b22XmK#+X2/mf�?BHHQHfL:.Jydf�#bi`�+ibfi�HFbfJEB`b+? 2M#�mKXT/7. Klein, Lauren F. 2019. “What the New Computational Rigor Should Be.” IntheMoment (blog). April 1, 2019. ?iiTb,ff+`BiBM[XrQ`/T`2bbX+QKfkyRNfy9fyRf+QKTmi�iBQM�H @HBi2`�`v@bim/B2b@T�`iB+BT�Mi@7Q`mK@`2bTQMb2b@8f. Koehrsen, Will. 2018. “Neural Network Embeddings Explained.” Towards Data Science, Octo- ber 2, 2018. ?iiTb,ffiQr�`/b/�i�b+B2M+2X+QKfM2m`�H@M2irQ`F@2K#2//BM;b @2tTH�BM2/@9/yk32e7y8ke. Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. “The Geometry of Culture: Ana- lyzing the Meanings of Class through Word Embeddings.” American Sociological Review 84 (5): 905–949. ?iiTb,ff/QBXQ`;fRyXRRddfyyyjRkk9RN3ddRj8. Kutuzov, Andrey, Lilja Øvrelid, Terrence Szymanski, and Erik Velldal. 2018. “Diachronic word embeddings and semantic shifts: a survey.” In Proceedings of the 27th International Con- ference on Computational Linguistics, 1384-1397. Santa Fe, New Mexico: Association for Computational Linguistics. ?iiTb,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vf*R3@RRRd. Laubichler, Manfred D., Jane Maienschein, and Jürgen Renn. 2019. “Computational History of Knowledge: Challenges and Opportunities.” Isis 110 (3): 502-512. Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity.” In Proceedings of the Fifteenth International Conference on Machine Learning, 296–304. San Francisco, Califor- nia: Morgan Kaufmann Publishers Inc. Mao, Rui, Chenghua Lin, and Frank Guerin. 2018. “Word Embedding and WordNet Based Metaphor Identification and Interpretation.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1222–31. Mel- bourne, Australia: Association for Computational Linguistics. ?iiTb,ff/QBXQ`;fRy XR3e8jfpRfSR3@RRRj. Nelson, Laura K. 2020. “Computational Grounded Theory: A Methodological Framework.” Sociological Methods & Research 49 (1): 3–42. ?iiTb,ff/QBXQ`;fRyXRRddfyy9NRk9R RddkNdyjX Noble, Safiya Umoja. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: New York University Press. Poole, Alex H. 2013. “Now Is the Future Now? The Urgency of Digital Curation in the Digital Humanities.” Digital Humanities Quarterly 7 (2). ?iiT,ffrrrX/B;Bi�H?mK�MBiB2 bXQ`;f/?[fpQHfdfkfyyyRejfyyyRejX?iKH. Potts, Christopher. 2019. “A Case for Deep Learning in Semantics: Response to Pater.” Lan- guage 95 (1): e115–24. ?iiTb,ff/QBXQ`;fRyXRj8jfH�MXkyRNXyyRN. Rhody, Lisa. 2017. “Beyond Darwinian Distance: Situating Distant Reading in a Feminist Ut Pictura Poesis Tradition.” PMLA 132 (3): 659-667. Risam, Roopika. 2018. “Decolonizing the Digital Humanities in Theory and Practice.” In The https://doi.org/10.1037/0033-295X.114.2.211 https://doi.org/10.1037/0033-295X.114.2.211 https://muse.jhu.edu/article/771132 https://www.csee.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf https://www.csee.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses-5/ https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses-5/ https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526 https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526 https://doi.org/10.1177/0003122419877135 https://www.aclweb.org/anthology/C18-1117 https://doi.org/10.18653/v1/P18-1113 https://doi.org/10.18653/v1/P18-1113 https://doi.org/10.1177/0049124117729703. https://doi.org/10.1177/0049124117729703. http://www.digitalhumanities.org/dhq/vol/7/2/000163/000163.html http://www.digitalhumanities.org/dhq/vol/7/2/000163/000163.html https://doi.org/10.1353/lan.2019.0019 42 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 3 Routledge Companion to Media Studies and Digital Humanities, edited by Jentery Sayers, 78–86. New York: Routledge. Roh, Yuji, Geon Heo, and Steven Euijong Whang. 2019. “A Survey on Data Collection for Ma- chine Learning: A Big Data - AI Integration Perspective.” IEEE Transactions on Knowledge and Data Engineering Early Access: 1–20. ?iiTb,ff/QBXQ`;fRyXRRyNfhE.1XkyRNX kN9eRek. Tversky, Amos. “Features of Similarity.” Psychological Review 84 (4): 327–52. ?iiTb,ff/QBX Q`;fRyXRyjdfyyjj@kN8sX39X9Xjkd. Underwood, Ted. 2019. Distant Horizons: Digital Evidence and Literary Change. Chicago: University of Chicago Press. Whitt, Richard J., ed. 2018. Diachronic Corpora, Genre, and Language Change. John Benjamins Publishing Company. https://doi.org/10.1109/TKDE.2019.2946162 https://doi.org/10.1109/TKDE.2019.2946162 https://doi.org/10.1037/0033-295X.84.4.327 https://doi.org/10.1037/0033-295X.84.4.327