Book Reviews and the Consolidation of Genre 

Kent Chang, Yuerong Hu, Wenyi Shang, Aniruddha Sharma, Shubhangi Singhal, Ted 
Underwood, Jessica Witte, Peizhen Wu 
 
A paper presented at the virtual panel “Cultural Analytics and the Book Review: Models, 
Methods and Corpora,” ADHO 2020, July 22, 2020. 
  
Introduction 

Book reviews clearly cast new light on reception: on literary judgment, for 
instance, and prestige. But reviews may also give us an opportunity to test claims about 
the significance of patterns in the reviewed books themselves. 

For instance, literary scholars have recently claimed that predictive models can 
measure the strength of the boundaries that separate different cultural categories—
different genres of fiction, say, or market segments.1 But the evidence supporting this 
argument comes purely from the texts themselves. The works in a particular literary 
genre may be relatively easy (or hard) to distinguish from others, because they possess 
(or lack) a distinctive diction. Interpreting this textual boundary as evidence about the 
strength of a cultural distinction has seemed questionable to many readers.2 One can 
imagine cultural categories that would be salient and distinctive for human readers even 
though they don't leave the kind of traces that can be captured in a model of word 
frequency. 

So do textual models really tell us anything about the boundaries between cultural 
categories? It is hard to resolve this question with a single experiment, because it isn't 
immediately clear what counts as ground truth about the strength of a cultural boundary. 
José Calvo Tello has compared predictive accuracy to the level of human consensus about 
different genres (expressed, for instance in bibliographies).3 Ted Underwood has 

 
1 Dan Sinykin, “How Capitalism Changed American Literature,” Public Books, July 17, 2019, 
https://www.publicbooks.org/how-capitalism-changed-american-literature/. Richard Jean So and Edwin 
Roland, “Race and Distant Reading,” PMLA 135.1 (Jan 2020): 59-73. 
2 This is, for instance, one of the questions raised by Nan Z. Da, “The Computational Case against 
Computational Literary Studies,” Critical Inquiry 45 (Spring 2019): 601-39. 
3 José Calvo Tello, “Genre Classification in Spanish Novels: A Hard Task for Humans and Machines?” 
European Association for Digital Humanities 2018, 
https://eadh2018.exordo.com/programme/presentation/82. 
 

 2 

compared predictive accuracy to the degree of overlap or separation between genres. 
(That is, we might expect pairs of genre labels that are often assigned to the same works 
to be closer to each other than those that rarely overlap.)4 Both studies suggest that the 
accuracy of a textual model does correlate with the behavior of human observers. But 
both studies are still open to the objection that they rely purely on explicit labels. This 
could produce a subtle kind of false confirmation. Perhaps the conscious labeling 
behavior of bibliographers and catalogers is governed by categories overtly signaled in 
the diction of a literary work—but ordinary readers care more, in practice, about other 
categories, less clearly registered in diction? 

Book reviews give us a way to address this remaining source of doubt. Reviewers 
may or may not explicitly assign books to a genre: in the nineteenth- and early-twentieth-
century period we will discuss here, explicit genre categorization is unusual. But reviews 
do presumably reflect the tacit concepts and categories that organize the landscape of 
fiction for a particular reader. It seems likely that books with similar reviews were 
perceived in similar ways. So if the textual boundaries between groups of literary works 
do really correlate with the responses of ordinary readers, reviews of those texts ought to 
reveal the same groupings and distinctions. That is the primary hypothesis we set out to 
test in this paper. Are the subject or genre categories most strongly marked in fiction also 
the categories most strongly marked in reviews of fiction? 
 
Data 

To construct a corpus of paired literary texts and book reviews we aligned 
extracted features from HathiTrust Research Center with book reviews from ProQuest's 
British Periodicals collection, matching on both the author and the title of the original 
work.5 We also used predictive modeling to filter the book reviews for reviews of fiction. 
Review metadata is imperfect, and title matches are often ambiguous, so without this 
filtering step it would have been difficult to have confidence that we were really pairing 
books of fiction with their reviews. 

 
4 Ted Underwood, “The Historical Significance of Textual Distances,” Proceedings of the Second Joint 
Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and 
Literature, Santa Fe, 2018, https://www.aclweb.org/anthology/W18-4507/. 
5 Boris Capitanu, Ted Underwood, Peter Organisciak, Timothy Cole, Maria Janina Sarol, J. Stephen 
Downie (2016). The HathiTrust Research Center Extracted Feature Dataset (1.0) [Dataset]. HathiTrust 
Research Center, http://dx.doi.org/10.13012/J8X63JT3. 


 3 

When the filtering process was complete we had 9137 pairs of books and reviews. 
The books' dates of first publication extend from 1535 to 1950, but the vast majority 
(more than 9000) were published after 1800, and about 8000 after 1850. The reviews 
dated from 1800 to 1950. Most were published very shortly after the book in question, 
although we do have a few nineteenth-century reviews of Don Quixote. When a book had 
multiple volumes, we aggregated the texts; when we had multiple reviews of the same 
book, we also aggregated the reviews to produce a single composite review-text. Word 
counts for the books and reviews are available through a GitHub repository documenting 
the project.6 
 
Methods 

Our overall hypothesis was (generally) that similar books will have similar 
reviews and (more specifically) that categories of books with closely-knit textual 
similarity will also have reviews that resemble each other closely. We preregistered an 
initial plan to test this hypothesis in two ways: using supervised predictive models (which 
have worked well for this problem in the past, but require relatively large groups of 
works), and using Word Mover's Distance.7 Distance metrics are easy to apply to small 
groups of texts, and we hoped a distance-measuring approach would allow us to explore 
this question across a wider range of genres, including genres with few examples. While 
other distance metrics are more familiar, we thought Word Mover's Distance might be 
preferable for short texts like reviews, since it uses word embeddings rather than one-hot 
encoding and thereby produces a less sparse feature space. 

We did find that our preregistered hypotheses were confirmed using Word 
Mover's Distance. For instance, to take the simplest example, we measured WMD 
between 500 random pairs of books and the corresponding pairs of reviews. We found a 
statistically significant relationship between the two measures (r = .13, p < .01). But 
regular cosine distance on the frequencies of the most frequent 2500 words showed an 
even stronger relationship in the same sample (r = .22, p < .00001). In subsequent 

 
6 Metadata for the books used in this experiment is available at our GitHub repository, 
https://github.com/tedunderwood/reviews/tree/master/bpo/corexperiment. Metadata for the reviews (and 
word counts for both the books and reviews) is available at the supporting Open Science Framework site: 
https://osf.io/a3749/. 
7 Yuerong Hu, Wenyi Shang, and William E. Underwood, “Book Reviews and the Consolidation of 
Genre: First Registration,” Open Science Framework, October 9, 2019, https://osf.io/j2ycz. 
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, “From Word Embeddings to Document Distances,” 
Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015.  


 4 

experiments, we also found that measuring category strength with cosine distance 
produced results that echoed predictive models more closely than WMD did. So we 
reverted to cosine distance in subsequent experiments. Since this is already the dominant 
distance measure in text analysis, we didn’t feel there was a great risk of tailoring our 
methods to a particular sample or problem.8 

As part of our preregistered hypothesis, we used metadata in contemporary 
libraries to define twenty-four genre or subject categories. These categories could be 
viewed as a source of anachronism (since they were mostly defined by librarians half a 
century or more after the publication of the original works). But the anachronism in 
question is helpfully orthogonal to the question explored here. In other words, we're not 
positing that these twenty-four categories are the best categories for English fiction 1800-
1950, or that they precisely align with real divisions in literary culture. Instead we are 
asking whether the relative clarity of different categories (in the literary texts themselves) 
correlates with the relative clarity of the same categories (in reviews of the texts). To 
fully test this hypothesis, it might actually be good if some of our categories were 
anachronistic, and did fail to align clearly with real boundaries between literary practices.  

We then tested our central hypothesis about genre in several different ways. First, 
we  trained classifiers to distinguish literary works (and their reviews) from other works 
and reviews in our corpus. (We used the scikit-learn implementation of regularized 
logistic regression.)9 We found that the accuracy of the book classifiers correlated with 
the accuracy of the review classifiers, r = .867 and p < .001. In this part of the 
experiment, we could only use fourteen large categories, because predictive models 
become unstable with small training sets. So some of the categories in figure 1 are all-
encompassing (e.g. “random”), or very general (works labeled “novel” or “romance” in 
their titles), or defined through subject headings rather than genres (e.g. works about 
“Britain” or “North America”). 
  

8 Instead of using tf-idf, we scale features by converting each column of the matrix to a z-score, which is 
equivalent to using Burrows's Delta. For empirical evidence that this distance measure works well for 
many problems in text analysis, see Stefan Evert et al., “Understanding and Explaining Delta Measures 
for Authorship Attribution,” Digital Scholarship in the Humanities 32.2 (December 2017): pp. 4-16, 
https://doi.org/10.1093/llc/fqx023. 
9 “Scikit-learn: Machine Learning in Python,” Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. 


 5 

 
Figure 1. Correlation between predictive models of books and of reviews. 
 

But we also observed the same pattern across a larger set of categories that are 
closer to groups ordinarily called “genres,” using distance measurements between pairs of 
texts. We selected random pairs of works in the same genre and measured both the in-
genre distance (between Mystery A and Mystery B) and the out-of-genre distances (e.g. 
from Mystery A to a randomly selected work published in the same year as Mystery B). 
Distances were measured as cosine distances for the 2,500 most common words, scaled 
using Burrows's Delta (which is in effect the StandardScaler in scikit-learn). By 


 6 

subtracting the in-genre distance from the out-of-genre distance for each pair, we 
obtained a measurement of how much closer works in each genre are to each other than 
to the rest of the corpus. Again, we found that closely-knit genres produce closely-knit 
groups of reviews, r = .806, p < .001. 
  

Figure 2. Correlation between distance-differences for books and reviews. In each case a 
pair of books in the same category are cross-compared to a pair of books outside the 
category, but published in the same years. 
 

 7 

Conclusions and future work 
We conclude that the similarities and differences between texts (measured, for 

instance, by cosine similarity) do correlate with similarities and differences in 
reception—or, at any rate, in book reviews. When we look at individual pairs of books, 
the relationship may not be very strong; perhaps r ≈ .22. But if we back up, and gather 
books in categories, the aggregate relationship is stronger. Closely-knit genres also 
produce clusters of closely-related reviews. 

This could be to some extent a verbal accident, if book reviews were distinguished 
by exactly the same unusual words overrepresented in the book texts. For instance, we 
might imagine that references to “valor” and “surrender” would characterize war stories 
(as well as reviews written about them). But inspection of the most strongly marked 
categories in our corpus does not lead us to credit this explanation. For instance, books 
are likely to be folklore when they mention fairy, witches, or invisible; those are some of 
the strongest features in our predictive model. But reviews are likely to be about folklore 
when they mention traditions, collected, and popular. In both contexts, folklore is 
marked by a distinctive diction—but it is a different diction in each context. So we 
suspect the coherence of these categories is not a purely verbal accident, but reflects an 
underlying social distinction. Books of folklore are genuinely unlike other kinds of 
fiction; that social distinctiveness is reflected (in different ways) both in their texts and in 
their reviews. 

We also tested this hypothesis in several other ways. For instance, we found that 
the correlation between review-similarity and book-similarity holds (more weakly, r 
= .418) even if we use two different subsets of works in each genre: one to test the 
similarity of book-texts, and a different, disjoint set to test the similarity of review-texts. 

Having validated this measurement of generic distinctiveness, we then used it, 
experimentally, to measure a broad structural change in fiction between 1860 and 1920. 
We measured the difference between in-genre and out-of-genre comparisons, and dated 
each pair of books to the midpoint of the two publication dates (since we precisely 
matched out-of-genre comparisons to the dates of the in-genre pair, the midpoint date was 
always the same). We found that genres became more closely knit across this period: that 
is, works of fiction became more similar to other works in the same genre than they were 
to randomly selected works from the same publication year. 
  

 8 

 
Figure 3. The consolidation of genre? 
 

This pattern is open to different interpretations. It could reveal a process of 
differentiation (if we focus on the growing differences between genres)—or consolidation 
(if we focus on the strengthening of in-genre similarity). But since the categories we are 
using are drawn from late twentieth-century librarians' judgments, it could also be that 
works of fiction simply fit those judgments better as we move closer to the late twentieth 
century. To decide between these interpretations, we have subsequently repeated the 
experiment with categories inferred from a topic model of twentieth-century book 
reviews. But that's a different project and a separate paper.