key: cord-0329367-ezzajdxk
authors: Davidson, Natalie R.; Greene, Casey S.
title: Analysis of scientific journalism in Nature reveals gender and regional disparities in coverage
date: 2021-06-22
journal: bioRxiv
DOI: 10.1101/2021.06.21.449261
sha: 09fd7db87f151842d956e04bc807810d25553dec
doc_id: 329367
cord_uid: ezzajdxk

Scientific journalism is a critical way in which the public can remain informed and benefit from new scientific findings. Such journalism also shapes the public’s view of the current state of scientific findings and legitimizes experts. Those covering science can only cite and quote a limited number of sources. Sources may be identified by the journalist’s research or by recommendations by other scientists. In both cases, biases may influence who is identified and ultimately included as an expert. We analyzed 22,001 non-research articles published by Nature to quantify possible disparities. Our analysis considered three possible sources of disparity: gender, name origin, and country affiliation. To explore these sources of disparity, we extracted cited authors’ names and affiliations, as well as extracted names of quoted speakers. While citations and quotations within a piece do not reflect the entire information-gathering process, they can provide insight into the demographics of visible sources. We then used the extracted names to predict gender and name origin of the cited authors and speakers. In order to appropriately quantify the level of difference, we must identify a suitable reference set for comparison. We chose first and last authors within primary research articles in Nature and a subset of Springer Nature articles in the same time period as our comparator. In our analysis, we found a skew towards male quotation in Nature journalism-related articles, but quotation is trending toward equal representation at a faster rate than first and last authorship in academic publishing. Interestingly, we found that the gender disparity in quotes was column-dependent, with the “Career Features” column reaching gender parity. Our name origin analysis found a significant over-representation of names with predicted Celtic/English origin and under-representation of names with a predicted East Asian origin. This finding was observed both in extracted quotes and journal citations, but dampened in citations. Finally, we performed an analysis to identify how countries vary in the way that they’re described in scientific journalism. We focused on two groups of countries: countries that are often mentioned in articles, but do not often have affiliated authors cited, and countries that have affiliated authors that are often cited, but the country is not typically mentioned. We found that the articles in which the less cited countries occur tend to have more agricultural, extraction-related, and political terms, whereas articles including highly cited countries have broader scientific terms. This discrepancy indicates a possible lack of regional diversity in the reporting of scientific output.

In order to appropriately quantify the level of difference, we must identify a suitable reference set for comparison. We chose first and last authors within primary research articles in Nature and a subset of Springer Nature articles in the same time period as our comparator. In our analysis, we found a skew towards male quotation in Nature journalism-related articles, but quotation is trending toward equal representation at a faster rate than first and last authorship in academic publishing. Interestingly, we found that the gender disparity in quotes was columndependent, with the "Career Features" column reaching gender parity. Our name origin analysis found a significant over-representation of names with predicted Celtic/English origin and underrepresentation of names with a predicted East Asian origin. This finding was observed both in extracted quotes and journal citations, but dampened in citations. Finally, we performed an analysis to identify how countries vary in the way that they're described in scientific journalism.

Science journalism is an indispensable part of scientific communication and provides an accessible way for everyone from researchers to the public to learn about new scientific findings and to consider their implications. However, it is important to identify the ways in which its coverage may skew towards particular demographics. Coverage of science shapes who is considered a scientist and field expert by both peers and the public. This indication of legitimacy can either help recognize people who are typically overlooked due to systemic biases or intensify biases. Journalistic biases in general-interest, online and printed news have been observed by journalists themselves [1, 2, 3, 4] , as well as by independent researchers [5, 6, 7, 8, 9, 10] . Researchers found a gap between male and female subjects or sources, with independent studies finding that between 17-40% of total subjects were female across multiple general-interest printed news outlets between 1985 and 2015 [5, 6, 10] . One study found 27-35% of total subjects in international science and health related news were female between 1995 and 2015, and 46% in print, radio, and television in the United States in 2015 [10] . It should be noted that scientific news coverage is confounded by the existing differences in gender and racial demographics within the scientific field [11, 12] . However, we are interested in quantifying disparities with respect to observed demographic differences in the scientific field, using academic authorship as an estimate for the existing demographics. This is similar to other studies that have quantified gender or racial disparities in science as observed in citation [13, 14] and funding rates [15, 16, 17, 18, 19] .

In researching a story, a journalist will typically interview multiple sources for their opinion, potentially asking for additional sources, thus allowing individual unconscious biases at any point along the interview chain to skew scientific coverage broadly. In addition, the repeated selection of a small set of field experts or the approach a journalist takes in establishing a new source may intensify existing biases [3, 4, 20] . While disparities in representation may go unnoticed in a single article, analyzing a large corpus of articles can identify and quantify these disparities and help guide institutional and individual self-reflection. In the same vein as previous media studies [5, 6, 7, 8, 9, 10] , we sought to quantify gender and regional differences of journalism beyond the existing demographic differences in the scientific field. Our study focused solely on scientific journalism, specifically content published by Nature. Since Nature also publishes primary research articles, we used these data to determine the demographics of the expected set of possible sources. For clarity, throughout this manuscript we will refer to journalistic articles as news and academic, primary research articles as papers. Furthermore, when we refer to "authors" we mean authors of academic papers, not journalists; this work did not scrape any journalists' names, nor derive any insights about individual journalists. In our analysis, we identified quoted and cited people by analyzing the content and citations within all news articles from 2005 to 2020, and compared this demographic to the academic publishing demographic by analyzing first and last authorship statistics across all Nature papers during the same time period.

Through our analysis of 22,001 news articles, we were able to identify >100,000 quotes and >8,000 citations with sufficient speaker or author information. We also identified first and last authors of >13,000 Nature papers. We then identified possible gender or regional differences using the extracted names. The extracted names were used to generate three data-types: quoted, mentioned, and cited people. We used computational methods to predict gender and identified a trend towards quotes from people predicted male in news articles when compared to both the general population and predicted male authorship in papers. Within the period that we examined, the proportion of predicted male attributed quotes in news articles went from initially higher to currently lower than the proportion of male first and last authors in Nature papers. Furthermore, we found that the quote difference was dependent on article type; the "Career Feature" column achieved gender parity in quoted speakers. We also used computational methods to predict name origins and found a significant over-representation of names with predicted Celtic/English origin and under-representation of names with a predicted East Asian origin in both quotes and citations.

While we focused on news from Nature, our software can be repurposed to analyze other text. We hope that publishers will welcome systems to identify disparities and use them to improve representation in journalism. Furthermore, our approach is limited by the features we were able to extract, which only reflects a portion of the journalistic process. Journalists could additionally track all sources they contact to self-audit. However, auditing is only part of the solution; journalists and source recommenders must also change their source gathering patterns. To help change these patterns, there exist guides [20] , databases [21] , and affinity groups [20] that can help us all expand our vision of who can be a field expert.

We scraped all text and metadata using the web-crawling framework Scrapy [23] (version 2.4.1). We created three independent scrapy web spiders to process the news text, news citations, and paper metadata. News articles were defined as all articles from 2005 to 2020 that were designated as "News", "News Feature", "Career Feature", "Technology Feature", and "Toolbox". Using the spider "target_year_crawl.py", we scraped the title and main text from all news articles. We character normalized the main text by mapping visually identical Unicode codepoints to a single Unicode codepoint and stripping all non-Unicode characters. Using an additional spider defined in "doi_crawl.py", we scaped all citations within news articles. For simplicity, we only considered citations with a DOI included in either text or a hyperlink in this spider. Other possible forms of citations, e.g., titles, were not included. The DOIs were then queried using the Springer Nature API. The spider "article_author_crawl.py" scraped all articles designated "Article" or "Letters" from 2005 to 2020. We only scraped author names, author positions, and associated affiliations from research articles, which we refer to as papers. It should be noted that "News" article designations changed over time.

After the news articles were scraped and processed, the text was processed using the coreNLP pipeline [24] (version 4.2.0). The main purpose for using coreNLP was to identify named entities related to countries and quoted speakers. The full set of annotaters were: tokenize, ssplit, pos, lemma,ner, parse, coref, quote. We used the "statistical" algorithm to perform coreference resolution. All results were output to json format for further downstream processing.

Springer Nature was chosen over other publishers for multiple reasons: 1) it is a large publisher, second only to Elsevier; 2) it covers multiple subjects, in contrast to PubMed; 3) its API has a large daily query limit (5000/day); and 4) it provided more author affiliation information than found in Elsevier. We generated a comparative background set for supplemental analysis with the Springer Nature API by obtaining author information for papers cited in news articles. We selected a random set of papers to generate the Springer Nature background set. These papers were the first 200 English language "Journal" papers returned by the Springer Nature API for each month, resulting in 2400 papers per year for 2005 through 2020. To obtain the author information for the cited papers, we queried the Springer Nature API using the scraped DOI. For both API query types, the author names, positions, and affiliations for each publication were stored and are available in "all_author_country.tsv" and "all_author_fullname.tsv".

To identify the gender of a quoted or mentioned person, we first attempt to identify the person's full name. Even though genderizeR only uses the first name to make the gender prediction, identifying the full name gives us greater confidence that we are using the first name. To identify the full name, we take the predicted speaker by coreNLP and match it to the longest matching name within the same article. We match names by finding the longest mentioned name in the article with minimal edit (Levenshtein) distance. The name with the smallest edit distance, where character deletions have zero cost, is defined as the matching name. Character deletion was assigned a zero cost because we would like exact substring matches. For example, the calculated cost, including a cost for character deletion, between John and John Steinberg is 10; without character deletion, it is 0. Compared with the distance between John and Jane Doe, with character deletion cost, it is 7; without it is 2. If we are still unable to find a full name, or if coreNLP cannot identify a speaker at all, we also determine whether or not coreNLP linked a gendered pronoun to the quote. If so, we predict that the gender of the speaker is the gender of the pronoun. We ignore all quotes with no name or partial names and no associated pronouns. A summary of processed gender predictions of quotes at each point of processing is provided in Table 1 .

Because we separate first and last authors, we only considered papers with more than one author. As for quotes, we needed to extract the first name of the authors. We cast names to lowercase and processed them using the R package humaniformat [25] . humaniformat identifies if names are reversed (Lastname, Firstname), as well as identifies middle names. This processing was not required for quote prediction because names written in news articles did not appear to be reversed or abbreviated. Since many last or first authorships may be non-names, we additionally filtered out any identified names if they partially or fully match any of the following terms: "consortium", "group", "initiative", "team", "collab", "committee", "center", "program", "author", or "institute". Furthermore, since many papers only contain first name initials (for example, "N. Davidson"), we remove any names less than four letters (length includes punctuation) and containing a "." or "-", then strip out all periods from the first name. This ensures that hyphenated names are not changed, e.g. Julia-Louise remains unchanged, but removes hyphenated initials, e.g. J-L. Finally, we only consider any remaining first names of more than two characters. This is to eliminate first and middle jointly-initialized names. For example, "NR Davidson" would be reduced to "Davidson" and then eliminated due to the lack of a first name. A summary of processed author gender predictions at each point of processing is provided in Tables 2 -4.

In contrast to the gender prediction, we require the entire name in all steps of name origin prediction. For names identified in the Nature news articles, we use the same process as described for the gender prediction; we again try to identify the full name. For author names, we process the names as previously described for the gender prediction of authors. For all names, we only consider them in our analyses if they consist of two distinct parts separated by a space. Additionally, if a full name is less than three characters, we were unable to consider it as the prediction model that we apply uses 3-mers. A summary of processed name origin predictions of quotes and citations at each point of processing is provided in Tables 1 -4.

The quote extraction and attribution annotator from the coreNLP pipeline was employed to identify quotes and their associated speakers in the article text. In some cases, coreNLP could not identify an associated speaker's name but instead assigned a gendered pronoun. In these instances, we used the gender of the pronoun for the analysis. The R package genderizeR [26] , a wrapper for the genderize.io API [27] , predicted the gender of authors and speakers. We predicted a name as male using the first name with a minimum cutoff of 50%. To reduce the number of queries made to genderize.io, a previously cached gender prediction from [28] was also used and can be found in the file "genderize.tsv". All first name predictions from this analysis are in the file "genderize_update.tsv". To estimate the gender gap for the quote gender analyses, we used the proportion of total quotes, not quoted speakers. We used the proportion of quotes to measure speaker participation instead of only the diversity of speakers. The specific formulas for a single year are shown in equations 1 and 2. We did not consider any names where no prediction could be made or quotes where neither speaker nor gendered pronoun was associated.

We used the same quoted speakers as described in the previous section for the name origin analysis. In addition, we also consider all authors cited in a Nature news article. In contrast to the gender prediction, we need to use the full name to predict name origin. We submitted all extracted full names to Wiki-2019LSTM [28] to predict one of ten possible name origins: African, Celtic/English, East Asian, European, Greek, Hispanic, Hebrew, Arabic/Turkish/Persian, Nordic, and South Asian. While a full description of Wiki-2019LSTM is outside the scope of this paper, we describe it here breifly. Wiki-2019LSTM is trained on name and nationality pairs, using 3mers of the characters in a name to predict a nationality. To ensure robust predictions, nationalities were grouped together as described in NamePrism [29] . NamePrism chose to exclude the United States, Australia, and Canada from their country groupings and were therefore excluded during training of Wiki-2019LSTM. This choice was justified by NamePrism in stating that these countries had a high level of immigration. The treemap of country groupings defined in the NamePrism manuscript are found in figure 5 of the publication [29] .

After running the pre-trained Wiki-2019LSTM model, we select the highest probability origin for each name as the resultant assignment. Similar to the gender analyses, quote proportions were again directly compared against publication rates. For citations, quotes, and mentions, we calculated the proportion for a given year for each name origin. This is shown in 3 to, for example, calculate the citation rate for last authors with a Greek name origin for a single year. 

We estimated the prevalence of a country's mentions by including all identified organizations, countries, states, or provinces from coreNLP's named entity annotater. We queried the resultant terms using OpenStreetMap [30] to identify the associated country with the term. All terms that were identified in the text 25 or more times were visually inspected for correctness. Hand-edited entries are denoted in the OpenStreetMap cache file "osm_cache.tsv" by the column "hand_edited". Still, this only accounts for less than 5% of the total entries. Furthermore, country-associated terms identified by coreNLP may be ambiguous, causing OpenStreetMap to return incorrect locations. Therefore, we count country mentions only if we find at least two unique country-associated terms in an article. We calculate the mentioned rate as the proportion of country-specific mentions divided by the total articles for a particular year, as exemplified in 6 for calculating the mentioned rate for Mexico for a single year.

To identify the citation rate of a particular country, we processed all authors' affiliations for a specific article. Since the affiliations could be in multiple formats, we again used OpenStreetMap to identify the country affiliation. Additionally, we considered all affiliations for a single author. We calculated a countries' citation rate as the number of citations for a country divided by either the number of Nature papers (7) or the total number of papers cited by news articles for that year (8). Shown below are example calculations for Colombia for a single year.

After calculating the citation and mention proportion for each country, we identified countries outlying in their comparative citation or mention rate. Outlier detection was done by subtracting the citation and mention rates, then identifying which countries were in the top or bottom 5% from each year. We only considered countries identified as either high citation (Set C) or high mention (Set M) across all years. We did not consider any country that was in the top and bottom 5% in different years. Additionally, we only considered a country if cited or mentioned five times in a single year. Once we identified the set of C and M countries, we analyzed the word frequencies in all news articles where the set C or M country was mentioned but not cited. We believe this would provide insight into content differences between set C and M countries. Text from articles in 2020 were not considered due to an excess of SARS-CoV-2 related terms.

Using the R package tidytext [31] we extracted tokens, removed stop words, and calculated the token frequencies across all articles. We only consider tokens in set C or M articles if the token has been observed at least 100 times across all articles. We then identify tokens that have the most significant ratio of usage between the two sets. Since there are differences in the number of articles per country within each set, we calculated a token frequency within a set as the median frequency within each countries associated articles. We calculated the resultant token ratio as the country normalized citation frequency to the country normalized mention frequency. To avoid divide by zero errors, a pseudocount of 1 is added to both the numerator and denominator. We assert that the term must be observed at least once in each set.

For all analyses related to equations 1 -8, we independently selected 5000 bootstrap samples for each year. We sampled with replacement of size equal to the cardinality of the complete set of interest. Bootstrap estimates for equations 1 -8 were performed by sampling the denominator set. The mean, 5th, 95th quantiles across the estimates are reported as the estimated mean, lower, and upper bounds. For the divergent word analysis, due to computational constraints, we only took 1000 bootstrap samples. The bootstrap estimates were taken by subsampling the news articles with replacement, each time recalculating the country-normalized token frequencies within each country set (C and M). After the normalized frequencies within each country set were calculated, we calculated the ratio between country sets for each subsample with a pseudocount of 1 in the numerator and denominator, (C+1)/(M+1). Again, the mean, 5th, 95th quantiles across the estimates are reported as the estimated mean, lower, and upper bounds. 

We have analyzed the text and citations of 22,001 news-related articles hosted on "www.nature.com" that span 15 years from 2005 to 2020. Our primary focus is on 16,080 articles written by journalists which include the following five article types: "Career Feature", "News", "News Feature", "Technology Feature", and "Toolbox". "Career Feature" generally focuses on the career-related aspects of being a scientist. "News" and "News Feature" focuses on current events related to science as well as new scientific findings. It should be noted that the types of articles contained in "News" changed over time which may induce content shifts in a subset of the articles within our corpus. "Technology Feature" also covers current events and scientific findings, but additionally focuses on how science intersects with technology, such as apps, methodologies, tools, and practices. Lastly, "Toolbox" is similar to "Technology Feature", but is more centered on technology, especially the tools used to perform science. We also B a include one analysis of the scientist-written news articles, "Career Column" and "News and Views", as an additional set of 5,921 articles. "Career Column" is similar to "Career Feature", except it is not written by journalists, but individuals in the scientific field. "News and Views" is similar to a review article, where a field expert writes an article relating to a recently written article within Nature.

The text and citations were then uniformly processed as depicted in Figure 1a to identify: 1) mentioned locations or organizations (light orange box), 2) quotes and quoted speakers (blue box), and 3) cited authors (green box). The extracted names from the text were used to generate three data types for downstream processing: quoted, mentioned, and cited people. A summary of frequencies for each data type at each point of processing is provided in Tables 1 -4 . We scraped the text using the web-crawling framework Scrapy [23] , processed, and ran it through the coreNLP pipeline (Methods). To identify country mentions, we used the following named entities as possible mentions: "organizations", "countries", "states or provinces". We then mapped the named entity to a country prediction using OpenStreetMap [30] . To identify quotes and speakers, we used the coreNLP quote extraction and attribution annotator. We performed multiple name formatting processes (Methods) to identify the speaker's full name for gender and name origin prediction. We scraped the citations using an independent scraper to the text scraper. All identified DOI's were queried using the Springer Nature API to attain all authors' names, positions, and affiliations, however last authors were used as the primary comparator.

Next, we determined if the quoted speakers, mentioned countries, and cited authors in news articles have a similar demographic makeup as the scientists who publish their primary research in Nature. To make this determination, we used all authors' names, positions, and affiliations of papers published by Nature over the same time period (Figure 1a , dark orange box). Again, last authors were used as the primary comparator. The author metadata of Nature papers from 2005 to 2020 totaled 13,414. To more broadly represent overall science authorship, we also separately analyzed 36,000 randomly selected Springer Nature-published papers from Englishlanguage journals over the same time. It should be noted that extracted quotes may come from multiple types of people, such as academic scientists, clinicians, the broader scientific community, politicians, and more. However, through anecdotal observation we believe that most sources come from either academic scientists or those actively involved in science. The extracted author affiliations from both data sources were mapped to a country using OpenStreetMap. Similarly, author names were uniformly processed and then used to predict both gender and name origin.

The top three observed article frequencies are "Research" (including "Letters" and "Articles"), "News", and "News Feature". Since Nature merged "Letters" and "Research" papers in 2019, we combined them in our analysis. We observed substantial variability in the number of Nature news articles by type between 2005 and 2020 ( Figure 1b) . The changing classification of article types may explain temporal changes in news articles. Over time, the frequency of "News" articles decreased; however, more specific news-related article types increased, including the introduction of the new categories "Career Feature", "Toolbox", and "Career Column". To quantify and compare the gender demographic of quoted people and authors, we analyzed their names. While we could have analyzed the proportion of unique male speakers, we were interested in measuring the overall participation rates by gender and analyzed the proportion of total quotes, e.g. a single speaker may have more than one quote in an article. Furthermore, we assume that a majority of quoted speakers are typically involved in scientific research and therefore primary research authors is a comparable demographic. Figure 2 shows an overview of the process and example input data for this analysis: 1) quotes and quoted speakers (blue box), 2) first and last authors' names of papers published by Nature (dark orange box). These analyses relied upon accurate gender prediction of both authors and speakers. To predict the gender of the speaker or author, we used the package genderizeR [26] , an R package wrapper to access the genderize.io API [27] to get binary gender predictions for each identified first name. We unfortunately cannot identify non-binary gender expression with the tools we used. Performance of binary prediction was evaluated on a benchmark data set of thirty randomly selected news articles, ten from each of the following years: 2005, 2010, 2015 ( Figure  Supplemental 1a) .

We first examined the number of quotes identified within each type of news-related article (Figure 2b ), totaling 119,998 quotes with 109,723 of them containing a gender prediction for the speaker. Quote frequencies vary by article type. We compared the number of quotes from predicted male people to the number of predicted male first and last authors published in Nature. The total number of authors with a gender prediction were 10,454 first authors and 10,488 last authors. As denoted by the red line, we found that the predicted genders of authors and source-quotes were far from gender parity (Figure 2c ). Additionally, we observed a difference in the predicted genders between first and last authors, with the last authors more frequently predicted to be male.

To extend our analysis to primary research authors more broadly, we also examined a random selection of authors from English language journals published by Springer Nature (Figure  Supplemental 2a) . The predicted gender gap between first and last authors was larger in our selection of Springer Nature papers; however, both first and last authors were predicted to be closer to parity than for Nature authors. Overall, predicted male people were more frequently quoted than predicted female people in Nature news articles and first and last authors in Nature and Springer Nature papers over the same time period.

The gender proportions of authorship were relatively stable over time for both Nature and Springer Nature papers. In contrast, we found that the rate of quotes predicted to be from male people noticeably decreased over time. In 2005, the fraction of quotes predicted to be from male people was 86.87% (5,552/6,391) whereas in 2020 it was 68.5% (3,494/5,098). Indeed, the fraction of quotes from predicted male people was initially higher than the fraction of predicted male last authors, then slowly decreased until it was below the predicted male first and last authorship rates in 2020. We explored the possible reasons for this decrease. First, we looked at the authorship position of speakers who were quoted about their published paper (Figure 2d) . We identified 8,064 quotes with an associated citation (3,382 first author and 4,682 last author quotes). We found that quotes trend slightly towards last authors from 2005 to 2020, but because the fraction of predicted male last authors remained stable over time both for Nature and the selection of Springer Nature papers, which likely does not explain the downward trend. We then analyzed the breakdown of gender predicted quotes by article type. Interestingly, one article type, "Career Feature", achieved gender parity in its quotes (Figure 2e and Figure  Supplemental 2b ). In this article type, we identified a total of 1,454 quotes (759 predicted female and 695 predicted male quotes), which substantially pulled the overall quote gender ratio closer to parity from 2018 onward.

To identify possible disparities with respect to name origin, we again used the extracted names of quoted speakers and last authors published in Nature. In addition, we also identified the last authors of all papers cited by a news article. All processed names were then input into Wiki2019-LSTM and assigned one of ten possible name origins (Methods). Figure 3a shows an overview of the process and example input data for this analysis: 1) quotes and quoted speakers (blue box), 2) names of cited last authors in news articles (green) 3) last authors' names of papers published by Nature (dark orange box). We divided our analysis into three parts: firstly, quantifying the proportions of predicted name origins of last authors cited in Nature news articles. Secondly, calculating the proportion of quotes from speakers with a predicted name origin. Thirdly, calculating the proportion of unique names mentioned within an article with a predicted name origin. As a comparator set, we again used the last author names in Nature papers for all three analyses. Additionally, in our supplemental analyses, we compared against the last authorship in a random selection of Springer Nature papers. We found that the number of quotes and unique names mentioned dramatically outnumbered the number of cited authors in Nature news articles, as well as last authors within Nature papers ( Figure Supplemental 3a) . Still, since we have more than one hundred observations per time point for each data type, we believe this is sufficient for our analysis. Minimum and median per data type over all years: Nature papers, (565, 679); Springer Nature papers, (1298, 1684); quotes, (4577, 6194); mentions, (3634, 5002); citations in journalist-written article, (139, 267) citations in a scientistwritten article, (503, 660).

In comparing the citation rate of last author name origins in news articles, we decided to additionally analyze scientist-written articles. Though fewer in number, scientist-written news articles have many citations, making the set sufficient for analysis and providing an opportunity to measure differences in citation patterns between journalists and scientists. In both journalistand scientist-written articles, we found that most cited name origins were predicted Celtic/English or European, both with a bootstrapped estimated citation rate between 24.8-43.0% (Figure Supplemental 3b,c) . East Asian predicted name origins are the third highest proportion of cited names, with a bootstrapped estimated citation rate between 5.7-24.8%. All other predicted name origins individually account for less than 9% of total cited authors.

We determined how these distributions compare to the composition of the last authors in Nature, by examining the top three most frequent predicted name origins (Figure 3b,c) . We found a slight over-enrichment for predicted Celtic/English name origins and a small underenrichment for predicted East Asian name origins in scientist-written and journalist-written news articles when compared to the composition of last authors in Nature (Figure 3b, c) . Interestingly, the under-enrichment for predicted East Asian name origins in journalist-written articles was only from 2005 to 2009. Furthermore, we found no substantial difference for European or other predicted name origins (Figure Supplemental 4a) . However, we did observe that papers in which the last author had European predicted name origins were more highly cited in news articles written by scientists than journalists (Figure Supplemental 44b,c) . We also observed the predicted Celtic/English over-enrichment and East Asian under-representation when considering our subset of Springer Nature papers ( Figure Supplemental 4b) for both journalistand scientist-written news articles. In contrast to Nature, in the Springer Nature set, we see a difference in predicted European name origins, with a growing over-enrichment. Additionally, we see a difference in predicted Arabic/Turkish/Persian name origins frequencies between cited authors and Springer Nature authors, however the absolute difference is lower than observed for Celtic/English and East Asian predicted name origins.

We then sought to determine whether or not the quoted speaker demographic replicated the cited authors' over-and under-enrichment patterns. We found a much stronger Celtic/English over-enrichment, with quotes from those with Celtic/English name origins at a much higher frequency than quotes from those with European name origins (Figure Supplemental 3d) . Additionally, we also found a much stronger depletion of quotes from people with predicted East Asian name origins ( Figure Supplemental 3b) , with never more than 7.9% of quotes even though they constitute between 5.7-24.8% of last authors cited in either journalist-or scientistwritten news articles (Figure 3b,c) . When we again compare against last authorship in Nature, we observe patterns consistent with the citation analysis with all predicted name origins, except for East Asian and Celtic/English closely matching the predicted name origin rate of last authors in Nature (Figure 3d ).

Similarly, we find the same patterns in quoted speakers with East Asian, Celtic/English, and Arabic/Turkish/Persian predicted name origins when comparing against the Springer Nature set of last authors as we did in the previous citation analysis (Figure Supplemental 4d ). In addition, we find an under-enrichment of predicted Hispanic, South Asian, and Hebrew name origins when comparing against the predicted name origin rate of last authors in our Springer Nature set.

Since many journalists use additional sources that are not directly quoted, we also also analyzed likely paraphrased speakers, e.g. a case in which the person was a source and mentioned in the story but not directly quoted. To do this, we identified all unique names that appeared in an article, which we term mentions. We found the same pattern of over-enrichment for predicted Celtic/English name origins and under-enrichment for East Asian name origins when comparing against both Nature and Springer Nature last authorships (Figure 3e , Figure After finding name origin differences between cited and quoted people in comparison to last authorship rates, we wanted to determine if news articles 1) represent countries at different rates, or 2) vary in the language used to describe scientific content related to each country. To perform this analysis, we used three sources of information: 1) country-related entities mentioned in the news article text (light orange), 2) country affiliations of cited authors in news articles (green), 3) country affiliation of authors in Nature and Springer Nature (dark orange). Figure 4a shows example input data and a schematic of the analysis. We provide further processing details in Methods.

First, we interrogated the country affiliations of cited authors. We assigned an affiliation to a paper if any author, not only first or last, has affiliation with a specific country. Therefore a single paper may have multiple country affiliations. It was not possible to only identify country affiliations for a specific author position due to limitations in the Springer Nature API. Affiliation query results from the Springer Nature API return all country affiliations for a specific paper and are not linked to one particular author.

After post-processing, we analyzed a total of 1,989 papers with a citation accessible through the Springer Nature API. We considered all authors, not only first or last, within the article and their affiliations for this analysis. We found that most cited papers have at least one author with an affiliation within the United States, followed by the United Kingdom, Germany, and France ( Figure 4b ). Interestingly, we found a strong citation over-enrichment of many top-cited countries, but we found no evidence of under-enrichment of countries included in NamePrisms' grouping of countries with East Asian name origins (Figure 4c , Figure Supplemental 5a ).

Next, we examined content differences between countries or groups of countries. For example, we wanted to determine the extent to which a country was the subject (i.e., their scientific policies, environment, pollution) or the research being performed within that country was the subject. To do this, we needed to identify in an article when a country is mentioned and an affiliated author from that country is cited. Our assumption is that if a country is not cited, but it is talked about, then the topic of the article is related to something happening within that country. Similarly, if a country is not mentioned, but has an affiliated author that is cited, then the science output from that country is likely to be the subject of the article. We quantified this by counting all the journalist-written news articles in which a country, region within a country, or organization affiliated with a country was mentioned, which we term a country's "mention rate". To identify if a country was mentioned in an article, we started with all organizations, countries, states, or provinces identified by coreNLP's named entity tagger. We then linked all the identified regionrelated named entities to countries with OpenStreetMap. Since there may be errors in both coreNLP and OpenStreetMap, we only assumed a country was mentioned when at least two unique entities mapped to the same country in a single article. On a benchmark set, we found that 4 country identifications from a total of 59 country predictions were incorrect ( Figure  Supplemental 1b,c) . When aggregating over articles, we find that 4/30 articles contain exactly one incorrect country mention ( Figure Supplemental 1b) Once we calculated the mention rate and used the previously described citation rate, we identified countries with a consistent skew towards either a higher or lower mention to citation rate (Figure 4d and Figure Supplemental 4b ). This is defined as countries where the difference between citation and mention rates is in the top or bottom 5% per year. This outlier description allowed us to identify two sets of countries based on their citation and mention rates. Those with a high relative citation to mention rate were: Germany, Spain, Netherlands, Denmark, Sweden, Austria, Switzerland, Israel, and Belgium. Those with a low relative citation to mention rate were: Russia, Brazil, Colombia, the United States, and India. We removed all countries that were in both the top or bottom 5% in different years, which excluded Australia, Canada, the United Kingdom, China, France, and Japan from consideration.

We then identified content differences between these two sets of countries by analyzing all of the main text from articles that mentioned and did not cite an author affiliated with each of the specified countries. After properly identifying high-frequency words across the entire corpus, we identified the top 15 most discriminative terms of each country type (Methods). Interestingly, we identified that the words most linked with mentioned countries were mostly related to environmental, extractive or political topics. The top 5 terms were "dams", "rio", "shuttle", "hydropower", and "weapons" (Figure 4e , Figure Supplemental 4c,d) . In contrast, we find that the words most related to countries with a higher citation than mention rate were science or research-related ones. The top five terms were "classical", "quantum", "yeast", "neurons", and "cells".

Scientific journalism is the critical conduit between the academic and public spheres, and consequently shapes the public's view of science and scientists. However, as observed in other forms of recognition in science, biases may shift coverage away from the known demographics within science [28] . Ideally, scientific journalism is representative of academic papers. Though it would be best for news coverage to promote equitable representation, at a minimum quotes and citations would ideally match the regional and gender demographics of scientific academia. To examine this last point, we analyzed over 22,000 news articles published by Nature to identify quoted, mentioned, and cited people. We then compared this to the authorship statistics from Nature's papers and a subset of Springer Nature's English language papers.

We first looked at possible gender differences in quotes and found a large, but decreasing, gender gap when compared to the broader population in all but one article type. We found that the decreasing trend was largely driven by the recent introduction of a single column, "Career Feature". This column has an equal number of quotes from both genders, showing that gender parity is possible in science journalism. However, we do recognize that different journalistic columns have different purposes or may represent different demographics and be inherently more difficult to reach parity. In order to draw these conclusions, we analyzed the proportion of all identified quotes that were from a speaker predicted to be male compared to the proportion of first and last authors in Nature predicted to be male, which similarly is a measure of scientific participation. Using computational methods, we performed quote association and gender prediction. We observed a strong skew towards predicted male participation across both quotes within news articles and authorship within Nature and Springer Nature papers. We also identify a gender differences between first and last authors, as previously shown [32, 33, 34] .

To further our analysis of possible coverage disparities, we looked to differences in predicted name origins of quoted and cited last authors across all the processed news articles. Our findings provide additional support for previous studies that identified under-citation [35] and under-recognition [28] of East Asian people. Interestingly, we found under-citation of people with predicted East Asian name origins to be much less pronounced than under-quotation. We do not believe that the under-quotation is driven by paraphrasing sources, which may occur more frequently with non-native English speakers. This is because our findings of under-enrichment of predicted East Asian name origins was recapitulated when we additionally looked at unique names mentioned within news articles. Furthermore, we find that scientist-written news articles tend to under-cite people with predicted East Asian name origins more than journalist-written articles. Our finding of under-quotation of people with predicted East Asian name origin was also recapitulated when we additionally looked at unique names mentioned within news articles.

Overall, we find that most quotes, mentions, and citations are from people with predicted Celtic/English or European name origins, followed by East Asian, with the remaining origins individually making up less than 10% of both citations or quotes. Except Celtic/English (overrepresentation) and East Asian (under-representation), all predicted name origins roughly match the expected background rate estimated by Nature last authorship. We also found this same pattern in our Springer Nature data set. Figure Supplemental 1 Quotes with a full name or pronoun associated 110035

Quotes with a gender prediction 109723

Quote with a full name 100529

Quotes with a name origin prediction 100528 

The enduring whiteness of the American media | Howard French the Guardian

Analyzed a Year of My Reporting for Gender Bias and This Is What I Found Adrienne LaFrance Medium

I Analyzed a Year of My Reporting for Gender Bias (Again)

I Spent Two Years Trying to Fix the Gender Imbalance in My Stories Ed Yong The Atlantic

Time Trends in Printed News Coverage of Female Subjects

Steven Skiena Journalism Studies

Women Are Seen More than Heard in Online Newspapers Sen Jia

Lack of female sources in NY Times front-page stories highlights need for change Poynter

Who Makes the News | GMMP

Why we need to increase diversity in the immunology research community Akiko Iwasaki Nature Immunology

Men Set Their Own Cites High: Gender and Self-citation across Fields and over Time Molly M

Bibliometrics: Global gender disparities in science Vincent Larivière

Fund Black scientists

NIH peer review: Criterion scores completely account for racial disparities in overall impact scores

Lee Science Advances

SOCIOLOGY: The Gender Gap in NIH Grant Applications

Science Stories Christina Selby The Open Notebook

Fast and Powerful Scraping and Web Crawling Framework

The Stanford CoreNLP Natural Language Processing Toolkit Christopher Manning

A Parser for Human Names

Gender Prediction Methods Based on First Names with genderizeR Kamil Wais The R

Analysis of ISCB honorees and keynotes reveals disparities

Text Mining using "dplyr

Time's up for journal gender bias

Does academic authorship reflect gender bias in pediatric surgery? An analysis of the Journal of Pediatric Surgery

Authorship in Psychiatry Journals From

After observing name origin differences, we determined if there was a difference in the frequency or content of coverage across countries. We first looked at possible citation disparities for cited authors with specific country affiliations, and found that most papers cited by Nature news articles have at least one author affiliated with the United States, United Kingdom, or Germany. In contrast to the name origins results, the citation rate of Chinese affiliated authors was not significantly depleted. Interestingly, we find the number of paper citations with authors having affiliations in China is increasing at the same rate as Springer Nature and Nature authorships. Furthermore, the increased citation and last authorship rates of Chinese affiliated authors is most pronounced in comparison to all other countries within the top ten most cited.We then focused on identifying whether the news content about a country focused on the scientific output from that country or the country itself as the scientific subject. We postulated that a difference in citation and mention rates could indicate the difference in a news article's subject matter. To achieve this, we identified two sets of countries with a large and consistent difference in their citation and mention rates. The top "Citation" countries were Germany, Spain, and the Netherlands. The top "Mention" countries were India, the United States, and Colombia. We then found that these two sets of countries were discussed differently. The resultant words for "Mention" countries were most related to extraction, agriculture and politics, suggesting that the country was likely the article's subject. In contrast, the representative words for "Citation" countries were more diverse in topic, relating to biological, medical, and physics terms. We hypothesize that the difference in discriminative terms between the two country sets is evidence that the news content may focus more on research of a country as a subject than science that comes out of it. This hypothesis assumes that no country has a specialization in a scientific topic, which is likely not true. This does, however, give us an indication that countries differ in their scientific journalism.

We would like to thank Jeffrey Perkel for asking thoughtful questions that spurred this line of research, and providing feedback and insight into the news-gathering process during the course of this project.

Through our comprehensive analysis, we were able to identify how news coverage varies by country, name origin, and gender, and compare it to scientific publishing background rates. While we found a significant gender disparity, the rate of female representation in scientific news is increasing and outpacing first and last authorships on scientific papers. Furthermore, we identified a significant depletion of quotes from scientists with a predicted East Asian name origin when compared to paper authorship, and a significant but smaller depletion of cited authors with a predicted East Asian name origin in news content. Finally, we showed that coverage of specific countries differ in content, with the country's scientific output being put in a more significant focus for some countries than the environmental aspects of other countries.Previous anecdotal studies from journalists have shown that awareness of their bias can help them to reduce it [2, 3, 4] . Once a bias is identified an individual can seek resources to help them find and retain diverse sources, such as utilizing international expert databases like gage [21] and SheSource [22] . Additional tips for journalists to achieve and maintain a diverse source pool is described by Christina Selby in the Open Notebook [20] .It should also be mentioned that we were only able to analyze the data provided through scraping "www.nature.com". This is a major limitation, because the only measures that we have of demographics of sources are people who have their name mentioned or research cited within the article. Journalists do not quote or mention all of the sources that they interviewed or cite all of the papers that they read when researching an article. For example, a person may not be mentioned or quoted in the article because of length limitations, because they do not want to be named, or if they provide information that is not directly quotable but that still shapes the content of the article. A more accurate reflection of journalists' sources would be a self-maintained record of people they interview. Our work examines disparities with respect to recognition within articles, which can be measured by mentions, quotes, or citations of people.Furthermore, many journalists are limited by who responds to their requests for an interview or recommendations from prominent scientists. Scientists fielding reporter inquiries can also audit themselves to examine the extent to which there are disparities in the sets of experts they recommend. Journalists and the scientists they interview have a unique opportunity to shape the public and their peers' perspectives on who is a scientific expert. Their choice of coverage topics and interviewees could help to reduce disparities in the outputs of science-related journalism.

This manuscript was written using Manubot [36] and is available on github: manuscript repository link. All code and metadata is also available on github, full analysis repository link, under a BSD 3-Clause License. The code to generate all main and supplemental figures are available as R markdown documents within our main analysis github, in the following subfolder: notebooks. Due to copyright, we are unable to provide the scraped data used in this analysis. However, scraping code is available on our main analysis github, in the following subfolder: scraper. To ensure reproducability without violating copyright, we provide the word frequencies for each news article and the coreNLP output. Furthermore, we provide a docker image that can re-run the analysis pipeline using intermediate, pre-processed data and produce all the main and supplemental figures. To re-run the entire pipeline (including scraping), the docker image contains all necessary packages and code. The shell scripts to re-run the entire analysis are provided in the README file in the github repository.