Auto-CORPus: Automated and Consistent Outputs from Research Publications


Auto-CORPus: Automated and Consistent
Outputs from Research Publications

Yan Hu1,a, Shujian Sun1,a, Thomas Rowlands2, Tim Beck2,3,b, and Joram M. Posma1,3,b

1 Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, SW7 2AZ, United Kingdom
2 Department of Genetics and Genome Biology, University of Leicester, LE1 7RH, United Kingdom

3 Health Data Research (HDR) UK, United Kingdom
a These authors contributed equally.

b These authors contributed equally. �

Abstract
Motivation: The availability of improved natural lan-
guage processing (NLP) algorithms and models enable
researchers to analyse larger corpora using open source
tools. Text mining of biomedical literature is one area
for which NLP has been used in recent years with large
untapped potential. However, in order to generate cor-
pora that can be analyzed using machine learning NLP
algorithms, these need to be standardized. Summarizing
data from literature to be stored into databases typically
requires manual curation, especially for extracting data
from result tables.
Results: We present here an automated pipeline that
cleans HTML files from biomedical literature. The
output is a single JSON file that contains the text for
each section, table data in machine-readable format
and lists of phenotypes and abbreviations found in the
article. We analyzed a total of 2,441 Open Access articles
from PubMed Central, from both Genome-Wide and
Metabolome-Wide Association Studies, and developed a
model to standardize the section headers based on the
Information Artifact Ontology. Extraction of table data
was developed on PubMed articles and fine-tuned using
the equivalent publisher versions.
Availability: The Auto-CORPus package is freely
available with detailed instructions from Github at
https://github.com/jmp111/AutoCORPus/.

information artefact ontology | natural language processing | text standard-
ization

Correspondence: timbeck [at] leicester.ac.uk and jmp111 [at] ic.ac.uk

Introduction

Natural language processing (NLP) is a branch of artificial
intelligence that uses computers to process, understand and
use human language. NLP is applied in many different fields
including language modelling, speech recognition, text min-
ing and translation systems. In the biomedical realm, NLP
has been applied to extract for example medication data from
electronic health records and patient clinical history from
clinical notes, to significantly speed up processes that would
otherwise be extracted manually by experts (1). Biomedical
publications, unlike structured electronic health records, are
semi-structured and this makes it difficult to extract and inte-

grate the relevant information (2). The format of research ar-
ticles differs between publishers and sections describing the
same entity, for example statistical methods, can be found
in different locations in the document in different publica-
tions. Both unstructured text and semi-structured document
elements, such as headings, main texts and tables, can con-
tain important information that can be extracted using text
mining (3).
The development of the genome-wide association study
(GWAS) has been led to by the on-going revolution in high-
throughput genomic screening and a deeper understanding of
the relationship between genetic variations and diseases/traits
(4). In a typical GWAS, researchers collect data from study
participants, use single nucleotide polymorphism (SNP) ar-
rays to detect the common variants among participants, and
conduct statistical tests to determine if the association be-
tween the variants and traits is significant. The results are
mostly represented in publication tables, but can also be
found in the main text, and there are multiple community ef-
forts to store these reported associations in queryable, on-
line databases (5, 6). These efforts involve time-intensive
and costly manual data curation to transcribe results from the
publications, and supplementary information, into databases.
Summary-level GWAS results are generally reported in the
literature according to community norms (e.g. a SNP asso-
ciated to a phenotype with a probability value), hence NLP
algorithms can be trained to recognize the formats in which
data are reported to facilitate faster and scalable information
extraction that is less prone to human error.
Development of effective automatic text mining algorithms
for GWAS literature can also potentially benefit other fields
in biomedical research as the body of biomedical literature
grows every day. Yet previous attempts of mining scientific
literature focused mainly on information extraction from ab-
stracts and some on the main text, while for the most part
ignoring tables. To facilitate the process of preparing a cor-
pus for NLP tasks such as named-entity recognition (NER),
text classification or relationship extraction, we have devel-
oped an Automated pipeline for Consistent Outputs from
Research Publications (Auto-CORPus) as a Python package.
The main aims of Auto-CORPus are:

• To provide clean text outputs for each publication sec-
tion with standardized section names

Hu and Sun, et al. | bioRχiv | January 8, 2021 | 1–10

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://github.com/jmp111/AutoCORPus/
timbeck@leicester.ac.uk
jmp111@ic.ac.uk
https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/


• To represent each publication’s tables in a JavaScript
Object Notation (JSON) format to facilitate data im-
port into databases

• To use the text outputs to find abbreviations used in the
text

We exemplify the package on a corpus of 1,200 Open Access
GWAS publications whose data have been manually added
to the GWAS Central database to list phenotypes, SNPs and
P-values found in the cleaned text (Figure 1). In addition, we
also include data on 1,200+ Metabolome-Wide Association
Studies (MWAS) to ensure the methods are not biased
towards one domain. MWAS focus on small molecules,
some of which are end-products of cellular regulatory
processes, that are the response of the human body to genetic
or environmental variations (7).

Materials and Methods

Data. Hypertext Markup Language (HTML) files for 1,200
Open Access GWAS publications whose data exists in the
GWAS Central database (5) were downloaded from PubMed
Central (PMC) in March 2020. A further 1,241 Open Access
publications of MWAS on cancer, gastrointestinal diseases,
metabolic syndrome, sepsis and neurodegenerative, psychi-
atric, and brain illnesses were also downloaded in the same
format. Publisher versions of ca. 10% of these publications
were downloaded in July 2020 to test the algorithms on pub-
lications with different HTML formats. The GWAS dataset
was randomly divided into 700 training publications to de-
velop algorithms, and a test set of the remaining 500 publica-
tions.

Processing. HTML files were loaded using the Beautiful-
soup4 HTML parser package (v4.9.0). Beautifulsoup4 was
used to convert HTML files to tree-like structures with each
branch representing a HTML section and each leaf a HTML
element. After HTML files were loaded, all superscripts,
subscripts, and italics were converted to plain text. Auto-
CORPus extracts h1, h2 and h3 tags for titles and headings,
and p tags for paragraph texts using the default configura-
tion. The headings and paragraphs are saved in a structured
JavaScript Object Notation (JSON) file for each HTML file.
Tables are extracted from the document using a different set
of configuration files (separate configurations for different ta-
ble structures can be defined and used) and saved in a new
JSON model that ensures tables of all formats and origin, not
only restricted to GWAS publications, can be described in
the same structured model, so that these can be used as in-
put to rule-based or deep learning algorithms for data extrac-
tion. The data cells are stored in the “result” key, and their
corresponding section name and header names are stored in
“section_name” and “columns” keys respectively. Therefore,
extracting relationships between cells only requires simple
rules.

Fig. 1. Workflow of the Auto-CORPus package.

2 | bioRχiv Hu and Sun, et al. | Auto-CORPus

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/


Ontologies for entity recognition. The Information Arti-
fact Ontology (IAO) was created to serve as a domain-neutral
resource for the representation of types of information con-
tent entities such as documents, databases, and digital im-
ages (8). We used the v2020-06-10 model (9) in which 37
different terms exist that describe headers typically found in
biomedical literature. The extracted headers in the JSON file
were first mapped to the IAO terms using the Lexical OWL
Ontology Matcher (10). We use fuzzy matching using the
fuzzywuzzy package (v0.17.0) to map headers to the pre-
ferred section header terms and synonyms, with a similarity
threshold of 0.8. This threshold was evaluated by confirming
all matches were accurate by two independent researchers.
After the direct IAO mapping and fuzzy matching, unmapped
headers still exist. To map these headings, we developed a
new method using a directed graph (digraph) for representa-
tion since headers are not repeated within a document, are se-
quential and have a set order that can be exploited. Digraphs
consist of nodes (entities, headers) and edges (links between
nodes) and the weight of the nodes and edges is propor-
tional to the number of publications in which these are found.
While digraphs from individual publications are acyclic, the
combined graph can contain cycles hence digraphs opposed
to directed acyclic graphs are used. Unmapped headers are
assigned a section based on the digraph and the headers in
the publication that could be mapped (anchor points). For
example, at this point in this article the main headers are ‘ab-
stract’ followed by ‘introduction’ and ‘materials and meth-
ods’ that could make up a digraph. Another article with head-
ers ‘abstract’, ‘background’ and ‘materials and methods’ has
two anchor points that match the digraph, and the unmapped
header (‘background’) can be inferred from appearing in be-
tween the anchor points in the digraph (‘abstract’, ‘materials
and methods’): ‘introduction’. We use this process to eval-
uate new potential synonyms for existing terms and identify
new potential terms for sections found in biomedical litera-
ture.
We used the Human Phenotype Ontology (HPO) to identify
disease traits in the full texts. The HPO was developed with
the goal to cover all common phenotypic abnormalities in hu-
man monogenic diseases (11).

Use cases: regular expression algorithms. Abbrevia-
tions in the full text are found using an adaptation of a previ-
ously published methodology (12) based on regular expres-
sions using the abbreviations package (v0.2.5). The brief
principle of it is to find all brackets within a corpus. If the
number of words in a bracket is <3 it considers if it could be
an abbreviation. It searches the characters within the brackets
in the text on either side of the brackets one by one. The first
character of one of these words must contain the first charac-
ter within that bracket. And the other characters within that
bracket must be contained by other words followed by the
previous word whose first character is the same as the first
character in that bracket. We combine the output of the pack-
age with abbreviations defined in the abbreviations section (if
found) from the IAO/digraph model.
For phenotype entity recognition, first any abbreviations in

paragraphs extracted from the full text are replaced by their
definition. This text is then tokenized using the spacy pack-
age (v2.3) (model en_core_web_sm) and compared against
phenotypes and their synonyms defined by HPO for disease
traits matching.
P-values and SNPs were identified in the full text and tables
based on regular expressions as they have a standard form.
Pairs of P-value-SNP associations are found in the text using
dependency parse trees (13).

Use cases: deep learning-based named-entity recog-
nition. The first example of a use case is to recognize the
assay with which the data was acquired, however no ex-
isting models exist for this purpose. We fine-tuned a pre-
existing model trained for biomedical NER, the biomedi-
cal Bidirectional Encoder Representations from Transform-
ers (bioBERT) (14), using part of our corpus where only
MWAS assays were tagged. We applied our fine-tuned model
only on the paragraphs in the materials and methods sec-
tions to recognize the assays used. A second bioBERT-based
model was fine-tuned on phenotypes, which already exist
in the data, and enriched in phenotypes associated with the
MWAS publications. This model was applied on only the
abstract and paragraphs from the results section. The third
example was applied only on paragraphs from the results
and discussion sections using an existing model specifically
trained to recognize chemical entities, ChemListem (v0.1.0)
(15).

Use cases: paragraph classification. It is possible un-
mapped headers are mapped to multiple sections if the an-
chor points are far apart. In order to test the applicability of
a machine learning model to classify paragraphs we trained a
random forest classifier on a dataset consisting of 1,242 ab-
stract paragraphs and 936 non-abstract paragraphs. 80% of
the data was used for training and the remainder as the test
set.

Results
The order of sections in biomedical literature. A total
of 21,849 headers were extracted from the 2,441 publica-
tions, mapped to IAO (v2020-06-10) terms and visualized by
means of a digraph with 372 unique nodes and 806 directed
edges (Figure 2A). The major unmapped node is ‘associated
data’, which is a header specific for PMC articles that ap-
pears at the beginning of each article before the abstract. The
main structure of biomedical articles that were analyzed is:
abstract → introduction → materials → results → discus-
sion → conclusion → acknowledgements → footnotes sec-
tion → references. IAO has separate definitions for ‘mate-
rials’ (IAO:0000633), ‘methods’ (IAO:0000317) and ‘statis-
tical methods’ (IAO:0000644) sections, hence they are sepa-
rate nodes in the graph and introduction is also often followed
by headers to reflect the methods section (and synonyms).
There is also a major directed edge from introduction directly
to results, with materials and methods placed after the discus-
sion and/or conclusion sections.

Hu and Sun, et al. | Auto-CORPus bioRχiv | 3

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/


All unmapped headers were investigated and evaluated
whether some could be used as synonym for existing cate-
gories. The digraph was also inspected by means of visual-
izing individual ego-networks which show the edges around
a specific node mapped to an existing IAO term. Figure 2B
shows the ego-network for abstract, and four main categories
and one potential new synonym (precis, in red) were iden-
tified. The majority of unmapped headers (in purple), that
follow the abstract, relate to a document that is written as
one coherent whole, with specific headers for each section
or a general header for the full/main text. An additional
four unmapped headers relate to ‘materials and methods’ in
their broader sense and these are data, data description, par-
ticipants and sample. The remaining two categories of un-
mapped headers to/from abstract can be classified as new
sections ‘graphical abstract’ and ‘highlights’. These head-
ers were found alongside, and appear to be distinct from, the
(textual) abstract.
Based on the digraph, we then assigned data and data descrip-
tion to be synonyms of the materials section, and participants
and sample as a new category termed ‘participants’ which is
related to, but deemed distinct from, the existing patients sec-
tion (IAO:0000635). The same process was applied to ego-
networks from other nodes linked to existing IAO terms to
add additional synonyms to simplify the digraph. Figure 2C
shows the resulting digraph with only existing and newly pro-
posed section terms.

New proposed elements for the IAO. Each existing IAO
term contains one or more synonyms and extracted head-
ers were first mapped directly to these terms. Any headers
that could not be mapped directly are mapped in the second
step using fuzzy matching (e.g. the typographical error ‘ex-
peremintal section’ in PMC4286171 is correctly mapped to
the methods section). The last step involves mapping remain-
ing unmapped headers to existing terms based on the digraph
and using the structure (anchor headers) of the publication.
Headers that can be mapped to existing terms in the second
and third steps, are included as synonyms in the model. The
existing categories for which new potential synonyms were
identified are listed in Table 1a and 1b with their existing
synonyms and newly identified synonyms.
From the analysis of ego-networks four new potential cate-
gories were identified: disclosure, graphical abstract, high-
lights and participants. Table 2 details the proposed defini-
tion and synonyms for these categories. In the digraph in
Figure 2C this section is located towards the end of a pub-
lication and in some instances is followed by the conflict of
interest section.

Table data extraction with different configurations.
PMC articles are standardized which makes data extraction
more straightforward, however some publications are not
deposited into PMC or other repositories and can only be
found via publisher websites. While the package has been
developed using a large set of PMC articles, we compared
the Auto-CORPus output for PMC articles with the output
for the equivalent articles made available by the publishers.

We found no differences in how headers were extracted and
paragraphs were classified based on the digraph. However,
the representation of tables does differ substantially between
publishers, hence a model developed on PMC articles alone
will fail to extract the data. We circumvent this issue by defin-
ing configuration files for different table formats and we com-
pare the accuracy of the data represented in the JSON format
(Figure 3) between PMC and publisher versions of the same
papers.
Using the default (PMC) configuration on non-PMC arti-
cles none of the 302 tables are represented accurately in the
JSON. Auto-CORPus allows to use a variety of configura-
tion files (a single file, or all as batch) to be used to extract
data from tables. One configuration file, different to the de-
fault, correctly represented the data in JSON format of 93%
(280) of tables. The remaining 22 tables could be repre-
sented correctly using 8 different configuration files. When
the right configuration file is used for non-PMC articles, all
tables (100%) are represented identically to the JSON output
from the matching PMC version.

Use cases. The extracted paragraphs were classified as one
(or more) categories based on the digraph. This is the purpose
of the Auto-CORPus package, to prepare a corpus for analy-
sis so that different sections can be used for specific purposes.
We detail how these standardized texts can be used for entity
recognition.

Paragraph classification. While many headers can be
mapped using fuzzy matching plus the digraph structure,
some headers remain unmapped (e.g. the headers in purple
in Figure 2B: full text, main text, etc.) while others can be
assigned to multiple (possible) sections. The choice of as-
signing multiple categories to unmapped headers based on
the digraph is deliberate as it is to ensure the algorithm does
not wrongly assign it to only one (e.g. ‘materials’ over ‘meth-
ods’). The next step is to perform the paragraph classification
using NLP algorithms to learn from the word usage and con-
text. We show that random forests can be used to this end
by training it to distinguish between abstracts and other para-
graphs. 435 paragraphs from the test set were predicted us-
ing a random forest trained on 1,743 paragraphs. For the test
set, we obtained an F1-score of 0.90 for classifying abstracts
(precision = 0.91, recall = 0.90) and 0.88 for classifying non-
abstracts (precision = 0.87, recall = 0.88).

Abbreviation identification. The abbreviation detection algo-
rithm searches through each paragraph using a rule-based ap-
proach to find all abbreviations used. Auto-CORPus then
investigates whether a paragraph is mapped to the abbrevia-
tions category and, if found, it combines these two lists of ab-
breviations found in the publication. For example, when ap-
plied on an MWAS publication (16) which contains a header
titled “ABBREVIATIONS” the algorithm combines the 9 ab-
breviations listed by the authors and with a further 7 identi-
fied from the text (Figure 4), including an abbreviation used
with two spellings in the text.

4 | bioRχiv Hu and Sun, et al. | Auto-CORPus

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/


Fig. 2. Digraph generated from analyzing section headers from 2,441 Open Access publications from PubMed Central. (A) digraph of the v2020-06-10 IAO model consists of
372 unique nodes, of which 24 could be directly mapped to section terms (in orange) and the remainder are unmapped headers (in grey), and 806 directed edges. Relative
node sizes and edge widths are directly proportional to the number of publications with these (subsequent) headers. Blue edges indicate the edge with the highest weight
from the source node, edges that exist in fewer than 1% of publications are shown in light grey and the remainder in black. (B) Unmapped nodes connected to ‘abstract’
as ego node, excluding corpus specific nodes, grouped into different categories. Unlabeled nodes are titles of paragraphs in the main text. (C) Final digraph model used
in Auto-CORPus to classify paragraphs after fuzzy matching. This model includes new (proposed) section terms and each section contains new synonyms identified in this
analysis. ‘Associated Data’ is included as this is a PMC-specific header found before abstracts and can be used to indicate the start of most articles.

Rule-based extraction of GWAS summary-level data. GWAS
Central relies on curated data extracted manually from pub-
lications or other databases. We investigated whether a
rule-based approach to recognize phenotypes, SNPs and P-
values can correctly identify data from publications con-
tained within the database. A rule-based approach by ap-
plying the HPO on the 500 GWAS publications from the test
set, identified a total of 9,599 unique disease traits (major and
minor) in these publications. 949 traits are recorded for these
publications in GWAS Central and the rule-based approach
found 449 with a perfect match. For 65% of the publica-
tions all traits were correctly identified. SNPs have standard-
ized formats, hence rule-based approaches are well suited for
their identification. Likewise, P-values in GWAS publica-
tions are typically represented using scientific notation and
can also be identified using rule-based methods. A total of
26,031 SNP/P-value pairs were found across the main text
and tables of the 500 publications. For 62.4% of publications
all associations recorded in the GWAS Central database are
also found using this approach. While 57.6% of these pub-
lications present results (SNP/P-value pairs) only in tables,
and 94.3% of pairs are found in tables, 276 associations were
identified from the main text that are not represented in ta-
bles. 2,673 pairs match those recorded in the database (total
of 6,969 pairs for these publications), however many associ-

ations in the database are not represented in main text/tables
but in supplementary materials. Auto-CORPus includes a
separate function to convert csv/tsv data to table JSON for-
mat (Figure 3), as summary-level results are often saved in
these file formats as part of the supplementary information.

Named-entity recognition. Three different deep learning
models were used for NER on specific paragraphs of publica-
tions. A pre-trained biomedical entity recognition algorithm
(14) was fine-tuned using the results from the rule-based
approach applied on GWAS data. Example sentences that
contain HPO terms were used to fine-tune the transformer
model and then applied on 928 MWAS publications from
four broad and distinct phenotypes (cancer, gastrointestinal
diseases, metabolic syndrome, and neurodegenerative, psy-
chiatric and brain illnesses). The fine-tuned deep learning
algorithm obtained accuracies between 0.76 and 0.97, aver-
aging around 82.3% (Table 3).
We then fine-tuned the same base model for recognizing as-
says in text by training on sentences identified from the text
that contain assays routinely used in MWAS. The first pass
consisted of a rule-based approach, with fuzzy matching, to
find sentences with terms and these were then used to fine-
tune the deep learning model. Figure 5 shows the result-
ing output in JSON format for one MWAS publication (16).

Hu and Sun, et al. | Auto-CORPus bioRχiv | 5

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/


Category (IAO
identifier)

Existing synonyms (IAO
v2020-06-10)

New synonyms identified a

abstract
(IAO:0000315)

abstract precis

acknowledgements
(IAO:0000324)

acknowledgements,
acknowledgments

acknowledgement, acknowledgment, acknowledgments and
disclaimer

author contributions
(IAO:0000323)

author contributions, contributions by
the authors

authors’ contribution, authors’ contributions, authors’ roles,
contributorship, main authors by consortium and author

contributions

discussion
(IAO:0000319)

discussion, discussion section discussions

footnote
(IAO:0000325)

endnote, footnote footnotes

introduction
(IAO:0000316)

background, introduction introductory paragraph

methods
(IAO:0000317)

experimental, experimental
procedures, experimental section,
materials and methods, methods

analytical methods, concise methods, experimental methods,
method, method validation, methodology, methods and design,
methods and procedures, methods and tools, methods/design,
online methods, star methods, study design, study design and

methods

references
(IAO:0000320)

bibliography, literature cited,
references

literature cited, reference, references, reference list, selected
references, web site references

supplementary
material

(IAO:0000326)

additional information, appendix,
supplemental information,

supplementary material, supporting
information

additional file, additional files, additional information and
declarations, additional points, electronic supplementary

material, electronic supplementary materials, online content,
supplemental data, supplemental material, supplementary

data, supplementary figures and tables, supplementary files,
supplementary information, supplementary materials,

supplementary materials figures, supplementary materials
figures and tables, supplementary materials table,

supplementary materials tables

Table 1a. Newly identified synonyms for existing IAO terms (00003xx) from the digraph mapping of 2,441 publications. Elements in italics have previously been submitted by
us for inclusion into IAO and added in the latest release (v2020-12-09).

Lastly, we applied a domain specific algorithm for recogniz-
ing chemical entities in the text and tables (15) to identify
metabolites in the same publication (Figure 5).

Discussion

The analysis of our corpus of 2,441 Open Access publica-
tions has resulted in identifying well over 100 new synonyms
for existing terms used in biomedical literature to indicate
what a paragraph is about. In addition, we identified four
new potential categories not previously included in the IAO.
We previously submitted a subset of synonyms reported here
and one of the new categories for inclusion in the IAO. These
have been accepted by the IAO and are included in the lat-
est release (v2020-12-09), hence we presented our analyses
using the previous version of IAO that does not include part

of our work. In the latest release, the ‘graphical abstract’
section has been added (IAO:0000707) based on our contri-
bution. Also, a new ‘research participants’ (IAO:0000703)
section has been added as contribution by others in the same
release; therefore synonyms found here for the new category
‘participants’ section will be proposed in future as synonyms
for the ‘research participants’ section. While the disclosure
section appears to be distinct from the conflict of interest sec-
tion due to a directed edge in the digraph, its synonyms could
also be proposed to be part of the existing conflict of interest
section in IAO.

Standardization of text for NLP is an important step in
preparing a corpus. Auto-CORPus outputs a JSON file of
cleaned text, with standardized headers as well as all data
presented in tables in JSON format. Standardizing headers
is important because some sections are more important than

6 | bioRχiv Hu and Sun, et al. | Auto-CORPus

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/


Category (IAO
identifier)

Existing synonyms (IAO
v2020-06-10)

New synonyms identified a

abbreviations
(IAO:0000606)

abbreviations, abbreviations list,
abbreviations used, list of

abbreviations, list of abbreviations
used

abbreviation and acronyms, abbreviation list, abbreviations
and acronyms, abbreviations used in this paper, definitions for

abbreviations, glossary, key abbreviations, non-standard
abbreviations, nonstandard abbreviations, nonstandard

abbreviations and acronyms

author information
(IAO:0000607)

author information, authors’
information

biographies, contributor information

availability
(IAO:0000611)

availability, availability and
requirements

availability of data, availability of data and materials, data
archiving, data availability, data availability statement, data

sharing statement

conclusion
(IAO:0000615)

concluding remarks, conclusion,
conclusions, findings, summary

conclusion and perspectives, summary and conclusion

conflict of interest
(IAO:0000616)

competing interests, conflict of
interest, conflict of interest statement,

declaration of competing interests,
disclosure of potential conflicts of

interest

authors’ disclosures of potential conflicts of interest,
competing financial interests, conflict of interests, conflicts of

interest, declaration of competing interest, declaration of
interest, declaration of interests, disclosure of conflict of

interest, duality of interest, statement of interest

consent
(IAO:0000618)

consent informed consent

ethical approval
(IAO:0000620)

ethical approval
ethics approval and consent to participate, ethical

requirements, ethics, ethics statement

funding source
declaration

(IAO:0000623)

funding, funding information,
funding sources, funding statement,
funding/support, source of funding,

sources of funding

financial support, grants, role of the funding source, study
funding

future directions
(IAO:0000625)

future challenges, future
considerations, future developments,

future directions, future outlook,
future perspectives, future plans,
future prospects, future research,
future research directions, future

studies, future work

outlook

materials
(IAO:0000633)

materials data, data description

statistical analysis
(IAO:0000644)

statistical analysis statistical methods, statistical methods and analysis, statistics

study limitations
(IAO:0000631)

limitations, study limitations strengths and limitations, study strengths and limitations

Table 1b. Newly identified synonyms for existing IAO terms (00006xx) from the digraph mapping of 2,441 publications. Elements in italics have previously been submitted by
us for inclusion into IAO and added in the latest release (v2020-12-09).

Hu and Sun, et al. | Auto-CORPus bioRχiv | 7

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/


Proposed
category

Proposed definition Proposed synonyms

disclosure
“A part of a document used to disclose any associations by authors
that might be perceived as to potentially interfere with or prevent

them from reporting research with complete objectivity.”

author disclosure statement,
declarations, disclosure, disclosure

statement, disclosures

graphical abstract
“An abstract that is a pictorial summary of the main findings

described in a document.”
central illustration, graphical

abstract, TOC image, visual abstract

highlights

“A short collection of key messages that describe the core findings
and essence of the article in concise form. It is distinct and
separate from the abstract and only conveys the results and

concept of a study. It is devoid of jargon, acronyms and
abbreviations and targeted at a broader, non-technical audience.”

author summary, editors’ summary,
highlights, key points, overview,
research in context, significance,

TOC

participants
“A section describing the recruitment of subjects into a research

study. This section is distinct from the ‘patients’ section and
mostly focusses on healthy volunteers.”

participants, sample

Table 2. Newly proposed categories of entities found in 2,441 publications in the biomedical literature that could not be mapped to existing terms in IAO. Elements in italics
have previously been submitted by us for inclusion into IAO and added in the latest release (v2020-12-09).

Known phenotype Papers Accuracy

cancer 492 0.84
gastrointestinal diseases 37 0.97

metabolic syndrome 286 0.80
neurodegenerative, psychiatric, brain illnesses 113 0.76

Table 3. Summary of results for named-entity recognition (NER) of phenotypes in MWAS papers.

others for specific tasks. For example, no new findings can be
found in an introduction however it is well suited to discover
the main phenotypes under study, only in materials/methods
can details be found on how these phenotypes are studied
and using what technologies, and findings can only be found
in results (and discussion) sections. Hence it is important
to classify these paragraphs and Auto-CORPus does this by
using the structure of the publication and the digraph. We
showed that we can further improve the assignment by train-
ing machine learning models with good accuracy to distin-
guish between different types of texts in cases where there
may be ambiguity - this can be further improved by using
a multi-class classifier and using all paragraphs. These data
are then available for use in downstream analyses using ded-
icated algorithms for entity recognition or other methods.
Auto-CORPus is able to process all HTML formatted tables
from both GWAS and MWAS corpora, as opposed to pre-
vious methods which could only operate on 86% of 3,573
tables (17). It takes Auto-CORPus on average 0.77 seconds
to process all tables within a publication compared to several
minutes if this is done manually. Moreover, Auto-CORPus
also supports parallel computing, thereby further reducing
the time needed to process publications as these can be run in
batch.

The structured JSON output is machine readable and can be
used to support data import into database. Here we used

the JSON output of Auto-CORPus in several examples to
demonstrate some potential use cases. We demonstrated that
existing algorithms trained on biomedical data can be fine-
tuned to recognize new entities such as assays and pheno-
types, which also opens up the possibility of using these data
to train new deep learning algorithms for recognizing new
entities such as metabolites (opposed to chemical entities),
SNPs and P-values, as well as identifying the relationships
between them from text. NER algorithms have difficulty with
recognizing terms that are abbreviated, therefore the list of
abbreviations found by Auto-CORPus can be used to replace
all abbreviations in the text to their definitions.

Conclusion

The Auto-CORPus package is freely available and can be de-
ployed on local machines as well as using high-performance
computing to process publications in batch. A step-by-step
guide to detail how to use Auto-CORPus is supplied with the
package. The key features of Auto-CORPus are that it:

1. outputs all text and table data in a standardized JSON
format,

2. classifies each paragraph into separate categories of
text, and

8 | bioRχiv Hu and Sun, et al. | Auto-CORPus

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/


Fig. 3. Example of JSON format for table data from this work (shown for Table 3).
The Auto-CORPus output for tables consists of ‘status’, ‘error message’ and ‘tables’
as top level fields, ‘tables’ has fields ‘identifier’, ‘title’, ‘columns’, ‘section’ and ‘footer’,
and ‘section’ contains ‘section name’ and ‘results’.

Fig. 4. Example of JSON output of abbreviation detection using a rule-based ap-
proach on an MWAS publication (16).

Fig. 5. Example of JSON output of named-entity recognition (NER) on an MWAS
publication (16) using a fine-tuned transformer-based deep learning model for as-
says and bidirectional long-short term memory network for chemical entity recogni-
tion.

3. is implemented in pure Python code and does not have
non-Python dependencies.

ACKNOWLEDGEMENTS
We thank Mohamed Ibrahim (University of Leicester) for identifying different configu-
rations of tables for different HTML formats, and Joy Li and Filip Makraduli (Imperial
College London) for testing the package and providing feedback.

AUTHOR CONTRIBUTIONS
TB and JMP designed and supervised the research. SS and YH developed the
pipeline and analyzed data. SS developed the initial table extraction algorithm
and implemented the phenotype recognition algorithm. YH developed the section
header standardization algorithm and implemented the abbreviation recognition al-
gorithm. SS fine-tuned the table extraction algorithm for use on non-PMC texts. TR
refined standardization of full texts and contributed algorithms for UTF-8 and UTF-
16 conversions of non-ASCII characters to Unicode. SS, YH, TB and JMP wrote
the manuscript.

FUNDING
This work has been supported by Health Data Research (HDR) UK and the Medical
Research Council via an UKRI Innovation Fellowship to TB (MR/S003703/1) and a
Rutherford Fund Fellowship to JMP (MR/S004033/1).

FOOTNOTE
ORCID: 0000-0002-4971-9003 (JMP).

Bibliography
1. Seyedmostafa Sheikhalishahi, Riccardo Miotto, Joel T Dudley, Alberto Lavelli, Fabio Rinaldi,

and Venet Osmani. Natural language processing of clinical notes on chronic diseases:
Systematic review. JMIR Med Inform, 7(2):e12239, 4 2019. ISSN 2291-9694. doi: 10.2196/
12239.

2. Ramón A-A. Erhardt, Reinhard Schneider, and Christian Blaschke. Status of text-mining
techniques applied to biomedical text. Drug Discovery Today, 11(7):315–325, 2006. ISSN
1359-6446. doi: https://doi.org/10.1016/j.drudis.2006.02.011.

3. Nikola Milosevic, Cassie Gregson, Robert Hernandez, and Goran Nenadic. A frame-
work for information extraction from tables in biomedical literature. International Jour-
nal on Document Analysis and Recognition (IJDAR), 22(1):55–78, 2 2019. doi: 10.1007/
s10032- 019- 00317- 0.

4. Peter M. Visscher, Naomi R. Wray, Qian Zhang, Pamela Sklar, Mark I. McCarthy, Matthew A.
Brown, and Jian Yang. 10 years of gwas discovery: Biology, function, and translation.
The American Journal of Human Genetics, 101(1):5 – 22, 2017. ISSN 0002-9297. doi:
https://doi.org/10.1016/j.ajhg.2017.06.005.

5. Tim Beck, Tom Shorter, and Anthony J Brookes. Gwas central: a comprehensive resource
for the discovery and comparison of genotype and phenotype data from genome-wide as-
sociation studies. Nucleic Acids Research, 48(D1):D933–D940, 10 2019. ISSN 0305-1048.
doi: 10.1093/nar/gkz895.

6. Annalisa Buniello, Jacqueline A L MacArthur, Maria Cerezo, Laura W Harris, James Hay-
hurst, Cinzia Malangone, Aoife McMahon, Joannella Morales, Edward Mountjoy, Elliot Sol-
lis, Daniel Suveges, Olga Vrousgou, Patricia L Whetzel, Ridwan Amode, Jose A Guillen,
Harpreet S Riat, Stephen J Trevanion, Peggy Hall, Heather Junkins, Paul Flicek, Tony Bur-
dett, Lucia A Hindorff, Fiona Cunningham, and Helen Parkinson. The NHGRI-EBI GWAS

Hu and Sun, et al. | Auto-CORPus bioRχiv | 9

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://gtr.ukri.org/projects?ref=MR/S003703/1
https://gtr.ukri.org/projects?ref=MR/S004033/1
https://orcid.org/0000-0002-4971-9003
https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/


Catalog of published genome-wide association studies, targeted arrays and summary statis-
tics 2019. Nucleic Acids Research, 47(D1):D1005–D1012, 11 2018. ISSN 0305-1048. doi:
10.1093/nar/gky1120.

7. Jeremy K. Nicholson, Elaine Holmes, and Paul Elliott. The metabolome-wide association
study: A new look at human disease risk factors. Journal of Proteome Research, 7(9):
3637–3638, 2008. doi: 10.1021/pr8005099. PMID: 18707153.

8. Werner Ceusters. An information artifact ontology perspective on data collections and asso-
ciated representational artifacts. Studies in health technology and informatics, 180:68–72,
2012. ISSN 0926-9630.

9. Alan Ruttenberg, Adam Goldstein, Albert Goldfain, Barry Smith, Bjoern Peters, Carlo Tor-
niai, Chris Mungall, Chris Stoeckert, Christian A. Boelling, Darren Natale, David Osumi-
Sutherland, Gwen Frishkoff, Holger Stenzhorn, James A. Overton, James Malone, Jen-
nifer Fostel, Jie Zheng, Jonathan Rees, Larisa Soldatova, Lawrence Hunter, Mathias
Brochhausen, Matt Brush, Melanie Courtot, Michel Dumontier, Paolo Ciccarese, Pat
Hayes, Philippe Rocca-Serra, Randy Dipert, Ron Rudnicki, Satya Sahoo, Sivaram Ara-
bandi, Werner Ceusters, William Duncan, William Hogan, and Yongqun (Oliver) He. Infor-
mation artefact ontology (v2020-06-10). https://raw.githubusercontent.com/
information-artifact-ontology/IAO/v2020-06-10/iao.owl, 2020. Ac-
cessed: 2020-06-21.

10. A. Ghazvinian, N. F. Noy, and M. A. Musen. Creating mappings for ontologies in
biomedicine: simple methods work. AMIA Annu Symp Proc, 2009:198–202, 11 2009.

11. Peter N. Robinson, Sebastian Köhler, Sebastian Bauer, Dominik Seelow, Denise Horn, and
Stefan Mundlos. The human phenotype ontology: A tool for annotating and analyzing hu-
man hereditary disease. The American Journal of Human Genetics, 83(5):610–615, 2008.

ISSN 0002-9297. doi: https://doi.org/10.1016/j.ajhg.2008.09.017.
12. Ariel Schwartz and Marti Hearst. A simple algorithm for identifying abbreviation definitions in

biomedical text. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing,
4:451–62, 02 2003. doi: 10.1142/9789812776303_0042.

13. Katrin Fundel, Robert Küffner, and Ralf Zimmer. RelEx—Relation extraction using de-
pendency parse trees. Bioinformatics, 23(3):365–371, 12 2006. ISSN 1367-4803. doi:
10.1093/bioinformatics/btl616.

14. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So,
and Jaewoo Kang. BioBERT: a pre-trained biomedical language representation model for
biomedical text mining. Bioinformatics, 36(4):1234–1240, 09 2019. ISSN 1367-4803. doi:
10.1093/bioinformatics/btz682.

15. Peter Corbett and John Boyle. Chemlistem: chemical named entity recognition using
recurrent neural networks. Journal of Cheminformatics, 10(1), 12 2018. doi: 10.1186/
s13321- 018- 0313- 8.

16. Charles R. Evans, Alla Karnovsky, Melissa A. Kovach, Theodore J. Standiford, Charles F.
Burant, and Kathleen A. Stringer. Untargeted LC–MS metabolomics of bronchoalveolar
lavage fluid differentiates acute respiratory distress syndrome from health. Journal of Pro-
teome Research, 13(2):640–649, 12 2013. doi: 10.1021/pr4007624.

17. Nikola Milosevic, Cassie Gregson, Robert Hernandez, and Goran Nenadic. Disentangling
the structure of tables in scientific literature. In Elisabeth Métais, Farid Meziane, Mohamad
Saraee, Vijayan Sugumaran, and Sunil Vadera, editors, Natural Language Processing and
Information Systems, pages 162–174. Springer International Publishing, 2016. ISBN 978-
3-319-41754-7. doi: https://doi.org/10.1007/978- 3- 319- 41754- 7_14.

10 | bioRχiv Hu and Sun, et al. | Auto-CORPus

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint 

https://raw.githubusercontent.com/information-artifact-ontology/IAO/v2020-06-10/iao.owl
https://raw.githubusercontent.com/information-artifact-ontology/IAO/v2020-06-10/iao.owl
https://doi.org/10.1101/2021.01.08.425887
http://creativecommons.org/licenses/by-nc-nd/4.0/