Submitted 17 May 2018
Accepted 29 August 2018
Published 17 September 2018

Corresponding author
Anne E. Thessen,
annethessen@gmail.com

Academic editor
Alessandro Frigeri

Additional Information and
Declarations can be found on
page 12

DOI 10.7717/peerj-cs.164

Copyright
2018 Thessen et al.

Distributed under
Creative Commons CC-BY 4.0

OPEN ACCESS

20 GB in 10 minutes: a case for linking
major biodiversity databases using an
open socio-technical infrastructure
and a pragmatic, cross-institutional
collaboration
Anne E. Thessen1,5, Jorrit H. Poelen2, Matthew Collins3 and Jen Hammock4

1 Ronin Institute for Independent Scholarship, Montclair, NJ, USA
2 Independent consultant, Oakland, CA, USA
3 University of Florida, Gainsville, FL, USA
4 National Museum of Natural History, Washington, DC, USA
5 Oregon State University, Corvallis, OR, USA

ABSTRACT
Biodiversity information is made available through numerous databases that each
have their own data models, web services, and data types. Combining data across
databases leads to new insights, but is not easy because each database uses its own
system of identifiers. In the absence of stable and interoperable identifiers, databases
are often linked using taxonomic names. This labor intensive, error prone, and lengthy
process relies on accessible versions of nomenclatural authorities and fuzzy-matching
algorithms. To approach the challenge of linking diverse data, more than technology
is needed. New social collaborations like the Global Unified Open Data Architecture
(GUODA) that combines skills from diverse groups of computer engineers from
iDigBio, server resources from the Advanced Computing and Information Systems
(ACIS) Lab, global-scale data presentation from EOL, and independent developers
and researchers are what is needed to make concrete progress on finding relationships
between biodiversity datasets. This paper will discuss a technical solution developed
by the GUODA collaboration for faster linking across databases with a use case
linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA
infrastructure is a 12-node, high performance computing cluster made up of about
192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB
of compressed JSON from Wikidata was processed and linked to GloBI in about
10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata
and GloBI were linked by comparing graphs of biodiversity identifiers external to each
system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase
of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared
to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The
process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives
and calculating consistency metrics was done in minutes on the GUODA platform. As a
model collaboration, GUODA has the potential to revolutionize biodiversity science by
bringing diverse technically minded people together with high performance computing

How to cite this article Thessen et al. (2018), 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-
technical infrastructure and a pragmatic, cross-institutional collaboration. PeerJ Comput. Sci. 4:e164; DOI 10.7717/peerj-cs.164

https://peerj.com
mailto:annethessen@gmail.com
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.164
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
http://dx.doi.org/10.7717/peerj-cs.164


resources that are accessible from a laptop or desktop. However, participating in such
a collaboration still requires basic programming skills.

Subjects Bioinformatics, Databases
Keywords Biodiversity, Collaboration, Identifiers, Wikidata, Graph, Linking

INTRODUCTION
Biodiversity databases provide global access to information about species via the Web.
These databases contain information as varied as observation records, text descriptions,
images, maps, genetic sequences, phylogenetic trees, and trait data (Table 1). All of these
data become much more useful if they can be linked. Many biodiversity databases share
information with each other (Bingham et al., 2017), but creating the links can be very
difficult for several reasons including the size of the databases, the heterogeneous nature
of the data, and the heterogeneous nature of the identifiers used by the different resources
(Page, 2008).

The more popular methods for linking biodiversity databases include taxonomic names,
LSID (Life Sciences Identifier), and DOI (Digital Object Identifier). The Encyclopedia of
Life uses taxonomic names to automatically aggregate data from hundreds of providers
(Parr et al., 2014). BioNames links data using LSID, DOI, handles, bibliographic citations,
and taxonomic names (Page, 2013). The iPhylo LinkOut service mapped identifiers used
by the NCBI taxonomy database (which provides the taxonomic backbone for GenBank)
to Wikipedia pages using taxonomic names, including synonyms (Page, 2011). TBMap
provides links from TreeBase across several taxonomic databases, such as ITIS and NCBI
(Page, 2007). This mapping was also achieved using taxonomic names, but in some cases
GenBank Accession numbers and museum specimen codes were available for supplement.
The use of taxonomic names to aggregate data can lead to errors and requires significant a
priori knowledge either in the form of curators or an authoritative nomenclature.

Many databases expose their own internal identifiers, such as the WoRMS Aphia ID, so
others can link their data to those resources within their own systems, often by providing a
URL. Databases like WoRMS provide web services that allow users to look up an identifier
for a taxon in question, one at a time. While this makes linking easier, it is still difficult to
scale across all databases. For example, a list of all the taxon identifiers in EOL is 300 MB
compressed. No system of identifiers is universal across biodiversity databases and none of
them are easy to implement at scale.

While the data would be much more useful if linked, there is a lack of tools for linking
data across databases at scale. Most mappings are done at great expense and then are
made available as a separate file or incorporated into the resources themselves. LinkOut,
BioNames, GBIF, and EOL take more than a day to link across their entire body of
aggregated content. This paper discusses links made between GloBI and Wikidata (WD)
in 10 min using GUODA, a high performance computing system available for analysis of
large biodiversity data sets.

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 2/15

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.164


Table 1 Selected biodiversity databases and their size.

Database Data Quantity (Jan 2018) Size (compressed)

GBIF 964,547,793 occurrence records 139 GB
Catalogue of Life/ITIS 1.7 million taxa 2.9 GB
GloBI 3,363,528 interactions 206 MB
iDigBio 106,922,498 specimen records 35.5 GB
GenBank 206,293,625 sequences 3 TB
Biodiversity Heritage Library 53,739,062 pages 2.7 GB
WoRMS 243,323 marine species 71 MB
OpenTree 2,722,024 taxa and 6,810 trees 189 MB
EOL TraitBank Over 11 million records 46 GB uncompressed
EOL 7,705,748 data objects (May 2017) 10 TB uncompressed
Wikidata 42,648,426 data items 20 GB

METHODS
Description of Resources
GUODA
Following an iDigBio hack-a-thon in June 2015, GUODA was created as a pragmatic way to
compute over multiple large biodiversity databases in a mutually beneficial collaboration
between iDigBio, EOL, Kew Garden, and independent developers. Catalyzed by various
presentations at conferences, hardware provided by ACIS, 20+ meetings, and several
prototypes (e.g., http://effechecka.org, https://gimmefreshdata.github.io), a general access
biodiversity data integration and analysis environment was created. This environment, with
the aggregated experience and perspectives of all the collaborators, was used to produce
the results of this paper.

Housed at the ACIS Lab at the University of Florida, the GUODA infrastructure consists
of 12 IBM HS22 blades each with 8 cores, 24 GB of memory, and 1 TB of storage each.
This makes a total of 192 threads, 288 GB of memory and 12 TB of disk space available for
processing jobs using Apache Spark (Fig. 1; Zaharia et al., 2016). The cluster is managed
under Apache Mesos (Hindman et al., 2011) which is a distributed scheduling system for
periodic jobs. For long running processes, such as web APIs or databases, the Marathon
(https://github.com/mesosphere/marathon) framework is run within Mesos. Marathon
facilitates running always-up services with monitoring, automatic deployment of code,
re-scaling to multiple nodes, and other management features. Mesos is responsible for
accepting requests to start Spark frameworks, processes which do the actual computation
and may span multiple servers, and allocation of resources requested by the framework.

Hadoop HDFS (Shvachko et al., 2010) is installed outside of Mesos directly on all 12
nodes of the cluster and provides redundant parallel shared storage to all nodes as well as
the Jupyter notebook (Kluyver et al., 2016) server that provides a programming interface
to end users. Each node has 1 TB of local disk storage for a total of about 3.5 TB of usable
storage space for data files in Apache Parquet format. Spark is aware of the placement of
data on an HDFS cluster and will divide processing among nodes in a way that prefers to
read and write data that is local to the node to minimize network traffic.

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 3/15

https://peerj.com
http://effechecka.org
https://gimmefreshdata.github.io)
https://github.com/mesosphere/marathon
http://dx.doi.org/10.7717/peerj-cs.164


Figure 1 GUODA Infrastructure. Data from biodiversity databases is loaded into GUODA as Parquet
files (Storage). When a user working in a Jupyter Notebook (Front-end Server) triggers a job interactively
or via GitHub and Jenkins, the data are analyzed using Apache Spark (Compute Cluster). This infrastruc-
ture allows a user working from a laptop or desktop to compute over multiple biodiversity databases at
once. All logos are provided by the organizations they represent and are used with permission.

Full-size DOI: 10.7717/peerjcs.164/fig-1

Wikidata
Wikidata (WD) is a free and open knowledge base that provides structured data for
WikiMedia projects (http://www.wikidata.org; Vrandečić & Krötzsch, 2014). Similar to
Wikipedia, anyone can read or edit the resource. Information, including links to other
resources, can be added to Wikidata using bots and batch imports through their Data
Import Hub (https://www.wikidata.org/wiki/Wikidata:Data_Import_Hub). Wikidata
information about taxa can be conceptualized as a graph linking related taxa to each other
and identifiers from other databases to the taxa they represent (Fig. 2). Every taxon in
Wikidata is issued a Wikidata identifier. While a public Wikidata SPARQL endpoint and
associated tools (Voß, 2016) exist, these APIs are not suitable for batch processing. For
example, when attempting to retrieve all taxa using the public SPARQL endpoint, a query
timeout error was reported. In addition, the APIs are expected to return different results
over time, so reproducing results is difficult if not impossible. This is why we used a JSON
archive to access Wikidata (Wikidata, 2018).

GloBI
GloBI is a database of biotic interactions recorded as Organism_1:has_relationship:
Organism_2 (Poelen, Simons & Mungall, 2014) per individual interaction observation or
claim. GloBI uses a combination of web APIs, taxon archives, and name correction/parsing
methods in an attempt to link names from species interaction datasets to existing sources.
Spatial, temporal, and taxonomic coverage in GloBI is sparse and unevenly distributed (see
Eltonian shortfall, Hortal et al., 2015), with spatial concentrations in Europe and North
America and taxonomically concentrated in Arthropods, Fungi, and Plants. Only 8% of

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 4/15

https://peerj.com
https://doi.org/10.7717/peerjcs.164/fig-1
http://www.wikidata.org
https://www.wikidata.org/wiki/Wikidata:Data_Import_Hub
http://dx.doi.org/10.7717/peerj-cs.164


Figure 2 Frequency of Wikidata taxa linked to biodiversity databases. This graph shows the propor-
tion of the approximately 2.3 million Wikidata taxa with zero, one, two, etc. links to external biodiversity
databases (NCBI, ITIS, GBIF, EOL, FishBase, Index Fungorum and iNaturalist). The majority of Wikidata
taxa had at least two links. A little more than 15% of Wikidata taxa had no links to external biodiversity
databases.

Full-size DOI: 10.7717/peerjcs.164/fig-2

taxa in ITIS are also in GloBI. A detailed technical description of the GloBI data model and
services has been published elsewhere (Poelen, Simons & Mungall, 2014). GloBI maintains
a graph of related taxa and their identifiers from different databases (Poelen, Simons &
Mungall, 2014). GloBI does not introduce its own taxon IDs. Instead, it records how names
were mapped from a source name into an external taxonomic database using a taxon graph
(see https://globalbioticinteractions.org/references). We used GloBI Taxon Graph v0.4.2
(Poelen, 2018b). This taxon graph links names and identifiers hierarchically and across
resources.

Open Tree of Life Reference Taxonomy
To assess taxonomic ID coverage, the taxa in Wikidata and GloBI were compared to Open
Tree of Life Reference Taxonomy (OTT 3.0; http://files.opentreeoflife.org/ott/ott3.0/ott3.
0.tgz; Rees & Cranston, 2017). OTT was built using an automated algorithm with informed
choices to aggregate and link existing naming authorities into a reasonably comprehensive,
artificial, taxonomy. OTT contains 4,385,000 external links for 3,594,550 taxa aggregated
and linked over five authorities (i.e., GBIF, IF, SILVA, WoRMS, NCBI).

Linking Wikidata And GloBI
Both Wikidata and GloBI have taxon graphs that map to identifiers from external databases
(e.g., NCBI, ITIS, GBIF, EOL, Index Fungorum (IF), Fishbase and WoRMS). A Wikidata
dump was loaded into GUODA and processed to extract taxon items (about 2.3 million)

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 5/15

https://peerj.com
https://doi.org/10.7717/peerjcs.164/fig-2
https://globalbioticinteractions.org/references
http://files.opentreeoflife.org/ott/ott3.0/ott3.0.tgz
http://files.opentreeoflife.org/ott/ott3.0/ott3.0.tgz
http://dx.doi.org/10.7717/peerj-cs.164


Figure 3 Mapping taxon graphs across resources. Both GloBI and Wikidata contain hierarchical taxon
graphs with each taxon having a ‘‘star’’ of external identifiers. The taxa are mapped across these resources
by comparing the portion of the graph with the external identifiers between nodes. In this example, the
names and identifiers match perfectly, so a relationship between Panthera leo in GloBI and Panthera leo in
Wikidata is inferred.

Full-size DOI: 10.7717/peerjcs.164/fig-3

and their links to NCBI, ITIS, GBIF, EOL, IF, Fishbase and WoRMS. This was the Wikidata
taxon graph. This taxon graph was loaded into a lookup table where each row contained
an NCBI, ITIS, GBIF, EOL, IF, Fishbase or WoRMS identifier and the corresponding
Wikidata identifier. The GloBI taxon graph was already in a similarly formatted lookup
table. The taxon graphs in GloBI and Wikidata were mapped to each other with a join of
the NCBI, ITIS, GBIF, EOL, IF, Fishbase or WoRMS identifiers of the respective lookup
tables (Fig. 3). So, for each external identifier that occurred in both Wikidata and GloBI,
the corresponding Wikidata identifier inserted in the GloBI lookup table. For instance,
consider Wikidata taxon item Q140 (https://www.wikidata.org/wiki/Q140 accessed on
30 March 2018; Panthera leo) points to ITIS:183803. With the matching algorithm used,
GloBI now considers WD:Q140 to be linked to all taxon entries that are considered the
same as, or synonymous to, ITIS:183803.

This final joined graph was saved into HDFS as a Parquet file and linked entries were
appended to GloBI Taxon Graph from v0.3.0 onward (Poelen, 2018c). In addition, the GloBI
ingestion engine was updated to automatically perform the taxon graph matching for future
updates. This linkage enabled lookups of diet items of lions by Wikidata identifier via https:
//www.globalbioticinteractions.org/?interactionType=eats&sourceTaxon=WD%3AQ140
and facilitates future integration of species interaction data with Wikidata.

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 6/15

https://peerj.com
https://doi.org/10.7717/peerjcs.164/fig-3
https://www.wikidata.org/wiki/Q140
https://www.globalbioticinteractions.org/?interactionType=eats&sourceTaxon=WD%3AQ140
https://www.globalbioticinteractions.org/?interactionType=eats&sourceTaxon=WD%3AQ140
http://dx.doi.org/10.7717/peerj-cs.164


Taxon graph overlap and consistency
OTT, Wikidata, and GloBI taxon graphs maintain links to GBIF, IF, NCBI and WoRMS
identifiers (referred to as external identifiers). The taxon graphs are considered to (partially)
overlap if individual taxon IDs from different graphs have at least one external identifier in
common. In addition, a taxon graph is inconsistent if a taxon ID links to multiple external
identifiers from the same identifier scheme. Similarly, overlapping taxon IDs are said to be
inconsistent if they link to multiple external identifiers from the same identifier scheme.
Where overlap is a measure for taxon graph similarity, consistency can be seen as a way to
measure the relative quality of (overlapping) taxon graphs.

For instance, let’s say that OTT:1087695 is linked to NCBI:191633, WoRMS:156905, and
GBIF:1449280. In addition, WD:Q7247420 (https://www.wikidata.org/wiki/Q7247420)
points to WORMS:156905, GBIF:1449280, and NCBI:191633. This would mean that links
of these OTT and WD IDs overlap and are consistent, because they do not point to different
names in same naming schemes (Fig. 4). However, when considering the GloBI taxon ‘‘ID’’
‘‘GLOBI:null@Procladius sp1 M_PL_014’’, multiple links to external IDs were found
(e.g., NCBI:1981571, NCBI:1981569, NCBI:1981572, NCBI:1981573, NCBI:1981574,
NCBI:1981570). In this case, the GloBI taxon ID is inconsistent. The high number of
external NCBI identifiers is due to the NCBI taxonomy containing many ‘‘provisional’’
taxa derived from environmental samples.

Data access
All of the input data sets can be found at: https://doi.org/10.5281/zenodo.755513 (GloBI
Taxon Graph), http://files.opentreeoflife.org/ott/ott3.0/ott3.0.tgz (Open Tree of Life
Taxonomy) http://doi.org/10.5281/zenodo.1211767 (Wikidata).

A selection of intermediary and result datasets are available online (Poelen, 2018d; Poelen,
2018a).

All of the scripts used to make the statements in the results can be found here
(https://github.com/bio-guoda/guoda-datasets/tree/master/wikidata) with instructions
on how to duplicate the analysis.

RESULTS
After 10 min of processing, GloBI was linked to Wikidata using pre-existing identifier
mappings. The Wikidata dump was 20 GB of compressed JSON with 40–50 million data
items. It took about 10 min for GUODA to extract taxa (about 2.3 million) and their
links in Wikidata and then less than one minute to map the Wikidata taxon graph to
the GloBI taxon graph. The 119,957 WikiData links that were added to GloBI increased
its outgoing name links by 13.7% (Poelen, 2018d). Eighty-seven percent (86.7%) of the
external identifiers in Wikidata overlap with the external identifiers in OTT (Fig. 5).
Eighty-six percent (86.1%) of the external identifiers in GloBI overlap with the external
identifiers in OTT (Fig. 5). Wikidata provided mappings for 65.2% of the external identifiers
in GloBI (Fig. 5). Out of the 77,000 external identifiers that occurred only in OTT and
GloBI, only 56 were inconsistent (https://github.com/bio-guoda/guoda-datasets/blob/
master/wikidata/inconsistentNameIdsGloBI_OTT.tsv). These 56 links pointed to seven

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 7/15

https://peerj.com
https://www.wikidata.org/wiki/Q7247420
https://doi.org/10.5281/zenodo.755513
http://files.opentreeoflife.org/ott/ott3.0/ott3.0.tgz
http://doi.org/10.5281/zenodo.1211767
https://github.com/bio-guoda/guoda-datasets/tree/master/wikidata
https://github.com/bio-guoda/guoda-datasets/blob/master/wikidata/inconsistentNameIdsGloBI_OTT.tsv
https://github.com/bio-guoda/guoda-datasets/blob/master/wikidata/inconsistentNameIdsGloBI_OTT.tsv
http://dx.doi.org/10.7717/peerj-cs.164


Figure 4 Inconsistent graph matching. When overlapping taxon graphs include multiple name strings,
the graph is inconsistent. In this example the Procladius genus is present in Wikidata (red), Open Tree
(textured fill), and GloBI (blue). The Wikidata, OTT, and GloBI taxon graphs overlap on the NCBI and
the GBIF identifiers (purple and textured fill). The WoRMS identifier overlaps the OTT and Wikidata
taxon graph (red and textured fill). The Procladius graph in GloBI includes NCBI identifiers with a differ-
ent name string, Procladius (Holotanypus), which indicates inconsistent usage.

Full-size DOI: 10.7717/peerjcs.164/fig-4

OTT ‘‘taxa’’. No inconsistent links were found between WD and GloBI. Out of the 38,000
links only found in GloBI, 9,000 were inconsistent (https://github.com/bio-guoda/guoda-
datasets/blob/master/wikidata/inconsistentNameIdsGloBIOnly.tsv). The OTT, Wikidata,
and GloBI identifier graphs related to this coverage analysis is a 74 MB compressed
tab-separated-values file consisting of about 12 million identifier mapping records (see
https://zenodo.org/record/1213477/files/links-globi-wd-ott.tsv.gz). The resulting Wikidata
taxon objects were merged into GloBI’s Taxon Graph (Poelen, 2018d).

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 8/15

https://peerj.com
https://doi.org/10.7717/peerjcs.164/fig-4
https://github.com/bio-guoda/guoda-datasets/blob/master/wikidata/inconsistentNameIdsGloBIOnly.tsv
https://github.com/bio-guoda/guoda-datasets/blob/master/wikidata/inconsistentNameIdsGloBIOnly.tsv
https://zenodo.org/record/1213477/files/links-globi-wd-ott.tsv.gz).
http://dx.doi.org/10.7717/peerj-cs.164


Figure 5 Identifier overlap between Wikidata (WD), OTT, and GloBI. This Venn Diagram shows the
number of overlapping external identifiers that can be found in one of three databases. Only 207,958 ex-
ternal IDs can be found in all three. These consisted of 22,637 WoRMS links, 71,980 NCBI links, 103,300
GBIF links and 10,040 IF links. Over two million IDs are only known to one of the three databases. OTT
contains more than half of the external IDs in Wikidata and in GloBI, but neither contain half of the exter-
nal IDs in OTT. Mapping Wikidata to GloBI matched 65.2% of the external IDs in GloBI.

Full-size DOI: 10.7717/peerjcs.164/fig-5

In order for a mapping to be considered consistent, there can only be one identifier per
resource included in each local graph. Thus, after removing the inconsistent identifiers,
the external ID overlap can be interpreted as an estimate of the number of shared taxon
names between two databases (Table 2). This cannot be interpreted as total taxa in each
resource.

DISCUSSION
GUODA is a high performance computing resource for biodiversity science that provides
scalable solutions for working with large data sets in a collaborative, online environment.
The 10 min processing time for 20 GB of compressed JSON is far faster than any current
mapping method used in biodiversity; however, it does benefit from the mapping
already completed inside Wikidata. For example, the Wikidata entry for Panthera leo
(https://www.wikidata.org/wiki/Q140) has 25 links to external databases, not all of them

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 9/15

https://peerj.com
https://doi.org/10.7717/peerjcs.164/fig-5
https://www.wikidata.org/wiki/Q140
http://dx.doi.org/10.7717/peerj-cs.164


Table 2 Absolute and relative link counts from OTT, WD, and GloBI compared to WoRMS, GBIF,
Index Fungorum (IF), and NCBI.

WoRMS GBIF IF NCBI Combined

OTT 327,929 (100%)* 2,451,566 (100%) 276,262 (100%) 1,355,207 (100%) 4,410,964 (100%)
WD 288,110 (88%) 1,779,789 (73%) 76,497 (28%) 410,092 (30%) 2,554,488 (58%)
GloBI 68,565 (21%) 315,173 (13%) 33,400 (12%) 704,361 (52%) 1,121,499 (25%)

Notes.
*Overlap between each resource and OTT is set at 100%. The other percentages give a relative estimate of size and scale and
should not be interpreted as overlapping IDs.

biodiversity-related. This linking may be based on matching name strings. Other efforts
using name-string-matching to link biodiversity databases take much longer to map
resources together. For instance, EOL takes more than a day to map the content it receives
from providers to a unified classification (J Rice, pers. comm., 2018). Similarly, the taxon
matching in BioNames and LinkOut took days to complete (R Page, pers. comm., 2018).
Projects like OTT, Wikidata, and GloBI that keep identifier-based taxonomic graphs make
it easier to link databases at scale.

Despite the notoriously poor nature of taxon names as identifiers, they are still commonly
used to link biodiversity data. A much-discussed solution has been the use of universal,
unique, persistent, resolvable identifiers across the biodiversity data landscape, but the
social barrier to a universal identifier system has, thus far, proven insurmountable (Nimis,
2001; Hardisty, Roberts & The Biodiversity Informatics Community, 2013). Rather than rely
on name strings or a universal identifier system, this method uses the graph of identifiers to
map taxa across two databases. This identifier-based method has the potential to be faster
and easier than name-string matching without some of the social difficulties of a single
identifier system.

Most biodiversity databases and nomenclatural authorities expose their data in
idiosyncratic ways that are not suitable for batch processing. If data sources published
their taxon identifier graph as a lookup table (as described in this paper) integrating across
databases would be much easier (Fig. 6). Now, users have to learn a unique format for
every data source. These lookup tables have the advantage of being easy to version and
integrate.

In addition to fast linking of biodiversity databases, comparison of identifier graphs
may be a scalable way to find inconsistencies, especially when multiple biodiversity
databases/identifiers are included. By linking GloBI to OTT and WD, inconsistent names
or false positive name matches were detected by considering the (lack of) overlap of GloBI
names with OTT and WD external identifier schemes. These inconsistencies might be
introduced by a dataset or a name resolution method that produces ambiguous results. In
addition, inconsistencies can indicate a disputed/outdated name like ‘‘GLOBI:null@Senecio
pectinatus’’ which maps to GBIF:8317096 and GBIF:8414746. This would be considered an
inconsistent mapping and suggests that Senecio pectinatus is an outdated name. A related
method using a variation of the PageRank algorithm (Page et al., 1999; Brin & Page, 1998)
to identify the most legitimate taxonomic name to apply to a fossilized specimen (Huber
& Klump, 2009) gives further legitimacy to this concept. Combining the speediness with

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 10/15

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.164


Figure 6 Example look up table. This figure is an excerpt from the GloBI look up table. The provided-
TaxonId and the providedTaxonName come from the taxon graph external to GloBI. The resolvedTax-
onId and the resolvedTaxonName are the names and identifiers that are already mapped within GloBI.
Each row represents a mapping from a taxon in an external source (Pluvialis obscura) to an identifier from
a source already in GloBI, which does not mint its own identifiers.

Full-size DOI: 10.7717/peerjcs.164/fig-6

the promise of scalability, a near-real-time name consistency check can be implemented to
detect inconsistencies across various systems in the biodiversity data-ecosystem introduced
by integration bugs, taxonomy updates or differences of interpretation.

GUODA has been available since 2015 and contains data dumps from GBIF, EOL
TraitBank, iNaturalist, iDigBio, and BHL which are all accessible via a Jupyter notebook,
web services, or Apache Spark shell on the command line. Despite its computing power
and successful demonstrations at major conferences, GUODA has not been used to its full
potential. The barrier of learning new programming and computing paradigms as well as
developing an understanding of large dataset work flows seems to be a barrier to many in
the biodiversity community. Despite this, GUODA is being used in several capacities. The
Effechecka application generates taxonomic checklists using a web interface that allows a
user to draw a polygon on a map and returns a deduplicated list of taxa aggregated from
observation data held in GBIF, iNaturalist, etc. The EOL Freshdata project uses it to enable
the detection of new occurrence records given geospatial and taxonomic and data source
constraints and notifies interested users via email. Several workshops have used it to teach
Spark programming skills to students at the University of Florida.

Future work on the GUODA infrastructure includes training and evaluating neural
network models on image data, containerization of the GUODA components to allow the
system to be run in additional data centers, and refinement of the end-user interface to
integrate programming, source code, and publication to make research more reproducible.
GUODA’s most impactful contribution has likely been the availability of readily formatted
biodiversity data and new data sets will continue to be added to the collaboration platform,
enabling domain experts and technical experts to answer new questions in the future.

The bottlenecks in processing for Hadoop File System and Apache Spark are the number
of CPUs, amount of memory, and available storage space allocated to the computer cluster.
Both HDFS and Spark are designed to scale horizontally by adding commodity servers (aka

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 11/15

https://peerj.com
https://doi.org/10.7717/peerjcs.164/fig-6
http://dx.doi.org/10.7717/peerj-cs.164


nodes) to increase the processing power, working memory, and storage space. Thus, this
problem is immediately solvable. Internet bandwidth to transfer the data archives from
Open Tree of Life, Wikidata, and GloBI does not scale and is not something that can be
addressed solely within our research group. At the moment, it takes longer to download
the Wikidata resource than it does to run the linking process discussed in this manuscript.
Socio-technical bottlenecks include resource-limitation and user education. Increased
usage and operational support is expected to positively impact processing performance
by encouraging pro-active bug fixing and infrastructure maintenance. In addition, while
the technical complexity of operating and using a compute cluster have been dramatically
reduced since the introduction of Hadoop in 2006, some re-education may be needed to
effectively use these powerful data tools (e.g., jupyter notebooks, HDFS, scala).

GUODA, and hosted data analytics infrastructure in general, has the potential to
drastically improve biodiversity science by making multiple biodiversity databases
accessible to scientists for analysis on their laptop or desktop. Users still need to have some
programming skills, which have now become an essential skill in biodiversity science.

CONCLUSIONS
Sharing information between biodiversity databases can be difficult because of the amount
and heterogeneity of the data and the identifiers. Most mappings are done using taxonomic
name strings at great expense. We were able to map Wikidata to GloBI in 10 min using
identifier graphs and GUODA, a high performance computing infrastructure developed
through collaboration between diverse players. The mapping increased GloBI’s outgoing
name links by 13.7%. This method of mapping across databases using identifier graphs
is faster than comparing name strings and can help find inconsistencies that point to a
disputed or outdated name. GUODA, and systems like it, have the potential to revolutionize
biodiversity science by bringing diverse technically minded people together with high
performance computing resources that are accessible from a laptop or desktop.

ACKNOWLEDGEMENTS
The authors would like to acknowledge support and resources provided by the ACIS lab.
The authors would like to acknowledge José A.B. Fortes for providing infrastructure and
creating room for collaboration. The authors would like to thank the two reviewers for
their insightful comments that greatly improved the manuscript. The Encyclopedia of Life
and iDigBio helped establish an informal yet pragmatic cross-institutional collaboration.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
Funding was provided by David Rubenstein and the Encyclopedia of Life and by iDigBio,
NSF award 1547229. The funders had no role in study design, data collection and analysis,
decision to publish, or preparation of the manuscript.

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 12/15

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.164


Grant Disclosures
The following grant information was disclosed by the authors:
NSF award: 1547229.

Competing Interests
The authors declare there are no competing interests.

Author Contributions
• Anne E. Thessen analyzed the data, prepared figures and/or tables, authored or reviewed
drafts of the paper, approved the final draft.

• Jorrit H. Poelen conceived and designed the experiments, performed the experiments,
analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or
tables, performed the computation work, authored or reviewed drafts of the paper.

• Matthew Collins contributed reagents/materials/analysis tools, performed the
computation work, authored or reviewed drafts of the paper.

• Jen Hammock authored or reviewed drafts of the paper, provided collaborative space
and leadership.

Data Availability
The following information was supplied regarding data availability:

Poelen, Jorrit H. (2018). Global Biotic Interactions: Taxon Graph (Version 0.3.5) [Data
set]. Zenodo. http://doi.org/10.5281/zenodo.1313243.

WikiData. (2018). Wikidata dump 2017-12-27 [Data set]. Zenodo. http://doi.org/10.
5281/zenodo.1211767

Poelen, Jorrit. (2018). 20 GB in 10 min: Data linking across major biodiversity
databases: Data supplements (Version 0.1) [Data set]. Zenodo. Available at http:
//doi.org/10.5281/zenodo.1213477

REFERENCES
Bingham HC, Doudin M, Weatherdon LV, Despot-Belmonte K, Wetzel FT, Groom

Q, Lewis E, Regan E, Appeltans W, Güntsch A, Mergen P, Agosti D, Penev L,
Hoffmann A, Saarenmaa H, Geller G, Kim K, Kim H, Archambeau AS, Häuser
C, Schmeller DS, Geijzendorffer I, García Camacho A, Guerra C, Robertson T,
Runnel V, Valland N, Martin CS. 2017. The biodiversity informatics landscape:
elements, connections and opportunities. RIO 3:e14059 DOI 10.3897/rio.3.e14059.

Brin S, Page L. 1998. The anatomy of a large-scale hypertextual web search engine.
Computer Networks and ISDN Systems 30(1–7):107–117
DOI 10.1016/S0169-7552(98)00110-X.

Hardisty A, Roberts D, The Biodiversity Informatics Community. 2013. A decadal
view of biodiversity informatics: challenges and priorities. BMC Ecology 13:16
DOI 10.1186/1472-6785-13-16.

Hindman B, Konwinski A, Zaharia M, Ghodsi A, Josepyh AD, Katz R, Shenker S,
Stoica I. 2011. Mesos: a platform for fine-grained resource sharing in the data

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 13/15

https://peerj.com
http://doi.org/10.5281/zenodo.1313243
http://doi.org/10.5281/zenodo.1211767
http://doi.org/10.5281/zenodo.1211767
http://doi.org/10.5281/zenodo.1213477
http://doi.org/10.5281/zenodo.1213477
http://dx.doi.org/10.3897/rio.3.e14059
http://dx.doi.org/10.1016/S0169-7552(98)00110-X
http://dx.doi.org/10.1186/1472-6785-13-16
http://dx.doi.org/10.7717/peerj-cs.164


center. In: Proceedings of the 8th USENIX conference on networked systems design and
implementation. Berkeley: USENIX Association, 295–308.

Hortal J, De Bello F, Diniz-Filho JAF, Lewinsohn TM, Lobo JM, Ladle RJ. 2015. Seven
shortfalls that beset large-scale knowledge of biodiversity. Annual Review of Ecology,
Evolution, and Systematics 46(1):523–549
DOI 10.1146/annurev-ecolsys-112414-054400.

Huber R, Klump J. 2009. Charting taxonomic knowledge through ontologies and rank-
ing algorithms. Computers & Geosciences 35:862–868
DOI 10.1016/j.cageo.2008.02.016.

Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K,
Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C, Jupyter De-
velopment Team. 2016. Jupyter notebooks—a publishing format for reproducible
computational workflows. In: Loizides F, Schmidt B, eds. Positioning and power in
academic publishing: players, agents and agendas. Amsterdam: IOS Press, 87–90.

Nimis PL. 2001. A tale from Bioutopia: could a change of nomenclature bring peace to
biology’s warring tribes? Nature 413(6851):21 DOI 10.1038/35092637.

Page L, Brin S, Motwani R, Winograd T. 1999. The pagerank citation ranking: bringing
order to the web. Technical report. Palo Alto: Stanford Digital Library Technologies
Project.

Page RDM. 2007. Tbmap: a taxonomic perspective on the phylogenetic database
treebase. BMC Bioinformatics 8(1):158 DOI 10.1186/1471-2105-8-158.

Page RDM. 2008. Biodiversity informatics: the challenge of linking data and the role of
shared identifiers. Briefings in Bioinformatics 9(5):345–354 DOI 10.1093/bib/bbn022.

Page RDM. 2011. Linking NCBI to Wikipedia: a wiki-based approach. PLOS Currents
3:RRN1228 DOI 10.1371/currents.RRN1228.

Page RDM. 2013. BioNames: linking taxonomy, texts, and trees. PeerJ 1:e190
DOI 10.7717/peerj.190.

Parr CS, Wilson N, Leary P, Schulz KS, Lans K, Walley L, Hammock JA, Goddard A,
Rice J, Studer M, Holmes JTG, Corrigan Jr RJ. 2014. The encyclopedia of life v2:
providing global access to knowledge about life on earth. Biodiversity Data Journal
2:e1079 DOI 10.3897/BDJ.2.e1079.

Poelen J. 2018a. 20 GB in 10 min: data linking across major biodiversity databases: data
supplements. Version 0.1. Zenodo DOI 10.5281/zenodo.1213477.

Poelen J. 2018b. Global biotic interactions: taxon graph. Version 0.4.2. Zenodo
DOI 10.5281/zenodo.1210315.

Poelen J. 2018c. Global biotic interactions: taxon graph. Version 0.3.0. Zenodo
DOI 10.5281/zenodo.1210308.

Poelen J. 2018d. Global biotic interactions: taxon graph. Version 0.3.1. Zenodo
DOI 10.5281/zenodo.1213465.

Poelen JH, Simons JD, Mungall CJ. 2014. Global biotic interactions: an open infras-
tructure to share and analyze species-interaction datasets. Ecological Informatics
24:148–159 DOI 10.1016/j.ecoinf.2014.08.005.

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 14/15

https://peerj.com
http://dx.doi.org/10.1146/annurev-ecolsys-112414-054400
http://dx.doi.org/10.1016/j.cageo.2008.02.016
http://dx.doi.org/10.1038/35092637
http://dx.doi.org/10.1186/1471-2105-8-158
http://dx.doi.org/10.1093/bib/bbn022
http://dx.doi.org/10.1371/currents.RRN1228
http://dx.doi.org/10.7717/peerj.190
http://dx.doi.org/10.3897/BDJ.2.e1079
http://dx.doi.org/10.5281/zenodo.1213477
http://dx.doi.org/10.5281/zenodo.1210315
http://dx.doi.org/10.5281/zenodo.1210308
http://dx.doi.org/10.5281/zenodo.1213465
http://dx.doi.org/10.1016/j.ecoinf.2014.08.005
http://dx.doi.org/10.7717/peerj-cs.164


Rees JA, Cranston K. 2017. Automated assembly of a reference taxonomy for phyloge-
netic data synthesis. Biodiversity Data Journal 5:e12581 DOI 10.3897/BDJ.5.e12581.

Shvachko K, Kuang H, Radia S, Chansler R. 2010. The Hadoop distributed file system.
In: MSST’10 proceedings of the 2010 IEEE 26th symposium on mass storage systems and
technologie. Piscataway: IEEE Computer Society, 1–10.

Voß J. 2016. wikidata-taxonomy 0.2.7. Version 0.2.7. Zenodo DOI 10.5281/zenodo.60708.
Vrandečić D, Krötzsch M. 2014. Wikidata: a free collaborative knowledgebase. Commu-

nications of the ACM 57(10):78–85 DOI 10.1145/2629489.
Wikidata. 2018. Wikidata dump 2017-12-27. Zenodo. DOI 10.5281/zenodo.1211767.
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J,

Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I. 2016.
Apache Spark: a unified engine for big data processing. Communications of the ACM
59(11):56–65 DOI 10.1145/2934664.

Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 15/15

https://peerj.com
http://dx.doi.org/10.3897/BDJ.5.e12581
http://dx.doi.org/10.5281/zenodo.60708
http://dx.doi.org/10.1145/2629489
http://dx.doi.org/10.5281/zenodo.1211767
http://dx.doi.org/10.1145/2934664
http://dx.doi.org/10.7717/peerj-cs.164