key: cord-0590703-5qp7vk0c
authors: Fairchild, Geoffrey; Silva, Lalindra De; Valle, Sara Y. Del; Laboratory, Alberto M. Segre Los Alamos National; Alamos, Los; NM,; USA,; Utah, The University of; City, Salt Lake; UT,; Iowa, The University of; City, Iowa; IA,
title: Eliciting Disease Data from Wikipedia Articles
date: 2015-04-02
journal: nan
DOI: nan
sha: 65078532d15de7dbfa1931a5e6e6eba317cee5e3
doc_id: 590703
cord_uid: 5qp7vk0c

Traditional disease surveillance systems suffer from several disadvantages, including reporting lags and antiquated technology, that have caused a movement towards internet-based disease surveillance systems. Internet systems are particularly attractive for disease outbreaks because they can provide data in near real-time and can be verified by individuals around the globe. However, most existing systems have focused on disease monitoring and do not provide a data repository for policy makers or researchers. In order to fill this gap, we analyzed Wikipedia article content. We demonstrate how a named-entity recognizer can be trained to tag case counts, death counts, and hospitalization counts in the article narrative that achieves an F1 score of 0.753. We also show, using the the 2014 West African Ebola virus disease epidemic article as a case study, that there are detailed time series data that are consistently updated that closely align with ground truth data. We argue that Wikipedia can be used to create the first community-driven open-source emerging disease detection, monitoring, and repository system.

Most traditional disease surveillance systems rely on data from patient visits or lab records (Losos 1996; Burkhead and Maylahn 2000; Adams et al. 2013) . These systems, while generally recognized to contain accurate information, rely on a hierarchy of public health systems that causes reporting lags of up to 1-2 weeks in many cases (Burkhead and Maylahn 2000) . Additionally, many regions of the world lack the infrastructure necessary for these systems to produce reliable and trustworthy data. Recently, in an effort to overcome these issues, timely global approaches to disease surveillance have been devised using internet-based data. Data sources such as search engine queries (e.g., (Polgreen et al. 2008; Ginsberg et al. 2009 )), Twitter (e.g., (Culotta 2010; Aramaki, Maskawa, and Morita 2011; Paul and Dredze 2011; Signorini, Segre, and Polgreen 2011)) , and Wikipedia access logs (e.g., (McIver and Brownstein 2014; Generous et al. 2014) ) have been shown to be effective in this arena.

A notably different internet-based disease surveillance tool is HealthMap (Freifeld et al. 2008) . HealthMap analyzes, in real-time, data from a variety of sources (e.g., ProMED-mail (Madoff 2004) , Google News, the World Health Organization) in order to allow simple querying, filtering, and visualization of outbreaks past and present. During emerging outbreaks, HealthMap is often used to understand the current state (e.g., incidence and death counts, outbreak locations). For example, HealthMap was able to detect the 2014 Ebola epidemic nine days before the World Health Organization (WHO) officially announced it (Greenemeier 2014) .

While HealthMap has certainly been influential in the digital disease detection sphere, it has some drawbacks. First and foremost, it runs on source code that is not open and relies on certain data sources that are not freely available in their entirety (e.g., Moreover Newsdesk 1 ). Some argue that there is a genuine need for open source code and open data in order to validate, replicate, and improve existing systems (Generous et al. 2014) . They argue that while certain closed source services, such as HealthMap and Google Flu Trends (Ginsberg et al. 2009) , are popular and useful to the public, there is no way for the public to contribute to the service or continue the service, should the owners decide to shut it down. For example, Google offers a companion site to Google Flu Trends, Google Dengue Trends 2 . However, since Google's source code and data are closed, it is not possible for anyone outside of Google to create similar systems for other diseases, e.g., Google Ebola Trends. Additionally, it is not possible for anyone outside of the HealthMap development team to add new features or data sources to HealthMap. For these reasons, Generous et al. argue for the use of Wikipedia access logs coupled with open source code for digital disease surveillance.

Much richer Wikipedia data are available, however, than just access logs. The entire Wikipedia article content and edit histories are available, complete with edit history metadata (e.g., timestamps of edits and IP addresses of anonymous editors). A plethora of open media-audio, images, and video-are also available.

Wikipedia has a history of being edited and used, in many cases, in near real-time during unfolding news events. Keegan et al. have been particularly instrumental in understanding Wikipedia's dynamics during unfolding breaking news events, such as natural disasters and political conflicts and scandals (Keegan, Gergle, and Contractor 2011; Keegan, Gergle, and Contractor 2013; Keegan 2013) . They have provided insight into editor networks as well as editing activity during news events. Recognizing that Wikipedia might offer useful disease data during unfolding epidemiological events, this study presents a novel use of Wikipedia article content and edit history in which disease data (i.e., case, death, and hospitalization counts) are elicited in a timely fashion. We study two different aspects of Wikipedia content as it relates to unfolding disease events: 1. Using standard natural language processing (NLP) techniques, we demonstrate how to capture case counts, death counts, and hospitalization counts from the article text. 2. Using the 2014 West African Ebola virus epidemic article as a case study, we show there are valuable time series data present in the tables found in certain articles. We argue that Wikipedia data can not only be used for disease surveillance but also as a centralized repository system for collecting disease-related data in near real-time.

Disease-related information can be found in a number of places on Wikipedia. We demonstrate how two aspects of Wikipedia article content (historical changes to article text and tabular content) can be harvested for disease surveillance purposes. We first show how a named-entity recognizer can be trained to elicit "important" phrases from outbreak articles, and we then study the accuracy of tabular time series data found in certain articles using the 2014 West African Ebola epidemic as a case study.

Wikipedia is an open collaborative encyclopedia consisting of approximately 30 million articles across 287 languages (Wikimedia Foundation 2014f; Wikimedia Foundation 2014g). The English edition of Wikipedia is by far the largest and most active edition; it alone contains approximately 4.7 million articles, while the next largest Wikipedia edition (Swedish) contains only 1.9 million articles (Wikimedia Foundation 2014g). The textual content of the current revision of each English Wikipedia article totals approximately 10 gigabytes (Wikimedia Foundation 2014d).

One of Wikipedia's primary attractions to researchers is its openness. All of the historical article content, dating back to Wikipedia's inception in 2001, is available to anyone free of charge. Wikipedia content can be acquired through two means: a) Wikipedia's official web API 3 or b) downloadable database dumps 4 . Although the analysis in this study could have been done offline using the downloadable database dumps, this option is in practice difficult, as the database dumps containing all historical English article revisions are very large (multiple terabytes when uncompressed) (Wikimedia Foundation 2014h). We therefore decided to use Wikipedia's web API, caching content when appropriate.

Wikipedia contains many articles on specific disease outbreaks and epidemics (e.g., the 2014 West Africa Ebola epidemic 5 and the 2012 Middle Eastern Respiratory Syndrome Coronavirus (MERS-CoV) outbreak 6 ). We identified two key aspects of Wikipedia disease outbreak articles that can aid disease surveillance efforts: a) key phrases in the article text and b) tabular content. Most outbreak articles we surveyed contained: dates, locations, case counts, death counts, case fatality rates, demographics, and hospitalization counts in the text. These data are, in general, swiftly updated as new data become available. Perhaps most importantly, sources are often provided so that external review can occur. The following two excerpts came from the articles on the 2012 MERS-CoV outbreak and 2014 Ebola epidemic, respectively:

On 16 April 2014, Malaysia reported its first MERS-COV related death. [34] The person was a 54 year-old man who had traveled to Jeddah, Saudi Arabia, together with pilgrimage group composed of 18 people, from 15-28 March 2014. He became ill by 4 April, and sought remedy at a clinic in Johor on 7 April. He was hospitalized by 9 April and died on 13 April. [35] (Wikimedia Foundation 2014a)

On 31 March, the U.S. Centers for Disease Control and Prevention (CDC) sent a five-person team to assist Guinea's Ministry of Health and the WHO to lead an international response to the Ebola outbreak. On that date, the WHO reported 112 suspected and confirmed cases including 70 deaths. Two cases were reported from Liberia of people who had recently traveled to Guinea, and suspected cases in Liberia and Sierra Leone were being investigated. [ Table containing updated worldwide Ebola case counts and death counts. This is a screenshot taken directly from the 2014 Ebola epidemic Wikipedia article (Wikimedia Foundation 2014b). Time granularity is irregular but is in general every 2-5 days. References are also provided for all data points. time granularity is irregular, but updated counts are consistently provided every 2-5 days.

While there are certainly other aspects of Wikipedia article content that can be leveraged for disease surveillance purposes, these are the two we focus on in this study. The following sections detail the data extraction methods we use.

In order to recognize certain key phrases in the Wikipedia article narrative, we trained a named-entity recognizer (NER). Named-entity recognition is a task commonly used in natural language processing (NLP) to identify and categorize certain key phrases in text (e.g., names, locations, dates, organizations). NERs are sequence labelers; that is, they label sequences of words. Consider the following example (Wikimedia Foundation 2014e):

Jim bought 300 shares of Acme Corp. in 2006.

Entities in this example could be named as follows:

This study specifically uses Stanford's NER (Finkel, Grenager, and Manning 2005) 7 . The Stanford NER is an implementation of a conditional random field (CRF) model (Sutton 2011) . CRFs are probabilistic statistical models that are the discriminative analog of hidden Markov models (HMMs). Generative models, such as HMMs, learn the joint probability p(x, y), while discriminative models, such as CRFs, learn the conditional probability p(y | x). In practice, this means that generative models like HMMs classify by modeling the actual distribution of each class, while discriminative models like CRFs classify by modeling 7 http://nlp.stanford.edu/software/CRF-NER.shtml the boundaries between classes. In most cases, discriminative models outperform generative models (Ng and Jordan 2002) .

While Stanford's NER includes models capable of recognizing common named entities, such as PERSON, ORGANIZATION, and LOCATION, it also provides the capability for us to train our own model so that we can capture new types of named entities we are interested in. For this specific task, we were interested in automatically identifying three entity types: a) DEATHS b) INFECTIONS, and c) HOSPITALIZATIONS. Our trained model should therefore be able to automatically tag phrases that correspond to these three entities in the text documents it receives as input.

NERs possess the ability to learn and generalize in order to identify unseen phrase patterns. Since the classifier is dependent on the features we provide to it (e.g., words, part of speech tags), it should hopefully generalize well for the unseen instances. A more simplistic pattern-matching approach, such as regular expressions, is not practical due to inherent variation. For example, the following phrases from our dataset all contain However, example 2 spells out the number, while example 3 provides the numeral. A simple regular expression cannot capture the variability found in our dataset; we would need to define dozens of regular expressions for each entity type, and rigidity of regular expressions would limit the likelihood that we would be able to identify entities in new unseen patterns.

A number of steps were required to prepare the data for annotation so that the NER could be trained: 1. We first queried Wikipedia's API in order to get the complete revision history for the articles used in our training set. 2. We cleaned each revision by stripping all MediaWiki markup from the text, as well as removing tables. 3. We computed the diff (i.e., textual changes) between successive pairs of articles. This provided lines deleted and added between the two article revisions. We retained a list of all the line additions across all article revisions. 4. Many lines in this resulting list were similar to one another (e.g., "There are 45 new cases." → "There are 56 new cases."). For the purposes of training the NER, it is not necessary to retain highly similar or identical lines. We therefore split each line into sentences and removed similar sentences by computing the Jaccard similarity between each sentence using trigrams as the constituent parts in the Jaccard equation. The Jaccard similarity equation for measuring the similarity between two sets A and B, defined as J(A, B) = |A∩B| |A∪B| , is commonly used for near-duplicate detection (Manning, Raghavan, and Schütze 2009) . We only kept sentences for which the similarity with all the distinct sentences retained so far was no greater than 0.75. 5. We split each line into tokens in order to create a tabseparated value file that is compatible with Stanford's NER.

6. Finally, we used Stanford's part-of-speech (POS) tagger (Toutanova et al. 2003) 8 to add a POS feature to each token.

In order to train the NER, we annotated a dataset derived from the following 14 Wikipedia articles generated according to the above methodology: , was used to tag each token. The IOB scheme offers the ability to tie together sequences of tokens that make up an entity.

The annotation task was split between two annotators (the first and second authors). In order to tune inter-annotator agreement, the annotators each annotated three sets of 5,000 tokens. After each set of annotations, differences were identified, and clarifications to the annotation rules were made. The third set resulted in a Cohen's kappa coefficient of 0.937, indicating high agreement between the annotators.

To understand the viability of tabular data in Wikipedia, we concentrate on the Ebola virus epidemic in West Africa article 23 . We chose this article for two reasons. First, the epidemic is still unfolding, which makes it a concern for epidemiologists worldwide. Second, the epidemiological community has consistently updated the article as new developments are publicized. Ideally, we would analyze all disease articles that contain tabular data, but the technical challenges surrounding parsing the constantly changing data leave this as future work. Ebola is a rare but deadly virus that first appeared in 1976 simultaneously in two different remote villages in Africa. Outbreaks of Ebola virus disease (EVD), previously known as Ebola hemorrhagic fever (EHF), are sporadic and generally short-lived. The average case fatality rate is 50%, but it has varied between 25% and 90% in previous outbreaks. EVD is transmitted to humans from animals (most commonly, bats, apes, and monkeys) and also from other humans through direct contact with blood and body fluids. Signs and symptoms appear within 2-21 days of exposure (average 8-10 days) and include fever, severe headache, muscle pain, weakness, diarrhea, vomiting, abdominal pain, and unexplained bleeding or bruising. Although there is currently no known cure, treatment in the form of aggressive rehydration seems to improve survival rates (World Health Organization 2014a; Centers for Disease Control and Prevention 2014).

The West African EVD epidemic was officially announced by the WHO on March 25, 2014 (World Health Organization 2014b). The disease spread rapidly and has proven difficult to contain in several regions of Africa. At the time of this writing, it has spread to 7 different countries (including two outside of Africa): Guinea, Liberia, Sierra Leone, Nigeria, Senegal, United States, and Spain.

The Wikipedia article was created on March 29, 2014, four days after the WHO announced the epidemic (Wikimedia Foundation 2014c). As seen in Figure 1 , this article contains detailed tables of case counts and death counts by country. The article is regularly updated by the Wikipedia community (see Figure 2 ); over the 165-day period analyzed, the article averaged approximately 31 revisions per day.

We parsed the Ebola article's tables in several steps: 3. As Figure 1 shows, there are non-regular gaps in the Wikipedia time series; these gaps range from 2-5 days. We used linear interpolation to fill in missing data points where necessary so that we have daily time series. Daily time series data simplify comparisons with ground truth data (described later).

4. Recognizing that the tables will not necessarily change between article revisions (i.e., an article revision might contain edits to only the text of the article, not to a table in the article), we then removed identical time series. This final dataset contained 39 time series.

To test the classifier's performance, we averaged precision, recall, and F1 score results from 10-fold cross-validation. Table 1 demonstrates a typical confusion matrix used to bin cross-validation results, which are then used to compute precision, recall, and the F1 score. Precision asks, "Out of all the examples the classifier labeled, what fraction were correct?" and is computed as TP TP+FP . Recall asks, "Out of all labeled examples, what fraction did the classifier recognize?" and is computed as TP TP+FN . The F1 score is the harmonic mean of precision and recall: 2 · precision·recall precision+recall . All three scores range from 0 to 1, where 0 is the worst score possible and 1 is the best score possible. 25 http://www.crummy.com/software/BeautifulSoup/ Table 2 shows these results as we varied the maxNGramLeng option (Stanford's default value is 6). The maxNGramLeng option determines sequence length when training. We were somewhat surprised to discover that larger maxNGramLeng values did not improve the performance of the classifier, indicating that more training data are likely necessary to further improve the classifier. Furthermore, roughly maximal performance is achieved with maxNGramLeng = 4; there is no tangible benefit to larger sequences (despite this, we concentrate on the maxNGramLeng = 6 case since it is the default). Our 14-article training set achieved precision of 0.812 and recall of 0.710, giving us an F1 score of 0.753 for maxNGramLeng = 6.

For maxNGramLeng = 6, Table 3 shows the average precision, recall, and F1 scores for each of the named entities we annotated (DEATHS, INFECTIONS, and HOSPITALIZATIONS). There were a total of 264 DEATHS, 633 INFECTIONS, and 16 HOSPITALIZATIONS entities annotated across the entire training dataset. Recall that we used the IOB scheme for annotating sequences; this is reflected in Table 3 , with B-* indicating the beginning of a sequence and I-* indicating the inside of a sequence. It is generally the case that identifying the beginning of a sequence is easier than identifying all of the inside words of a sequence; the only exception to this is HOSPITALIZATIONS, but we speculate that the identical beginning and inside results for this entity are due to the relatively small sample size.

To compute the accuracy of the Wikipedia West African EVD epidemic time series, we used Caitlin Rivers' crowd- sourced Ebola data 26 . Her country-level data come from official WHO data and reports. As with the Wikipedia time series, we used linear interpolation to fill in missing data where necessary so that the ground truth data are specified daily; this ensured that the Wikipedia and ground truth time series were specified at the same granularity. Note that time granularity of the WHO-based ground truth dataset is generally finer than the Wikipedia data; the gaps in the ground truth time series were not the same as those in the Wikipedia time series. In many cases, the ground truth data were updated every 1-2 days. We compared the 39 Wikipedia epidemic time series to the ground truth data by computing the root-mean-square error (RMSE). We use the RMSE rather than the mean-square error (MSE) because the testing and ground truth time series both have the same units (cases or deaths); when they have the same units, the computed RMSE also has the same unit, which makes it easily interpretable. The RMSE,

computes the average number of cases or deaths difference between a Wikipedia epidemic time series (Ŷ ) and the ground truth time series (Y ). Figure 3 shows how the case time series and death time series RMSE changes with each table revision for each country. Of particular interest is the large spike in Figure 3a on July 8, 2014 in Liberia and Sierra Leone. Shortly after the 6:27pm spike, an edit from a different user at 8:16pm the same day with edit summary "correct numbers in wrong country columns" corrected the error. The average RMSE values for each country's time series are listed in Table 4 . Even in the worst case, the average deviation between the Wikipedia time series and the ground truth is approximately 19 cases and 12 deaths. Considering the magnitude of the number of cases (e.g., approximately 1,500 in Liberia and 3,500 in Sierra Leone during the time period considered) and deaths (e.g., approximately 850 in Liberia and 1,200 in Sierra Leone), the Wikipedia time series are generally within 1-2% of the ground truth data.

Internet data are becoming increasingly important for disease surveillance because they address some of the existing challenges, such as the reporting lags inherent in traditional disease surveillance data, and they can also be used 26 https://github.com/cmrivers/ebola (Liberia and Sierra Leone) and August 20, 2014 (Liberia) in 3a were due to Wikipedia contributor errors and were fixed shortly after they were made. Most RMSE spikes are quickly followed by a decrease; this is due to updated WHO data or contributor error detection. to detect and monitor emerging diseases. Additionally, internet data can simplify global disease data collection. Collecting disease data is a formidable task that often requires browsing websites written in an unfamiliar language, and data are specified in a number of formats ranging from well-formed spreadsheets to unparseable PDF files containing low resolution images of tables. Although several popular internet-based systems exist to help overcome some of these traditional disease surveillance system weaknesses, most notably HealthMap (Freifeld et al. 2008) . This study explores a new facet of Wikipedia: the content of disease-related articles. We present methods on how to elicit data that can potentially be used for near-realtime disease surveillance purposes. We argue that in some instances, Wikipedia may be viewed as a centralized crowdsourced data repository.

First, we demonstrate using a named-entity recognizer (NER) how case counts, death counts, and hospitalization counts can be tagged in the article narrative. Our NER, trained on a dataset derived from 14 Wikipedia articles on disease outbreaks/epidemics, achieved an F1 score of 0.753, evidence that this method is fully capable of recognizing these entities in text. Second, we analyzed the quality of tabular data available in the 2014 West Africa Ebola virus disease article. By computing the root-mean-square error (RMSE), we show that the Wikipedia time series very closely align with WHO-based ground truth data.

There are many future directions for this work. First and foremost, more training data are necessary for an operational system in order to improve precision and recall. There are many more disease-and outbreak-related Wikipedia articles that can be annotated. Additionally, other open data sources, such as ProMED-mail, might be used to enhance the model. Second, a thorough analysis of the quality and correctness of the entities tagged by the NER is needed. This study presents the methods by which disease-related named entities can be recognized, but we have not throughly studied the correctness and timeliness of the data. Third, our analysis of tabular data consisted of a single article. A more rigorous study looking at the quality of tabular data in more articles is necessary. Finally, the work presented here considers only the English Wikipedia. NERs are capable of tagging entities in a variety of other languages; more work is needed to understand the quality of data available in the 286 non-English Wikipedias.

There are several limitations to this work. First, the ground truth time series we used to compute RMSEs is static, while the Wikipedia time series vary over time. Because the relatively recent static ground truth time series may contain corrections for reporting errors made earlier in the epidemic, the RMSE values may be artificially inflated in some instances. Second, we are ignoring the userprovided edit summary. This edit summary provides infor-mation about why the edit was made. The edit summary identifies article vandalism (and subsequent vandalism reversion) as well as content corrections and updates. Taking these edit summaries into account can further improve model performance (e.g., processing edit summaries would allow us to disregard the erroneous edit that caused the July 8, 2014 spike in Figure 3a) .

Ultimately, we envision this work being incorporated into a community-driven open-source emerging disease detection and monitoring system. Wikipedia access log time series gauge public interest and, in many cases, correlate very well with disease incidence. A community-driven effort to improve global disease surveillance data is imminent, and Wikipedia can play a crucial role in realizing this need.

Twitter catches the flu: detecting influenza epidemics using Twitter

Incorporating non-local information into information extraction systems by Gibbs sampling

HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports

Smart Machines Join Humans in Tracking Africa Ebola Outbreak

Hot off the wiki: dynamics, practices, and structures in Wikipedia's coverage of the Thoku catastrophes

Hot Off the Wiki: Structures and Dynamics of Wikipedia's Coverage of Breaking News Events

Routine and sentinel surveillance methods

ProMED-mail: An Early Warning System for Emerging Diseases

Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time

You are what you Tweet: Analyzing Twitter for public health

The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic

An Introduction to Conditional Random Fields

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

Accessed: 2014-10-10. [Wikimedia Foundation 2014b] Wikimedia Foundation

Accessed: 2014-10-11. [Wikimedia Foundation 2014f] Wikimedia Foundation

Accessed: 2014-10-07

Accessed: 2014-10-08. [World Health Organization 2014a] World Health Organization

Accessed: 2014-10-27. [World Health Organization 2014b] World Health Organization. 2014b. Ebola virus disease in Guinea

This work is supported in part by NIH/NIGMS/MIDAS under grant U01-GM097658-01 and the DTRA Joint Science and Technology Office for Chemical and Biological Defense under project numbers CB3656 and CB10007. LANL is operated by Los Alamos National Security, LLC for the Department of Energy under contract DE-AC52-06NA25396.