key: cord-0711292-ko03vn8l authors: Yu, Lishan; Yu, Sheng title: Developing an automated mechanism to identify medical articles from Wikipedia for knowledge extraction date: 2020-07-13 journal: Int J Med Inform DOI: 10.1016/j.ijmedinf.2020.104234 sha: 944bf3d8ebdedb2d36816fca405f729a814fdc66 doc_id: 711292 cord_uid: ko03vn8l Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the size of data easy to work with. However, due to the extremely low prevalence of medical articles in the entire Wikipedia, articles identified by generic text classifiers would be bloated by irrelevant pages. To control the false discovery rate while maintaining a high recall, we developed a mechanism that leverages the rich page elements and the connected nature of Wikipedia and uses a crawling classification strategy to achieve accurate classification. Structured assertional knowledge in Infoboxes and Wikidata items associated with the identified medical articles were also extracted. This automatic mechanism is aimed to run periodically to update the results and share them with the informatics community. Wikipedia contains rich biomedical information and has been widely used for medical informatics research [1] . In addition to basic text mining [2] [3] [4] , Wikipedia articles can be also used for formal knowledge extraction. For example, the article titles, text written in bold, and redirections are usually medical concepts or named entities. The Infobox (the information box at the top right corner of each article), tables in the main text, and the Wikidata item associated with each Wikipedia article provides concept relations [5] [6] [7] . These medical concepts and relation can also be discovered from the free text, which are important research topics in natural language processing (NLP) [8] [9] [10] . These concepts and relations can be used to develop medical knowledge graphs that can provide high-level support to healthcare artificial intelligence [11, 12] , such as language understanding and decision support. In addition, the medical articles as a corpus can be used for training word/concept representations [13, 14] and language models [15, 16] to improve the modeling performance in various machine learning tasks. Therefore, although there are controversies about the scientific rigor and quality of some of the articles on Wikipedia [17] [18] [19] [20] [21] , the size and richness of Wikipedia still make it one of the most useful data sources for medical informatics studies. However, the size of Wikipedia also creates problems. Wikipedia is freely editable by internet users around the world, on any possible subject. As a result, medical articles only represent a tiny fraction of the entire Wikipedia. For example, the 2020-05-01 dump of Wikipedia contains over 20 million articles, which includes 14 million redirect pages and 6 million non-redirect articles. Among which, as our results indicate, about only 90 thousand articles (1.5% of nonredirect pages, 0.5% of all pages) are related to medicine. With such a tiny representation, using the entire Wikipedia for medical research can have negative effects. For example, language models trained on general text are less accurate in healthcare NLP than those trained with medical corpora [13, 15, 16] , and medical term discovery and relation extraction models can have many false discoveries when applied to articles unrelated to medicine. In addition, the 2020-05-01 dump of Wikipedia is 65 GB in volume, and the 2020-06-01 dump of Wikidata is J o u r n a l P r e -p r o o f 1.1 TB when uncompressed, which creates unnecessary computational difficulties for researchers who only need the medical parts of them. The goal of our work is to develop an automated mechanism to identify the medical article subset of Wikipedia, which can be used to facilitate further medical informatics studies. Currently, we look for articles on 7 categories of medical subjects: Anatomy (ANAT), Chemicals & Drugs (CHEM), Devices (DEVI), Disorders (DISO), Living Beings (LIVB), Physiology (PHYS), and Procedures (PROC). The exact scope of these categories follows the semantic group definition of the Unified Medical Language System (UMLS) [22] , with certain exclusions, as detailed in Supplementary Materials S1. For instance, for LIVB, we only included the semantic types of Bacterium, Fungus, Virus, and Eukaryote, which are more related to diseases than other live beings. Since there already exist multiple ontologies for genetics [23, 24] , we decided to exclude genetic concepts from our current search scope. Semantic web projects and efforts associated with Wikipedia can be used to identify some of these categories [2, 6, 8, 25] . For example, DBpedia [26] provides class labels that can help identify articles of certain categories, such as diseases and live beings, but it does not cover all target categories. Besides, DBpedia is not 100% correct, and it updates slowly. Similarly, WikiProject Medicine also provides tags for several but not all interested semantic groups [27, 28] . Therefore, instead of relying on existing semantic resources, we develop machine learning algorithms to identify medical articles and classify them into the aforementioned 7 semantic groups. As a side product, we also extract structured assertional knowledge from the Infoboxes and Wikidata items of these articles. As Wikipedia is constantly being updated by users, the automated mechanism allows us to periodically rerun to update our results and share them with the medical informatics community (https://github.com/yusir521/WikiMedSubset). Identifying medical articles from Wikipedia and classifying them by semantic group pose a few uncommon challenges. The first challenge is the extremely low prevalence of each class. Generic text classification techniques have progressed rapidly in recent years , with some latest deep learning models exhibiting near-human accuracy [29] [30] [31] . Techniques have also been J o u r n a l P r e -p r o o f proposed to alleviate the sample imbalance issue [32] [33] [34] . However, Wikipedia articles are not plain text, but they have very rich elements and structures. To exploit these features to improve classification accuracy and efficiency, we devised a crawling classification strategy that only needs to classify a portion of Wikipedia articles, which can raise the prevalence and control the false discovery rate. We also incorporate various elements of Wikipedia pages into our models with feature engineering. Another challenge to our work is acquiring annotated samples for training and validation. With the extremely low prevalence of each semantic category, manual annotation of a random sample is infeasible. For example, retrospectively estimating from our result, to acquire 50 sample articles on medical devices, it would require annotating 200 thousand non-redirect pages on average. To acquire sufficient annotated data, we employed the weak/distant supervision technique [35] [36] [37] and used the UMLS for automatic annotation. We also conducted limited manual validation on the model predicted medical articles. The structure of the remaining paper is as follows. Section 2 gives an overall summary of the Wikipedia data and explains the preparation of the training data. Section 3 introduces the crawling classification strategy and models. Section 4 introduces baseline models for comparison and evaluation metrics. Section 5 shows the statistics of the identified articles and comparisons of model accuracy. Section 6 discusses various aspects of the results and compares the identified articles and extracted relations with possible alternative approaches. The last section summarizes the work and its limitations. Wikipedia is a website that is constantly being updated. The contents of Wikipedia are also available as dumps, which are backups of the website's database. The dumps are created every few months and are available for download. For this paper, we used the 2020-05-01 dump of Wikipedia. This dump contains 20,208,017 Wikipedia pages, among which 6,069,466 are nonredirect, i.e., they are actual articles. We used the UMLS to create automatic annotations for training and validation. Nonredirect/disambiguation article titles were matched with the UMLS for concept recognition. To avoid ambiguities, we only used full string matches with UMLS "preferred terms", and terms with multiple possible concept matchings were abandoned. Eventually, 40,856 were identifiable as UMLS concepts. Among them, 11,843 articles/concepts were not in the chosen 7 semantic groups and were labeled as NULL. The composition of the matched articles is shown in Table 1 . In the crawling classification, articles of the 7 target semantic groups are considered as positive samples, and the NULL class is considered as negative samples. In addition, given the extremely low prevalence of medical articles in the entire Wikipedia, we used a random sample of 17,000 Wikipedia articles whose titles could not be matched using the UMLS as additional negative samples; these samples were representative of the more unrelated articles. Therefore, the total automatic annotated samples had 29,013 positive and 28,843 negative. 80% of these samples were used for training, and 20% were used for testing. The crawling classification strategy, introduced in the next section, applies a breadth-first search to the Wikipedia articles. The breadth-first search requires at least one medical article (seed) in the search queue as a starting point. Indeed, since medical articles on Wikipedia are not guaranteed to be all connected (accessible from a sequence of links from any given medical article), it is necessary to use many articles as seed points to minimize the possibility of isolation. To find a large number of articles as seeds, we used Wikipedia's category hierarchy. A Wikipedia article is usually tagged with categories that are displayed at the bottom of the page ( Figure 1 ). The categories have a hierarchy: under each category, there can be subcategories as well as articles tagged with this category. We used articles within 5 steps down the Medicine and Anatomy categories to populate the search queue. These articles were likely to be in the defined scope of medical articles, and they were classified in the same way as other articles during the search. Additionally, UMLS-recognizable articles in the training set were also added to the seed list. The seed list eventually contained 225,239 articles. Our mechanism uses a two-step workflow, illustrated in Figure 2 : the first step identifies the medical subset of Wikipedia, and the second step classifies the articles (which were generally about medical concepts) by semantic group. To raise the prevalence of medical articles, the first step uses a crawling strategy. The crawler starts with a search queue filled with seed articles introduced in Section 2. At each step, the crawler uses a support vector machine (SVM) binary classifier to classify if the article is about medicine. If it is, links in the article to other Wikipedia articles will be extracted, and the linked articles will be added to the end of the queue to be classified, using the breadth-first search strategy; otherwise, the article will be abandoned and no linked articles will be added to the search queue. This crawling strategy leverages the fact that articles linked from a medical article are likely about medicine as well, so the process blocks the majority of the non-medical articles from being classified and keeps the positive rate high. The SVM classifier uses the Gaussian kernel with the three kinds of features: (1) Naïve Bayes probabilities. We fit 4 Naive Bayes classifiers using word tokens from the main body, the section titles, the Infobox, and the categories, respectively (see Figure 1 for illustration). Words from these fields usually exhibit a clear pattern that can help distinguish articles of different topics. The predicted probabilities from the classifiers that the article is about medicine are used as features, denoted by ∈ ℝ 4 . Since each probability can be used for classification by itself, these 4 features are all strong predictors. (2) Article embedding. We use the Skip-gram model [38] to obtain 300-dimensional vector representations of stemmed words using the entire We compared our mechanism with three off-the-shelf text classifiers: Naïve Bayes (NaïveB) J o u r n a l P r e -p r o o f [41] , the logistic regression model with TF-IDF features (RM-TF-IDF) [42] , and TextCNN [43] , which usually attain excellent performance in semantic classification. All three models were trained on automatically labeled training data, including the randomly sampled Wikipedia pages labeled as NULL. TextCNN used kernel sizes 3, 4, and 5, with 100 channels. The embedding dimension for each word was 128. The baseline classifiers were applied to the entire Wikipedia for 8-way classification. Two ways are used to evaluate the results from the proposed mechanism and the baseline models. The first way uses the 20% automatically annotated samples reserved for testing, containing 11,571 samples. Recall, precision, and F score are calculated for each category. The second way randomly samples 100 articles predicted as medical articles from the result of each model and manually labels their categories (7 medical semantic groups + NULL). Accuracy and false discovery rate (the rate of NULL among articles predicted as medical) are calculated for each model. We also considered and compared with alternative ways not using machine learning to identify Wikipedia medical articles. One such way is via Wikidata. A Wikidata item associated with a Wikipedia medical article may contain concept IDs from notable medical ontologies. Therefore, querying Wikidata items with such IDs can be used to identify medical articles in Wikipedia. We queried the 2020-06-01 dump of Wikidata for items that contained a concept ID from UMLS, RxNorm, NDF-RT, ICD-9, ICD-10, or LOINC to search for Wikipedia medical articles. We also compared our result with the 2020-06-01 version of DISNET [25] , which was based on DBpedia and focused on diseases. The Table 2 shows the decomposition of the identified articles by semantic groups. The baseline models predicted much more medical articles than the proposed method, which is undesirable because they were inaccurate and included many false discoveries. Table 3 shows the precision, recall, and F score evaluated using the reserved articles with automatic labels. The proposed mechanism achieved the best performance on almost every metric. It achieved the worst performance on NULL precision, that is, articles were incorrectly classified as nonmedical. Among the 5,876 positive samples in the test set, 766 were misclassified as NULL, and 607 of which were misclassified by the crawling classifier. Interestingly, about half of the 607 were LIVB, most of which were not reached by the crawler, which is also reflected by the low recall of the category. This suggests that many UMLS-recognizable LIVB (viruses, bacteria, fungi, etc.) may not be linked to medical pages. And this may not be a drawback as it appears, as many microbes are not directly related to human health and thus are not desired in the medical subset. On the other hand, the proposed mechanism achieved the highest recall on the NULL category, which means that its results contain the fewest false discoveries. A high recall on NULL is an important property to have because most Wikipedia articles are non-medical, which means that a small drop on NULL recall would result in many false discoveries, as shown in Table 2 . Table 4 further confirms this point. Based on manual review of the identified articles, the proposed mechanism has far higher positive sample accuracy than the baseline models, and it has the fewest false discoveries ("Richard Shope", "Isturgia", "Epichlorops", and "List of virus species" classified as LIVB, "Chlamys" and "Kiss curl" classified as DISO, "Hair-cutting shears" classified as ANAT, and "Pentamerida" classified as CHEM). Indeed, the false discovery rates of the baseline models are so high that their results are hardly usable, even though they are excellent text classifiers in general. Combining our automatic search mechanism and medical ontology code queries, 110,850 Wikidata items in total can be found, as Figure 3 shows. Among them, 91,513 can be identified by our mechanism, and 79,714 are exclusively identified from Wikipedia, showing our work is not replaceable by simple queries. This also partially suggests that our search has a high recall An automatic mechanism to periodically identify medical articles in Wikipedia and extract their structured knowledge is important to keep our medical informatics infrastructures up to date. For instance, "Coronavirus disease 2019" is already in our identified medical subset (the 2020-05-01 dump of Wikipedia), while it is not in DISNET (2020-06-01), which is DBpedia-based that is updated in a long cycle. As discussed at the beginning, the major difficulty for developing a text classifier for the automatic mechanism is the extremely low prevalence of medical articles in Wikipedia. A high proportion of negative samples means a high false discovery rate for machine learning algorithms, which can potentially render the results useless. Therefore, the main goal of our design decisions is how to achieve a low false discovery rate while maintaining a high recall for medical articles. Instead of seeking more sophisticated deep learning text classification models, we decided to leverage the rich page elements and the connected nature of Wikipedia Table 4 show that our unique search mechanism did not sacrifice recall (compared to RM-TF-IDF and TextCNN, the two better models of the baselines), and its number of false discoveries is 1-2 levels of magnitude fewer than the baselines. In semantic group classification, as shown in Table 3 , which is evaluated using the automatically annotated samples, our method still shines in most categories. The low recall of LIVB in Table 3 was because many pages of microbes (especially those not related to human health) were not connected with medical articles and they were not reached by the crawler. We do not consider this an issue at the moment until we can find better labels to differentiate microbes related to human health from those that are not. To avoid missing medical articles, we used over 225 thousand seed articles in the breadth-first search, and the crawler eventually covered 20% of Wikipedia articles, which we think is sufficiently large. Further raising the coverage would risk more false discoveries. We reviewed incorrect classifications by our method and found that many errors were due to articles being Another difficulty and a major limitation to our study is the lack of gold-standard labels. As explained in Section 1, unbiased manual annotation is infeasible given the rareness of medical articles. Therefore, we used the UMLS for automatic labeling. The benefit of using the UMLS is that the generated sample size is very large. On the other hand, UMLS can introduce biased sample distribution to both training and validation. The labels are also imperfect. For example, we only want LIVB and CHEM that are related to human health, but UMLS cannot give us that information. Additionally, we also do not know the true ratio of non-medical articles in Wikipedia, so randomly sampling 17,000 negative samples for training is also biased. For an unbiased validation, we manually reviewed samples that were classified as medical (that is, in one of the 7 semantic groups), and results show that the proposed mechanism is far superior to the baseline text classifiers, and is the only one that has an acceptable false discovery rate ( Table 4 ). The manual review cannot evaluate recall. If we use DISNET as a reference for diseases, then the recall is at least 95%. Inference from this kind of positive vs. unlabeled data is an open question and active research area [44] . One of the end goals of identifying the medical subset of Wikipedia is to extract structured assertional knowledge to support the development of medical knowledge graphs. We extracted Figure 4 compares the number of concepts that these relations covered, grouped by whether they were unique in the UMLS, unique in Infobox/Wikidata, or common in both. Table 5 gives the relation name mapping used for counting the number of concepts covered. Note that the mapped relations might not be equivalent in broadness. For example, "has causative agent" in the UMLS is a narrower relation than Caused by. From Figure 4 , one can see that Infobox and Wikidata can provide a significant supplement to the UMLS in 4 of the 5 relations. Examining closer about which diseases are covered further shows that a large proportion of those covered by Infobox/Wikidata but not by the UMLS are common diseases, such as type 2 diabetes and influenza. This could be due to that researches of some common diseases were not as heavily funded as diseases like cancer and do not have dedicated ontologies. Therefore, from the perspective of primary healthcare decision support, the value of the added relations can be more substantial than what Figure 4 can show. Wikipedia can provide very rich structured and unstructured information to support medical informatics. However, the subset of medical articles in Wikipedia had not been identified and the whole Wikipedia can be difficult to work with. The automatic mechanism that we developed can identify the medical articles in Wikipedia with high accuracy. In particular, the crawling classification strategy and the utilization of Wikipedia's rich structures allow it to achieve far superior performance than generic text classifiers in false discovery control. Due to the extremely low prevalence of medical articles in Wikipedia, our study is limited in the evaluation of overall recall by manually reviewed gold-standards. Our future research aims to simplify the classification process and to develop adaptive classifiers to improve the accuracy on the very short articles. To facilitate healthcare modeling and NLP, more semantic groups may be included in subsequent iterations. Additionally, automatic article quality assessment can also be added to avoid extracting knowledge from uninformative articles [45, 46] . Situating Wikipedia as a health information resource in various contexts: A scoping review Analysis of Wikipedia pageviews to identify popular chemicals Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources Mining and ranking biomedical synonym candidates from Wikipedia Wikipedia Disease Articles: An Analysis of their Content and Evolution Evaluating Wikipedia as a Source of Information for Disease Understanding Extracting semantic predications from medline citations for pharmacogenomics relationship extraction using Wikipedia to improve recall Extracting Semantic Concept Relations from Wikipedia Long distance entity relation extraction with article structure embedding and applied to mining medical knowledge An ontology-based agent for information retrieval in medicine Automatic Generation of a Qualified Medical Knowledge Graph and Its Usage for Retrieving Patient Cohorts from Electronic Medical Records Clinical Concept Embeddings Learned from Massive Sources of A survey of word embeddings for clinical text BioBERT: a pre-trained biomedical language representation model for biomedical text mining Wikipedia Articles on Nutrition: Are they Accurate and Complete? Current Nutrition & Food More than 2 billion pairs of eyeballs: Why aren't you sharing medical knowledge on Wikipedia? Why Medical Schools Should Embrace Wikipedia: Final-Year Medical Student Contributions to Wikipedia Articles for Academic Credit at One School Quality of information sources about mental f disorders: a comparison of Wikipedia with centrally controlled web and printed sources Empirical studies assessing the quality of health information for consumers on the world wide web: a systematic review The UMLS project: making the conceptual connection between users and the information they need Gene Ontology: tool for the unification of biology The Human Phenotype Ontology: A Tool for Annotating and Analyzing Human Hereditary Disease DISNET: a framework for extracting phenotypic disease information from public sources DBpedia -A crystallization point for the Web of Data title=Wikipedia:WikiProject_Medicine&oldid= 911651909 Wikipedia and Medicine: Quantifying Readership, Editors, and the Significance of Natural Language Pre-training of Deep Bidirectional Transformers for Language Understanding A Robustly Optimized BERT Pretraining Approach Generalized Autoregressive Pretraining for Language Understanding Safe-Level-SMOTE: Safe-Level Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning Borderline over-sampling for imbalanced data classification Distant Supervision for Relation Extraction Without Labeled Data Combining distant and partial supervision for relation extraction Snorkel: Rapid Training Data Creation with Weak Supervision Distributed Representations of Words and Phrases and their Compositionality Narrative Information Linear Extraction (NILE) Software. CELEHS. /packages/nile Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation A comparison of event models for naive bayes text classification A comparative study of TF*IDF, LSI and multi-words for text classification Convolutional Neural Networks for Sentence Classification Learning from positive and unlabeled data: a survey Automatically assessing the quality of Wikipedia contents Automatically Assessing Wikipedia Article Quality by Exploiting Article-Editor Networks COMPETING INTERESTS: None. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors thank the following people for their help in data collection and preliminary analyses.