key: cord-0437841-yfaxj22v authors: Carvallo, Andres; Parra, Denis; Rada, Gabriel; Perez, Daniel; Vasquez, Juan Ignacio; Vergara, Camilo title: Neural language models for text classification in evidence-based medicine date: 2020-12-01 journal: nan DOI: nan sha: f731dcc951abbde0ce0fbd1fdf4f2fb25defacf2 doc_id: 437841 cord_uid: yfaxj22v The COVID-19 has brought about a significant challenge to the whole of humanity, but with a special burden upon the medical community. Clinicians must keep updated continuously about symptoms, diagnoses, and effectiveness of emergent treatments under a never-ending flood of scientific literature. In this context, the role of evidence-based medicine (EBM) for curating the most substantial evidence to support public health and clinical practice turns essential but is being challenged as never before due to the high volume of research articles published and pre-prints posted daily. Artificial Intelligence can have a crucial role in this situation. In this article, we report the results of an applied research project to classify scientific articles to support Epistemonikos, one of the most active foundations worldwide conducting EBM. We test several methods, and the best one, based on the XLNet neural language model, improves the current approach by 93% on average F1-score, saving valuable time from physicians who volunteer to curate COVID-19 research articles manually. Evidence-based medicine (EBM) is a medical practice that aims to find all the evidence to support medical decisions. This evidence nowadays is obtained from biomedical journals, usually accessible through online databases like PubMed [5] and EMBASE [4] , which provide free access to articles' abstracts and in some cases, to full articles. In the context of the COVID-19 pandemic, EBM is critical to making decisions at the individual level and public health since research articles address topics like treatments, adverse cases, and effects of public policies in medicine. The EBM foundation Epistemonikos has made essential contributions by curating and publishing updated guides of what treatments are working and not against COVID-19 1 . Epistemonikos addresses EBM by a combination of software tools for data collection, storage, filtering [2, 1] , and retrieval, as well as by the vital labor of volunteer physicians who curate and label research articles based on quality (to include in the database), type (systematic review, randomized trial, among others) and PICO labels (patient, intervention, comparison, outcome). However, this workflow has been challenged during 2020 by increasing growth and rapidly evolving evidence of COVID-19 articles published in the latest months. Moreover, to ensure the rapid collection of the latest evidence published, pre-print repositories such as medRXiv and bioRXiv have been added to the traditional online databases. In order to support Epistemonikos' effort to filter and curate the flood of articles related to COVID-19, we present the results of an applied AI project where we implement and evaluate a text classification system to filter and categorize research articles related to COVID-19. The current model, based on Random Forests, has an acceptable performance classifying systematic reviews (SR) but fails on classifying other document categories. In this article, we show how using BioBERT yields marginal improvements, while XLNET results in significant progress with the best performance. These results save a considerable amount of time from volunteer physicians by pre-filtering the articles worth of manual curation and labeling for EBM. In average, a physician takes two minutes in reviewing one article, while the system we present in this article can review up to 32, 000 within one hour. Methods and data. We compare document classification results among a (i) random forest with a customized tokenizer made by Epistemonikos, (ii) an XLNET [8] language model representing documents using a linear layer as a classifier, and (iii) the same setting with a BioBERT [3] language model. The documents' classification can be a systematic review, a primary study using a randomized controlled trial, non-randomized primary study, broad synthesis, and excluded document. The distribution of documents can be observed in the second column of Table 1 . Notice that the type of document partially explains the classification models' mistakes: broad synthesis and systematic review are both kinds of surveys, while primary studies (rct and non-rct) deal with specific treatments and populations. Excluded can be of any of the other four classes, but they are not included in the official Epistemonikos dataset due to their low quality. Results. Table 1 shows the performance of each model in terms of precision (Prec.), recall (Rec.), and F1-score (F-1) for every type of document. In general terms, we observe that XLNet obtains the top F-1 score for any document category, in some cases by a small margin, such as under systematic review (F-1=.97), and in other cases by a large margin, as in the classes Broad synthesis (F-1=.61), and Excluded (F-1=.78). The results indicate that the random forest and BioBERT with a linear layer have a bias towards the most dominant class, Systematic review, reporting slightly better recall (Rec.=.99 and Rec.=1.0) than XLNet (Rec.=.98) in this particular type of document. However, XLNet is better than the other two models in terms of Precision upon all classes, with the only exception of Broad synthesis, where random forest (Prec.=.75) performs better than XLNet (Prec.=.67). However, XLNet recall outperforms (Rec.=.56) random forest (Rec.=.15). It is important to note that when using the random forest implemented for Epistemonikos, a new tokenizer has to be made depending on the document categories. In the case of XLNET, it is more versatile because it is enough to train embeddings and classify them regardless of the document category. In the case of BioBERT, which has a similar operation, it does not yield consistent performance for the minority classes Broad synthesis and Excluded. In this study, we have compared three methods, one of which is currently in production at the Epistemonikos foundation, the random forest. The others are BioBERT, which, although it is based on the transformer architecture, does not achieve the results shown by XLNET. Having such reliable results means a big impact in times of the COVID-19 pandemic where there is an exponential growth of available literature. In future work we will incorporate explanations obtained from transformer attention mechanisms, compare them against other explanation methods like LIME [7] or SHAP [6] , and conduct a user study to assess whether physicians' work is facilitated by this feature. This work seeks to decrease manual effort in the practice of evidence-based medicine, allowing physicians to distinguish relevant documents for clinical questions. Implementing the method with the largest performance in our offline evaluation (XLNet) in production might imply an increased cost in terms of GPU needs for Epistemonikos, which is not under their current infrastructure. Adding more documents might also imply additional fine-tuning of the model, incurring in larger costs. Another aspect not addressed in this research is that of fairness: is the current model performing better to classify certain populations being treated (e.g. white males) compared to black females? we should address this aspect actively to prevent our model from learning undesired biases already seen in several applications. Automatic document screening of medical literature using word and text embeddings in an active learning setting An interactive relevance feedback interface for evidence-based health care Biobert: a pre-trained biomedical language representation model for biomedical text mining Enhancing access to reports of randomized trials published world-wide-the contribution of embase records to the cochrane central register of controlled trials (central) in the cochrane library Pubmed searches: Overview and strategies for clinicians A unified approach to interpreting model predictions why should i trust you?" explaining the predictions of any classifier Xlnet: Generalized autoregressive pretraining for language understanding