key: cord-0527136-rkk1f6tt authors: Aizawa, Akiko; Bergeron, Frederic; Chen, Junjie; Cheng, Fei; Hayashi, Katsuhiko; Inui, Kentaro; Ito, Hiroyoshi; Kawahara, Daisuke; Kitsuregawa, Masaru; Kiyomaru, Hirokazu; Kobayashi, Masaki; Kodama, Takashi; Kurohashi, Sadao; Liu, Qianying; Matsubara, Masaki; Miyao, Yusuke; Morishima, Atsuyuki; Murawaki, Yugo; Omura, Kazumasa; Song, Haiyue; Sumita, Eiichiro; Suzuki, Shinji; Tanaka, Ribeka; Tanaka, Yu; Toyoda, Masashi; Ueda, Nobuhiro; Ueoka, Honai; Utiyama, Masao; Zhong, Ying title: A System for Worldwide COVID-19 Information Aggregation date: 2020-07-28 journal: nan DOI: nan sha: 051f6fc6ee4dea1c81ee988bf3df0b8c9876e5db doc_id: 527136 cord_uid: rkk1f6tt The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation (http://lotus.kuee.kyoto-u.ac.jp/NLPforCOVID-19 ) containing reliable articles from 10 regions in 7 languages sorted by topics for Japanese citizens. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese. A BERT-based topic-classifier trained on an article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories. Due to the global COVID-19 epidemic and the rapid changes in the epidemic, citizens are highly interested in learning about the latest news, which covers various domains, including directly related news such as treatment and sanitation policies and also side effects on education, economy, and so on. Meanwhile, citizens would pay extra attention to global related news now, not only because the planet has been brought together by the pandemic, but also because they can learn from the news of other countries to obtain first-hand news. For example, the epidemic outbreak in Korea is one month earlier than in Japan. Japanese citizens could prepare better for the epidemic if they had obtained more information from Korea. Citizens could learn from Asian countries about the The authors are in alphabetical order. 1 Site: http://lotus.kuee. efficiency of masks before local official guidance. Universities can learn about how to arrange virtual courses from the experience of other countries. Thus, a citizen-friendly international news system with topic detection would be helpful. There are three challenges for building such a system compared with systems focusing on one language and one topic (Dong et al., 2020; Thorlund et al., 2020) : • The reliability of news sources. • Translation quality to the local language. • Topic classification for efficient searching. The interface and the construction process of the worldwide COVID-19 information aggregation system are shown in Figure 1 . We first construct a robust multilingual reliable website collection solver via crowdsourcing with native workers for collecting reliable websites. We crawl news articles base on them and filter out the irrelevant. A high-quality machine translation system is then exploited to translate the articles into the local language (i.e., Japanese). The translated news are grouped into their corresponding topics by a BERT-based topic classifier. Our classifier achieves 0.84 F-score when classifying whether an article is about COVID-19 and substantially outperforms the keyword-based model by a large margin. In the end, all the translated and topic labeled news is demonstrated via a user-friendly web interface. We present the pipeline for building the worldwide COVID-19 information aggregation system, focusing on the three solutions to the challenges. To avoid rumors and obtain high-quality, reliable information, it is essential to limit the information sources. Since we aim to create a multilingual system, the first challenge is to obtain a list of reliable information providers from different countries and in different languages. Crowdsourcing is known to be efficient in creating high-quality datasets (Behnke et al., 2018) . To collect the list of reliable websites of a specific country, we use multiple crowdsourcing services (e.g., Crowd4U 2 , Amazon Mechanical Turk 3 , Yahoo! Crowdsourcing 4 , Tencent wenjuan 5 ) and limit the workers' nationality because we assume that local citizens of each country know the reliable websites in their country. The workers not only suggest websites they think are reliable but they must also justify their choices and give a list of related topics they address, similar to constructing support for rumor detection (Gorrell et al., 2019; Derczynski et al., 2017) . We decided to use eight countries of interest, including India, the United States, Italy, Japan, Spain, France, Germany, and Brazil. For other countries or regions such as China and Korea, reliable web-sites are provided by international students from these areas. We treat official news from the governments as primary information sources and reliable newspapers as secondary information sources. We counted how many times each website was mentioned by the crowdworkers and found that the primary information sources tend to be ranked at the top three in each country. So we mainly crawl articles from primary sources. Table 1 shows examples of the crowdsourcing results. The workers provide websites indicating for each one whether it is a primary or a secondary source, what are the reasons to choose this particular website, and which topics are addressed by the website. These topics are selected from a list that includes eight topics (e.g., Infection status, Economics and welfare, School and online classes). We crawl articles from 35 most reliable websites everyday by accessing the entry page and jumping to urls inside it recursively. The number of crawled web pages is too big and exceeds the translation capacity. We consider only the most relevant pages by filtering using keywords such as COVID. We can focus on pages with a higher probability to be COVID-19 related. We use neural machine translation model Tex-Tra 6 with self-attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017) . The translation system provides high-quality translation from news articles in multiple languages into articles in Japanese. The translation capacity is approximately 1,000 articles per day. To perform topic classification, we first collect the dataset via crowdsourcing. The topic labels are annotated to a subset of articles. Then we train a topic-classification model to label further articles automatically. All articles are in Japanese after the translation stage, we then apply crowdsourcing annotation to label the articles with topics. As shown in Figure 2 , the crowdsourcing workers first check the content of the page and give four labels to the article: whether it is related to COVID-19, whether it is helpful, whether the translated Japanese is fluent, and topics of the article. Each article is assigned to 10 crowdworkers from Yahoo Crowdsourcing and we set a threshold to 50% for each binary question, i.e., if more than 5 workers think the article is related to COVID-19, then we label the article as related. We post this crowdsourcing task twice a week and can obtain 20K article-topic pairs each time. The pretrained language model BERT (Devlin et al., 2019) shows reliable performance on many NLP tasks with limited annotated data including document classification (Adhikari et al., 2019; Sun et al., 2019) . We use a pretrained BERT model in a feature based manner where encoder weights kept frozen and train a classifier using the labeled articles by crowdsourcing. The BERT-based topic classification can then label other pages. We also compare it with a keyword-based baseline method where we set keywords for each topic and find exact match. 6 https://mt-auto-minhon-mlt.ucri. jgn-x.jp/ Yes No Is the Japanese in this page fluent? What's the topics in this page? Choose correct ones you think from options below. We show the topic classification result and statistical information of the interface in this section. As shown in Table 2 , we totally recieved 908 questionnaire results from 8 countries with totally 550 websites. Rumors are rampant in this era, the reliable websites dataset can help people to protect themselves from COVID-19 and avoid trusting rumors about COVID-19. We compared the BERT-based model with the keyword-based baseline model on topic classification task. For the keyword-based method, there are totally 76 selected keywords of different topics such as COVID, Remote work, and Social distance. For the BERT-based method, we use the pretrained BERT-LARGE model with Whole Word Masking (WWM) 7 . We add one linear layer after the BERT encoder without fine-tuning the encoder. For every article, we take the hidden state of the ending symbol of each sentence as the sentence embedding and perform mean and max pooling of all sentence embeddings. The input of the linear layer is the concatenation of mean and max pooling embeddings and the output is a binary label. We randomly select 90% data from labeled data by crowdsourcing shown in Table 3 as a train set and remaining 10% as a test set. As shown in Table 4 , the BERT-based model outperforms the baseline model in almost all tasks. We can see that our system can reliably classify which articles are related to COVID-19, and that our interface can show related news to our users. Meanwhile, for some topic such as Arts & Sports and Education, the performance of the current system is still limited, which could be improved in future work. The detail of the system database is shown in Table 5. There are totally 1.05M website pages with 110K of them translated into Japanese and 18K of them with topic labels. The dataset is still growing approximately 11K pages per day. We built a system for worldwide COVID-19 information aggregation by combining crowdsourcing, crawling, machine translation, and a BERT-based topic classifier, which provides reliable, comprehensive and latest information from the world. Worldwide COVID-19 Information Aggregation Site Figure 3 : The interface of the worldwide COVID-19 information aggregation system. This is the official page of the ministry of health and family welfare of government of India and is therefore reliable. infection status prevention and emergency declaration symptoms, medical treatment and tests economics and welfare school and online classes entertainment and sports www.thehindu.com India False This is one of the trusted News paper infection status prevention and emergency declaration economics and welfare school and online classes SemEval-2017 task 8: RumourEval: Determining rumour veracity and support for rumours BERT: Pre-training of deep bidirectional transformers for language understanding An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases SemEval-2019 task 7: RumourEval, determining rumour veracity and support for rumours What would elsa do? freezing layers during transformer fine-tuning How to fine-tune bert for text classification? A realtime dashboard of clinical trials for covid-19. The Lancet Digital Health Attention is All you Need