key: cord-0617965-h5pygfyl authors: Zhao, Jie; Rodriguez, Maria A.; Buyya, Rajkumar title: High-Performance Mining of COVID-19 Open Research Datasets for Text Classification and Insights in Cloud Computing Environments date: 2020-09-16 journal: nan DOI: nan sha: 17e04c9ac9982584f54080bd02e2a631efa1e86d doc_id: 617965 cord_uid: h5pygfyl COVID-19 global pandemic is an unprecedented health crisis. Since the outbreak, many researchers around the world have produced an extensive collection of literatures. For the research community and the general public to digest, it is crucial to analyse the text and provide insights in a timely manner, which requires a considerable amount of computational power. Clouding computing has been widely adopted in academia and industry in recent years. In particular, hybrid cloud is gaining popularity since its two-fold benefits: utilising existing resource to save cost and using additional cloud service providers to gain assess to extra computing resources on demand. In this paper, we developed a system utilising the Aneka PaaS middleware with parallel processing and multi-cloud capability to accelerate the ETL and article categorising process using machine learning technology on a hybrid cloud. The result is then persisted for further referencing, searching and visualising. Our performance evaluation shows that the system can help with reducing processing time and achieving linear scalability. Beyond COVID-19, the application might be used directly in broader scholarly article indexing and analysing. COVID-19 is a global scale health crisis. Since the outbreak, a massive amount of research efforts have been poured into many aspects of this highly infectious disease. To help the research community, in March 2020, the White House and the Allen Institute for AI teamed up with many researchers and released the COVID-19 Open Research Dataset(CORD-19) [1] . As of 27/Jul/2020, CORD-19 contains over 199,000 research papers and nearly half of them are open-access with full text available [2] . The amount of articles' rapid growth has posed a challenge for the research community to keep up to date. Also, the general public is interested in many aspects of the disease, especially those findings related to dayto-day life. Hence, the CORD-19 dataset is shared on Kaggle. Community actions are required to help with developing tools that facilitate the understanding of virus [2] using machine learning(ML) technology. The ability to analyse the data and provide insights promptly pose a challenge since ETL and ML requires considerable computing power. In recent years, both industry and academia have adopted cloud computing paradigm, especially in the form of Hybrid Cloud Environment (HCE). [3] Adopting HCE enables user to utilise their existing computing infrastructure, and instantaneously acquires additional resources from external cloud service providers (CSP) on demand whenever requirement arises. In context of processing voluminous amounts of data such as the CORD-19 dataset, HCE provides cost saving measure by using an existing on-premise cloud, meanwhile, supplies the capability to gain extra computing, storage or networking resources from other CSP. However, building an application using HCE is also a demanding task; it requires detailed knowledge of cloud computing techniques. In this context, we propose a system design and implementation to accelerate ETL process and text classification with ML technology in a hybrid cloud environment. The architecture design is aimed to address the following characteristics: scalability, availability, stability, high performance and portability. To achieve these goals and for ease of development, the proposed application deploys on top of Aneka Platform-as-a-Service (PaaS) platform. [4] Aneka PaaS is a range of tools, providing high level Application Programming Interfaces(APIs) and Software Development Kits(SDKs) for simpleness of implementing a scalable application. It allows developers to focus on developing their program logic without spending too much time considering deployment and scalability. When additional resources are required, one can easily acquire extra resources from additional CSP via Aneka dynamic provisioning mechanism. In this paper, we make the following key contributions: • We present a system architecture design that achieves various characteristics: scalability, availability, stability, high performance and portability. • The system is implemented and tested using real world CORD-19 dataset in a real hybrid cloud environment built using on-premise private cloud and the Melbourne Research Cloud(MRC). • The architectural design can be easily generalized and quickly adopted in similar scenarios. The system is not only applicable for the CORD-19 dataset but also for a broader scholarly article indexing and analysing. The rest of the paper is organized as follows: Section II outlines the background information about Aneka PaaS and other technologies used in the system. System architecture is detailed and discussed in Section III. Section IV explains the system design and implementation by using UML diagrams and workflows, in addition, some example queries and visualizations are given. Afterwards, Section V describes the testbed built on a hybrid cloud including an on-premise private cloud and MRC. Our performance evaluation shows linear scalability can be achieved by utilizing Aneka PaaS with little overheads. Finally, the last section concludes the paper with summary and future works. Since the CORD-19 dataset made available, it has been downloaded for more than 200,000 times and many applications have been created [2] . For instance, Amazon Web Service (AWS) provides a search engine over the CORD-19 dataset and a question answering system powered by AWS Kendra [5] . Azure [6] , TekStack [7] , and COVID-Miner [8] developed full-text search engines. VIDAR-19, which extracts and visualizes risk factor from articles, was presented by F. Wolinski [9] . COVID Seer [10] and COVID Explorer [11] were developed by the Pennsylvania State University. COVID Seer is a multi-facet search engine powered by ElasticSearch and COVID Explorer has visualization and advanced filtering features utilizing automatic unsupervised ML. One drawback with existing systems is that they are not keeping updated with the growing dataset. Some of these are still using the initial version of CORD-19, even though the current dataset has grown threefold since the initial release. Hence, we address this issue by introducing additional workflows to keep up-to-date with current version of CORD-19 dataset and ingest newly published articles. These workflows allow users to have a latest view of state-of-the-art COVID-19 research outputs. To implement the system quickly, the best solution is to select some proven technologies in both industry and academia. After evaluating many difference technologies, we decide to go with the following: (a) Aneka PaaS [4] for core processing and deployment; (b) Microsoft .NET Framework and ML.net [12] for development; (c) Grobid [13] for ML based scientific paper parsing; (d) Minio [14] for scalable S3 compatible shared storage; (e) Elastic Stack [15] including ElasticSearch for fulltext indexing and Kibana for data visualization. Aneka PaaS provides a platform for users and developers to develop and deploy distributed application with ease. An overview of Aneka is shown in Fig 1. It comprises of three major layers with rich collection of components: Microsoft .NET Framework/.NET Core is a free software development framework developed by Microsoft. It is a complete platform that supports various languages, such as C#, VB, F#, etc. With the strategic tradition to .NET Core, it also provides cross-platform support including Windows, Linux and MacOS. Since Aneka itself is developed with .NET framework, it becomes a natural choice for our application. In addition, Microsoft made an extensive machine learning framework for .NET developers. It implements many traditional and proven ML algorithms; users who don't have detailed knowledge of ML techniques can still easily use ML technology in their application. This significantly lowers the barrier for developer to utilize ML technology. Elastic Stack (ES) [15] is widely used in industry; it is also know as ELK Stack (ElasticSearch, Logstash, Kibana). Elastic Search is a distributed full-text indexing and search engine based on Apache Lucene library; Logstash provides data collection and log-parsing engine through various of agents called Beats; Kibaba is a data visualization platform using searching, filtering and aggregation functions provided by ElasticSearch. Since we implement customized data collections and processing logics by Aneka, we only use ElasticSearch and Kibana in our system. In addition to the technologies above, we also used an open source program called Grobid [13] in our update workflow. Grobid is a machine learning library designed explicitly for extracting text data from technical or scientific documents. It is capable of converting PDF file to TEI/XML while maintaining section and structure format. It has been actively developed since 2008 and open-sourced in 2011. This section outlines an overview of the system architecture visualized in Fig. 2 . The system utilizes a hybrid cloud environment: an on-premise private cloud for storing and Fig. 1 . Aneka Framework Overview [16] processing data and Melbourne Research Cloud (MRC) for hosting and serving public-facing ES traffic. While designing a data-centric processing system, the first thing to consider is how to store and distribute the data efficiently. Although Aneka PaaS supports data distribution via task payload and FTP, it is not efficient and scalable in our usecase. Therefore, we decide to use a centralized storage cluster powered by Minio [14] , an open-source object storage software. Minio provides S3 compatible API and shared-nothing architecture for scalability and availability. It is a shared file system that can be accessed by all Aneka workers/master and application server. When requirement arises, data can be easily replicated to external CSP over VPN connection, creating a multi-cloud storage cluster. The second component is the computing service provided by Aneka PaaS as described in Section II. Aneka makes it easy to perform parallel processing by encapsulating task scheduling and execution. The third component is an ES cluster replicated to a MRC public instance over the Internet/VPN. The on-premise primary ES cluster is for data persistence. The secondary instance is configured for read-only accessing and serving public-facing traffic. Due to resource constraint, we are running on a single node at both sides with regular automated snapshot/backup. The architecture design is quite straightforward, but it addresses some common system design principles as follows: 1) Scalability: All major components are horizontally and vertically scalable. When additional capability is requisite, one can easily scale up or out by providing more powerful nodes or add more nodes. 2) Availability: Minio and ElasticSearch use sharednothing architecture, which is designed for highly available systems. Aneka also has robust mechanism to handle task/node failure automatically. 3) Stability: This can be achieved using fail-fast and idempotent processing. If a task fails for any reason, it can be rescheduled either periodically or automatically without affecting the whole system. 4) Performance: High performance is achieved by facil- itating Aneka's distributed scheduling and execution capability. 5) Portability: Application developed with the Aneka SDK/API is portable. It can execute in any environment that supports .Net framework, regardless of cloud service provider. Portability can be easily achieved by moving the application to another CSP like AWS, Azure, GCP, etc. The main system logic is implemented in the application server component. The application server works as a collaborator performing some housekeeping tasks. It periodically checks for dataset update, downloads PDF files from various sources and submits tasks to Aneka master. It also monitors the ML model bucket. When a new training set becomes available that can improve the ML model, the application server submits a task to Aneka master for training. The actual data processing tasks are performed by worker nodes. Fig. 3 . Firstly, we describe four storage buckets followed by their purposes. Afterwards, the application workflows are explained. There are four types of data stored in four buckets: 1) RAW: This bucket contains PDF files need to be extracted. 2) Staging: This bucket contains TEI/XML, metadata and JSON files that have been extracted from PDF files or downloaded from the public dataset. 3) Completed: Processed files are moved from the staging area to this bucket for archiving/referencing purposes. This also avoids duplicate processing. 4) ML Models: Dataset for ML Model training and pretrained model are stored in this bucket. The first workflow is checking for data source update. The application checks for the latest version from CORD-19 dataset and other sources. If new data is available in CORD-19 dataset, it ingests incremental data into the staging bucket. Since every article is binary checksummed with SHA1 and all processed articles are stored in a separated bucket, an incremental update is possible by comparing checksums. Because checksum values are used as object keys in S3/Minio, complexity for checking is O(1). If new PDF files are available from various sources, they are stored in the RAW bucket first. Periodically, RAW PDF files are sent as AnekaTask for text extracting using Grobid on worker nodes. Secondly, the ML model bucket is checked periodically. We use SdcaMaximumEntropy multi-class trainer provided by ML.net to train our model and save it to the bucket. Since high quality training dataset is crucial for achieving higher accuracy, if better quality labelled data is available, the system will train a newer model and compare its accuracy with the old one, keeping the better model for later inferencing. The main data processing workflow is described in the following algorithm 1. The UML class diagram in Fig. 4 shows classes comprised the application. To define an Aneka task, developers only need to make the class sterilizable and implement Aneka.Tasks.ITask interface with only a single method to program the execution logic. Moving on to the application demonstration, Fig. 5 and Fig. 6 illustrate two example queries we searched, aggregated and visualized on ElasticSearch and Kibana as general interest. Since the data is ever changing, these are exported at the time of querying and for illustration purposes only. Question 1: What are the hottest research areas? The result shows population(virus spread patterns), vaccine, Effectiveness of PPE and risk factors are among the most studied areas. Question 2: Which country is contributing the most efforts In this section, we introduce our testing environment, present benchmark results and also share our observations based on the experimental findings. Table I lists the specifications of our test environment. The Aneka master was setup to run in a converged mode, which means it also doubles as a worker node. In our performance benchmark, the single node execution was performed on the master node. In our testbed, the nodes are virtualized instances with Linux KVM running on top of three physical hosts. Each host has 2x1 GBE configured with LACP(802.3ad), layer 2+ hashing and connected to the same gigabit managed switch. The experiments were done with smaller sets of articles N = {10000, 20000, 30000, 40000, 50000} randomly taken from CORD-19 dataset, running with M = {1, 2, 3, 4} Nodes to demonstrate performance and scalability. Fig. 7 and Fig. 8 show execution time for a single node and multiple nodes. As expected, for a single node, the execution time increases linearly in relation to the size of input. For multiple nodes, the processing time can be significantly reduced; we are able to 1) Processing time of each article is near constant, the fluctuation is relatively small from 18.2ms to 18.7ms. This is largely caused by I/O bound and network latency since each article is read from storage server before processing and persisted to the ES cluster afterwards. 2) Using two nodes theoretically reduces processing time by half comparing to a single node. The actual result is less than 2x since there is communication and task scheduling overhead. 3) When dataset is not large enough, after reaching the diminish return point, further increasing number of nodes will not reduce processing time much further as theoretically expected due to overheads. e.g. for 10000-20000 articles, the benefit of using more than two nodes is less rewarding, meaning using two nodes is the best in this problem size. For larger input, using more nodes is still beneficial. In this paper, we proposed a system architecture for indexing and analysing scholarly articles, in particular, the CORD-19 dataset. We also presented an application design and implementation. By using the Aneka PaaS solution, parallel data processing application can be effortlessly developed. It significantly reduces entry barrier for a developer to develop such a distributed system. For future work, the ML model can always be improved with higher quality labelled datasets. A significant contribution is the CODA-19 dataset [17] . The authors used 248 human workers provisioned by AWS Mechanical Turk and created a human-annotated dataset. We may utilize this dataset to improve our model but it requires some additional work. Also, we are planning to test the system to the limit with a more extensive collection. S2ORC [18] is a general-purpose corpus containing 136M+ paper nodes with 12.7M+ fulltext papers(as of 27/July/2020) from many different sources. The future direction will be testing/benchmarking the system The public ElasticSearch and Kibana instance is hosted on Melbourne Research Cloud(MRC). We thank MRC for providing computing resource and bandwidth. CORD-19: The Covid-19 Open Research Dataset COVID-19 Open Research Dataset Challenge (CORD-19) -Kaggle The Aneka platform and QoS-driven resource provisioning for elastic applications on hybrid Clouds Aneka: A Software Platform for .NET Based Cloud Computing CORD-19 Search TEKStack Health -COVID-19 Research Portal Visualization of Diseases at Risk in the COVID-19 COVIDSeer ML.NET: The Machine Learning Framework for .NET Developers MinIO -High Performance, Kubernetes Native Object Storage Elastic Stack: Elasticsearch, Kibana, Beats & Logstash -Elastic Resource provisioning for data-intensive applications with deadline constraints on hybrid clouds using Aneka CODA-19: Reliably Annotating Research Aspects on 10,000+ CORD-19 Abstracts Using a Non-Expert Crowd S2ORC: The Semantic Scholar Open Research Corpus