key: cord-0168792-6tgl1cur authors: Baina, Karim title: Leveraging Data Preparation, HBase NoSQL Storage, and HiveQL Querying for COVID-19 Big Data Analytics Projects date: 2020-04-01 journal: nan DOI: nan sha: 64c014f2dd60c5751b590614f93c4c75ab84100f doc_id: 168792 cord_uid: 6tgl1cur Epidemiologist, Scientists, Statisticians, Historians, Data engineers and Data scientists are working on finding descriptive models and theories to explain COVID-19 expansion phenomena or on building analytics predictive models for learning the apex of COVID-19 confimed cases, recovered cases, and deaths evolution curves. In CRISP-DM life cycle, 75% of time is consumed only by data preparation phase causing lot of pressions and stress on scientists and data scientists building machine learning models. This paper aims to help reducing data preparation efforts by presenting detailed schemas design and data preparation technical scripts for formatting and storing Johns Hopkins University COVID-19 daily data in HBase NoSQL data store, and enabling HiveQL COVID-19 data querying in a relational Hive SQL-like style. ingest_and_clean.sh data fomatting Shell script removes \", ' * ' characters, and replaces non separator ',' by '-' character (e.g. in "Korea, South"), formats column dates into "%m/%d/%Y" format (eg. 3/2/20 becomes 03/02/2020) enabling dates operations, keeps only not null values from the sparse matrix, and merges the two first columns to form a composite key separated by a '~' character. Listing 1.3: Data Fomatting Shell Script (ingest_and_clean.sh) 1 #!/bin/sh 2 specific=$1 3 4 #$1 script parameter may be 'confimed' or 'deaths' or ' recovered' 5 6 sed "s/, /-/" ./COVID-19/csse_covid_19_data/ csse_covid_19_time_series/ 7 time_series_covid19_${specific}_global.csv | sed "s/\"//g" | 8 sed "s/\ * //" | sed -E "s/\,(.)\//,0\1\//g" | 9 sed -E "s/\/(.)\//\/0\1\//g" | 10 sed -E "s/\/20([^/])/\/2020\1/g" | 11 sed -E "s/\/20$/\/2020/g"| sed -E "s/,($)/,0\1/g" | 12 sed "s/,0,/,,/g"| sed -E "s/([,]+)0,/\1,/g" | 13 sed "s/,0$/,$/" | sed "s/^,/~/" | 14 sed -E "s/([a-z A-Z]+),([a-z A-Z]+)/\1~\2/" > 15 time_series_covid19_${specific}_global-sparse-with-formattedcolumn-names.csv In this section present NoSQL and relational schema design and detailed technical scripts for storing JHU COVID-19 daily confimed cases and deaths data 2 . For a more conceptual background on NoSQL databases, and NoSQL design methodologies, here are are related author papers [7, 8] Confirmed cases and deaths data will be stored respectively in HBase 'con-firmed_covid19_cases' table, and 'deaths_covid19_cases' table. Mainly those tables are compliant to JHU CCSE files struture with the first two columns agregation for database unique key property 3 . Each covid-19 row, either for confirmed cases or for deaths, in HBase will store a country data structured as a composite string primary key (rowid) constituted from its eventual province/state concatenated with its country name/region and separated with '~' character. The row then will store all columns values under the same column family 'a' (e.g. 'a:lt' represents latitude, 'a:lg' represents longitude, while remaining dynamic daily dated columns values will be named by convention as 'a:d122' meaning value at January 22nd, 'a:d327' meaning confirmed cases value of confirmed_covid19_cases table (respectively number of deaths of deaths_covid19_cases table) at March 27nd, etc. The following HBase commands retrieve number of confirmed COVID-19 cases, and deaths at March 31st for Morocco (suffix before ' ' is empty for all countries) and for British Columbia Canada (suffix before ' ' is not empty for all states) from'confirmed_covid19_cases' and 'deaths_covid19_cases' Hbase tables. Confirmed cases and deaths data will be respectively represented by two external tables in Hive 'confirmed_covid19_cases' table, and 'deaths_covid19_cases'. Those tables will be relational abstractions mapped (kind of shortcuts pointing) to their equivalent NoSQL tables in HBase (i.e. non managed Tables -stored physically only in Hbase). 4 Loading prepared COVID-19 data to HBase data store is achieved by (i) copying time_series_covid19_confirmed_global-sparse.csv and time_series_covid19_deaths_global-sparse.csv files generated by ingest_and_clean.sh script invokations into HDFS file system, and (ii) then performing bulk looding into HBase previously created schema. Instead of suffering from spreesheats limitations to exploit JHU COVID-19 data with regards to columns number for sorting, or integration of more tables, or versioning different hard coded sheets and workbooks for business users, and instead of coding complex reporting scripts for simple queries for data engineers and data scientists, one may express simple queries both using HBase and Hive command line interfaces or through APIs. Listing 1.12: Visualise all confirmed cases and deaths directely from HBase 1 scan 'confirmed_covid19_cases' This paper presents detailed schemas design and data preparation technical HBase, Hive, shell and HDFS scripts for formatting and storing Johns Hopkins University COVID-19 daily data in HBase NoSQL data store, and enabling HiveQL COVID-19 data querying in a relational Hive SQL-like style. It aims to help scientists and data scientists shortening data preparation phase which is time consuming acording to CRISP-DM life cycle specialists. This work is to be taken as a leveraging bootstrap for specific data preparation phase in COVID-19 analytics Big Data projects aiming for instance to integrate COVID-19 evolution time series with medical/biology best practices, COVID-19 mutations, scientific papers results, or to study correlations between COVID-19 curves with humidity data, people telco mobilty during countries lockdown phases, or to analyse recurrent COVID-19 contamination causality, or to study similarities with other historical pandemics evolution data like SARS-CoV, MERS-COV, or to compare evolution with spreading information from social networks, etc. The more integration you do on the schema with other data sets (e.g. continents, median age, population, testing numbers, virus contamination rates, etc.), the more features you will have and the more this work will leverage your COVID-19 data experience. Hurry Up, and share you experience for the world scientists. 7 Appendix : How to download scripts of this paper ? To download continuously data engineering models and scripts discussed in this paper, you can access, and clone the author gitlab repository at [11] . Novel coronavirus (covid-19) cases, provided by jhu csse Apache HBase Apache Hive What is minimum viable (data) product ? Apache Ambari NoSQL Databases-Seek for a Design Methodology NoSQL Databases: Yearning for Disambiguation Apache Impala IBM DB2 Big SQL Novel coronavirus (covid-19) data engineering Acknowledgement must go to Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) for keeping up to date world wide COVID-19 data available in a daily frequency.Acknowledgement must go to The Ministry of National Education, Higher Education, Staff Training, and Scientific Research, Morocco for accepting and supporting my sabbatical leave to do research, and return to ENSIAS refreshed. I also acknowledge my colleagues at ENSIAS maintaining the superb teaching and learning and e-learning culture in the school in my absence especially during COVID-19 crisis. 1 get 'confirmed_covid19_cases', '~Morocco', 'a:d331' 2 get 'deaths_covid19_cases', '~Morocco', 'a:d331' 3 4 get 'confirmed_covid19_cases', '~Spain', 'a:d331' 5 get 'deaths_covid19_cases', '~Spain', 'a:d331' 6 7 get 'confirmed_covid19_cases', '~France', 'a:d331' 8 get 'deaths_covid19_cases', '~France', 'a:d331' 9 10 get 'confirmed_covid19_cases', '~Germany', 'a:d331' 11 get 'deaths_covid19_cases', '~Germany', 'a:d331' Listing 1.14: Hive query retrieving all confirmed cases data concerning Morocco