key: cord-0214227-6z7b04id authors: Collins, Chris; Cuevas, Roxana; Hernandez, Edward; Hernandez, Reece; Le, Breanna; Woo, Jongwook title: Scalable Analysis for Covid-19 and Vaccine Data date: 2021-08-06 journal: nan DOI: nan sha: 457e82c37d9d133444c93d195fd7f5e1a2ecf572 doc_id: 214227 cord_uid: 6z7b04id This paper explains the scalable methods used for extracting and analyzing the Covid-19 vaccine data. Using Big Data such as Hadoop and Hive, we collect and analyze the massive data set of the confirmed, the fatality, and the vaccination data set of Covid-19. The data size is about 3.2 Giga-Byte. We show that it is possible to store and process massive data with Big Data. The paper proceeds tempo-spatial analysis, and visual maps, charts, and pie charts visualize the result of the investigation. We illustrate that the more vaccinated, the fewer the confirmed cases. COVID-19 has impacted our lives for almost two years. Now that there is a decline in cases, we chose these datasets because it would help better visualize how much the vaccine can positively impact the decrease of Covid. In addition, it will also give us a better analysis and visualization of all general covid cases and the vaccines that are distributed throughout the states. This topic is essential because it is something that's has been at the forefront of our lives and conversations for the past two years and possibly for the foreseeable future. We can use this data to prove the positive impacts of any findings and analyze any effects we observe to avoid a similar problem in the future. We want our observations of data that is publicly accessible to continue to positively reinforce our peers that vaccines do contribute to the decline of Covid-19 cases. We will look at different types of vaccine datasets, including ones from Pfizer, Moderna, and Janssen, and the data distributed from other states. The data set collected is about 3.2 GB. Big Data utilizes the data set more efficiently because Big Data is the distributed Parallel Computing system to store and process massive data set. Thus, we adopt Big Data solution using Hadoop and Hive. It is even scalable for data strorage and computation so that the more data set becomes, the more server we can add to the Big Data cluster [13] . This study is a general topic in the world of big data and data analysis. Big Data AI Center (BigDAI) at California State University Los Angeles has completed a study of Covid-19 data [1] . However, its main focus was to highlight disparities in Covid-19 cases amongst people of different races, ethnic backgrounds, and age groups. BigDAI's work solely focused on the confirmed and fatality cases in the world. Although this is an exciting topic that assists with local government decision-making, we expand our analysis and see more pictures and how Covid-19 cases and the rollout of vaccines impact the entire US population. We, too, were able to highlight the disparities in positive covid-19 cases, but more importantly, we could do more with the vaccination data. Global Banking News also reported on a study by The World Health Organization (WHO) that says fewer COVID-19 cases with more vaccination. Our study also shows that as more vaccines were rolled out, there was a decline in reported cases of COID-19. With our Big Data analysis, we can see an evident decline in the number of cases and deaths compared to in 2020. We see a correlation between an increase in vaccines and a decrease in the confirmed and fatality cases. WHO also states it. This paper enforces our study to show the same results. An article by The Washington Post covers the topic on the effects of COVID-19 on men and women. This article is very informative about possible reasons why the coronavirus could affect men more severely than women. The article explains that women have a stronger immune system compared to men's immune system. Another reason highlighted in the article is the different social norms for men and women, alluding to the fact that men tend to take it less seriously than women. In the paper, we show the number of cases for men and women in each state. The collected dataset consists of the different types of vaccines and their distribution state. It also includes data about the first and second doses. We have taken data from the CDC website because it was the most accurate and updated dataset, and it is open to the public. The dataset size is 3.21GB, and we have taken data from the dates (January 1, 2020 -April 30, 2021). Table 1 shows the data set files and their sizes. 4. Workflow Figure 1 shows the workflow chart with the steps to obtain and analyze the data. First, we download the raw data of COVID-19 from the CDC website and save them to HDFS (Hadoop Distributed File Systems) at the Big Data cluster. We create several tables with columns to afford the datasets. We transform data to fit to the table. The data transformation includes adjusting the States columns to a consistent spelled out State name rather than an abbreviated format. A similar data transformation was done for other datasets that contain a different date format than what we needed to avoid any lack of standardization among the fields collected. We also join some tables for the data analysis. We created a directory and individual files where each dataset can be stored. This allowed us to stay organized and accurately query each table, as having multiple datasets within one folder would query only the first available file, resulting in erroneous data. After we created tables, we developed Hive codes to pull the information such as the total vaccines administered, the number of the fatality cases, the total of vaccines distributes, and breakdowns of this data by the states. Lastly, we visualize the result data with query outputs on bar graphs, pie charts, and geo maps. We used Excel for some data cleaning and uploaded the finalized data files into the HDFS. The data clean is to make the standard format for the vaccines from Moderna, Pfizer, and Janssen. We can analyze the total vaccination by the state with the confirmed cases. Using Excel, we changed the state columns to the format of full name, which helps accurately join across all tables. For example, we change CA to California, NY to New York, TX to Texas, etc. Additionally, the raw datasets contain date formats that needed to be adjusted to match across the Hive tables. We transform the dates formats to MM/DD/YYYY. Here are analyses and visualization that we put together from Tableau, Power BI, and Excel. We used different charts to show the relationship between the cases and the vaccines. Our visualizations consist of the confirmed cases in California and by the state in the US, Pfizer Vaccine distribution in the US, the Death Count, and the Vaccination impact on the confirmed cases. California for 2020 and 2021 Figure 2 compares the confirmed cases in California from 2020 to the same month in 2021. There is a significant difference in January from year to year. In the beginning, it had poor reporting in 2020 and there was a few cases in California. The confirmed cases are higher than in most months of 2021 except April. The data for April is significant because in 2020, California started lockdown. And, in April 2021, vaccines became available to all persons over 18. The chart in that month shows us that vaccines have positively affected on decreasing the confirmed cases of COVID-19. Figure 3 displays a map of the United States showing the confirmed cases as red dots. The larger the surface area of the circles, the larger the population of the confirmed cases in that state. The states with the highest number of cases in order are California, New York, and Florida. The total number of vaccines administered in months is visualized below using Power BI. We can observe the most extensive vaccine distribution during March and April of 2021, in which over 232M vaccines were administered. Most states show a slightly higher number of positive COVID cases for females than males except Texas. In regards to Texas, males tested positive for COVID-19 more frequently than females. Figure 8 shows the correlation between cases and total vaccinations in a line graph. As total vaccinations increase, the number of confirmed cases decreases. Since vaccinations just started in December 2021, cases are declining slowly but surely till April 2021 as more people take on getting vaccinated. Based on the experimental result as of April 2021, we can conclude the following: We analyze 3.2 GB COVID-19 data sets from the beginning to April 2021 using the scalable Big Data. Understandably many organizations today rely on this type of information for future operations. Yet, Big Data should be more efficient for the large data sets on a country and worldwide scale. COVID-19 Case Surveillance Public Use Data with Geography: Retrieved COVID-19 Vaccine Distribution Allocations by Jurisdiction COVID-19 Vaccine Distribution Allocations by Jurisdiction -Moderna: Retrieved COVID-19 Vaccine Distribution Allocations by Jurisdiction -Pfizer: Retrieved US: Daily COVID-19 vaccine doses administered: Retrieved Related Work Source Scalable Predictive Time-Series Analysis of COVID-19: Cases and Fatalities COVID-19 Data Statistics and Forecasting Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing