key: cord-0214756-js7uvbau authors: Long, Jiawei title: COVID-19 Real-Time Tracker and Analytical Report date: 2020-06-04 journal: nan DOI: nan sha: 38f94a9aedbf50892c1f28377c089d6fc38968d4 doc_id: 214756 cord_uid: js7uvbau While the COVID-19 outbreak was reported to first originate from Wuhan, China, it has been declared as a Public Health Emergency of International Concern (PHEIC) on 30 January 2020 by WHO, and it has spread to over 180 countries by the time of this paper was being composed. As the disease spreads around the globe, it has evolved into a world-wide pandemic, endangering the state of global public health and becoming a serious threat to the global community. To combat and prevent the spread of the disease, all individuals should be well-informed of the rapidly changing state of COVID-19. In the endeavor of accomplishing this objective, a COVID-19 real-time analytical tracker has been built to provide the latest status of the disease and relevant analytical insights. The real-time tracker is designed to cater to the general audience without advanced statistical aptitude. It aims to communicate insights through various straightforward and concise data visualizations that are supported by sound statistical foundations and reliable data sources. This paper aims to discuss the major methodologies which are utilized to generate the insights displayed on the real-time tracker, which include real-time data retrieval, normalization techniques, ARIMA time-series forecasting, and logistic regression models. In addition to introducing the details and motivations of the utilized methodologies, the paper additionally features some key discoveries that have been derived in regard to COVID-19 using the methodologies. The COVID-19 real-time tracker primary includes features such as odometers of the latest status of COVID-19 cases, trend analysis, and prediction of COVID-19 cases in 185 different countries, informative visualizations of the most common symptoms and risk factors, as well as patient demographic distributions. Subsequent sections will be providing a brief description of every major feature, discussing relevant methodologies behind the feature, as well as highlighting selective findings from the feature. Link to the COVID-19 real-time tracker: https://peterljw.shinyapps.io/covid_dashboard/ The section of the COVID-19 real-time tracker contains two different pages to separately highlight the most current states of COVID-19 in the states within the U.S. and countries around the globe (See figure 1 and figure 2 ). The two pages share the same features and elements. The top of the page has three odometer boxes to display the total confirmed cases, total deaths, and total recovered cases along with their respective daily new counts. The bottom half of the page contains a user-interactive control panel and a display window. The users are able to apply population normalization or log transformation to the visualizations in the display window through the widgets in the control panel. Table The purpose of this feature is to provide the audience with an aggregated view of the severity of COVID-19 in different locations and inform the audience of the latest status of the disease at a first glance. The options of applying log transformation and population normalization allow the audience to observe the state of COVID-19 from different perspectives while the interactive table allows the audience to explore specific metrics of their interest. To ensure the accuracy and the reliability of the tracker's content, the website's server retrieves the newest data from the COVID-19 data repository by the Center for Systems Science and Engineering at Johns Hopkins University when any user tries to load the web page. The data repository is regulated by the Johns Hopkins University Center for Systems Science and Engineering and supported by the ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab. According to the documentation of the repository, the data source is being updated in real-time numerous times throughout the day, and the validity of the data is verified by researchers at Johns Hopkins University. The content displayed in the overview feature is therefore derived from a real-time and reliable data source. To achieve real-time data retrieval, the webserver contains a protocol to download and ingest the data source from the CSSE data repository by Johns Hopkins University when a request is sent to the server when a user tries to access the web page in a browser. When the server receives such requests, it will attempt to download the data by getting the current date and accessing the data source with an updated URL. Once the data file is downloaded successfully, it will be ingested and stored temporarily on the server. Subsequently, the pre-specified R script will read the data and preprocess it into different data frames to support the insights to be derived on the web page. If the download were to be unsuccessful due to unforeseen circumstances, the web server will load up the most recent data file that it has ingested previously to support the content on the web page. The server will also log such errors so that they could be handled to improve the robustness of the tracker. Figure 6 provides a visual summary of the real-time data retrieval process. The control panel on the page allows the users to apply log transformation and population normalization (i.e. cases per million) to the data, which interacts with the corresponding visualizations of heat map and time-series line plot. When the user turns log scale switch on, the logarithmic function with a base of 10 will be used as a deterministic mathematical function to be applied to each point in a data set. That is, for every data point i x , its value will be replaced by ) ( log 10 i i x y  . Such transformation significantly improves the interpretability and the appearance of visualizations. The choice of using the logarithmic function is based on the nature of exponential growth associated with pandemic and the relatively large differences in the raw counts of cases across different locations in the later stages of a pandemic. The effect of log transformation is demonstrated in figure 7. On the other hand, while population normalization does not necessarily improve the appearance of the visualizations, it alters the interpretation of the visualization by accounting for the population of each region. Such a perspective is beneficial because each country or state could vary significantly in its population. Assessing the number of cases per million provides a more robust estimation of the severity of the COVID-19 in each region rather than solely observing the total counts. To achieve population normalization, global country-level population data and statelevel population data of the U.S. are preprocessed and stored on the server, and they will be joined to the retrieved data to produce the corresponding visualizations. Precisely, the normalization is applied in the following manner, for every data point i x , its value will be replaced by In addition, the time-series line plots have built-in timescale standardization. Rather than comparing the timeseries data with respect to date, the plot compares them with respect to the number of days after the spread of the disease reaches a certain magnitude. Since the time frames of outbreaks are different in every region, it will be hard to compare the severity of the disease in each region in separate time frames. Hence, the application of timescale standardization will help to standardize the time-series data into a universal time scale. In conjunction with population normalization, the audience will be able to compare regions which have the fastest spread of COVID-19. There are a number of noticeable findings which surface from the visualizations and data presented from the overview feature. In spite of the fact that the U.S. has the most confirmed cases of COVID-19 around the globe and accounts for approximately 30% of the global total confirmed cases, it no longer tops the chart with the application of population normalization. Among countries which have over a million population, the most severe countries are Qatar, Singapore, Bahrain, Spain, and Ireland. Similarly, if we sort the world data table by confirmed cases per million, we can see the U.S. is not in the top 10 countries. As shown, there are numerous smaller countries that are having tougher struggles with COVID-19 as they have higher confirmed cases per million and tend to have less advanced medical supplies to combat the disease. Such countries gain much less exposure and discussion on mainstream news coverage due to their limited presence in the global economy, but they may be suffering from much greater severity of COVID-19. Similarly, within the United Sates, the severity of COVID-19 has been quietly climbing in some smaller states. For example, Rhode Island, Connecticut, Delaware, Louisiana, and Nebraska are among the top 10 states with the highest confirmed cases per million. While California has the fourth highest confirmed number of cases in the United States and has received relatively high amount of media coverage, it is only standing at the 32th position among all the states while measuring confirmed cases per million. While it is true that regions with higher population is more vulnerable to bigger outbreaks, regions with small populations cannot be overlooked and also deserve some amount of attention. This section of the COVID-19 real-time tracker contains a user-interactive control and a display window , where the display window shows visualizations of a bar plot of the total cases, a line plot of daily new cases with respect to time along with its moving average, and a 5-days interval estimation of daily new cases ( Figure 9 ). The visualization is presented in a dual-axis plot. The gray line in the back represents the number of new cases while the dotted black line represents the moving average of the number of new cases, and they correspond to the left vertical axis. The orange bar plot represents the accumulated total cases, and it corresponds to the right vertical axis. The purpose of this feature is to inform the audience with insights into the trend of the spread of the disease in each individual country. In other words, the plot aims to answer the question of whether the curve has flattened. Since the number of daily new cases have a substantial amount of fluctuations, applying a moving average aggregation will help unrevealing the underlying direction of the curve of the number of daily new cases. In addition, the fluctuations in the number of daily new cases display some degree of short-term patterns that can serve as the basis for time-series forecasting, which will provide the audience of an estimation of the trend in the near future. The control panel on the page allows the users to specify the period of days which the moving average aggregation uses to draw the moving average curve. Moving average is an aggregating calculation to analyze data points by creating a series of averages of different subsets of the full data set. In this case, the method of simple moving average is used to compute the values of the moving averages. To calculate a simple moving average, let t X be the number of new cases at time t, then a simple moving average at m t  is computed as By computing a series of simple moving averages, we are able to smooth out short-term fluctuations in the number of daily new cases and highlight longer-term trends or cycles. This is especially useful in determining the constantly changing state of the COVID-19 outbreak in a particular region. Auto ARIMA model (i.e. auto.arima() function in the forecast R package) is used to implement a 5-days timeseries prediction on the number of daily new cases. In summary, for every given time-series, the script will automatically fit the best ARIMA model to the data by using AIC, AICc, or BIC scores as the basis of judgment. The following discussion briefly introduces the notion of the ARIMA model and provides the necessary background information for the subsequent discussion of the prediction mechanism. Suppose that is a time-series data, where are real numbers and is an integer, then the model can be written as follows, = It can also be written as where represents the lag operator, represents the error terms, represents the moving average part parameters, and represents the autoregressive part parameters. However, the general assumption about the error terms is that they are: i. Sampled from a normal distribution with zero mean; ii. Identically distributed variables; iii. Independent variables. Assuming that t , which is a polynomial, has a unit root, i.e. a factor whose multiplicity is d, it can be written as follows: The above polynomial factorization property is expressed by an t process with t t. It is given by the following, Looking at the above output, one can think of it as an t process case that has the autoregressive polynomial with the unit roots being equivalent to t, and the above is an indication that there does not exist a wide sense stationary for an model with t . Generalizing the above, the definition of t obtained is: The drift of the above defined t process is: Φ There are three major parameters t of the ARIMA model, in which case p represents the lag order or rather the number of lag observations that are included in the t model, d represents the degree of differencing or rather the number of times that the raw observations are differenced, and q refers to the order of moving average or rather the size of the moving average window. In addition, the ARIMA model assumes that the input time-series data is univariate and stationary. Stationarity implies that the time series' properties are independent of the time when they were captured. In other words, the data has a constant mean and variance. If not, the data needs to be transformed before one can use the ARIMA model. The auto.arima() function automatically determines the appropriate order of differencing using the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. As mentioned earlier, auto.arima() performs model selection based on the AIC, AICc or BIC scores. AIC is an abbreviation of the Akaike Information Criterion, and AICc refers to the corrected AIC. BIC is known as Schwarz information criterion, and it is the acronym of the Bayesian information criterion. All of them are very useful model evaluation metrics. AIC provides a means for model selection. This is particularly because it estimates the quality of each model, from a collection of data models, relative to each of the other models. In other words, it serves as an estimator of outof-sample prediction error, thus estimating the statistical models' relative quality for a given data set. Given a particular statistical data model with being the model's estimated parameters number and the maximum likelihood function value. Then the model's ⯮ value is given by Thus, as assessed by likelihood function, AIC rewards goodness of fit. On the other hand, BIC is closely related to the ⯮ model because it is also partly based on the likelihood function. Equally, ⯮ is also a criterion for model selection, particularly amongst a "finite" set of models. The most preferred model is one with the lowest BIC. The formal definition of ⯮ is as follows, ⯮ t ln ln In both settings, k = the number of parameters that the model has estimated; n = the sample size, or number of observations, or number of data points in x; L = the model's likelihood function's maximized value. Thus, for ARIM models, the evaluation metrics can be computed as follows, , where k = the ARIMA model's intercept q = the moving average part order p = the autoregressive part order L = the model's likelihood function's maximized value. For every given time-series, auto.arima() chooses the parameters which give the lowest AIC, AICc, or BIC, and forecast the values for the next 5 days. As a part of the output from the auto.arima() function, the 95% intervals are taken to plot the transparent orange ribbon around the mean prediction. The 95% confidence intervals for ARIMA forecasts are computed as ݄ " ݄ , where ݄ refers to the variance of ݄ From this feature, we can observe various trends and patterns as countries around the globe have been reacting differently to mitigate the spread of COVID-19. For example, European countries such as Spain, Italy, and Germany have implemented a relatively strict country-level lockdown policy, and it is being reflected from their trend plots (Figure 9 ). Iran also implemented similar lockdown policies, and the country has begun to reopen throughout April after seeing a significant drop in its number of new cases. However, the country has been hit by a new surge of COVID-19 cases in May ( Figure 10 ). In contrast, the U.S.'s reaction at the federal level has been relatively slow-moving and incoherent. Despite that the number of new cases has been gradually decreasing, the rate of decrease is relatively small to countries that implement strict lockdown policies. We may be able to expect a similar new surge of cases in the U.S. if the country were to reopen without cautions in the near future. Figure 11 . Trend in U.S. This section of the COVID-19 real-time tracker contains an interactive visual summary of the most common symptoms associated with the disease (Figure 12 ). Due to data quality issues and the uncertain nature of the disease, it is difficult to estimate the true prevalence of the symptoms among infected patients. Hence, their prevalence measure is standardized to an 0-to-10 scale, represented by the horizontal axis of the plot. Since the symptom variable from the patient-level data contains descriptive sentences of a patient's symptoms (e.g. "Moderate fever 38.5 o C, cough, strong headache"), we would have to apply natural language processing techniques such as n-gram tokenization to transform and preprocess the data. The goal is to convert the descriptive sentences into a set of binary indicator variables, as shown by the simple example in figure 13 . The process of word tokenization refers to splitting a sample of text into words or phrases. In addition, n-gram tokenization refers to tokenization that splits the text into phrases which contain n words. For example, unigram tokenization will turn the sentence "he has shortness of breath" into [he, has, shortness, of, breath] while trigram tokenization will turn the sentence into [he has shortness, has shortness of, shortness of breath]. As an attempt to collect all of the recorded symptoms in the dataset, we can apply n-gram tokenization to every descriptive sentence and compute the frequency of each token for n = {1, 2, 3, 4}. As anticipated, we can obtain a list of the most common symptoms from the symptom records by looking through the processed output from n-gram tokenization ( Figure 14) . After obtaining a comprehensive list of symptoms, we can then create a dictionary of phrases for each symptom and loop through all descriptive sentences to see if they contain any phrase in any dictionary. For example, the dictionary for cough is [cough, coughing], and any sentence that contains cough or coughing will take the value of 1 for the cough's binary indicator variable and 0 if otherwise. By the end of the loop, we would have finished converting the descriptive sentences into a set of binary indicator variables in the format that is shown in figure 12. After the application of n-gram tokenization to create all the necessary binary indicator variables, we can then obtain the aggregated count of patients for every symptom by calculating the columnar sums of the binary indicator variables. To better communicate the level of prevalence of each symptom, we can apply min-max normalization to the columnar sums to standardize each data point into a scale of 0 to 10. For any symptom's columnar sum, Si, its scaled value could be computed as follows, 10 min max min , The scaled value is an abstract representation of the symptom's prevalence relative to other symptoms, and it does not reflect the true prevalence of the symptom among patients who have been infected with COVID-19. A logistic regression model is built to identify risk factors that could potentially increase a patient's likelihood of dying from COVID-19. Once we have formed all the binary indicator variables for symptoms, we can use them along with other variables as predictors to build a logistic regression model on a patient's outcome, which is either active/recovered or death. The following discussion briefly introduces the notion of the logistic regression model and provides the necessary background information for the subsequent discussion of hypothesis testing of the model's coefficients. Let y be a binary output variable, taking on values ) 1 , 0 (  , analogous to a patient's outcome, and we would like to model the output y as a linear function of the input variables, ) ,..., . As a way to represent ) | ( x y E so that its value ) 1 , 0 (  , we can apply the sigmoid function as follows, We can invert the transformation above to obtain the logit function, To find the maximum likelihood estimators of β, we would take the gradients of the expression above and set them equal to 0 to find the solutions. Because of the nonlinear nature of the parameters, there is no closed-form solution to these equations and they must be solved iteratively using numerical methods such as the Newton-Raphson method. The details of the method will not be discussed in this report as we will focus on the problem of testing hypotheses of the coefficients. Suppose we have successfully estimated all the coefficients, ˆ, using numerical methods, we can then use hypothesis testing to evaluate if the predictors have statistically significant associations with the output variable. Based on the large-sample distribution of the maximum likelihood estimator, we can apply the Wald test for this problem. For any coefficient, we have the following hypothesis testing set up, 0 : Concerning the significance of the coefficient, we can calculate the ratio of the estimate to its standard error as follows, is calculated by taking the inverse of the estimated information matrix. Going back to our case and applying logistic regression to the COVID-19 patient dataset, we have the following logit function, Before we interpret the model, we would need to first ensure the quality of the model. To evaluate the quality and accuracy of the logistic regression model, we can use k-fold cross-validation. The goal of cross-validation is to evaluate the model's ability to generalize and predict unseen data, in order to flag potential problems such as overfitting or selection bias. It provides insights on the model's level of robustness and generalization on new data that is not a part of its training data. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (i.e. the training set), and validating the analysis on the other subset (i.e. the validation set or testing set). To reduce variability, we can repeat this procedure k times by initially partitioning the data into k subsets. Figure 15 demonstrates a visual summary of the process when k = 5. Using the caret and glmnet packages in R, we are able to perform a 5-fold cross-validation to compute the overall accuracy and the ROC curve of the logistic regression model. After repeating the same procedures for logistic regression with L1 and L2 regularization, it was found that the regular logistic regression had the best performance. According to the results, the model has an overall accuracy of 0.900 with a standard deviation of 0.030. The ROC curve is shown in figure 16 . After confirming the quality of the model, we can apply the same model onto the whole dataset and interpret the coefficient table from the output. The output coefficient table is displayed in the table below. From this section, we can derive that fever and cough are the two major symptoms associated with COVID-19. Moreover, pneumonia and shortness of breath as discovered to be significant risk factors as they potentially increase one's likelihood of dying from the disease, controlling for other factors. This section of the COVID-19 real-time tracker shows a summary visualization of the distributions of demographic characteristics of selective patients (Figure 17) . To determine if there is statistically significant difference in the age of two patient groups, active/recovered or death, we can conduct a two-sample t-test. Two-sample t-test is a hypothesis testing method to compare two continuous-data distributions. More precisely, it tests to determine if the means of two continuous-data distributions are equal. There are a number of assumptions that need to be satisfied in order to use the two-sample t-test properly, and they are listed as follows, i. The data are continuous (not discrete) ii. The data follow the normal probability distribution iii. The variances of the two populations are equal, if not, use un-pooled variances to calculate the test statistics iv. The two samples are independent. There is no relationship between the individuals in one sample as compared to the other v. Both samples are simple random samples from their respective populations. Each individual in the population has an equal probability of being selected in the sample Assumption (i) is satisfied as the value of age is continuous. However, assumptions (iv) and (v) may not be valid due to potential data quality issues such as missing data. We will presume they are satisfied and proceed with cautions. For assumptions (ii), we can validate the data's normality using QQ plots as follows, The data points appear to be decently consistent with the quantiles of a normal distribution. For assumption (iii), we can apply the F-test of equality of variances as follows, where SX denotes and sample standard deviation of age of active/recovered patients, SY denotes and sample standard deviation of age of dead patients, n and m denote the sample sizes of two groups. After conducting the hypothesis test, we obtained a p-value of 0.00097. Thus, we have sufficient evidence to reject the null hypothesis and conclude the variances of the two groups are unequal at the alpha level of 0.05. Consequently, we proceed to conduct a two sample t-test with un-pooled variances. The steps are demonstrated below, ) ( As a result, we obtained a p-value that is approximately 0 that allowed us to reject the null hypothesis at the alpha level of 0.05. Hence, we have sufficient evidence to conclude that the average age of patients who are active or recovered is different from the average of patients who have died from COVID-19. To determine if there is statistically significant association between a patient's gender and a patient's outcome, we can conduct a Chi-Square test as a test of association on a 2x2 contingency table. There are a number of assumptions that need to be satisfied in order to use the two-sample t-test properly, and they are listed as follows, i. The data in the cells should be frequencies, or counts of cases ii. The levels (or categories) of the variables are mutually exclusive iii. Each observation is independent of all the others iv. The value of the expected values should be 5 or more in at least 80% of the cells, and no cell should have an expected of less than one Assumptions (i) and (ii) are met since we are observing counts of patients who are either male or female, and either active/recovered or deceased. In addition, assumptions (iv) is satisfied as shown by the 2x2 contingency table below. We will presume assumption (iii) to hold true and proceed. After filtering the data to create a subset of patient data with recorded genders and outcomes, we can form the following 2x2 contingency As a result, we obtained a p-value of 0.1025, which does not provide sufficient evidence in rejecting the null hypothesis. Thus, we failed to reject the null hypothesis at the alpha level of 0.05 and conclude that a patient's gender has no statistically significant association the patient's outcome. From this section, we can derive that the average age of patients who are active or recovered is different from the average of patients who have died from COVID-19, and older populations are more vulnerable to a negative outcome of the disease. In contrast, gender does not appear to have a strong association with the outcome. Both findings are consistent with the results of the logistic regression model from the previous section as both age and gender are incorporated as a part of the model. This research presented the latest trends of COVID-19 across different regions and insights of COVID-19's symptoms and patient demographics as visualized in the real-time COVID-19 tracker. In addition, the research dives into deeper details of the methodologies behind the real-time COVID-19 tracker, which include real-time data retrieval, data transformation and normalization, time-series forecast with ARIMA model, text mining techniques, and logistic regression model. However, we need to be cautious about accepting the conclusions as there are potential data quality issues, in which case the patient-level data has a substantial amount of missing data and erroneous entries. To verify the findings in this research, we can try reproducing the derived insights when we have access to an updated dataset towards the end of the pandemic. During a global-level pandemic such as COVID-19, it is paramount for the public to have access to the latest status of the outbreak and be well-informed of relevant insights of the disease. A platform such as a real-time COVID-19 tracker will assist the public community to disseminate accurate and reliable insights into the spread of COVID-19. The research and effort behind the tracker are motivated by the social responsibility to spread awareness to the common public by providing scientific-based data analysis, prediction, and relevant findings. This paper and research project is still ongoing research as many more investigations regarding COVID-19 can be carried out. It will serve as an initial step to unravel the many uncertainties that revolve around this global pandemic. Introduction to ARIMA Models Estimation and Hypothesis Testing for Logistic Regression Two-Sample T-Test The Chi-Square Test of Independence Time Series Analysis and Its Applications: With R Examples An Introduction to Statistical Learning : with Applications in R