key: cord-0835707-nrrdww9a authors: Hussein, R. title: A HYBRID KNOWLEDGE-BASED AND MODIFIED REGRESSION ANALYSIS APPROACH FOR COVID-19 TRACKING IN USA date: 2020-07-29 journal: nan DOI: 10.1101/2020.07.26.20162347 sha: 729995f981164dace9e66e2c77f18ee2a1aef41c doc_id: 835707 cord_uid: nrrdww9a Since its appearance in 2019, the covid-19 virus deluged the world with unprecedented data in short time. Despite the countless worldwide pertinent studies and advanced technologies, the spread is neither contained nor defeated. In fact, there is a record surge in the number of confirmed new cases since July 2020. This article presents a new predictive Knowledge-based (KB) toolkit named CORVITT (Corona Virus Tracking Toolkit) and modified linear regression model. This hybrid approach uses the confirmed new cases and demographic data, implemented. CORVITT is not an epidemiological model, in the sense that it does not model disease transmission, nor does it use underlying epidemiological parameters like the reproductive rate. It forecasts the spread in order to assist the official to make proactive intervention. these parameters explain the simulation behavior. The agent-based techniques are stochastic, enabling the variability of human behavior to be incorporated to help understand the likely effectiveness of proposed protective measures. The discrete event technique is also stochastic to model operations over time where entities flow through number of activities. The hybrid simulation combines two or more techniques is used for complex behavior. These techniques focus mainly on diseases' transmission unfolding phases such as quarantine, lock down, testing, and health care services. Some of these approaches are rooted in the literature since 1777, complex, and cumbersome to implement. Without the adequate specialists in advanced and complex mathematical theories and/or computers, the logical question is thus: how could the authorities ascertain the covid-19 spread in order to make a proactive intervention decisions; e.g. to prepare hospitals and intensive care units, to mitigate the adverse impacts of what may happen in the near future? In search for accurate answer and based on the popular utilization of covid-19 relationship between the number of cases [9] and population per land area, the idea of a new index was conceptualized in this study. It represents the number of reported confirmed new cases per population in the specific region the data was recorded. This new concept harnesses the number of cases and the regional crowdedness of people, which varies in the US from a single digit to multi thousand [2] . The index increases with more cases and with more dense populations (shorter social distancing). In this study, a combined linear regression analysis and data-fitting model is used. To deal with data fluctuation, this model used a short time span of one month maximum for forecasting, [10, 11, 13, and 14] . The data is obtained from the New York Times Journal database [12] . The journal publishes the daily cases of covid-19 by state and county in the US. The data from eleven states was used: New York State (NYS), Florida (FL), California (CA), Colorado (CO), Illinois(IL), Texas (Tx), Louisiana (LA), Washington (WA), Georgia (GA), New Jersey (NJ), and Michigan (MI). Table 1 the data for selected counties in New York State. We first constructed the slope for the confirmed cases and population in each of the states from March 27 to May 11, 2020 and then used polynomials fitting and linear regression analysis for forecasting. Linear Regression is a direct way to deal with the connection between variables. Table 2 shows a typical output from one of the many regression analysis' runs followed in this study. To accurately and proactively capture the big picture of covid-19 spread, this study transfers the expertise of problem-solving from humans into a KB toolkit that takes in the same data and yields the same conclusion but faster. This new KB-statistic hybrid approach effectively assists humans in dealing with covid-19 massive daily data in addition to save times which is an essential requirement in dealing with the virus illusiveness. The. study introduces for the first time in this field, to our knowledge, a novel KB toolkit to visualize the data and make it easier to understand and use without either mathematical or computer expertise. The CORVITT is a promising incubator for covid-19 future forecasting platforms. Its architecture blueprint emerges from an open-end modular adaptable structure encompassing a graphical-interface client allowing the users to operate it with no professional skills. This KB technology has been proven in other applications and thus applied in this study for covid-19 [15, 16, and 17] . To the author's knowledge, the concept of CORVITT has not been attempted to date for covid-19. Figure 1 shows the dashboard of CORVITT and Fig. 2 shows the data used in Fig. 1 . Although the amount of collected data is massive, the use of the dashboard is intuitive and user friendly. Again, there is no need for medical ,mathematical, or computer skills to use the dashboard and benefit from its applications. This is one of the benefits of bringing an artificial brain to help human brains in dealing with complex challenge at hand such as the covid-19. In what follows, we examined the feasibility of the ascribed model for covid-19 in two ways: firstly, by analyzing its forecasted outcomes in eleven US states, and secondly by comparing the forecasted results with actual onsite data. Firstly, a database was created at micro-level or counties, for the first time to our knowledge, for the new cases and population per land area on covid-19 from March 27 to May 1, 2020 in the eleven US states: NYS, FL, CA, TX, LA, WA, CO, GA, IL, NJ, and MI. These sates had a steady high number of confirmed cases according to the New York Times Journal. Table 2 shows some of the collected data. Figure 2 shows CORVITT' gauging of the virus county-wise distributions in terms of the new index and population. Figure 1(d) shows that in NYS, FL, CA, CO, and IL the inhabitants are infected in areas with large social distances because most of the data is concentrated at low population, i.e. large spaces between the inhabitants. On the contrary, in TX, LA, WA, GA, NJ, and MI the virus was spread though the social distance was small because the data spreads over a wide range of population, i.e. large space between the inhabitants. Unlike the common general approach used for all US states at all times since 2019, which is common in the news media from Tabloid to New York Times journals, this discovery unveiled new facets. Figure 1 (d) shows that on March 27, the indexes were 0.93 and 0.10 in NYS and NJ (large distance), respectively, 0.08 and 0.07 for TX and GA (small social distance), respectively, though the spread of data appeared similar on the dashboard in each group. In addition, NYS, LA, WA, IL, and MI have high indexes by comparison to FL, CA, TX, CO, GA, and NJ For example, On April 21, 2020, the indexes were 5.0 and 0.40 for NYS and CA although the closeness of data in both states was similar. Furthermore, the increase of the index within each state was nonuniform. For example, the index increased from 2.8 to 5 between April 9 and 21 in NYS, and from 0.05 to 1.0 in TX over the same period. From a different angel of view, Figure 1 (a) shows that the rate of spreading of the virus differ from one state to another. For example, the spreading rate in IL is very high compared to that in CA. This indicates that the virus can affect more people in IL (57920 sq. mi and 13 million people) than NJ (8730 sq. mi and 9 million people). Taken as a whole, CORVITT' outcomes suggest that NYS is a good region for the virus to spread whereas CA is not as good from March 27 to May 1, 2020. Such information allows the authority to prioritize the resources giving NYS the highest priority. Secondly, Figures 3a and 3b compares the forecasted and actual on-site data and shows a close agreement in different US states. Figure 3a shows that whereas the new cases in NYS has reached the peak in the first week of May, the pandemic was worsening in other states as Illinois, but in California, Georgia, and Colorado reached a plateau. Figure 3c shows the severity of the pandemic as indicated by the skewness; positive (to the right) for LA and negative for NYS, NJ, and MI of the growth and deterioration of the distributions [18] . The positive skewness means longer deterioration (decline in cases) time. The shallow deterioration rate at the trailing end of the curve in Fig. 1(b) is a sign of a plateau. Figure 1describes the peak, weakness, and steadiness statuses by which the virus trajectory disperses through different stages in various regions. This new discovery is useful to understand the building up and collapse of the virus impacts thus make proactive preparations. As the enormity of the covid-19 threat has become clear, the characteristics of existing covid-19 complex analytic methodologies and the all-encompassing approach place serious limitations on their usefulness for practical use. The computer technologies have reached what no one could imagined, and the KB systems have proven very beneficial in many fields. The logic question is what hasn't the KB been applied for covid-19 in harness the analytical complexity and easy to apply simulations? This article brought machine smartness to assist humans' intelligence to capture the big picture of the virus illusiveness thus take proactive steps to mitigate safely its inevitable adverse effects. This article introduced a hybrid KB-regression analysis model for covid-19 forecasting. It used data collected from eleven US states at macro-level level to foresee the short-term spread trajectory. The outputs unveiled new discoveries and shed light on various facets of the covid-19 in each state. The accuracy of the hybrid approach was gauged by comparing forecasted and actual data and satisfactory agreements were found. It should be noted that this study is a step forward, but additional development is in progress for improvement. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.26.20162347 doi: medRxiv preprint All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.26.20162347 doi: medRxiv preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.26.20162347 doi: medRxiv preprint Committee for the Coordination of Statistical Activities Situation Summary Global research on coronavirus disease The Chinese Medical Association Publishing House The European Centre for Disease Prevention and Control The coronavirus is washing over the U.S. These factors will determine how bad it gets in each community How simulation modelling can help reduce the impact of covid-19 Statistical analysis of forecasting covid-19 for upcoming month in Pakistan A machine learning forecasting model for covid-19 pandemic in India The Coronavirus in the US Modeling and Forecasting for the number of cases of the covid-19 pandemic with the Curve Estimation Models, the Box-Jenkins and Exponential Smoothing Methods Case Fatality Rate estimation of covid-19 for European Countries: Turkey's Current Scenario Amidst a Global Pandemic; Comparison of Outbreaks with European Countries Knowledge-based & expert system technologies for built & natural environments Artificial Intelligence for the Built and Natural Environments and Climate Knowledge-based Tools for Monitoring and Management of the Engineered Infrastructure Construction Systems Principles of Epidemiology in Public Heath Practice