key: cord-0698012-uz75pk0v authors: Huang, Yi; Chattopadhyay, Ishanu title: Universal risk phenotype of US counties for flu-like transmission to improve county-specific COVID-19 incidence forecasts date: 2021-10-14 journal: PLoS Comput Biol DOI: 10.1371/journal.pcbi.1009363 sha: f03fd7c88e20fa383179ab771f0587f8ea38d00d doc_id: 698012 cord_uid: uz75pk0v The spread of a communicable disease is a complex spatio-temporal process shaped by the specific transmission mechanism, and diverse factors including the behavior, socio-economic and demographic properties of the host population. While the key factors shaping transmission of influenza and COVID-19 are beginning to be broadly understood, making precise forecasts on case count and mortality is still difficult. In this study we introduce the concept of a universal geospatial risk phenotype of individual US counties facilitating flu-like transmission mechanisms. We call this the Universal Influenza-like Transmission (UnIT) score, which is computed as an information-theoretic divergence of the local incidence time series from an high-risk process of epidemic initiation, inferred from almost a decade of flu season incidence data gleaned from the diagnostic history of nearly a third of the US population. Despite being computed from the past seasonal flu incidence records, the UnIT score emerges as the dominant factor explaining incidence trends for the COVID-19 pandemic over putative demographic and socio-economic factors. The predictive ability of the UnIT score is further demonstrated via county-specific weekly case count forecasts which consistently outperform the state of the art models throughout the time-line of the COVID-19 pandemic. This study demonstrates that knowledge of past epidemics may be used to chart the course of future ones, if transmission mechanisms are broadly similar, despite distinct disease processes and causative pathogens. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 We are in the midst of a global pandemic caused by the novel coronavirus SARS-CoV-2, and reliable prediction of the future local and national case count is crucial for crafting effective intervention policies. Thus the need for tools that chart the likely course of an epidemic in the human population is now felt more than ever. The spread of a transmissible virus is shaped by diverse interacting factors that are hard-to-model and respond to [1] , including the specific transmission mechanism, the survivability of the pathogen outside the host under harsh environmental conditions, and the ease of access to susceptible hosts-determined in part by the density of the local population, its travel habits [1] , and compliance to common-sense social distancing policies. Additionally, the prevalence of pre-existing medical conditions in the local population, and its demographic makeup, might modulate susceptibility of specific hosts to the virus, slowing or accelerating the spread of the disease [2, 3] . While a broad set of putative factors shaping the spread of communicable viruses such as the seasonal Influenza and COVID-19 are increasingly becoming clear [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] , making precise granular actionable forecasts of the case counts over time is still difficult. At present, faced with the challenge of forecasting COVID-19 incidence over time, a diversity of modeling approaches have emerged [16] [17] [18] [19] [20] [21] [22] . However a single best model is yet to coalesce. In this study we introduce the concept of a universal geospatial risk of person-to-person transmission of influenza-like illnesses in the US; universal in the sense that it is pathogen-agnostic provided the transmission mechanism is broadly similar to that of seasonal Influenza. We call this the Universal Influenza-like Transmission (UnIT) score. Transmission dynamics in the general population is known to be modulated by diverse factors, only a few of which have been investigated, and are now beginning to be characterized. In all likelihood many unmodeled factors remain, along with the impact of non-trivial interactions between such known and unknown covariates that are hard to disentangle and account for. The UnIT score allows us to account for the impact of these unmodeled effects by automatically leveraging subtle emergent geospatial patterns underlying the seasonal flu epidemics of the past. In particular, we reduce the need for human modelers to manually identify every putative covariate that impacts the process. Importantly, the UnIT score-once computed-has applicability beyond seasonal influenza. Validating our claim that the estimated UnIT score indeed quantifies a risk phenotype of individual counties for a disease with a flu-like transmission mechanism, we significantly . We include five demographic independent variables: 1) total population, 2) percent of the total population aged 65+, 3) percent of Hispanics in the total population, 4) percent of black/African-American in the total population, 5) percent of minority groups in the total population. For socioeconomic factors, we consider: 1) percent of the total population in poverty and 2) median household income, which are also obtained from the US Census Bureau, based on the 2010 US decennial census. This data is publicly available. Generated models are publicly available at https:// github.com/zeroknowledgediscovery/unitcov, which includes the complete forecast software. (DOI: 10.5281/zenodo.5361628) The Truven dataset is a third party dataset, which the authors are not authorized to distribute publicly. The dataset can be procured by interested researchers, under license, from https://www.ibm.com/watsonhealth/about/truven-health-analytics. The Truven MarketScan database is a US national database collating data contributed by over 150 insurance carriers and large, self-insuring companies, contains over 4.6 billion inpatient and outpatient service claims, with over six billion diagnostic codes. We processed the Truven database to obtain the reported weekly number of influenza cases over a period of 471 weeks spanning from January 2003 to December 2011, at the spatial resolution of US counties. Standard ICD9 diagnostic codes corresponding to Influenza infection is used to determine the county-specific incidence time series, which are: 1) 487 Influenza, 2) 487.0 Influenza with pneumonia, and 3) 487.1 Influenza with other respiratory manifestations and 4) 487.8 Influenza with other manifestations. improve incidence forecasts for COVID-19 over currently proposed state of the art models. We show that the UnIT score emerges as the most important factor "explaining" observed county-specific incidence trends for COVID-19 in the US, with coefficients in normalized multi-variate regression dominating those for typical covariates. Thus, our key insight is that incidence patterns from a past epidemic caused by a different pathogen can substantially inform current projections under mild assumptions on the similarity of the transmission mechanisms. We operationalize this insight by crafting a general information-theoretic principle to transfer this past knowledge to the new context of COVID-19. This is accomplished via a new computable measure of intrinsic similarity between stochastic sample paths generated by the hidden processes driving incidence. Our overall scheme is summarized in Fig 1. We leverage the county-specific incidence patterns observed for the past Influenza epidemics to compute the UnIT score, which then is used as a new fixed effect to infer a general linear model (GLM) for county-specific COVID-19 weekly count totals, alongside other putative covariates. The coefficients computed for the GLM model (stag 1, updated weekly) are then used to "correct" the COVID-19 count, replacing the observed count vector with the weighted linear combination of the socio-economic, demographic and the UnIT risk covariates. Intuitively, one may visualize this step as analogous to replacing a somewhat diffused set of observed points with a fitted line in linear regression. We use a national insurance claims database with more than 150 million people tracked over a decade (Truven Claims database) to curate geospatial incidence records for past Influenza epidemics over nearly a decade, which informs our new UnIT score. This score is then used as an additional fixed effect along with other putative socioeconomic and demographic covariates obtained from US Census to infer a General Linear Model (GLM) explaining the weekly county-specific case COVID-19 case count. Using this inferred GLM model we "correct" the observed weekly case count, and use it as the only feature in an ensemble regressor to forecast county-specific count totals. The GLM model and the regressor is recomputed weekly, while the UnIT score remains invariant, representing a geospatial phenotype modulating transmission. Finally, in stage 2 this corrected incidence vector is used to train ensemble regressors (updated weekly) that predict the next week's county-specific count totals. Importantly, the GLM model in the first stage and the regressors in the second stage are updated weekly, while the UnIT score remains invariant. The key ingredient that makes simple ensemble regressors in the final step to perform better than more involved tools reported in the literature is the informationrich UnIT score, which potentially informs about complex transmission patterns modulating Influenza-like incidence native to each US county. Importantly, for each week, we compute one GLM model, and a fixed set of ensemble regressors, which are then used to predict case counts for individual counties, i.e., we do not have separate models for each county, and the county-specific effects are captured by the spatial variation of the putative covariates. Our ability to leverage Influenza infection patterns to inform COVID-19 modeling is not surprising. COVID-19 and Influenza are both respiratory disorders, which present as a wide range of illnesses from asymptomatic or mild through to severe disease and possible death. Both viruses are transmitted by contact, droplets and fomites [23] . Current efforts to curb the spread of COVID-19 worldwide has also reduced Influenza cases [24] [25] [26] . However, to the best of our knowledge, the current paradigms have not capitalized on this similarity between the transmission mechanisms of the two viruses. This is not simply an oversight: an effective approach to leverage flu patterns in COVID-19 modeling is non-trivial. Despite similarities outlined above, there are important empirically observed differences between the two diseases precluding a "drop-in" replacement, e.g., COVID-19 has possibly a higher reproduction number [27] [28] [29] , can be spread widely by asymptomatic carriers (more so than Influenza [30, 31] ), is estimated to have a potentially higher mortality rate [32] , is novel, i.e., is infecting a host population with almost non-existent immunity, and the COVID-19 pandemic has induced a global trend of social distancing policies alien to the seasonal flu dynamics. Despite these challenges, the UnIT score has significant predictive value, more than manual combinations of putative factors investigated so far. In our results on COVID-19 modeling, in addition to the UnIT risk, we also use a scaled version of the UnIT risk, which we call the urban-UnIT risk (See Fig 2G) . The urban-UnIT risk is the product of the estimated UnIT risk and the percentage of urban population in each county (See Materials and methods for details). To demonstrate the role of urban-UnIT risk as a meaningful risk phenotype of US counties, we first investigate its influence as a covariate driving the weekly total new case count for COVID-19. Diverse putative driving factors have been investigated to explain/model the epidemiological data emerging over the course of the current pandemic. Suspected factors include weather and pollution covariates [34] , population density, socio-economic factors such as poverty, median household income, various measures of income inequality, and fraction of population without medical insurance, demographic variables such as the percentage of African-American, Hispanic and other minorities in the local population, percentage of population aged over 65 years, and gender [34] [35] [36] [37] [38] [39] . A common approach here is the use of Poisson regression [40] to establish the statistical significance and relative magnitude of the influence of the various individual factors, and their suspected interactions. We identified the variables that have been repeatedly cited as the most important driving factors, and investigated the effect of adding in the urban-UnIT and the UnIT scores in multi-variate Poisson regression models, with weekly new case count total as the endogenous (response) variable. Our first key result is that in our models, the urban-UnIT score significantly dominates typical putative factors. The UnIT score emerges as the second most important covariate (See Our approach begins with collecting weekly county-wise new case counts of the seasonal flu epidemic spanning Jan. 2003 to Dec. 2012 from a large national database of insurance claims records (Truven MarketScan). We identify weekly Influenza diagnoses using ICD codes related to influenza infection (See Materials and methods), and end up with county-specific integer-valued time series for each US county for each flu season. Panel B. These 471-week-long integer-valued time-series are used to compute pairwise similarity between the counties using our new approach of computing intrinsic similarity between stochastic sample paths (See (5) ). This similarity matrix induces county clusters C 0 , C 1 , C 2 and C 3 , inferred via standard spectral clustering. Panel C. The flu incidence time series allow us to identify counties which register cases in the first couple of weeks of each flu season. Averaged over all the seasons this gives us a measure of average epidemic initiation risk. Panel D. Using the incidence series for the county cluster with maximal average initiation risk we compute a specialized HMM model (PFSA, see Materials and methods) G ? . Panel E. Then, we compute the UnIT risk phenotype of each county as the sequence likelihood divergence (SLD, See (8)) between the incidence sequence observed and the inferred PFSA model G ? . Panels F and G. Finally, the urban-UnIT risk is computed by scaling up the UnIT risk with the fraction of urban population in each county, as obtained from US census (Panel f). We show that this risk phenotype is highly predictive of weekly case count of COVID-19, while only dependent on Influenza epidemic history. https://doi.org/10.1371/journal.pcbi.1009363.g002 Universal risk phenotype for transmissible respiratory illnesses Table 1 ). Since we standardize covariates to zero mean and unit variance, the magnitude of the inferred coefficients potentially reflect their relative impact in the models. This is illustrated in Table 1 where we show the inferred coefficients in a Poisson regression model with the typical covariates along with the urban-UnIT and UnIT risks. In Table 1 , we consider county-specific COVID-19 case counts available on 2021-05-30, and note that the magnitude of the coefficient for urban-UnIT risk is close to being an order of magnitude larger than that for the next most influential covariate (0.849 for urban-UnIT risk vs −0.156 for % of population in poverty, and −0.127 median household income). Note that all coefficients inferred are strongly significant with p < 0.01. Next to demonstrate the dominance of the UnIT risks throughout the current pandemic, we carry out the regression modeling at each week of the current pandemic. We find that urban-UnIT risk remains dominant over the entire pandemic time-line (See Fig 3A( ii)), by comparing 1) a baseline model with the covariates outlined in Table 1 with the exception of the two UnIT risk variables, vs 2) the full UnIT augmented model with all the enumerated covariates. The comparative results are shown in panels A(i) and A(ii) of Fig 3. Comparing the explained variance of the weekly confirmed case counts via the standard R 2 measure (See Fig 3B and 3C ), we note that the UnIT-augmented model has greater than significant advantage over the baseline model, explaining nearly 60% of the variance in the observed weekly COVID-19 case count totals (median R 2 � 0.651 for augmented model, and � 0.448 for baseline model) for most of the pandemic time-line. Weekly inference of coefficients for the weeks between 2020-09-05 to 2021-01-23 is shown in Table 2 . We note that simply comparing the magnitude of the regression coefficients between the baseline and the UnIT augmented models might not justify the claim that our new covariates improve the model. To establish this, we compute the Akaike Information Criteria (AIC, the lower the better) [41] , and the model log-likelihood measures (the higher the better) over time as shown in Fig 4A and 4B . We note that the augmented model is clearly dominating the baseline for both measures. An second issue is the justification for assuming that our response variable follows a Poisson distribution. Poisson regression makes strong assumptions about the dispersion characteristics of the data, in particular, that the mean and the variance of the response variable is identical. If our data is significantly overdispersed, then negative binomial (NB) regression is generally suggested to be a better choice, which lacks this particular constraint. We find that our data is indeed somewhat overdispersed (as determined by calculating the ratio of deviance to the residual degree of freedom [42] , which turns out to be > 1). We Universal risk phenotype for transmissible respiratory illnesses compare the effect of replacing the Poisson regression with NB regression in the first stage of our approach, which results in similar, but somewhat worse predictive performance of the case counts at the end of our predictive pipeline. We see that the NB-model has on average lower AIC and higher log-likelihood compared to the Poisson model (See Fig 4A and 4B ), but nevertheless the final performances are slightly worse with the NB model (See Fig 4G and S1 Fig) . Hence, we decided to use the Poisson regression over NB in the first stage of our approach. It is important to note that we are only using the Poisson regression in the first stage to generate considering the log-likelihood of the models over time (higher is better). The Poisson-approach (red) ultimately makes slightly better predictions from the two stage modeling, as shown in the bottom row of the figure. Panel C illustrates that influenza is a good choice for a COVID-19-similar disease, producing the largest coefficients for the risk variable among bacterial infections such as Staphylococcus aureus (which is worse than Influenza), or features that get used by the ensemble regressor in the second stage. We are not using the first stage model for prediction directly, and not using any results that require residual normality or the estimation of confidence bounds, and hence are not significantly affected by errors resulting from overdispersion. A third issue lies with the claim that the regression coefficient for urban-UnIT risk dominates the other covariates in the augmented model. Since the covariate for the % of urban population dominates in the baseline model, one might argue that the dominant behavior of the urban-UnIT risk purely arises from its definition: the product of the % of urban population with the UnIT risk. Also, comparing Fig 3A and 3B, it appears that the urban-UnIT risk is anti-correlated with the explained variance R 2 , which might undermine the claim that this new covariate improves our model. To investigate these concerns, we investigated a separate model where we only included the UnIT risk (and not the urban-UnIT risk)), and compared the coefficients for the UnIT risk and the % of urban population. The results are shown in Fig 4D, 4E and 4F. Panel d shows that the UnIT risk dominates the % of urban population on average, and over the timeline of the pandemic, except within some limited periods. Additionally, a locally weighted scatterplot smoothing (LOWESS) fit [43] in panels E and F show that while the explained variance increases (and saturates) with increasing values of the coefficient for UnIT risk, it rapidly falls with increasing values of the coefficient for % of urban population. This establishes that the UnIT risk does have more explanatory information, and the behavior of urban-UnIT risk may be attributed to the decreasing predictability of the response variable as the effect of % of the urban population becomes more important. As for the comparison between the augmented and the baseline models with respect to the dominance of the % of urban population, it is expected that urban population (cities) matter, with infection spreading more easily among people living in close contact, explaining why this covariate is dominant in the baseline model. However, combined with the UnIT risk, we get a more predictive covariate (the urban-UnIT risk), which then dominates in the augmented model. In addition to the above considerations, we demonstrate the robustness of the UnIT score via multiple modes of perturbation, namely by 1) deleting the top 10% of the counties ranked by the highest number of COVID-19 cases per capita, and 2) randomly selecting only 75% of the counties to include in the analysis. Under all such perturbations, the UnIT score retains its position as the dominant explanatory factor (See S3 Fig) . Finally, we investigate our ability to forecast weekly COVID-19 case count totals across the US counties. Using a simple forecast model (See Eq (3a) in Materials and Methods) that incorporates the UnIT risk we outperform the state of the art models from the US COVID-19 modeling community (https://covid19forecasthub.org/community), achieving the least mean absolute error in 1-week ahead county-specific incidence forecasts (See Fig 3D and 3E ) over the entire pandemic time-line. The predicted and confirmed case counts for New York and California are shown at selected weeks over the pandemic, where our 1-week forecasts match up well with the observed counts (See Fig 5) in these two US states hit hard by the COVID-19 pandemic. chronic infections such as HIV (which is still worse). Panel D. shows that temporal variation of the regression coefficients for UnIT risk and % of urban population. Here we used Poisson regression leaving out the urban-UnIT risk covariate in the augmented model to highlight the role of UnIT risk vs % urban population: except in the shaded periods, the coefficient for the UnIT risk dominates. Panel E and panel F show the variation of the coefficients for the UnIT risk and % urban population with adjusted R 2 . We note that the LOWESS fit shows that R 2 increases and saturates as the coefficient for UnIT risk increases, whereas it drops rapidly with increasing values of the coefficient for % urban population. This suggests that when the covariate for the % of urban population is more important, our explained variance is low. panel G illustrates the mean absolute forecast errors at different points in the pandemic, highlighting the results obtained with Poisson and NB regressions (See also S1 Fig). https://doi.org/10.1371/journal.pcbi.1009363.g004 The global modeling community responded to the COVID-19 pandemic with diverse tools [17, [44] [45] [46] [47] to predict case counts, COVID-19-related hospitalizations and deaths (See Table A in S1 Text for an incomplete list). The proposed approaches range from county-level metapopulation estimates to stochastic compartmental models to fitting Gaussian processes to raw We compare the weekly forecasts with observed count for the state of California. We note that in both states, for the weeks included in this limited snapshot, the predicted count matches up well with what is ultimately observed. The cartography in this figure is generated from scratch using opensource shape files available at https://www.sciencebase.gov/catalog/item/581d051de4b08da350d523cc using GeoPandas [33] . https://doi.org/10.1371/journal.pcbi.1009363.g005 data to survival-convolution models to growth rate dynamics to models that take into account human mobility and social distancing policies explicitly. In the US, predictions from individual contributing groups are been used to inform an ensemble forecast [48] , which is currently live at a web-based visualization portal at https://viz.covid19forecasthub.org/ (the COVID-19 forecasthub). As a contribution to this community, we report a precise yet simple model for forecasting case counts; one that operates without explicit social distancing and other hard-tomeasure parameters, yet outperforms the operating models at the COVID-19 forecasthub, including the ensemble forecast. Our current 1-week forecast may be viewed at the COVID-19 forecasthub webpage (team: UChicagoCHATTOPADHYAY-UnIT), and complete software with usage instructions (See Software Usage in S1 Text) is publicly available at https://github. com/zeroknowledgediscovery/unitcov. In addition to the development of forecasting tools, general epidemiological modeling of COVID-19 has progressed in two broad categories: 1) deep theoretical approaches to understand disease propagation in epidemics extending classical compartmental models or their variations [17, [44] [45] [46] [47] . These investigations aim to estimate the theoretical reproduction number of COVID-19, and other epidemiological quantities associated with the virus. And, 2) in the second category, studies have focused on identifying putative factors driving the differential severity and case counts across regions, demographic strata and age groups [34] [35] [36] [37] [38] [39] [49] [50] [51] . The first category of studies may be seen as theoretical epidemic modeling, and the second as inferential analyses [52] , i.e., infer how nature associates responses with input variables aiming to work out the differential impact of putative factors. The current study improves results in the second category by presenting the UnIT score as a highly explanatory covariate, and then demonstrates its ability to make precise incidence forecasts. The UnIT risk exposure of a US county is conceived of as intrinsic similarity of the time series of weekly total of new flu cases to that observed in counties at high risk of an epidemic initiation. Thus, central to our approach is the notion of intrinsic similarity between stochastic processes, particularly if the structure of the underlying processes is unknown. Such (dis)similarity is quantified by the notion of sequence likelihood divergence (SLD), which lies at the heart of our computation (See Eq (8) in Materials and Methods). SLD is a generalization of the notion of divergence of probability distributions (KL divergence [53] ) to potentially non-iid stochastic processes. Similar to how we quantify the deviation of a probability distribution p from q by their KL-divergence Dðp k qÞ, SLD measures the divergence of a stochastic process P from Q as DðP k QÞ. The actual computations are distinct despite the identical notation used (See Intuitive Example in Materials and methods). Additionally, the log-likelihood of a sample path x being generated by a process G, denoted as L(x, G), converges in probability: with increasing length of x, where X is the true generator of the sample path x, and H(�) is the entropy rate [53] function (See Materials and methods, Theorem 1). Importantly, if the processes are modeled by a special class of Hidden Markov Models known as Probabilistic Finite State Automata (PFSA) [54] , then the estimation of the LHS of Eq (1) becomes tractable (Algorithm A in S1 Text). Using SLD we can efficiently compute the dissimilarity between two observed sample paths, estimated as the deviation between the underlying generators. Thus, the UnIT risk (denoted as ν) of a county is defined as the SLD between the underlying process driving incidence counts and a high risk process initiating the epidemic. Since these processes are hidden, and only sample paths are observable, we formulate an estimator for the UnIT risk as follows: we begin with weekly county-wise confirmed case counts of the seasonal flu epidemic spanning nearly a decade (nine flu seasons between 2003-2012, See Fig 2A) . These are obtained by looking for Influenza related diagnostic codes reported in each week in each county in the Truven Marketscan insurance claims database [55] . This database consists of over 150 million patients i.e. almost a third of the US population, and despite limitations (under-reporting of non-severe influenza cases, and reporting/coding uncertainties), provide a detailed record of flu season incidence dynamics. These relatively short integer-valued timeseries (each spanning 471 weeks) are used to compute pairwise similarity between the counties (using the SLD-based approach, see Materials and methods), which then induces a partition of the 3094 US counties into a pre-specified number of clusters, obtained by using standard clustering techniques, e.g. spectral clustering [56] (See Fig 2B) . We note here that the number of clusters (four) is chosen via standard heuristic considerations [57] , and increasing this number somewhat does not significantly impact our results. With these county-clusters in hand, we next inspect the initial weeks of the nine flu seasons to estimate the empirical probability of a specific county reporting cases within the first couple of weeks of a flu season | these counties are at high initiation risk empirically (See Fig 2C) . We find that one specific cluster accounts for almost all of the counties at high risk of flu season initiation. Focusing on the set of counties in this high risk cluster, we infer [54] a PFSA G ? , assuming that the incidence series at each of these counties is a sample path from the same underlying stochastic process (See Fig 2D) . This is a simplification, aimed at obtaining an average model driving the incidence dynamics at initiation, ignoring the variation in the structure and parameters of the underlying processes among the high risk counties themselves. Finally, we estimate the UnIT risk exposure of each county with count sequence x as: where the convergence to the divergence between the local process X and the inferred high risk process G ? occurs in probability as length of x increases. To carry out this computation, we need a consistent estimate [58] of the entropy rate of the process X from x. This is non-trivial [59, 60] if X is not an iid process. We may either: 1) estimate the entropy rate from the observed sample path [61] , or 2) compute an upper bound of the entropy rate assuming X is iid for the purpose of computing H(X) only. The second approach is computationally simpler, but only allows us to estimate a lower bound of the UnIT risk. For simplicity we present results with only the second approach (See Fig 2E) , i.e. using a lower bound to the UnIT risk, which is nevertheless demonstrated to have significant predictive value, especially when scaled up with the percent of the county-specific urban population. The estimated urban-UnIT risk obtained by scaling the UnIT risk with the fraction of urban population is then used to verify its dominant explanatory role amongst suspected covariates as discussed before. Finally this risk phenotype is used to make weekly case count forecasts, one week ahead of time, on a per county basis. The forecast model (See Eq (3a) in Materials and Methods) is simple; essentially an ensemble regressor with the urban-UnIT risk as an input feature, along with the previous week's county-wise count totals, which as shown in Fig 3, outperforms more complex state of the art approaches. It is important to consider how the performance of our algorithm varies along the pandemic timeline, particularly how it performs at the early days of the pandemic, and in the aftermath of peak infection. We plot the percentage forecast errors in panel B in S1 Fig, which shows that we underestimate the counts somewhat at the beginning of the pandemic, and overestimate somewhat post peak infection. While these trends might indicate that the effectiveness of the UnIT risk itself varies across the pandemic timeline, reporting inaccuracies might also be a contributing factor. Indeed, when we applied our approach to forecasting case counts using estimates of the true total infection as computed and posted at https://covid19-projections.com/infections/summary-counties/ [62, 63] , these trends disappeared ( See S2B and S2C Fig) , suggesting the possibility that the official county-specific case count reports might have been underestimated in the early days, and are being overestimated somewhat after the infection peak in the United States. Limited access to COVID-19 tests in the early days of the pandemic supports these observations. Additionally, comparing our forecasts against the reported case count (Fig 3E) , it appears that the forecasts were worse around the peak infection point. However, the % forecast errors in S1 Fig confirm that there is no systematic uptick around the peak itself, and the larger differences visible in Fig 3E is due to scaling up of a similar error fraction as the case count exploded in the United States. With the UnIT risk appearing to have usable information that helps us predict the counts better, it is important to ask if we could have used the incidence patterns of other diseases, either in conjunction to the UnIT risk or by themselves to improve the forecasts. We investigated the applicability of observed incidence dynamics of other infectious agents such as that of common bacteria e.g. Staphylococcus aureus (Staph) or viruses causing chronic infections such as the Human Immunodeficiency virus (HIV). As shown in Fig 4C, the Poisson regression coefficients for influenza dominate the others over time, with HIV significantly worse compared to Staph. This is not surprising, since bacterial infections as well as HIV are spread by mechanisms very different from that of COVID-19. More detailed investigations in this direction will be taken up in future, with a principled mechanism to characterize the epidemiological "similarity" of different infections from observed incidence patterns. A second important question is the generalizability of our proposed approach to other infections, epidemics, and in other geographical regions. As a demonstration, we apply our approach to the prediction of HIV incidence in the year 2011, using influenza to construct the risk covariate as before (data restricted upto 2010). The results are shown in S2 Fig. The predictions are not very useful, as expected. In addition to influenza being not "similar" to HIV, the latter is a chronic infection, taking generally longer to serroconvert (� 2 months [64] ), making weekly predictions not particularly appropriate. This suggests that to apply our approach for diseases other than COVID-19, we would need to identify a pathogen that transmits via similar mechanisms as that of the target organism, and also one for which we have geospatial incidence data at our disposal. Application of this approach beyond United States would require detailed incidence data on influenza and COVID-19. Our current access to the electronic administrative is currently limited to the United States, which limits our current validation to the US. Future work will attempt to test applicability in other regions of the world. A source of uncertainty in our approach is the use of diagnostic codes from insurance claims to infer seasonal flu incidence. Influenza is in general hard to track, since less severe cases are seldom reported. Additionally, electronic health records are also inherently noisy, and suffers from potential coding errors by physicians, and other artifacts. Similarly the number of confirmed COVID-19 cases is also a function of how many tests are actually administered, and the fraction of the infected population who are asymptomatic. Thus, we are forecasting the number of detected cases as opposed to true disease incidence. Additionally, our database of diagnostic codes needs to have geospatial metadata, i.e, we need to know the county in which each patient was located at the time they contracted influenza. This information has been redacted recently from the Truven database due to privacy concerns, implying that we could only use data upto 2011 to construct the UnIT risk. Leaving out random years one at a time from this 9 year period did not affect the results significantly, suggesting that the key patterns we are leveraging do not change too fast. The effect of considering newer data from other sources, will be investigated in future. An important limitation in our modeling are current assumptions on the distribution of the response variable. While we achieved good prediction results, future work in this direction might consider application of recently reported strategies that deal with non-Poisson or non-Gaussian distributions [65] to further refine our models. Importantly our results do not imply that Influenza and COVID-19 are similar in their clinical progression. Indeed, a limitation of our approach is its reduced ability to predict COVID-19-related deaths (S4 Fig and Tables B and C in S1 Text). Our death count forecasts are worse than the top few contributors [66] to the COVID-19 forecasthub. We hypothesize that this reduced effectiveness is attributable to differences in the clinical progression of Influenza and COVID-19: COVID-19 is a more serious disease, and while historical flu patterns may be leveraged to predict the number of cases, performance suffers when we attempt to extend the same strategy to predict the mortality. In this study we demonstrate that leveraging the knowledge of the incidence fluctuations in one epidemic informs another with a broadly similar transmission mechanism, despite differences in the epidemiological parameters and the disease processes. The COVID-19 pandemic has highlighted the need for tools to forecast case counts early in the course of future pandemics, when only sparse data is available to train upon, by leveraging incidence pertaining to different epidemics of the past. We begin by describing the forecast model, followed by the mathematical details underlying the risk measure itself. The UnIT score (ν) is a spatially varying time-invariant measure. Thus, to forecast temporal changes in weekly incidence we consider the past week's case count as a feature in training regressors as follows (where X t is the observed case count at time t, andX t is the forecast made for t at time t − 1): UnIT risk correction n and train GLM g t X ? Train regressor h t X t ¼ h t ðX tÀ 1 ; X ? tÀ 2 ; X ? tÀ 1 Þ ð3bÞ Here g t is the generalized multivariate regression model (GLM) which carries out the Poisson regression, fitted with X t as the target variable, and ν, v 1 , � � �, v m as exogenous variables, with a logarithmic link function (See (11) ). ν is the urban-UnIT risk, and the rest of the variables v 1 , � � �, v m (as described in Table 1 ) are total population, fraction of population over 65 years, fraction of minorities in the population, fraction of Hispanics, fraction of the population reported as African-American or black, fraction of the population designated to be poor, and the median household income. Including the fraction of population living in urban environments as a separate variable does not change results significantly. In Eq 3b X ? t is the estimate of X t obtained using the inferred coefficients in g t , and may be viewed as the noise corrected version of the current case count. Finally, we train a standard regressor between the corrected case count and the count observed in the next time step, and use it for forecasting one-week futures (Eq 3c). The choice of the specific regressor (random forest, gradient boosting, feedforward neural networks or more complex variants) does not significantly alter our performance. This is an exceedingly simple model compared to the approaches described in the literature, and is essentially a ensemble regressor with the UnIT-corrected case count as one of the features/inputs. Nevertheless we outperform the top state of the art models put forward by the COVID-19 modeling community (https://covid19forecasthub.org/community) in mean absolute error in county-specific incidence count estimates (See Fig 3D) . As examples we illustrate the county-wise predicted and confirmed case counts for New York and California at selected weeks over the pandemic, which shows that our 1-week forecasts match up well with the counts ultimately observed (See Fig 5) . Efficiently contrasting and condomness cannot be ignored. For such learning to occur, we need to define either a measure of deviation or, more genmparing stochastic processes is the key to analyzing time-dependency in epidemiological patterns, particularly where raerally, a measure of similarity to compare stochastic time series. Examples of such similarity measures from the literature include the classical l p distances and l p distances with dimensionality reduction [67] , the short time series distance (STS) [68] , which takes into account of irregularity in sampling rates, the edit based distances [69] with generalizations to continuous sequences [70] , and the dynamic time warping (DTW) [71] , which is used extensively in the speech recognition community. A key challenge in the existing techniques is differentiating complex stochastic processes with subtle variations in their generative structures and parameters. When presented with finite sample paths from non-trivial stochastic processes, the state-of-the-art techniques often focus on their point-wise distance, instead of intrinsic differences in their (potentially hidden) generating processes. Our approach addresses this issue and demonstrably differentiates data streams indistinguishable by state-of-the-art algorithms. Our intuition follows from a basic result in information theory: if we know the true distribution p of a random variable, we could construct a code [53] with average description length h(p), where h(�) is the entropy of a distribution. If we used this code to encode a random variable with distribution q, we would need hðpÞ þ Dðp k qÞ bits on average to describe the random variable. Thus, deviation in the distributions show up as an additional contribution from the KL divergence term Dð� k �Þ. Generalizing the notion of KL divergence to processes, we can therefore quantify deviations in dynamics via an increase in the entropy rate by the corresponding divergence. As a more concrete example, consider sequences of length n generated by two iid processes P 1 ¼ Bð:5Þ and P 2 ¼ Bð:8Þ, where B(p) is the Bernoulli process with parameter p [72] . Our objective is to estimate deviations in the binary sample paths generated by these processes. Here we choose iid processes for simplicity, which is not a restriction in general for our approach. Let us generate sequences of length n and use E ij to denote the expected Hamming distance [73] between sequences generated by P i and P j . It is easy to show that E 11 = E 12 = E 21 = 0.5n, which implies that two sequences both generated by B(.5) are not more alike than two sequences where one is generated by B (.5) which now clearly disambiguates the two processes indistinguishable by their expected Hamming distance, and the correct generator may be identified readily as the one corresponding to the index of the smaller entry in v x . Our approach generalizes this idea to more complex processes, where we cannot make the iid assumption a priori, thus necessitating the generalization of KL divergence from distributions to stochastic processes. In the example above, the generating models are used to evaluate log-likelihoods, which are not directly accessible in our target application. The computation of the log-likelihood L(x, G) of a sequence x generated by a process G, is simple (See Algorithm A in S1 Text) if we restrict our stochastic processes to those generated by Probabilistic Finite State Automata (PFSA) [75] [76] [77] [78] . PFSA are semantically succinct and can model discrete-valued stochastic processes of any finite Markov order, and can approximate arbitrary Hidden Markov Models [76] (HMM). Importantly, PFSA model finite valued processes taking values in a finite pre-specified alphabet. Thus, continuous or integer valued inputs must be quantized, in a manner described later. In the context of the above discussion, we define dissimilarity Θ between observed sequences x, y as: where G i 2 G is a set of pre-specified PFSA generators on the same alphabet. And using PFSAs for our base models implies that this measure is easily computatble via multiple applications of Algorithm A in S1 Text for pseudocode of algorithm). In our approach, we use the set of four PFSA models shown in S5 Fig as G. Using a different set of models, which generate processes that are sufficiently pairwise distinct, does not significantly alter our results. These particular "base" models are chosen randomly from all possible PFSAs (See next section) with a maximum of 4 states. For a finite number of base models, Eq (5) does not technically yield a metric. However, one can approach a metric by increasing the number of models included in the base set. S5 Fig illustrates a comparison of this approach of comparing time series with the state of the art Dynamic Time Warp (DTW) algorithm. In particular, our approach is significantly faster yet produces a higher separation ratio (ratio of the mean distance between clusters computed by the two algorithms) for the University of California Riverside (UCR) time-series classification archive [79] . Definition 1 (PFSA). A probabilistic finite-state automaton G is a quadruple ðQ; S; d;pÞ, where Q is a finite set of states, S is a finite alphabet, δ: Q × S ! Q called transition map, and p : Q � S ! ½0; 1� specifies observation probabilities, with 8q 2 Q; P s2Sp ðq; sÞ ¼ 1. We use lower case Greeks (e.g. σ or τ) for symbols in S and lower case Latins (e.g. x or y) to denote sequence of symbols, with the empty sequence denoted by λ. The length of a sequence x is denoted by |x|. The set of sequences of length d is denoted by S d . The directed graph (not necessarily simple with possible loops and multi-edges) with vertices in Q and edges specified by δ is called the graph of the PFSA and, unless stated otherwise, assumed to be strongly connected [80] . Definition 2 (Observation and Transition Matrices). Given a PFSA ðQ; S; d;pÞ, the observation matrixP G is the |Q| × |S| matrix with the (q, σ)-entry given bypðq; sÞ, and the transition matrix P G is the |Q| × |Q| matrix with the (q, q 0 )-entry, written as π(q, q 0 ), given by pðq; q 0 Þ ¼ X s:dðq;sÞ¼q 0p ðq; sÞ. Both P G andP G are stochastic, i.e. non-negative with rows of sum 1. Since the graph of a PFSA is strongly connected, there is a unique probability vector p G that satisfies p T G P G ¼ p T G [81] , and is called the stationary distribution of G. We extend the definition of the Γ to S ? by G x ¼ Q n i¼1 G s i for x = σ 1 . . .σ n with Γ λ = I, the identity matrix. Definition 4 (Sequence-Induced Distributions). For a PFSA G ¼ ðQ; S; d;pÞ, the distribution on Q induced by a sequence x is given by p T G ðxÞ ¼ 〚p T G G x 〛 , where 〚v〛= v/kvk 1 . Definition 5 (Stochastic process Generated by PFSA). Let G ¼ ðQ; S; d;pÞ be a PFSA, the S-valued stochastic process {X t } t2S generated by G satisfies that X 1 follows the distribution p T GPG and X t+1 follows the distribution p G ðX 1 � � � X t Þ TP G for t 2 N. We denote the probability an PFSA G producing a sequence x by p G (x). We can verify that A key step in our approach is the abductive inferrence of a PFSA [82] from quantized incidence time series. Importantly, we do not specify the number of states, or the transition structure of the model; both the transition map and the observation probabilities are inferred from the observed data streams. A single PFSA modeling the incidence dynamics in high risk counties is obtained in step 5) of the procedure outlined in the section "Calculation of UnIT Risk" below. Importantly, we have a data stream for each county inferred to have a high initiation risk (Step 3). Thus, we infer a single PFSA from multiple data streams, where each data stream is assumed to be a sample path generated by a similar underlying process. The inference algorithm cited above is designed to take advanatge of multiple data stream inputs to identify a common model. Definition 6 (Entropy rate and KL divergence). The entropy rate of a PFSA G is the entropy rate of the stochastic process G generates [83] . Similarly, the KL divergence of a PFSA G 0 from the PFSA G is the KL divergence of the process generated by the G 0 from that of G. More precisely, we have the and the KL divergence whenever the limits exist. We also refer to the KL divergence between stochastic processes as the Sequence Likelihood divergence (SLD). Definition 7 (Log-likelihood). The log-likelihood [83] of a PFSA G generating x 2 S d is given by Algorithm A in S1 Text outlines the steps in computing L(x, G). The time complexity of log-likelihood evaluation is O(|x| × |Q|) with input length x and |Q| being the size of the PFSA state set. Theorem 1 (Convergence of Log-likelihood). Let G and G 0 be two irreducible PFSA, and let x 2 S d be a sequence generated by G. Then we have Lðx; G 0 Þ ! HðGÞ þ DðG k G 0 Þ; in probability as d ! 1. Proof. See proof of convergence in S1 Text. Let D be the pair-wise distance matrix with d ij = Θ(s i , s j ), where s i is the flu time series of county c i . Then the affinity matrix A for spectral clustering is chosen as a ij ¼ expðÀ d 2 ij =2Þ. We include five demographic independent variables: 1) total population, 2) percent of the total population aged 65+, 3) percent of Hispanics in the total population, 4) percent of black/ African-American in the total population, 5) percent of minority groups in the total population. For socioeconomic factors, we consider: 1) percent of the total population in poverty and 2) median household income, Which are also obtained from the US Census Bureau, based on the 2010 US decennial census. The source of incidence counts for seasonal flu epidemic is the Truven MarketScan database [55] . This US national database collating data contributed by over 150 insurance carriers and large, self-insuring companies, contains over 4.6 billion inpatient and outpatient service claims, with over six billion diagnostic codes. We processed the Truven database to obtain the reported weekly number of influenza cases over a period of 471 weeks spanning from January 2003 to December 2011, at the spatial resolution of US counties. Standard ICD9 diagnostic codes corresponding to Influenza infection is used to determine the county-specific incidence time series, which are: 1) 487 Influenza, 2) 487.0 Influenza with pneumonia, and 3) 487.1 Influenza with other respiratory manifestations and 4) 487.8 Influenza with other manifestations. Integer-valued incidence input is quantized to produce data streams with a finite alphabet, by choosing k − 1 cut-off points p 1 < p 2 < � � � < p k−1 and replacing a value