key: cord-1048483-v78octh6 authors: Zhao, Shi title: To avoid the noncausal association between environmental factor and COVID-19 when using aggregated data: Simulation-based counterexamples for demonstration date: 2020-08-09 journal: Sci Total Environ DOI: 10.1016/j.scitotenv.2020.141590 sha: 37d2ba651a5fd3452ef181459a48fa1fe0c1e4ca doc_id: 1048483 cord_uid: v78octh6 Abstract In infectious disease epidemiology, the association between an independent factor and disease incidence (or death) counts may fail to infer the association with disease transmission (or mortality risk). To explore the underlying role of environmental factors in the course of COVID-19 epidemic, the importance of following the epidemiological metric's definition and systematic analytical procedures are highlighted. Cautiousness needs to be taken when understanding the outcome association based on the aggregated data, and overinterpretation should be avoided. Identifying environmental factors, e.g., meteorological factors and air pollutants, affecting the transmission and mortality risk of COVID-19 is of importance in understanding the features of the ongoing COVID-19 pandemic (1, 2). Pani et al found the associations between several meteorological factors and COVID-19 incidences under a time series study design, and they implied similar associations with the COVID-19's transmission (3) . Frontera et al found the positive associations between levels of several air pollutants and COVID-19 cases and deaths under a cross-region ecological study design (4) . They concluded that "air pollution may have a strong impact on the high rate of infection and mortality". Similar analytical approaches as well as main findings appear in other recent studies (5) (6) (7) . In infectious disease epidemiology, the number of incidences is driven by the disease's transmission process that is strongly determined by (i) the transmissibility, and (ii) the seed cases in the latest few days (8) (9) (10) (11) (12) (13) (14) (15) (16) . Thus, the autocorrelation is highly likely undermine the statistical inference of the role of independent factors when using the aggregated number of COVID-19 incidences time series directly. Time series and ecological studies are two kinds of commonly used study design aiming at exploring the relationship between independent factors and the course of a disease, e.g., (3, 4) . Both of these two classic study designs adopt the aggregated (instead of individual level) dataset. Hence, cautiousness needs to be taken when understanding the outcome association based on the aggregated data, and overinterpretation should be avoided. In this study, it is demonstrated that the association between an independent factor and disease incidence (or death) counts fails to infer the association with disease transmission (or mortality risk) by using simple simulation-based counterexamples. The existing analytical approaches to address this inferential failure is also discussed. To mimic the real-world COVID-19 epidemic growing process, the simple modelling framework in (17) is adopted with the mean serial interval, τ, at 5 days, referring to previous studies (18) (19) (20) (21) (22) , and population size, N, at 10 million. Then, for the t-th day since the first COVID-19 case, the daily number of new cases c t = c t−1 •r t 1/τ , where c 0 = 1 for first seed case at the start of the outbreak. Here, r t denotes effective reproduction number, and thus r t The R t is the reproduction number, a wellaccepted metric quantifying the instantaneous transmissibility of an infectious disease (15) . The C t is the cumulative number of cases at the t-th day, and obviously C t ≥ C t−1 > 0. The values of R t time series will be directly assumed in our counterexamples. In this counterexample, the time series study design as in (3) is considered. An independent factor X was considered as a determinant of the reproduction number (R) of COVID-19. Three predefined scenarios are considered that included  scenario (I): the correlation between X and R was negative;  scenario (II): the correlation between X and R was positive; and  scenario (III): the correlation between X and R was zero. In Fig 1A, the factor X time series are generated by using the arithmetic sequence with additive random noise terms. For COVID-19, R largely ranges from 1.5 to 3 (11, 13, 14, 17, 19, 21, 23, 24) , and the range  the association between X and R, and  the association between X and number of COVID-19 cases, i.e., daily or cumulative. The inconsistency of these two associations indicates an inferential failure that the association between factor X and number of COVID-19 cases failed to infer the underlying true relationship between X and COVID-19 transmissibility. In this counterexample, the same cross-region ecological study design is considered as in (4)  the pre-defined negative association between X and CFR, and  the association between X and cumulative number of COVID-19 deaths. In same fashion, the inconsistency of these two associations indicates an inferential failure that the association between factor X and number of COVID-19 deaths failed to infer the underlying true relationship between X and COVID-19 mortality risk. For the counterexample #1, Fig 1E shows the relationship between factor X and three types of R series, which matches the scenarios (I)-(III) introduced above. However, in Fig 1F and 1G , factor X was found positively associated with the daily or cumulative number of COVID-19 cases in all scenarios. Therefore, the proposed inferential failure was evidently demonstrated. Furthermore, as remark, even though this counterexample merely presented the scenarios when R > 1 for demonstration, the conclusions may still hold when R < 1 as well as in more complex contexts. For the counterexample #2, the predefined negative correlation between the factor X and CFR was shown in Fig 1C. However, the association between X and the cumulative number of deaths are obviously positive, see Fig 1D, which was opposite the association in Fig 1C. Thus, it was clear that the positive association between factor X and cumulative number of deaths failed to infer the underlying true relationship between X and mortality risk of COVID-19 (in terms of CFR). A similar inferential failure also occurred when using the daily number of deaths (data not shown). Furthermore, as remark, that even though this counterexample presents the situation when CFR was negatively correlated with R for demonstration, similar kinds of inferential failure may still occur in alternative situations and even more complex contexts. The epidemic curve of an infectious disease is driven by its transmission process that can be measured by several metrics including R. The daily number of cases is strongly determined by the strength of transmissibility and the number of seed cases in the latest few days, which is determined by the serial interval. Hence, the autocorrelation is highly likely undermine the statistical inference of transmission driven factors when using the number of cases time series directly, i.e., without defining quantifying the disease transmissibility. The cautiousness needs to be taken in avoiding this kind of noncausal association between independent factors and COVID-19 incidences. It is important to transform the incidence data to transmissibility by using plausible analytical approaches (9, (28) (29) (30) , and check the associate between transmission rate and external factors thereafter, e.g., (2, 31) . The mortality risk of COVID-19 is reflected by both the number of disease-induced deaths and the number of COVID-19 cases. Thus, the COVID-19 CFR is of importance to properly quantify the mortality risk. It is important to transform the cases and deaths data to CFR by using analytical frameworks in previous studies (25-28, 32, 33) . One may then examine the association between CFR and external factors directly, e.g., (34, 35) . Recently, an increasing number of studies present 'associations' between environmental factors and COVID-19 transmission, which is mainly due to two reasons. They include J o u r n a l P r e -p r o o f The role of the environment and its pollution in the prevalence of COVID-19 The Ambient Ozone and COVID-19 Transmissibility in China: A Data-Driven Ecological Study of 154 Cities Association of COVID-19 pandemic with meteorological parameters over Singapore Severe air pollution links to higher mortality in COVID-19 patients: the "double-hit" hypothesis Impact of weather on COVID-19 pandemic in Turkey Demystifying a Possible Relationship between COVID-19, Air Quality and Meteorological Factors: Evidence from Kuala Lumpur Asymmetric nexus between temperature and COVID-19 in the top ten affected provinces of China: A current application of quantile-on-quantile approach Dynamically modeling SARS and other newly emerging respiratory illnesses: past, present, and future A new framework and software to estimate time-varying reproduction numbers during epidemics A conceptual model for the coronavirus disease 2019 (COVID-19) outbreak in Wuhan, China with individual reaction and governmental action Pattern of early human-to-human transmission of Wuhan How generation intervals shape the relationship between growth rates and reproductive numbers Estimating the serial interval of the novel coronavirus disease (COVID-19) based on the public surveillance data in Shenzhen Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. The Lancet Modelling the effective reproduction number of vector-borne diseases: the yellow fever outbreak in Luanda Imitation dynamics in the mitigation of the novel coronavirus disease (COVID-19) outbreak in Wuhan Epidemic Growth, and Reproduction Numbers for the 2019 Novel Coronavirus (2019-nCoV) Epidemic. Annals of Internal Medicine Serial interval in determining the estimation of reproduction number of the novel coronavirus disease (COVID-19) during the early outbreak Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data Temporal dynamics in viral shedding and transmissibility of COVID-19 Estimating the time interval between transmission generations when negative values occur in the serial interval data: using COVID-19 as an example Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia Estimating the Unreported Number of Novel Coronavirus (2019-nCoV) Cases in China in the First Half of January 2020: A Data-Driven Modelling Analysis of the Early Outbreak Estimates of the severity of coronavirus disease 2019: a model-based analysis. The Lancet infectious diseases Estimating the infection and case fatality ratio for coronavirus disease (COVID-19) using age-adjusted data from the outbreak on the Diamond Princess cruise ship Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China First-wave COVID-19 transmissibility and severity in China outside Hubei after control measures, and second-wave scenario planning: a modelling impact assessment. The Lancet Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures Ambient ozone and influenza transmissibility in Hong Kong No Association of COVID-19 transmission with temperature or UV radiation in Chinese cities Early estimation of the case fatality rate of COVID-19 in mainland China: a data-driven analysis Real-Time Estimation of the Risk of Death from Novel Coronavirus (COVID-19) Infection: Inference Using Exported Cases Temporal Association between Particulate Matter Pollution and Case Fatality Rate of COVID-19 in Wuhan Association of particulate matter pollution and case fatality rate of COVID-19 in 49 Chinese cities Environmental factors affecting the transmission of respiratory viruses Transmission routes of respiratory viruses among humans A climatologic investigation of the SARS-CoV outbreak in Beijing, China The Effects of Temperature and Relative Humidity on the Viability of the SARS Coronavirus Stability of Middle East respiratory syndrome coronavirus (MERS-CoV) under different environmental conditions Urbanization and humidity shape the intensity of influenza epidemics in U The authors declare no competing interests.J o u r n a l P r e -p r o o f