key: cord-302056-wvf6cpib authors: Benatia, D.; Godefroy, R.; Lewis, J. title: Estimating COVID-19 Prevalence in the United States: A Sample Selection Model Approach date: 2020-04-30 journal: nan DOI: 10.1101/2020.04.20.20072942 sha: doc_id: 302056 cord_uid: wvf6cpib Background: Public health efforts to determine population infection rates from coronavirus disease 2019 (COVID-19) have been hampered by limitations in testing capabilities and the large shares of mild and asymptomatic cases. We developed a methodology that corrects observed positive test rates for non-random sampling to estimate population infection rates across U.S. states from March 31 to April 7. Methods We adapted a sample selection model that corrects for non-random testing to estimate population infection rates. The methodology compares how the observed positive case rate vary with changes in the size of the tested population, and applies this gradient to infer total population infection rates. Model identification requires that variation in testing rates be uncorrelated with changes in underlying disease prevalence. To this end, we relied on data on day-to-day changes in completed tests across U.S. states for the period March 31 to April 7, which were primarily influenced by immediate supply-side constraints. We used this methodology to construct predicted infection rates across each state over the sample period. We also assessed the sensitivity of the results to controls for state-specific daily trends in infection rates. Results The median population infection rate over the period March 31 to April 7 was 0.9% (IQR 0.64 1.77). The three states with the highest prevalence over the sample period were New York (8.5%), New Jersey (7.6%), and Louisiana (6.7%). Estimates from models that control for state-specific daily trends in infection rates were virtually identical to the baseline findings. The estimates imply a nationwide average of 12 population infections per diagnosed case. We found a negative bivariate relationship (corr. = -0.51) between total per capita state testing and the ratio of population infections per diagnosed case. Interpretation The effectiveness of the public health response to the coronavirus pandemic will depend on timely information on infection rates across different regions. With increasingly available high frequency data on COVID-19 testing, our methodology could be used to estimate population infection rates for a range of countries and subnational districts. In the United States, we found widespread undiagnosed COVID-19 infection. Expansion of rapid diagnostic and serological testing will be critical in preventing recurrent unobserved community transmission and identifying the large numbers individuals who may have some level of viral immunity. Public health efforts to determine population infection rates from coronavirus disease 2019 (COVID -19) have been hampered by limitations in testing capabilities and the large shares of mild and asymptomatic cases. We developed a methodology that corrects observed positive test rates for non-random sampling to estimate population infection rates across U.S. states from March 31 to April 7. We adapted a sample selection model that corrects for non-random testing to estimate population infection rates. The methodology compares how the observed positive case rate vary with changes in the size of the tested population, and applies this gradient to infer total population infection rates. Model identification requires that variation in testing rates be uncorrelated with changes in underlying disease prevalence. To this end, we relied on data on day-to-day changes in completed tests across U.S. states for the period March 31 to April 7, which were primarily influenced by immediate supply-side constraints. We used this methodology to construct predicted infection rates across each state over the sample period. We also assessed the sensitivity of the results to controls for state-specific daily trends in infection rates. The median population infection rate over the period March 31 to April 7 was 0.9% (IQR 0.64 1.77). The three states with the highest prevalence over the sample period were New York (8.5%), New Jersey (7.6%), and Louisiana (6.7%). Estimates from mod-1 els that control for state-specific daily trends in infection rates were virtually identical to the baseline findings. The estimates imply a nationwide average of 12 population infections per diagnosed case. We found a negative bivariate relationship (corr. = -0.51) between total per capita state testing and the ratio of population infections per diagnosed case. The effectiveness of the public health response to the coronavirus pandemic will depend on timely information on infection rates across different regions. With increasingly available high frequency data on COVID-19 testing, our methodology could be used to estimate population infection rates for a range of countries and subnational districts. In the United States, we found widespread undiagnosed COVID-19 infection. Expansion of rapid diagnostic and serological testing will be critical in preventing recurrent unobserved community transmission and identifying the large numbers individuals who may have some level of viral immunity. Social Sciences and Humanities Research Council. In December 2019, several clusters of pneumonia cases were reported in the Chinese city of Wuhan. By early January, Chinese scientists had isolated a novel coronavirus (SARS-CoV-2), later named coronavirus disease 2019 (COVID- 19) , for which a laboratory test was quickly developed. Despite efforts at containment through travel 2 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04. 20.20072942 doi: medRxiv preprint restrictions, the virus spread rapidly beyond mainland China. By April 7, more than 1.4 million cases had been reported in 182 countries and regions. Our understanding of the progression and severity of the outbreak has been limited by constraints on testing capabilities. In most countries, testing has been limited to a small fraction of the population. As a result, the number of confirmed positive cases may grossly understate the population infection rate, given the large numbers of mild and asymptomatic cases that may go untested [1] [2] [3] [4] [5] . Moreover, testing has often been targeted to specific subgroups, such as individuals who were symptomatic or who were previously exposed to the virus, whose infection probability differs from that in the overall population [6, 7] . 1 Given this sample selection bias, it is impossible to infer overall disease prevalence from the share of positive cases among the tested individuals. A further challenge to our understanding of the spread of outbreak has been the wide variation in per capita testing across jurisdictions due to different protocols and testing capabilities. For example, as of April 7, South Korea had conducted three times more tests than the United States on a per capita basis [8, 9] . Large differences in testing rates also exist at the subnational level. For example, per capita testing in the state of New York was nearly two times higher than in neighboring New Jersey [8] . Because the severity of sample selection bias depends on the extent of testing, these disparities create large uncertainty regarding the relative disease prevalence across jurisdictions, and may contribute to the wide differences in estimated case fatality rates [10, 11] . In this study, we implemented a procedure that corrects observed infection rates among tested individuals for non-random sampling to calculate population disease prevalence. A large body of empirical work in economics has been devoted to the problem of sample selection and researchers have developed estimation procedures to 1 Notable exceptions include the universal testing of passengers on the Diamond Princess cruise ship, and an ongoing population-based test project in Iceland. correct for non-random sampling [12] [13] [14] [15] [16] [17] . Our methodology builds on these insights to correct observed infection rates for non-random selection into COVID-19 testing. Our procedure compares how the observed infection rate varied as a larger share of the population was tested, and uses this gradient to infer disease prevalence in the overall population. Because investments in testing capacity may respond endogenously to local disease conditions, however, model identification requires that we find a source of variation in testing rates COVID-19 that is unrelated to the underlying population prevalence. To this end, we relied on high frequency day-to-day changes in completed tests across U.S. states, which were primarily driven by immediate supply-side limitations rather than the more gradual evolution of local disease prevalence. We used this procedure to correct for selection bias in observed infection rates to calculate population disease prevalence across U.S. states from March 31 to April 7. To evaluate population disease prevalence, we developed a simple selection model for COVID-19 testing and used the framework to link observed rates of positive tests to population disease prevalence. We considered a stable population, denoting A and B as the numbers of sick and healthy individuals, respectively. Let p n denote the probability that a sick person is tested and q n the probability that a healthy person is tested, given a total number of tests, n. Thus, we have: CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04. 20.20072942 doi: medRxiv preprint and the number of positive tests is: This simple framework highlights how non-random testing will bias estimates of the population disease prevalence. Using Bayes' rule, we can write the relative probability of testing as the following: q n p n = P r(sick|n)/P r(healthy|n) P r(sick|tested, n)/P r(healthy|tested, n) , which is equal to one if tests are randomly allocated, P r(sick|tested, n) = P r(sick|n). When testing is targeted to individuals who are more likely to be sick, we have P r(sick|tested, n) > P r(sick|n) and P r(healthy|tested, n) < P r(healthy|n), so the ratio will fall between zero and one. In this scenario, the ratio of sick to healthy people in the sample, p n A q n B, will exceed the ratio in the overall population, A B. We specified the following functional form for the relative probability of testing: which is in [0,1] for −a − bn ≤ 0. The term e −a−bn > 0 reflects the fact that testing has been targeted towards higher risk populations, with the intercept, −a, capturing the severity of selection bias when testing is limited. Meanwhile, the coefficient b > 0 identifies how selection bias decreases with n as the ratio q n /p n approaches one. Intuitively, as testing expands, the sample will become more representative of the overall population, and the selection bias will diminish. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04. 20.20072942 doi: medRxiv preprint Combining both equations, we have: We used the fact that the ratio of negative to positive tests is much larger than one to make the following approximation: 2 Given a change in the number of tests conducted in a particular population, n 1 to n 2 , equation (2) implies the following change in the share of positive tests: Our empirical model was derived from equation (3). We used information on testing across states i on day t to estimate the following equation: where n i,t is the number of tests on day t, s i,t is the share of positive tests, and pop i is the state population. The term u i,t is an error which we assumed to follow a Gaussian distribution with mean zero and unknown variance. We restricted the model to a cubic approximation of the function in equation (4), since higher order terms were found to be statistically insignificant. This approximation is supported by graphical evidence depicted below. We estimated equation (4) by maximum likelihood. For model identification, we required that day-to-day changes in the number of tests be uncorrelated with the error term, u i,t . In practice, this assumption implies that daily changes in underlying population disease prevalence cannot be systematically related to day-to-day changes in testing. Our identification assumption is supported by at least three pieces of evidence. First, severe constraints on state testing capacity have caused a significant backlog in cases, so that changes in the number of daily tests primarily reflects changes in local capacity rather than changes in demand for testing. Second, because our analysis focuses on high frequency day-to-day changes in outcomes, there is limited scope for large evolution in underlying disease prevalence. Finally, in robustness exercises, we augmented the basic model to include state fixed effects, thereby allowing for state-specific exponential growth in underlying disease prevalence from one day to the next. These additional controls did not alter the main empirical findings. To recover estimates of population infection rates,P i,t , in state i at date t, we combined the estimates from equation (4) and set n = pop i according to the following equation:P We then used the Delta-method to estimate the confidence interval forP i,t . 7 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The analysis was based on daily information on total tests results (positive plus negative) and total positive test results across U.S. states for the period March 31 to April 7. These data were obtained from the COVID Tracking Project, a site that was launched by journalists from The Atlantic to publish high-quality data on the outbreak in the United Stated [8] . The data were originally compiled primarily from state public health authorities, occasionally supplemented by information from news reporting, official press conferences, or message from officials released on facebook or twitter. We focused on the recent period to limit errors associated with previous changes in state reporting practices. We supplemented this information with data on total state population from the census [18] . (4), estimated across states for the period March 31 to April 7. Becauseβ is negative, the upward sloping pattern implies a negative relationship between daily changes in testing and the share of positive tests. A symptom of selection bias is that variables that have no structural relationship with the dependent variable may appear to be significant [13] . Thus, these patterns strongly suggest non-random testing, since daily changes in testing should be 8 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04.20.20072942 doi: medRxiv preprint unrelated to population disease prevalence except through a selection channel. 3 Table 2 reports the results that adjust observed COVID-19 case rates for nonrandom testing based on the procedure described in Section 2. For reference, column (1) reports the observed positive test rate on April 7, 2020. Columns (2) and (3) The average estimates are similar to the April 7 estimates, albeit generally smaller in magnitude, suggesting continued spread of the disease in many states. In Table 3 , we examined the robustness of the main estimates. To begin, we estimated modified versions of equation (4) that include state fixed effects. These models allow for an exponential trend in infection rates, thereby addressing concerns that underlying disease prevalence may evolve from one day to the next. We allowed each state to have its own specific intercept to capture the fact that the trends may differ depending on the local conditions. The results (reported in cols. 2 and 7) are virtually identical to the baseline estimates. Moreover, the augmented model tends to produce more precise confidence intervals. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. We explored the sensitivity of the results to excluding days in which a large fraction of tests were positive. This specification addresses concerns that the functional form of the estimating equation may differ in settings in which the share of positive was large, due to the approximation in equation (2). We restricted the sample to observations in which fewer than 50% of tests were positive, and re-estimated equation (4) . Table 3 , cols. 5,6,9 report the results. Although the sample size is reduced, the predicted infection rates are similar in magnitude to the baseline estimates and have similar confidence intervals. In Table 4 , we explored the relationship between the number of diagnosed cases and total population COVID-19 infections implied by our estimation procedure. We compared the average population infection rates from March 31 to April 7 to the total number of diagnosed cases by April 12. Because many individuals may not seek testing until the onset of symptoms, the latter date was chosen to capture the virus's typical five day incubation period [19, 20] . Column (1) reports the total diagnosed cases by April 12; column (2) reports the total number of COVID-19 cases implied by the estimates reported in Table 2 (col. 4); and column (3) presents the ratio of total cases to diagnosed cases. The results reveal widespread undetected population infection. Nationwide, we found that for every identified case there were 12 total infections in the population. There were significant cross-state differences in these ratios. In New York, where more than two percent of the population had been tested, the ratio of total cases to positive diagnoses was 8.7, the lowest in the nation. Meanwhile, Oklahoma had the highest ratio in the country (19.4) , and tested less than 0.6 percent of its population. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04.20.20072942 doi: medRxiv preprint response to geographic differences in pandemic severity. Instead, the patterns suggest that states that expanded testing capacity more broadly were better able to track population prevalence. and population prevalence. The similarity between these two series is notable, given that our estimates were derived from an entirely different source of variation from the cumulative case counts. Nevertheless, observed case counts do not perfectly predict overall population prevalence. For example, despite similar rates of reported positive tests, Michigan had roughly twice as many per capita infections as Rhode Island. These differences can partly be explained by the fact that nearly two percent of the population in Rhode Island had been tested by April 12, whereas fewer than one percent had been tested in Michigan. Together, these findings suggest that differences in state-level policies towards COVID-19 testing may mask important differences in underlying disease prevalence. The high proportion of asymptomatic and mild cases coupled with limitations in laboratory testing capacity has created large uncertainty regarding the extent of the COVID-19 outbreak among the general population. As a result, key elements of virus' clinical and epidemiological characteristics remain poorly understood. This uncertainty has also created significant challenges to policymakers who must trade off the potential benefits from non-pharmaceutical interventions aimed at curbing local transmission against their substantial economic and social costs. A number of recent studies have sought to estimate COVID-19 disease prevalence and mortality in the United States and internationally [21] [22] [23] [24] [25] [26] . One approach has 11 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. Other research has relied on Bayesian modelling to infer past disease prevalence from observed COVID-19 deaths, and apply SIR models to forecast current infection rates. This approach requires fewer assumptions regarding the underlying parameter values. Nevertheless, because these models 'scale up' observed deaths to estimate population infections, small differences in the assumed case fatality will have substantial effects on the results. This poses a challenge for estimation, given that there is considerable uncertainty regarding the case fatality rate, which may vary widely across regions due to local demographics and environmental conditions [27] [28] [29] [30] [31] . Moreover, to the extent that there is significant undercounting in the number of COVID-19 related deaths [32, 33] , these estimates may fail to capture the full extent of population infection. In this paper, we developed a new methodology to estimate population disease prevalence when testing is non-random. Our approach builds on a standard econometric technique that have been used to address sample selection bias in a variety of different settings. Our estimation strategy offers several advantages over existing methods. First, the analysis has minimal data requirements. The three variables used for estimation -daily infections, daily number of tests, and total population -are widely reported across a large number of countries and subnational districts. Second, the model identification is transparent and depends only on a simple exclusion restriction assumption that daily changes in the number of conducted tests must be uncorrelated with underlying changes in population disease prevalence. This assumption is likely to hold in many jurisdictions where constraints on capacity are a primary determinant of 12 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. . We used this framework to estimate disease prevalence across U.S. states. We Our findings are comparable to previous studies on U.S. population prevalence that find ratios of population infection to positive tests ranging from 5 to 10 by mid-March [22, 25] . Despite a dramatic expansions in testing capacity in the intervening weeks, the vast majority of COVID-19 cases remain undetected. Our results are comparable to recent estimates of population prevalence in a number of European countries [21] . We found a nationwide 1.9 percent infection rate in early April, which is similar to the estimated prevalence in Austria (1.1%), Denmark (1.1%), and the United Kingdom (2.7%) as of March 28. Meanwhile, Germany's 0.7% infection rate would rank in the lowest tercile of prevalence among U.S. states. The highest rates of infection in New York (8.5%), New Jersey (7.6%), and Louisiana (6.7%) are still lower than the estimated rates in Italy (9.8%) and Spain (15%). Given the rapidly expanding availability of high frequency testing data at both the national and subnational level, in future research we plan to apply this methodology to compare infection rates across a broader spectrum of countries. There are several limitations to our study, which should be taken into account when interpreting the main findings. First, the estimation results depend on several functional form assumptions including a constant exponential growth rate in new infections and the specific functions governing how the number of available tests affect individual testing probability. As more data on testing become available, the increased sample sizes will allow future studies to impose weaker functional form assumptions through either semi-or non-parametric approaches. Second, our analysis required an assump-13 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04. 20.20072942 doi: medRxiv preprint tion that the underlying sample selection process was similar across observations. To the extent that decisions regarding who to test, conditional on the number of available tests, diverged across states or changed within states over the sample period, our model may be misspecified. Finally, our analysis depends on the quality of diagnostic testing, and systematic false negative test results may affect the population disease prevalence estimates [34] [35] [36] . 4 As countries continue to struggle against the ongoing coronavirus pandemic, informed policymaking will depend crucially on timely information on infection rates across different regions. Randomized population-based testing can provide this information, however, given the constraints on supplies, this approach has largely been eschewed in favor of targeted testing towards high risk groups. In this paper, we developed a new approach to estimate population disease prevalence when testing is non-random. The estimation procedure is straightforward, has few data requirements, and can be used to estimate disease prevalence at various jurisdictional levels. Contributions DB, RG, and JL conceptualized the study, analyzed the data, and drafted and finalized the manuscript. All authors approved of the final version of the manuscript. We declare no competing interests. This study was supported by funding from the Social Sciences and Humanities Research Council (Grant: SSHRC 430-2017-00307). 4 Provided that the rates of misdiagnosis were unrelated to the number of tests, these errors will not bias the coefficient estimates, but may reduce precision through classical measurement error [37] . 14 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. . . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. . . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 30, 2020. Notes: Columns (1) to (6) report the estimates and heteroskedasticity robust 95% confidence intervals for population prevalence of COVID-19 on April 7 based on the methodology described in Section 2. Columns (7) to (9) report the the average estimates for population prevalence of COVID-19 from March 31 to April 7. Columns (3), (4) and (8) report results based on models that include state fixed effects. Columns (5), (6) , and (9) report results based on models that restrict the sample to observations for which the share of positive cases was less than 0.5. In cases of incomplete testing data on April 7, population prevalence is reported for the closest day: * indicates prevalence on April 6, ** indicates prevalence on April 5, and *** indicates prevalence on March 31. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2020. . . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2020. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 30, 2020. . https://doi.org/10.1101/2020.04.20.20072942 doi: medRxiv preprint Epidemiological Characteristics of 2143 Pediatric Patients with 2019 Coronavirus Disease in China SARS-CoV-2 Infection in Children Evidence of SARS-CoV-2 Infection in Returning Travelers from Wuhan, China Asymptomatic Cases in a Family Cluster with SARS-CoV-2 Infection. The Lancet Infectious Disease Presumed Asymptomatic Carrier Transmission of COVID-19 Therapeutic and Triage Strategies for 2019 Novel Coronavirus Disease in Fever Clinics Centers for Disease Control and Prevention: Coronavirus (COVID-19 The COVID Tracking Project The Many Estimates of the COVID-19 Case Fatality Rate. The Lancet Infectious Disease Coronavirus COVID-19 Global Cases The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models The Economics and Econometrics of Active Labor Market Programs Evaluation Methods for Non-experimental Data Nonparametric Estimation of Sample Selection Models. The Review of Economic Studies Two-Step Series Estimation of Sample Selection Models Annual Estimates of the Resident Population for the United States, Regions, States Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus -Infected Pneumonia The Incubation Period of Coronavirus Disease 2019 (COVID-19) from Publicly Reported Confirmed Cases: Estimation and Application Impacts of Non-pharmaceutical Interventions to Reduce COVID-19 Mortality and Healthcare Demand. London: Imperial College COVID-19 Response Team Estimating Unobserved SARS-CoV-2 Infections in the United States. medRxiv Working Paper Substantial Undocumented Infection Facilitates the Rapid Dissemination of Novel Coronavirus (SARS-Cov2) Adjusting Age-Specific Case Fatality Rates during the COVID-19 Epidemic in Hubei, China Estimating SARS-CoV-2 Positive Americans using Deaths-only Data Estimation of SARS-CoV-2 Mortality during the Early Stages of and Epidemic: A Modelling Study in Hubei, China and Norther Italy The Effects of Outdoor Air Pollution Concentrations and Lockdowns on COVID-19 Infections in Wuhan and Other Provincial Capitals in China Exposure to Air Pollution and COVID-19 Mortality in the United States Pollution, Infectious Disease, and Mortality: Evidence from the 1918 Spanish Influenza Pandemic What Explains Cross-City Variation in Mortality during the 1918 Influenza Pandemic? Evidence from 440 U Deaths in New York City are More than Double the Usual Total Doctors and Nurses Say More People are Dying of COVID-19 in the US than We Know Chest CT for Typical 2019-nCoV Pneumonia: Relationship to Negative RT-PCR Testing Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report on 1014 Cases Evaluating the Accuracy of Different Respiratory Specimens in the Laboratory Diagnosis and Monitoring the Viral Shedding of 2019-nCoV Infections. medRxiv Working Paper Econometric Analysis of Cross Section and Panel Data