key: cord-0925635-7ck3drcp authors: García-García, David; Morales, Enrique; de la Fuente-Nunez, Cesar; Vigo, Isabel; Fonfría, Eva S.; Bordehore, Cesar title: Identification of the first COVID-19 infections in the US using a retrospective analysis (REMEDID) date: 2022-05-10 journal: Spat Spatiotemporal Epidemiol DOI: 10.1016/j.sste.2022.100517 sha: 23a69deea90758c8acf7df7c096384d12231c14d doc_id: 925635 cord_uid: 7ck3drcp Accurate detection of early COVID-19 cases is crucial to reduce infections and deaths, however, it remains a challenge. Here, we used the results from a seroprevalence study in 50 US states to apply our Retrospective Methodology to Estimate Daily Infections from Deaths (REMEDID) with the aim of analyzing the initial spread of SARS-CoV-2 infections across the US. Our analysis revealed that the virus likely entered the country through California on December 28, 2019, which corresponds to 16 days prior to the officially recognized entry date established by the Centers of Disease Control and Prevention. Furthermore, the REMEDID algorithm provides evidence that SARS-CoV-2 entered, on average, a month earlier than previously reflected in official data for each US state. Collectively, our mathematical modeling provides more accurate estimates of the initial COVID-19 cases in the US, and has the ability to be extrapolated to other countries and used to retrospectively track the progress of the pandemic. The use of approaches such as REMEDID are highly recommended to better understand the early stages of an outbreak, which will enable health authorities to improve mitigation and preventive measures in the future. analysis of respiratory samples of an individual hospitalized on December 27, 2019, was positive for SARS-CoV-2, which is around a month before the first case had been reported (Deslandes et al., 2020) . In Italy, the retrospective analysis of wastewater samples found that the virus was already circulating on 18 December 2019 in Milan and Turin (LaRosa et al., 2021) . Besides, a retrospective computational analysis suggested that the first infection in Italy was in late November 2019 (Fochesato et al., 2021) . In the US, retrospective analysis of blood samples identified virus introduction earlier than reported in Illinois, Massachusetts, Wisconsin, Pennsylvania, and Mississippi (Althoff et al., 2021) , and even between December 13-16, 2019, in California, Oregon, and Washington (Basavaraju et al., 2021) . The objective of this study is to provide insights into the early stages of the COVID-19 outbreak in the US from a likelihoodbased estimation procedure. We report the results from an independent retrospective data analysis to reconstruct the daily infections time series at the beginning of the pandemic. These new time series reconcile reported deaths, clinical information of the illness, and the results of a seroprevalence study (Bajema et al., 2020) , unlike official records, which present a general underestimate of cases, and then an overestimate of the Case Fatality Ratio (CFR). Besides, official data usually refers to the diagnosis date and not to the date of the infection, which is relevant for modeling purposes. Finally, the first infection of each reconstructed time series is identified for each state, providing information about where and when the virus was introduced in the US, which provides valuable information about the early spread dynamics of the virus. Overall, COVID-19 deaths have been more thoroughly documented than infections, and we would like to transfer that thoroughness from death records to the infection records. The reason being, is that the number of infections at the beginning of the pandemic were generally of poor quality, either because no one was looking for them yet or because there were not enough diagnosis tests. Therefore, it can be useful to apply our algorithm Retrospective Methodology to Estimate Daily Infections from Deaths (REMEDID) (García-García et al., 2021) to reconstruct the time series of new infections, as it was done in Spain. To do so, some information about COVID-19 is needed. Given that an individual dieddue to COVID-19, the question of when they got infected remains. The period from infection to death is the addition of the incubation period (IP) and the illness onset to death (IOD) period. Then, as far as IP and IOD are known, the date of infection can be inferred by subtracting the IP+IOD from the date of death. However, IP and IOD are not fixed values, but random variables that can be approximated by probability distributions. The convolution of their probability density functions (PDF) defines the PDF of the period from infection to death. Let f(t) be such PDF, where t represents time since infection. As data are usually given daily, let F(n) be a discrete approximation to f(t) representing the probability of death n days after infection. Then, given a COVID-19 death on a certain day n, the probability of having contracted the disease 1 day before is F(1); 2 days before is F(2), and so on. If more than one death was produced on day n, say x(n) deaths, the associated infections can be dated as follows: x(n)⋅F(1) infections were produced 1 day before; x(n)⋅F(2) were produced 2 days before, and so on. If the CFR is known, the total infections can be inferred from deaths. Following the previous reasoning from the opposite point of view, the infections on day n that ended in death, y(n), can be inferred as the addition of deaths on day n+1 that were infected on day n, x(n+1)⋅F(1); deaths on day n+2 that were infected on day n, x(n+2)⋅F(2); and so on. Then, For each infection that ends in death, it can therefore be assumed that there were 100/CFR infections. So, given a time series of deaths produced by the illness, x(n), the infections can be inferred as To make sense of the inferred infections they have been rounded to the nearest integer (positive) number. Then, the first non-null element defines the date of the first infection. All the computations in this study were implemented in Matlab R2019b, while graphics were made in R software with the packages usmap_0.5.2, viridis_0.6.2, and ggplot2_3.3.4. The nature of the data is public and anonymous, hence, no ethical approval was required for this study. We used the IP and IOD distributions estimated by Linton et al. 2020 The daily infections time series have been estimated by applying the REMEDID algorithm in each state, and they will be referred to as IR. Similarly, daily infections from official records will be referred to as IO. As an example, Figure 1 infections on April 3 (Figure 1c) , that is less than the 15% of the total infections and with a week and a half delay. Another major result is that the first official cases in IO are quite delayed with respect to those in IR. We will focus on the dates of the first infections. Although the dates of the first infections have been estimated in studies based on sample repositories (Deslandes et al., 2020; Althoff et al., 2021; Basavaraju et al., 2021; LaRosa et al., 2021; Valenti et al., 2021; WHO, 2021) , they have not been reported in other studies based on retrospective models (Barber et al. 2021; Irons and Raftery, 2021; Noh and Danuser, 2021 (Rippinger et al., 2021) . For example, the Spanish seroprevalence study reported a third of completely asymptomatic infections during the first wave (Pollán et al., 2020) ; and in Italy, another seroprevalence study in a random sample of blood donors exposed many more infections in Milan, Italy, than was initially detected at the beginning of the pandemic in February 2020 (Valenti et al., 2021) . cases, which produces remarkable differences compared to official records. The application of the REMEDID algorithm assumes that the proportion of mild and asymptomatic cases was similar at the beginning of the epidemic and during the period covered by the seroprevalence study. This scenario is plausible since there was not any new virus variant becoming dominant till the alpha variant (B.1.1.7 lineage), which was first detected in England in September 2020 (PHE, 2020) . Differences between IO and IR regarding the early spread are significant. For example, Illinois dropped from the second to 13 th position using our REMEDID infection score. The first IR cases are dated around a month earlier than in the IO ones, revealing that: (i) it was more likely that SARS-CoV-2 spread to US states a month earlier on average than previously reported in official records; (ii) there was a generalized underdetection of cases during the beginning of the pandemic. Only Arizona and Illinois showed earlier first cases in documented infections than in our REMEDID analysis. Finally, West Virginia was the last state to report a COVID-19 infection (on March 17, 2020), contrary to our REMEDID analysis that identified Wyoming as the last state on its ranking (on February 28, 2020). The REMEDID algorithm provides information about the early stage of the pandemic when official records are expected to be of lower quality, although it has pros and cons. For example, it presents some advantages with respect to other retrospective analyses that rely on sample repositories (Deslandes et al., 2020; Althoff et al., 2021; Basavaraju et al., 2021; LaRosa et al., 2021; Valenti et al., 2021; WHO, 2021 ) that may or may not exist. If they do not exist when the health crisis breaks out, these retrospective studies will no longer be feasible. On the contrary, the REMEDID algorithm is based on seroprevalence studies that can be planned and carried out after the illness outbreak took place. In fact, it is highly recommended to apply the algorithm to all regions with available seroprevalence studies to estimate daily infections, and infer their first infections. However, this dependence has a counterpart that limits the algorithm application, since it can only be applied to regions where seroprevalence studies are available. A second advantage is that the REMEDID reconstruction of daily infections allows to infer the first infection date, which is not the case in other approaches that are also reconstructing infections from deaths (Irons and Raftery, 2021) . Another limitation comes from the IP and IOD distributions, which were estimated in China and may differ for the US. The IP is known to show geographical differences (Cheng et al., 2021) , and some differences are expected for the IOD as far as it partially depends on the health system of each country. However, the study can easily be redone as soon as an IP and IOD distribution will be available for the US. Finally, results depend on the quality of the daily deaths time series. Iuliano et al. (2021) Antibodies to Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in All of Us Research Program Participants Seroprevalence in the US as of Estimating global, regional, and national daily and cumulative infections with SARS-CoV-2 through Nov 14, 2021: a statistical analysis Serologic Testing of US Blood Donations to Identify Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)-Reactive Antibodies Second travel-related Case of 2019 Novel Coronavirus Detected in United States. Press release The incubation period of COVID-19: a global meta-analysis of 53 studies and a Chinese observation study of 11 545 patients SARS-CoV-2 was already spreading in France in late Cluster of pneumonia cases caused by a novel coronavirus A retrospective analysis of the COVID-19 pandemic evolution in Italy Retrospective methodology to estimate daily infections from deaths (REMEDID) in COVID-19: the Spain case study Seroprevalence of antibodies to SARS-CoV-2 in 10 sites in the United States Novel Coronavirus in the United States Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys using an excess mortality modelling approach. The Lancet Regional Health Clinical and virologic characteristics of the first 12 patients with coronavirus disease 2019 (COVID-19) in the United States SARS-CoV-2 has been circulating in northern Italy since December 2019: Evidence from environmental monitoring Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2) Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: A statistical analysis of publicly available case data Estimation of the fraction of COVID-19 infected people in U.S. states and countries worldwide Timing the SARS-CoV-2 index case in Hubei province Investigation of novel SARS-COV-2 variant. Variant of Concercn 202012/01 Prevalence of SARS-CoV-2 in Spain (ENE-COVID): a nationwide, population-based seroepidemiological study Evaluation of undetected cases during the COVID-19 epidemic in Austria SARS-CoV-2 seroprevalence trends in healthy blood donors during the COVID-19 outbreak in Milan WHO-convened global study of origins of SARS-COV-2: China part Dissecting the early COVID-19 cases in Wuhan