key: cord-0777651-d8m6vpzh authors: Vallejo, Juan A.; Trigo, Noelia; Rumbo-Feal, Soraya; Conde-Pérez, Kelly; Lopez-Oriona, Ángel; Barbeito, Inés; Vaamonde, Manuel; Tarrío-Saavedra, Javier; Reif, Rubén; Ladra, Susana; Rodiño-Janeiro, Bruno K.; Nasser, Mohammed; Cid, Ángeles; Veiga, María; Acevedo, Antón; Lamora, Carlos; Bou, Germán; Cao, Ricardo; Poza, Margarita title: Modeling the number of people infected with SARS-COV-2 from wastewater viral load in Northwest Spain date: 2021-12-16 journal: Sci Total Environ DOI: 10.1016/j.scitotenv.2021.152334 sha: fadaec2f120302c2f04dca52e6b83a3a6ea8a662 doc_id: 777651 cord_uid: d8m6vpzh The quantification of the SARS-CoV-2 RNA load in wastewater has emerged as a useful tool to monitor COVID–19 outbreaks in the community. This approach was implemented in the metropolitan area of A Coruña (NW Spain), where wastewater from a treatment plant was analyzed to track the epidemic dynamics in a population of 369,098 inhabitants. Viral load detected in the wastewater and the epidemiological data from A Coruña health system served as main sources for statistical models developing. Regression models described here allowed us to estimate the number of infected people (R 2 = 0.9), including symptomatic and asymptomatic individuals. These models have helped to understand the real magnitude of the epidemic in a population at any given time and have been used as an effective early warning tool for predicting outbreaks in A Coruña municipality. The methodology of the present work could be used to develop a similar wastewater-based epidemiological model to track the evolution of the COVID–19 epidemic anywhere in the world where centralized water-based sanitation systems exist. During the last decade, Wastewater-Based Epidemiology (WBE) has emerged as a highly relevant discipline with the potential to provide objective information by combining the use of cutting-edge analytical methodologies with the development of ad hoc modeling approaches. WBE has been extensively used in the last years to predict with high accuracy the consumption patterns of numerous substances (EMCDDA, 2020) . Several examples from the literature showed different approaches and strategies to tackle the uncertainty associated with WBE studies. For example, Goulding and Hickman assumed three main sources of uncertainty and, using Bayesian statistics, fitted the data to linear regression hierarchical models (Goulding et al., 2020) . Other modeling approaches (Croft et al., 2020) considered Monte Carlo simulations to deal with uncertainties such as wastewater inflow variability or stability of the substances in wastewater and their pharmacokinetics. In general, WBE studies showed that despite the wide number of the present global COVID-19 pandemic, processes to monitor SARS-CoV-2 in wastewater were first developed in the Netherlands (Medema et al., 2020) , followed by other countries (Nemudryi et al., 2020; Wurtzer et al., 2020; Ahmed et al., 2020; La Rosa et al., 2020; Randazzo et al., 2020b; Peccia et al., 2020; Balboa et al., 2021; Weidhaas et al., 2021; Pillay et al., 2021; Kumar et al., 2021; Hart and Halden, 2020) . A range varying from one third to four fifths of patients infected with SARS-CoV-2 are asymptomatic (Bi et al., 2020; Day, 2020; Pollán et al., 2020) ; a condition that depends on many factors such as the mean age in the population and that promotes the undetected spread of COVID-19. A systematic literature review found that at least an important proportion of COVID-19 infected persons, including symptomatic and asymptomatic people, tested for fecal viral RNA were positive from initial steps of infection (Gupta et al., 2020) and persistently tested positive on rectal swabs even after nasopharyngeal testing was negative (Chen et al., 2020; Xing et al., 2020; Xu et al., 2020; Cevik et al., 2021; Miura et al., 2021) . For example, it has been reported that excretion of viral RNA in the stool of people infected occurs during a mean prolonged period of 27.9 days after the person has tested negative in their respiratory samples (Wu et al., 2020b) . Therefore, genetic material of SARS-CoV-2 can be found in wastewater (Lodder and de Roda Husman, 2020) , which has made monitoring of viral RNA load in sewage an excellent tool for the epidemiological tracking of the actual pandemic as well as an extremely efficient early warning tool for outbreaks detection (Randazzo et al., 2020a ; Ahmed et 2020), reaching a peak of 1667 active cases. The cases were distributed in an area that covers 37 municipalities in the health area of A Coruña-Cee, as shown by the data provided by SERGAS (Galician Health Service) in https://www.datawrapper.de/_/QrkrZ (SERGAS, 2021) . In this context, the main objective of the present work was to develop parametric and nonparametric statistical models useful to determine the entire SARS-CoV-2 infected population, including symptomatic and asymptomatic people, by tracking the viral load present in the wastewater of the Bens wastewater treatment plant that serves the metropolitan area of A Coruña with near 370000 residents, without the need of health system data or the number of positive people reported, and obtaining information from population-based seroepidemiological surveys developed in Spain (this represents a contribution with respect to other models). The pursuit of this objective has a public service motivation. At this regard, it is important to note that this research, in the framework of the COVIDBENS project, had high social impact and it was one of the precursors of this type of monitoring in Spain, for the surveillance of SARSCoV-2 in wastewater. COVIDBENS currently provides a public service through weekly reports to municipalities, public health and regional administrations (SERGAS, Xunta de Galicia) for surveillance and early warning tasks. RNA was extracted from 100  L of the concentrated samples using the QIAamp Viral RNA Mini Kit (Qiagen, Germany) according to manufacturer's instructions. Briefly, the sample was lysed under highly denaturing conditions to inactivate RNases and to ensure isolation of intact viral RNA. Then, the sample was loaded in the QIAamp Mini spin column where RNA was retained in the QIAamp membrane. Samples were washed twice using washing buffers. Finally, RNA was eluted in 70  L of RNase-free water. The quality and quantity of the RNA was checked using a Nanodrop Instrument and an Agilent Bioanalyzer. Samples were kept at -80 C until use. RT-qPCR assays were done in a CFX 96 System (BioRad, USA) using the qCOVID-19 kit (GENOMICA, Spain) through N gene (coding for nucleocapsid protein N) amplification. Hard-Shell 96-well PCR plates (BioRad, USA) were used and sealed with microseal 'B' PCR Plate Sealing Film (BioRad, USA). Recommendations given by Ahmed et al. (Ahmed W et al, 2022) were followed for minimizing errors in RT-PCR detection. RT-qPCR reactions were done following the manufacturer's instructions (GENOMICA, Spain). This kit provided two reaction mixes; Mix A contained DNA polymerase, nucleotides, polymerase buffer and an internal control and Mix B contained primers and probe for N gene. The final reaction contained 5  L of Mix A, 1  L of Mix B, 0.2  L of reverse transcriptase, 5  L of template and 8.8  L of water. The internal control allowed discarding the presence of inhibitors. Nuclease-free water was used as negative-control template. The cycling parameters were 50 C for 20 minutes for the retrotranscription step, followed a PCR program consisting of a preheating cycle of 95 C for 2 min, 50 cycles of amplification at 95 C for 5 s and finally one cycle of 60 C for 30 s. RT-qPCR assays were done in sextuplicate. For RNA quantification, a reference pattern was standardized using the Human 2019-nCoV RNA standard from European Virus Archive Glogal (EVAg) ( Figure S1 ). To build the calibration curve, the decimal logarithm of SARS-CoV-2 RNA copies per  L of control material ranging from 5 to 500 were plotted against Cq (quantification cycle) values. Calibration was done amplifying the N gene. Viral load was defined as number of copies of RNA of SARS-CoV-2 per L. The analytical efficiency of the RT-qPCR was calculated following the criteria recommended by the MIQE guidelines (Bustin et al., 2009) . The limit of detection (LOD) of the RT-qPCR was analyzed using serial dilutions of Human 2019-nCoV RNA standard (EVAg). Detection rates are shown in Table S1 (in Supplementary Material section), where the LOD was established in 25 copies per reaction and limit of quantification (LOQ) was between 25 and 15 copies per reaction. The slope of the standard curve was -3.56, the R 2 was 0.9984 and the amplification efficiency was above 90%. In the present work a COVID-19 case was defined as a person with COVID-19 virus infection laboratory-confirmed by RT-PCR of SARS-CoV-2 regardless of clinical signs and symptoms. Active cases mean living persons confirmed with COVID-19 whose symptom onset date is less than or equal to 14 days from the date of the current report. (CHUAC) . Since flow may be an important variable when determining the viral load in the wastewater, an exploratory data analysis for the volume of water pumped at the WWTP Bens during the lockdown period has been performed using flow data. Dataset S4 includes two-minute flow measurements (m 3 ·s -1 ) at WWTP Bens for the period January 1 st -May 14 th . An additional flow study is described in the Supplementary Material section. Preliminary statistical methods have been devised to backcast the number of COVID-19 active cases based on reported official cases. Follow-up times (available only until May 7 th ) for anonymized individual reported COVID-19 cases in Galicia (NW Spain where the WWTP Bens is located, Figure 1 ) have been used to count the number of cases by municipality based on patient zip codes. Since the epidemiological discharge date is missing, the number of active cases in the metropolitan area of A Coruña could not be obtained but the cumulative number of cases was computed. On the other hand, the main epidemiological series for COVID-19 were publicly available daily in Galicia at the level of health areas. However, the definition of one of the series changed from cumulative cases to active cases in April 29 th . Thus, the epidemiological series for COVID-19 in the health area of A Coruña -Cee (population 551,937) was used to estimate the epidemiological series for COVID-19 for the metropolitan area of A Coruña (population 369,098). To do this, a linear regression model was used to relate the relative cumulative and active cases (cases per million) of COVID-19 for the health area of A Coruña -Cee. Predicting the rate of active cases and considering the population size in the metropolitan area gives the estimated total number of official active cases in the five municipalities. The previous approach is only possible until May 7 th , our database update date. To estimate the number of official active cases from May 8 th onwards, another linear regression model has been used to relate the number of active cases in the health area of A Coruña -Cee and in the metropolitan area of A Coruña. Since the number of active cases in the health area has been reported until June 5 th , the series of estimated official active cases could be backcasted from May 8 th until June 5 th . Finally, to transform the official number of COVID-19 cases into the real number, the ratio mean of real cases / mean of official cases was estimated using the official figures of cumulative cases. The Spanish seroepidemiological survey made by the National Center of Epidemiology in Spain, ENECOVID (Pollán et al., 2020) , included representative samples within the metropolitan area of A Coruña during the period, April-June 2020. These data have been used to quantify the real proportion of population infected during the period April-June in the area corresponding to WWTP Bens. Accordingly, the number of actual active cases in Galicia was: 56,713 for April 27 th -May 11 th (prevalence 2.1%) and 59,414 for May 18 th -June 1 st (prevalence 2.2%). Confronting these numbers with the official numbers in May 11 th (10,669) and June 1 st (11,308) gives estimated ratios of 5.316 and 5.254 in these two periods, with an average of around 5.29. This conversion factor was used to backcast the series of real active cases based on the estimated daily official COVID-19 cases in the metropolitan area of A Coruña. Some of these series, including the backcasted series of real active cases, are included in the Dataset S3. In order to properly interpret the results of fitting models applied aiming understand the J o u r n a l P r e -p r o o f viral load evolution, a brief description of them is included as follows, namely of the Generalized Additive Models and the Locally Estimated Scatterplot Smoothing (LOESS). In fact, GAM using a basis of cubic regression splines (Hastie and Tibshirani, 1990) and LOESS (Cleveland, 1979) nonparametric regression models have been used to fit the viral load along the day on May 5 th , 6 th , 11 th and 12 th , and as a function of time at CHUAC from April 22 nd to May 12 th and at WWTP Bens from April 16 th to June 3 rd . Several outliers, showing very small viral load (some of then under the level of detection), have been removed from the data, corresponding to unexpected and intensive pipeline cleaning episodes (8-hour 70 C water cleaning during Thursday-Friday nights) carried out in April 23rd-24 th , April 30 th -May 1 st and May 7 th -8 th . Nonparametric methods do not assume any parametric form (Maity, 2017) , allowing us to model complex nonlinear functions (Shokrzadeh et al., 2014) . There is a wide range of nonparametric regression models, such as local polynomial regression (Fan and Gijbels, 2018) , (Breidt and Opsomer, 2000) , kernel smoothing (Wand and Jones, 1994) , regression splines (Wood, 2017) , and LOESS (Cleveland, 1979) , among others. There is other group of models, called semiparametric regression models, that allow the inclusion of linear parametric and flexible nonparametric effects of the predictors on the response, such as the GAM models. In this work, GAM models are applied to define the relation between the viral load and the date, by using penalized regression splines (Wood, 2017) to estimate the smooth effect of date on the viral load. But this type of models can include linear effects in addition to smooth effects. For instance, a GAM model expression consisting in one linear, 1 X , and two smooth predictors, 1 T and 2 T , can be defined by . Thus, the smooth effect of T can be expressed by means a splines basis composed of where jk  are the splines functions and L accounts for the number of knots. In this work =3 degree (cubic regression splines) was chosen. As in the previous section, in order to be able to correctly interpret the results of the application of the models, a brief introduction from a statistical point of view is provided beforehand. Parametric models such as linear regression, the simplest one, show a transfer function that defines the relation of dependence between two or more variables involved in a specific process. This function depends on the value of some parameters that can be interpreted from physical or chemical point of views, among others. We can define a simple parametric model by where Y is the response variable be to modeled, and m is the parametric function, depending on X , the independent variable. In the present case, Y is the number of real infected people, X accounts for the values of the logarithm of viral load in wastewater, while ε are the model random errors. The regression function is assumed to be an element of a set of parametric functions   θ = m / θ  , whereby θ is the vector of parameters that defines the regression model and is a subset of k . On the other hand, the semiparametric models used are flexible ones that allow the introduction of linear and smooth effects of the predictors on the response. All these models have been successfully used to predict the number of COVID-19 active cases based on the measured viral load (number of SARS-CoV-2 RNA copies/L) at WWTP Bens, daily flow in the sewage network as well as other environmental variables, such as rainfall, temperature and humidity. Diagnostic tests (Q-Q plots, residuals versus fitted values plots and Cook's distance) were used for outlier detection, which improved the models fit. The R statistical software was used to perform statistical analyses (R Core Team, 2021). Namely, the mgcv library (Wood, 2017) was applied to fit GAM models and ggplot2 and GGally (Schloerke et al., 2020; Ginestet, 2011) to perform J o u r n a l P r e -p r o o f Journal Pre-proof correlation analysis, obtain graphical output and fit LOESS models, respectively. The caret R package was used to fit and evaluate regression models (Kuhn et al., 2008) . Although some RT-qPCR replicates could not be measured when the viral load was scarce, due to the limitation of the detection technique (errors randomly occur when the number of copies/L is under 10,000), 74% of the assays led to three or more measured replications. Conditional mean imputation (Enders, 2010) was used for unmeasured replications. In those cases, the missing values along the replications were imputed with the sample mean of the non-missing values of the replications. When all the replications were missing, they were imputed as the lowest value of all the observed measurements along the whole dataset. With the aim of helping in the awareness of the population and in the management of the pandemic, the data obtained were disseminated at first through private reports to the political and health authorities. Later, data were disseminated through social networks, press, radio, TV and finally a public web page was created in order to inform weekly the society about the COVID-19 To model the viral load, the number of COVID-19 positive cases need to be estimated J o u r n a l P r e -p r o o f ( Figure 2 ). Iceberg on Figure 2 represents the global health of the population of the metropolitan area of A Coruña based in our study, where visible cases are differentiated from the invisible ones separated by a black line. The real (and unknown) iceberg is represented in blue, which includes real cases with or without symptoms, visible and invisible. The iceberg represented in green corresponds to the number of people infected estimated using the statistical models presented here. The green iceberg has been drawn with a slightly different shape and size but very similar to blue, taking into account the 10 % margin of error of our models. Visible cases estimated in this study (in yellow) against the real (and unknown) ones (in orange) are also represented in Figure 2 . Finally, the asymptomatic cases known thanks to the national seroprevalence study (Pollán et al., 2020) are represented in purple. However Figure S2B ) depending on the hour of the day, during four different days. Figure S2A shows the hourly trend at CHUAC, with a maximum around 08:00, whereas the viral load curves at WWTP Bens ( Figure S2B ) attained a minimum around 05:00 and a maximum between 14:00 and 15:00. As a consequence of these trends, 24-hour integrated samples are fully justified. As expected, the mean viral load decreased with time when measured at CHUAC ( Figure S3A of Since the nonparametric estimation of the viral load effect had a logarithmic shape, a multiple linear model was fitted using the logarithmic transformation of the viral load, daily flow, rainfall, temperature, and humidity. The best option to explain the number of COVID-19 cases is a simple linear model as a function of the as a function of the natural logarithm of viral load. In fact, the R 2 hardly change when adding Flow, Rainfall, Humidity and Temperature variables (see Figure S4 in Supplementary Material that shows the most explanatory models for different number of predictors using the 2 R maximization criterion). In fact, when a multivariate linear model depending on three predictors (viral load, daily flow, and rainfall) was fitted, data showed that the only significant explanatory variable was the viral load ( 8 = 1.32 10 p value   ). Table 1 shows that the effect of the other two predictors, daily flow ( = 0.186525 p value  ) and rainfall ( = 0.099239 p value  ), were not clearly significant. Figure 7A ). After removing three outliers, the fit improved slightly ( 2 = 0.894 R ), as shown in Figure 7B . The final fitted linear model became: where Y denotes the real number of active COVID-19 cases, X is the viral load (number of RNA copies per L) and ln stands for the natural logarithm. For instance, a viral load of X = 150,000 copies per liter would lead to an estimated number of Y = 5,543 active cases. The prediction ability of this fitted linear model, the GAM, and the linear and quadratic LOESS models has been evaluated using a 6-fold cross validation procedure, to prevent overfitting. In all cases, the response variable was the estimated number of real COVID-19 active cases in the metropolitan area (Figure 2) , and the explanatory variable, the natural logarithm of the viral load. Table 2 shows the corresponding prediction 2 R for each one of the four models, along with the root mean squared prediction error (RMSPE). The smaller this error, the better the predictive ability of the model was. All the models provided quite accurate predictions for the real number of COVID-19 active cases using the viral load, with an error of around 10% of the response range. The model with the lowest prediction error, 9.5%, was the quadratic LOESS model. Flexible models, such as LOESS and GAM, slightly improved the predictive performance when compared with the linear model, which has a prediction error of around 11.4% of the response range. The quadratic LOESS model was also the one with the largest value for 2 R . Therefore, it provided the best predictive results. A scatterplot of the estimated number of COVID-19 active cases in the metropolitan area versus the natural logarithm of the viral load, along with the quadratic LOESS fitted curve show the linear relationship between these two variables (see Figure 8A ). The actual and predicted values of real number of COVID-19 active cases are very close to the diagonal line, which represents perfect model prediction (see Figure 8B ). Journal Pre-proof In the pandemic context described above, 24-h composite samples from WWTP Bens were continuously analyzed from April 19 th until early June, although surveillance has continued until now. The data from wastewater obtained from April 19 th to June 1 st has confirmed the decrease in COVID-19 incidence. We showed that time course quantitative detection of SARS-CoV-2 in wastewater from WWTP Bens correlated with COVID-19 confirmed cases, which backs up the plausibility of our approach. The previous seroprevalence studies carried out by the Spanish Centre for Epidemiology showed that cases in A Coruña represented about 1.8% of the local population (Pollán et al., 2020) . This means that, for a population of about 369,098 inhabitants, the number of people infected with SARS-CoV-2 contributing their sewage into the WWTP Bens would be around 6,644, which includes people with symptoms and those who are asymptomatic. Considering that the ratio between people with symptoms (reported by the health service) and the total infected population (including asymptomatic people) was estimated to be 1:5, we estimated that reported cases contributing their wastewater into WWTP Bens would be around 1,661, which is close to the maximum number of cases reported in the A Coruña-Cee area (1,667 cases on April 28 th ) therefore these data support our study. Of course, as reflected above, this ratio presents uncertainty and can be discussed. It should be noted that the criteria used by the authorities to report cases varied over time, so this may explain the gap between the graphs reported in the media throughout the epidemic and our Figure 4 , where both a decrease in the viral load and in the estimated COVID-19 cases can be observed from mid-April to early-June. However, the level of the curve at WWTP Bens at the beginning of May was much higher than that corresponding to May 11 th . This is due to the effectiveness of the lockdown measures applied in Spain. The May 12 th daily curve at CHUAC showed a higher viral load than the one corresponding to May 11 th at WWTP Bens, showing that the viral load measured at the hospital tends to be higher J o u r n a l P r e -p r o o f than at WWTP Bens due to a lower dilution effect, as expected. One of the key points of this work is the use of statistical regression models for the estimation of real cases of COVID-19 from the information provided by wastewater, in the specific framework of the metropolitan area of A Coruña. In fact, in the present work, nonparametric and even simple parametric regression models have been shown to be useful tools to estimate the real number of COVID-19 active cases as a function of the viral load. This is a pioneering approach in the context of the SARS-CoV-2 pandemic, specifically in the framework of Spain. To our knowledge, the WBE studies available before the performance of our proposal (Vallejo et al., 2020) tended to be limited to reporting the occurrence of SARS-CoV-2 RNA in WWTPs and sewer networks, in order to establish a direct comparison with declared COVID-19 cases (Randazzo et al., 2020b; Medema et al., 2020; Nemudryi et al., 2020; La Rosa et al., 2020; Randazzo et al., 2020a; Polo et al., 2020) . However, there are a number of studies that at similar dates proposed mathematically based models for estimating the number of COVID-19 cases. In this line, Hart and Halden (2020) is one of the precedent works that combines computational analysis and modeling with a theoretical approach in order to identify useful variables and confirm the feasibility and cost-effectiveness of WBE as a prediction tool. We have to mention also the studies of Ahmed et al. (2020) and Curtis et al. (2020) that proposed parametric transfer functions depending on WBE variables to estimate the number of persons infected. In addition, recently, new works focused on providing and comparing regression and time series models for the prediction of active cases have been developed (Cao and Francis, 2021; Li et al., 2021) . The present methodology is characterized by focusing on a particular region and studying carefully all the steps, from sample collection to the estimation of the actual number of infected persons. This significantly differentiates our method, and the resulting predictive model, from other valid and useful alternatives. In addition, it is important to J o u r n a l P r e -p r o o f note that we focus on estimating the actual number of infected people rather than the number of cases reported by health authorities, as other methodologies do. For this task, the present approach uses the seroprevalence studies carried out by the Spanish Centre for Epidemiology. Moreover, the statistical regression models included in this methodology range from parametric to non-parametric approaches such as GAM or LOESS, which allow much more flexible data fitting, not subject to a fixed parametric expression relating the response variable and predictors. Other Although this is a bit counterintuitive (dilution should affect the viral load measured), it is important to point out that rainfall fluctuated little during the data collection period from mid-April to early-June 2020: its median was 0, its mean was 2.88 L/m2 and its standard deviation was 6.59 L/m 2 . Obviously, flow is an important factor to consider in WBE. In our case, before starting to model, the study correlations study between several variables was done, of course including flow. Since this work was done at time of drought, it was found that, contrary to expectations, flow was not a variable that entered the model. The strongest correlation was established between the log of the viral load and the number of active cases ( 2 = 0.93 R ). Of course, our models can be affected in the rainy season. However, as soon as the rainy season came, large distortions in our models were detected caused by the appearance of the Alpha and Delta variants (data not shown). Also, the water temperature remains stable in our geographical area and never raises enough to cause excessive degradation of viral RNA, so this variable does not enter the models either, as expected. It is important to highlight that, as a consequence of the results of the GAM fit, a ), the best among all the considered models. However, similar results were found for the linear model that also brings the advantage of J o u r n a l P r e -p r o o f simplicity. Therefore, both models, linear and LOESS quadratic, could be successfully used to estimate the number of infected people in a given region based on viral load data obtained from wastewater. The proposed models, as described, are only applicable to the metropolitan area of A Coruña, the region for which they have been developed, although can be adapted to any other location. This area has Atlantic weather and it may rain substantially in autumn and winter, which could lead to explanatory variables such as rainfall and/or mean flow becoming significant for those seasons and needing to enter the prediction model. Thus, when applying these models to the same location but in seasons with different climatic behavior, they might need to be reformulated. In addition, the methodology used to build these statistical models could be used at other locations for epidemiological COVID-19 outbreak detection, or even for other epidemic outbreaks caused by other microorganisms. Of course, in that case a detailed data analysis would have to be carried out as well, since specific features of the sewage network or the climate may affect the model itself. Overall results showed that the estimations of the real total number of infected persons are significantly higher than the number of reported infected persons. Both curves are very different in terms of scale. However, regarding the shape of both trends, it is very similar, providing relevant information of the evolution of the pandemic in Spain. Moreover, it is important to note that the curve of estimates is slightly out of phase with respect to the number of active cases reported by health authorities. In fact, it is slightly ahead (around two weeks) contributing to early information about the beginning of each outbreak, such as those corresponding to August 2020, November 2020 or January 2021 (data not shown). This information has provided support for decision making to the public authorities, involving decisions such as confinement, mobility and time The first serious limitation to start this work was the beginning of the total confinement of the first outbreak decreed by the national government when the COVIDBENS research team was created on purpose for this work. This limited, of course, interaction between the researchers. We J o u r n a l P r e -p r o o f had to ask for special permits to be able to carry out the laboratory work in the hospital and do the sample collection. Then the lack of reagents came; for RNA purification, for PCR, for suitable devices and personal protective material such as masks, gloves, etc. The lack of reagents did not allow us to make replicates at the very first time, which could only be done later. The control of recovery efficient was also impossible due to the unavailability of marker viruses at that moment. Also, the lack of reliable data of COVID-19 cases was an important limitation, therefore statisticians had to create a statistical model to be able to estimate the real number of cases, since the information at that first moment of the pandemic was very biased and incomplete. Moreover, some periodically cleaning events made with hot water in the wastewater collection system of the hospital originated confusing data during the first weeks of the study, since hot water practically eliminates viral RNA. Due to lack of several materials and reagents a great effort in sample collection and processing had to be done and finally the number of samples was enough to develop the statistical models as soon as possible. Variability found in data obtained by PCR when viral load was very low was another important limitation, which has been partially solved by making six replications of each RT-qPCR. Also, ratios fixed for the calculation of symptomatic cases and total cases are another source of uncertainties that should be taken into account. The mean ratio between real COVID-19 cases and official COVID-19 cases was estimated using the Spanish sero-epidemiological survey, which age a mean ratio estimation of 5.29. The mean ratio was rather stable over the time in our study since it ranged between 5.316 (May 11 th ) and 5.254 (June 1 st ). However, this mean ratio may be quite different in further periods of the pandemic. The statistical models described in the present study were done with data recovered only during the first epidemic outbreak. This is a limitation but, nevertheless, this fact allowed having early statistical models that, since July 2020 were able to estimate the number of infected people in the J o u r n a l P r e -p r o o f population and collaborate with the public health system. Other important limitation is that the models described here did not depend on rainfall, probably because the data were obtained in spring, a season of little rain in NW Spain. So in order for the models to be used in rainy seasons, they must be adapted. Also adaptation has to be done for using this type of models in other locations. However, the authors believe they could be of help to adapt these models to any other place in the world where there is a wastewater sanitation system. Finally, despite the serious limitations, we decided to exploit the statistical models as soon as possible to help in the fight against the pandemic. This work aimed to evaluate the viral load measured in wastewater as a predictor of the real number of COVID-19 active cases. Different regression models were fitted to predict the real number of COVID-19 active cases based on the viral load, flow and the most relevant atmospheric variables. These statistical models have been able to provide estimates, from April 2020 up to now, of the real magnitude of the epidemic at the metropolitan area of A Coruña. These estimates have been weekly reported to the local public authorities. A variable selection method based on the maximization of the adjusted 2 R was used. The only predictor with a significant effect on the real number of infected persons was the viral load, both when GAM and multivariate linear models were fitted. In the GAM model, the effect of the viral load on the real number of active cases had a logarithmic shape. Thus, the viral load was introduced in the multivariate linear model using its logarithmic transformation. As a result, very explanatory GAM ( 2 = 0.86 R ) and linear ( 2 = 0.851 R ) models were fitted. Their goodness-of-fit increased when outliers are removed ( 2 = 0.894 R this specific case of the metropolitan area of A Coruña. In other locations, with different weather, or in other seasons, the role of the other potential explanatory variables could be more important. In addition, the reliability of the model predictions could change during time due to different causes such as the change of SARS-CoV-2 variants. Moreover, the prediction ability of the linear, the GAM, and the linear and quadratic LOESS models has been evaluated using a 6-fold cross validation through the prediction 2 R and the RMSPE index. The prediction errors of all the models were relatively small, supporting their use for prediction tasks. The lowest prediction error corresponded to the quadratic LOESS model. Finally, we have used the linear model to predict the number of real active cases using viral load. The reasons are its low prediction error, its simplicity, and the fact that it provides a parametric function that defines the relationship between variables. The present methodology, including the fitting of the parametric linear or nonparametric statistical models (GAM, LOESS), can be extended to estimate the real number of infected people at other locations. In addition, it is important to note that their sampling cost-effectiveness and speed can help to early alert health authorities about potential new outbreaks, thereby helping to protect the local population. The fate of SARS-COV-2 in wwtps points out the sludge line as a suitable spot for detection of COVID-19 Impacts of covid-19 pandemic on the wastewater pathway into surface water: A review Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study La depuradora de bens remite todas las semanas a la xunta sus datos sobre el coronavirus en aguas residuales (the Bens wastewater treatment plant sends its data on coronavirus in wastewater to the Xunta -government of Galiciaevery week) 20618_965641.html; 2021. Cadena SER radio Local polynomial regression estimators in survey sampling Epidemiology of the silent polio outbreak in Rahat, Israel, based on modeling of environmental surveillance data The MIQE guidelines: Minimum information for On forecasting the community-level COVID-19 cases from the concentration of SARS-CoV-2 in wastewater SARS-CoV-2, SARS-CoV, and MERS-CoV viral load dynamics, duration of viral shedding, and infectiousness: a systematic review and meta-analysis. The lancet microbe 2021 Time evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in wastewater during the first pandemic wave of covid-19 in the metropolitan area of barcelona, spain The presence of SARS-CoV-2 RNA in the feces of COVID-19 patients Robust locally weighted regression and smoothing scatterplots COVIDBENS project web site Prevalence of illicit and prescribed neuropsychiatric drugs in three communities in Kentucky using wastewater-based epidemiology and Monte Carlo simulation for the estimation of associated uncertainties Covid-19: four fifths of cases are asymptomatic, China figures indicate Quantification of damage in dna recovered from highly degraded samples-a case study on dna in faeces Detection of enteroviruses in untreated and treated drinking water supplies in South Africa Wastewater analysis and drugs: a European multi-city study Applied missing data analysis Local polynomial modelling and its applications: monographs on statistics and applied probability 66. Routledge Galicia acumulates 90 cases of SARS-CoV-2 (Galicia suma 90 casos de SARS-CoV-2) The outbreak of the civic center of A Coruña accumulates 11 cases of coronavirus (O foco do centro cívico da Coruña suma xa 11 casos de coronavirus) ggplot2: elegant graphics for data analysis A comparison of trends in locations Persistent viral shedding of SARS-CoV-2 in faeces-a rapid review Computational analysis of SARS-CoV-2/COVID-19 surveillance by wastewater-based epidemiology locally and globally: Feasibility, economy, opportunities and challenges Detection and quantification of SARS-CoV-2 RNA in wastewater and treated effluents: Surveillance of COVID-19 epidemic in the united arab emirates Generalized additive models Detection of pathogenic viruses in sewage provided early warnings of hepatitis A virus and norovirus outbreaks Role of environmental poliovirus surveillance in global polio eradication and beyond Building predictive models in R using the caret package Wastewater surveillance-based city zonation for effective COVID-19 pandemic preparedness powered by early warning: A perspectives of temporal variations in SARS-CoV-2-RNA in ahmedabad, india News dealing with COVIDBENS project in La Opinion, media in A Coruña metropolitan area First detection of SARS-CoV-2 in untreated wastewaters in Italy News dealing with COVIDBENS project in La Voz de Galicia, most read newspaper in Galicia Review on the contamination of wastewater by COVID-19 virus: Impact and treatment Data-driven estimation of COVID-19 community prevalence through wastewater-based epidemiology Human enteric viruses in a wastewater treatment plant: evaluation of activated sludge combined with UV disinfection process reveals different removal performances for viruses with different features SARS-CoV-2 in wastewater: potential health risk, but also data source Nonparametric functional concurrent regression models Molecular characterization of human Sapovirus in untreated sewage in Italy by amplicon-based sanger and next-generation sequencing COVID-19 wastewater epidemiology: A model to estimate infected populations Presence of SARS-Coronavirus-2 RNA in sewage and correlation with reported COVID-19 prevalence in the early stage of the epidemic in the netherlands Website of galician meteorology agency Duration of SARS-CoV-2 viral shedding in faeces as a parameter for wastewater-based epidemiology: Re-analysis of patient data using a shedding dynamics model Temporal detection and phylogenetic assessment of J o u r n a l P r e -p r o o f Journal Pre-proof SARS-CoV-2 in municipal wastewater Measurement of SARS-CoV-2 RNA in wastewater tracks community infection dynamics Monitoring changes in COVID-19 infection using wastewater-based epidemiology: A South African perspective Prevalence of sars-cov-2 in spain (ene-covid): a nationwide, population-based seroepidemiological study Making waves: Wastewater-based epidemiology for COVID-19-approaches and challenges for surveillance and prediction R: A Language and Environment for Statistical Computing Metropolitan wastewater analysis for COVID-19 epidemiological surveillance SARS-CoV-2 RNA in wastewater anticipated COVID-19 occurrence in a low prevalence area We are ready, but citizen responsability is essential COVID-19 (SARS-CoV-2) outbreak monitoring using wastewater-based epidemiology in qatar GGally: Extension to 'ggplot2'; 2020 COVID-19 reported cases in a coruña area shown in the Galician Health Service web site Wind turbine power curve modeling using advanced parametric and nonparametric methods Website of the Spanish Ministry of Health Highly predictive regression model of active cases of covid-19 in a population by screening wastewater viral load Kernel smoothing Correlation of SARS-CoV-2 RNA in wastewater with COVID-19 disease burden in sewersheds Generalized additive models: an introduction with R SARS-CoV-2 titers in wastewater are higher than expected from clinically confirmed cases Prolonged presence of SARS-CoV-2 viral RNA in faecal samples Evaluation of lockdown impact on SARS-CoV-2 dynamics through viral genome quantification in Paris wastewaters Prolonged viral shedding in feces of pediatric patients with coronavirus disease 2019 Characteristics of pediatric SARS-CoV-2 infection and potential evidence for persistent fecal viral shedding The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors declare that all data supporting the findings of this study are available within the article and Supplementary Information files, and also are available from the corresponding authors on reasonable request.References AC: Ángeles Cid. MP, RC, CL, JAV, JT-S and RR conceived and designed the study. MP, JAV, NT, SR-F, MN and KC-P performed wastewater processing and viral analysis, RC, AL-O, IB, MV and JT-S performed statistical models and data analysis, SL managed and analyzed data, BKR-J assisted in the study design and analysis, AA assessed in data collection, AC supervised the wastewater analysis, MCV assessed in wastewater sampling, GB and MP supervised the microbiology team, MP, JAV, RC, RR and JT-S wrote the manuscript. MP, JT-S and RC supervised the team and coordinated all tasks.