key: cord-0685417-awmut0og authors: Unwin, H. J. T.; Cori, A.; Imai, N.; Gaythorpe, K. A. M.; Bhatia, S.; Cattarino, L.; Donnelly, C. A.; Ferguson, N. M.; Baguelin, M. title: Using next generation matrices to estimate the proportion of cases that are not detected in an outbreak date: 2021-02-26 journal: nan DOI: 10.1101/2021.02.24.21252339 sha: c5246eb8232d3338c3b45c3c8ed8224ffdaea230 doc_id: 685417 cord_uid: awmut0og Contact tracing, where exposed individuals are followed up to break ongoing transmission chains, is a key pillar of outbreak response for many infectious disease outbreaks, such as Ebola and SARS-CoV-2. Unfortunately, these systems are not fully effective, and cases can still go undetected as people may not know or remember all of their contacts or contacts may not be able to be traced. A large proportion of undetected cases suggests poor contact tracing and surveillance systems, which could be a potential area of improvement for a disease response. In this paper, we present a novel method for estimating the proportion of cases that are not detected during an outbreak. Our method uses next generation matrices that are parameterized by linked contact tracing and case line-lists. We use this method to investigate the proportion of undetected cases in two case studies: the SARS-CoV-2 outbreak in New Zealand during 2020 and the West African Ebola outbreak in Guinea during 2014. We estimate that only 6% of SARS-CoV-2 cases were not detected in New Zealand (95% credible interval: 1.31 - 16.7%), but over 60% of Ebola cases were not detected in Guinea (95% credible interval: 15 - 90%). 3 make accurate transmission predictions. Over time attempts have been made to account for under-reporting in models. Some models assume perfect reporting (15, 16) , however, this can lead to an underestimation of the infection rate (6) . Other methods assume a constant under-reporting rate (17) or use data augmentation techniques (6) . More recently, many models have switched to using death data, which was believed to be more reliable than case data, because it is more likely consistent over time and between countries (13) . This is especially important for methods which are robust to constant under-reporting. We propose using a quasi-Bayesian next generation matrix (NGM) approach in this paper to estimate the proportion of cases that are not detected in an outbreak. This method is not disease specific and is simple to implement from contact tracing and surveillance data. The calculation can also be repeated throughout the outbreak to provide time varying estimates. We present two applications of our method: the SARS-CoV-2 outbreak in New Zealand in 2020 and the 2014 Ebola epidemic in Guinea. NGMs are often used to calculate the basic reproduction number (the average number of secondary infections generated by a primary infection in a large fully susceptible population), ! , from a finite number of discrete categories that are based on epidemiologically relevant traits in the population, such as infected individuals at different stages of infection (e.g. exposed and infectious) or with different characteristics (e.g. age). The NGM is a matrix which quantifies the number of secondary infections generated in each category by an infected individual in a given category. ! is defined as the dominant eigenvector of this matrix (18, 19) . Here, we stratify infected individuals using information about their contact tracing status and whether they were being followed up at the time of symptom onset to assign infection pathways and construct our NGM. We identify three types of cases: i) cases that are not detected (ND), ii) cases that are detected but not under active surveillance (NAS), and (iii) cases that are detected and under active surveillance . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 26, 2021. ; https://doi.org/10.1101/2021.02.24.21252339 doi: medRxiv preprint (AS). Contact follow-up or surveillance takes different forms for different diseases; for Ebola, a contact under active surveillance would be undergoing in-person follow-up for 21 days after their last interaction with the case (20) , whereas for COVID-19, a contact under active surveillance would have been notified by contact tracers, or through a mobile phone application, and asked to self-isolate for up to 10 days (21, 22) . For contact tracing to be fully effective, the parent (or primary) case needs to be diagnosed and, if positive, all their contacts placed under active surveillance. The parent case therefore needs to know and remember everyone they have been in close contact with whilst they have been infectious and for these contacts to be contacted. Despite a contact being recalled and reported, they may not be under active surveillance if they cannot be identified due to missing or incorrect contact details or evasion from contact tracers. We assume in our model that: i) cases that are not detected and those cases detected but not under active surveillance have the same effective reproduction number (R) and therefore on average, infect the same number of secondary cases; and ii) contacts under surveillance who become cases have a lower effective reproduction number (scaled by ) because they are rapidly isolated after the onset of symptoms. We define as the proportion of contacts recalled, as the proportion of contacts actively under surveillance, and as the proportion of cases detected or "re-captured" by community surveillance We identify 12 pathways through which individuals can become infected by three different types of cases ( Figure 1 ). These pathways are described as follows: 1. A case that was detected (with probability ), who was infected by a case that was not detected and was therefore not under active surveillance. 2. A case that was not detected (with probability 1-), who was infected by a case that was not detected and was therefore not under active surveillance. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 26, 2021. ; https://doi.org/10.1101/2021.02.24.21252339 doi: medRxiv preprint 3. A case that was detected (with probability ), who was infected by a case that was detected but not under surveillance, was correctly recalled as a contact (with probability ) and was under active surveillance (with probability ). 4 . A case that was detected (with probability ), who was infected by a case that was detected but that was not under surveillance, was correctly recalled as a contact (with probability ) but was not under surveillance (with probability 1-). 5 . A case that was not detected (with probability 1-) case, who was infected by a case that was detected but not under surveillance, was correctly recalled (with probability ) but was not under surveillance (with probability 1-). 6 . A case that was detected (with probability ) case, who was infected by a case that was detected but not under surveillance, that was not recalled (probability 1-). 7 . A case that was not detected (with probability 1-) case, who was infected by a case that was detected but not under surveillance, that was not recalled (probability 1-). 8 . A case that was detected (with probability ), who was infected by a case that was detected and under surveillance, was correctly recalled (with probability ) and was under surveillance (with probability ). 9 . A case that was detected (with probability ) case, who was infected by a case that was detected and under surveillance, was correctly recalled (with probability ) but was not under surveillance (with probability 1-). 10 . A case that was not detected (with probability 1-), who was infected by a case that was detected and under surveillance, was correctly recalled (with probability ) but was not under surveillance (with probability 1-). 11 . A case that was detected (with probability ), who was infected by a case that was detected and under surveillance, that was not recalled (with probability 1-). 12 . A case that was not detected (with probability 1-) case, who was infected by a case that was detected and under surveillance, that was not recalled (with probability 1-). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 26, 2021. ; https://doi.org/10.1101/2021.02.24.21252339 doi: medRxiv preprint Seven of our twelve pathways result in detected cases. The cases from pathways 3, 4, 8, and 9 are individuals on contact lists who are detected as cases whereas, the cases from pathways 1, 6, and 11 are de novo cases that are not on any contact tracing list, but which are detected via other routes such as attending a health care unit. The cases from pathways 3 and 8 are contacts who were under surveillance at the time of symptom onset, while those from pathways 4 and 9 were not under surveillance at onset. The cases resulting from the pathways 2, 5, 7, 10 and 12 are not detected by the surveillance system. We use the notation FX to denote the probability of a case stemming from pathway X, for example F1 equals . If " = [ " , " , " ] # is a vector of the number of each type of case for generation , the dynamics of the model is given by: where is our NGM that represent the potential transitions from one generation of cases to the next From the eigenvalues of this NGM, we can calculate the proportion of each of the three types cases (ND, NAS and AS), see Supplementary Information (SI) A. In the limit as goes to infinity, an equilibrium is reach and the proportion of cases that are not detected, &' , can be calculated as: Linking our model to contact tracing and surveillance system data . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 26, 2021. ; https://doi.org/10.1101/2021.02.24.21252339 doi: medRxiv preprint Cases are often recorded in line-lists during disease outbreaks, where dates of testing, symptom onset and hospitalization are recorded alongside information about the age and sex of the patient. When case lists are linked to contact lists, we can derive two ratios with which we parameterize our NGM. We define % as the ratio of cases who were contacts but not under surveillance versus the cases who were contacts and under surveillance and ( as the ratio of de novo cases (cases that were not known contacts) versus detected cases that were contacts and under surveillance. Following the pathways in Figure 1 , we expand % (the ratio of cases who were contacts but not under surveillance versus the cases who were contacts and under surveillance) as (4) We re-write this as We also expand ( (the ratio of de novo cases versus detected cases that were contacts and under surveillance) as ? . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 26, 2021. ; https://doi.org/10.1101/2021.02.24.21252339 doi: medRxiv preprint Figure 1 : Potential pathways for a three-state model of Ebola surveillance (ND, AS, NAS). is the effective reproduction number, is the scaling of the reproduction number due to active surveillance (rapid isolation upon symptom onset), is the proportion of contacts recalled and reported by a case, is the proportion of contacts actively under surveillance, and is the proportion of cases detected or "re-captured" by community surveillance. We assume that all cases under active surveillance are detected. The colouring and shape of the end points of the paths are described as follows: red circle -any case that was not detected (so cannot be under active surveillance), purple circle -an eventually detected case that was not under active surveillance at the time of symptom onset (e.g. a contact of an earlier case lost to follow-up or who refused followup), purple square: a detected case that was under active surveillance at the time of symptom onset (e.g. a contact of a previously detected case, correctly recalled and reported, and under surveillance). In addition to equations (5) and (7), we also have three more relationships that we can use is the proportion of contacts actively under surveillance; is the proportion of contacts recalled by a case and is the scaling of reproduction number due to active surveillance (rapid isolation upon symptom onset)). The green terminal nodes are the potentially observable data ( ! is the ratio of cases who were contacts but not under surveillance versus the cases who were contacts and under surveillance; and " as the ratio of de novo cases versus detected cases that were contacts and under surveillance. The white nodes are our calculated terms ( #$ is the proportion of cases that are not detected; and relates the proportion of not detected cases to the other two types of cases). The arrows show the direction of the dependence. until 14 th December 2020 that had an epidemiological link to a previous case and 90 cases without an epidemiological link (26) . We assume that 80% of contacts were under active surveillance, since this was determined as the minimum requirement for the NZ system (22) . Therefore, we estimate 456 cases were under active surveillance and 114 cases were not. This makes % = 0.25 and ( = 0.20. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. whereas 167419 were (28) . Since we know the total number of cases on the contact tracing list, 45, and assume % = 0.2, we estimate the number of contacts under active surveillance to be 38 (denominator of ( ). The number of people not on the contact list for the two regions was 107 (numerator of ( ). Therefore, ( is equal to 2.85. 2) We assume % is equal to 0.5 (twice as many contacts under active surveillance than not under active surveillance or two thirds of contacts are under active surveillance) to illustrate the impact of a slightly better surveillance system. Since we know the total number of cases on the contact tracing list, 45, and assume % = 0.5, we estimate the number of contacts under active surveillance to be 30 (denominator of ( ). Therefore, ( is equal to 3.57. We estimated the proportion of cases that are not detected using a quasi-Bayesian framework for both case studies. We sampled 100,000 values from [0,1] ( uniformly for ( , ), which is comparable to assuming a uniform prior distribution, and computed the other parameters ( , , &' ) if a solution was viable. We note that there is no solution for some values of ( , ), (see SIB). Since we found our viable parameter space to be convex, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 26, 2021. ; https://doi.org/10.1101/2021.02.24.21252339 doi: medRxiv preprint the mean parameter values calculated from our sampling may be a poor central estimate. Therefore, we define our central estimate as the solution for the point in our viable parameter space ( , ) that is furthest from the boundary of our central region. We calculate this using the polylabelr R package (29) . Our credible intervals reflect the values between which 95% of our viable samples lie. All code necessary to implement the analysis is included open source in the "MissingCases" R package on GitHub (30) . Figure 3 , we find that the region of feasible parameter space for SARS-CoV-2 was only 2.4% of the total space, which suggests high certainty in our parameter estimates. It is also located in the top right corner of the parameter space, where both the proportion of cases detected in the community( ) and proportion of contacts ( ) are high. This suggests a well-functioning and rigorous contact tracing and surveillance system in NZ, which our estimate that only 6.14% (95% CrI: 1.31 -16.7%) of cases are not detected also suggests. All parameter estimates for this model are given in Table 1 . Table 2 . The only parameter that differs significantly between our scenario is the . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 26, 2021 Proportion of cases detected in the community ( p ) Proportion of contacts recalled ( f ) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 26, 2021. ; https://doi.org/10.1101/2021.02.24.21252339 doi: medRxiv preprint proportion of contact under active surveillance, which is directly impacted by the ratio of contacts not under active surveillance versus contacts under active surveillance. Contact tracing is an important control mechanism for infectious disease outbreaks. However, its efficiency depends on detecting as many cases as possible. We show in this paper that NGMs can be easily used to estimate the proportion of cases that were not detected for two different disease outbreaks. Our method requires much less data (only 5 parameters) that other methods, such as capture re-capture (10), which is an alternative method suggested for estimating under-reporting and is highly data intensive. This means that it is feasible to repeat this analysis in near real time as the epidemic unfolds. During the West African Ebola epidemic, the WHO acknowledged that their reported case and death figures "vastly underestimate(d)" the true magnitude of the epidemic (31) . We find that our estimates for the proportion of cases not detected in Guinea (64.5% (95% CrI: . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 26, 2021. ; https://doi.org/10.1101/2021.02.24.21252339 doi: medRxiv preprint A benefit of this method is that we do not just estimate the proportion of cases that were not detected but also other useful quantities that are important for managing a response such as the proportion of cases that are detected in the community. Our central estimates for routine surveillance of 82.5% (95% CrI: 60.6 -95.9%) suggests that NZ was very effective at detecting cases in the community during the SARS-Cov-2 outbreak. In contrast, our estimates of 29.3% (95% CrI: 8.26-80.4%) and 31.7% (95% CrI 14.1-81.0%) suggests that routine surveillance could have been an area for improvement during the West African Ebola outbreak in Guinea. We also find that a higher proportion of contacts were recalled, or membered, in the NZ SARS-Cov-2 outbreak 95.1% (95% CrI: 87.9-99.8%) compared to the Ebola epidemic in Guinea (45.8% (95% CrI: 31.4-95.3%) and 54.5% (95% CrI: 32.4-97.1%)). This could reflect more trust in the NZ health service. The wide credible intervals come from the uniform sample of ( , ). This is a limitation of the method but could be improved with more information about the numbers of contacts people had ( ) and a better understanding of the performance of the routine surveillance ( ) to narrow the region. We believe this method highlights important lessons for responding to the ongoing SARS-CoV-2 pandemic and the unfortunate inevitability of future infectious disease outbreaks. By simply linking the case line-lists and contact tracing lists, we can use the very general method from our "MissingCases" package (30) to assess under-reporting throughout an epidemic. This would help outbreak responses, especially during the early and late phases, target resources and quantify how effect their surveillance systems were. In addition, these estimates can be used to improve the accuracy of other models, such as for the time varying reproduction number, which are key tools for the outbreak response themselves. isolation. Swiss Med Wkly [Internet] . 2020 Mar 19 [cited 2020 Dec 10] ;150 (11) (12) . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. where if we assume = , Assuming is diagonalisable, we can rewrite equation S2 as: where λ denote eigenvalues of which % is the biggest, and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) where [ , , ] ⊺ is the eigenvector associated with the biggest eigenvalue of . This eigenvector can be found by calculating the determinant of − , were is the 3x3 identity matrix. The largest eigenvalue of B is equal to and corresponds to the eigenvector . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) We see that, as predicted by the analytic derivation, despite the unpredictable course of the epidemics ( Figure S1A ), the proportion of missing cases quickly reaches an equilibrium ( Figure S1B ). This proportion at equilibrium can be calculated using the formula derived above. We note that the convergence is exponential and thus fast proving that % ≫ ( and % ≫ , . We assume that = 0.83, = 0.95 and get % = 0.25 and ( = 0.20 from the NZ data. Using Equation 5, we estimate = 0.77. Solving the system gives the following solution: = 0.30 and % = 0.06 meaning that in this configuration, 6% of the cases are not detected and missing from the records. A graphical representation is given in Figure S2 , where the solution of the system can be seen at the intersection of the two curves. No solution is found if the two curves do not intersect. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. NGM . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 26, 2021. ; https://doi.org/10.1101/2021.02.24.21252339 doi: medRxiv preprint Role of contact tracing in containing the 2014 Ebola outbreak: a review First secondary case of Ebola outside Africa: epidemiological characteristics and contact monitoring Addressing needs of contacts of Ebola patients during an investigation of an Ebola cluster in the United States Morbidity and mortality weekly report Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus Modelling under-reporting in epidemics Priorities for the US Health Community Responding to COVID-19 Unreported cases in the 2014-2016 Ebola epidemic: Spatiotemporal variation, and implications for estimating transmission How many Ebola cases are there really? Sci Now Using "outbreak science" to strengthen the use of models during epidemics Mathematics of life and death: How disease models shape national shutdowns and other pandemic policies Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand Modeling the transmission dynamics of Ebola virus disease in Liberia The concept of R o in epidemic theory Estimating the Future Number of Cases in the Ebola Epidemic -Liberia and Sierra Leone MMWR. Morbidity and mortality weekly report On the definition and the computation of the basic reproduction ratio R0 in models for infectious diseases in heterogeneous populations The construction of next-generation matrices for compartmental epidemic models EMERGENCY GUIDELINE Implementation and management of contact tracing for Ebola virus disease If you're told to self-isolate by NHS Test and Trace -NHS Rapid Audit of Contact Tracing for Covid-19 in New Zealand New Zealand's elimination strategy for the COVID-19 pandemic and what is required to make it work COVID-19 in New Zealand and the impact of the national response: a descriptive epidemiological study Successful contact tracing systems for COVID-19 rely on effective quarantine and isolation 4 New Zealand Ministry of Health. COVID-19: Souce of cases Contact Tracing Activities during the Ebola Virus Disease Epidemic in Kindia and Faranah, Guinea Contact tracing performance during the Ebola epidemic in Liberia Find the pole of inaccessibility No early end to the Ebola outbreak Updating the Estimates of the Future Number of Cases in the Ebola Epidemic-Liberia Use of Capture-Recapture to Estimate underreporting of Ebola Virus disease Emerging Infectious Diseases . and the proportion of missing cases