key: cord-0644874-t4qx73xr authors: Herbrich, Ralf; Rastogi, Rajeev; Vollgraf, Roland title: CRISP: A Probabilistic Model for Individual-Level COVID-19 Infection Risk Estimation Based on Contact Data date: 2020-06-09 journal: nan DOI: nan sha: 92ea0ee554abc07965ed9fc114a441d9e8bae73c doc_id: 644874 cord_uid: t4qx73xr We present CRISP (COVID-19 Risk Score Prediction), a probabilistic graphical model for COVID-19 infection spread through a population based on the SEIR model where we assume access to (1) mutual contacts between pairs of individuals across time across various channels (e.g., Bluetooth contact traces), as well as (2) test outcomes at given times for infection, exposure and immunity tests. Our micro-level model keeps track of the infection state for each individual at every point in time, ranging from susceptible, exposed, infectious to recovered. We develop a Monte Carlo EM algorithm to infer contact-channel specific infection transmission probabilities. Our algorithm uses Gibbs sampling to draw samples of the latent infection status of each individual over the entire time period of analysis, given the latent infection status of all contacts and test outcome data. Experimental results with simulated data demonstrate our CRISP model can be parametrized by the reproduction factor $R_0$ and exhibits population-level infectiousness and recovery time series similar to those of the classical SEIR model. However, due to the individual contact data, this model allows fine grained control and inference for a wide range of COVID-19 mitigation and suppression policy measures. Moreover, the algorithm is able to support efficient testing in a test-trace-isolate approach to contain COVID-19 infection spread. To the best of our knowledge, this is the first model with efficient inference for COVID-19 infection spread based on individual-level contact data; most epidemic models are macro-level models that reason over entire populations. The implementation of CRISP is available in Python and C++ at https://github.com/zalandoresearch/CRISP. The COVID-19 pandemic has spread rapidly around the world, with the number of infections and deaths steadily growing. Most governments around the world have been completely unprepared to deal with the COVID-19 outbreak, which UN Secretary-General Antonio Guterres has referred to as humanity's worst crisis since World War II. While governments around the world had plans in place in the event of a pandemic, the peculiarities of COVID-19 (e.g., delayed onset of symptoms, asymptomatic transmission) have challenged these preparations. Governments have reacted by implementing measures such as nationwide lock-downs, that require people to stay inside their homes, enforcing social distancing and therefore breaking the COVID-19 infection chain. However, a blunt mechanism such as a lock-down (over an extended period) can cause severe damage to the economy, and so, there is a need to find alternative measures to slow down or stop the spread without incremental effects in other areas of society. These alternatives have to be built in a solid foundation such as widespread testing and the isolation of infected (or potentially infected) individuals via contact-tracing. Contact tracing technologies [20, 19] have shown promise in tracking the spread of the disease across the population. These mobile apps capture social contact information between users such as contact duration, distance, etc. using Bluetooth signals on devices. The fine-grained contact data of individuals collected by the apps can enable: • Individual risk score prediction. The contact data, combined with information about COVID-19 positive test cases, can be used to predict the likelihood of infection for each individual. The individual risk scores can be leveraged by governments and organizations to prioritize testing as well as to identify individuals that need to enter isolation/quarantine. • Hotspot detection. Tracing technologies can help authorities identify areas with a high density of contacts and/or individuals with high infection risk. This can allow policymakers to make more effective decisions, for example, by imposing highly restrictive measures such as lock-downs, shelter-at-home, or school closures only in COVID-19 hotspots while allowing activities to remain closer to normal in unaffected areas. • Insights about infection spread. Contact tracing can provide insights into the relative importance of different modalities of disease transmission (e.g., through intermediate surfaces vs individual contact), risk of infection transmission based on contact characteristics such as duration and distance, most likely locations (e.g., schools, work, malls) for the spread of disease, and "super spreaders" who come in close proximity with a large number of individuals and so must be frequently tested for infection. To achieve the above-mentioned benefits, we need to devise new models and inference algorithms for analyzing contact tracing data. This is because existing epidemics models [3, 15, 14, 6, 5 ] focus on estimating population-level statistics such as percentage of the population infected, number of days for the epidemic to peak, etc. as opposed to the infection state of each individual in the population. Other models [16, 17] that use ML-based inference techniques assume complete knowledge of the infection state of each individual at each time instant. However, in the COVID-19 scenario, (1) the infection status of individuals is not known until they are tested, and (2) infectious time of individuals are unknown since individuals may infect others while asymptomatic. Finally, governments are using contact tracing data [20, 19] to identify and test individuals who have come in direct contact with COVID-19 positive test cases. However, the fact that asymptomatic individuals may have infected a large number of people prior to displaying symptoms and being tested, delays the detection of these newly infected individuals by only using contact tracing apps. Our main contributions can be summarized as follows: • We propose CRISP (COVID-19 RIsk Score Prediction), a probabilistic graphical model for COVID-19 infection spread through diverse contacts channels between individuals. Our model uses latent variables to represent the epidemiological states of individuals based on the SEIR model [12] at different points in time, and captures both the transitions between states as well as test outcomes. We classify related work into four broad categories: epidemic models, Machine Learning (ML) based inference of model parameters, influence maximization in social networks, and contact tracing apps. In recent years, there has been research on modeling individual dynamics of epidemics [3, 15, 14] . However, this work typically resorts to mean-field theory to model virus spread over a network, and thus does not characterize the dynamic infectious state of each individual over time. Ferguson et al. [6, 5] use a compartmental transmission model to simulate the spread of influenza across a population, and analyze interventions such as antiviral prophylaxis and social distancing to halt a pandemic. The authors use a stochastic model of individuals co-located in households that are randomly distributed across a geographical region, and infection risk from 3 sources -household, place and random contacts in the community. The infection transmission rates for the 3 sources and recovery rates are based on analysis of historical data. In contrast, we leverage real individual contact tracing data and outcomes of tests on individuals to infer the infection transmission rate for each contact and the likelihood of infection for each individual. Lorch et al. [10] propose a spatiotemporal epidemic model that uses marked temporal processes to represent the epidemiological condition of each individual (based on a variation of the SEIR compartment models), individual mobility patterns, test outcomes, and testing and contact tracing strategies. The authors design an efficient sampling algorithm for the model using Monte Carlo roll-outs that is able to predict the spread of COVID-19 under different testing & tracing strategies, social distancing measures, and business restrictions, given contact histories of individuals. They use Bayesian optimization techniques to infer model parameters (e.g. infection transmission rate) that minimize the difference between the real positive COVID-19 cases and those in the Monte-Carlo simulations. In addition, they demonstrate the efficacy of their model using real COVID-19 data and mobility patterns of Tübingen, Germany. Our Monte Carlo EM inference algorithm for model parameters is computationally much more efficient than the Bayesian optimization techniques employed in [10] . In [16] , the authors consider the problem of inferring latent social networks based on network diffusion or disease propagation data. Given the times when nodes become infected, but not who infected them, the authors identify the optimal network that best explains the observed data. The authors present a maximum likelihood approach based on convex optimization with a L 1 -like penalty term (that encourages sparsity) to estimate the conditional probability of infection transmission between every node pair. A key difference from our work is that [16] assumes complete knowledge of infected nodes and infection times. In contrast, in the COVID-19 scenario, (1) the infection status of nodes is not known until they are tested, and (2) infection times of nodes are unknown since nodes may not show symptoms even though they are infected (and infecting others). Warriyar et al. [17] introduce a novel R statistical software package EpiILM for simulating infectious disease spread, and carrying out Bayesian MCMC-based statistical inference for spatial and/or (contact) network-based models in the Deardon et al. [4] individual-level modelling framework. In individual-level models (ILMs), the epidemiological state of each individual (e.g., susceptible or infected) is assumed to be perfectly known at each time instant, which makes it relatively straightforward to estimate model parameters such as infection transmission probabilities (as a function of covariates) using maximum likelihood estimation or Bayesian inference using Metropolis-Hastings MCMC. However, in the COVID-19 scenario, epidemiological states of individuals are hidden until they are tested, and this complicates Bayesian inference in our probabilistic model setting. Note that this model has no cycles as we assume the infection status z u,t only depends on variables z v,t before time step t, t < t. However, due to the "memory" that the state z u,t = E and z u,t = I have, we require edges into the entire past of an infection trace. The Influence Maximization problem aims to select k users in a social network that maximize influence spread, and was first modeled as an algorithmic problem by Kempe et al. [8] . [9] presents a comprehensive survey of different diffusion models that capture the information diffusion process and approximation algorithms to maximize influence. The papers assume that diffusion model parameters such as influence probabilities are given and focus on selecting k users to maximize influence spread. In contrast, our paper focuses on the problem of estimating model parameters related to infection transmission probabilities for each contact, given social contact information between users and COVID-19 test results for users. [11] addresses the problem of finding the "backbone" of an influence network. It employs network sparsification to preserve only the links that play an important role in the propagation of information. [7] considers the problem of estimating influence probabilities between users in a social graph. Given a social graph and a log of actions by users, the Maximum Likelihood Estimator (MLE) of influence probability of node u on node v is simply the fraction of actions performed by u that are also performed by v. Unlike [7] , in our setting, the infection status and times of nodes are latent, and need to be inferred by our algorithms. To combat the spread of COVID-19, governments have launched contact tracing apps [20, 19] that use Bluetooth signals on mobile phones to track contacts between users. Users who have come in direct contact with COVID-19 positive test cases are considered to be at high risk of infection, and subject to tests and quarantine actions. However, a key problem with this approach is that COVID-19 infected users are typically tested only after they show symptoms, and typically, infected users show symptoms 5-6 days post infection. These asymptomatic users may have infected a large number of users over multiple hops prior to displaying symptoms and being tested. This delays detection of infected users using contact tracing apps, and limits their effectiveness to proactively test and isolate infected users to contain the spread of COVID-19. In contrast, our probabilistic modeling algorithm CRISP predicts the likelihood of a user getting infected with COVID-19 through a chain of social contacts involving asymptomatic users, and identifies infected users early, even though they may be multiple hops from a user who has tested positive for COVID-19 and even before they begin to show symptoms. Our inference algorithm also learns infection transmission probabilities for each contact channel. Our CRISP model is an SEIR model at the level of every individual (see [12] for an introduction). Note that we consider discrete time steps t implicitly assumed to be at the level of single days. We assume that we are given the following two datasets for a given set S of individuals: Here we assume that the feature vector x i describes the overall contact between u i and v i via the number x i,j of mutual contacts over channel j (e.g., Bluetooth encounters, queuing together, sharing public transportation). We assume D contact to be symmetric so that We model the T discrete time steps of infection status for each individual. Our model has |S| × T many latent variables Z := {z u,t } u∈S,t=1,...,T ∈ {S, E, I, R} |S|×T that represent the four stages of infection 2 : • z u,t = S: individual u has not been infected and is susceptible, • z u,t = E: individual u is infected but not contagious, • z u,t = I: individual u is infected and is contagious, • z u,t = R: individual u has recovered and is not contagious. Let us use the notation Z u,t := {z u,1 , . . . , z u,t } to denote the set of latent states z u,t of individual u up to and including time t and Z t := u Z u,t . In addition to the latent infection status of all individuals at each time step, we also model Then, our graphical model G := (V, E) between the variables V := Z O has the following edges: 1. E time = u E u and E u := {(z u,t , z u,t ) t 60. • Suppression After 60 Days. For t ≤ 60, we generate C(2.5, p 1 ) connections for every individual at every time step. Afterwards, we assume that lock-down measures are taken to suppress the pandemic which reduce the reproduction rate to 0.5. • Suppression After 60 Days and Release of Lock-down after 120 Days. This scenario is similar to the previous scenario but we assume that due to very low infection numbers, the lock-down is released after 60 days. Thus, we generate C(2.5, p 1 ) random connections for every individual for t > 120. In Figure 2 , we show the plot of u P(z u,t = z) over t = 1, . . . , 274 days for z ∈ {E, I, R} (orange = E, red = I, blue = R) from 100 forward samples of the CRISP model for these scenarios. As one can see, with no mitigation there is a high peak around day t * = 180 and eventually herd-immunity is achieved at 85% of infected population. Even though the number of unique contacts in each time step is the same, "social bubbles" flatten the curve, thus slowing down the infection but growth rates of infected people are still super linear until large parts of the population had been in contact with the disease. Note that a similar mitigation policy is currently used in Belgium. In case of mitigation to R 0 = 1.0, growth rates are pushed to sub linear but the pandemic is still continuously going on after 9 months. Not surprisingly, suppression is most effective at bringing the infections back to nearly 0% after 120 days. However, if the lock-down is lifted after 120 days, a second wave of infections will cause an exponential increase in infectiousness after only two weeks (dashed lines). Note that all these effects were computable by simply forward sampling our individual-level CRISP model. In order to assess the test and quarantine efficacy of the CRISP model, we consider a population of |S| = 1, 000 individuals for 150 days (5 months) with a uniformly random contact pattern of C(2.5, 0.025) = 5.03 contacts on average per individual and day. We simulate the actual infection spread by applying the following sequence in each time step (i.e., day): At the beginning of each time (bottom-right) Lock-down at day 60 and reduction of R 0 to 0.5 (solid lines). In dashed lines we show the effect of a subsequent re-opening of a subsequent contact rate increase to R 0 of 2.5 starting at day 120. step, we query the testing-and-quarantining policy for a list of individuals which need to be tested and need to be in quarantine during this step (this will only be done after t * = 30 to simulate an undetected initial outbreak). Each policy is constrained to select no more than 10 test candidates per day (1% of the total population). Given the quarantined individuals on that day, we remove contacts from and to the quarantined individuals for that day and then use the CRISP forward model (4) and CRISP test outcome model (9) to draw one sample of the next simulated infection state of every individual as well as the actual test outcomes of the requested test candidates. If the infection state of an individual changes from E to I in this sampling step, we assume that with 50% probability, the individual generates symptoms. Finally, at the end of the time step, the testing-and-quarantining strategy is revealed the test outcomes as well as the list of symptomatic individuals (again, provided t ≥ t * ). We single out an individual u for whom we set p 0 = 1 so that she will get infected with probability 100% at t = 1 ("patient 0"); for all other people we assume a p 0 = 10 −4 to model a small chance of infection spread from exogenous sources. 1. Symptom-Based Policy. For every time step t ≥ t * , we will request testing for up to 10 symptomatic individuals from the previous time step. For all individuals with a positive test outcome on the previous day, we will institute a quarantine for ρ time steps where ρ ranges from 2 to 21 days in our evaluation. 2. Contact-Tracing Policy. For every time step t ≥ t * , we will request testing for up to 10 symptomatic individuals from the previous time step. If there are less than 10 symptomatic individuals, then we will request the remaining tests for individuals in quarantine sorted in descending order of the number of contacts they have had in the past 7 days with people who have tested positive. For every individual with a positive test outcome, we will not only quarantine her but also all the contacts she had in the past 7 days for ρ time steps where ρ ranges from 2 to 21 days in our evaluation; for every individual with a negative test outcome, we will remove her from quarantine. 3. CRISP Model-Based Policy. For every time step t ≥ t * , we will use block-Gibbs sampling of 100 infection traces z u to estimate P(z u,t ) for every individual u at the current time step t based on the contacts and test outcomes prior to time step t. We will request testing for up to 10 symptomatic individuals from the previous time step. If there are less than 10 symptomatic individuals, then we will request the remaining tests for individuals (who have not tested positive before) in descending order ofP(z u,t = I). We will quarantine any individual who is not yet quarantined but whose estimated probabilityP(z u,t ∈ {E, I}) exceeds a given policy threshold τ EI ; we will release an individual from quarantine once their estimated probabilityP(z u,t ∈ {S, R}) exceeds a given policy threshold τ SR . Note that we increase p 0 in the block-Gibbs sampling by a factor of 10 to account for "patient 0". In order to gauge the efficacy of each policy, we measure two quantities at the end of the simulation (t = 150): (1) Percentage of population that got infected during the 150 days, and (2) total number of days that individuals were quarantined (e.g., if a policy locks down for the entire 150 days, this would result in 150,000 quarantine days). Varying the policy parameters ρ, τ EI and τ SR results in curves on the two dimensions of infection percentage and quarantine days. The closer a curve is to the origin, the more effective is the policy in terms of "health" (infection) and "economic" (quarantining) cost. In Figure 4 , we plot curves for the three policies with ρ ∈ {2, 7, 14, 21}, τ EI ∈ {0.2, 0.3, 0.4, 0.5}, τ SR = 0.9. For comparison, we also show the two extreme points corresponding to "no mitigation" (i.e., zero quarantine days but the largest infection percentage of 90%) and "full lock-down" (i.e., largest quarantine days of 120,000 and near-zero infection percentage). All three curves exhibit a negative slope where a higher percentage of quarantine days corresponds to a more effective mitigation of infection spread. Of the three policies, our CRISP-based policy achieves the best performance in terms of the smallest number of quarantine days for a given infection percentage (i.e., Pareto frontier). This is because our CRISP model is able to accurately identify infectious users (even though they may be asymptomatic) and test/quarantine them proactively-this helps to prevent infection from spreading across the population while at the same time quarantining fewer individuals with a high likelihood of getting infected. In contrast, the symptom-based policy only tests individuals with symptoms and then quarantines the individuals who have tested positive. As a result, since 50% of the infected individuals are asymptomatic, they never get tested and quarantined, thus resulting in a spread of infection to 60% of the population. Similarly, the contact-tracing policy, by isolating all contacts of positive tested individuals (many of whom may have low likelihoods of getting infected), is able to achieve the absolute smallest infection percentage but at the cost of massive quarantining (30% of the population). Figure 5 shows a visualization of these effects in one of the simulation runs for ρ = 14, τ EI = 0.3, and τ SR = 0.9. In this paper, we proposed a probabilistic graphical model for COVID-19 infection spread through individual contacts that captures the epidemiological state of each individual based on the SEIR model. We developed a computationally efficient block-Gibbs sampling-based algorithm to infer In the bottom plots, we show a stacked bar chart of quarantined individuals per day grouped by actual infection status. While the number of quarantined individuals for the symptom-based policy is small, the infection spread is not contained and the quarantining keeps growing exponentially. In contrast, the contact-tracing policy effectively suppresses infection spread while regularly quarantining more than 25% of the population. The CRISP model-based policy is initially picking a large number of individuals for quarantining but is then able to keep it at a low-level, in particular of susceptible individuals. the COVID-19 infection risk score of all individuals at any time, given test outcome and mutual contact information between individuals. An efficient C++-based Python implementation of our inference algorithm is available at https://github.com/zalandoresearch/CRISP. Through experiments with simulated data, we showed that the CRISP model is able to model macro-level characteristics of the COVID-19 infection at county level (≈ 10, 000 individuals) and effectively mitigate COVID-19 spread by pro-actively quarantining and testing individuals with high risk of infections. As part of future work, we would like to further accelerate our inference procedure using other approximation techniques such as Variational Bayes and Expectation Propagation [2] . Our inference algorithm can also be speeded up by exploiting the parallelism inherent in our block-Gibbs Sampling algorithm. For example, it is possible to concurrently sample infection traces of two individuals with no contacts in common. It is also known that the hyper-parameters of the SEIR model vary with demographic attributes such as age, socio-economic status, or location (see, for example [10] who present a location-varying infection spread model). We would like to extend our model with group-level hyper-parameters to account for this variation. We would also like to explore the causal impact of mitigation or suppression policy measures (e.g., school closures, shop closures, small group gatherings) on COVID-19 infection spread when using contact-level data. Finally, we would like to consider more sophisticated models of COVID-19 transmission through different modalities, and contacts with varying duration and distance characteristics. Incubation period of 2019 novel coronavirus (2019-nCoV) infections among travellers from Wuhan, China Pattern Recognition and Machine Learning Epidemic thresholds in real networks Inference for individual-level models of infectious diseases in large populations Strategies for mitigating an influence pandemic Strategies for containing an emerging influenza pandemic in southeast asia Learning influence probabilities in social networks Maximizing the spread of influence through a social network Influence maximization on social graphs A spatiotemporal epidemic model to quantify the effects of contact tracing, testing, and containment Aristides Gionis, and Antti Ukkonen. Sparsification of influence networks Infectious diseases of humans: dynamics and control Communication-efficient learning of deep networks from decentralized data -homogeneous virus spread in networks Virus spread in networks On the convexity of latent social network inference Individual-level modeling of infectious disease data: Epiilm Clinical presentation and virological assessment of hospitalized cases of coronavirus disease 2019 in a travel-associated transmission cluster Acknowledgments We would like to thank Sebastian Munoz, Christopher Gandrud, Jasvinder Kandola and Peter Herbrich for all their valuable input and feedback on earlier drafts of this paper. We are also indebted to Christoph Thöns for his support with Python and C++ programming.