key: cord-0657996-wic9ezs5 authors: Briers, Mark; Charalambides, Marcos; Holmes, Chris title: Risk scoring calculation for the current NHSx contact tracing app date: 2020-05-22 journal: nan DOI: nan sha: e485a9a9e43741b22dd0a74b376c88c3d55e32f4 doc_id: 657996 cord_uid: wic9ezs5 We consider how the NHS COVID-19 application will initially calculate a risk score for an individual based on their recent contact with people who report that they have coronavirus symptoms. We consider how the NHS COVID-19 application will initially calculate a risk score for an individual based on their recent contact with people who report that they have coronavirus symptoms. The NHS COVID-19 app uses Bluetooth to estimate the distance over time between people who have downloaded and are running the app. If a person reports coronavirus symptoms, their recent history of interactions is uploaded to a database and the risk scoring algorithm is used to update the risk score for every app user they have come into contact with. The NHSx technical report [4] gives a high-level overview of the risk score calculation suitable for a wide audience. The report [2] presents the algorithmic elements of the risk score calculation and notification process. In this note, we describe the technical aspects of the risk scoring algorithm and consider its statistical basis. Assume that individual i notifies the app that they are symptomatic. The time of the reported onset of their symptoms is denoted as t s i and this is distinct from the time of notification, denoted here by t r i . The superscripts s and r refer to the onset of symptoms and reporting time respectively. Currently, t s i is always marked to noon of the day of symptom onset. All times are measured in minutes, unless otherwise specified. Assume that the app for individual i has stored N i contact events (somewhat ambiguously referred to as "pings" in [2] ). We will denote the m-th contact event as E i m and the absolute time of the start of the contact event will be denoted by t i,m . We will refer to i as the source and the individuals associated with recorded contact events as recipients. Each contact event E i m , has an associated risk of transmission which can be written as a product as follows: where: • α t i is a weighting associated with attributes of the source individual i at time t (such as severity of symptoms, age, etc) -currently α t i ≡ 1; • c i,m is a risk context adjusting factor, for example, taking into account factors such as whether the contact is made indoors, etc; • D i,m is a distance-related risk factor; • I i,m is a infectiousness risk factor; • δt i,m is a duration of the contact event. The distance-related factor is given by where d i,m denotes the distance between source and recipient for the contact event and d min is a parameter controlling the point where the distance-related factor is maximised -currently d min = 1. The distance d i,m is a function of the Bluetooth signal which may be estimated using the RSSI (Received Signal Strength Indicator) value; see, for example, Equation 1 in [5] . The infectiousness factor is given by i ] days denotes the time difference in days between the start of the contact event and midday of the day of symptom onset of source i and the parameters µ 0 and σ 0 control the shape of the Gaussian -currently µ 0 = −0.3, σ 0 = 2.75 (see Section 2.1 for a discussion on these parameter values). See Appendix A for a visualisation of the risk score r(E i m ). The total risk of transmission, r i,j , from source i to a recipient j is then obtained by aggregating the risk of transmission of all relevant contact events: where: • id(E i m ) is the recipient identifier of the m-th contact event for source i; • ∆t max corresponds to the maximum amount of time a contact event is stored before the onset of symptoms of the source -currently this is 7 days, i.e. ∆t max = 10080. Note that, once a source i notifies the app, they do not continue to upload contact events. This implicitly assumes that the source i self-isolates after the time of notification t r i . The source i is not involved in any contact events after time t r i . Finally, the risk score, r j , associated to individual j is obtained by aggregating the risk of transmission from all source individuals: The generation period between a source infector and a recipient is defined as the interval between the source becoming infected and the target becoming infected. The incubation period is defined as the interval between the time an individual becomes infected and the time the individual shows symptoms. The infectiousness factor adjusts the risk score so that contact events on the day the source develops symptoms have higher weight. It accounts for the non-uniformity in the distribution of the time between a source infector showing symptoms and a contact event which causes a recipient to become infected. In fact, [1] model the incubation and generation periods of the virus and we can use these to estimate this distribution. The distribution ends up being numerically close to the normal distribution N (−0.3, 2.75 2 ). We perform this analysis below. The Gaussian infectiousness factor in Equation 3 is equal to the density of this normal distribution which is scaled so that the maximum value is 1. In [1] , the incubation period is modelled by a log-normal distribution (following [3] ) and the generation period is modelled by a Weibull distribution. When the source i uploads their data, we do not observe their time of infection but only the onset time of their symptoms, t s i . In practice, this time has uncertainty associated with it, but it is assumed to be negligible for the calculation of the risk score. The infectiousness factor corresponds to modelling the time between the source's onset of symptoms and the potential contact time when the recipient becomes infected. This is precisely the distribution of the difference between the generation period and the incubation period. In Figure 1 , we use the parameter estimates from [1] for the generation and incubation periods and generate samples for the difference. This is compared to the Gaussian N (·; µ 0 , σ 2 0 ). The sample distribution is not symmetric and has a heavier left tail than the Gaussian. However, the Gaussian approximation is numerically close. An individual j is notified if their risk score is greater than or equal to a minimal risk score, r j ≥ r min . A notified individual is advised to follow advice related to additional restrictions for a fixed period of 14 days. Currently, r min = 1.83 which is chosen to correspond to the PHE guidelines of a contact event of 15 minutes at 2 metres and marked to 3 days from when the source develops symptoms. If a source individual i returns a negative test result, then their proximity risk at times before the test can be assumed to be equal to 0 (assuming the tests have a low false negative rate), therefore a simple approach would be to adjust the source to receiver risk scores to 0. For each of the previously notified contacts, the de-cascading process should compute the revised risk scores for this individual, and if they fall below the threshold, notify them that they no longer need to follow advice related to additional restrictions. That is, one needs to ensure that the default de-cascading process does not release recipients whose risk score remains above r min after removing the risk component r i,j . While the risk score definition allows for a scalar valuation of infection risk to be computed, the fact that there is no underlying probabilistic model presents several challenges. The meaning of the risk score values themselves lacks clarity and it becomes difficult to compare events directly. Moreover, there are a number of parameters in the model and there is no natural loss function which can be utilised in updating these parameters as data is received. Another more practical problem is that once a recipient is notified, they follow advice for 14 days regardless of whether the actual probability of them having been infected has significantly decreased during that period of time. In this section, we give one possible interpretation which ties the risk score to the probability of infection. This allows us to present an approach which can address the aforementioned challenges. For an individual j, let (E n ) Nj n=1 be the sequence of contact events from symptomatic source individuals which have j as recipient and are within the ∆t max (i.e. 14 day) cutoff. Write I E for the event that j gets infected as a result of contact event E. Write for the event that the recipient j gets infected as a result of any of the contact events (E n ) n . We seek to relate the risk score r j to the probability that j gets infected I j so that a higher risk score directly corresponds to a higher infection probability. The formulation should respect the fact that r j is the sum of the individual risk scores (r(E n )) n associated to the contact events. We assume that the infection events (I En ) n are independent and interpret the risk score r(E) of the contact event E as the negative of the logarithm of the conditional probability that j does not get infected. More precisely, let ν ∈ (0, 1) be a parameter. Define ρ(E) by P(I E ) = ν ρ(E) . Here, I E denotes the complement of the event I E . Define ρ j by Then, We can view the risk score r(E) (from equation 1) as an estimator for ρ(E) in equation 7 and then the risk score r j (from equation 4) becomes an estimator for ρ j and has a clear probabilistic interpretation. With this formulation, the notification process can be formulated in probabilistic terms. Decide on a probability threshold p min . An individual j is notified if the probability that they have been infected given their contact events is equal to or above this threshold. That is, if P(I j ) = 1 − ν ρj ≥ p min . (10) In particular, the probability of a false positive is bounded above by 1 − p min . Consider, as above, the sequence of contact events (E n ) Nj n=1 for a recipient j. We will denote the time of the contact event E by t E . We will also denote by S(t) the event that j develops symptoms by time t due to the contact events. Recall that the generation period is the time between the source becoming infected and a recipient becoming infected and the incubation period is the time between an individual becoming infected and developing symptoms. If the individual becomes infected as a result of contact event E, they will develop symptoms precisely after the generation period and the incubation period of the virus elapse from the time, t E , that the contact event E occurs. This means that the probability the recipient shows symptoms by time t is equal to the probability that the sum of the generation period and the incubation period of the virus is at most the time elapsed since the contact event occurred, t − t E . Therefore, we can write where G is the cumulative distribution function of the distribution of the sum of generation period and incubation period. Currently, a notified individual must follow additional advice for 14 days. However, we may amend this algorithm so that a notified individual is subsequently informed that they no longer need to follow the additional restrictions when the probability P(I j |S(t)) that they are infected given that they have not experienced symptoms by time t drops below a certain threshold. This probability can be expressed in terms of ρ, if we make the following assumption. Then, the probability that individual j is infected without showing symptoms by time t may be expressed as See Appendix B for a derivation. The incubation period model in [1] (and [3] ) is log-normal. As such, it assumes that there is 0 probability of being asymptomatic. With the models in [1] for incubation and generation periods, as t → ∞ the cumulative distribution Therefore, the probability of being infected but never showing symptoms is zero, P(I j |S(t)) → 0. In the case when there is just a single contact event E, the expression for the probability reduces to The probabilistic interpretation gives an approach to updating parameters. As an example, we consider how to estimate the parameter ν. We can estimate ρ jm by r jm and apply MCMC, for example, to approximate the posterior distribution p(ν|D). Moreover, we can update this distribution as new data is received. There are several directions for further work relating to the risk score calculation as data is collected: • Improve estimation of the Bluetooth RSSI to distance mapping to increase the accuracy of the derived distance. This should enable the quantification of the uncertainty in the derived distance and how this uncertainty propagates through any calculations. • Analyse the rate of false positives the application generates when notifying users at different risk score levels. • Account for the false negative rate of the COVID-19 test (as in a user has tested negative despite still having the disease). • Account for the uncertainty in the parameter estimates of the models for generation and incubation period in the infectiousness factor. • Understand the risk associated with not accounting for users who are asymptomatic. • Compare appropriate decay models for the risk level after a recipient is notified but does not show symptoms (as in equation 12). Additionally, there is a need to reformulate the risk scoring process in probabilistic terms. While Section 4 does give a probabilistic interpretation of the current risk scoring algorithm, it is preferable to start with an underlying, wellmotivated, fully-probabilistic model with clear assumptions. Competing probabilistic models can be consistently evaluated as data is collected. The risk score for a single contact event factorises as a product of five terms as defined in equation 1. In this section, for visualisation purposes, the risk context adjusting factor c i,m is assumed to be 1. In the figures below, we plot visualisations of r(E i m ) (risk_score) as we vary: • the distance d i,m (distance), We decompose the probability P(I j |S(t)), P(I j |S(t)) = P(I j ∩ S(t)) P(S(t)) = P( n I En ∩ S(t)) 1 − n P(S(t)|I En )P(I En ) = 1 − n 1 − P(I En ∩ S(t)) 1 − n P(S(t)|I En )P(I En ) = 1 − n (1 − P(S(t)|I En )P(I En )) 1 − n P(S(t)|I En )P(I En ) = 1 − n 1 − (1 − G(t − t En )) 1 − ν ρ(En) 1 − n G(t − t En )(1 − ν ρ(En) ) . (15) Figure 8 : Risk score isosurfaces Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing Defining an epidemiologically meaningful contact from phone proximity events: uses for digital contact tracing. Version 2 The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application Risk-scoring Algorithm (Interim): Technical Information Bayesian filtering for a bluetooth positioning system