key: cord-0573821-fhot59js authors: Ali, Junade; Dyo, Vladimir title: Cross Hashing: Anonymizing encounters in Decentralised Contact Tracing Protocols date: 2020-05-26 journal: nan DOI: nan sha: 087751a44759d4a2d6dfe4db89b3b91d6ccf2336 doc_id: 573821 cord_uid: fhot59js During the COVID-19 (SARS-CoV-2) epidemic, Contact Tracing emerged as an essential tool for managing the epidemic. App-based solutions have emerged for Contact Tracing, including a protocol designed by Apple and Google (influenced by an open-source protocol known as DP3T). This protocol contains two well-documented de-anonymisation attacks. Firstly that when someone is marked as having tested positive and their keys are made public, they can be tracked over a large geographic area for 24 hours at a time. Secondly, whilst the app requires a minimum exposure duration to register a contact, there is no cryptographic guarantee for this property. This means an adversary can scan Bluetooth networks and retrospectively find who is infected. We propose a novel"cross hashing"approach to cryptographically guarantee minimum exposure durations. We further mitigate the 24-hour data exposure of infected individuals and reduce computational time for identifying if a user has been exposed using k-Anonymity buckets of hashes and Private Set Intersection. Contact tracing systems are based on sharing of individual's presence through short-range Bluetooth beacons and enable individuals to determine if and when they have been in contact with an infected person [1] . If used by enough people, epidemic control can be achieved without resorting to countrywide "lockdown" strategies which are harmful to society and the economy [2] . At the same time pervasive tracking of people locations raised serious privacy concerns globally including potential risk to human rights with claims that excessively compromising privacy is a gateway to undermining other rights [3] . While a number of privacy-aware protocols have been proposed, many do not protect the privacy of the person once he or she becomes infected and allow sharing his or her location information. The area is rapidly evolving with numerous privacy aware protocols developed and deployed up to date [4] [5] . In particular, Apple and Google, who control the overwhelming market share of mobile phone operating systems, have created a BLE-based protocol for contact tracing [5] , based on DP3T, an open-source contact tracing protocol using hash-based cryptography to achieve anonymity. The protocols protect the users' privacy by encrypting their beacon IDs with temporal daily keys, which are created by the individual and can be disseminated in the event a user tested positive. However, once an infected user shares his daily temporal key, it becomes possible to track his or her entire daily trajectory. In fact, a formal analysis of DP3T has revealed that it is possible for "malicious users to re-identify infected users at scale" [4] . As the locations of infected individuals tend to attract extensive news coverage and re-identification leading to unwanted privacy invasion and public distain, this represents a serious issue [6] . Another concern is that the location information of an infected person becomes available to anyone who recorded at least one beacon, i.e. passersbys, various Bluetooth trackers installed in various locations, i.e. to users who have not been in a meaningful contact with the infected person and therefore are not supposed to have access to the infected person's location information. In this paper, we propose a privacy aware contact tracing approach that ensures the privacy of the person even after they become infected. The novelty of the approach is that we choose to protect the proximity information not at the individual beacon level but on the encounter level of a certain biologically meaningful duration. The approach ensures that location information is shared only with individuals who have spent a minimum sufficient duration in the proximity of the infected individual reducing the risk of re-identification attacks. Furthermore, by cryptographically signing encounters the approach ensures that once the infected individual shares his or her location information, it is not possible for a malicious user to re-identify the person or reconstruct his trajectory. Re-identification attacks of anonymised datasets have been explored before in literature, a description of the problem can be found in [7] . During the 2020 COVID-19 outbreak, notes a number of privacy controversies in South Korea through the implementation of Information Technology contact tracing systems; with the locations of infected individuals attracting extensive news coverage and re-idenitification allegedly leading to unwanted privacy invasion and public distain. The [5] protocol by Apple and Google works by generating a Temporary Exposure Keys (otherwise known as a Daily Tracing Key, a random number which is SHA-256 hashed and truncated to 16 bytes), this daily key is then in turn used to generate Rolling Proximity Identifiers for each 10 minute interval in the 24 hour period. The Rolling Proximity Identifiers is created by concatenating the Temporary Exposure Key with a timestamp value, computing a SHA-256 hash and truncating it to 16 bytes. In the event an infection is detected, the Temporary Exposure Keys are distributed and users can generate each Rolling Proximity Identifiers to check if they encountered an infected individual. The [5] protocol shares the same flaw as DP3T in that the Temporary Exposure Keys are widely circulated, allowing for re-identification of infected users. Finally, the protocol derives new Rolling Proximity Identifiers at a fixed time interval regardless of other characteristics (such as distance travelled), raising the possibility of tracing journeys travelled by users. The key issue to solve here is allowing for users to compute the Rolling Proximity Identifiers of users they may have come into contact with without leaking the information of all affected users. Developers of contact tracing apps [8] have sought guidance on implementing my original k-Anonymity communication protocol [9] to mitigate this effect (a protocol originally devised for compromised credential checking). It is noted in [10] that this risk is not sufficiently mitigated in existing protocols; the paper discusses that an attacker can passively collect proximity identifiers and then target that person if their diagnosis key is later disclosed and note that k-Anonymity may offer a solution to this problem. This risk is further discussed in [4] and notes that this is a substantial problem to address, noting "the only way to mitigate this attack is to deny the same privileges as the apps to individuals" and suggesting it could require a challenging deployment of hardware encryption (TPM chips) to address. Our prior work in [11] provides an empirical comparison of compromised credential checking (C3) protocols and defines novel protocols for minimising information loss. Whilst pure Private Set Intersection has a heavy computational and communication overhead, [12] has combined k-Anonymous protocols with Private Set Intersection to reduce this burden. Most recently support for response padding has been introduced into the original k-Anonymity protocol to prevent inference on the wire of encrypted contents [13] . Wireless scanner equipment using MAC address tracking has been deployed for monitoring pedestrian and cyclist journey time, monitoring road conditions in real-time and passenger movements throughout train stations. Such continuous tracking over a large geographical scale has raised serious privacy concerns amongst governments and the general public. In [14] , we provided a novel practical hash-based approach to anonymising such MAC Addresses at the point of collection. This approach was designed for use-cases like Journey Time Monitoring Systems which do not rely on client-side software to allow for the anonymisation of data, nevertheless the paper provides a formal model for anonymising wireless unique identifiers, which is of relevance to this work. The Contact Tracing protocol described in [5] works by a user randomly generating a Temporary Exposure Keys (formerly known as a Daily Tracing Key) which is 16 bytes long and generated from a cryptographic random number generator (CRNG). The device stores one Temporary Exposure Key per day up to a maximum of 14 keys (covering a 14 day period). A Rolling Proximity Identifier (RPI) is derived by hashing the Temporary Exposure Key together with an integer count of the 10 minute interval associated to that 14 day period. This RPI is broadcast and collected by other users. In the event the user is found to have tested positive for the virus, they may disclose their Temporary Exposure Keys to a server. Users may then download a list of Temporary Exposure Keys and compute the RPIs for each 10 minute interval to see if they were exposed to the virus. An adversary may passively scan for RPIs broadcast over a geographic area. Using lists of Temporary Exposure Keys they are able to identify and trace infected individuals throughout a large geographic area for that 24 hour period using wireless sensors installed on public transportation networks, motorway Bluetooth Journey Time Monitoring Systems or even rubbish bins [14] . Accordingly, the threat is not merely that an adversary may identify an infected individual, but that they may also trace them over a large geographic scale for a 24 hour period. The [5] protocol requires public health authorities set a minimum threshold for time spent together for an encounter to the counted, this value must be greater than 5 minutes. A user must be in contact for at least 5 minutes for a match to be registered and no further data is logged beyond 30 minutes of exposure. It is important to note that an adversary can passively collect RPIs without any need to have due regard for the minimum exposure threshold. We seek to limit information leakage by the server containing the Daily Tracing Keys of users whilst allowing users to validate if they have been exposed. We propose two modifications to the protocol to enhance privacy; the first provides a cryptographic guarantee that a user must be contact with another for a minimum duration before a contact is established and the second provides efficient lookups of RPIs without the need to disclose the Temporary Exposure Keys for that 24 hour period. In the current protocol, contact duration thresholds are set through software configuration rather than part of the cryptographic specification. By providing a cryptographically guaranteed contact duration threshold, we are able to limit information exposure before that threshold is met. We propose that the Rolling Proximity Identifiers (RPIs) are rotated at the same frequency as the minimum contact threshold. Contact events are registered by computing a hash of the current RPI and one of the previous RPI value, we refer to this as a Consistent Contact Identifier (CCI). To identify exposures later, a user can search for these CCIs from a remote server. Where RPIs generated from a device during the course of a day are represented as RP Is = {i 1 , i 2 , ..., i n }, a CCI for a given period can be computed as CCI = HKDF (i n , i n−1 ), where HKDF refers to the HKDF hash function used by the [5] protocol (adopted from IETF RFC 5869). As an alternative configuration; in the event that the minimum contact threshold is higher than the interval used to rotate the Rolling Proximity Identifiers (RPIs), a CCI should instead be computed as CCI = HKDF (i n , i n−k ), k ≥ 1 where k is the number of RPI steps required to achieve the minimum contact threshold. In lieu of a user downloading a list of Temporary Exposure Keys from users who are exposed to the virus, we propose users instead obtain a list of CCIs of exposed users. This prevents a user being identified as exposed unless the interaction was for the minimum contact duration threshold and constrains any tracking to the period of time around the immediate contact. By cross hashing RPIs together, we are able to achieve a minimum time using standard cryptographic primitives. The advantage of supplying and downloading the daily Temporary Exposure Keys to users is that there is a lower communication overhead than downloading all RPIs or our proposed CCIs. With RPIs rotated every 10 minutes, there are 144 times the number of keys to download. If the rotation duration is decreased to every 5 minutes, there are 288 times the number of keys to download over simply providing Temporary Exposure Keys. Alongside the higher communication overhead created through the creation of CCIs, knowledge of Temporary Exposure Keys allows a user to be tracked for a 24 hour period if they test positive. We seek to allow a user to query CCIs whilst minimising the privacy loss both on the server and for the client. Fortunately, this is a well explored problem in [9] , [11] , [12] . When a user's Temporary Exposure Key is uploaded to a server, the CCIs are computed for the period of exposure. They are then stored in k-Anonymous buckets described in [9] . The client simply needs to truncate the CCI hash to a desired number of bits such to grant anonymity (we provide both formal definitions and empirical analysis for applying k-Anonymity to hash based data sets in [14] ). As such buckets are queried on the basis of CCI hashes instead of the RPIs or Temporary Exposure Keys, no additional information is leaked (for more information on data leakage risks see the section on "Identifier-Based Bucketization" in [11] ). As a further safeguard against excessive data leakages, look-ups within those buckets are performed using Private Set Intersection (PSI). An approach for implementing Private Set Intersection in k-Anonymous buckets is described in further detail in [12] using a Diffie-Hellman based approach. Private Set Intersection allows two parties to determine the intersection of two sets of data without disclosing either set to the other party. During a k-Anonymous search protocol, this means the client discloses a partial hash as the search partition and then both parties are able to determine if the content of the clients search is in that partition without needing to disclosure any further information. Unlike the approach detailed in [9] , this provides further cryptographic shielding of the hashes held by the server. As a final security measure, to prevent an adversary determining the number of daily contacts through passive analysis (as analysed in [13] ), provision should be made for the number of requests to be padded with random data. Our proposed amendments to the protocol mitigates passive de-anonymisation attacks in two ways; the first is by requiring a contact to last for longer than the minimum duration for a user to know if another user is infected and the second prevents whole-day tracking in instances where the user is infected. The cross hashing approach used to mitigate such attacks necessitates more hashes per infected user (protecting information at an encounter level) and accordingly a higher communication overhead. We mitigate this by allowing the client to search in k-Anonymous buckets which reduces the communication overhead and reduces data leakage by the server through the use of Private Set Intersection. Nevertheless, this results in some data leakage. There is a trade-off to this approach; to ensure both communication overhead remains low and prevent re-identification attacks, the user must disclose the first few bits of two hashed Rolling Proximity Identifiers. The risk associated to such a protocol is well documented in [11] and formal analysis of the properties of anonymised hashes can be found in [14] . Such protocols have been used widely in compromised credential checking systems, a privacy sensitive context. Nevertheless, even without preventing such deanonymisation attacks; communication overhead may be a future problem in the existing protocol as it grows linearly to the number of positive tests (regardless of user interactions). Once a user is infected, a Temporary Exposure Key of 16 bytes must be stored for 14 days. This means the communication overhead of downloading Temporary Exposure Keys exceeds 100 MB after 446429 users have tested positive. Similarly, depending on the number of infected users and communication limitations, cross hashing may be deployed as a standalone measure meaning no data leakage is necessitated. This approach mitigates, but does not eliminate, deanonymisation attacks. An active attacker who approaches a users smartphone for longer than the Contact Duration Threshold can still determine if the user later tested positive. We do however eliminate the risk of tracking over a 24 hour period. In this paper we have provided a practical approach for mitigating passive de-anonymisation attacks on the Google/Apple Contact Tracing Protocol [5] . We provide a novel cross hashing approach to provide a cryptographic guarantee that a contact is only registered when two devices have been in contact for a minimum duration. We further reduce the communication overhead of this protocol using k-Anonymity buckets of hashes and Private Set Intersection. The work in this paper provides a tangible proposal rectifying weaknesses in the anonymity scheme used in the existing protocol. We hope to expand this work with formal and empirical analysis of the protocols with this configuration. Contact tracing mobile apps for covid-19: Privacy considerations and related trade-offs Quantifying sars-cov-2 transmission suggests epidemic control with digital contact tracing Covid-19 apps pose serious human rights risks Analysis of dp3t Contact tracing -cryptography specification Cliniclevel variations in contact tracing outcomes: protocol for a crosssectional study of variation and case-mix adjustment in london, uk Broken promises of privacy: Responding to the surprising failure of anonymization Robert's desired property is not clearly stated, proposing: "k-anonymity Mechanism for the prevention of password reuse through anonymized hashes Security analysis of the covid-19 contact tracing specifications by apple inc. and google inc Protocols for checking compromised credentials Protecting accounts from credential stuffing with password breach alerting Pwned passwords padding (ft. lava lamps and workers) Practical hash-based anonymity for mac addresses