key: cord-0561337-qv8mmse7
authors: Hsiao, Hsu-Chun; Huang, Chun-Ying; Cheng, Shin-Ming; Hong, Bing-Kai; Hu, Hsin-Yuan; Wu, Chia-Chien; Lee, Jian-Sin; Wang, Shih-Hong; Jeng, Wei
title: An Empirical Evaluation of Bluetooth-based Decentralized Contact Tracing in Crowds
date: 2020-11-09
journal: nan
DOI: nan
sha: 802b98e29180ed041a21870cd9beb4f72b5aba28
doc_id: 561337
cord_uid: qv8mmse7

Digital contact tracing is being used by many countries to help contain COVID-19's spread in a post-lockdown world. Among the various available techniques, decentralized contact tracing that uses Bluetooth received signal strength indication (RSSI) to detect proximity is considered less of a privacy risk than approaches that rely on collecting absolute locations via GPS, cellular-tower history, or QR-code scanning. As of October 2020, there have been millions of downloads of such Bluetooth-based contract-tracing apps, as more and more countries officially adopt them. However, the effectiveness of these apps in the real world remains unclear due to a lack of empirical research that includes realistic crowd sizes and densities. This study aims to fill that gap, by empirically investigating the effectiveness of Bluetooth-based contact tracing in crowd environments with a total of 80 participants, emulating classrooms, moving lines, and other types of real-world gatherings. The results confirm that Bluetooth RSSI is unreliable for detecting proximity, and that this inaccuracy worsens in environments that are especially crowded. In other words, this technique may be least useful when it is most in need, and that it is fragile when confronted by low-cost jamming. Moreover, technical problems such as high energy consumption and phone overheating caused by the contact-tracing app were found to negatively influence users' willingness to adopt it. On the bright side, however, Bluetooth RSSI may still be useful for detecting coarse-grained contact events, for example, proximity of up to 20m lasting for an hour. Based on our findings, we recommend that existing contact-tracing apps can be re-purposed to focus on coarse-grained proximity detection, and that future ones calibrate distance estimates and adjust broadcast frequencies based on auxiliary information.

Contact tracing has been known to be an effective method for controlling the spread of infectious diseases. In the traditional contact-tracing model, trained personnel use evaluations of Bluetooth-based indoor positioning [14] are not directly applicable to contact tracing due to different settings and assumptions. For example, indoor positioning often assumes using trilateration or known floor maps and considers only one or a few sending devices within the communication range.

Accordingly, this study empirically investigates the effectiveness of BDCT apps in two phases. In Phase 1, we will examine whether Bluetooth RSSI can reliably predict distance in a controlled experimental setting. In Phase 2, based on estimated distances between pairs of participants' phones over time, we compare detected proximity and contact events in a semi-controlled event: a real-world academic gathering, the ground truth of which will be carefully recorded. We recruited a total of 80 participants to use one Bluetooth-based contact-tracing app, which we modified from Covid-Watch- TCN [8] , in our controlled and semi-controlled settings. Our modifications allowed us to collect ground-truth data and the phones' usage logs. After both experimental phases were completed, we also conducted a follow-up survey with participants to enrich the data we had already obtained. This paper will address the following questions:

• RQ1. Can the app reliably estimate the distance between its users based on Bluetooth RSSI under different crowd parameters, e.g., standing still vs. walking, with or without physical barriers, varying interpersonal distances, and the presence of jamming?

• RQ2. How accurately does the app detect proximity and contact events in realistic crowd environments, as compared with the ground truth of such events?

• RQ3. What are users' perceptions and experiences of using these apps?

Our experimental results suggest that Bluetooth RSSI is unreliable for detecting proximity, and reveals that such inaccuracy worsens in crowded environments. This implies that this technique may be least useful when it is most needed, and fragile when confronted by low-cost jamming. Specifically, the app failed to capture the majority of proximity events: only 16 out of 67 (24%) proximity events were detected by the app when setting a 2-meter distance threshold; only 19 out of 67 (28%) were detected even when no distance threshold was set. In terms of user experience and perceptions, technical problems such as high energy consumption and phone overheating caused by the app were found to negatively influence users' willingness to adopt it. On the bright side, the app captured 63% of contacts lasting one hour in a room containing 50 participants and more than 150 other people. Divided by operating system, 80% of Android devices were able to be discovered by both nearby Android and iOS devices in about an hour. This implies that this technique may still be useful for detecting coarse-grained contact events; for example, contacts within 20m that last for at least an hour. The sampled phone users said they were more willing to use the similar apps 1) when in crowded environments and 2) while contact tracers from health departments were also using them.

Although our study is based on a specific implementation, many of our instruments and findings are applicable to BDCT applications [15] with similar designs. We discuss limitations in the Practical Implications section.

We summarize several representative privacy-preserving contact-tracing methods and previous studies that review them.

Privacy-preserving contact tracing for COVID -19 There are two major approaches to privacy-preserving contact tracing. The first adopts a decentralized design intended to minimize the amount of data needed to be sent to a centralized server. The other applies cryptographic algorithms to protect sensitive user data.

Decentralized design This approach is exemplified by Safe Path [16] , DP3T [17] , and Covid-Watch [8] . Safe Path logs a user's movement routes in his/her mobile device and only exports that data to health authorities if that user is diagnosed with the virus. When this happens, the exported dataset is first redacted to ensure privacy, and then is broadcast to other users to allow them to self-determine their likelihood of having been exposed. Rather than capturing and storing route information, DP3T and Covid-Watch continuously broadcast anonymized tokens, which will only be published in the event of a positive diagnosis. This enhances privacy because everyone can locally infer exposures based on whether any received tokens belong to infected people.

Cryptographic algorithms This approach has been adopted by Private Set Intersection (PSI) [18] and TraceSecure [19] . PSI enables two parties to compute the intersection of their data in a privacy-preserving way, with only the common data values being revealed. The data used for computation contains only hashed location points, and location privacy can therefore be guaranteed. TraceSecure, on the other hand, incorporates a public-key-based security protocol for message exchange and storage. An optional homomorphic encryption scheme can be used to further enhance data protection. While this cryptographic-based approach guarantees better privacy than the decentralized one, it is less deployable on consumer devices and existing infrastructure.

Our study therefore focuses on decentralized contact tracing, since it is more feasible to deploy on the massive scale required for this purpose.

Implementation Among the decentralized contact-tracing apps, a number of them use Bluetooth RSSI as a proxy for close proximity between devices, and therefore, between the owners of those devices. Early implementations such as the DP3T project and Covid-Watch had to confront Bluetooth interoperability issues between the two major mobile platforms, iOS and Android. In April 2020, the two major smartphone OS providers, Apple and Google, announced that they had formed a coalition to release APIs that would help contact-tracing apps work across iOS and Android devices, with the first such APIs appearing the following month [13] . Android version 6.0 and iOS 13.7 and higher have supported the fundamental functions of Bluetooth contact tracing, including broadcasting and listening to tokens, at the OS level. To prevent abuse of these APIs, the companies restricted each country to one official app. An increasing number of national projects have adopted the GAEN APIs, including Switzerland's SwissCOVID, Italy's Immuni, and Germany's Corona-Warn-App. Since our use of it in our experiments, Covid-Watch has also now switched to using the GAEN APIs. We plan to use our proposed methodology to evaluate GAEN-API-based apps once the GAEN APIs are enabled in Taiwan.

Possible DoS attacks Studies have pointed out that current contact tracing apps, based on ephemeral IDs, are vulnerable to DoS attacks. Chen and Hu [20] presented BlindSignedIDs, which are verifiable ephemeral identifiers, using blind signatures and TESLA authenticators to verify EphIDs in-place. BlindSignedIDs reduce storage requirements by more than 90% and are demonstrated effective in mitigating gigabyte-level DoS attacks. Privacy concerns Tracing contacts by accessing devices' relative locations (e.g., using Bluetooth signal reception) is considered more protective of user privacy than capturing absolute locations (e.g., via GPS) [21, 22] . Several prior studies have investigated privacy threats such as replay attacks and de-anonymization attacks by state-level or resourceful adversaries [23, 24] , or have proposed advanced cryptographic solutions to enhance privacy [25] . While detailed privacy concerns are beyond the scope of our empirical study, we feel it should be noted that privacy enhancement beyond a certain level is likely to degrade the detection accuracy and other aspects of the performance of contact-tracing apps. Our methodology and protocols used in this study will be useful in assessing whether the negative performance impacts of future privacy-enhancement efforts outweigh their benefits.

Broadcasting of anonymized tokens to nearby devices is fundamental to the implementation of privacy-preserving contact tracing. In contemporary smartphones, the most widely deployed techniques that support such broadcasting are Bluetooth and Wi-Fi. Of the two, however, developers prefer to use Bluetooth because it was originally designed to function within ad hoc networks, and because it has also been widely used for distance measurement and indoor positioning [26, 27] . By reading an RSSI reported by a receiver, an application can estimate the distance between the receiving and sending devices; and indoor positions can be calculated based on three or more RSSIs from fixed-location broadcasters or beacons.

RSSI-based distance estimation Bluetooth technology was developed to replace cables connected to peripherals. Its lightweight design and wide deployment enable many IoT and logistic applications, such as warehouse management and traffic monitoring [28] . Bluetooth's RSSI can also help distance measurement and indoor positioning [14] . A receiving device can estimate its distance from a sending device based on the perceived RSSI. Researchers also attempt to use alternative radio-frequency-based techniques such as Zigbee, Ultra-Wideband (UWB), and WiFi for positioning and contact tracing [29, 30] . However, Bluetooth remains the mainstream choice because of its cost, efficiency, and availability. The discussion of other alternatives is therefore outside the scope of our study.

A major drawback to RSSI-based distance estimation is the variation in its measurement results, caused by various environmental factors such as interference, emission power, and receiver sensitivity, all of which introduce noise. Indoor positioning applications have improved estimation accuracy through trilateration (using multiple referenced sending devices at known positions), incorporating floor maps, or training position-dependent signal attenuation models. Some have proposed augmenting Bluetooth with other sensors to improve accuracy [27] . However, these improvements may be difficult to apply to BDCT because they will require national-scale referenced device deployment, indoor mapping, or modeling. Moreover, BDCT and other Bluetooth-based applications consider different settings. For example, in BDCT, every user's phone is both sending and receiving, thus more likely to saturate wireless channels than indoor positioning (which requires only one or a few sending devices present in a room). Therefore, a thorough evaluation is required to understand BDCT's limitations and possibilities.

Nevertheless, we felt that Bluetooth-based distance estimation has the potential to provide helpful information to pandemic investigators, and tested its performance for this purpose with human subjects in controlled and uncontrolled settings, as explained in the Research Design section .

Empirical evaluation of contact tracing Although many BDCT apps have been deployed in the field, their effectiveness remains unknown due to the lack of ground-truth information. Prior to our present effort to help fill that gap, an empirical study [31] was conducted in April 2020 among a group of 48 soldiers in Germany. They were divided into five scenarios with different moving patterns, with at most 10 people in any one scenario. A follow-up report documented the experimental protocol and provided some preliminary analysis, but drew no clear conclusions and made no recommendations.

Studies [32, 33] have shown that distances derived from RSSIs without calibration can be quite diverse, even in controlled-experiment scenarios with no human participants and with the same settings on all devices. Leith and Farrell conducted a series of studies [33] [34] [35] to empirically measure RSSI between mobile phones indoors and outdoors, as well as on a bus and a tram, and considered factors that could affect such signal strength, including distance, phone orientation, and absorption and/or reflection by surroundings such as building walls or even human bodies. Their follow-up measurement study recruited five participants on a commuter bus to investigate the relationship between Bluetooth attenuation and distance in an environment prone to signal reflection. They made several recommendations for improving BDCT's, such as leaving phones on tables instead of keeping them in bags or pockets.

Our study considers scenarios involving much larger groups of participants and jamming devices, which allows us to simulate crowded scenarios and observe issues that might occur only or mostly in large groups, e.g., rapid battery depletion, interference, and interoperability problems between different phone models. In addition, we also collected users' feedback after using a Bluetooth-based contact-tracing app, particularly their perceptions and concerns regarding privacy and usability.

Since our focus is on Bluetooth-related issues, the following studies were also important to our thinking, despite being beyond the scope of our own research.

Existing exposure-notification apps often feature fixed thresholds for identifying contact events and calculating exposure risks. For example, if a user has been in close proximity with a confirmed patient for a sufficiently long time (e.g., 15 minutes), that user will be warned of potential exposure by the app. Wilson et al. [36] proposed a calibrated measure of infection risk based on empirical measurements, and devised a risk-scoring system that aims to provide better quarantine recommendations.

Some scholars have evaluated the effectiveness of contact tracing via mathematical modeling and simulation, and compared it against other countermeasures such as social distancing or lockdowns [37, 38] . Our findings can provide more realistic parameters for contact tracing that can assist the refinement of such models and simulations.

Despite the security and privacy issues involved in the adoption of digital contact tracing, as of October 2020 (i.e., approximately half a year after WHO declared COVID-19 a pandemic), a considerable number of national and local governments have either already introduced such measures in the fight against COVID-19, or are planning to do so [39] .

On the flipside, a relatively small number of governments have launched new human-based tracing services or announced improvements to existing ones [40] , and remain on the fence regarding the adoption of digital contact tracing, repeatedly citing concerns about uptake rates and false positives/negatives [41, 42] . The UK, for instance, planned to launch a coronavirus app that enables users to report symptoms and book tests, but does not allow contact tracing [43] . In Canada, Manitoba's chief public health officer declared that the contact-tracing app "will not replace public health's ability to contact trace", although the other four provinces have adopted such an app [44] . Several large-scale questionnaire surveys have been conducted to capture phone users' general perceptions toward contact-tracing mobile apps. A multi-country survey of Europe and North America has shown high user acceptance of downloading such apps (74.8%) [45] . However, the results of another survey, conducted in the U.S., suggest that support for the policy of encouraging use of these apps is relatively weak (42%), as compared to traditional measures; but also that the implementation of decentralized data storage helps increase acceptance [46] . According to a UK survey, 67.2% of those who are unwilling to participate in app-based contact tracing considered privacy concerns the main reason [47] . A team in Jordan, meanwhile, reported that 71.6% of their respondents accepted the use of contact-tracing technology, but only 37.8% actually used it [48] . Additionally, a German team found that factors such as age, gender, education, and income could influence the download and use of contact-tracing apps [49] . It is noteworthy, therefore, that all of the respondents to our post-study survey had actually experienced using such an app, and were thus able to provide meaningful, app-specific answers about their usage experience and privacy perceptions.

Our research design comprises four elements, as shown in Table 1 . These are: 1) modification to the Covid-Watch-TCN mobile application; 2) a controlled experiment with 30 participants; 3) semi-controlled experiment with 50 participants in the field; and 4) a followup user survey administered to the participants from both experiments. Research has been approved by REC Office at NTU (#202006HS001).

To evaluate the effectiveness of Bluetooth-based decentralized contact tracing, our experiments collected information that would help us reconstruct the ground truth required for conducting comparative analysis: e.g., the sender of each token, and the distance between each pair of participants. Some of this information was collected by the mobile app and reported to a backend server; some was pre-assigned based on our protocol; and some was derived from direct observation. This subsection describes how we customized and configured an open-source app, and the following two subsections present our protocol and observational approaches, respectively.

We modified the source code of Covid-Watch-TCN Exposure Notification App [8], whose Android and iOS versions are both open-source and can be found on GitHub. We modified the version based on the TCN Coalition's implementation. Since our use of it in our experiments, Covid-Watch has also now switched to using the GAEN APIs, which nevertheless was not enabled in Taiwan while we conducted this study. Covid-Watch-TCN derives tokens (also called temporary contact numbers, TCNs) from a seed using cryptographic keys and hash functions. Keys are renewed periodically to •Survey results balance storage overhead and privacy. While the detailed token generation algorithms, implementations, and parameter selection may vary, all of these apps rely on token reception and RSSI strength to estimate distance and exposure duration, and this is the main focus of our study.

To minimize interference with the app's main functionality, we did not modify its token generation algorithm, where tokens are sent every 100 ms and changed every 15 minutes, but only locally logged data, and sent the logs back to our server at the end of each task. We manually inspected the app's logic to identify the Android or iOS system API calls that created or sent tokens, and inserted our logging code before such calls.

On starting up, our modified app prompted each participant to enter the unique ID that was assigned to him/her at the beginning of the experiment, and to log device information including the running operating system and phone model.

While running, the app logged all the sent and received tokens, along with RSSI values, timestamps, and phone battery status. Specifically, when a device transmitted a token, the app logged the sent token with the unique ID of the device, the current battery status, and a timestamp. When a token was received from another participant's device, on the other hand, the app logged that received token with measured RSSI and calculated distance; the unique ID of the device; and a timestamp. At the end of each task, the participants were asked to click on a "Submit" button to upload the logged information to our backend server. How the logged information was processed for further analysis will be explained in Section .

In Phase 1, each participant was assigned a position and movement pattern (Figure 2 ), such that the actual and Bluetooth-RSSI-estimated distances between each pair of participants can be calculated and compared at multiple time-points.

Research site and participants The controlled experiment, conducted in July 2020, comprised two scenarios, indoor and outdoor. The indoor scenario utilized an empty classroom 131m 2 in size, and the outdoor scenario took place on a covered patio measuring 503m 2 . Both areas are 3.03m in height. We recruited 30 college students from our institutions, all of whom brought their personal mobile devices. Among these 30 devices, 14 were Android and the remainder were iOS. In both scenarios, we physically labeled each participant with a unique ID and marked the floor with tape. Before starting the experiment, we provided instructions to all participants explaining the research purpose and the overall experiment flow, and collected informed consent from all of them. Figure 1 provides an overview of the experiment settings and process.

Protocol The indoor scenario was broken down into five sessions, and the outdoor one into three, as shown in Table 2 . Both scenarios included three sessions-i.e., sessions 1, 3, and 4 in the indoor scenario, and sessions 6, 7, and 8 in the outdoor scenario-that required the participants to hold their devices in hands and i) stand still, ii) equidistantly walk in a given area, or iii) gradually move closer to each other. The other two sessions in the indoor scenario required the participants to stand still in a jammed environment characterized by continuous sending of useless data by Bluetooth beacons (Session 2), and near a wall that divided the participants into two groups (Session 5). To set up the jammed environment, we placed six RaspberryPi 3 model b devices in the same room. Each of these devices emitted unique, useless data via Bluetooth every 20 ms, i.e., five times faster than a normal device. This emulates two cases: 1) the existence of a malicious user jamming the wireless channels by sending tokens at a higher frequency, and 2) a very crowded place containing an additional 30 (= 6 * 5) phones.

The eight sessions comprised 28 one-minute tasks. The multiple tasks within a session were set up to test how Bluetooth signal propagation varied across devices 1) running different operating systems (i.e., Android, iOS, and mixed) and 2) held at different distances from one another (i.e., 0.5d, 1d, 1.5d, where d = 1.5 meters).

• Operating systems. In sessions 1 and 6, where the participants stood still in both the indoor and outdoor scenarios, they were first grouped by their devices' operating systems. After completing the tasks (i.e., tasks 1-6 and 18-23), the participants were taken out of these operating-system-based groups, and the experiment continued, any further groupings being randomized.

• Distances. To simulate a real-world setting in which people may or may not maintain social distancing, we asked the participants in sessions 1, 2, 3, and 6 to keep a distance of 0.5d, 1.0d, or 1.5d (d=1.5 m) from each other. This yielded data that subsequently allowed us to compare the estimated distances generated by the app against the ground-truth distances, and thus to evaluate the accuracy of the app's proximity-detection techniques. Due to time limitations, sessions 4, 5, 7, and 8 were conducted with the participants attempting to maintain a single fixed distance of 0.5d, 1.0d, 1.0d, and 0.5d, respectively.

At the beginning of each task, the participants were prompted to turn on the app, as well as their devices' Bluetooth and GPS. We required all users to turn on GPS during the experiment for consistency, because for Android version 6.0 and above, location services (e.g., GPS) need to be enabled when performing Bluetooth scanning [50]. During each task, a series of slides would show the participants their assigned positions and/or movement paths (Fig. 2) . Once a task ended, the participants were asked to manually upload the logged data to our backend server using the app.

Unlike Phase 1, participants were allowed to move freely within a large auditorium during Phase 2. Proximity events and contact events were reconstructed based on video footage and direct on-site observation.

Research sites and participants The semi-controlled experiment was conducted at a summer school event in a campus auditorium measuring roughly 396m 2 with a seating capacity of 250.

Out of the 216 attendees, we recruited 50 participants: 26 were using Android devices and the other 24, iOS ones. The participants could be distinguished from the other attendees by differently-colored lanyards.

Protocol On the first day, we asked participants to turn on our research app, Bluetooth, and GPS for at least 90 minutes; on the second day, this was increased to 150 minutes. At the points when the participants were asked to turn on the app, they might be sitting still and listening to a speech, divided into groups and taking part in discussions, or having a tea break outside the auditorium.

One conference staff member was secretly assigned to be "the source of the virus" on the first day. This individual turned on the app, randomly passed by the participants, and recorded these actions with a GoPro camera so that we could reconstruct his close contacts after the experiment. Additionally, four researchers observed and manually documented the ground truth of proximity events in the auditorium, including the IDs of the participants involved and when they occurred.

After the experiments, we administered a questionnaire. Its three sections covered 1) technical problems the participants had encountered during the experiments, 2) their attitudes toward the use of the contact-tracing app, and 3) their attitudes toward personal privacy.

Section 1: Technical Problems In this section, the respondents could choose to agree with any or all of the following seven statements: Phone overheating, Seriously increased energy consumption, App crash, Unstable receiving token, Phone performance negatively affected, Couldn't log in, and Other. We also asked a yes/no question regarding whether the respondents had encountered upload failure during the experiments.

Section 2: Willingness to Use the Contact-tracing App Section 2 aimed to capture how different technical factors and situations influenced the participants' willingness to use the contact-tracing app. Its questions were divided into two groups.

In the first, each of the seven answer options from Section 1 regarding technical problems was repeated, along with the question, Will this technical problem affect your willingness to use the app in the next six months? The respondent was then asked to select how much each problem s/he had selected would affect such willingness, on a three-point Likert-scale ranging from -2=gradually decrease to 0=not influenced, though an answer of N/A could also be given in place of a scaled response.

The second group of items in Section 2 asked the respondents to select how various non-technical conditions would affect their willingness to use the app over the following six months. These conditions were Regulation by law or my school; Social influence from my family or colleagues; Planning a trip domestically or abroad; Entering a crowd of more than 100 people; and Current use of such apps by epidemic investigators. These were rated on a five-point Likert-scale ranging from -2=gradually decrease to +2=gradually increase, plus an N/A option.

In the final section of the questionnaire, the respondents were first asked about when and why they usually turned on their devices' Bluetooth and GPS functions. Then, they rated the statement My data are secure and my privacy is protected while using the app on a five-point Likert scale ranging from -2=strongly disagree to +2=strongly agree, again with an N/A option. If a person's response revealed a negative attitude toward the app's privacy and security, i.e., was lower than 0, s/he would further be asked to select from among the following list of five data-security and privacy concerns: Data being tampered with; The app developer or associates may take advantage of security weaknesses; The app developer or associates may use my data for other purposes; and My identity past contacts, or past locations may be recognized.

We sent out the questionnaire to all 80 participants, but in fact this represented only 78 individuals, as two had participated in both experiments. Of these 78, 24 completed the questionnaire: a response rate of 30.8%.

This study was reviewed and approved in July 2020 by National Taiwan University's Research Ethics Office (equivalent to an Institutional Review Board in North America), and meets all criteria for minimal-risk research (#202006HS001).

We will open the participant instructions shortly after publication, so that other research teams can reuse our protocols or reproduce our research.

After removing incomplete data caused by tech glitches (5 Androids in the controlled experiment; 2 Androids and 7 iOS devices in the semi-controlled one were corrupted), the final dataset includes 66 devices.

Among the valid devices (33 Android and 33 iOS), Apple (n=33), Samsung (n=9), Google (n=5), and Xiaomi (n=5) were the most common models. Approximately half of the devices (n=30, 46.9%) were existing models released within two years, and the remaining models were released two to five years before, as of July 2020. Forty-four devices (72.1%) have been updated to Android 10.0, iOS 13.0, or newer versions released after September 2019. Thirteen (21.3%) were Android 9.0 or iOS 12.1 onwards.

Based on the participants' device logs, we reconstructed a directed multigraph, on which a vertex represents a participant and an edge from A to B represents a token sent by A and received by B. Each edge is labeled with a unique tuple (token, RSSI, timestamp) representing the corresponding token's RSSI value and timestamp. Because tokens are sent every 100ms and changed every 15 minutes, the same token may be seen multiple times and have different RSSI values and timestamps.

Tokens missing either sender or receiver information were removed. Missing sender information could have been caused by technical glitches (e.g., device malfunctioning, phone overheating and network congestion), while missing receiver information could 12/23 have been caused by any of the same factors, or simply by no device having received them. Tokens with non-negative RSSI values or unrecognized sender/receiver IDs were also removed. In all, around 524,000 tokens, representing 86% of the total received, were removed.

Because the participants all used their own devices, it was not possible for us to determine the root causes of all technical glitches; nor can we be certain that they were not specific to our experiments. However, app-store reviews and news reports reveal that many similar apps have struggled to resolve similar glitches in real-world settings. Thus, it seems relatively unlikely that our settings and/or app modifications caused them.

The remaining 25 valid smartphones included nine Android and 16 iOS devices, and over the whole course of the experiment transmitted 700 unique tokens and received around 85,000 unique tuples of (token, RSSI, timestamp), all of which were included in the analysis described below.

Estimated distances between pairs of devices were calculated directly by the Covid-Watch-TCN app based on Bluetooth RSSI data.

The relation between measured RSSI and estimated distance, d, can be expressed as

where n is the environment factor, and A is the reference signal strength at 1m. In both Android and iOS Covid-Watch-TCN apps, n is set to 2. The A value is determined based on the sender's transmission power level, encoded in Bluetooth tokens. When receiving a token, the app extracts the transmission power level, and determines the value of A according to what range that level falls within. Then, the app estimates based on the measured RSSI and A, using the equation shown above.

The Influence of Operating Systems Figure 3 represents the transmission and reception statistics for all the devices used during the controlled experiment, classified by operating system. It shows that there was a significant difference between Android and iOS devices' token-transmission capabilities.

Most of the Android devices were able to transmit tokens to both Android and iOS devices directly, while all but one of the iOS devices relied on nearby Android devices to broadcast tokens to others. However, two out of nine Android devices and four out of 16 iOS devices failed to receive any packets at all.

According to the Covid-Watch-TCN app's specifications, iOS versions 13.4 and older do not support discoverability between third-party iOS apps in the suspended or . Therefore, they rely on Android devices as a relay to broadcast Bluetooth packets when running in the background. On the other hand, iOS devices running the app in the foreground exchange Bluetooth packets with one another directly. Due to the iOS Bluetooth platform's reliance on relays from other devices, its RSSI and estimated-distance information cannot represent actual values. Therefore, we chose to focus only on directly transmitted tokens, i.e., Android-to-Android or Android-to-iOS, in our further analysis. Figure 4 illustrates the relationships of the estimated and true distances between each sender-receiver pair of Android devices in Session 1. In that session, the participants stood still in an indoor environment, and within each task were 0.5d, 1d, or 1.5d apart. The standard deviation of the estimated distance increased as the true distance increased, suggesting that Bluetooth signals attenuate during transmission and become more easily influenced by radio noise. The correlation coefficient between true distance and estimated distance is 0.68. Figure 5 also indicates that the app tended to underestimate the distance between devices when the receiver was an Android one, potentially leading to high numbers of false-positive results. When the receiver was an iOS device, in contrast, the app tended to overestimate the distance, potentially leading to high numbers of false negatives.

Next, we tested if and how background radio noise or jamming affects token reception and RSSI.

Our results indicate that Bluetooth is susceptible to packet dropping due to jamming. The devices received fewer tokens when there were higher levels of background radio noise due to jamming. Figure 6 shows that the number of received tokens was reduced in the presence of jamming. Additionally, the app had a lower accuracy when the participants were in a jammed environment. By applying a Wilcoxon Rank Test with α = 0.05, we confirm that the estimation errors across these two types of settings were drawn from two non-distinguishable distributions, implying that jamming did affect the app's ability to estimate distance. Even the app itself became a source of noise when a large number of app users were gathered in the same place. Tasks 7-9 can be seen as more "noisy" conditions than Tasks 1-3, i.e., with 16 iOS devices placed between each pair of Android devices. As shown in Figure 5 , as compared to the previous three tasks' results, estimates of distance in "noisy" environments became more inaccurate in general, and even at distances of less than 2m. For these tasks, the correlation coefficient between true distance and estimated distance is just 0.26.

Additionally, in Tasks 1-3, the app recorded transmission events for 137 out of 170 device pairs, a rate of 81%. However, when more participants joined the experiment in Tasks 7-9, the number of recorded pairs dropped to 111, or 65%; i.e., one-third of the senders were no longer able to successfully transmit tokens to receivers due to the "noise" caused by iOS devices in the immediate vicinity.

On average, the phone battery dropped by 11.3% per hour in the uncontrolled experiment. We also observed a greater battery drop in larger crowds: the per-hour drops for small and large groups are 10.4% and 29.6%, respectively.

Data collected in the semi-controlled experiment were also analyzed to evaluate the effectiveness of the Covid-Watch-TCN app in a spacious indoor environment. Collectively, over the two days of the second experiment, they transmitted a total of 39,000 tokens and received 1.8 million.

A proximity event was deemed to have occurred if 1) two devices were detected by the app as having exchanged tokens at below a particular estimated-distance threshold, and 2) the time at which this exchange was recorded as occurring by the app was within 15 minutes before or after the time at which the same event was recorded by the researchers observing the conference and/or the GoPro videos.

A contact event between two devices was defined as a continuous proximity event lasting for a particular period, for example, 15 minutes. We further defined a strict contact event as one meeting the additional condition that every minute during the exposure period includes at least one proximity event. There were 67 proximity events documented by the four researchers during the experiment and extracted from the GoPro videos. If we set the distance threshold as 2m (commonly recommended as a social-distancing measure), only 16 of these proximity events were detected, implying a proximity detection rate of 24%. However, even when no distance threshold was set (i.e., no lower bound was placed on the RSSI value), the number of detected events only increased to 19, i.e., 28% of the total known to have occurred.

Under an exposure-duration rule of 15 minutes, meanwhile, the app could only detect 7.5% of the relevant contact events. Decreasing the exposure duration to 5 minutes and 1 minute resulted in only slight increases in the contact-detection rate: to 9.0% and 10.4%, respectively. Additionally, when the strict contact rule was applied, the app failed to detect any contact events at all. The proximity and contact detection rates at various RSSI and contact-duration thresholds are shown in Figure 7 and Figure 8 , in which a measured signal with RSSI of -80 dB equates roughly to a 2m separation.

We also evaluated the proximity and contact detection rates of "the source of the virus". A total of 11 proximity events (lasted for at least 5 minutes) with this individual were recorded, two via direct observation and nine via review of the GoPro videos. However, only four of these 11 proximity events were recorded by the app, despite none being fleeting. That is, exposure to the "virus" lasted for at least 5 minutes in each case, according to our observations. Moreover, among the 11 documented 5min-contact events involving the "virus", none were detected.

During the span of the semi-controlled experiment, each device sent at least one token to every other device, and received at least one, with an average of around 1,000 tokens per device being sent, and about 38,000 per device being received. Figure 9 illustrates trends in the number of unique recorded sender-receiver pairs over a 90-min period on the first day of the experiment. This number steadily increased over time and converged to an upper bound of about 1,000, with 63% of all possible pairs represented. This indicates that, if one of its users remains in an indoor environment long enough, the app will be able to discover most nearby devices.

We further classified the recorded device pairs according to their operating systems, as shown in Figure 10 . Most (¿ 80%) of the Android devices were discovered by both nearby Android and iOS devices. As for iOS devices, 55% were discovered by iOS devices, with only 17% discovered by Android devices. The results are consistent with our findings in the controlled experiment that limitations of the iOS Bluetooth platform could significantly influence the transmission capability of iOS devices.

Among the 24 questionnaire respondents, about half of them indicated that they had encountered at least one technical problem, including the app crashing(n=7), unstable receiving tokens (n=6), phone overheating (n=4), unexpectedly high energy consumption (n=3), login issues (n=3), and difficulty uploading their devices' token data (n=15). Nine out of the 15 participants tried re-uploading and successfully uploaded the data eventually.

Among these, the technical problem with the strongest negative influence on the respondents' intentions to use the app was high energy consumption, followed by phone overheating. We also found that, although marginally more respondents mentioned experiencing crash problems (n=13) than inefficient phone performance (n=12), the latter problem had a greater negative impact on their willingness to use the app.

The external conditions that the respondents selected most often as likely to affect their willingness to use the contact-tracing app were entering crowds of 100 people or more (n=19); current use of the app by epidemic investigators (n=19); regulations (n=16); and domestic-trip planning (n=16). The top two of these conditions were also the most positively influential on the respondents' willingness to use the contact-tracing 17/23 app. Some of the participants even rated the influence of regulations on their willingness to use the app as negative.

Turning now to privacy issues, half our respondents stated that their habits regarding Bluetooth and GPS functions would not change in the wake of our experiments, whereas half said that they would. However, since our questionnaire did not ask about how/why Bluetooth and GPS usage impacted the respondents' privacy concerns, we cannot make any conclusions about this split in attitudes.

Surprisingly, only a small minority of our respondents expressed a belief that, due to using the focal contact-tracing app, their data (n=4) or privacy (n=5) might be unsafe, with the others either deeming them to be safe, or expressing no opinion on this matter. Among the minority, the top two data-security concerns cited were that the app developer might take advantage of security weaknesses (n=4), and that the app developer did not build in sufficient protections (n=3). Their top privacy concerns included their past routes being recognized (n=4), developers' associations using the data for other purposes (n=4), and the developer itself using the data for other purposes (n=3).

Our empirical findings have four important implications for BDCT, discussed in turn below.

RSSI alone does not produce reliable estimates of physical distance, which aligns with the findings of previous studies in indoor positioning [26] and contact tracing [33] . In our experiments, the RSSI estimates often spanned −11.3 to 11.7 dB, resulting in errors 0.27 to 3.85 times of the ground truth. Unreliable distance estimates lead to inaccurate proximity or contact detection.

While increasing the number of samples might reduce this variance, we observed system bias caused by contextual factors such as phone models and crowd size, in addition to those investigated in previous work, including wall geometry, phone orientation, and whether users were indoors or outdoors. These system biases would be difficult to eliminate in the absence of extensive, detailed prior knowledge of the context in which contact tracing would need to occur.

Although the app was unreliable in estimating the distance between app users, information about whether they are in the same indoor location or not could still be useful to the broader contact-tracing process.

Another observation was that variance increased with the density of the crowd. The variance was lower when the participants were farther away from each other. This could be due to the status of human bodies as obstacles and wireless channels becoming congested when all devices in the room are transmitting signals simultaneously. In addition, the six Raspberry Pi devices that emitted tokens at a high rate in one session of our experiment, which we added to investigate possible jamming effects, had a similar impact on more crowded conditions. All else being equal, the variance was higher in the more "noisy" environment that resulted from the inclusion of these extra devices.

Reducing the token broadcast frequency (e.g., from 100ms to 1s) in dense areas may alleviate packet loss due to interference, but its effect on the proximity and contact detection remains to be investigated.

To be effective, apps need to be interoperable and produce consistent results regardless of what OSs, phone models, app configurations, and implementations are involved. Our experiments used the existing Android and iOS versions of the same app, and about half of our participants used Android, and the rest used iOS. To emulate realistic scenarios, we did not limit the phone models involved, apart from a requirement that all must support Bluetooth.

We found asymmetric results across phone models and versions. The differences we observed among phone models might be due to differences in Bluetooth chips, transceiver modules, and signal-processing methods, among other factors. This complicates interoperability by implying that each receiving phone may need to know the model and version of each sending phone if the app's detection accuracy is to be improved.

We observed that iOS devices tended to overestimate distances, while Android ones tended to underestimate them. The overestimation by iOS devices may have been caused by the calibration of the default reference RSSI value (i.e., A) at 1m across both versions of the Covid-Watch-TCN app. For the Android version, there were only three reference values; the iOS version had the same possible values as the Android version, except that its default value was greater by 10, i.e., Android is -67db and iOS -57db. These coarse ranges of transmission power levels could lead to inaccuracy in distance estimates.

GAEN system [7] recommends that Android devices be calibrated to a typical iPhone according to their model designations. However, even with improved calibration to compensate biases due to inter-device differences, inaccuracies caused by environmental factors may be difficult to eliminate in the absence of prior contextual knowledge.

In all our experimental tasks, participants' phone batteries drained quickly regardless of brand. This excessive consumption means that our research app would not be usable in real-world scenarios, even if people were willing to try. Some also complained that their phone overheated while running the app.

This extreme power consumption may be attributable to how our app handles Bluetooth. The early version of Covid-Watch-along with many other apps implemented before the release of the GAEN API-did not have native Bluetooth access and had to use hacks to bypass low-level restrictions. For example, iOS versions below 13.4 can only send tokens when either the sender or the receiver is running in the foreground. These hacks likely consume unnecessary resources, including energy. Although we were unable to test GAEN API-based apps, we anticipate that they will have better power efficiency.

Some participants also experienced app crashes or hangs, and thus could not broadcast or submit tokens. Although this was likely caused by our rapid development cycle and lack of testing on a variety of phone models and OSs, it is worth noting that similar issues have been reported by users of other contact-tracing apps.

Our survey result may indicate that people's willingness to use contact-tracing apps is rooted in self-protection concerns and/or a public-spirited desire to aid epidemic investigation work, but that the enforced use of such apps might nevertheless provoke opposition.

However, it should be borne in mind that most of our participants in the second experiment were students with information-engineering backgrounds and an interest in cybersecurity, who may have been less likely to worry about data-security issues than an equivalent-sized sample of the general public.

Additional studies are needed to address the following limitations: 1) Our evaluation was restricted to a specific implementation. Using the GAEN APIs might alleviate battery and interoperability issues. 2) Our logging code may introduce additional overhead. 3) Our experiments were of short duration and not representative of the full range of real situations. 4) Falsely identified non-contacts (false positives) were not analyzed in our semi-controlled setting.

Benefits of Mobile Contact Tracing on COVID-19: Tracing Capacity Perspectives

Ethical Framework for Assessing Manual and Digital Contact Tracing for COVID-19

COVID-19 digital contact tracing applications and techniques: A review post initial deployments

Why many countries failed at COVID contact-tracing -but some got it right

Attitudes and Perceptions Toward COVID-19 Digital Surveillance: Survey of Young Adults in the United States

Ethical perspectives in sharing digital data for public health surveillance before and shortly after the onset of the Covid-19 pandemic

Exposure Notifications: Using technology to help public health authorities fight COVID-19

CovidWatch: Together, we have the power to stop COVID-19

Global Pandemic App Watch (GPAW): COVID-19 Exposure Notification and Contact Tracing Apps

Here are the countries using Google and Apple's COVID-19 Contact Tracing API

EU plans international coronavirus tracing network

Coronavirus: EU interoperability gateway goes live, first contact tracing and warning apps linked to the system

Exposure Notification -Bluetooth Specification

A Survey of Indoor Localization Systems and Technologies

Survey of Decentralized Solutions with Mobile Devices for User Location Tracking, Proximity Detection, and Contact Tracing in the COVID-19 Era

Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

Decentralized Privacy-Preserving Proximity Tracing

Assessing Disease Exposure Risk with Location Data: A Proposal for Cryptographic Preservation of Privacy

Mitigating Denial-of-Service Attacks on Digital Contact Tracing

A Survey of COVID-19 Contact Tracing Apps

Privacy Preservation of User Identity in Contact Tracing for COVID-19-Like Pandemics Using Edge Computing

Contact tracing mobile apps for COVID-19: Privacy considerations and related trade-offs

Mind the GAP: Security & privacy risks of contact tracing apps

Lightweight contact tracing with strong privacy

Indoor distance estimated from Bluetooth Low Energy signal strength: Comparison of regression models

A Hybrid Positioning System for Location-Based Services: Design and Implementation

Applications of the Internet of Things (IoT) in Smart Logistics: A Comprehensive Survey

A wearable magnetic field based proximity sensing system for monitoring COVID-19 social distancing

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proximity Tracing App: Report from the Measurement Campaign 2020-04-09

OpenTrace Calibration

Coronavirus Contact Tracing: Evaluating the Potential of Using Bluetooth Received Signal Strength for Proximity Detection

Measurement-based evaluation of Google/Apple Exposure Notification API for proximity detection in a light-rail tram

Measurement-based evaluation of google/apple exposure notification API for proximity detection in a commuter bus

Quantifying SARS-CoV-2 Infection Risk Within the Google/Apple Exposure Notification Framework to Inform Quarantine Recommendations

Effectiveness of isolation, testing, contact tracing, and physical distancing on reducing transmission of SARS-CoV-2 in different settings: a mathematical modelling study. The Lancet Infectious Diseases

Modeling the effect of exposure notification and non-pharmaceutical interventions on COVID-19 transmission in Washington state

States Are Rolling Out COVID-19 Contact Tracing Apps: Months of Evidence From Europe Shows They're No Silver Bullet

Local COVID-19 contact tracing system launched in Slough

Contact-tracing app for England and Wales 'hampered by loss of public trust'; 2020

The COVID-19 pandemic and contact tracing technologies, between upholding the right to health and personal data protection

The UK's coronavirus app will launch without contact tracing

Manitoba still working on getting COVID-19 contact tracing app

Acceptability of App-Based Contact Tracing for COVID-19: Cross-Country Survey Study

Americans' perceptions of privacy and surveillance in the COVID-19 pandemic

Belief of having had unconfirmed Covid-19 infection reduces willingness to participate in app-based contact tracing

COVID-19 Contact-Tracing Technology: Acceptability and Ethical Issues of Use. Patient Preference and Adherence

Sociodemographic characteristics determine download and use of a Corona contact tracing app in Germany-Results of the COSMO surveys

This work was financially supported by the Ministry of Science and Technology (MOST) in Taiwan