key: cord-0126443-y1rkex4t authors: Shankar, Sheshank; Chopra, Ayush; Kanaparti, Rishank; Kang, Myungsun; Singh, Abhishek; Raskar, Ramesh title: Proximity Sensing for Contact Tracing date: 2020-09-04 journal: nan DOI: nan sha: 4e442ec9cb20245ca6e481956828dd6236405dc8 doc_id: 126443 cord_uid: y1rkex4t The TC4TL (Too Close For Too Long) challenge is aimed towards designing an effective proximity sensing algorithm that can accurately provide exposure notifications. In this paper, we describe our approach to model sensor and other device-level data to estimate the distance between two phones. We also present our research and data analysis on the TC4TL challenge and discuss various limitations associated with the task, and the dataset used for this purpose. As economies open up, digital contact tracing is emerging as an important tool to help contain the spread of COVID-19 by providing exposure notification to susceptible individuals who came in close proximity to infected individuals. There have been several proposals varying across different modalities, but Bluetooth is the most widely accepted technology for digital contact tracing, as it is commonly available across most of the phone devices that are currently being used and aggressively supported by Apple and Google through their operating systems and exposure notifications API. However, there are certain limitations with the effectiveness of Bluetooth, as described in [7] , [8] . We focus on the task of proximity sensing and predicting these values accurately, which is the key to facilitating efficient contact tracing. Proximity sensing is concerned with predicting if two individuals have been in close contact for a long duration, which may open the possibility for the transmission of COVID-19. We specifically focus on the TC4TL challenge that was recently introduced by NIST. Current approaches to automated exposure notification rely on using Bluetooth Low Energy (BLE) signals for advertising the presence of a device (or chirps) emanating from smartphones to detect if a person has been too close for too long (TC4TL) to an infected individual. However, the received signal strength indicator (RSSI) value of Bluetooth chirps sent between phones is a very noisy estimator of the actual distance between the phones, and can be dramatically affected in realworld conditions [5] . In this challenge, we try to predict the distance using the phone sensor data, particularly the RSSI values from the BLE signal logs, along with other factors which have been logically observed to have an impact on these RSSI values. We train and test our networks using the datasets provided by NIST and MITRE. We train and test using multiple networks with varying architectures, with a singular goal of obtaining the right model in order to understand the subtle nuances observed in a phone's sensor data (and other factors) and to predict the distance between phones accurately. We train networks based on Deep Learning (LSTM, ConvGRU, etc.), and also networks with architecture based on Support Vector Machines and Decision Trees. All of the networks are tested using the NIST Test Data (a subset of the full NIST dataset). The Temporal Conv1D network gives us the most favourable results. We also run a few tests and projections (Ablation studies, Data analysis and Training and Dev set distribution discrepancy) to get an idea of the practicality of this problem along with its future implications. We discuss our observations and hypotheses in section VI. The task in the NIST TC4TL Challenge is to estimate the distance and time between two phones given a series of RSSI values along with other phone sensor data. Current approaches to automated exposure notification rely on using Bluetooth Low Energy (BLE) signals (or chirps) from smartphones to detect if a person has been too close for too long (TC4TL) to an infected individual. Some of the identified factors affecting distance estimation from RSSI values are (1) the number and time spread of observed chirps, (2) the carriage position of the phones (i.e., hand, front pocket, back pocket, etc), (3) bodies and barriers between phones, and (4) multi-path signals from surfaces (e.g., indoor vs outdoor). To better characterize the effectiveness of range and time estimation using the BLE signal, the dataset collects Bluetooth chirp data as well as other phone sensor data (e.g., accelerometer and gyroscope) between various types of phones with simulated real-world variability. The dataset is divided into chunks of 4-sec device interactions (sender/receiver) with corresponding readings for each sensor. For ease of analysis, the current version of the challenge restricts focus to estimation of the range (distance) and not the time duration. The initial experiments were conducted on a subset of the PACT dataset for training the model. However we were soon asked to train the model from the MITRE dataset to prevent overfitting or underfitting. Bluetooth and other mobile sensor data tend to be quite noisy. Hence, our main strategy is to rather exploit the temporal characteristics of the dataset more effectively, and make sense of the of the data. We choose 150 as an ideal count for the number of time-steps within a 4 second interval, since we observe the maximum number of samples to be present for this value, which is calculated from a mean of the number of time-steps over the entire dataset. Every time-step is represented by a normalized fixed-length feature vector representing the most recent values obtained from each sensor. In addition, the metadata of the experiment (TXDevice, RXDevice, TXPower, Device Carriage & Activity) is one-hot encoded, and concatenated to each time-step's vector. As for the models that do not make use of a time-series input, these readings are concatenated into a single feature vector for each 4 second interval. In order to increase the size of the training data, we used mix-up data augmentation [12] . However it did not provide a significant increase in the overall performance. We subsequently create an optimal subset of the MITRE training set by using k-NN (k-nearest neighbors), by selecting the k-nearest train set neighbors of each point in the NIST development set [4] . We experiment with Deep Learning based models (LSTM, CNN etc.), Support Vector Machines and Decision Trees. 1) Deep Learning based Models: We implement the following Deep Learning Models:-• GRU and LSTM: We test GRU and LSTM because their architecture are suited for our time series data input format. Both LSTMs and GRUs have the ability to keep memory/state from previous activations, allowing them to remember features for a long time and allowing backpropagation to happen through multiple bounded nonlinearities, which reduces the likelihood of the vanishing gradient. • ConvGRU: ConvGRU [10] is a GRU with Conv1D reset, update, and output gates. We found that the ConvGRU works better than a regular GRU. This could be possibly because: a) GRU is less complex: better to mitigate overfitting with sparse input. b) Conv: in input state vector, short term spatial dependencies, are better exploited by using Conv form of GRU cell over fully connected layers in the GRU cell. Using the time-series input format, we implement a Con-vGRU network and experiment with hyper parameters, such as with multiple layers, varying number of units and kernel sizes, and also experiment by adding fully connected layers after the ConvGRU. • Conv1D: We draw similarity of our learning task to that of what was tackled by Google's Wavenet [11] . Wavenet leverages 1D convolutional neural net for predicting the sequential audio signal. Inspired by this approach, we implement three distinct variations of 1D convolutional neural net, differing in ways to regularize the neural net: 1D CNN + Dropout, 1D CNN + Dropout + Maxpool, 1D CNN + Dropout + Dilation. In addition, we experiment with hyper parameters, such as number of epochs, batch size, weight decay and learning rates, for each of these variations. • Feed Forward: Using the concatenated time-step input format, we implement a feed-forward neural network. We experiment with hyper parameters, such as multiple layers, varying number of neurons, different percentages of dropout, and contrasting activation functions. 2) Support Vector Machines: Using a concatenated timestep input format, we implement variations of the support vector machine [3] , specifically, Nu-Support Vector Classification and C-Support Vector Classification. 3) Decision Tree based Models: Using the concatenated time-step input format, we implement XGBoost [2] and Random Forest Classification [1] . All experiments were run on a Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz server with 528 GB RAM, 48 cores, and on a single Nvidia 1080 Ti GPU. All the Deep Learning based model architectures are trained the complete MITRE dataset. The XGBoost and SVM based models are trained on partial data as they take too much time to train on the full train set. All experiments are optimized using the Adam optimizer [6] . The temporal networks are built using PyTorch, whereas the Support Vector Machine [3] and the Decision Tree based models are implemented uding scikit-learn [9] . In the results presented, we do not use mix-up data augmentation [12] or the k nearest neighbors [4] method, they are discussed in the ablation studies. In the Conv1D, we notice that the Dilation does not seem so effective (for the hyperparameters we run). This result is unusual as the Wavenet's main point was to leverage dilation. Also, the architecture with the Maxpool layer is the only one which showed the "learning" (validation loss decreased while learning). We also notice that the ConvGRU has excellent results on the NIST dataset, but siginificantly lost performance on the MITRE dataset. We hypothesize that this is because: • In the NIST dataset, distribution of dev (the training) set and test set is similar (even when dev set is notably smaller than the test set), so modeling the dev set data distribution helped in generalization. • MITRE train set has a very different distribution compared to the test set (as also indicated by our plots + k-NN analysis). • So, the time-series hypothesis does not generalize here since train set is effectively noise for the test set. There is a need to focus on more fundamental physics modeling of data for this to work. Apart from the Conv1D and ConvGRU, of the other Deep Learning based Models, the GRU and LSTM models has almost identical results. However, they both have a very high nDCF. The Feed Forward model is better than the GRU and LSTM, but isn't as good as the Conv1D. We experiment with hyper parameters, such as number of hidden layers and neurons. We obtain our best results on the simplest Conv1D, by using 1 Conv layer, 2 Linear layers, a hidden size of 64 and a kernel size of 3 (learning rate = 1.00E-05). The best results for the RNN style models are obtained using 2 layers and a hidden size of 200 (learning rate = 3.00E-04), and a kernel size of 3 for the ConvGRU. The best results for the feed forward network are obtained by using 2 layers and a learning rate of 1.00E-04. The partially trained SVM and Decision Tree models, tend to completely overfit (over 99% accuracy) on the training data. In this work we analyze the feasibility of the problem by investigating the data distribution and different modalities present in the data and how they affect the overall results and outcomes. a) Ablation studies: For ablation studies, we train the model by varying the input data streams with each data stream referring to only a subset of input to a given model. This way, we try to estimate the role of different types of data and which sensors could be useful for the TC4TL task described previously. At the time of writing this paper, our ablation study is not completely exhaustive, hence we share only a limited number of findings. Using our existing typical training scheme as described in the section III, we exclude some of the sensors from the input pipeline and train on the rest of the data. This requires minimal adjustment in the first layer of the neural network to accommodate varying sized input feature vector. We perform the first set of experiments by excluding device level information such as TXDevice, RXDevice, TXPower, RxPower, Device carriage, and activity. We do not observe any significant performance improvement with including or excluding the device level data with performance being around 35% on the dev set. However, we find that the training is more unstable when we include this device level information and more susceptible to overfitting on two classes instead of uniformly training across all of the four classes. In the next set of experiments, we train our model on different combinations of sensors and evaluate its performance on As there could be a large number of combinations to try out, we only try the combinations which make sense from the physics based modeling view of the dataset. In the first experiment, we train the model with just the bluetooth readings given in the dataset, excluding all other sensor data. The performance does not reduce significantly even with the bluetooth data but it increases the divergence in the final training and testing accuracy, indicating higher susceptibility to overfitting. In the next experiment, we add up more sensors which make intuitive sense for the forward modeling of the data recorded by bluetooth, like gyroscope (captures orientation of bluetooth antenna), accelerometer (captures relative linear motion) and magnetometer (captures the variance in the magnetic aberration in the environment over the time domain). Across all of the ablation studies, we get best performance on test as well as training set with this particular combination. We also try including and excluding other sensory input also like altitude, attitude, heading and etc. b) Data analysis: We perform few low dimensional projections of the dataset to understand if there are any underlying clusters in the feature vectors or their components which capture the dataset decision boundary with respect to the respective target class. Figure 2 and Figure 3 show the principal component analysis for the two distributions, training and dev respectively. It can be observed that training data clusters are packed heavily with no clear decision boundary for any two labels. However, there could be a higher dimensional embedding which can have separating hyperplanes across the classes. c) Training and Dev set distribution discrepancy: While the initial training and dev distribution was provided by the NIST team, the finalized training dataset was provided by the MITRE group which differs in several ways from the dev and the testing set provided by the NIST group, like different carriage, device models, time period of scanning sensors. The dev dataset uses roughly 5 extra iPhone devices for the measurement compared to the training set which also has various iPhones but only limited number of unique devices. Hence, preventing any data driven model to capture invariances in the signal arising from the missing devices. Another significant challenge we encounter across all of the model training and experiments, is the lack of generalization of the models trained on training dataset. The training of models on dev set and evaluation on test set resulted in near perfect accuracies indicating certain amount of overfitting. At the same time training on the training dataset and evaluation on dev set indicated a level of result only slightly better than the random guess over the classes. This clearly indicates that the training distribution does not lead to sufficient generalization. Therefore to assess the efficacy of the dev set, we also try training on dev set and evaluation on the training dataset, which also does not yield any significant improvement in generalization. Informed by these results, we try to estimate the gap in the distribution and also see how much of the two distributions are skewed from each other. One of the clear inconsistency between the two distributions is around the distances as measured by 2 norm between any given two feature vectors. We measure the nearest neighbour for all the points in the dev dataset and with respect to the points in the training dataset. We find a significant number of all the closest points in training and dev set to be having a different class label. Then we perform the nearest neighbour based on same label and train on the resulting training dataset with only 2 closest training points for each and every point in the dev set. The performance on this new training subset is not significantly different than the full training dataset. Furthermore, we measure the interclass distances between the two distributions and find that nearest-neighbor between two datasets have an average of 2 distance of 24 while the neighboring points' classes are not same, however, if we find the closest 2 distance between two points with same class between training and dev set, then the average 2 distance is somewhere between 200. This supports our previously made argument about the data discrepency and distributions being not similar enough to capture any relevant and generalizable information. However, a more thorough analysis would be needed to confirm this argument in a statistical manner due to other possibilities like existence of highly non-linear manifolds which could fit both the distributions sufficiently. In this paper we present our approach for solving the TC4TL challenge and corresponding results. We also report our findings and analysis of the task as well as dataset. The task was marked by several challenges due to the noise in the data distribution and poor transferability of training data over the validation data. Therefore, we believe a proper physics based model which could capture appropriate invariances will be a good step towards solving the task. We also consider interpretable modeling and extensive breakdown of different sensor based data as part of the future work. We would like to thank Parth Patwa for helping out in drafting this paper. Random forests Xgboost: A scalable tree boosting system Support-vector networks Nearest neighbor pattern classification Using bluetooth low energy (ble) signal strength estimation to facilitate contacttracing forcovid-19 Adam: A method for stochastic optimization Coronavirus contact tracing: Evaluating the potential of using bluetooth received signal strength for proximity detection Measurement-based evaluation of google/apple exposure notification api for proximity detection in a commuter bus Scikit-learn: Machine learning in Python Convolutional gated recurrent networks for video segmentation Wavenet: A generative model for raw mixup: Beyond empirical risk minimization