key: cord-0148031-lqdxbh8l authors: Ali, Mohammed Eunus; Eusuf, Shadman Saqib; Islam, Kazi Ashik title: An Efficient Index for Contact Tracing Query in a Large Spatio-Temporal Database date: 2020-06-23 journal: nan DOI: nan sha: 1f45ac5dee5b24bdae242ce65e3b4791ee627cf1 doc_id: 148031 cord_uid: lqdxbh8l In this paper, we study a novel contact tracing query (CTQ) that finds users who have been in $direct$ $contact$ with the query user or $in$ $contact$ $with$ $the$ $already$ $contacted$ $users$ in subsequent timestamps from a large spatio-temporal database. The CTQ is of paramount importance in the era of new COVID-19 pandemic world for finding possible list of potential COVID-19 exposed patients. A straightforward way to answer the CTQ is using traditional spatio-temporal indexes. However, these indexes cannot serve the purpose as each user covers a large area within the time-span of potential disease spreading and thus they can hardly use efficient pruning techniques. We propose a multi-level index, namely QR-tree, that consider both space coverage and the co-visiting patterns to group users so that users who are likely to meet the query user are grouped together. More specifically, we use a quadtree to partition user movement traces w.r.t. space and time, and then exploit these space-time mapping of user traces to group users using an R-tree. The QR-tree facilitates efficient pruning and enables accessing only potential sets of user who can be the candidate answers for the CTQ. Experiments with real datasets show the effectiveness of our approach. In this demonstration, we present a web-based system to illustrate the contact tracing query (CTQ) in a spatio-temporal database. Consider the following scenario. Let D be the historical mobility traces (or equivalently trajectories) of users for the last T days, obtained from GPS-enabled phones or mobile signals through triangulations. Thus, each user u ∈ D is represented as a sequence of time stamped locations {(l 1 , t 1 ), (l 2 , t 2 )....(l n , t n )} denoting her visited places l 1 , l 2 , ..., l n in different times t 1 , t 2 , ..., t n , respectively. Let q be the mobility traces (or the trajectory) of a newly identified COVID-19 infected user, which is the query in our system. The objective of the CTQ is to identify a set of users U ⊂ D who have been in direct contact with q at any point of time, and subsequently find users who came into contact with the already contacted users. To process variants of trajectory related queries such as range, join, nearest-neighbor, etc., a large body of trajectory indexing techniques have been proposed in the literature [5, 1, 7, 3] . These indexes are variants of traditional spatio-temporal indexes such as R-tree [2] or quad-tree [6] . These indexes are tailored for answering different types of queries. Though it may seem that the CTQ can be solved by using existing indexes designed for range queries, running repetitive range queries for different points of the query trajectory in the CTQ will make it extremely in-efficient. This is due to the following two reasons: (i) the mobility traces or historical trajectories of a user are usually a set of time-stamped dispersed point locations covering a large area, which is different than the normal point data such as POIs (Point of Interest) or trajectory data such as taxi trips, and thus very hard to prune using traditional indexes, (ii) if a user's travel history matches with the query at any instance then the user will be a candidate answer, and we need to run the process recursively as this user may have subsequently infected others. To answer the CTQ efficiently, we propose a two-level index structure, namely QR-tree, that exploits the strengths of both quad-tree and R-tree. In the first level, we use a quad-tree to partition the points of historical trajectories, where the location of a trajectory point is specified by the space − id of the smallest quad-tree block that contains the point. Similarly, the timestamp of each location of a trajectory is mapped to a time − id that corresponds to a time bucket containing the timestamp of the trajectory point. After that, we transform each trajectory as a sequence of (space − id, time − id) tuples. We consider this mapping as a transformation to a new coordinate system for the trajectory points. Next, we apply an R-tree on the trajectory points, represented by the new coordinate system, for grouping and saving them in disk. Finally, we present an efficient divide-and-conquer approach to answer CTQ, where a query is recursively divided and run through different levels of the index to find the users who match in both space and time. We develop an interactive web-based system that demonstrates all the above steps of processing CTQ. We use PostgreSQL, an open source database, to host our huge trajectory database at the back-end. Then we build our index and query processing techniques in the middleware. Finally, as a front-end, we build a web-based interactive system that uses a map-based visualization for taking inputs from a user and displaying the output results of CTQ. The rest of the paper is organized as follows. We formally define the CTQ in Section 2. We discuss our proposed index in detail in Section 3. Then we discuss the query processing algorithm and evaluation in Section 4. We provide the details of our demonstration system in Section ?? and conclude the paper in Section ??. Let D be the trajectory dataset, where each user u ∈ D is represented as a sequence of time stamped locations {(l 1 , t 1 ), (l 2 , t 2 ), ...(l n , t n )} denoting her visited places l 1 , l 2 , ..., l n in different times t 1 , t 2 , ..., t n , respectively. Let q be the query user who is infected with the Coronavirus. Condition of u meets v: } be the two sequence of time-stamped locations of u and v, respectively. Let spatialDist and temporalDist be the spatial distance and temporal distance measuring functions between two locations and two time-stamps, respectively. Now, for any i ∈ [1, n], j ∈ [1, m], if spatialDist(u.l i , v.l j ) ≤ ψ and temporalDist(u.t i , v.t j ) ≤ τ , then we say the trajectory u meets the trajectory v. Here, ψ and τ , are spatial (euclidean) and temporal distance thresholds, respectively. The objective of the CTQ is to identify the set of users U ⊂ D where each user u ∈ U has potentially been exposed to the Coronavirus. We define the set U as follows. 1. Let U 0 be the set of users where each user u ∈ U 0 met with q at any timestamp u.t in the last T days. We say that u was exposed at time u.t exposed (= u.t). Here, the subscript zero (0) in U 0 denotes that, the number of intermediate carriers (of the virus) between user q and u is zero (i.e u was infected by the query user q). 2. Let U 1 be the set of users where each user u ∈ U 1 met with any user v ∈ U 0 at timestamp u.t > v.t exposed . We define u.t exposed (= u.t) as the time of exposure for user u. 3. In a similar manner, we can define the set of users U i recursively where each user u ∈ U i was exposed to some user v ∈ U i−1 . Then, we define the set U as: Here, L is an integer that denotes the maximum allowed depth for the recursion and is passed as a parameter for CTQ. Based on the above definitions, we formally define our contact tracing query as follows. Definition. CTQ. Given a set D of user trajectories, a COVID-19 infected user trajectory q, a spatial proximity threshold ψ, a temporal proximity threshold τ , and an integer L, a CTQ query finds a set of users U ⊂ D such that U = L−1 i=0 U i ; where U i is the set of users who were exposed to the query user through 'i' number of intermediate carrier users. The trajectories in our datasets can be very long (e.g., last 14 days of mobility traces of each user) and may cover large areas. Using an R-tree to index the two spatial and one temporal dimension of these trajectories might not be useful as each trajectory's MBR (Minimum Bounding Rectangle) will most likely overlap with too many other trajectorys' MBRs, making the pruning scheme of the R-tree ineffective. On the other hand, if we use a quadtree to index all points of the trajectories, the points of a single trajectory may end up in many quadrant of the quadtree blocks, and thereby making it hard to decide which of those trajectories should be stored together in a disk block to facilitate faster retrieval of candidate users. The key intuition of our proposed index is, the trajectories whose points are co-located at the overlapping time-instant are likely to match with the same query. Based on this observation, we present a two-level index, the Quad R (QR) tree, that combines the strengths of both quadtree and R-tree. First, the spatial data space is recursively partitioned using quadtree, where each leaf quadtree block does not contain more than θ points. Then we use a space filling curve, specifically a z-curve (Morton order), to number these leaf quadtree blocks. We call such a number, the spatial − id of the block. Thus, in the spatial domain each trajectory is represented as a list of spatial − ids. Similarly, each timestamp of a trajectory is mapped to a time bucket and assigned a number temporal − id. Thereby, each trajectory can now be represented as sequence of (spatial − id, temporal − id) tuples. This new mapping of trajectories can be seen as a transformation to a new coordinate system, where x-axis represents spatial dimension and y-axis represent temporal dimension, and each trajectory is represented as a set of points in that space. In the next step, we use an R-tree to group trajectories based on their sets of points in the transformed space. Essentially, each set of points in this new space is represented as an MBR, and the R-tree groups close-by MBRs in a leaf node. Each leaf node of the constructed R-tree is stored in a disk-page. We maintain this disk-page id in all the corresponding leaf-blocks of the first level quadtree that contain a point of the trajectories stored in this disk-page. In the quadtree block, we also maintain associated temporal ids denoting the time range of trajectory points stored in the corresponding disk-page. Note that, we do not keep the hierarchical structure of R-tree for query processing, rather we only use the R-tree for grouping of similar trajectories in the transformed space. Figure 1 and Figure 2 show the construction process of the QR-tree. Figure 1 (a) shows an example with four user trajectories, {u 1 , . . . , u 4 }, where θ = 2. The space is first divided into four quadrants Q 1 , . . . , Q 4 . As Q 2 contains more than 2 trajectory points, this block is further divided into Q 2.1 , . . . , Q 2.4 . We then apply zordering to number these quadtree blocks as b 1 , b 2 , ..., b 7 . After that each time-stamp of points in the trajectories is assigned time-bucket number between t 1 and t 6 . After that Figure 1(b) shows the new representation of trajectories u 1 − u 4 as a sequence of (b i , t j ) tuples. These points are then mapped into a new co-ordinate system in a two-dimensional space (Figure 2(a) ), where we can see points of four trajectories in four different colors, and each set of points of a single trajectory is represented as an MBR. These MBRs are grouped together to form an R-tree. Each leaf level node, R 1 , R 2 corresponds to a disk-page. Finally, we maintain these disk-page references in different level quadtree blocks of the QR-tree, as shown in (Figure 2(b) ). For example, with Q 2.4 (b 5 ), disk-page id R 2 is assigned along with time-bucket ids t 5 and t 8 . We make further improvement on the proposed QR-tree index, where we augment the index by adding another top-level quadtree. The intuition behind adding this top level quadtree is two fold: (i) it partitions the entire sets of trajectories into different groups based on their extents, thus the index will have better pruning capability, (ii) since a longer trajectory will most likely contain more points than a shorter trajectory, maintaining different length trajecotory in a single R-tree is challenge as trajectories of different length may occupy different storage spaces in the disk. In our proposed Q 2 R-tree, a trajectory is stored under any non-leaf or leaf nodes based on their extents. In this case, we recursively partition the space, and a trajectory is stored in a quadtree block that fully contains it. Thus, long trajectories are stored in the upper level quadtree blocks than shorter trajectories are stored in lower level trajectory blocks. For all the trajectories in a single quadtree block, we apply QR-tree strategy to organize them in disk. Since we use two quadtrees and one R-tree in the index, we refer this index as Q 2 R-tree. Input: A quadtree node N of QR-tree, a COVID-19 positive user trajectory q Output: A set U of user trajectories suspected to be exposed by q 1.6 N children ← children(N ) 1.7 q children ← extendedIntersection(N children , q) for Nc ∈ N children , qc ∈ q children do 1.8 U ← U matchCT(Nc, qc) 1.9 return U In this section, we present an algorithm for processing CTQ using our proposed QRtree. Intuitively, the quadtree of QR-tree supports faster range query around query trajectory points, while R-tree grouping ensures the lower I/O overhead. We apply a spatial pruning followed by a temporal pruning using QR-tree, where the irrelevant quadtree nodes are pruned first, and then the time buckets are used to further prune the R-tree blocks to be retrieved. For simplicity, we present the first level contact tracing, where the task is to find users who were directly exposed to q. Algorithm 1 describes the pseudocode for a divide-and-conquer algorithm for the CTQ. A user u can be infected by q, if a point of u is within a threshold distance ψ and within a threshold time τ from any point of q. Thus, we consider an extended minimum bounding rectangle (EMBR) (in terms of space) of every points of q to include the infectious region of q. Initially, the function matchCT(·) is called with the root node N of the QR-tree and q. It finds the relevant child nodes of N that intersect with q (or EMBRs of q) in the function extendedIntersection(·) (Line 1.7). Thus, quadtree nodes that are within ψ spatial distance threshold from q are considered. If a child node does not intersect with the EMBR, it can be safely pruned. Otherwise, each unpruned child node N c of N and the corresponding components of q, are passed to matchCT function according to Algorithm 1 (Line 1.8). The recursive method has two base conditions: (i) when q is empty (there is no point left in that subspace for repeated division (Line 1.2)); and (ii) when N is a leaf node. In case of a leaf node, the time buckets corresponding to the points in q are calculated with the function extendedTimeWindows(·). Then the exposure of the trajectories stored in the disk blocks mapped to the node N and entries of temporal bucket are computed with evaluateContacts(·). The function evaluateContacts is used to determine which trajectories meet with q. To compute it, first we need to retrieve trajectories which have transformed coordinates (N, t), for each entry t ∈ t b list . So we look up our in memory QR-tree index and obtain a list of relevant R−tree nodes (i.e. disk block ids). We the fetch the trajectories stored in those disk blocks. For each user trajectory t i ∈ T r , we compute whether the user meets with q. The trajectory t i is included in the exposed set U if it meets with q. The above algorithm supports contact tracing by passing the depth level L in matchCT(·) as a recursion depth parameter. Initially the parameter is set to 0. Then in the aforementioned second base condition, we can call matchCT(·) recursively for each of the computed exposed trajectories with depth parameter incremented by 1, until it has reached L. In this section, we compare the QR-tree with a baseline (BL) approach, where we use a 3D R-tree (for location and timestamp) for indexing. We use it as our baseline because, in contrast to other methods, a trajectory is saved in an Rtree leaf as a single object. This is ideal for retrieving the whole trajectory during the processing of CTQ. We use the mobility traces from the CDR data collected by Grameenphone Ltd between June 19, 2012 and July 18, 2012 [4] , hereafter referred to as BD Cellphone, as our default dataset. Besides, we use Foursquare check-in dataset [8] of New York city, hereafter referred to as NYF, to evaluate performance of CTQ in a different spatio-temporal domain. We use JDK 1.8 for implementing our algorithms, which were run in Intel core i5-3570K processor (3.4 GHz) and 8 GB of RAM. Performance Evaluation and Parameterization. The parameters we varied, their ranges and default values (in bold) are shown in Table 1 . We have varied a single parameter in each experiment while the others are assigned their default values. We measure the impact of the parameters on runtime and I/O cost i.e. # of disk blocks accessed in CTQ processing. We configure the quadtree nodes to hold upto 128 points, and R-tree blocks to hold upto 4 trajectories. For each set of experiments, we run 100 CTQ and present the average result. (Figure 3c ). As the number of points in query trajectory increases, more user trajectories at different blocks are expected to be processed. So runtime as well as I/O should increase in both the approaches, as reflected in the graph (Figure 3a, 3c) . The performance of the QR-tree is significantly important when we consider multiple levels of CTQ, that is, when we consider exposure from already exposed users upto a certain level, instead of confirmed patients only. The QR-tree outperforms baseline approach by 2 − 3 orders of magnitude in terms of runtime ( Figure 5a ) and by around 1 − 2 orders of magnitude in terms of I/O cost ( Figure 5b ) when we vary maximum recursion depth level, L. More importantly, note that, the QR-tree can provide results in tens of seconds in case of upto three levels of exposure while the baseline would require thousands of seconds to do that. The benefits of I/O may seem misleading for higher depth levels (Figure 5b ) because the CTQ processing gets saturated in terms of disk block access, i.e. it accesses almost all the blocks in both approaches (baseline being marginally higher) to retrieve potential candidate trajectories. For this reason, the baseline approach has a somewhat flat tail for already accessing all the disk blocks. But running the experiment with higher number of trajectories to demonstrate this I/O gain is not feasible because of the intractable runtime of the baseline method. Besides, instead of default temporal distance threshold (τ ) value of 15 minutes, we have used τ = 1 minute for running this experiment to keep the results demonstrable. The NYF dataset is significantly smaller than the BD Cellphone dataset. We report only the impacts of varying spatio-temporal distance thresholds in the experiments with this dataset since varying the other parameters would be of little value for the dataset size. (iii) Spatio-temporal thresholds (ψ, τ ): The QR-tree works better than the baseline by at around 2 orders of magnitude in terms of runtime (Figure 6a, 6b) and by around 1 − 2 orders of magnitude (Figure 6c, 6d) in terms of I/O cost. So the performance of QR-tree is comparatively even better for NYF dataset. The I/O graph of baseline looks flat because all the disk blocks have been accessed by it whatever the parameter values are. This is because the dataset spans over a longer temporal domain than that of BD Cellphone. In a real system, the performance gain by our proposed index will largely be attributed to the lower I/O cost. This is not simulated in the runtime experiments. So the merits The further enhancement we have proposed on the QR-tree, namely the Q 2 R-tree works slightly better in terms of I/O cost specially when we deal with larger number of trajectories, as demonstrated in Figure 7 . Note that, this experiment is run by indexing 50k, 100k, 150k and 200k trajectories respectively. The Q 2 R-tree achieves 2 − 5% reduction in the number of disk blocks accessed, specially for higher number of trajectories. So, though the CTQ processing using the Q 2 R-tree needs multiple levels of tree traversal, marginally lower I/O overhead can eventually result in better runtime performance as well, which is subject to further experiments in real systems. We have proposed a novel CTQ in the context of spatio-temporal databases and developed a multi-level index, namely QR-tree, to efficiently process the CTQ. Experimental results show that the QR-tree based approach outperform the baseline by 1-2 orders of magnitude both in terms of processing time and I/O. In future, we plan to develop a system based on the proposed index and make it available for the community. The maximum trajectory coverage query in spatial databases The r*-tree: An efficient and robust access method for points and rectangles Indexing large trajectory data sets with seti Estimating travel time of dhaka city from mobile phone call detail records Indexing Historical Spatio-Temporal Data The quadtree and related hierarchical data structures Trajectory similarity join in spatial networks Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns