key: cord-0544585-xmn1xyul authors: Zhang, Pan; Calderon, Wilfredo Torres; Lee, Bokyung; Tessier, Alex; Bibliowicz, Jacky; Calin, Liviu; Lee, Michael title: Contact Area Detector using Cross View Projection Consistency for COVID-19 Projects date: 2020-08-18 journal: nan DOI: nan sha: 029409bcd00b3649b74a5a4cd3078517cedbdbaf doc_id: 544585 cord_uid: xmn1xyul The ability to determine what parts of objects and surfaces people touch as they go about their daily lives would be useful in understanding how the COVID-19 virus spreads. To determine whether a person has touched an object or surface using visual data, images, or videos, is a hard problem. Computer vision 3D reconstruction approaches project objects and the human body from the 2D image domain to 3D and perform 3D space intersection directly. However, this solution would not meet the accuracy requirement in applications due to projection error. Another standard approach is to train a neural network to infer touch actions from the collected visual data. This strategy would require significant amounts of training data to generalize over scale and viewpoint variations. A different approach to this problem is to identify whether a person has touched a defined object. In this work, we show that the solution to this problem can be straightforward. Specifically, we show that the contact between an object and a static surface can be identified by projecting the object onto the static surface through two different viewpoints and analyzing their 2D intersection. The object contacts the surface when the projected points are close to each other; we call this cross view projection consistency. Instead of doing 3D scene reconstruction or transfer learning from deep networks, a mapping from the surface in the two camera views to the surface space is the only requirement. For planar space, this mapping is the Homography transformation. This simple method can be easily adapted to real-life applications. In this paper, we apply our method to do office occupancy detection for studying the COVID-19 transmission pattern from an office desk in a meeting room using the contact information. COVID-19 virus transmission occurs, in part, from touching contaminated objects or surfaces. The ability to determine when and where a person touches an object in space is essential in understanding the spreading path of the virus and helping reduce the spread (for example, helping facilities staff to prioritize their cleaning). A fast and accurate method for contact tracking is both ideal and urgent to curb the spread of COVID-19. Computer vision techniques are ideal tools for automating this process and generating insightful statistics, as they are efficient, accurate, adaptable, and video sensors can be placed safely and unobtrusively in the space. The task of answering whether a person has touched something is challenging but has been solved. The solution has been actively applied to automated grocery store products like Amazon Go. However, the bar of implementing such a system is still unreachable for everyday use cases without the support of a team of engineers and researchers when lots of sensing technologies need to work together with customized algorithms. This paper focuses on solving this problem using only two commodity cameras with simple algorithms and little calibration involved. Our method gives a solution that can be made ubiquitous and implemented inexpensively. Two straightforward approaches might solve this problem. The first one is to reconstruct 3D human joints and the object from 2D camera captures; then computing the touched areas from the 3D intersections between human end joints and the object. With this approach, given a correctly calibrated camera system, errors could still be introduced during 3D reconstruction steps. Furthermore, the reconstruction requires multi-view object identification, which is still an open research area, and this adds an extra challenge to the task. The errors introduced by these issues give inaccurate 3D reconstruction results, which produce inaccurate intersection results. The second approach is to train a neural network to detect human touch actions. The network could be trained to do a regression task in both spatial and temporal domains. If done correctly, a satisfactory result could be achieved using a single-camera-setup. However, the challenges, such as creating a large scale human-to-object contact dataset and methods that can generalize its application, make this deep learning approach very hard to implement in the short term to address urgent application needs. However, this question is the same as to know whether a person has touched a defined object. Furthermore, we show a straightforward method that can solve this contact detection. Particularly, for a surface that can be viewed from two different viewpoints, we analyzed the projection of the two viewpoint images to the surface plane. Assuming there is only one object in the scene, if the object's projection from both camera views intersects on the surface plane, the object is considered in contact with the surface. We call this cross view projection consistency, as shown in Figure 1 . We applied this method to determine the contact area for a desk in an office meeting room. Since the desk surface is planar, we can use Homography mapping to map between surfaces. Furthermore, we extend our method to find the mapping for non-planar surfaces. Because of the uniqueness of the mapping, one can create a mapping from a recorded video with an object moving on the surface until the track of it covers the whole surface. Then, the point of the object center being detected in the images from the two cameras captured at the same time becomes the map of the surface point between the two camera views; we call this surface mapping calibration. We use the contact information gathered from the method in a case study to understand the COVID-19 spread pattern. In particular, we generate a contact heat-map on the desk surfaces in the office meeting rooms and occupancy detector as in Fig 6 to show the usage frequency of the desks as shown in Fig 5. In summary, our contributions are: i) a simple accessible method that finds the contact area using commodity hardware and easy setups; ii) an easy mapping calibration method for non-planar surface; iii) an application of our method that might be used for the study of COVID-19 transmission patterns in the office. Human object contact area detection: Human-computer interaction (HCI) has played a significant role in developing interactive means to improve and smooth communications between computers and humans. The evolution from keyboards to interactive projection surfaces to obtain human input is an example that highlights its progress. However, it relies on more sophisticated software and hardware (e.g. stereo [1, 16] and rgb-d [19, 13] cameras) to acquire the data. Agarwal [1] proposed machine learning and geometric models to capture fingersurface contact using an overhead stereo camera. Matsubara et al. [16] reasoned about the shadows produced by the fingers when hovering or touching a surface. They presented a machine learning algorithm using the input from an infrared (IR) camera and two IR lights to identify finger-surface contact. The proposed strategy in this project relies on a simple but effective way to identify contact by leveraging the availability of standard monocular cameras typically used in office settings. Action recognition based approach: Vision-based action recognition is focused on computing robust action features for the correct identification of human actions. Different from action classification [4, 21] , action detection or action segmentation [11, 7, 5] carries an additional level of difficulty by trying to identify the start and ending time of an action in the temporal dimension. Carreira [4] presented the advantages of using a pre-trained model for action classification. Yan [21] explored the temporal component using body pose key-points as input through a Spatial-Temporal Graph Convolutional Network (ST-GCN) for the same task. The approach for the temporal location of actions introduced by Farha [7] implemented a Multi-Stage Temporal Convolution Networks (MS-TCN) that refines its prediction on every stage to classify and find the starting and ending time of each action. The work of Chen [5] extended this idea by incorporating self-supervised domain adaptation (DA) to the MS-TCN network. The objective of these approaches differs from ours in that action recognition does not require identifying the location of the surfaces touched by a human, but understanding the sequence of movements of a person's appearance or limbs with the labeled activity. 3D pose estimation based approach: 3D Human pose estimation of singleperson and multi-person in individual frames and videos continue to attract significant attention [20, 2, 17, 10] . Large-scale benchmark datasets and deep learning models have made successful progress [9] . However, the large variety of human poses, occlusion, and fast motions still represent significant challenges for this research area. State-of-the-art (SOTA) methods use 2D key-point (e.g., wrists and ankles) locations as input paying less attention to the position of the palm and fingers, central parts for human-to-surface contact analysis. Even with successful identification of the 3D human pose, complete contact detection requires 3D scene understanding. Whether using a precomputed 3d reconstruction or a 3D CAD model, both would need to be continuously updated when movable objects (e.g., furniture) change locations. These limitations and the intrinsic ambiguities of 3D human reconstructions would affect the accurate identification of surface-human contact. In this section, we first introduce cross view projection consistency, which is the theory behind our method. Then we show an informal proof of it to illustrate the correctness and show that we can relax the assumption, in theory, to use in our case. In section 3.1, we describe the algorithm we use for desk surface contact recovery. In section 3.2, we introduce the surface mapping calibration method for finding mappings of the non-planar surface. Cross view projection consistency: Given a surface that can be observed from two cameras at different locations, assume a small ball exists in the space. When the ball overlaps with the surface in both camera views, we can project the ball's location onto the actual (3D) surface to compute the intersection points. The intersections will be closest to each other if and only if the ball is in contact with the surface, and we call this projection consistency. Informal proof for illustration purpose: . As a general surface can be decomposed to small planar patches, the proof will imply the correctness for all surfaces if we illustrate this to be true for a planar patch. For a planar patch or a planar surface, simple observations can prove cross-view projection consistency. As shown in Fig.2 . Given a planar surface, a spherical object, such as a basketball, and two spotlight sources that illuminate the area from different directions that are not perpendicular to the surface normal, the ball has the smallest shadow area only when it touches the ground. The shadow will run away from each other them the ball goes up in the air. If we replace the spotlights with cameras, the shadows on the surface are the maps of the intersections. Single object assumption: Cross view projection consistency holds when there is only a single object in the scene. Having more than one object in the scene causes ambiguities, as shown in Fig.3 , unless we have a way to identify the same objects from different camera views, which is still ongoing research. Without doing camera calibration and 3D reconstruction or machine learning-based texture mapping, it is challenging to solve. However, this assumption might be relaxed given the condition that the camera is far away relative to the object, which is the same as to say the object is small enough in the scene. We found that under this condition, the chances of ambiguities being created by more than one object are minor. Here, we use a toy example to demonstrate the probability of ambiguity is very low, again, as the formal proof requires lots of references from radiometry theories that are not in the scope of this paper, and we use loose terms only for demonstration purposes. Assume 2 freely moving objects in the scene that can be viewed by 2 cameras; then, for a given small patch on the surface, there is only one cluster of rays from each of the camera needed for this patch to be visible in the camera view. The ambiguity happens only when the two objects are blocking the two clusters of rays simultaneously to make the patch not visible from any of the camera view. The probability of occurrence is the volume of two clusters of the rays over the volume of the space, which is close to zero when the object is small in the view. We claim that to detect hand-object contact, this condition generally holds, as people's hands are small in the image and can be modeled by points or small patches. Using cross view projection consistency, this section shows the algorithms of finding contact areas on the desk. Since the desk surface is planar, we use a Homography to map points surface between different views. Assume a two-camera setup, a plane α, an object m at point p in 3D space, at a time t when m can be viewed by both cameras. The context above to can be translated to this formulation. where p 1 is the center of the object in camera 1 view and p 2 is the center of the object in camera 2 view, H is the Homography mapping that maps the plane α from camera 2 view to camera 1 view, and F d is the consistency checking function that returns p 1 if the distance between p 1 and p 2 is smaller than d and returns empty otherwise. And p x is the location on the plane α being contacted by object m at time t from camera 1 view. Given the formulation above, we use algorithm 1 for detecting the contact area on the surface for the implementation. Result: Set Q of mappings that maps timestamp to contact points Q = ∅ ; for t in all time stamps do create an empty point set P = ∅ ; for object center p 1 in the centers detected from camera 1 do for object center p 2 in the centers detected from camera 2 do Noise from the detection: The implementation of the contact area detection algorithm requires objects to be detected from both camera views. The detection algorithm can be chosen from object joint detection [6, 3] , instance segmentation [8] , optical flow detection [15, 18] , or by just computing the intensity variance pattern on the surface. We model the detected object center p x as p x ∼ N (p µ , σ 2 ) . where p µ is the real object center, and σ 2 is the variance. We include a plot to indicate the different std value might change the result by show plots of smallest distance between p 1 and H(p 2 ) overtime as shown in Fig.4 . Set distance threshold: Every object has a volume, so even for the case of a perfect contact on the surface with no error introduced when computing centers of the object from the two cameras and with no error introduced in the Homography mapping, the two centers p 1 and H(p 2 )) will not be the same. We need to set d parameter for the consistency checking function F d . Parameter d needs to be tweaked according to the scene setup but once setup, it does not need to be changed. One simple way is to do this by observation. One can select a clip of video when contact happens and compute the d, which is the Manhattan distance between points in the view of camera 1 and the points in the Homographically transformed view of camera 2. Another more accurate way of doing this is to perform calibration when the person has access to the scene; one can go into the scene, keep his hand visible, contact the desk, and swipe the entire desk surface. From this video, a map from patches to the d threshold when contacting happens is calculated. This follows a very similar thought as surface mapping calibration introduced below. Explicitly, in our study for the office desk contact area detection shown in the applications section, we compute a d map for each point on the desk surface in the camera 1 view for the entire desk area. In this setup, we rasterize the desk surface into small patches. Then, we select videos with only one person in the p2)) of the wrist detected by adding different noise models to the detection, the unit on y axis is per pixel, the unit on x axis is per 0.04 second scene (the videos are selected using human tracking result). From the video, for each wrist joint that falls into a patch in the 2D image, we collect the distance between the joint point in camera 1 view and the joint point in camera 2 being Homographically transformed to camera 1 view. Doing this process over time, we will have a bag of distance numbers collected for each patch. Furthermore, we select the d value from these numbers. From the observation, we find that people spend most of the time resting their arms on the desk, and the most values collected should be the moment contact. We can use histogram intervals, and the one interval that receives the most numbers can be used to set the d number for that patch. For the patches that have no number or do not have enough numbers collected, the d value can be interpolated from its neighbor patches. Surface mapping calibration: The method also supports getting the contact area on a non-planar surface. For the non-planar surface, the challenge is to find the mapping of the points on surfaces between 2 camera views. For the surfaces with rich texture patterns, algorithms like SIFT [14] mapping algorithms can be used. However, for the surfaces that have no or low quality texture patterns, we can create the mapping by detecting objects when they are contacting the surface. The intuition behind this is from a structured light system for doing 3D reconstructions. The light source projects patterns on the surface and the camera maps the pattern and knows the map of it on the image, now, when the object contacts the surface, it can just be thought of as a pattern being projected on the surface. In this section, we show some ways to use our result which might help to understand the COVID-19 spreading pattern. In section 4.1, we give an introduction to our dataset, in section 4.2, we show contact heat-maps of a office desk with data collected over a month. In section 4.3, we introduce desk occupancy detection using from which we believe that estimation of the virus transmission might be computed. We use the office dataset from [12] , The people from the video are not performers, they are office users being captured from two cameras as shown in Fig.1 . The cameras capture activities in the office during working hours for about 45 workdays. Humans in the video are skeletonized for privacy protection. Heat-maps of selected days are generated by collecting the number of contacts being made for a point in the image. As shown in Figure 5 , this observation is to show an overview of the desk usage. As we do not assign ids for the contacts, this means that for an area when n numbers of the contacts being made by one person, it will still be counted as n contacts. In the figures, we transformed the contact points to the desk model from the top-down view. Most tracking algorithms [22] [24] [23] do not deal with long time disappearance of the object, which happens a lot when people are in the office with their body Fig. 6 . Region occupancy detection, figure on the left shows the region being monitored, and plots on the right is the occupancy indicator shows as time series data, each temporal unit is 0.04 second parts occluded by desks or chairs. Also, track-by-detection based algorithms will fail as the center of the tracking boxes vary by the occlusions. As a result, when a single person is sitting by the desk for an hour, hundreds of ids will be generated from that person, and this makes tracking algorithms very unstable for desk space occupancy detection task. We monitor the moment the desk regions are being contacted, as shown in figure 6 . We assume that the desk space will not be reoccupied within a minute and for the same region, then if there is a gap of the usage less than a minute, we fill in the gap. In this paper, we have discovered ways to find the contacted area on the surface of a static object to help people better understanding the COVID-19 spread path. We try to show people the insights of how complicated tasks for this application might be converted into very simple computer vision tasks. One of the interesting extensions is to apply this technique to movable objects. The projection constancy still holds if more dynamic mappings can be established. Since the method we proposed can be used to label the data, another possible extension is to use the contact data as a training data set for a deep learning model that detects, from only a single camera view, when and where contact is being made by a human. High precision multi-touch sensing on surfaces using overhead cameras Human pose estimation via convolutional part heatmap regression Openpose: realtime multi-person 2d pose estimation using part affinity fields Quo vadis, action recognition? a new model and the kinetics dataset Action segmentation with joint self-supervised temporal domain adaptation Rmpe: Regional multi-person pose estimation Ms-tcn: Multi-stage temporal convolutional network for action segmentation Mask r-cnn Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments Learning to reconstruct 3d human pose and shape via model-fitting in the loop Temporal convolutional networks for action segmentation and detection An empirical study of how socio-spatial formations are influenced by interior elements and displays in an office context Dynamic hand gesture recognition using rgb-d data for natural human-computer interaction Distinctive image features from scale-invariant keypoints An iterative image registration technique with an application to stereo vision Touch detection method for nondisplay surface using multiple shadows of finger 3d human pose estimation in video with temporal convolutions and semi-supervised training Kanade-lucas-tomasi (klt) feature tracker. Computer Vision (EEE6503 Explore efficient local features from rgb-d data for one-shot learning gesture recognition Convolutional pose machines Spatial temporal graph convolutional networks for skeleton-based action recognition A simple baseline for multiobject tracking Tracking objects as points Objects as points