key: cord-0947910-fbsyoz2p
authors: Canil, Marco; Pegoraro, Jacopo; Rossi, Michele
title: MilliTRACE-IR: Contact Tracing and Temperature Screening via mm-Wave and Infrared Sensing
date: 2021-10-08
journal: IEEE Journal on Selected Topics in Signal Processing
DOI: 10.1109/jstsp.2021.3138632
sha: d23c5ea71cdb451925a6596adc70d890d075e5d3
doc_id: 947910
cord_uid: fbsyoz2p

Social distancing and temperature screening have been widely employed to counteract the COVID-19 pandemic, sparking great interest from academia, industry and public administrations worldwide. While most solutions have dealt with these aspects separately, their combination would greatly benefit the continuous monitoring of public spaces and help trigger effective countermeasures. This work presents milliTRACE-IR, a joint mmWave radar and infrared imaging sensing system performing unobtrusive and privacy preserving human body temperature screening and contact tracing in indoor spaces. milliTRACE-IR combines, via a robust sensor fusion approach, mmWave radars and infrared thermal cameras. It achieves fully automated measurement of distancing and body temperature, by jointly tracking the subjects's faces in the thermal camera image plane and the human motion in the radar reference system. Moreover, milliTRACE-IR performs contact tracing: a person with high body temperature is reliably detected by the thermal camera sensor and subsequently traced across a large indoor area in a non-invasive way by the radars. When entering a new room, a subject is re-identified among several other individuals by computing gait-related features from the radar reflections through a deep neural network and using a weighted extreme learning machine as the final re-identification tool. Experimental results, obtained from a real implementation of milliTRACE-IR, demonstrate decimeter-level accuracy in distance/trajectory estimation, inter-personal distance estimation (effective for subjects getting as close as 0.2 m), and accurate temperature monitoring (max. errors of 0.5{deg}C). Furthermore, milliTRACE-IR provides contact tracing through highly accurate (95%) person re-identification, in less than 20 seconds.

Abstract-Social distancing and temperature screening have been widely employed to counteract the COVID-19 pandemic, sparking great interest from academia, industry and public administrations worldwide. While most solutions have dealt with these aspects separately, their combination would greatly benefit the continuous monitoring of public spaces and help trigger effective countermeasures. This work presents milliTRACE-IR, a joint mmWave radar and infrared imaging sensing system performing unobtrusive and privacy preserving human body temperature screening and contact tracing in indoor spaces. milliTRACE-IR combines, via a robust sensor fusion approach, mmWave radars and infrared thermal cameras. It achieves fully automated measurement of distancing and body temperature, by jointly tracking the subjects's faces in the thermal camera image plane and the human motion in the radar reference system. Moreover, milliTRACE-IR performs contact tracing: a person with high body temperature is reliably detected by the thermal camera sensor and subsequently traced across a large indoor area in a non-invasive way by the radars. When entering a new room, a subject is re-identified among several other individuals by computing gait-related features from the radar reflections through a deep neural network and using a weighted extreme learning machine as the final re-identification tool. Experimental results, obtained from a real implementation of milliTRACE-IR, demonstrate decimeter-level accuracy in distance/trajectory estimation, inter-personal distance estimation (effective for subjects getting as close as 0.2 m), and accurate temperature monitoring (max. errors of 0.5°C). Furthermore, milliTRACE-IR provides contact tracing through highly accurate (95%) person re-identification, in less than 20 seconds.

Index Terms-Extreme learning machines, indoor human sensing, mmWave radars, person re-identification, temperature screening, thermal camera.

This work tackles the problem of designing a real-time, integrated radio and infrared sensing system to jointly perform unobtrusive elevated skin temperature screening and privacy preserving contact tracing in indoor environments.

Lately, social distancing has become a primary strategy to counteract the COVID-19 infection. Many research works [1] , [2] have shown that it is an effective non-pharmacological approach and an important inhibitor for limiting the transmission of many contagious diseases such as H1N1, SARS, and COVID-19. Along with social distancing, elevated skin temperature detection and contact tracing have proven to be key to effectively contain the pandemic [3] . However, available methods to enforce these countermeasures often rely on RGB cameras and/or apps that need to be installed and continuously run on people's smartphones, often rising privacy concerns [4] . Moreover, currently adopted methods to screen people's temperature require individuals to stand in front of a thermal sensor, which may be impractical in heavily frequented public places.

Here, milliTRACE-IR, a joint mmWave radar and infrared imaging sensing system is designed and validated. milliTRACE-IR performs unobtrusive and privacy preserving human body temperature screening and contact tracing in indoor spaces (see Fig. 1 ). Next, its main components are discussed, emphasizing their novel aspects and the joint processing of the acquired sensor data. mmWave radar: The radar analyzes the reflections of a transmitted mmWave signal off the individuals that move in the monitored environment, returning sparse point-clouds that carry information about the subjects' locations and the velocity of their body parts. A novel point-cloud clustering method is designed, combining Gaussian mixtures [5] and the densitybased DBSCAN [6] algorithm, to distinguish the mmWave radio reflections from the subjects, as they move as close as 0.2 m to one another. The so obtained point-cloud clusters are used to track the subjects' positions in the physical space by means of a Kalman Filter (KF) [7] , and to obtain their gait-related features through a deep-learning based feature extractor. Finally, a novel person re-identification algorithm is proposed by exploiting weighted extreme learning machines (WELM). Thermal camera: The infrared imaging system, or thermal camera (TC), returns images whose pixels contain information on the temperature of the objects in the TC field of view (FoV). To measure the subjects' temperature, at first, YOLOv3 [8] is used to perform face detection in the TC images, by bounding those areas containing a human face. Hence, the obtained bounding boxes are tracked through an Extended Kalman Filter (EKF) [9] and the subjects' temperature is estimated by accumulating readings for each EKF track, according to a dedicated estimation and correction procedure. Through the EKF, the subject's distance from the TC is also estimated from the size of the corresponding bounding box by considering the non-linear part of the EKF, which is approximated by fitting a function over a set of experimental data points. Radar and thermal camera data fusion: Tracks in the radar reference systems are associated with those in the TC image plane via an original algorithm that finds optimal matches for the readings taken by the two sensors, through their joint analysis. This makes it possible to take temperature measurements from a subject and reliably associate them with the highly precise tracking of his/her movement performed by the radar. In addition, the joint analysis of radar and TC data allows refining the temperature estimated through the TC: to mitigate the influence of the distance on the temperature readings [10] , a regression function that provides temperature correction coefficients is fit from training data. The final temperatures are obtained using such function with the accurate distances retrieved from the radar.

Hence, once a subject's temperature is measured, it is associated with the corresponding radar track and the subject's movements and contacts inside the building are accurately monitored, by re-identifying the subject as he/she moves across the FoV of different radar devices. To the best of the author's knowledge, milliTRACE-IR is the first system that achieves temperature screening and human tracking through the joint analysis of radar and TC signals. Furthermore, it concurrently performs body temperature screening and contact tracing, while these aspects have been previously dealt with separately. A sensible usage model for the system is as follows: the TCs shall be deployed in strategic locations to allow an effective temperature screening, such as facing the building/room entrance, to ensure that people's faces are seen frontally for a reasonable amount of time, and that their TC images are only taken when they enter or leave the building/room. On the other hand, the radar can be utilized to track the subjects while moving inside the monitored indoor space. This ensures higher privacy with respect to RGB cameras.

The main contributions of the present work are:

1) milliTRACE-IR, a joint mmWave radar and infrared imaging sensing system that performs unobtrusive and privacy preserving human body temperature screening and contact tracing in indoor spaces is designed and validated through an extensive experimental campaign. 2) A novel data association method is put forward to robustly associate tracks obtained from the mmWave radar and from the TC, where the radar returns the people coordinates in the physical space and the TC identifies people's faces in the thermal image space. The achieved precision and recall in the associations are as high as 97%. 3) An original clustering algorithm for mmWave pointclouds is devised, making it possible to resolve the radar reflections from subjects as close as 0.2 m. 4) A new WELM based person re-identification procedure is presented. The WELM is trained at runtime on previously unseen subjects, achieving an accuracy of 95% over six subjects with only 3 minutes of training data. 5) A novel method is designed to perform elevated skin temperature screening as people move freely within the FoV of the TC, without requiring them to stop and stand in front of the thermal sensor. For this, a dedicated approach is presented to mitigate the distortion in the TC temperature readings as a function of the distance, by also leveraging the accurate distance measures from the radar. Through this method, worst-case errors of 0.5°C are obtained. The paper is organized as follows. In Section II, the related work is discussed. Section III introduces some basic concepts about mmWave radars and thermal imaging systems, while in Section IV the proposed approach is thoroughly presented. In Section V-A, the implementation of milliTRACE-IR is described, while Section V contains an in depth evaluation of milliTRACE-IR on a real experimentation setup. Concluding remarks are provided in Section VI.

II. RELATED WORK In the literature, almost no work has focused on a joint approach to social distancing and people's body temperature monitoring which preserves the privacy of the users. Here, several prior works in related areas are discussed, highlighting the differences with respect to the proposed system. Social distancing monitoring: Social distancing has been one of the most widely employed countermeasures to contagious diseases outbreaks [1] . Real-time monitoring of the distance between people in workplaces or public buildings is key for risk assessment and to prevent the formation of crowds. Existing approaches use either wireless technology like Bluetooth or WiFi [11] , [12] , which require the users to carry a mobile device, or camera-based systems [13] , which are privacy invasive. Other approaches use the received signal strength indication (RSSI) from cellular communication protocols [1] or wearables [14] , although these are often inaccurate, especially when used in crowded places [1] . A lot of effort has been put into designing person detection and tracking algorithms for crowd monitoring and people counting [15] by using fixed surveillance cameras and mobile robots [16] . The main drawbacks of these methods are the intrinsic difficulty in estimating the distance between people from images or videos, along with the fact that the users have to be continuously filmed during their daily lives, which raises privacy concerns.

Concurrently, a large body of work has focused on ultrawideband (UWB) transmission for people tracking [17] , [18] , e.g., using mmWave radars, as these naturally allow measuring distances with decimeter-level accuracy. However, none of these works has tackled the problem of estimating interpersonal distances when people are very close to one another for extended periods of time; this is especially difficult with radio signals, as the separation of the reflections from different subjects becomes challenging. Passive temperature screening: Infrared thermography is widely adopted for non-contact temperature screening of people in public places [19] . Due to the COVID-19 pandemic, there has been a growing interest in developing screening methods to measure the temperature of multiple subjects simultaneously, without requiring them to collaborate and/or to carry dedicated devices [20] . Approaches that involve the use of RGB cameras, e.g., [21] , share the aforementioned privacyrelated limitations.

The authors of [10] developed a Bayesian framework to measure the body temperature of multiple users using low-cost passive infrared sensors. The distance from the sensors and the number of subjects is also obtained. However, the working range of this system is very short (around 1.5 m for precise temperature estimation), so it is deemed unapt for monitoring a large indoor area. Radar-thermal imaging association and fusion: Sensor fusion between radars and RGB cameras has been extensively investigated, see, e.g., [22] , [23] , while the joint processing of mmWave radar data and infrared thermal images was marginally treated [24] . In addition, the last paper only deals with the detection of humans using thermal imaging and does not address body temperature screening.

The present work is focused on the data association between a thermal camera and a mmWave radar over short periods of time, using the accurate radar distance estimates to refine the temperature reading. This makes it possible to consider scenarios where the thermal camera only covers a small portion of the environment (e.g., the entrance) so as to preserve the subjects' privacy, while a mmWave radar network can effectively monitor the whole indoor space. mmWave radar person re-identification (Re-Id): Radiofrequency (RF) based person Re-Id is a recent research topic. So far, many works have focused on person identification [25] , [26] , where the subjects to identify have been previously seen by the system, typically via a preliminary training phase. Re-Id is more challenging, as it addresses the recognition of unseen subjects, for which only a few radio samples are collected during system operation. Differently from camera image based Re-Id methods [27] , RF approaches need to profile the users across time intervals of a few seconds, to extract robust person specific features [28] . To the best of the author's knowledge, only two works have proposed solutions to this problems [28] , [29] . In both cases, a deep learning method trained on a large set of users is used to extract features from the human gait. At test time, the features obtained from the subjects to be re-identified are compared against those of a set of known individuals using distance-based similarity scores. This approach entirely depends on the feature extraction process, and the classifier does not learn to refine its decisions at runtime, as new samples become available. This is a weakness, as the gait features extracted from mmWave radars are known to be variable, e.g., across different days [30] . Conversely, milliTRACE-IR combines deep feature extraction with fast classifiers which are continuously trained and refined as new data is collected; this improves the robustness of the identification task.

In this section the main working principles of the sensing technologies used in this work are summarized, namely, frequency-modulated continuous wave (FMCW) mmWave radars and infrared thermal cameras.

A MIMO FMCW radar allows the joint estimation of the distance, the radial velocity and the angular position of the targets with respect to the radar device [31] . It works by transmitting sequences of chirp signals, linearly sweeping a bandwidth B, and analyzing their copies, which are reflected back from the environment. A full chirp sequence, termed radar frame, is repeated with period ∆ seconds.

1) Distance, velocity and angle estimation: By computing the frequency shift induced by the delay of each reflection, the radar allows obtaining the distance and velocity of the targets with high accuracy. The use of multiple receiving antennas, organized in a planar array, allows obtaining the angle-ofarrival (AoA) of the reflections along the azimuth and the elevation dimensions, leveraging the different frequency shifts measured by the different antenna elements. This enables the localization of the targets in the physical space.

2) Radar detection: The raw output of the radar is typically high dimensional for mmWave devices, due to the high resolution. To sparsify the signal and perform a detection of the main reflecting points, a typical approach is the constant false alarm rate (CFAR) algorithm [32] , which consists of applying a dynamic threshold on the power spectrum of the output signal. A further processing step is required to remove the reflections from static objects, i.e., the clutter. This operation is performed using a moving target indication (MTI) high pass filter that removes the reflections with Doppler frequency values close to zero [32] .

3) Radar point-clouds: After the detection phase, a human presence in the environment typically generates a large number of detected points. This set of points, usually termed radar point-cloud, can be transformed into the 3-dimensional Cartesian space (x−y−z) using the distance, azimuth and elevation angles information of the multiple body parts. In addition, the velocity of each point is also retrieved, along with the strength of the corresponding signal reflection.

In the following, the point-cloud outputted by the radar at frame k is referred to as P k , containing a variable number of reflecting points. Each point, p ∈ P k , is described by vector p = x, y, z, v, P RX T , including its coordinates x, y, z, its velocity v and reflected power P RX .

Infrared thermal imaging deals with detecting radiation in the long-infrared range of the electromagnetic spectrum (∼ 8 − 15 µm) and producing images of that radiation, called thermograms. According to the Planck's Law, infrared radiation is emitted by all objects with temperature T > 0 K [33] . Since the radiation energy emitted by an object is positively correlated to its temperature, from the analysis of the received radiation it is possible to measure the object's temperature. A thermographic camera, or thermal camera, is a device that is capable of creating images of the detected infrared radiation. The operating principle is quite similar to that of a standard camera, and the same relations described by the so-called pinhole camera model hold [34] . Within this approximation, the coordinates of a point a = [a x , a y , a z ] T in the three-dimensional space are projected onto the image plane of an ideal pinhole camera through a very small aperture. Mathematically, this operation is described as a proj = Ψa, where a proj is the projected point and Ψ is the intrinsic matrix of the camera that contains information about its focal lengths, pixel dimensions and position of the image plane. However, when dealing with a real thermal camera, this approximation may be insufficient and the radial and/or tangential distortions introduced by the use of a lens and by inaccuracies in the manufacturing process may additionally have to be accounted for. On the image plane, an array of infrared detectors is responsible for measuring the received radiation, which is sampled and quantized to produce a digital information. The pixels of the final image that is returned by a thermal camera contain information about the temperature of the corresponding body/object part, encoded into the pixel intensity.

This work considers the problem of monitoring an indoor environment covered by multiple mmWave radar sensors, which span over different rooms and corridors. A few infrared thermal cameras are placed at strategic locations to perform accurate temperature screening of the people in the indoor space without compromising their privacy, e.g., at the building's entrance.

From a high-level perspective, milliTRACE-IR performs the following operations.

(1) Person detection and temperature measurement: When people enter the monitored indoor space, the system concurrently performs face detection from the infrared images captured by the thermal camera and person detection using the mmWave radar point clouds.

1) From the thermal camera (TC) images, a face detector is used to obtain bounding boxes enclosing the faces of the detected subjects, (Section IV-B). A measure of their body temperature is obtained from the intensity of the thermal image pixels in the bounding box, see Section IV-C. While milliTRACE-IR works independently of the specific face detector architecture used, in the implementation YOLOv3 is used [8] . 2) Concurrently, radar signal processing is used to detect and group the point-clouds from different subjects and estimate their positions (Section IV-D). A novel clustering algorithm based on DBSCAN and Gaussian Mixture models is put forward to separate the contributions of closeby subjects (Section IV-E).

(2) Radar-TC person tracking: Kalman filtering (KF) is independently applied to the TC images and to the radar point-clouds to respectively track the subjects' movements within the thermal images and in the indoor Cartesian space. Standard KF-based tracking in the thermal image plane is here modified to achieve a coarse estimation of the distance of the subjects, based on the dimension of their face bounding box Section IV-B. In this phase, each subject track is associated with a unique numerical identifier.

(3) Radar-TC track association: As a subject exits the FoV of the TC, his/her body temperature is associated with the corresponding trajectory from the mmWave radar, by performing a track-to-track association between TC tracks and radar tracks. This association algorithm is based on the subjects' distances from the TC, and on the radar estimated positions of the subjects, projected onto the thermal image plane (Section IV-F). After the association, the temperature measurement is corrected accounting for the distance of each person from the TC, using the more precise distance estimates provided by the radar, Section IV-C. (4) Radar-based person re-identification: During the radar tracking process, the point-cloud sequences generated by each subject are collected and fed to a deep neural network that performs gait feature extraction (Section IV-H). The resulting gait features are organized into a labeled training set, where labels are obtained from the track identifiers. When a subject exits the FoV of a radar and enters that of another radar placed in a different room or corridor, a weighted extreme learning machine (WELM) based classifier [35] is trained on-the-fly and used to re-identify the subject at runtime (Section IV-J). This robust and lightweight person Re-Id process, based on the gait features extracted from the radar point-clouds, enables contact tracing across large indoor environments.

The system operates at discrete time-steps, k = 1, 2, . . . , each with fixed duration of ∆ seconds, also referred to as frame in the following. Boldface, capital letters refer to matrices, e.g., X, with elements X ij , whereas boldface lowercase letters refer to vectors, e.g., x. X −1 denotes the inverse of matrix X, and x T denotes the transpose of vector x. x k refers to vector x at time k, x j refers to element j of x and (x k ) j is element j of x k . N (µ, σ 2 ) indicates a Gaussian random variable with mean µ and variance σ 2 . Notation ||x|| 2 indicates the Euclidean norm of vector x, while ||x|| Γ = √

x T Γx denotes the norm induced by matrix Γ. The diagonal matrix with elements x 1 , x 2 , . . . , x n is denoted by diag [x 1 , x 2 , . . . , x n ]. |X | indicates the cardinality of set X while log(·) denotes the natural logarithm.

The detection of the subjects in the thermal camera images is performed by means of a face detector that computes rectangular bounding boxes delimiting the faces of the people within the FoV. The bounding boxes are used to track the positions of the subjects in the subsequent instants and to identify a region of interest (ROI) from which the temperature of the targets is obtained. milliTRACE-IR is independent of the particular face detector used, provided that it outputs bounding boxes enclosing the faces of the subjects. In the implementation, YOLOv3 [8] is used due to its excellent performance in terms of accuracy and speed.

To track the faces of the subjects in the image plane, an extended Kalman filter (EKF) is employed [9] . Define the state vector of a target subject at time k, as

, where x c k , y c k are the true coordinates of the center of his/her face in the thermal image, x c k ,ẏ c k its velocities along the vertical and horizontal directions, h k is the true height of the bounding box enclosing the subject's face, d k , the distance of the target from the camera in the physical space, andḋ k its time derivative (rate of variation).

The observation vector obtained from the YOLOv3 face detector, denoted by z k = [x c k ,ỹ c k ,h k ] T , contains noisy measurements of the face position and height (represented by the height of the bounding box), which are distinguished from their true values by the superscript "˜". Denote the observation noise by vector

The EKF state transition model is defined as

is the transition function, connecting the system state at time k, x k , to that at time k + 1, x k+1 , and vector u k ∼ N (0, Q) represents the process noise. In the model used in this work, the process noise includes 4 independent components, representing two random accelerations of the bounding-box center coordinates, u x k , u y k , a random noise term for the bounding-box dimension, u h k , and a random acceleration for the subject's distance, u d k . Therefore, it can be written

Assuming that the target moves according to a constant velocity (CV) model, from the state definition it follows that

where the only non-linear term is function g(·), which relates the subject's distance extracted by the thermal camera to the height h k of the bounding-box enclosing his/her face. The proposed approach consists in (i) obtaining an estimate for g(·) in an offline fashion using training data, and (ii) using such estimate in the EKF model. These two steps are detailed next. 1) Estimation of function g(·): Function g(·) maps the distance of the target from the thermal camera d k , at time k, onto the corresponding height of the bounding box, h k , as follows,

containing the true distances of the target, d i , and the measured bounding box height, h i , g(·) is obtained solving an offline non-linear leastsquares (LS) problem of the form

From the equations of the pinhole camera model [34] , g(·) is restricted to the family of hyperbolic functions with shape

reducing the problem to that of estimating the parameters b 0 , b 1 , and b 2 , i.e.,

This optimization problem is here solved using the Levenberg-Marquardt algorithm [36] for non-linear LS fitting: with the experimental setup used in this paper

Note that the process noise acts on the bounding-box dimension in two ways, inside the function g(·), modeling the uncertainty in the subject's distance due to the random acceleration, and through the additive term u h k , modeling the imperfect estimation of g(·) itself. The variance of u h k can be estimated from the residuals, after fitting the training measurements with function g(·).

2) Using g(·) in the EKF: Due to the non-linear dependence of the state x k on the process noise u k , in the EKF operations the following transformed process noise covariance matrix is used [37] 

where matrix L k is the Jacobian of function f (·) with respect to the process noise vector, evaluated for the current state estimate. Using the above system model, the system state estimate at time k,x k , is recursively obtained along with the corresponding error covariance matrix, P k . By definition of the EKF state, this allows us to get a coarse estimate of the distance of the subjects from the TC, which is exploited in the radar-TC data association step, see Section IV-F.

The body temperature is obtained from the thermal camera readings in the bounding-boxes contained inx k , for each subject, and for all the time steps in which he/she is tracked by the EKF. At any given time k, a single (noisy) temperature measurement,T k , is extracted by taking the maximum value across all the pixels in the current bounding box. Denoting by B k the 2-D region of the image enclosed by the bounding box, and by B ki the intensity of its pixel i, it holdsT k = max i B ki .

The common approach to people tracking from mmWave radar point-clouds [18] , [26] , [30] includes (i) detection: using density-based clustering to separate the points generated by the subjects from clutter and noise; (ii) tracking: applying Kalman filtering techniques [7] on each cluster centroid to track the movement trajectory of each subject in space.

Detection is typically performed using DBSCAN [6] , an unsupervised density-based clustering algorithm that takes two input parameters, ε and m pts , respectively representing a radius around each point and the minimum number of other points that must be inside such radius to satisfy a certain density condition. Given the description of the radar measurements from Section III-A, the coordinates of the points in the horizontal plane (x − y) are used as input to DBSCAN, which outputs a list of detected clusters and a set of points which are classified as noise. Typically, the centroid of each cluster is used as an observation of the subject's position, feeding a subsequent KF tracking algorithm [18] . DBSCAN has proven to be robust and accurate as long as the subjects do not come too close to one another [18] , [26] , [38] , see also Fig. 3a . When this occurs (Fig. 3b) , the algorithm often fails to distinguish between adjacent subjects, merging their contributions into a single cluster [39] .

In the KF tracker, the state of each subject at time k is defined as

T , containing the x − y subject's coordinates and the corresponding velocities. The state evolution is assumed to obey s k = As k−1 , where the transition matrix A represents a constant-velocity (CV) model [38] . The KF computes an estimate of the state for a target subject at time k, denoted byŝ k , by sequentially updating the predictions from the CV model with the new observations. The association between the new observations (time k) and the previous states (time k − 1) exploits the nearest-neighbors joint probabilistic data association algorithm (NN-JPDA) [38] , [40] . milliTRACE-IR also uses the just described DBSCAN and KF based signal processing pipeline, but improves it significantly with a novel clustering procedure to better resolve the point-clouds of subjects that are close to one another. The designed solution to enhance the tracking accuracy in such cases is a major contribution of the present work and is detailed next.

As a possible solution to DBSCAN drawbacks, one may adjust the parameters ε and m pts so as to correctly resolve the clustering ambiguity, even for closely spaced targets. However, ε and m pts interact in a complex and often unpredictable way, making the design of such adaptation rule difficult. milliTRACE-IR adopts a different approach, which combines (i) the standard DBSCAN algorithm with fixed ε and m pts , (ii) the spatial locations of the subjects, available from the tracking procedure, and (iii) the Gaussian mixture (GM) clustering algorithm [5] . The designed algorithm, reported in Alg. 1 and exemplified in Fig. 3 , proceeds as follows. At first, the DBSCAN algorithm is applied to obtain an estimate of the clusters and a reasonable separation between the noise points and those belonging to actual subjects, using ε = 0.4 m and m pts = 10. DBSCAN outputs a cluster label for each point p ∈ P k , denoted by p . Clusters are denoted by C n , and their centroids byc n , with n = 1, . . . , n k . The next step is to identify which of the tracked subjects get closer than a critical distance d th from one another. The clusters provided by standard DBSCAN for these subjects are expected to be incorrect, as the point-cloud data from these would be merged into a single cluster. To pinpoint these subjects, their KF state is leveraged, which corresponds to a filtered representation of their trajectories. Consider track t at time k, its coordinates are predicted asŝ t k = Aŝ t k−1 (see line 2 − 3 in Alg. 1). For any two subjects with associated tracks t and t , milliTRACE-IR checks whether ||ŝ t k −ŝ t k || 2 < d th . If this occurs, as shown in the example of Fig. 3b for tracks t = 0 and t = 1, t and t are termed nearby subjects. Hence, define G as the set of subjects that are mutually within a radius of d th from one another. A group G can be constructed starting from any subject and recursively adding all the subjects who are closer than d th from any of the set members. If a subject has no other subjects within distance d th , it will be the only member of his group. Collecting all the disjoint groups, constructed from the maintained tracks at time k, set G k (d th ) is obtained, containing all the nearby subjects groups. Once the nearby groups are identified, the ambiguities inside each group G containing more than one member are resolved by recomputing the clustering labels as follows. Consider a single group G. To delimit the region where the clustering has to be refined, the following additional regions are defined. The sample covariance matrix of the last cluster associated with track t is denoted by Σ t n , and contains information about the shape of the subject's cluster. The regions of the plane containing the points that are within a radius of d th fromŝ t k , can be written as

and the regions of points with a squared Mahalanobis distance smaller than γ are Fig. 3 : Illustration of the proposed clustering method. In (a) the point-clouds belonging to 2 subjects are well separated and DBSCAN outputs the correct clustering. In the next time-step, (b), DBSCAN fails and merges the two clusters into one. The proposed method selects the points to re-cluster using the tracks positions together with Eq. (6) and Eq. (7), as shown in (c), and outputs the correct result using GM on the selected points with nG = 2, see (d).

Input: States of the targets at time k−1, observed point-cloud at time k, P k .

discard cluster q if π q < π thr 12: end if 13: end for

In the implementation, d th = 1.2 m and γ = 9.21 were used. 1 Then, the labels assigned by DBSCAN to all the points belonging to a cluster whose centroid falls inside region R(G) = ∪ t∈G (R c (t) ∩ R s (t)), are discarded (lines 7 − 9 in Alg. 1). 2 This set of points is denoted by S.

Then, the GM algorithm is applied to the points belonging to set S to refine the clusters within this region, see the green points in Fig. 3c . As GM requires the number of clusters to be specified in advance, it is set to be equal to the number of subjects in the group, i.e., n G = |G|. The GM algorithm outputs the labels p for each point p ∈ S and the weight of the Gaussian component associated with each GM cluster, π q ∈ [0, 1], q = 1, . . . , n G , with q π q = 1. The new labels are used to replace the ones previously found by DBSCAN (Fig. 3d) , unless the GM clusters have very small weights, i.e., the new clusters having π q < π thr are discarded and treated as noise points. The threshold value used in the implementation is π thr = 0.1/n G .

The proposed method effectively solves the problem faced by DBSCAN in resolving subjects close to one another. The 1 The value of γ corresponds to a probability of 99% of falling inside the region, assuming that the points in the cluster are distributed on the plane according to a Gaussian distribution aroundŝ t k . 2 Discarding a label corresponds to setting it equal to that used by DBSCAN to represent noise points.

cost of this improvement is that an additional GM algorithm has to be applied to a subset of the point-cloud, however, at each time k the number of points in this subset is typically much smaller than that in the full point-cloud P k .

Upon tracking the subjects in the TC image plane and in the physical space, respectively using the measurements from the TC and from the mmWave radar sensor, a track-to-track association method is applied to link the movement trajectory of each person to his/her body temperature.

Assume that, at time k, the system has access to N rad k tracks from the radar sensor and N tc k tracks from the thermal camera, indicized by i and j, respectively. The data association strategy used in milliTRACE-IR consists in (i) computing a cost for each association (i ↔ j), and (ii) solving the resulting combinatorial cost minimization problem to associate the best matching track pairs. The main challenge in the association of radar and thermal camera tracks is the design of a cost function that grants robustness in the presence of multiple targets, which may enter the monitored area in unpredictable ways. The key point is to gauge the similarity of the tracks by comparing them in terms of common quantities, which can be estimated from both devices.

Assume also that the two sensors are located in the same position and with the same orientation (co-located). In this setup, (i) the distance between the subjects and the sensors is the same, so its estimate should match for tracks representing the same subjects, and (ii) the radar KF states containing the coordinates of the subjects' positions can be projected onto the TC image plane; after this operation, the horizontal component of the radar projections and the horizontal component of the TC bounding boxes position should match for correctly associated tracks. To reliably associate radard and TC tracks, milliTRACE-IR uses a cost function consisting of the following components, see also subsequent time steps where radar track i and TC track j are both available, the estimated distance cost is defined as

where σ 2 

where σ 2

x i k and σ 2

x j k are the variances of the two estimates. An illustrative example is shown in Fig. 4b . Track length coefficient. Recalling that ∆ is the (constant) sampling interval, the proposed cost function accounts for the length K of the tracks that are to be associated, favoring longer tracks. To this aim, the following coefficient is defined,

Note that ρ(K) is a weight factor for a cost (see the later Eq. (11)), which decreases with the track length K. This means that a smaller cost is implied when the associated tracks i and j are longer. Also, in the implementation, it holds K > 1/∆, so ρ(K) is always positive. Association cost function for radar and TC tracks. The association cost A(i, j) for the tracks pair (i, j) (i refers to a radar track and j to a TC track) is obtained summing Eq. (8) and Eq. (9), to gauge how well the two tracks match in terms of their estimated distance across time, and estimated position on the horizontal projected axis on the TC image plane, respectively. The sum is then weighted by the coefficient of Eq. (10). Formally, A(i, j) is given by

Costs A(i, j), i = 1, . . . , N rad k , j = 1, . . . , N tc k , are arranged into an N rad k × N tc k matrix, and the optimal association of tracks is obtained by minimizing the overall cost, computed through the Hungarian algorithm [41] . The Hungarian algorithm takes the cost matrix as input and solves the problem of pairing each radar track with a single TC track (by minimizing the total cost), with an overall complexity of O((N rad k N tc k ) 3 ). In general, the radar and the TC would be deployed at different spatial locations. However, knowing their relative position and orientation, a roto-translation matrix Φ can be obtained to geometrically transform the data into a new coordinate system where the TC and the radar sensors are co-located, as described above. In this work, the TC position and orientation are selected as the reference coordinate system, and the positions estimated from the radar sensor are transformed into it.

In line with [10] , the direct reading of each subject's temperature ,T k , is subject to a scaling factor, α(d k ), with respect to the true temperature T , where α(d k ) depends on the distance from the TC, i.e.,

For an accurate temperature screening, the scaling factor α(d k ) is estimated from the training data, considering a linear model of the form α(d k ) = a 0 + a 1 d k .

, the fitting coefficients a 0 , a 1 are obtained by solving arg min a0,a1

From the above optimization problem, in this work the above parameters are set to a 0 = 1.116, a 1 = 0.013. At system 

where α(·) is defined in Eq. (13), using the parameters obtained from Eq. (14), whiled j is an estimate of the distance obtained by the system at time-step j. To improve the temperature estimates, milliTRACE-IR performs sensor fusion by exploiting the association between the TC face tracks and the mmWave radar tracks (see Section IV-F). In Eq. (15), the coefficients α(d j ) are computed using the distances estimated by the mmWave radar device, as these are much more accurate than those obtained from the TC. The impact of combining the temperature information from the TC and the accurate distance estimation capabilities of the radar is investigated in Section V-C. The block diagram for the temperature correction step is shown in Fig. 5 .

To extract the gait features of the subjects, the neural network (NN) proposed in [30] , which was originally developed for person identification, is here adapted. The network uses a point-cloud feature extraction block inspired by PointNet [42] , and followed by temporal dilated convolutions [43] to capture features related to the movement evolution in time. The proposed NN takes as input a radar point-cloud sequence, denoted by Z, and outputs the corresponding feature vector v = F(Z). Fig. 6 shows the block diagram of the NN. First, the network is expanded with respect to [30] , using augmented point-cloud feature extraction blocks composed of 3 shared multi-layer perceptrons (MLPs) of size 98 and 2 MLPs of size 196, yielding point cloud features of size 196×1. Then, 2 temporal convolution blocks are used, containing 3, 3 × 3, convolutional layers each, with (32, 64, 128) and (256, 128, 32) filters, respectively, for the two blocks, and dilation rates of 1, 2, 4 for the 3 layers in each block. Then, after applying the same global average pooling operation of [30] , a fully connected layer [44] is introduced before the classification output, which produces a vectorṽ of dimension 32. The final feature vector is obtained using L 2 -normalization onṽ, i.e., v =ṽ/||ṽ|| 2 . A summary of the NN layers and their parameters is provided in Tab. 1.

1) Training: the NN is trained to produce representative feature vectors, v, containing information on the way of walking of the subjects. This requires that the network generalizes well to subjects not seen at training time, as the performance of the re-identification mechanism strongly depends on the quality of the extracted features. To this end, in this work the NN is trained using a weighted combination of the crossentropy loss [44] , denoted by L ce , the center loss [45] , L cnt , and the triplet loss [46] , L tri .

The cross-entropy is the most widely used loss for classification purposes in deep learning, and here it is used to train the network to distinguish among the different subjects [44] . However, just training the NN on a classification problem does not lead to sufficiently discriminative features for the re-identification mechanism. The center loss is adopted to additionally force the feature representations belonging to the same class to be close in the feature space, in terms of Euclidean distance. Specifically, denoting by c l the centroid of the feature vectors belonging to class l, the center loss is

where the centroids are learned as part of the training process via the back-propagation algorithm [45] . The triplet loss is used to push apart the feature representations of inputs belonging to different classes. For this, triplets of input samples are selected from the training set, two of them from the same class, leading to feature vectors v a and v b , and one belonging to a different class, leading to a third feature vector v c . For further details on the triplet selection process, see Section 3.2 of [46] . The triplet loss is written as

where µ is a margin hyperparameter, set to 1. Hence, the feature extractor is trained with the following total loss function

where the parameter ω = 0.5 weighs the relative importance of the center loss. In the implementation, a training dataset containing mmWave radar point-clouds from 16 subjects is used. It was collected in different indoor environments to increase the generalization capabilities of the NN. The optimization is carried out using Adam [44] with learning rate 10 −4 and an L 2 regularization rate of 8 × 10 −5 for 250 epochs, as summarized in Tab. 1. Hyperparameters tuning was carried out using a greedy search procedure, optimizing the value of the loss L on a validation set containing a randomly selected subset (20%) of the training data.

2) Feature extraction: at inference time, i.e., during the system operation, the NN is used to compute feature vectors that are representative of the subjects' gait. Specifically, 45 steps (3 seconds) long sequences of radar point-clouds are collected for each tracked subject. The point-cloud sequences are denote by Z in the following. The inner representation v = F(Z), after L 2 -normalization, is used as the feature vector for the following re-identification mechanism.

The weighted extreme learning machine (WELM) [35] is a particular kind of single-layer feedforward neural network in which the weights of the hidden nodes are chosen randomly, while the parameters of the output layer are computed analytically. Consider an n cls -class classification problem, a training set V = ∪ n cls n=1 V n of input feature vectors v (see Section IV-H), each with an associated one-hot encoded label y ∈ {0, 1} n cls , where V n is the set containing the vectors from class n = 1, . . . , n cls . For any v ∈ V, the WELM computes the matrix of hidden feature vectors H ∈ R |V|×L , with rows h(v), where L is the number of WELM hidden units and h(·) is a non-linear activation function. milliTRACE-IR uses h(v) = ReLU (W v + b) where ReLU is the rectified linear unit [44] (ReLU(x) = max(x, 0)) and W , b are the weights and biases of the ELM hidden layer, respectively. The elements of W and b are here generated from N (0, 0.1). The WELM learning process amounts to computing, for each class n, the optimal values of an output weight vector β n that minimizes the weighted LS L 2 -regularized quadratic cost function ||Hβ n − y n || 2 Ω +λ||β n || 2 2 , where λ is a regularization parameter and Ω is a diagonal weighting matrix used to boost the importance of those samples belonging to underrepresented classes. This compensates for the tendency of the standard ELM to favor over-represented classes at inference time [35] . In the analyzed scenario, the individuals move freely in the environment across different rooms, so the number of feature vectors collected from each of them is not only unknown in advance, but highly variable. Hence, the training set usually contains unbalanced classes, and milliTRACE-IR uses

where n i = arg max n (y i ) n denotes the class of the i-th vector. Stacking all the β n into a single matrix B ∈ R L×n cls and the labels into matrix Y ∈ {0, 1} |V|×n cls , the WELM output weights B can be computed in closed-form using one of the following equivalent expressions

Due to the dimension of the matrix to be inverted, if |V|> L, it is more convenient to use Eq. (21), while if |V|≤ L Eq. (20) has to be preferred. The output classification for a vector v is

is a vector of WELM scores for each class.

To enable person re-identification based on the feature vectors v extracted by the NN, milliTRACE-IR uses the WELM multiclass classifier of Section IV-I, which is trained at runtime only when the system has to re-identify a previously seen subject. This is done by sequentially collecting feature vectors from all the subjects seen by the system at operation time, and storing them into the training set V.

Note that, although an online sequential version of the ELM training process has been proposed in [47] , the WELM is trained every time a person has to be re-identified using a batch implementation and including in the training set V all the subjects seen up to the current time-step k. This is because in the online training procedure of [47] the number of classes has to be fixed in advance, while in the considered setup the number of subjects seen by the system may change in time and the Re-Id procedure must be flexible to the addition of new individuals to the training set V. The WELM training and re-identification phases are detailed next and in Alg. 2.

1) Training: the training process is performed at runtime as explained in Section IV-I, using L = 1, 024 and λ = 0.1. During the normal system operation, the feature vectors obtained from each track are continuously added to set V, storing the corresponding one-hot encoded vectors containing the subjects' identities into matrix Y . To reduce the computational burden, the feature extraction step is executed every 5 timesteps. This is reasonable, as the input sequences to the NN contain 45 time-steps overall and extracting the features at every time-step would lead to highly correlated, and therefore less informative feature vectors, in addition to entailing a higher computation cost. At time-step k, if a subject has to be re-identified, the training procedure of Section IV-I is executed (lines 1 − 4): the WELM feature vectors H are computed by applying the activation function h(·) to each training vector and the weight matrix Ω is obtained from Eq. (19) (lines 1−3) . The WELM output matrix B, is computed using Eq. (20) or Eq. (21) depending on |V| (line 4).

2) Re-identification: the Re-Id procedure is used to recognize subjects that have been seen by the system and associate them with their temperature measurement and their past movement history in the monitored area. Denoting by t id the track to be re-identified, the trained WELM processes the NN features of this user, v id , as follows: h(v id ) T B. Due to the high variability of human movement, rather than considering a single feature vector, milliTRACE-IR computes the cumulative average WELM scores over a time window of length W , Algorithm 2 Re-identification mechanism at time k.

Input: Training set V, track to be re-identified t id . Output: Re-id label of t id .

1:

where the average score at time j = 1, . . . , W is referred to as ξ j (lines 6 − 9). The identity label corresponds to the index of the largest element of ξ W (line 10).

In this section, the experimental results obtained by testing the system in different indoor environments are presented.

Hardware. milliTRACE-IR has been implemented on an NVIDIA Jetson TX2 edge computing device 3 , with 8 GB of RAM and a NVIDIA Pascal GPU. The Jetson TX2 has been connected via USB to a Texas Instruments IWR1843BOOST mmWave radar 4 , operating in the 77 − 81 GHz band, and via Ethernet to a FLIR A65 thermal camera 5 , as shown in Fig. 7 . The experiments have been performed in real-time at a frame rate of 1/∆ = 15 Hz. The radar device operates in FMCW mode, using a chirp bandwidth B = 3.07 GHz, which leads to a range resolution of c/2B = 4.88 cm, and 64 chirps per sequence, obtaining a maximum measurable velocity of 4.77 m/s and velocity resolution of 14.92 cm/s. The thermal camera has a 640 × 512 focal plane array (FPA), a spectral range of [7.5, 13] µm, a temperature range of [−25, 135] • C, a measurement uncertainty of ±5 • C, and a noise equivalent temperature difference (NETD) of 50 mK. Software. The system has been developed in Python, using the NumPy, SciPy and OpenCV libraries for the implementation of the tracking phases (for radar and thermal camera) and the proposed data association (Section IV-F), clustering (Section IV-E) and re-identification (Section IV-J) algorithms. Tensorflow and Keras libraries have been used to implement the feature extraction NN (Section IV-H). The pre-trained face detector for the thermal images (Section IV-B) has been taken from the open-source YOLOFace 6 implementation. or the sum of the two were used in the evaluation. Label "With ρ(K)" indicate that the corrective term, ρ(K), was used, while label "Without ρ(K)" means ρ(K) = 1.

To assess the performance of the radar-TC track association method, experimental tests were conducted in a 7 × 4 m research laboratory. A motion tracking system including 10 cameras was used to gather ground-truth (GT) data about the locations of the subjects, by placing markers atop their heads. This camera based tracking system provides 3D localization with millimiter-level precision, for all markers, at a rate of 100 Hz. The radar and the TC were placed as shown in Fig. 7 . 5 measurement sequences with 2 subjects and 9 sequences with 3 subjects, all freely entering the room, were collected. The roto-translation matrix Φ was estimated using a set of markers applied to the devices, while the TC intrinsic matrix Ψ (see Section IV-F) and the radial distortion coefficients were obtained through the Zhang's method [48] , using a sun-heated checkerboard pattern.

An association is defined as a specific pairing i ↔ j of a track i from the radar with a track j from the TC, and a correct association as an association for which the two tracks correspond to the same subject. Given a set of tracks, the set of all the correct associations performed by the algorithm is denoted by A TP (true positives), the set of all the associations performed by the algorithm as A P (positives), and the set of all the associations that the algorithm should have performed, based on the GT, as A R (relevant).

To quantify the association performance of the system, define the precision, Pr = |A TP |/|A P |, and the recall, Rec = |A TP |/|A R |. Using these metrics, the proposed track association method is evaluated by assessing the contribution of each cost component in A(i, j) (see Eq. (11)). The results are reported in Tab. 2, where the row labels A x , A d , and A x + A d indicate the cost function used. The table also shows the impact of adding the correction coefficient ρ(K) (see Eq. (10)): for the case "Without ρ(K)", ρ(K) is set to 1.

As shown, the proposed track association method reliably associates the radar and TC tracks, reaching precision and recall both higher than 97%. The joint use of A x , A d and ρ(K) leads to improvements of up to 11% and 8% for the precision and recall metrics, respectively.

Remarkably, the proposed temperature screening method does not require people to stand in front of the TC sensor, but estimates their temperature as they move within the FoV of the TC. In order for the method to return accurate temperature measurements, the subject' frontal face should be captured by the TC for a minimum time duration. For this reason, it is advisable to place the TC near a point of passage, e.g., in proximity of an entrance. The temperature screening method was tested on 4 − 7 sequences of ∼ 10 s each were collected from 4 different individuals moving within 3.5 m from the TC. Each subject was tested at a different time of the day, to gauge the effects of the changing (thermal) environmental conditions, and of a possible concept drift (e.g., heating) of the TC after a long period of operation. Furthermore, as explained in Section IV-C, a linear function α(·) was fit to compensate for the influence of the distance on the measures.

To evaluate the benefit brought by the correction based on the targets' distance, in Fig. 8a distance of 2 m and multiplying each measured temperature by α(2) = a 0 + 2a 1 . The full method (Corr. temp.), instead, uses the rescaled average estimate, as per Eq. (15) . The boxplot shows that the range of the corrected temperatures is significantly reduced (for these experiments, the true temperature is constant), demonstrating the efficacy of the proposed correction plus averaging approach. As an illustrative example, Fig. 9 shows the impact of the distance-based correction on data measurements from a subject moving in front of the TC. Fig. 8b compares the temperature estimates from milliTRACE-IR and the true temperatures measured with a contact thermometer. The numerical results are reported in Tab. 3, where the worst cases are reported in bold fonts. Mean temperatures are estimated with a maximum standard deviation from the mean smaller than 0.5°C and a maximum absolute error with respect to the true temperature of about 0.5°C. Note that only one of the subjects in Fig. 8b exhibits this maximum error (subject 3), while the absolute error for the others remains within 0.1°C. These errors descend from the fact that the environmental conditions and the heating of the thermal camera affect the measurements in an unpredictable way, modifying the bias of the fitting function. Notwithstanding, the thermal screening capability of milliTRACE-IR is significantly better than that of existing approaches, see Section V-G. Also, some improvements are possible by, e.g., applying a correction based on an external reference, such as a piece of material instrumented with a contact thermometer and located within the field of view of the TC, or monitoring the statistics of the people's temperature (mean µ and standard deviation σ) to detect anomalous samples within such empirical distribution. For instance, an alarm could be raised for those subjects whose temperature is greater than µ + c × σ, for a user-defined threshold c. This would allow the system to continuously and autonomously adapt to different operating conditions.

To evaluate the performance of the radar tracking system in estimating the position of the targets and the inter-subject's distance, tests were conducted in the 7 × 4 m research laboratory described in Section V-B. A total of 7 sequences of duration 10−15 s were collected, each with 3 subjects moving freely in the room, along with their GT locations obtained from the motion tracking system. The root mean squared error (RMSE) between the mmWave radar estimated locations and the GT is used as a performance metric. Moreover, the intersubject distances were measured, considering all the possible combinations of the three subjects and leading to a total of 21 inter-subject distances across all the recorded sequences. Ratio r cl between the number of frames in which the different subjects are correctly separated and the total number of frames, using the proposed method and DBSCAN. Symbols " " and "×" denote success and failure of the tracking step, respectively.

The cumulative distribution functions (CDF) of the absolute error between the ground truth and the estimated subject's position/inter-subject distance, as measured by the radar tracking system, is shown in Fig. 10 , along with the corresponding mean values. The numerical results are provided in Tab. 4. The radar system achieves an absolute positioning error within 0.3 m in 80% of the cases. For the inter-subject distance, the error remains within 0.25 m in 80% of the cases.

To evaluate the improvement brought by the proposed clustering method over the standard DBSCAN, both algorithms were tested on specific measurement sequences with subjects moving within 1 m from one another. To quantify the clustering performance, the correct clustering ratio, r cl , is used. This metric represents the fraction of frames in which the clusters belonging to the different subjects are correctly separated. The results of this evaluation are summarized in Tab. 5. The evaluation is conducted on sequences with 2 and 3 individuals (i) walking along parallel paths with the same velocity and at a distance between 0.5 m and 0.8 m (parallel), (ii) walking along crossing paths, with subjects coming as close as 0.2 m from one another (crossing) and (iii) staying still and moving arms at an inter-subject distance of approximately 0.8 m (close). The proposed clustering algorithm led to a large improvement (up to 44 %) in terms of r cl metric with respect to DBSCAN. In addition, for 3 of the 5 test sequences, DBSCAN led to failures in the tracking process, either merging the tracks of different subjects, or failing to detect some of them, while milliTRACE-IR correctly tracked all the subjects in all cases.

The proposed WELM based Re-Id algorithm was evaluated on a set of mmWave radar measurements from 6 individuals who were not included among the 16 subjects used to train the feature extraction NN. The tests were conducted in a 12 × 3 m research lab, with furniture that made the evaluation challenging. The training data contains 4 minutes of measurements (3, 600 radar frames) while over 1 minute of measurements per subject (1, 000 frames) was used as test data. In both the training and the test data, the individuals walked freely in the room. The radar position was changed for each test to gauge the impact of varying the radar point-of-view. Re-Id accuracy. The Re-Id accuracy as a function of W (see Alg. 2) is shown in Fig. 11a and Fig. 11b . The curves of these plots are obtained averaging the results of 20 different WELM initializations, and all the possible combinations of the considered number of subjects (from 2 to 6) over the 6 total individuals. As expected, the Re-Id performance increases with an increasing inference time (larger W ) and with the length of the training sequences: the accuracy gain is about 10% by going from 1-minute (Fig. 11a ) to 3 minutes (Fig. 11b ) long training sequences. Also, milliTRACE-IR reaches high Re-Id accuracy using W ≥ 15 s and the detrimental effect of increasing number of subjects to be classified is greatly reduced using larger values of W , as accumulating the WELM scores over longer time windows increases the robustness of the WELM decision. Overall, the accuracy of the proposed method is higher than 95% in all cases, only using 3 minutes of training data per subject and W = 20 s, which are reasonable in practice. The worst-case (3 minutes of training data for 6 subjects) WELM training time, on the ARM Cortex-A57 processor of the Jetson TX2 device, took 2.98 ± 0.015 s. Impact of imbalanced training data. As shown in Fig. 11c , the effect of imbalanced training data is successfully mitigated by the sample weighting strategy of Eq. (19) . In this evaluation, the WELM was trained with 1 minute of data for a randomly selected subset containing half of the subjects and 4 minutes for the remaining half. Improvement over a baseline. Tab. 6 compares the WELM to a baseline classification method widely used in camera-based person Re-Id [27] that, unlike milliTRACE-IR, does not learn a similarity score based on the actual distribution of the feature vectors at operation time. The baseline algorithm collects the training feature vectors along with the corresponding labels and computes the centroid of each class m in the NN feature space, denoted by c m . To re-identify a subject, the cosine similarity (CS) between his/her feature vectors, v, and the centroid of each class m is computed, obtaining a similarity score s m = c T m v/(||c m || 2 ×||v|| 2 ), and the classification is performed taking arg max m s m . The WELM outperforms the baseline scheme in all the tests, see Tab. 6. The performance gap is significant for little training data (up to 16% improvement), small windows and imbalanced training sets. (c) Imbalanced training data. Fig. 11 : Re-identification accuracy results. In (a) and (b) the re-identification algorithm is used with 1 and 3 minutes of training data per subject, respectively. In (c), 1 minute of training data was used for a randomly selected subset containing half of the subjects, while 4 minutes were used for the remaining half. [24] and Savazzi et al. [10] . Symbols "×" and "n.a." denote, respectively, that the task is not tackled or that there is no available result for the considered quantity in the original papers. The symbol "*" is used to highlight that the value is not an RMSE value but the minimum interpersonal distance threshold considered in [10] .

In this section a comparison between milliTRACE-IR and available methods from the literature is provided. To the best of the authors' knowledge, only two works exploit both mmWave radars and thermal cameras to perform human sensing and/or temperature screening, namely, the works fromÜlrich et al. [24] and Savazzi et al. [10] . Since none of the two tackles all the points that milliTRACE-IR addresses, they are here considered, separately, to compare different aspects. The data association strategy is compared with that proposed in [24] , while [10] is used to compare the positioning, distance monitoring, and temperature screening parts. In Tab. 7, symbols "×" and "n.a." denote, respectively, that the task is not tackled or that no specific result is provided in the corresponding work. Data association. In [24] (Ülrich et al.), people are detected in thermal images by applying the Viola and Jones algorithm [?] to detect the upper bodies of the subjects in the environment. The distance between the TC and each subject is roughly retrieved from the dimension of the bounding box enclosing the upper body of each person, similarly to what milliTRACE-IR does with faces. The TC detections are then associated, on a frame basis, with range measurements obtained with a mmWave radar by minimizing a Gaussian-shaped association cost. This cost provides an estimate of the probability that the corresponding association is correct, based on the difference between the distance estimates by the TC and by the mmWave radar. This data association method has been implemented and tested on the dataset of Section V-B, comparing it to the data association strategy of milliTRACE-IR. For a fair comparison, the YOLOv3 detector has been used in place of the Viola and Jones algorithm, as, besides providing superior performance, it is the same detector used by milliTRACE-IR. This guarantees that any difference in the data association results is only due to the data association strategy. At every time frame, each bounding box has been associated with the radar detection yielding the highest association probability, which corresponds to the smallest difference in the two distance estimates. The main differences between the approach in [24] and that of milliTRACE-IR are that, in [24] : (i) the association is perframe and not per-track, (ii) the estimated distance from the TC is the only feature considered for the association, and (iii) the Hungarian algorithm is not used, so different bounding boxes can be, erroneously, associated with the same radar detection. Numerical results for the precision ("Pr") and recall ("Rec") metrics are presented in Tab. 7. Since the association technique of [24] performs a per-frame association, the table shows the per-frame performance of milliTRACE-IR, computed by counting the number of frames that are correctly classified using milliTRACE-IR's per-track association algorithm. From these results, it can be seen that milliTRACE-IR performs notably better in associating mmWave radar with TC human detections. The largest improvement is brought by the combination of milliTRACE-IR per-track association paradigm with the Hungarian algorithm, which effectively filters out ghost tracks and spurious detections which often occur in real world scenarios, significantly boosting the robustness of the scheme.

Positioning, distancing, and temperature screening. In [10] (Savazzi et al.), people localization, interpersonal distance monitoring, and temperature screening are addressed using thermopiles and mmWave radars. Since in [10] the data association strategy is not disclosed, a comparison is here provided only for the previously mentioned tasks. In the paper, positioning performance is evaluated in terms of range (radial distance) and angular RMSEs. Numerical values for these metrics are given in Tab. 7 considering the dataset of Section V-D for milliTRACE-IR and the (average) values from Tab. II of [10] for their algorithm.

In the same work, interpersonal distance monitoring is obtained by dividing the monitored area into a regular grid, whose cells have a side length of 0.5 m. The system is able to distinguish subjects occupying adjacent cells, which are considered to be violating the minimum interpersonal distance of 1 m, thus raising an alarm. For this reason, the resolution of the method of [10] is 0.5 m in the best case (a lower bound for the interpersonal distance estimation error). In Tab. 7, this value is reported alongside the RMSE of milliTRACE-IR in measuring interpersonal distances, marking the former with a "*" symbol, to highlight that it is not an RMSE. Thermal screening performance comparisons are also presented in Tab. 7, where "Thermal screening range [m]" refers to the maximum distance at which the tests were carried out. milliTRACE-IR performs better than [10] in all the considered tasks, showing a larger monitoring range and more accurate body temperature estimates. In addition, milliTRACE-IR combines these monitoring capabilities with a robust data association strategy and with the capability to re-identify subjects when moving through different areas.

This work presents the design and implementation of milliTRACE-IR, the first system combining high resolution mmWave radar devices and infrared cameras to perform noninvasive joint temperature screening and contact tracing in indoor spaces. This system uses thermal cameras to infer the temperature of the subjects, achieving measurement errors within 0.5°C, and mmWave radars to infer their spatial coordinates, by successfully locating and tracking subjects that are as close as 0.2 m apart. This is possible thanks to improvements along several lines, such as the association of the thermal camera and radar tracks from the same subject, along with a novel clustering algorithm combining densitybased and Gaussian mixture methods to separate the radar reflections coming from different subjects as they move close to one another. Moreover, milliTRACE-IR performs contact tracing: a person with high body temperature is reliably detected by the thermal camera sensor and subsequently traced across a large indoor area in a non-invasive way by the radars. When entering a new room, this subject is re-identified among several other individuals with high accuracy (95%), by computing gait-related features from the radar reflections through a deep neural network and using a weighted extreme learning machine as the final re-identification tool.

Future research includes improvements of the reidentification mechanism for contact tracing. milliTRACE-IR uses an offline pre-training phase for the neural network based feature extraction block, which is performed on a dataset including several subjects. This step could be possibly removed by exploring recent developments in self-supervised learning, thus greatly enhancing the generality and usability of the system. Self-supervised learning could allow the automatic training of the gait feature extractor in a fully online fashion, by exploiting the human movement traces of opportunity that are gathered during system operation. Moreover, a large scale implementation of milliTRACE-IR, featuring tens of mm-Wave radars distributed across a large and crowded indoor environment is another interesting research direction. In such setup, it is key to develop data fusion and collaborative sensing algorithms for radars with overlapping fields of view, towards providing improved resilience to occlusions and better human tracking performance.

Enabling and emerging technologies for social distancing: A comprehensive survey

Effect of the social distancing measures on the spread of COVID-19 in 10 highly infected countries

Lessons learned from the investigation of a COVID-19 cluster in Creil, France: effectiveness of targeting symptomatic cases and conducting contact tracing around them

A study on contact tracing apps for covid-19: Privacy and security perspective

Pattern recognition and machine learning

A density-based algorithm for discovering clusters in large spatial databases with noise

A new approach to linear filtering and prediction problems

Yolov3: An incremental improvement

Kalman and extended kalman filters: Concept, derivation and properties

Processing of body-induced thermal signatures for physical distancing and temperature screening

Location fingerprinting with bluetooth low energy beacons

Wifi bluetooth based combined positioning algorithm

The visual social distancing problem

A wearable magnetic field based proximity sensing system for monitoring COVID-19 social distancing

Deepsocial: Social distancing monitoring and infection risk assessment in covid-19 pandemic

Covid-robot: Monitoring social distancing constraints in crowded scenarios

Indoor tracking of multiple persons with a 77 GHz MIMO FMCW radar

mID: Tracking and Identifying People with Millimeter Wave Radar

Noncontact Body Temperature Measurement: Uncertainty Evaluation and Screening Decision Rule to Prevent the Spread of COVID-19

Inner eye canthus localization for human body temperature screening

AI thermometer for temperature screening: demo abstract

Extending reliability of mmwave radar tracking and detection via fusion with camera

A deep learning-based radar and camera sensor fusion architecture for object detection

Person recognition based on micro-doppler and thermal infrared camera fusion for firefighting

Indoor person identification using a lowpower FMCW radar

Gait Recognition for Co-Existing Multiple People Using Millimeter Wave Sensing

Deep learning for person re-identification: A survey and outlook

Learning longterm representations for person re-identification using radio signals

Person reidentification based on automotive radar point clouds

Real-time People Tracking and Identification from Sparse mm-Wave Radar Point-clouds

Automotive radars: A review of signal processing techniques

Principles of modern radar

How to find the right Thermal imaging camera

Computer Vision: Models Learning and Inference

Weighted extreme learning machine for imbalance learning

Numerical optimization

Optimal state estimation: Kalman, H infinity, and nonlinear approaches

Radar signal processing for jointly estimating tracks and micro-Doppler signatures

Evaluating mmWave Sensing Ability of Recognizing Multi-people Under Practical Scenarios

The probabilistic data association filter

The Hungarian method for the assignment problem

Pointnet: Deep learning on point sets for 3d classification and segmentation

WaveNet: A Generative Model for Raw Audio

Deep learning

A discriminative feature learning approach for deep face recognition

Facenet: A unified embedding for face recognition and clustering

Regularized online sequential learning algorithm for single-hidden layer feedforward neural networks

A flexible new technique for camera calibration

Italy. He is currently pursuing a Ph.D. in Information Engineering with the SIGNET research group of the Department of Information Engineering (DEI), in the same university. His research interests include machine learning, signal processing, sensor fusion, and remote sensing with mmWaves

His research interests include signal processing, sensor fusion and machine learning with applications to mmWave sensing and integrated sensing and communication solutions

Over the years, he has been involved in several EU projects on wireless sensing and IoT and has collaborated with major companies such as Ericsson, DOCOMO, Samsung and INTEL. His research is currently supported by the European Commission through the H2020 projects MINTS (no. 861222) on "mmWave networking and sensing" and GREENEDGE (no. 953775) on "green edge computing for mobile networks