key: cord-0273702-hhfj6dn3
authors: Ren, Yili; Yang, Jie
title: 3D Human Pose Estimation for Free-from and Moving Activities Using WiFi
date: 2022-04-16
journal: nan
DOI: 10.1145/3485730.3492871
sha: ec09ac71782358a88ec4de64fb4629eb1fcd4f83
doc_id: 273702
cord_uid: hhfj6dn3

This paper presents GoPose, a 3D skeleton-based human pose estimation system that uses WiFi devices at home. Our system leverages the WiFi signals reflected off the human body for 3D pose estimation. In contrast to prior systems that need specialized hardware or dedicated sensors, our system does not require a user to wear or carry any sensors and can reuse the WiFi devices that already exist in a home environment for mass adoption. To realize such a system, we leverage the 2D AoA spectrum of the signals reflected from the human body and the deep learning techniques. In particular, the 2D AoA spectrum is proposed to locate different parts of the human body as well as to enable environment-independent pose estimation. Deep learning is incorporated to model the complex relationship between the 2D AoA spectrums and the 3D skeletons of the human body for pose tracking. Our evaluation results show GoPose achieves around 4.7cm of accuracy under various scenarios including tracking unseen activities and under NLoS scenarios.

Estimating the human pose is gaining increasing attention as the human body offers a high degree of freedom for human-computer interactions (HCI). It is a crucial building block to support a variety of emerging applications in smart home, such as virtual/augmented reality [7, 36] , interactive exergaming [43, 49] , well-being [24, 64] , and exercise monitoring [44, 46] . Traditional human pose estimation systems mainly rely on either computer vision technique that requires the installation of specialized cameras (e.g., RGB or infrared cameras) [10, 61] , or wearable approach where users wear or carry dedicated sensors (e.g., IMU sensors, RFID) [19, 50] . However, the vision-based systems cannot work in non-line-of-sight (NLoS) and poor lighting scenarios, for example when the The computer provides elderly monitoring by estimating the elders' 3D poses even they walk behind a screen or wall (i.e., NLoS) and sends alerts when a falling or injury occurs.

person is behind a folding screen or in a dark environment, whereas the wearable systems could be inconvenient as they require explicit user involvement. Moreover, the necessity for dedicated devices plus the cost of the hardware in these systems dampens the likelihood of mass adoption. More recently, Radio Frequency (RF) based sensing becomes an appealing alternative for human pose estimation. It analyzes the RF signals reflected off the human body for activity and human pose tracking, and thus does not require a user to wear or carry any sensors. It also works under NLoS scenarios as the RF signals can penetrate folding screens and walls when compared to the vision-based approach. Existing work in RF-based human sensing uses either specialized hardware (e.g., USRP, FMCW RADAR) [31, 63] or commodity WiFi devices [25, 42, 53] . As the specialized hardware-based systems are less attractive to consumer-oriented use due to their high hardware cost, we focus on the commodity WiFi-based approach as it could be enabled with a simple software update by reusing existing WiFi devices in smart home environments [41] .

In this work, we propose GoPose, a 3D skeleton-based human pose estimation system by reusing WiFi devices in a home environment. Unlike the prior WiFi-based 3D human pose estimations that only work for a set of predefined activities [13] performed at a fixed position [37] , our system works for unseen activities even when the user is moving around, offering on-the-go pose tracking for unseen activities. It is because our system can extract the two-dimensional (2D) angle of arrival (AoA) of the incident signals, which can represent the spatial information of different body parts or joints regardless of activities or user positions. A deep learning model is then used to identify the joints and transfer the 2D AoA spatial information to the joint locations in physical space. WiPose extracts the features which cannot directly represent the spatial information of the body joints. It utilizes the deep learning model to map the features that are less related to the spatial information to the joint locations. Such a mapping relationship could be weak when the user is moving around or performing unseen activities. As shown in Figure 1 , GoPose could be utilized to digitize a user's 3D full-body motions into a set of body joints to enable new interactive experiences beyond the traditional computer vision and touch-based human-computer input. It works when people are mobile and across occlusions, where the light of sight is blocked by a folding screen or walls. As GoPose could reuse WiFi devices, it does not incur an additional cost, and thus is promising for mass adoption for end-users in smart homes.

Estimating 3D human pose solely from the WiFi signals bounced off the human body faces unique challenges. First, unlike the USRP or FMCW RADAR that offers accurate spatial information (e.g., the location and shape of objects) [62, 63] , the Channel State Information (CSI) data directly exported from the off-the-shelf WiFi devices does not provide any spatial information of the human body. To tackle this challenge, we leverage the two-dimensional angle of arrival of the incident signals derived from the non-linearly spaced antennas to provide spatial information for pinpointing the human body. Moreover, we propose to combine both the spatial diversity at the transmitter and the frequency diversity of WiFi OFDM subcarriers to increase the spatial resolution of 2D AoA for differentiating signals reflected from different parts of the human body.

Another challenge is that as the received WiFi signal is dominated by the signal reflected from the indoor environments, such as those from the walls and furniture, how to make the human pose estimation system independent of the environments it operates in? That is once the system is configured in one environment, it should work well across different environments, for example after move to a different house. To handle this challenge, we leverage the spatial characteristic of the 2D AoA spectrum to separate the signals bounced off the human body from the ones reflected from the static environments. In particular, we subtract the 2D AoA spectrum of the static environment from the ones we extracted when one or more users are performing activities. In addition, we propose to combine the 2D AoA spectrum of multiple packets at multiple receivers to resolve the issue of specularity of the human body, i.e., one received WiFi packet only captures a subset of body motions at a particular direction.

The next challenge is to model the complex relationship between the 2D AoA spectrums and the 3D skeletons of the human body. The complexity is very high as the human body has a high degree of freedom and the user could be moving around at different locations with different orientations. Instead of using analytic kinematic models, we leverage the deep learning models of the convolutional neural network (CNN) and the Long Short-Term Memory (LSTM) to abstract the 3D human pose from the 2D AoA spectrums. In particular, the CNN is a useful technique to extract spatial dynamics (e.g., the locations of multiple limbs and the torso), whereas the LSTM is a special kind of recurrent neural network (RNN) that models and estimates temporal dynamics of human poses (e.g., trajectories of limbs and torso).

We experimentally evaluate the GoPose in different home environments with various activities performed by different users. We conduct experiments under non-line-of-sight scenarios, different distances between WiFi devices, and multi-person scenarios. Results show that our system is highly accurate in constructing 3D human poses for moving users even for unseen activities. The main contributions of our work can be summarized as follows:

• We propose a 3D human pose estimation system for moving user and unseen activities by leveraging WiFi devices. The proposed system does not require any dedicated or specialized sensors and can work under NLoS scenarios. • We estimate the human pose based on the 2D AoA spectrums derived from the non-linearly spaced antennas.

The 2D AoA spectrum offers unique advantages including providing spatial information of the human body and enabling environment-independent human pose estimation. • We leverage deep learning (i.e., CNN and LSTM) to model the 2D AoA spectrums and the human body for inferring 3D skeletons of human pose. Experimental results show that GoPose achieves around 4.5 cm of accuracy under various scenarios including tracking unseen activities and under the NLoS scenarios.

Existing work of human pose estimation can be divided into three categories: computer vision-based, wearable sensor-based, and RF signal-based.

Computer vision-based. There exist many systems for 2D or 3D body skeleton and pose estimation by utilizing images or videos. For example, recent conventional RGB camera-based 2D body pose estimation systems [4, 10, 29] have made great progress by leveraging the deep learning models and annotated pose databases. Meanwhile, 3D pose estimation has also attracted growing interest from researchers. For instance, Sigal et al. [40] presented a baseline algorithm for 3D human pose estimation by leveraging an optimized relatively standard Bayesian framework, whereas Pavllo et al. [30] proposed a fully convolutional model based on semi-supervised training video data over two dimensions key points. VNect [26] achieves full global 3D human pose tracking with only one RGB camera, while Kanazawa et al. [14] reconstruct a full 3D mesh human body model with a single RGB image. Moreover, LiSense [21] utilizes visible light communication to achieves concurrent 3D human skeleton reconstruction and real-time data communication, whereas Ahuja et al. [1] proposed to leverage multiple cameras and sensors on the smartphone to approximate the full-body pose while on the go. Similarly, Simo [2] uses cameras of the smartphone to track body movements, whereas cameras were installed below the floor to track human poses in one system [3] . Additionally, the commercial depth/infrared cameras are proposed to track 3D human pose, for example, in Microsoft Kinect [27] and Leap Motion [11, 48] ). These computer vision-based approaches, however, cannot work in non-line-of-sight (NLoS) and poor lighting scenarios. In addition, the vision-based approaches often involve user privacy concerns and incur a non-negligible cost.

Wearable sensor-based. The wearable sensor-based approaches require the user to carry or wear one or multiple dedicated sensors. For example, existing systems [6, 15] can track the movement of arms by attaching various sensors on the upper limbs of the user. Moreover, 3D hand pose could be estimated by using a wrist-worn camera [55] or attachable electromagnets [5] , whereas the full-body motions could be reconstructed by attaching the IMU sensors to various positions of the body [12, 47] . In addition, there are several systems that utilize a single wearable sensor for body motion tracking [33, 34] . For instance, RecoFit [28] can provide real-time feedback and post-workout analysis for strength-training exercises by leveraging a single sensor on the arm, while ArmTrack [39] utilized a single smartwatch to rebuild the arm motions. However, these wearable sensor-based systems all rely on dedicated sensors that need to be attached to or worn by the user, which could be inconvenient and cumbersome under daily life scenarios and incur non-negligible installation overhead.

RF signal-based. Many research efforts have been dedicated to human pose tracking leveraging RF signals [41] . For instance, Zhao et al. proposed RF-Pose [62] that leverages a teacher-student network to estimate 2D human pose through the walls. They further proposed RF-Pose3D [63] , which could infer 3D human body skeletons based on a convolutional neural network model. However, these systems rely on the specialized hardware that emits FMCW signals across a large bandwidth, which is dozens of times wider than that of the WiFi bandwidth. Moreover, it requires a carefully designed and synchronized T-shape antenna array to obtain accurate spatial information, which contains four antennas and sixteen antennas for vertical array and horizontal array, respectively. Moreover, Zhang et al. presented Wall++ [60] , a sensing approach that can estimate the body pose of users but requires patterning large electrodes onto a wall using conductive paint. However, these systems leveraging specialized hardware are less scalable for mass adoption due to the cost of hardware and the overhead of installation.

To enable potential mass adoption, many systems utilize WiFi devices that can be found in a home environment to enable a wide range of applications including large-scale activity recognition [32, 53] (e.g., daily activities, dancing movements), indoor locations [23, 56, 57, 65] , small-scale motion sensing [25, 42] (e.g., vital sign, finger gesture), and object sensing [35, 45] (e.g., fruit ripeness, liquid level). For example, E-eyes [53] and WiFinger [42] are among the first work to leverage commodity WiFi to classify different daily activities and finger gestures respectively, whereas Liu et al. [25] and Tan et al. [45] are among the first to perform vital signs and object sensing (i.e., FruitSense [45] ), respectively. For human pose estimation, Person-in-WiFi [51] rebuilds the 2D skeleton for pose estimation. The system heavily relies on the features such as Part Affinity Fields [4] and Segmentation Masks [10] , which are only suitable for 2D scenarios. WiPose [13] is a more recent work using commercial off-the-shelf WiFi devices to track the 3D human pose. It however only works well for a set of predefined activities performed. Winect [37] is the most recent system that works for free-form human activities tracking but is limited to the activities performed at a fixed location. In our work, we propose a 3D skeleton-based human pose estimation system that offers on-the-go pose tracking even for unseen activities.

2D AoA. It is worth noting that there are many research efforts related to 2D AoA. For example, Lee et al. proposed a low-complexity estimation algorithm [18] for the estimation of 2D AoA using a pair of uniform circular arrays. In addition, Li et al. proposed a 2D AoA estimation model [20] based on the motion of a 1D nested array. Wei et al. presented a method [54] for pair-matching of elevation and azimuth angles in 2D AoA estimation with the L-shaped array. Wang et al. proposed a framework [52] that occupies a larger and provided better estimation of 2D AoA performance for multiple-input multiple-output (MIMO) radar. mmEye [59] is a super-resolution imaging system toward a mmWave on commodity 60GHz WiFi devices. It developed a super-resolution imaging algorithm based on Multiple Signal Classification (MUSIC) and 2D AoA. However, this system achieves 2D human imaging instead of 3D human pose estimation.

To the best of our knowledge, none of them use 2D AoA to estimate the 3D human pose with WiFi signals. In our application, we further leverage the spatial diversity at transmitting antennas and the frequency diversity of OFDM subcarriers of WiFi to improve the resolution of the 2D AoA. We note that some of these 2D AoA techniques could be incorporated into our system.

In this section, we present the preliminary of WiFi sensing, and discuss 1D and 2D AoA estimation of the incident signals and their limitations.

WiFi has been evolving from providing laptop connectivity to connecting smart home devices such as tablets, smartphones, smart speakers, refrigerators, and smart TVs to home networks and the Internet. It has resulted in a large number of WiFi devices, which provides the opportunity to extend WiFi's capabilities beyond communication, particularly in sensing the physical environment. As the WiFi signals travel through space, they interact with the human body, and any human activities, either small scale or large scale, affect the signal propagation. With measurable changes in the received WiFi signals, human activities in the physical environment thus could be inferred.

Moreover, with the advanced WiFi technology, WiFi radio offers channel state information (CSI) (i.e., the sampled version of the channel frequency response) to estimate the channel condition for fast and reliable communication. The CSI thus could be directly exported from the network interface card (NIC) to measure the changes in the CSI amplitude and phase for inferring human activities. In particular, current 802.11a/g/n/ac employs OFDM technology, which partitions the relatively wideband WiFi channel into 52 subcarriers and provides detailed CSI for each subcarrier.

The CSI directly exported from the commodity WiFi devices only provides information on how the wireless channel was interrupted by human activities. It however offers no spatial information regarding human activities, such as the location and the shape of the human body. We thus turn to exam the AoA of the incident signals at the antenna array of WiFi devices to derive spatial information for human pose estimation.

The one-dimensional (1D) angle of arrival (AoA) of the incident signals (e.g., the LoS signals and the reflected signals from the human and environment) could be derived if the WiFi receiver is equipped with a linear antenna spatial information, which is insufficient for tracking multiple human limbs. For instance, it cannot provide the position of a person in a 2D space, not to mention to pinpoint the spatial locations of multiple human limbs. 

In this section, we discuss the system design, present the methods to improve the resolution of the 2D AoA spectrum and remove the environment effects, and describe the design of the deep learning models.

The basic idea of our system is to leverage the spatial information of the 2D AoA spectrum and deep learning to model the complex 3D skeletons of the human body for 3D pose estimation. As illustrated in Figure 4 , a WiFi transmitter sends out signals to multiple WiFi receivers to probe human activities. The system takes as input time-series CSI measurements, which can be exported from the NICs of the commodity WiFi devices. The CSI measurements are exported for 30 subcarriers on each WiFi link. The system can benefit from the CSI measurements from existing traffic across these links, or the system can also generate periodic traffic for sensing purposes. This data is then preprocessed to remove noises by using a linear fit method proposed in the literature. The core of our system, GoPose, is 2D AoA extraction and the 3D pose construction. The 2D AoA extraction encompasses three different components to address the issues of spatial resolution, environment independence, and the specularity of the human body. The system first combines both the spatial diversity and the frequency diversity to increase the resolution of 2D AoA for differentiating signals reflected from different parts of the human body. It then goes through static environment removal to filter out the signals reflected from the indoor environments. After that, the system combines the 2D AoA spectrum of multiple packets at multiple receivers to resolve the issue of specularity of the human body (i.e., one packet can only capture a subset of motions of the human body). Next, our system leverages the deep learning models of CNN and LSTM to construct the 3D pose of the human body based on the 2D AoA spectrum. CNN is used to capture the spatial feature of the human body parts, while the LSTM is utilized to estimate the temporal feature of the motions. GoPose offers on-the-go pose tracking for unseen activities. As it relies on the WiFi signal reflections for human pose tracking, it does not require the user to wear or carry any devices and works under NLoS scenarios. It could also be enabled with a simple software update for mass adoption as it can reuse existing WiFi devices in a home environment.

As the limited number of antennas on commodity WiFi receivers (e.g., Intel 5300 card has up to only three antennas) provides insufficient 2D AoA resolution for 3D human pose estimation, we seek helps from both the spatial diversity at the transmitter and the frequency diversity of the WiFi OFDM subcarriers. In particular, existing approaches [17, 22] only utilize the frequency diversity of OFDM subcarriers to calculate the time of flight (ToF) to improve AoA estimation. In our work, we improve the 2D AoA estimation by leveraging both the spatial diversity in three transmitting antennas as well as the frequency diversity of thirty OFDM subcarriers. The spatial diversity in three transmitting antennas can introduce phase shifts due to the angle of departure (AoD), while the frequency diversity of OFDM subcarriers can result in phase shifts with respect to time of flight (ToF). Thus, we can jointly estimate 2D AoA, AoD, and ToF by leveraging both the spatial and frequency diversities to dramatically improve the resolution of the 2D AoA spectrum.

Specifically, we utilize CSI measurements of the WiFi signals across all OFDM subcarriers transmitted from multiple transmitting antennas and received at multiple receiving antennas to generate a large number of virtual sensing elements. In our implementation, for each subarray (i.e., on X_A-axis or Y_A-axis in the X_A-Y_A-Z_A coordinates), we have two receiving antennas, three transmitting antennas, and thirty subcarriers. Thus, there are in total 180 sensing elements for each axis. It provides three times better resolution than the ones estimated in the existing work [17, 22] . Or it offers ninety times better resolution when compared to 1D AoA estimation. The spatial and frequency diversities thus result in sufficient information that allows our system to jointly estimate high-resolution 2D AoA (azimuth and elevation), AoD, and ToF simultaneously. The information can be combined together to improve the resolution of 2D AoA estimation. Figure 5 illustrates that we can separate multipath signals in 2D space and capture different parts of the human body of the moving subject with the improved 2D AoA spectrum. Figure 5 (a1) shows multipath signals including the LoS signal, the signal reflected by the wall, and the signals reflected from different parts of the human body. The resulted 2D AoA spectrum is shown in Figure 5(a2) . We can observe that the improved 2D AoA spectrum can be used to differentiate the multipath signals, such as the signals that come from the LoS, environment, and human body reflections. We can also observe that the signals reflected from different parts of the human body (e.g., arms, legs, and torso) are located at different spatial locations, as shown in Figure 5 (c2).

As the 2D AoA spectrum provides spatial information of the multipath signals, we can leverage such information to remove the LoS signal and the signals reflected from the static environment for environment-independent 3D pose estimation. In particular, we propose to subtract the 2D AoA spectrum of the static environment from the ones we extracted with human activities. Then, the 2D AoA spectrum mainly reflects the signals bounced off the human body and thus is independent of the signals reflected from the static environment. More specifically, we first calculate the 2D AoA spectrum of the static environment from multiple CSI packets. For example, Figure 5 (b1) shows the signals of the static environment, which include the LoS signal and the signal reflection from static objects (e.g., wall). The corresponding spectrum is shown in Figure 5 (b2), in which we can distinguish the 2D locations of the LoS signal and signals reflected from the wall. Note that the spectrum of the static environment should be periodically updated after detecting significant changes in the environment (e.g., the furnishings have been significantly altered) [58] . The 2D AoA spectrum under human activities will also be generated. As shown in Figure 5 (a1), in which the person is walking towards the receiver while waving his hands. We can observe from the corresponding 2D AoA spectrum, i.e., Figure 5 (a2), although the signal reflected from the human body is weaker compared to the LoS signal, we still see the signals reflected from different parts of the human body. Next, we subtract the static spectrum from the spectrum under human activities to obtain the 2D AoA spectrum that only reflects the signals bounced off the human body. As shown in Figure 5 (c2), we can clearly observe the signals reflected from human's different limbs and torso, which is irrelevant to the environments.

It is worth noting that the signals reflected from the human body may bounce off the walls again, resulting in secondary reflection to the receivers. For example, when a person shows up, there might be a signal propagates from the transmitter to the person, then reflected from the person to the wall and eventually received by the receiver after wall reflection. Although such a signal can not be removed from spectrum subtraction, it has little effect on human pose estimation as such it is too weak after the second reflection.

The human body is specular with respect to WiFi signals, which means the human body acts as a reflector (i.e., a mirror) instead of a scatterer [62] . This is because the wavelength of the WiFi signal is much larger than the roughness of the surface of the human body. On contrary, the human body acts as a scatterer with respect to visible light as its wavelength is much smaller than the roughness of the surface of the human body. Depending on the orientation of the human body, some WiFi signals may be reflected towards the receiver, while some may be reflected away from the receiver. As a result, the 2D AoA spectrum derived from a single WiFi packet can only capture a small subset of body motions and may miss the majority part of the motions. To resolve this issue, we combine multiple 2D AoA spectrums derived from multiple CSI packets to capture the motions of different parts of the human body. As shown in Figure 6 , we can observe that the first spectrum only captures the upper part of the body (e.g., two arms) and other parts are missing. The second spectrum only captures the middle part of the body (e.g., the torso), whereas the third one only has information about the legs. This actually inspired us to take advantage of multiple WiFi packets to describe the full-body movements, as shown in the last spectrum in Figure 6 . It is worth noting that the last spectrum is an intuitive example that is superimposed by the previous three spectrums. In our system, we let the deep learning networks learn such information from multiple spectrums derived from multiple packets. In particular, we take a sequence of packets (i.e., 100 packets) as input to estimate one human pose.

To obtain the 3D information from the 2D AoA spectrum, we leverage multiple receivers locations at different positions, as shown in Figure 7 . Each receiver is equipped with an L-shaped antenna array and can be used to extract one 2D AoA spectrum. By combing the 2D spectrums from multiple receivers, we are able to recover 3D information of the human pose. As the complexity of the 3D human pose is very high, we leverage the deep learning models instead of analytic kinematic models to infer the 3D human pose based on the 2D AoA spectrums.

The deep learning framework of GoPose is illustrated in Figure 8 . After 2D AoA estimation, we obtain the spectrum with the dimensions of 180 × 180 as we set the range of azimuth and elevation as [0, 180] degrees with a resolution of one degree. Our system also utilizes multiple receivers (e.g., 4 receivers) to capture the motions of the user from different angles. We concatenate the spectrum from four receivers, and derive the tensor with the dimensions of 180 × 180 × 4. Moreover, we need to combine multiple spectrums to capture the full-body movements. We thus concatenate 100 packets of each receiver to form a 180 × 180 × 400 matrix, and we denote such a matrix as x_t , which means the input data at time t . Then, the whole sequence of input data can be denoted as [x1, x2, ..., x_t ]. Each input data illustrates the spectrum distribution of one snapshot of the moving user in the physical space, and the continuous input data stream describes how the spectrum varies corresponding to human activities. To fully understand the input data stream, we extract both spatial features from each input data x_t and the temporal dependencies between x_i and x_j .

In particular, we first adopt CNN in our deep learning model to extract spatial features from x_t . In our system, the spatial features could be the positions of different parts of the body in the 2D AoA spectrum. More specifically, we utilize stacked six-layer CNNs and use 3D filters for each CNN layer. After the CNN layer, we adopt a batch normalization layer to standardize the inputs to a network and speed up the training. Then, we add a rectified linear unit (ReLU) to add non-linearity. After that, max pooling is applied to down-sample the features. Also, a dropout layer is added to prevent overfitting.

Besides the spatial features, the input data also contains temporal features as the 3D skeleton is dynamic and the movement is consecutive. Recurrent neural networks (RNN) are useful as they can model complex temporal dynamics of sequences. Compared with original RNNs, Long Short-Term Memory (LSTM) is more capable of learning long-term dependencies. After the CNNs, we obtain a sequence of the feature vector, which is then fed into the LSTM. In particular, we utilize a two-layer LSTM for temporal modeling.

Note that we use the proposed deep learning framework to estimate the locations of 14 key points/joints of the human body. These key points/joints include head, spine, left/right shoulder, left/right elbow, left/right wrist, left/right hip, left/right knee, and left/right ankle. We use a 3D skeleton composed of those key points/joints to represent the 3D human pose. Fig. 9 . The L-shaped antenna array at each receiver. 

�� = � � =1 � �=

Devices. We conduct the experiments with five laptops (one transmitter and four receivers). Each laptop runs Ubuntu 14.04 and is equipped with Intel 5300 wireless NIC connected to three antennas. The transmitter is equipped with three linearly-spaced antennas, and each receiver has an L-shaped antenna array as shown in Figure 9 . Linux 802.11 CSI tools [8] are used to extract CSI measurements from 30 subcarriers for each packet. The frequency band of the WiFi channel is 5.32 GHz with 40 MHz bandwidth. The default packet rate of our system is set at 1000 packets per second. We utilize a Microsoft Kinect 2.0 [27] to record the ground truth of the 3D human pose, and the sampling rate of Kinect 2.0 is set at 10 Hz. We use network time protocol (NTP) to ensure the synchronization for all the receivers and the Kinect 2.0. Environments. We evaluate our system in three real-world environments in including a living room (4m × 4m), a dining room (3.6m × 3.6m), and a bedroom (4m × 3.8m). As shown in Figure 10 , each room includes different furniture such as TV, sofa, table, bed, stove, etc. Figure 10 also shows the detailed deployments of all the devices in each environment. Note that the location of the furniture in each environment may change during the long-term experiments. If not specified, the default distance between two adjacent receivers is 2.5m. When the user is performing activities, she/he is also walking around freely in the area formed by the receivers. We place a Microsoft Kinect 2.0 to capture the participants and use frontal view data of Kinect without occlusion as the ground truth. It means that the Kinect can always capture the person in the line of sight.

Model Setting. We implement the stacked six-layer CNNs in our system, we use 3D convolution operation for both the 2D AoA spectrum and the CSI measurements. The number of the convolutional filters for the CNN layers are 64, 128, 256, 128, 64, and 1, respectively. The dropout rate is set as 0.1. In the LSTM, the number of the hidden state is set as 256 and the dropout rate is 0.1. The hyperparameters Q_p and Q_h in the loss function are 0.63 and 0.37, respectively. The networks are implemented in Keras.

Data Collection. In our experiments, 10 volunteers (including 6 males and 4 females) of various heights, weights, and ages are recruited. To evaluate the system performance, each volunteer is asked to conduct both exergaming activities according to the video tutorials and various everyday activities at will without any specific instruction while she/he is walking around. The exergaming activities include muscle-strengthening activities, balance and stretching exercises, aerobic exercises, dancing. The everyday activities include lifting arms, waving hands, walking, jogging, using a smartphone, etc. The total time span of our data collection is one month. We collect 676,200 samples of 3D skeleton frames and 67,620,000 WiFi CSI packets correspondingly in 3 different environments. The data set includes both one-person and two-person data. In particular, we split the dataset with 10 people into two non-overlapping datasets: a training set with 8 people and a test set with the other 2 people. This ensures that the training and test sets are from different people. We performed the leave-one-person-out cross-validation on the training set, where each time (i.e., each fold) we utilize 7 people for training and 1 person for validation. Thus, we can optimize the model parameters based on the 8-fold cross-validation. At last, we show the overall system performance using the test set and the trained model. Evaluation Metric. We use the joint localization error as the evaluation metric. It is defined as the Euclidean distance between the predicted joint location and the ground truth. Note that we evaluate the 14 key points/joints mentioned in Section 4.5.

We first show the overall performance of our system, and also compare it with the results of using CSI measurements as input of the deep learning framework (i.e., CSI-based approach). Figure 11 that GoPose can accurately track the 3D pose of human activities with an error of a few centimeters. Moreover, GoPhose significantly outperforms the CSI-based approach. In particular, the median joint estimation error of GoPose is 4.3cm, whereas it is 8.8cm for the CSI-based approach. Also, the 80 percentile joint estimation error is around 6.8cm for GoPose, while it is more than 15.6cm for the CSI-based approach. The reason that GoPose significantly outperforms the CSI-based approach is that the extracted 2D AoA spectrums can effectively capture the spatial information of the human body and are environment-independent, whereas CSI measurements provide no spatial features and are also affected by the background environments. Table 1 reports the joint localization error for each joint for both GoPose and the CSI-based approach. We can find that the overall localization error of GoPose is only 4.7cm, whereas it is 10.1cm for the CSI-based approach. Additionally, the average joint localization error of GoPose ranges from 3.0cm to 8.4cm, while it ranges from 6.6cm to 13.6cm for the CSI-based approach. The results also show the significantly better performance of GoPhose. Also, the error distribution of each joint of GoPose is shown in Figure 11 (b) with blue boxplot (i.e., w/ activities). All the medians are less than 8cm and most errors have a relatively small range. Moreover, we observe that the median and the range of the error of spine movements is much smaller than that of elbows, hands, and other joints. This is mainly because the torso has the largest reflection area among different joints, whereas the reflection of WiFi signal from arms is much weaker because of the smaller reflection areas. The accuracy of the joints with smaller reflection areas thus could be potentially improved by increasing the signal transmission power or directional antennas. We also show the error distribution of each joint when the user is not performing any activities. As we can see from the magenta boxplot (i.e., w/o activities) in Figure 11 (b), all the medians are less than 0.4cm and all errors have a very small range. This is because it is relatively easier to track a stationary human body with few degrees of freedom.

To better visualize the performance of GoPose, we presented the constructed 3D human skeletons for different activities when the user is performing various activities. Figure 12 shows four examples of those constructed 3D skeletons. For each subfigure (e.g., Figure 12(a) ), the first row shows the time series video frames recorded by the RGB camera for visual reference, and the second row shows the ground truth skeletons recorded by Kinect Figure 12(a), (b) , and (c) show the user is performing different activities when he is walking. Figure 12(d) shows the user is randomly walking around. In addition, we use red solid rectangles to highlight the mispredicted and distorted body parts. For example, the 2 nd frame in Figure 12 (a) has an inaccurate construction on the subject's left arm. We can see an incorrect left arm in the 2 nd frame in Figure 12(b) . There are also a few slight deformations in Figure 12 (d). Nevertheless, it is easy to observe that majority of the 3D poses estimated by GoPose are highly accurate. These results also demonstrate that the proposed system can accurately construct 3D moving human poses using WiFi signals.

To evaluate the effect of the NLoS scenario on the performance of our proposed system, we conduct experiments in two adjoining rooms separated by a wall. In particular, the WiFi transmitter and the person are in different rooms. Thus, the signals propagated to the human body and the signals reflected by the human body are obstructed by the wall. Note that we perform static environment removal in the evaluation. We utilize the trained model under the LoS scenario to test the performances of both LoS and NLoS as our system is environment-independent. The results of both LoS and NLoS scenarios are shown in Figure 13 . We can observe that the average joint localization error under the LoS scenario is 4.7cm, whereas the error under the NLoS scenario is slightly higher (i.e., 5.3cm). Such a result demonstrates that our system can construct 3D moving human poses with high accuracy even under the NLoS scenario as the WiFi signals can penetrate obstacles. It also shows that our system can apply the deep learning model trained in LoS conditions to the NLoS scenarios without re-training. Therefore, GoPose is applicable to a wider range of applications, where the computer vision-based approaches could not work due to lack of line of sight or under poor lighting conditions.

We have demonstrated that our system is able to achieve environment-independent where the performance stays the same when the location of the furniture was changed in the overall performance subsection. In this subsection, we further study the system performance under more challenging scenarios. In particular, we use the data collected in one environment (e.g., living room or dining room) to train our system, and then evaluate the performance when the system is operating in a different environment (e.g., bedroom). Note that we also perform static environment removal in this evaluation. Figure 14 shows the results for when operating the system in Bedroom, Living room, and Dining room respectively after the system was trained in a different environment. As shown in Figure 14 , the average joint localization errors are 4.9cm, 5.2cm, and 4.8cm for Bedroom, Living room, and Dining room, respectively. Although testing environments are not seen during the training phase, the joint localization errors are still highly accurate. This is because our learning model relies on the 2D AoA spectrums of the signals only reflected by the human body, which is independent of the background environments. Our Fig. 16 . System performance under different packet rates system is trained at the default distance.

when the system is trained with the default packet rate.

system thus is environmental-independent and can be trained in one environment and then operates in new environments without additional training.

We next study the impact of the distance between the WiFi devices on the performance of our system. Here the distance is defined as the distance between two adjacent receivers and it may be adjusted by users according to the size of the room. We use the model trained at the default distance to evaluate the performance at different distances including 2.5m, 3m, and 3.5m. Figure 15 shows that the corresponding average errors are 4.7cm, 5.1cm, and 5.8cm, respectively. We can observe that system performance improves when the distance is reduced. This is because a shorter propagation distance leads to higher received signal strength. The results also show that the proposed system works well in the range of a typical room at various distance setups.

For our evaluation, the default packet rate is set at 1000pkts/s. We further study the impact of different packet rates on the performance of our system. We set the packet rate at 250pkts/s, 500pkts/s and 1000pkts/s, respectively. As shown in Figure 16 , the average errors for 250pkts/s, 500pkts/s and 1000pkts/s are 5.7cm, 5.3cm and 4.7cm, respectively. We can observe that a higher packet rate slightly improves system performance. The reason is twofold: first, a high packet rate can easily recover the high-speed movements of the body part; second, we can combine more packets to better represent the full-body movements of a user when using a higher packet rate. Still, GoPose works relatively well even under low packet rates (e.g., 250pkts/s).

We next evaluate the subject diversity on the performance of our system since the system could be used by a diverse set of users. We have a training set with 8 people and we conduct the evaluation using leave-one-personout cross-validation on these people. Thus, each time we utilize 7 people for training and 1 person for validation. The results are shown in Figure 17 . For example, the result of subject 1 means that the model was trained with the data from subject 2 to subject 8, and was validated with the data of subject 1. We can observe that all joint localization errors are less than 4.9cm, and there are no significant differences among these subjects. The average error of the leave-one-person-out cross-validation is only 4.6cm. This demonstrates our system can achieve high accuracy when used by different users and the model is not overfitting.

In this subsection, we focus on the impact of multiple users by asking two users to walk and perform activities simultaneously. We first detect the number of users in the environment by taking the 2D AoA as input to a simple CNN classifier. This is because the multi-user scenarios tend to have more multiple human bodies reflected signals that result in multiple body shapes in the 2D AoA spectrum. Thus, it is easy to recognize the number of users if there is no overlapping human body. In the meanwhile, different model parameters are specifically trained with different numbers of users. After we detect the number of users, we utilize the model with the corresponding parameters for pose estimation. In our evaluation, we test the multi-user scenario with two people in one room. In particular, if the system detects two people, we choose the two-user model (i.e., the model with two-user parameters) to estimate the poses. Figure 18 shows the average errors for each of the two users, i.e., user1 and user2. We observe that the overall error is 6.0cm and 6.3cm for user1 and user2, respectively. The results are slightly worse than that of the overall performance of the system. This is because multiple users can lead to complex signal reflections, which will result in slightly larger errors. We believe that with more WiFi devices, the complex signal reflections could be resolved better. Increasing the number of WiFi devices thus can improve the system performance under multiple user scenarios. Still, our current system works well when there are two users performing activities simultaneously in the home environment.

We perform ablation studies to evaluate the contribution of each component of our system by either removing or changing one component while others remain the same. Figure 19 (a) shows the impact of the number of receivers on the system performance. We can observe that the errors of the one-receiver scenario, two-receiver scenario, three-receiver scenario, four-receiver scenario are 8cm, 6cm, 5.2cm, and 4.7cm, respectively. We observe that the error of the one-receiver scenario is much larger than the others and using more receivers can improve performance slightly. Thus, our system could achieve comparable performance with only 2 receivers in a typical smart home environment, in where multiple WiFi transceivers may be deployed. 19 . Ablation studies. Figure 19 (b) represents how the system performance varies with the number of OFDM subcarriers. Compared to using only one subcarrier (the error is about 19cm), system performance is significantly increased (the error is 4.7cm) when we leverage all 30 OFDM subcarriers. It is because a number of subcarriers provide more frequency diversities and thus can improve the 2D AoA resolution.

As shown in Figure 19 (c), using 2D AoA achieves a much better performance which has an error of 4.7cm, whereas the 1D AoA achieves an error of 10.1cm. The reason is that the 2D AoA provides more dimensional information than 1D AoA does. Figure 19 (d) illustrates the effectiveness of multiple transmitting antennas. The errors of using 1, 2, and 3 transmitting antennas are 8.5cm, 6.8cm, and 4.7cm, respectively. This is because multiple transmitting antennas can increase the spatial diversity at the transmitter. According to these ablation studies, we can find that all these components could contribute to the system performance, in which multiple OFDM subcarriers and 2D AoA play important roles.

Average localization errors (cm)

Average localization errors (cm)

Although GoPose can achieve on-the-go 3D pose estimation for unseen activities with high accuracy in various scenarios, it still has some limitations.

Pose Estimation for Multiple Users. Currently, our system only tested one or two people in a typical room environment. We acknowledge that the pose estimation for a large number of users (i.e., more than 3 people) in one room is still a challenge. It is because the human bodies could overlap in such a crowded environment. Such an issue also occurs in the computer vision community. In addition, a large number of users would lead to more complicated signal reflections.

User Study. There are a limited number of subjects participating in our experiments as it is challenging to find a large number of volunteers to conduct the experiments due to the impact of the COVID-19 pandemic. A more comprehensive user study over a larger number of users could better evaluate our system. We would like to include more users in our future work.

Long-Range Pose Estimation. Although we improve the performance of 2D AoA with the spatial diversity at transmitting antennas and the frequency diversity of OFDM subcarriers, the sensing range of our system could be still limited. As the distance becomes greater, it becomes more difficult for the system to distinguish between two objects with a fixed spatial resolution.

This paper presents GoPose, a 3D skeleton-based human pose estimation system that offers on-the-go pose tracking for unseen activities in a home environment. It analyzes the WiFi signal bounced off the human body for 3D pose estimation and can reuse the WiFi devices that already exist at home for mass adoption. In the GoPose system, the 2D AoA spectrum of the signals reflected from the human body is leveraged to locate different parts of the human body as well as to enable environment-independent sensing, while deep learning is incorporated to model the complex relationship between the 2D AoA spectrums and the 3D skeletons of the human body for 3D pose tracking. We evaluate GoPose in different home environments with various activities performed by multiple users. The evaluation shows that GoPose is environment-independent and is highly accurate in constructing 3D human poses for mobile users. Results also show that GoPose achieves around 4.7cm of accuracy under various scenarios including tracking unseen activities and under NLoS scenarios.

Pose-on-the-Go: Approximating User Pose with Smartphone Sensor Fusion and Inverse Kinematics

Simo: Interactions with distant displays by smartphones with simultaneous face and world tracking

GravitySpace: tracking users and their poses in a smart room using a pressure-sensing floor

Realtime multi-person 2d pose estimation using part affinity fields

Finexus: Tracking precise motions of multiple fingertips using magnetic sensing

Shoulder and elbow joint angle tracking with inertial sensors

Fruit Ninja VR

Tool release: Gathering 802.11 n traces with channel state information

Three-dimensional antennas array for the estimation of direction of arrival. IET microwaves, antennas & propagation

Mask r-cnn

Proactive sensing for improving hand pose estimation

Acquiring in situ training data for context-aware ubiquitous computing applications

Towards 3D human pose construction using wifi

End-to-end recovery of human shape and pose

Leveraging wearables for steering and driver tracking

Adam: A method for stochastic optimization

Spotfi: Decimeter level localization using wifi

Low-complexity estimation of 2D DOA for coherently distributed sources

Id-match: A hybrid computer vision and rfid system for recognizing individuals in groups

Improved DFT algorithm for 2D DOA estimation based on 1D nested array motion

Human sensing using visible light communication

Dynamic-music: accurate device-free indoor localization

Push the limit of WiFi based localization for smartphones

Monitoring vital signs and postures during sleep using WiFi signals

Tracking vital signs during sleep leveraging off-the-shelf wifi

Vnect: Real-time 3d human pose estimation with a single rgb camera

Kinect 2 for Windows

RecoFit: using a wearable sensor to find, recognize, and count repetitive exercises

Towards accurate multi-person pose estimation in the wild

3d human pose estimation in video with temporal convolutions and semi-supervised training

Whole-home gesture recognition using wireless signals

Inferring motion direction using commodity wi-fi for interactive exergames

Smartphone based user verification leveraging gait recognition for mobile healthcare systems

User verification leveraging gait recognition for smartphone enabled mobile healthcare systems

Liquid Level Sensing Using Commodity WiFi in a Smart Home Environment

Tracking free-form activity using wifi signals

Winect: 3D Human Pose Tracking for Free-form Activity Using Commodity WiFi

Multiple emitter location and signal parameter estimation

I am a smartwatch and i can track my user's arm

Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion

Commodity WiFi Sensing in 10 Years: Status, Challenges, and Opportunities

WiFinger: Leveraging commodity WiFi for fine-grained finger gesture recognition

Enabling fine-grained finger gesture recognition on commodity wifi devices

MultiTrack: Multi-user tracking and activity recognition using commodity WiFi

Sensing fruit ripeness using wireless signals

Human Pose Estimation Technology

Motion reconstruction using sparse accelerometer data

Ultraleap. 2021. Leap Motion Controller

Comparison of a Deep Learning-Based Pose Estimation System to Marker-Based and Kinect Systems in Exergaming for Balance Training

RF-kinect: A wearable RFID-based approach towards 3D body movement tracking

Person-in-WiFi: Fine-grained person perception using WiFi

Joint 2D-DOD and 2D-DOA estimation for coprime EMVS-MIMO radar

E-eyes: device-free location-oriented activity identification using fine-grained wifi signatures

Pair-matching method by signal covariance matrices for 2D-DOA estimation

Back-Hand-Pose: 3D Hand Pose Estimation for a Wrist-worn Camera via Dorsum Deformation Network

A theoretical analysis of wireless localization using RF-based fingerprint matching

Indoor localization using improved rss-based lateration methods

MultiSense: Enabling multi-person respiration sensing with commodity wifi

mmEye: Super-resolution millimeter wave imaging

Wall++ room-scale interactive and context-aware sensing

Microsoft kinect sensor and its effect

Throughwall human pose estimation using radio signals

RF-based 3D skeletons

Pose Estimation to Empower Your Business

A study of localization accuracy using multiple frequencies and powers