key: cord-0624224-v88zdvua
authors: Chatterjee, Soumyajit; Singh, Arun; Mitra, Bivas; Chakraborty, Sandip
title: Accoustate: Auto-annotation of IMU-generated Activity Signatures under Smart Infrastructure
date: 2021-12-08
journal: nan
DOI: nan
sha: fb52d7c870d30fb9ce67d9f308e6ddda239ab97c
doc_id: 624224
cord_uid: v88zdvua

Human activities within smart infrastructures generate a vast amount of IMU data from the wearables worn by individuals. Many existing studies rely on such sensory data for human activity recognition (HAR); however, one of the major bottlenecks is their reliance on pre-annotated or labeled data. Manual human-driven annotations are neither scalable nor efficient, whereas existing auto-annotation techniques heavily depend on video signatures. Still, video-based auto-annotation needs high computation resources and has privacy concerns when the data from a personal space, like a smart-home, is transferred to the cloud. This paper exploits the acoustic signatures generated from human activities to label the wearables' IMU data at the edge, thus mitigating resource requirement and data privacy concerns. We utilize acoustic-based pre-trained HAR models for cross-modal labeling of the IMU data even when two individuals perform simultaneous but different activities under the same environmental context. We observe that non-overlapping acoustic gaps exist with a high probability during the simultaneous activities performed by two individuals in the environment's acoustic context, which helps us resolve the overlapping activity signatures to label them individually. A principled evaluation of the proposed approach on two real-life in-house datasets further augmented to create a dual occupant setup, shows that the framework can correctly annotate a significant volume of unlabeled IMU data from both individuals with an accuracy of $mathbf{82.59%}$ ($mathbf{pm 17.94%}$) and $mathbf{98.32%}$ ($mathbf{pm 3.68%}$), respectively, for a workshop and a kitchen environment.

Applications on Human Activity Recognition (HAR) are essential for developing any smart infrastructure, be it a smart home, a smart workplace, or a smart factory. There have been various endeavors on HAR [1] using time-series data captured from sensors like Inertial Motion Units (IMU) attached with individuals in different forms like a smartwatch or a smart band. However, these models primarily rely on supervised training that needs a huge volume of labeled data [2] . Traditional approaches for data annotation of IMU streams use human-in-the-loop that not only produces noisy and unreliable labels [3] but is also a significant costly & time-consuming process [4] , which does not even scale. In order to reduce human participation during data annotation, approaches like Active Learning [5] and Experience Sampling (ESM) [6] have been adopted in several recent researches. However, these methods heavily depend on the choice of annotators [7] and need a partially-labeled dataset for bootstrapping. Moreover, as human activities are a continuous time-series process, it is necessary to correctly identify the IMU boundaries where the subject changes her activity & annotate them accordingly.

Any human-based annotation is likely to fail for such precise labeling of the IMU data. The problems escalate for scenarios like a smart-home designed for elderly assistance [4] , where the human-based annotation is not only challenging but also infeasible up to a certain extent. Hence, automating the data annotation of IMU streams is necessary for such scenarios.

Addressing these challenges, attempts have been made in the literature [8] - [10] to develop frameworks that can automatically annotate the IMU data streams for HAR relying on the auxiliary modality present in the environment. The choice of the auxiliary modality is crucial in these cases, and most of these frameworks use videos as the preferred option. Although video provides a rich source of granular information, this modality is privacy-intrusive, costly, and hardly conducive to infrastructures like smart homes. In this paper, we leverage the acoustic signals as auxiliary modality, which can be captured in the environment through virtual personal assistants (VPA) and Smart Speakers. Audio as the auxiliary modality provides us multiple advantages over video. (a) There exist sophisticated models, like [11] built over rich publicly available datasets such as YouTube-8M [12] or urbansound8k [13] , which can efficiently recognize human activities with high accuracy, even when multiple activities are performed simultaneously. (b) Importantly, audio processing is much lightweight compared to the complex video processing techniques employing large deep learning models [14] . This paves the way for on-device audio processing, conducted at the edge of smart infrastructure, preserving data privacy. In an initial work [15] , we developed a platform called LASO that can annotate the user activities by exploiting the acoustic signals generated from those activities. However, LASO is limited to correctly annotate the data when there is only a single user in the environment.

Additional challenges exist for annotating IMU data in the presence of more than one user while using audio as the auxiliary modality. First, when more than one users perform activities simultaneously under the same environment, the audio signals from respective activity sources get convoluted. Albeit audio-based HAR models [11] return a set of activities with the corresponding confidence scores; however, it is challenging to separate the activity boundaries directly from such models. This challenge mainly stems from external non-human noises, which force the model to detect spurious activities and activities with low confidence. Side by side, IMU streams produce highly fluctuating and noisy signals, which make it challenging to extract the activity boundaries directly from the IMU. Finally, in the presence of more than one user performing multiple activities simultaneously, the acoustic signal alone cannot determine who is doing what, which is essential to annotate the IMU data stream captured from the individual users.

Owing to these challenges, we extend LASO in this paper for supporting dual user activity annotation from acoustic signatures. Although a dual user setup does not generalize the activity annotation problem for a complete multi-user setup, acoustic-based activity annotation over a dual user setup is not straightforward and is the foremost step towards this generalization. Further, in a realistic environment, it is seldom the scenario when many users perform activities simultaneously within a common acoustic setup; the dual user case can itself cover a significant portion of the use-cases [16] .

In contrast to the existing works, the contributions of this paper are as follow.

(1) Unsupervised activity to user mapping in a dual user setup: Accoustate exploits the activity acoustic patterns in a dual user setup to map an activity to a user through unsupervised joint analysis of acoustic and IMU signals (Section IV). The crux behind the design of the proposed method, called Accoustate, is that the sounds produced from human activities are not continuous [17] and are generated from specific granular micro-activities. For example, in a smart kitchen during the activity cooking, no sound may be produced when the user is carrying a frying pan. Indeed, when the frying pan is used to fry something, a distinctive sound will be produced.

(2) Unsupervised labeling of IMU data from pretrained acoustic-based HAR model: Accoustate intelligently utilizes the unsupervised nearest-neighbor model to annotate the activities associated with individual IMU change-points. For this purpose, we utilize pre-trained audio-based HAR models, which are lightweight and thus run over an edge device, preserving data privacy.

(3) Implementation, deployment, and testing of Accoustate:

We have implemented and tested Accoustate over two different datasets, a Workshop and another Kitchen environments, and benchmarked the resource consumption profile of the running module to show its feasibility for deploying over edge devices under the smart infrastructure. The results show that Accoustate generates annotated IMU data with an appreciable accuracy of 82.59% (±17.94%) and 98.32% (±3.68%), respectively for the two environments, for the cases when a correct activity to user mapping could be done (10 out of 12 cases in Workshop, and 3 out of 5 cases in Kitchen) in a two-user simultaneous activity setup (Section VIII).

Obtaining annotations for the collected data has been a prime challenge in the field of HAR. Typical locomotive and inertial sensors generate millions of instances in a day. Subsequently, a majority of the data generated by smart infrastructures become unusable because of the unavailability of proper annotation [4] , [7] . Also, given such a huge volume of data generated by these smart infrastructures, the standard approach of human annotation becomes infeasible (both in terms of cost and time). To counter these challenges several approaches like Active Learning [5] and Experience Sampling [6] have been adopted. These approaches reduce the overall data to be annotated by polling the user only when the system is not confident about the generated label. However, a major drawback of these approaches is their heavy dependency on the human-in-the-loop based approach that requires a proper choice of annotators [7] and can even be noisy in certain cases [3] . Furthermore, approaches like Active Learning need a small set of pre-labeled data to start with, and obtaining such datasets for complex ADL(s) can be challenging [21] . Similarly, for ESM-based approaches [22] , the participants themselves may need to provide labels in situ. However, such a setup may not always be possible, especially in smart homes designed for assisting the elderly population.

Understanding these challenges, a different approach that has been adapted in recent times is the development of automated annotation frameworks [8] - [10] , [23] . Notably, most of these works use one or more auxiliary modalities that assist the framework in generating the labels for the unlabeled primary sensor modality. In most of the cases, a typical choice for many of these approaches has been to use videos [9] , [10] as an auxiliary modality, which provides very granular information regarding the basic physical activities (like walking, standing) and for specialized activities like fitness exercises. Additionally, the video also provides rich information to perform cross-modal information associations [24] , albeit it is highly privacy-invasive and computationally expensive to process. Understanding these challenges, recent works like [11] , [25] have pointed out different alternative modalities that can allow granular activity recognition in smart homes. Out of these, the acoustic context, in particular, can be used in a plug-andplay setup with diverse pre-trained models without using any external hardware [11] or human intervention.

Interestingly, existing works like [15] , [20] have used this idea for generating training data. However, these frameworks are designed explicitly for single-user scenarios and involve cross-modal detection of activities that may be difficult to map when multiple unlabeled IMU streams arrive in the system from more than one user along with a globally confounded audio stream. Additionally, the performance of frameworks like RecycleML [20] heavily depends on the availability of small yet significant bootstrapping data, obtaining which can be challenging in real-life scenarios. The summary of comparison with the existing related works is shown in TABLE I.

One of the major challenges of designing and testing Accoustate is the unavailability of sufficient data from a realistic smart infrastructure environment. Data from commercial solutions such as Samsung Connected Living are not publicly available. On the contrary, publicly available datasets typically collect data over a very controlled and time-constrained en- vironment. For example, the CMU-MMAC dataset 1 contains activity data from a smart kitchen environment; however, the primary focus has been on individual body-parts' movement patterns while performing various micro-activities (for example, beating eggs, taking fork, etc. while cooking). Further, participants are connected with multiple sensors at different body parts, hindering them from free movements. Because of such constraints, the natural & organic activity sequences, like bringing the frying pan from a distanced shelf to the oven, get hampered. However, such activity sequences play a crucial role while designing Accoustate. Hence, we opt to rely on an in-house lab-scale smart environment, where individuals can perform the given tasks without external control and with minimum interference from the connected devices.

We rely on a minimal setup where the IMU data is collected using a Moto-360 smartwatch (sampling rate = 50Hz) worn on the preferred arm of the participant. A COTS Smartphone is utilized to capture the audio generated from the environment (sampling rate = 44.1kHz). The smartwatch was paired apriori with the smartphone over Bluetooth for intermodality synchronization. We obtain the data from two different environments, namely Workshop and Kitchen, involving a total of 8 volunteers in this experiment. We collected the data in a single-user setup, where every volunteer has been asked to perform one single activity at a time, independently & freely, without having any external constraints. In the Workshop environment, we involve 4 volunteers and ask them to separately perform two primary activities -(a) hammering a wooden plank or a metal pipe, & (b) cutting a wooden plank or metallic pipe using a saw. During this process, the participants organically perform other auxiliary microactivities, like picking up the nails, or fitting the plank, etc. Similarly in Kitchen environment, the other 4 volunteers are asked to separately conduct two primary activities -(a) chopping vegetables with a knife on a chopping board, & (b) cooking the vegetables in a frying pan or a cast iron wok. For both the Workshop and the Kitchen environments, we captured timestamped videos with a frame rate of 30fps and used them to annotate the ground-truth activity labels. It can be noted that this data has been collected during the design of LASO [15] . Unfortunately, due to the COVID-19 pandemic, we could not extend these experiments in a multiuser setup. So, this paper applied a judicial data augmentation mechanism using a signal convolution technique to synthesize the multi-user data from the available single-user activity data.

We use the collected single-user datasets and augment them by time-synchronizing the IMU signals and convoluting the acoustic signals among volunteers to create a synthetic dataset that can effectively mimic the multi-user setup with two users performing simultaneous activities. Before creating the augmented dataset, we first noise profile each audio file for noise reduction, albeit we keep the IMU data unfiltered. Next, we use Audacity [26] to convolute the time-series acoustic signals from two different volunteers performing two different activities, ensuring that the audio processing steps maintain the overall quality and standard of the output files. Next, for the IMU data and the ground-truth activity timings, we synchronize them considering a global time reference by choosing the earliest time among the two volunteers in the combination. As the IMU sampling rate is fixed (at 50Hz) Fig. 1 : Accoustate Working Environment and the IMU data for both the volunteers are synchronized, the final augmented dataset will have an approximately equal number of IMU instances for each volunteer. In this context, it is important to note that this augmentation strategy only creates a virtual setup with two occupants and does not tamper with the existing nature of the two real-life datasets. Following this data augmentation strategy, we create a set of 12 (TABLE II) and 5 (TABLE III) unique virtual combinations for each subject-activity pair in the two setups, respectively.

Let two users U m and U n perform two different primary activities p i (denoted as U m → p i ) and p j (denoted as U n → p j , respectively, for a time period [0, T ]. Let I m (0, T ) and I n (0, T ) be the unlabeled IMU data collected through the wearable from each of the two users (see Fig. 1 ), respectively, and A(0, T ) be the audio signal for the entire duration [0, T ] captured from a VPA deployed in the environment.

The objective of Accoustate is to develop a framework to annotate the completely unlabeled IMU data I n (0, T ) and I m (0, T ) utilizing the acoustic information extracted from A(0, T ). Our key idea is that there are granular auxiliary micro-activities, say {a 1 m , a 2 m , . . .}, within a primary activity p i , some of which does not generate a distinctive sound, thus producing a gap in the acoustic signal. Accordingly, We define a term "Acoustic Gaps" as follows.

Definition 1 (Acoustic Gap): Let a user U m performs a primary activity p i for a period [0, T ]. We define an acoustic gap as the intermittent time duration [t, t+∆], (∆ ≥ 1s), when a different acoustic context is produced due to an interleaved auxiliary micro-activity a k m . We aim to leverage the acoustic gaps extracted from A(0, T ) during primary activities {p i , p j } to opportunistically annotate the completely unlabeled IMU data I n (0, T ) and I m (0, T ). We consider that the label-space of the primary activities {p i , p j } is known and included in the label-space of a pretrained audio-based HAR model like [11] . However, the userto-activity mappings U m → p i and U n → p j are not known. Similar to previous works like [20] , we also assume that IMU and audio signals are time-synchronized using any standard approaches like RTP or NTC [27] , [28] . 

By exploiting the acoustic gaps, Accoustate first maps the primary activities {p i , p j } to the users {U m , U n } and then annotates the IMU data I m (0, T ) and I n (0, T ) with the corresponding primary activities {p i → U m , p j → U n }. As shown in Fig. 2 , the entire framework is divided into three major stages. In Stage 1, Accoustate independently detects signal changes in I . (0, T ) and A(0, T ) using unsupervised approaches. In Stage 2, Accoustate extracts the acoustic gaps from the signal changepoints to map the primary activities to the corresponding users, i.e. p i → U m and p j → U n . Specifically, Accoustate relies on a pre-trained audio-based activity recognition module for identifying the primary activities {p i , p j }, which allows us to avoid human intervention and detect activities in an automated manner. However, lack of an appropriate number of acoustic gaps and presence of multiple acoustic sources may confuse Accoustate, and thus the framework may generate conflicting mappings. To resolve this, Accoustate applies a conflict resolution technique that allows it to judiciously map an activity label to a user during the entire duration [0, T ]. Ultimately, in Stage 3, Accoustate collates all the information from the previous two stages to finally output the annotated IMU data for both the users. The details of each of these steps follow.

The first stage of Accoustate uses an unsupervised approach to detect the instances when the distributions of the input signals, both for I . (0, T ) and A(0, T ), show a change in their patterns, indicating that an activity change has happened in the environment for one of the users.

The objective of this step is to find out windows of duration [ν, η], 0 ≤ ν < η ≤ T for each user U u , such that each I u (ν, η) corresponds to one of the activities from {p u , a 1 u , a 2 u , . . .}. 1) Calculating Change-Points: To achieve this, we first rely on the statistical Change-point Detection [29] , [30] approach to evaluate the changes in the IMU data stream. Formally, a change-point represents a point in time where a time series or a stochastic process has changed its probability distribution. Change-point scores quantify these changes in terms of numerical scores. Accoustate computes the change-point scores to quantify the change in the IMU data stream considering the input from a tri-axial accelerometer. Say, x t represents the input from a tri-axial accelerometer at time t. Then, to compute the change-point scores, we create windows in the IMU data, such that a window X t = {x t , x t+1 , . . . , x t+f −1 }, where, f is the window size 2 . Subsequently, we compute the changepoint score µ t between any two consecutive IMU windows X t and X t+f using the α-relative Pearson Divergence Estimation (PE), following a similar procedure as discussed in [31] .

2) Identifying the Actual Changes: Although the changepoint scores can quantify the amount of changes in an IMU stream; however, they do not explicitly demarcate the actual event changes; instead, they give a relative score of changes in the distribution. Ideally, the actual activity changes within the IMU signal should produce relatively higher values of the change-point scores. However, as the IMU data is unlabeled, we cannot determine an empirical threshold on the changepoint score. To solve this problem, we apply an unsupervised approach, where we cluster the obtained change-point scores into two sets -one corresponding to the actual activity changes in the IMU data and the other containing the rest of the changepoint scores. Thus, we perform a k-means clustering (with k = 2, Fig. 3a ) on all the scores obtained from pairwise consecutive IMU windows. Subsequently, we demarcate the set of scores belonging to the cluster with a higher mean as the actual activity change scores.

We split the audio signal into 1 second segments for extracting the audio change points and compare the two consecutive segments to detect changes. However, measuring the change in acoustic context is not straightforward, as the environmental acoustic context is a convoluted signal from different activities and other acoustic sources [32] . Therefore, existing methods like [33] , which use techniques like Mel Frequency Cepstral Coefficients (MFCC), fail to locate the changes correctly in our context. It is known that MFCC becomes ineffective in the presence of multiple noise sources [34] , especially for nonspeech environmental acoustic signatures [35] .

1) Observing Changes in the Environmental Acoustic Context: Notably, recent works like [35] have pointed out that [36] and is obtained by performing Fourier transform on the cross-correlation of the two signals.

Since CPSD returns the power density across all the frequency bins present in both the signals, we consider the sum of absolute output values across all frequency bins to obtain the final CPSD value between two consecutive audio segments. In this context, an important observation is that unlike changepoint scores computed for IMU signatures, higher CPSD values indicate lesser chances of any change between two consecutive segments. As Fig. 4a indicates, the CPSD values are marginally lower for the majority of the frequency values during an acoustic gap (when an activity change occurs) compared to the scenario when both the activities are being performed continuously. However, these are also mere values only, and we need to cluster them to demarcate actual changes. The detail follows.

2) Identifying the Actual Changes in the Acoustic Context: Similar to clustering the change-point scores for demarcating activity changes in the IMU data, here as well, we cluster the CPSD values. However, a primary problem with acoustic signatures is the presence of different random noise components. Additionally, we also observe that some acoustic signature changes are also introduced due to the behavioral changes in performing the activity. For example, a user may change how she holds the saw according to her ease in a workshop environment. Depending on that, the sound generated while cutting the plank can also change. Due to all these reasons, there can be a huge number of uneventful change-points if we cluster them into two sets only, which can negatively impact Accoustate's performance. Thus to avoid this, we first obtain the optimal number of clusters for CPSD values using the Silhouette score [37] . For this, we cluster the CPSD values into c clusters, where c ∈ [2, C] 3 and choose c with maximum Silhouette score as the optimal number of clusters. Once the clustering is done, we mark the cluster with minimum mean CPSD value as the cluster containing actual activity changes. VI. STAGE 2: ACTIVITY TO USER MAPPING Accoustate maps primary activities {p i , p j } to users {U m , U n } in two stages. (a) Using I u , u ∈ {m, n}, it first identifies U u , u ∈ {m, n} who might have taken a break in her primary activity during the time segment [ν, η], 0 ≤ ν < η ≤ T (thus produces an acoustic gap in A(ν, η) for U u , u ∈ {m, n}). (b) In the next step, it uses A(ν, η) to obtain the activity labels from a pre-trained audio-based HAR model [11] and maps those activities to the individual users based on the timing analysis over acoustic gaps.

One of the major challenges in correlating IMU and Audio is that the change-points computed individually from them are not time-synchronized. Fig. 4b indicates that a difference is observed in the IMU signal whenever there is a change in the acoustic signal, albeit with a slightly smaller window compared to the acoustic change. This is because when the user stops performing the primary activity, the acoustic signature drops immediately, while the IMU signatures still record the transition. For example, when a user resumes using a saw, the acoustic signature captures this instantly; however, IMU changes a bit earlier when the user just picks up the saw. Therefore, Accoustate uses an opportunistic approach to exploit the acoustic gaps by combining the observations from both modalities. For this, we use the notion of Exclusive Change defined as follows.

Definition 2 (Exclusive Change): Say, at some time-interval [ν, η], 0 ≤ ν < η ≤ T , we observe a change in I m (ν, η) for user U m . We define this change as an exclusive change if and only if the following two conditions are met.

1) ∃ a time-interval [θ, ζ], 0 ≤ θ ≤ ν < η ≤ ζ ≤ T , where there is a change in A(θ, ζ). 2) For the entire time interval [θ, ζ], we do not observe any change in I n (θ, ζ) for the other user U n present in the environment. Based on this definition, Fig. 5 shows an exclusive change for the user U m over the IMU signal I m . The exclusive changes indicate the presences of an acoustic gap. Once we determine these exclusive changes for the individual users, we identify the activity labels from the acoustic context and map them to individual users as follows.

The acoustic gaps during the exclusive changes help us to find out a unique mapping from the activity labels {p i , p j } to the users {U i , U j }. Let, [β, γ] be a continuous time interval, and [ν, η], β ≤ ν < η ≤ γ be an exclusive change detected for the user U m from I m (0, T ). Let A m (β, γ) and A n (β, γ) be the pure acoustic signal components generated from the activities (primary or auxiliary) being performed by U m and U n , respectively, during the time interval [β, γ]. Further, consider that N (β, γ) is the environmental noises generated from nonhuman activities (like the sound from AC, dog barking, etc.). N (β, γ) , where ⊕ is the signal convolution operator. It can be noted that A(β, γ) should contain change-points near ν and η, as the primary activity for U m changes at ν and η; therefore, A m (β, ν + ∆ 1 ), A m (ν +∆ 1 , η+∆ 2 ), and A m (η+∆ 2 , γ) should have different distributions in their Power Spectral Density (PSD). Here, ∆ 1 and ∆ 2 are small adjustment windows, as the IMU changepoints and the acoustic change-points may not be perfectly time-synchronized. However, U n does not change her activity during [β, γ], and therefore A n (β, γ) should not ideally contain any change-points. Activity detection from acoustic context: We first detect the activities from the environmental acoustic signature A(β, γ) during the entire duration [β, γ]. As one of the main objectives of Accoustate is to minimize human-in-the-loop, therefore, we rely on the concept of pre-trained models for audiobased activity recognition. Specifically, we adapt the model suggested in [11] for this purpose, which is pre-trained using YouTube-8M [12] dataset and uses context information to filter out noisy activity labels effectively. For an input acoustic signature, the pre-trained model returns a set of detected activities with an associated confidence level indicating the model's confidence over the detected activities. Notably, the label space of [11] also contains the workshop and kitchen activities defined in our datasets. Separating activities: Let A(β, γ) be the set of activities returned by the pre-trained model [11] during the time in- γ) is the set of activities (including the primary activity and the auxiliary activities) being performed by U m , p j is the single primary activity being performed by U n during the entire duration [β, γ], and A N (β, γ) is the set of non-human noisy activities. A N (β, γ) is easily separable as they do not belong to the target activity set. A m (β, γ) should contain one primary activity p i within the duration [β, ν + ∆ 1 ] & also within [η + ∆ 2 , γ], and an auxiliary activity a m within the duration [ν + ∆ 1 , η + ∆ 2 ], as detected from the pre-trained acoustic-based activity recognition model [11] . Mapping the primary activities based on the IMU changes: To map the primary activities p i and p j with the corresponding users U m and U n , we now look into the change-points detected in I m (β, γ) and I n (β, γ). I m (β, γ) should have a change (which is the exclusive change) within the duration [ν, η], whereas, I n (β, γ) should not contain any change-points. Consequently, we should observe a break in p i within the duration [ν, η] (when the auxiliary micro-activity a m was performed), but there will be no break in p j . Thus, these two cases are easily separable based on the exclusive changes at U m ; therefore, Accoustate maps p i to U m and p j to U n unanimously for the window [β, γ].

The above approach works well unless p i = p j , i.e., both U m and U n performs the same activity at the same instance of time. However, it is typically a rare event when multiple users within a smart environment perform the same activity simultaneously. Therefore, we argue that Accoustate can label the IMU data for the majority of the cases, which is also evident from the detailed experiments, as discussed next.

We repeat the two steps mentioned above for the entire duration to obtain the activity mappings for all the exclusive changes across both the users. Specifically, we first create a list of exclusive changes E m and E n for the users U m and U n , respectively. Mathematically, the contents of these lists can be defined as follows.

(1) Once these lists are obtained for each user, we then separately get the activity mappings for each of the entries in E m and E n by querying the audio-based activity recognition model 4 with the audio segment corresponding to the given time-interval [θ, ζ]. However, it can be noted that due to the noise in the acoustic data as well as the confusion with the acoustic-based activity recognition model, different exclusive changes from E m may result in different activity labels for the other user U n . However, we consider that a user performs a single primary activity within the entire duration [0, T ]. Therefore, to map a single primary activity label to the user U n , we consider the activity label with the majority from to the list of exclusive changes E m for the user U m .

Although this mapping seems straightforward, several challenges may appear while mapping the activity labels to the users. Notably, we observe that a critical condition may arise when there is a conflict, and the same activity gets mapped to both the users. This appears when both the users have rarely performed the auxiliary micro-activities, or the acoustic context changes are because of the users' external noise or behavioral changes. Since Accoustate is concerned with generating annotations for the IMU data for which we use a pretrained acoustic model, fine-tuning the audio-based activity recognition model is entirely out of scope. However, where the results are conflicting, we apply the following strategy. For resolving the conflict where Accoustate maps the same activity to both the users, we start by defining the opportunistic user from the set of two users in the environment.

Definition 3 (Opportunistic User): Among two users U m and U n performing simultaneous activities, U m is called to be an opportunistic user if |E m | > |E n |, i.e., I m (0, T ) indicates higher number of exclusive changes than I n (0, T ). 4 We choose the activity label with maximum confidence.

In other words, an opportunistic user is the user who has taken more number of breaks during her primary activity and thus provided the framework more opportunities to map the activities to the other user correctly. Subsequently, to resolve the conflict, we assume the decision made by observing the changes in E m to be true and map the inferred activity to U n .

The output of Stage 2 of the framework is the individual activity labels (corresponding to their primary activities) for each user for the entire duration [0, T ]. Next, the final task is to annotate the unlabeled IMU data for both the users with the respective mapped activity labels. It can be noted that I m (0, T ) and I n (0, T ) may contain the instances of the auxiliary activities performed by the users; however, we are only interested in annotating the IMU segments when they have performed their primary activities. Although we have a unique activity to user mapping for both users, we cannot use this mapping directly for the IMU annotation because of the following two reasons. (1) The activity labels are returned by a pre-trained acoustic-based activity recognition model [11] , and thus the returned activity labels are synchronized with the acoustic change-points. (2) The IMU change-points may not be perfectly time-synchronized with the acoustic change-points, as they are computed independently.

Accoustate solves these issues as follows. The pre-trained acoustic-based HAR model [11] returns the activity labels with an associated confidence value. We first find out the acoustic segment, sayĀ(β, γ), which returns the primary activity p i → U m with the maximum confidence label. We then extract the corresponding IMU segment I m (β, γ), termed as the Key Segment. We first label the key segment I m (β, γ) with the activity label p i . Based on the IMU change-point detection technique, we have segmented I m (0, T ) where IMU change-points mark the start and the end of each segment. We first fit all these IMU segments I m (θ, ζ) to an unsupervised nearest neighbor approach [38] , which learns the patterns of these IMU segments. As similar types of segments should come closer, and they collectively indicate the same activity (because the same activity should produce a similar variety of signal patterns for a user). We now use the key segment I m (β, γ) to find out the z number of nearest neighbors of that segment and annotate all those segments using the activity label p i . The following section empirically analyzes the impact of different z values.

We evaluate the performance of Accoustate in Kitchen and Workshop scenario (see Section III for dataset) from multiple perspectives, starting with the quality of activity labels generated by Accoustate, to its utility in developing supervised models, with a glimpse on its resource consumption. Evidently, Accoustate assigns incorrect activities to the users for combination C3 and C8 in Workshop and C2 and C4 in the Kitchen environment. Close inspection reveals that in C3 for Workshop, the user U 4 performed sawing at a stretch, taking only a single break in the entire period. This provides fewer opportunities to Accoustate for annotating user U 1 (performing hammering). On the other hand, in C8, the framework cannot provide any conclusive activity annotation for both the users, mostly due to the external noises present in the environment. Albeit Accoustate could identify 41 acoustic changes, the audio-based activity recognition model fails to recognize any meaningful activities for 38 of them, resulting in a drop in performance. Side by side, we notice a general performance drop of Accoustate in Kitchen environment. This attributes to the fact that the kitchen, in general, exhibits an inherently complex nature of activities, such as cooking and chopping. This results in a jumbled patterns in the acoustic contexts, which adversely affects the audio-based activity recognition model of the framework. For example, we observe that the model incorrectly detects cooking using the sound of handling utensils, which can occasionally occur while chopping as well.

Next, we delve deep and investigate the performance of Accoustate in correctly annotating the IMU stream I m , generated by a specific user U m , with a correct activity label p i (say, p i → U m ). We measure the accuracy of the annotations for a user U m by comparing the overlap of the annotated instances with the ground-truth activity labels for each time instance. For every overlap, we assign a score of 1 and a 0, otherwise. Concerning the Workshop dataset, from TABLE VI, we observe that for most of the users, the framework generates annotations (for some value of z) with accuracy > 70%, albeit the volume of annotations is less than 20% for some instances, especially in the case of users performing sawing. The IMU data from sawing exhibits frequent change-points, and the IMU change patterns within those change-points depend on factors like the holding the saw and the speed. Consequently, we see significant variations in the per-window IMU pattern. As we rely on a nearest-neighbor strategy for labeling the IMU data, such variations in the patterns result in few IMU windows showing similarity with the key segment (Section VII), resulting in a lower volume of annotated data.

However, for Kitchen (see TABLE VII), we observe that Accoustate performs with better accuracy. This improvement can be attributed to the inherent nature of kitchen activities like cooking and chopping, which by default require intermediate stops, thus providing many opportunities for the framework to identify the patterns accurately. However, one important point regarding the analysis and similarity of patterns also reveals that common activities like cooking often have high variability [4] . For example, users may change the way they use the cooking spud depending on the item being cooked. These variations over time cause frequent small change windows impacting the volume of annotation to some extent.

One critical observation is that the IMU patterns heavily depend on the activity type. For example, auxiliary microactivities are much less when a user cuts a metal pipe. In general, such opportunities are less in number for workshop activities. On the contrary, kitchen activities provide a better opportunity for annotation. Therefore, Accoustate can widely be used for annotating Activities of Daily Livings (ADLs); nevertheless, it can also annotate non-ADL activities (like Workshop) to a satisfactory extent.

The primary purpose of annotating sensor data is to create training data for supervised models. Understanding this objective, we assess the quality of our annotated data in a cross-user setup. This setup allows us to benchmark the annotated data and provides insight if Accoustate can be used to get bootstrap data for a few users and then whether such bootstrapping data can be used to obtain the label for other unlabeled data. To evaluate this, we use a leave-one-out strategy to test a particular users' original data with supervised models trained using the annotated labels (z = 15) and the labeled IMU streams as features from the other users in the setup. For the Workshop, we thus generate four sets (Set 1 to Set 4) of training data by leaving out one of the four users in each of the cases and use the left-out user to test the model accuracy. For the Kitchen, we had data for both the activities only for user U 2 whose data we use as the test dataset and create two training datasets with a mixture of chopping data from U 1 and cooking data from users U 3 (Set 1) and U 4 (Set 2), respectively. From the results with two standard supervised learning algorithms (Random Forest and Support Vector Machines) shown in Fig. 6 , we observe that for most of the cases the supervised models attain > 0.70 F 1 -score, in correctly predicting the activity labels in the cross-user setup, which demonstrates the quality of annotation by Accoustate.

Accoustate uses audio as an auxiliary modality to make the overall framework lightweight so that the privacy-sensitive data can be kept within its territory by running the model on the edge. To assess this, we profile the stages of the framework individually on a per-module basis for time and memory consumption (using Linux proc filesystem) for the C1 dataset from the Workshop setup. Fig. 7a shows that except for the audio activity-recognition module, all the remaining modules take < 200MB memory which allows them to run on resourceconstrained edge devices. Albeit the audio-based HAR module engrosses a significant volume of memory, however, it can run on resource-constrained devices as shown in [11] . Concerning the time consumption, the total running time for Accoustate on Workshop-C1 is 969.15seconds, with the majority of time being consumed to detect changes in the IMU data. However, this total time consumed is for the entire dataset containing the IMU streams from both users, which may be provided in sessions to reduce data volume and processing time.

This section highlights a few of the insights we got while developing Accoustate, opening up various exciting issues.

Scalability: We primarily designed and tested Accoustate for dual user simultaneous activities. However, our method can be extended for multiple users by applying a leave-oneout policy as follows. Let there be a set of n users U = {U 1 , . . . , U n } and n different activities P = {p 1 , . . . , p n }. Let there be an acoustic gap for U 1 , indicating that the primary activity for U 1 , say p 1 , will be absent in that duration. So, we can recursively probe Accoustate with U \ {U 1 } and P \ p 1 . Nevertheless, such a method needs fine tuning of activity signatures, which can be explored as an immediate extension of this work.

Change in Intra-Activity Patterns: An unsupervised nearest neighbor allows us to identify closely related patterns, and Accoustate uses this idea to annotate using a known key segment. However, one crucial observation in this context is regarding the change of patterns over time. For example, a user may change how s(he) uses the cooking spud over time depending on how well the food is cooked. Such changes over time can increase the distance of a valid instance from the known key segment, resulting in lower annotation volume. Although Accoustate cannot tackle such behavioral changes, it can provide significant bootstrapping data. We can use such bootstrapping data in approaches like Active Learning [5] for further annotation with minimum human intervention or adapt approaches like [4] for tackling data variability.

Significant Acoustic Gap: A primary understanding that we develop from the overall evaluation is that the acoustic gaps are critical for Accoustate to perform well. However, a significant loud noise from such auxiliary activities may collude the overall acoustic context making it difficult to get identified by the audio-based HAR. Thus, the particular acoustic gap becomes irrelevant.

"Data, data, everywhere, nor any drop to use." Smart infrastructures generate millions of sensing data per second, but the data is useless until they are appropriately labeled. Human-based annotations are not feasible and cost-effective for labeling such data, whereas video-based annotation has significant processing overhead and privacy concerns. Accoustate provides a framework for automated annotation of the IMU data using audio captured through a smart speaker or a VPA deployed within the infrastructure. Accoustate judicially combines signal processing techniques with unsupervised learning mechanisms to correctly identify the activity boundaries to label the IMU data with high accuracy. The evaluation indicates that Accoustate generated labels are highly accurate, as the method identifies the activity boundaries by extracting the precise changes within the IMU data itself. Although the volume of Accoustate-labeled data is low sometimes, we believe that our approach can at least generate the bootstrap labels that can then be combined with techniques like Active Learning [5] to annotate the remaining data.

Imu-based robust human activity recognition using feature analysis, extraction, and reduction

Sequential weakly labeled multiactivity localization and recognition on wearable sensors using recurrent attention networks

Fixing mislabeling by human annotators leveraging conflict resolution and prior knowledge

Assessing adl routine variability from high-dimensional sensing data using hierarchical clustering

Improving generalization with active learning

Validity and reliability of the experience-sampling method

Active deep learning for activity recognition with context aware annotator selection

Automatic annotation for human activity recognition in free living using a smartphone

Automated annotation of sensor data for activity recognition using deep learning

Let there be imu data: Generating training data for wearable, motion sensor based activity recognition from monocular rgb videos

Ubicoustics: Plugand-play acoustic activity recognition

Youtube-8m: A large-scale video classification benchmark

A dataset and taxonomy for urban sound research

Smart frame selection for action recognition

LASO: Exploiting locomotive and acoustic signatures over the edge to annotate imu data for human activity recognition

Aras human activity datasets in multiple homes with multiple residents

Optimizing for Happiness and Productivity: Modeling Opportune Moments for Transitions and Breaks at Work

Automatic annotation of sensor data streams using abductive reasoning

Realtime multi-person 2d pose estimation using part affinity fields

Enabling edge devices that learn from each other: Cross modal training for activity recognition

Active learning enabled activity recognition

Sensing fine-grained hand activity with smartwatches

Automatic annotation of unlabeled data from smartphone-based motion and location sensors

Idiot: Towards ubiquitous identification of iot devices through visual and inertial orientation matching during human activity

Fine-grained recognition of activities of daily living through structural vibration and electrical sensing

Audacity software is copyright 1999-2019 audacity team. the name audacity is a registered trademark of dominic mazzoni

Temporal segmentation and activity classification from first-person sensing

Svm-based multimodal classification of activities of daily living in health smart homes: Sensors, algorithms, and first experimental results

Real-time change point detection with application to smart home time series data

Change-point detection in timeseries data based on subspace identification

A smart segmentation technique towards improved infrequent non-speech gestural activity recognition model

Sound-adapter: Multi-source domain adaptation for acoustic classification through domain discovery

Crowd++: Unsupervised speaker count with smartphones

Analyzing noise robustness of mfcc and gfcc features in speaker identification

A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds

Cross power spectral density spectrum for noise modelling and filter design

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis