key: cord-0127061-7ug0jr9x authors: Peng, Kunyu; Roitberg, Alina; Yang, Kailun; Zhang, Jiaming; Stiefelhagen, Rainer title: Should I take a walk? Estimating Energy Expenditure from Video Data date: 2022-02-01 journal: nan DOI: nan sha: c623cd508ca9a19ee8a32cae1251a5f22f82748a doc_id: 127061 cord_uid: 7ug0jr9x We explore the problem of automatically inferring the amount of kilocalories used by human during physical activity from his/her video observation. To study this underresearched task, we introduce Vid2Burn -- an omni-source benchmark for estimating caloric expenditure from video data featuring both, high- and low-intensity activities for which we derive energy expenditure annotations based on models established in medical literature. In practice, a training set would only cover a certain amount of activity types, and it is important to validate, if the model indeed captures the essence of energy expenditure, (e.g., how many and which muscles are involved and how intense they work) instead of memorizing fixed values of specific activity categories seen during training. Ideally, the models should look beyond such category-specific biases and regress the caloric cost in videos depicting activity categories not explicitly present during training. With this property in mind, Vid2Burn is accompanied with a cross-category benchmark, where the task is to regress caloric expenditure for types of physical activities not present during training. An extensive evaluation of state-of-the-art approaches for video recognition modified for the energy expenditure estimation task demonstrates the difficulty of this problem, especially for new activity types at test-time, marking a new research direction. Dataset and code are available at https://github.com/KPeng9510/Vid2Burn. If you would ask people to honestly answer "Why do you go to the gym?" a frequent reply would be to burn calories. Physical activity is connected with our health and is an important element in prevention of obesity, diabetes or high blood pressure 2 -issues which are amplified through the recent Covid-19 lockdowns and the home office regulations [3] . With the rise of health tracking apps, automatic inference of energy expenditure is rapidly gaining attention [2, 5, 36, 39, 59, 70] , but almost all prior research has focused on signals obtained from wearable devices, such as smart watches or heart rate monitoring chest straps. While such sensors are not always present at hand or comfortable to wear, most people can easily access a video camera in their phone or laptop. Apart from helping the users interested in tracking their exercise and maintaining active lifestyle, recent studies in gerontology highlight the benefits of automatically tracking the level of physical activity in assistive smart homes in order to support the elderly [4, 22, 45] . As important as it is for our health, understanding physical activity offers new technical challenges in computer vision. Excellent progress has been made in the field of human activity recognition [6, 14, 44, 56, 65] with remarkable accuracies reported on datasets such as HMDB-51 [29] or Kinetics [6] . However, when facing our task of estimating caloric expenditure from human observations, these methods will face two main obstacles. First, the cornerstone of past research lies in rather rigid categorization into predefined actions. These categories are often relatively coarse, (e.g., "football" vs. "jogging"), so that the scene context provides the network with an excellent shortcut to the decision, leaving the actual moving person behind [10, 62] . Our task however requires fine-grained understanding of human movement, as medical research [7] lists which muscles are active and how hard they work as the main drivers of energy expenditure (although a multitude of further factors influence this complex physiological process). A second key challenge is to encourage the model to capture the essence of energy expenditure instead of memorizing average values of specific activity categories seen during training. Deep neural networks are prone to learning shortcuts [10, 16, 19] and internally casting the calorie regression problem as an "easier" task of activity categorization which might be one of such potential shortcuts. Even if the annotations are continuous calorie values and not rigid categories, in practice, the training set can only cover a finite amount of activity types. Ideally, our model should not be bounded to category-specific biases and indeed learn the nature of activity-induced energy expenditure by, e.g., understanding the type and intensity of bodily movement produced by the skeletal muscles. When developing an energy expenditure benchmark, it is therefore critical to evaluate the results on types of physical activity not present during training. In this paper, we explore the new research direction of inferring activity-induced caloric cost by observing the human in video, as shown in Fig. 1 . To tackle the lack of public large-scale datasets, we introduce Vid2Burn -a new omni-source benchmark spanning 9789 video examples of people engaged in different activities with corresponding annotations designed based on models established in medical literature [27] from (1) current activity category (2) intensity of the skeleton movement and (3) heart rate measurements obtained for a subset of activities (household activities) in a complementary study. Videos in the dataset are chosen from four diverse activity recognition datasets [25, 29, 50, 53] originally from YouTube, movies or explicitly designed for recognition in household context. Yet, a key challenge when applying energy expenditure models in practice arises from transferring the learned concepts to new activity types. To meet this requirement, Vid2Burn is equipped with a cross-category benchmark, where the caloric cost estimation models are evaluated against activity types not seen during training. In addition to potential mobile health applications, our dataset there fills the lack of benchmark studying concise recognition of body movement without relying on category-specific context biases. From the computer vision perspective, the key technical challenges of our benchmark are (1) fine-grained understanding of bodily movement and (2) generalization to previously unseen types of activities. Extensive experiments with multiple state-of-the-art approaches for videoand body pose based action recognition demonstrate the difficulty of our task using modern video classification architectures, highlighting the need for further research. Activity recognition in videos. Human activity recognition often operates on body poses [9, 31, 33, 51, 67] or learns representations end-to-end directly from the video data using Convolutional Neural Networks (CNNs) [6, 17, 20, 52, 68] . CNN-based approaches often deal with the temporal dimension via 3D convolution [44, 56, 57, 61, 65] or follow the 2D+1D paradigm, chaining spatial 2D convolutions and subsequent 1D modules to aggregate the features temporally [14, 15, 24, 32, 60, 71] . Fueled by multiple publicly released large-scale activity recognition datasets collected from Youtube/Movies [25, 29, 53] or in home environments [50] , the research of deep learning based activity recognition became a very active research field also explored in more targeted applications, e.g., in cooking [11, 46] , sports [42] , robotics [23, 49] , and automated driving [35] -related tasks. More specialized activity recognition research also addressed topics such as uncertainty of video classification models [47, 54] . However, all the approaches focus on categorization into previously defined activity classes, while examining their feasibility for capturing complex physiological processes of the body, such as our calorie expenditure task, has been largely overlooked. Energy expenditure prediction. Visual estimation of calorie values has been mainly investigated in food image analysis (i.e., tracking the amount of caloric intake) [34, 40, 48] . Energy expenditure induced by physical activity is mostly studied from an egocentric perspective featuring data from wearable sensors, such as accelometors or heart rate monitors [2, 5, 18, 26, 37, 39, 41, 55, 64] , with a recent survey provided in [70] . Only very few works address the visual predicted activity-related caloric expenditure [39, 55] . The only dataset collected for energy expenditure prediction by visually observing the human [55] features a highly simplistic evaluation setting (a single environment) and is comparably small in size, restricting the investigation of data-driven CNNs in this scenario (besides, the access to the collected database is restricted). To the best of our knowledge, no previous work explored deep CNNs for estimation of caloric cost from human observation. The research most similar to ours is presumably the work of Nakamura et al. [39] , who collected an egocentric video dataset for estimating the energy expenditure and explore CNN-based architectures for this task. However, the research of [39] is significantly different from ours, as the cameras are mounted on the human, so that he/she is not observed. Our dataset is created with the opposite perspective in mind, as we target caloric expenditure estimation from human video observations. Given the growing demand for eHealth apps 3 , it is surprising that there is not a larger body of work on estimating physical intensity of activities in videos. This might be due 3 to the general focus of video classification research evolving mostly around activity categorization [6, 8, 13, 15, 17] , while virtually all exercise intensity assessment datasets focus on wearable sensors [2, 5, 39] delivering, e.g., heart rate or accelerometer signals. To promote the task of visually estimating the hourly amount of kilocalories burned by the human during the current activity, we introduce the novel Vid2Burn dataset, featuring > 9K videos of 72 different activity types with both caloric expenditure annotations on category-and sample-level. Vid2Burn is an omni-source dataset developed with a diverse range of movements and settings in mind. Our data collection procedure comprised the following steps. We started by surveying the well-known available datasets for categorical activity classification, (e.g., [25, 29, 50, 53] ). Then, we identified categories which are not only accessible from these public datasets but also have great technically feasibility to infer caloric cost annotations. The main sources of our dataset are UCF-101 [53] , HMDB51 [29] , test set of Kinetics [25] and NTU-RGBD [50] . We manually identified 72 activity types for which the hourly caloric cost can be estimated based on the established physiological models, (e.g., [1, 27, 58] ). Then, we estimated the labels for the energy expenditure based on these models on the category-and sample-level described in Section 3.3. The benefits of understanding caloric cost from videos extend to many applications, such as tracking of active exercise routines [21] or monitoring the daily physical activity level for elderly care [4, 22, 45] . From the technical perspective, it is also useful to distinguish settings with higher and lower differences between the samples. Lastly, while it is feasible to derive proper ground-truth for coarse behaviours or situations with well-studied energy expenditure, (e.g., types of sports and exercises), many daily living activities do not fall into this category and should be addressed with different techniques. Motivated by this, we group the content of Vid2Burn in two subsets: Vid2Burn Diverse and Vid2Burn ADL . Table 7 gives an overview of Vid2Burn and both its variants. Vid2Burn-Diverse is collected from Youtube-and movie-based sources [25, 29, 53] and therefore features a highly uncontrolled environment (camera movement, diverse inside/outside backgrounds). Since we focused on activities with well-studied energy expenditure models, a large portion of behaviours are related to sports, (e.g., PushUps). However, the database also covers certain everyday activities, such as walking, standing or eating. The distribution of different activity types is summarized in Figure 2 . On average, the dataset features 129 video clips per category using category labels inherited from the original sources, with walking being unsurprisingly the most common behaviour for 548 videos while stretching and shopping are the least frequent ones for 47 videos. Vid2Burn-ADL on the other hand targets Activities of Daily Living (ADL) and might be used for physical workload tracking in smart homes. The activity types and video examples are derived from the public NTU-RGBD [50] dataset for ADL classification and, compared to Vid2Burn Diverse , this dataset contains activities of rather lower physical intensities (e.g., pickup, take off jacket, read, drink water). The environment of Vid2Burn ADL is much more controlled and the differences between the individual samples are at smaller scale. In other words,Vid2Burn ADL can be regarded as a much more fine-grained benchmark for caloric cost regression. In contrast to Vid2Burn Diverse , the categories of Vid2Burn ADL are rather well-balanced and the number of examples per activity type is 142 on average (detailed frequency statistics provided in the supplementary). To adequately represent activity-induced caloric expenditure, we conducted a literature review on this physiological process [1, 37, 43, 58, 63] . Tracking the heat expended by nutrients oxidation (i.e., monitoring oxygen intake and carbon dioxide production) is considered the most accu- rate way of estimating energy expenditure [43] . While this method is invasive and not practical for large-scale use, a multitude of topical studies conduct and publish such measurements for specific groups of activities, which is often summarized by meta-reviews in the form of compendiums [1] . Such catalogues provide energy expenditure values for specific activities that are often available online as look-up tables. A common way of estimating caloric cost, which is also leveraged by us, is deriving it from the heartrate with validated physiological models [27, 37] . Our annotation scheme leverages three methods to estimate hourly energy expenditure: established medical compendiums [1] , heart-rate based measurements [27] as well as adjustments based on the captured body movement [58] . Next, we describe the three different ways for obtaining caloric cost ground truth (Section 3.3.1) and explain how we leveraged them to annotate the Vid2Burn Diverse (Section 3.3.2) and the Vid2Burn ADL (Section 3.3.3) datasets. Our derived annotation scheme leverages three types of sources: (1) current activity category, (2) intensity of the skeleton movement, as well as, (3) heart rate measurements obtained for a subset of activities (household activities) in a complementary study. Caloric cost values from published compendiums. First, we leverage activity-specific metabolic rate values from published compendiums [1] often summarized as look-up tables available on the web. 1 For simplicity, we assume the body weight as 150 lb, since this is also the average body weight of subjects captured in our heart rate measurement study. Examples of the category-wise caloric expenditure annotations are marked as stars in Figure 5b for theVid2Burn Diverse dataset. Heart-rate based annotations. Vid2Burn ADL focuses on daily living activities, which naturally exhibit lower intensity of movement. The differences are at a much smaller scale compared to Vid2Burn Diverse and the average expected energy expenditure has not been well-studied for many such concise types of physical activity. However, due to the more restricted nature of the environment, the NTU-RGBD setting is easy to reproduce. In such cases, we recreate the environment of 39 activities of Vid2Burn ADL and estimate their average caloric cost based on heart rate measurements captured in a study with four volunteer participants. Four people, one female and three males, participated in the data collection (1 female, 3 male, 27.75 years old on average, average weight 150 lb). The participation in our study was voluntary, and the subjects were instructed about the scope and purpose of the data collection and have given their written consent according to the requirements of our institution. The heart rate of all participants was recorded using a wrist band activity tracker. 2 The subjects were asked to execute 39 activities of the Vid2Burn ADL with a resting period in between to ensure the heart rate recovery. More information about the study setup is provided in the supplementary. Given the measured heart rate, we compute the caloric cost of the activity in accordance to [27] as, where Cal M /Cal F indicates the hourly caloric expenditure for male/female, HR is heart rate, W is the body weight, A indicates participant's age and T is the time length in hour. Body movement-based annotations. Next, we approximate the caloric cost induced by the movement by leveraging the model of Tsou et al. [58] . People can engage in the same type of activities in different ways and, since the amount of calories burned is directly linked to the amount/types of active muscles and the intensity, more active bodily movements lead to higher caloric cost. Tsou et al. [58] formalizes and validates a model corresponding based on the movement of eight body regions r. We estimate the skeleton movement using AlphaPose [12, 30, 66] for Vid2Burn Diverse , while for Vid2Burn ADL we use the skeleton data provided by the authors of the original datasets [38, 50] . Following [58] , we group skeleton 2 https://mi.com/global/mi-smart-band-4/specs/ joints into eight regions-of-interest to approximate the energy consumption as: where ∆x rt , ∆y rt , ∆z rt indicate the position difference of the body region between frame t and t − 1 (the average position of all body joints inside same region), F indicates the frame number and λ is the frame frequency, M r indicates the mass of rth body region. The final calorie consumption is obtained via multiplication with 0.239(Cal/J) × F × λ × 3600(h/s) in per hour-wise expression. The weighting factors ω i of the different body regions are obtained from [58] . The main purpose of the body-pose based caloric cost estimation is to enable more concise annotations at samplelevel, since the same activity can be executed with different intensities. One strategy behind the design of Vid2Burn Diverse was to select behaviour types for which the average caloric costs have been well-studied and easily accessible [1] (for that reason, sports-related videos constitute a significant portion of Vid2Burn Diverse ). The main source for the categorylevel values is therefore derived from the published average category-specific which have been well-studied and easily accessible in this regard (on category level). We then correct the estimations of the individual videos based on the previously explained body-movement model [58] , resulting in more concise sample-level annotations. Energy expenditure is not well-studied for many of the more concise daily living situations in Vid2Burn ADL . We therefore take a detour by conducting a study with participants' heart rate recorded during the 39 target activities (as described in the heart rate paragraph of Section 3.3.1). For each activity type, we estimate (1) the average heart rate-based caloric cost value obtained from our study, (2) Since we specifically aim to rate generalization of the calorie estimation models for new activity types, we construct two testing scenarios covering (1) Known activity types evaluation, with videos covering the same behaviours as the training set and (2) unknown activity types evaluation where the train and test samples are drawn from different activity types. We randomly select 27 (Vid2Burn Diverse ) and 33 (Vid2Burn ADL ) activity types for the training set, while for both benchmark versions, the 6 remaining categories are used for evaluation. Next, the data of the 27/33 training activity types is further split into training/testing (with ratio 7 : 3) for the same-category evaluation. Note that the category annotations used for constructing the splits are inherited from the source datasets. Overall, our dataset comprises 2782/3243 videos for training, 1192/1390 samples for the validation on the same activity type and 286/896 samples for new-activity-type evaluation for the Vid2Burn Diverse and Vid2Burn ADL databases, respectively. Given a video input, our goal is to infer hourly energy cost of the activity in which the depicted human is involved. Note, that we target the intensity of the bodily activity and not its duration, i.e., our goal is to infer kilocalories burnt per hour. Since our targets are continuous caloric values our task naturally suits regression-based losses, such as the Euclidean L2 loss. However, we observed that regression optimisation converges to a constant value in our case (a similar effect has been reported before in multimodal problems, e.g., in [69] ). We therefore address this problem as multinomial classification with additional label softening. Similar to [54] , we binarize each caloric value annotation l with resolution of 1 kcal inside a range n ∈ [0, N], where N is set to 1000 kcal. To keep certain regression properties, such as penalizing predictions which fall closer to the ground-truth bin less, we soften the labels through Gaussian distribution with a given standard deviation (STD) denoted as σ . Then, for each ground truth annotation l, we obtain the softened label as a distribution over N bins: where l s indicates the soft label used for the supervision. We then use the the Kullback-Leibler (KL) divergence between the ground truth and predicted distributions: where y indicates the predicted distribution with n ∈ [0, N]. We adopt five modern video-and body pose-based architectures developed for categorical human activity recognition as our video representation backbones. I3D. The Inflated 3D CNN (I3D) is a widely-used activity recognition backbone [6] and is a spatio-temporal version of the Inception-v1 network. Weights transfer from pretrained 2D CNNs and its pretraining is achieved by repeating ("inflating") the weights along the temporal axis. R3D. This 3D convolutional architecture [17] with a remarkable depth of 101 layers (enabled through residual connections) chains multiple ResNeXt blocks, which are shallow three-layered networks leveraging group convolution. R(2+1)D. Unlike previous models, R(2+1)D [57] "mimics" spatio-temporal convolution by factorizing it into distinct 2D spatial and 1D temporal convolutions, yielding remarkable results despite these simpler operations. This framework also leverages a residual architecture. SlowFast. Our last CNN-based architecture is the Slow-Fast model of Feichtenhofer et al. [13] which introduces two branches: a slow pathway and a fast pathway capturing cues from different temporal resolutions. ST-GCN. In addition to the video-based models, we consider a popular architecture comprising a graph neural network operating on the estimated body poses [67] , which uses spatial-and temporal graph convolutional neural network to harvest human motion cues. Temporal windows captured by the above backbones vary between 16 frames for I3D [6] , R(2+1) [57] and R3D [56] and 32 frames for SlowFast [15] are considerably smaller than the durations of the video clips captured in Vid2Burn. Given an input video of length T and a model f θ (·) which takes as input F frames, we sequentially pass K video snippets {t 1 , t 2 , ..., t K } using sliding window with overlapping resulting in K predictions { f θ (t 1 ), f θ (t 2 ), ..., f θ (t K )}. We now consider two different strategies for fusing these results: (1) averaging of the output of the last fully connected layer and (2) learning to fuse the output with an additional LSTM network. The first method employs simple average pooling of the final representation: Our second fusion strategy passes the last fully connected layer of our video representation backbone to an LSTM network with two layers with the number of neurons corresponding to the input size, trained together with the backbone model in an end-to-end fashion. Since an LSTM also produces sequential output, we also average the resultant sequence to obtain the final prediction. We adopt Mean Average Error (MAE) as our main evaluation metric and additionally report the Spearmann Rank Correlation (SPC), and the Negative Logarithm Likelihood (NLL). MAE is an intuitive metric reporting the mean disarray between prediction and ground truth in our target units, (i.e., kilocalories). Note, that while SPC illustrates the association strength between the ground truth and the predictions (it is +1 if one is a perfect monotone function of the other and −1 if they are fully opposed), it should be interpreted with care, since it ignores scaling and shifting of the data. In other words, SPC would not reflect if the number of kilocalories is constantly over/underestimated by similar amounts. It therefore should only be viewed as a complementary metric. Note, that we report SPC in % for better readability (i.e., we multiply the result by 100). Our experiments are carried out in two annotation settings: categoryand sample-wise annotations (see Section 3.3 for details). We view the sample-wise version as a more concise choice, but using static category labels is the protocol used in the past energy expenditure work from egocentric data [39] and is adopted for consistency. As explained in Section 3.4, we conduct the evaluations on both: behaviours present during training and new activity types. All implemented models greatly outperform the random and average baselines in all measures ( marked as AVR (for the average pooling fusion) and LSTM, with AVR consistently leading notably better results. Slow-Fast (SF) with AVG consistently achieves the best recognition quality with only 34.5 kcal/20.1 kcal MAE. As expected, the results are lower in the more fine-grained sample-wise setting, since the backbones were initially developed for coarser categorization. The task of caloric cost regression in previously unseen situations is much more difficult and the performance drops significantly. The MAE is > 4 and > 2 times higher in the category-and sample-wise settings separately for SF-AVG. Interestingly, the best model in the case of known activities is usually not the top-performing approach for the new activity types, which is SF-AVG in case of the sample-wise evaluation of Vid2Burn Diverse with the MAE of 130.8 kcal, but the gap to SF-AVG (MAE of 134 kcal) is very small (Table 4) . Presumably due to a more restricted environment and smaller average caloric cost values, the models are more accurate on Vid2Burn ADL (Table 6 ). Consistently with the Vid2Burn Diverse results, SF-AVG yields the best recognition quality on known activities (MAE of 20.1 kcal) and is the second best performing model on the new ones. Table 6 also lists our results achieved with a regression loss (L2), marked as R3D-Reg. As explained in Section 4.1, we observe convergence to a constant value resulting in a very high MAE. Note that SPC cannot be computed in this case, since the output is a number while not a distribution. We also consider the trade-off between the performance and the computational cost (Table 5) . SF-AVG offers a good balance between speed and accuracy, while the I3D backbone is a more lightweight model, but the MAE is higher. We further look at the recognition for the individual activity types: two known (running, climbing) and two unknown (yoga, shopping) behaviours as presented in Table 11 (more results are provided in the supplementary). The recognition quality varies greatly depending on whether the activity is familiar. For example, SF-AVG is only off by 43.8 kcal for running, but the MAE is 174.4 kcal for shopping. Finally, in Figure 4 we showcase multiple qualitative results by visualizing the activation region of a CNN (I3D backbone) with multiple examples of representative qualitative results for the Vid2Burn Diverse (top) and Vid2Burn ADL (bottom) datasets. Additionally to the predicted caloric value and the ground truth, we visualize the activation regions of an intermediate CNN layer (in this case we choose the second convolutional layer). It is evident, that the largest focus is put on the body region activated during movement which we view as a positive property, since the energy expenditure is a direct result of the muscle movement [7] . However, in several cases the objects (e.g., the television) are clearly highlighted despite no direct interaction, indicating category-specific biases which are presumably the leading cause of mistakes in cross-category settings. Implementation Details. Our models are trained with ADAM [28] using a weight decay of 1e −5 , a batch size of 4, a learning rate of 1e −4 for 40 epochs, the model weights from Kinetics [25] and a Quadro-RTX 6000 graphic card (parameter numbers and inference times reported in Table 5 ). For binarization of the continuous label space, the maximum calorie prediction limit is set to 1000 kcal for Vid2Burn Diverse and 500 kcal for Vid2Burn ADL with resolution of 1 kcal. A more detailed description of the parameter settings is provided in the supplementary. We introduced the novel Vid2Burn benchmark for estimating the amount of calories burned during physical activities by visually observing the human. Through our experiments, we found that the generalization ability of modern video classification CNNs is limited in this challenging task and we will keep tackling this issue in the future work. Vid2Burn will be publicly released, opening new perspectives on specific challenges in human activity analysis, such as fine-grained understanding of bodily movement and generalization to new physical activity types, since our benchmark specifically evaluates the quality of energy expenditure estimation in new situations. We hope to foster research of human understanding models which are able to capture cues of the underlying physiological processes, (e.g., active muscles and their intensity) instead of learning rigid category-specific biases seen during training. Broader Impact and Limitations. This work targets energy expenditure estimation from videos. The benefits of such methods extend to multiple applications, such as supporting healthy lifestyle, e.g., by tracking exercise routines [21] or monitoring the daily physical activity level for elderly care [45] . However, both annotations in our dataset and the results inferred by our models are approximations and not exact measurements, which should be carefully used in medical care applications, as they are simplified by assuming the body weight of 150 lb, while gender, height and age are not taken into account. Moreover, our datadriven algorithms may learn shortcuts and biases present in the data potentially resulting in a false sense of security. In addition to the summary of limitations mentioned at the end of our main paper, more details about the limitations of our approaches and proposed benchmarks will be given in this section. This work targets estimation of energy expenditure from videos. The benefits of such methods extend to multiple applications, such as supporting active and healthy lifestyle, e.g., by tracking exercise routines [21] or monitoring the daily physical activity level for elderly care [4, 22, 45] . However, our work is not without limitations. Energy expenditure is a complex physiological process [7] , and while bodily movement, (i.e. active muscled and intensities) are its primary drivers, there is a variety of the contributing factors, such as age, gender, weight and personal metabolic rate. Many of these factors are not considered in our work. For example, for simplicity, we derive energy annotations from medical compendiums assuming the weight of 150 lb (our study with the heart-rate based ground truth estimation is an exception, where age/gender/weight were taken into account). The ground truth values of our dataset are therefore only approximate estimates. Furthermore, as with most data-driven algorithms, our models may learn shortcuts and biases presenting in the data (in our cases oftentimes category-and context-related biases), which may cause a false sense of security. Direct caloriometry [43] or heart rate-based estimation [1] are more accurate ways to estimate caloric cost than visual models. Our work introduces two video-based calorie consumption estimation benchmarks -Vid2Burn Diverse and Vid2Burn ADL , together with several deep learning-based baselines targeting at end-to-end calorie consumption estimation. A wide range applications for health monitoring and human physical movement level prediction will directly benefit from this work. Moreover, since our work also tackle the generalization issue through evaluating the calorie estimation performance on the unseen activity types which can simulate the scenario for facing with out-ofdistribution samples. The baselines leveraged in our work show a certain performance difference between the evaluations of known and unknown action types, indicating that offensive predictions, biased content and possible misclassifications can result in false sense of security while it still points out a valuable future research direction to us for further investigation. To allow future work constructed based on our benchmarks and baselines, we will make our code, models, and data publicly available. Since we use multiple public datasets and online resources to form the video dataset and annotation set, we have carefully cited the related works for these leveraged datasets and marked the website link of the online resources in the corresponding footnote in our paper. In Vid2Burn ADL dataset, we collect the heart rate, body weight and age data from 4 subjects to improve the accuracy of our calorie consumption annotation. The collected data is only leveraged to generate the global calorie consumption annotation which is highly aggregated and can not directly identify a specific person. The data and annotation are all anonymous. During the data collection procedure, each subject is well instructed to collect the heart rate data through wrist band (MIBAND 4) which can't bring any negative impact to the human body. From the dataset Vid2Burn which will be published soon, no person data is involved since all the data are highly aggregated. All participants are voluntary and signed a data collection agreement. We did not place the signed form for voluntary data collection in In order to further clarify the strengthens of our proposed benchmarks, we make a comparison between the proposed two benchmarks, -Vid2Burn Diverse and Vid2Burn ADL , and the other two existed video-based benchmarks which are Stanford-ECM [39] and Sphere [55] . The camera setting for our proposed benchmarks and Sphere are all fixed-position while Stanford-ECM leveraged egocentric perspective requiring the camera to be mounted on a wearable device which limits the comfort of the user and requires contact, if application has been taken into consideration. Concerning the action numbers, our Vid2Burn contains in total 72 kinds of activities which contains simultaneously high-and lowintensity activities, together with >9K video clips which is much larger compared with the other two datasets offering more possibility to achieve deep learning based endto-end calorie consumption estimation. In addition, our benchmarks provide sample-wise calorie consumption annotation which is more precise compared with the other datasets that only provide category-level human energy expenditure annotations. We also provide the description of the two proposed benchmarks in Figures 7 and 8 to show part of the label-sample pair with sample-wise calorie consumption annotation for each benchmark. There are 33 and 39 label-sample pairs for Vid2Burn Diverse and Vid2Burn ADL separately. First, a detailed introduction of sample numbers under each activity type, indicated by the number of samples on each histogram, and the corresponding category-wise annotation, denoted by the number on each image, are introduced in Fig. 6 . The sample numbers for different actions show a balanced distribution with minimum sample number as 122 and maximum sample number as 159. Second, the statistic analysis for Vid2Burn ADL dataset is shown in Fig. 5b . Similar to Vid2Burn Diverse dataset, we use 39 categories coming from NTU RGBD [50] dataset to construct the Vid2Burn ADL dataset, where the color dot indicates the sample-wise calorie consumption annotation. Compared with Vid2Burn Diverse introduced in Fig. 5a , Vid2Burn ADL shows relatively lower movement intensity. In Fig. 5b we can find that for sample-wise calorie consumption annotation, there is an overlapping for fluctuated calorie consumption ranges among different action types. Finally, we will give a detail description for the heart rate collection procedure. During the data collection process, each participant needs to wear the wrist band and monitor the heart rate. For a specific action, participants were asked to repeat the action for two minutes and maintain the same action frequency as the original video (randomly selected for each action category leveraged in our work based on NTU RGBD [50] dataset) to obtain stable heart rate data. The interval between each action is carefully selected to ensure that the heart rate has returned to the rest state heart rate based on measurement. Table 8 . A comparison between deep learning-based approach (I3D-AVR) and skeleton-based forward computation approach (the same with the skeleton-based calorie consumption annotation generation procedure) on Vid2Burn Diverse and Vid2Burn ADL benchmarks. First, more details about the category-wise performance for calorie consumption estimation on the Vid2Burn Diverse benchmark using category-wise annotation for supervision is represented in Table 11 . Second, we provide additional comparison between deep learning-based and pure skeleton-based forward computation for calorie consumption prediction. Finally, we provide an additional ablation studies for different σ when generating soft label for supervision. Since one of the annotation source for calorie consumption estimation is skeleton data, the performance of directly using skeleton to compute calorie consumption is interesting to be researched. We thereby conduct experiments on the two proposed benchmarks with sample-wise annotations between deep learning-based approaches and pure skeleton-based forward calculation. According to the experimental results introduced by Table 8 , pure skeleton based forward calculation shows a performance difference by 272.8 kcal and 531.1 kcal on the known and unknown action types on the Vid2Burn Diverse dataset and the performance difference on the Vid2Burn ADL dataset for the known and unknown action types are 58.1 kcal and 58.8 kcal compared with I3D-AVR illustrating the outstanding performance of the deep learning-based approaches for video-based calorie consumption estimation. In order to investigate the influence brought by different σ when generating soft label for calorie consumption prediction, we conduct corresponding ablation studies shown in Table 10 using I3D-AVR approach on the Vid2Burn Diverse dataset under category-wise supervision and choose σ as 5, 15, 25 and 50 kcal separately. According to the experimental results, choosing σ as 15 shows the best performance on known activity types and 50 shows the best performance on unknown activity types in the MAE metric. Table 10 . Ablation studies by adjusting the σ for label softening on Vid2Burn Diverse using category-wise label supervision. When digging deeper into the direction of the deep learning-based calorie consumption estimation, the relationship between action recognition and calorie consumption estimation is interesting to be investigated, especially for the question about whether there is only a lookup relationship between calorie consumption estimation and action recognition or not. First if looking into labels, we have sample-wise label differing among the samples inside same action type according to different human body movement intensity, which makes sure that it will not be a simple lookup relationship. According to Fig. 5b , there are calorie consumption range overlapping among different action types. Second we conduct several ablation studies listed in Table 9 to support our argument. If our models predict lookup relationship between calorie consumption and action classes, the performance of the model only fine-tuning the fc layers should be higher than the perfor- Figure 8 . An overview of the calorie consumption annotation for Vid2Burn ADL dataset for all 39 leveraged action types (sample-wise annotation). We mark the corresponding calorie consumption annotation under each category name for the selected sample. Table 11 . Experimental results for human calorie consumption estimation for the selected action categories on the Vid2Burn Diverse dataset supervised with category-wise annotation. mance of our approach. Since video classes are highly dependent on action classes, we conduct experiment by freezing weights of pretrained video-based backbone while only adjusting weights of fully-connected layers as Video in Table 9 , where MAE of Video for both known-and unknownaction types evaluation are all worse than I3D-AVR. We also test train-from-scratch for the I3D-AVR baseline denoted as I3D-AVR (TFS) which shows the worst performance when compared with others, illustrating that pretraining is important. Through the above analyses it can be seen that the relationship between human action and calorie consumption prediction is not a simple lookup relationship and also pretraining is essential. In addition to the mentioned implementation details in our paper, our model is built based on PyTorch toolbox. Since we leverage temporal sliding window to aggregate features along time axis, the temporal overlapping of the sliding window for I3D, R3D, R(2+1)D backbones are chosen as 6 frames while the temporal overlapping for Slow-Fast is chosen as 16 frames since it requires larger temporal window length (32 frames) compared with the others (16 frames). For the Vid2Burn ADL dataset, the estimation head has 500 channels output as the maximum calorie consumption estimation range is set as 500 kcal together with resolution 1 kcal. For the Vid2Burn Diverse dataset, the channel number of the final output is 1000. 2011 compendium of physical activities: a second update of codes and MET values Using wearable activity type detection to improve physical activity energy expenditure estimation COVID-19 pandemic-induced physical inactivity: the necessity of updating the global action plan on physical activity 2018-2030. Environmental Health and Preventive Medicine Physical activity classification for elderly people in free-living conditions Multitask LSTM model for human activity recognition and intensity estimation using wearable sensor data Quo vadis, action recognition? A new model and the kinetics dataset Christenson. Physical activity, exercise, and physical fitness: definitions and distinctions for health-related research Deep analysis of CNN-based spatio-temporal representations for action recognition Skeleton-based action recognition with shift graph convolutional network Why can't I dance in the mall? learning to mitigate scene bias in action recognition Scaling egocentric vision: The EPIC-KITCHENS dataset RMPE: Regional multi-person pose estimation SlowFast networks for video recognition Spatiotemporal multiplier networks for video action recognition Convolutional two-stream network fusion for video action recognition Shortcut learning in deep neural networks Learning spatio-temporal features with 3D residual networks for action recognition Prediction of energy expenditure during activities of daily living by a wearable set of inertial sensors Women also snowboard: Overcoming bias in captioning models 3D convolutional neural networks for human action recognition Is there a benefit to patients using wearable devices such as fitbit or health apps on mobiles? A systematic review Elderly perception on the internet of things-based integrated smarthome system A human morning routine dataset Large-scale video classification with convolutional neural networks The kinetics human action video dataset Validity of wearable activity monitors for tracking steps and estimating energy expenditure during a graded maximal treadmill test Prediction of energy expenditure from heart rate monitoring during submaximal exercise Adam: A method for stochastic optimization HMDB: A large video database for human motion recognition CrowdPose: Efficient crowded scenes pose estimation and a new benchmark Actional-structural graph convolutional networks for skeleton-based action recognition TSM: Temporal shift module for efficient video understanding Disentangling and unifying graph convolutions for skeleton-based action recognition Recipe1M+: A dataset for learning cross-modal embeddings for cooking recipes and food images Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles CaloriNet: From silhouettes to calorie estimation in private environments A combined heart rate and movement index sensor for estimating the energy expenditure Accuracy of physical activity monitors for steps and calorie measurement during pregnancy walking Jointly learning energy expenditures and activities using egocentric multimodal signals CalorieCaptorGlass: Food calorie estimation based on actual size using HoloLens and deep learning Improving energy expenditure estimates from wearable devices: A machine learning approach What and how well you performed? A multitask learning approach to action quality assessment Energy expenditure: components and evaluation methods Learning spatiotemporal representation with pseudo-3D residual networks Older women's perceptions of wearable and smart home activity sensors. Informatics for Health and Social Care A database for fine grained activity detection of cooking activities Uncertainty-sensitive activity recognition: A reliability benchmark and the CARING models Multi-task learning for calorie prediction on a novel large-scale recipe dataset enriched with nutritional information The KIT robo-kitchen data set for the evaluation of view-based activity recognition systems NTU RGB+D: A large scale dataset for 3D human activity analysis Skeleton-based action recognition with directed graph neural networks Two-stream convolutional networks for action recognition in videos UCF101: A dataset of 101 human actions classes from videos in the wild Uncertainty-aware score distribution learning for action quality assessment Energy expenditure estimation using visual and inertial sensors Learning spatiotemporal features with 3D convolutional networks A closer look at spatiotemporal convolutions for action recognition Estimation of calories consumption for aerobics using kinect based skeleton tracking Calorific expenditure estimation using deep convolutional network features Temporal segment networks: Towards good practices for deep action recognition Abhinav Gupta, and Kaiming He. Non-local neural networks Mimetics: Towards understanding human actions out of context. International Journal of Computer Vision Self-selected exercise intensity during household/garden activities and walking in 55 to 65-year-old females Activityspecific caloric expenditure estimation from kinetic energy harvesting in wearable devices. Pervasive and Mobile Computing Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification Pose flow: Efficient online pose tracking Spatial temporal graph convolutional networks for skeleton-based action recognition Beyond short snippets: Deep networks for video classification Colorful image colorization Energy expenditure prediction methods: Review and new developments Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos