key: cord-0994544-qnklowz2 authors: Shei, Ren-Jay; Holder, Ian G.; Oumsang, Alicia S.; Paris, Brittni A.; Paris, Hunter L. title: Wearable activity trackers–advanced technology or advanced marketing? date: 2022-04-21 journal: Eur J Appl Physiol DOI: 10.1007/s00421-022-04951-1 sha: 9849a13837781589d0f06cae6c85bd170995d91f doc_id: 994544 cord_uid: qnklowz2 Wearable devices represent one of the most popular trends in health and fitness. Rapid advances in wearable technology present a dizzying display of possible functions: from thermometers and barometers, magnetometers and accelerometers, to oximeters and calorimeters. Consumers and practitioners utilize wearable devices to track outcomes, such as energy expenditure, training load, step count, and heart rate. While some rely on these devices in tandem with more established tools, others lean on wearable technology for health-related outcomes, such as heart rhythm analysis, peripheral oxygen saturation, sleep quality, and caloric expenditure. Given the increasing popularity of wearable devices for both recreation and health initiatives, understanding the strengths and limitations of these technologies is increasingly relevant. Need exists for continued evaluation of the efficacy of wearable devices to accurately and reliably measure purported outcomes. The purposes of this review are (1) to assess the current state of wearable devices using recent research on validity and reliability, (2) to describe existing gaps between physiology and technology, and (3) to offer expert interpretation for the lay and professional audience on how best to approach wearable technology and employ it in the pursuit of health and fitness. Current literature demonstrates inconsistent validity and reliability for various metrics, with algorithms not publicly available or lacking high-quality validation studies. Advancements in wearable technology should consider standardizing validation metrics, providing transparency in used algorithms, and improving how technology can be tailored to individuals. Until then, it is prudent to exercise caution when interpreting metrics reported from consumer-wearable devices. Wearable technology has gained in popularity among a broad segment of the general population including elite, competitive, and recreational athletes, as well as both the physically active and sedentary general population. Even in 2015, approximately 1 in 8 respondents (12.5%) to an annual nationwide consumer mail panel survey in the United States indicated that they currently used a wearable activity monitor (Omura et al. 2017) . Recent survey data of fitness trends in 2019 showed that wearable technology was the number one most popular trend (Thompson 2018) , and the market for such devices continues to grow. These devices are broadly marketed to both the general population for physical fitness and health monitoring as well as specifically to elite athletes as a training tool (Wahl et al. 2017) . As such, the public health relevance of such wearable devices is increasing and may impact areas, such as physical activity, wellbeing, cardiovascular health, mortality risk, dietary habits, among others. For example, higher volumes of physical activity energy expenditure measured by wearable devices was recently shown to be associated with reduced mortality rates, and that higher-intensity activity reduced mortality rates to a larger extent than lower-intensity activity (Strain et al. 2020) . Recent advances in technology have resulted in a myriad of wrist-based sensors being built into the current generation of fitness watches (Fig. 1) . These include digital 3-axis accelerometers, pulse oximeters, optical heart rate sensors, thermometers, barometers, magnetometers, among others. These sensors, in combination with ever-improving algorithms-most of them proprietary-have led to fitness companies marketing these devices as being capable of estimating and monitoring such physiological parameters as step count, heart rate (HR), sleep quality, sleep rhythm, energy expenditure (EE), maximal oxygen consumption ( V O 2max ), peripheral oxygen saturation, and the "training effect" of both individual and cumulative exercise bouts. Despite these purported advances in physiological monitoring capabilities, few published data exist to support the Fig. 1 Web of variables assessed by wearable devices and factors that must be considered in accurately reporting these variables. Variables directly connected to the athlete are those recorded by technologies. Outer variables are factors that influence the inner measure. For example, "training load" must consider both "external load" and "internal load," which themselves must consider "work," "distance," "duration," etc validity and reliability of such tracking. Due to the widespread use of these wearable-fitness tracking devices to promote health, fitness, and an overall active lifestyle, rigorous and transparent reporting of validation studies should be encouraged to improve precision, accuracy, and reliability, and to engender trust in the consumers who purchase and utilize such devices. To date, studies investigating the validity and reliability are sparse, with wide disparity in findings, likely due to a variety of reasons including differences in devices tested, study population, and experimental design. This review summarizes the available studies testing the validity and reliability of wearable-fitness devices, discusses several publicly available algorithms to estimate select physiological parameters, and presents current knowledge gaps and future directions for the wearable technology field to address. Specific metrics discussed below include V O 2 and V O 2max estimation, EE estimation, step count estimation, HR and HR variability (HRV), which were all selected on the basis of being common measures used in applied physiology research and are commonly tracked by end-users of consumer-wearable devices. These metrics also have the most published research data available on their validity compared to gold standard laboratory or research methods of data collection. Additionally, more applied metrics, such as training load, stress, and sleep, are discussed as well. These variables are also measured in physiology research, although they are less common than the aforementioned metrics, and they have been selected for discussion because of their special interest to the lay population of end-users. Oxygen transport and utilization are the most important determinants of endurance exercise performance. The amount of oxygen that an individual utilizes per unit time ( V O 2 ) provides a comprehensive view of aerobic capacity and pathophysiology. Traditional V O 2 measurement is undertaken in a laboratory setting using open-circuit spirometry and indirect calorimetry in which expired gases are collected and analyzed and V O 2 is calculated using the Haldane transformation. Continuous measurement of V O 2 during an incremental exercise test to exhaustion is considered the gold standard measurement of maximal aerobic capacity, or V O 2max . The major drawback of such testing is the need for properly trained staff, expensive equipment, and the lack of a field-based or "real-world" setting. Therefore, attempts have been made to either adapt equipment to be more portable, or to estimate V O 2 based on surrogate physiological parameters, such as heart rate, workload, or accelerometry data (Carrier et al. 2020; Helm et al. 2021; Snyder et al. 2021) . Though V O 2 estimation has become more common in wearable devices, research concerning the accuracy of wearables in estimating V O 2 and V O 2max has struggled to keep pace. Data from (Passler et al. 2019 ) compared five commercially available wrist-worn devices against respiratory gas analysis and found a mean absolute percentage error of > 10% [considered a high error (Henriksen et al. 2020) ] for the majority of devices. Even within a given brand, V O 2 was sometimes overestimated, but other times underestimated. Particularly prominent in discussing validity and reliability amongst wearable devices is the, Towards Intelligent Health and Well-Being Network of Physical Activity Assessment (INTERLIVE). This consortium of six European universities and one industrial partner develops best-practice guidelines for evaluating consumer wearables. A recent review from INTERLIVE (Molina-Garcia et al. 2022) regarding the validity of wearables in estimating V O 2 and V O 2max concluded that population-level estimates of V O 2max demonstrate some degree of accuracy, but the margin of error is large for V O 2max estimates at the individual level. Within the realm of oxygen utilization, some devices also report arterial oxygen saturation (SaO 2 or SpO 2 ). Recently, Zhang and Khatami (Zhang and Khatami 2022) summarized existing data on the validity of this measure, especially in the Apple and Garmin smartwatches. They concluded that, despite millions of people utilizing smartwatches to monitor oxygen saturation (and possible sleep apnoea), and although the Apple Watch seems to be more accurate than the Garmin, none of the popular smartwatches had been well validated. Overall, findings suggest that for V O 2 estimates, as well as oxygen saturation estimates, wrist-worn activity trackers lack the accuracy for sport and healthcare application. Discrepancies between laboratory-based V O 2 calculations and wearable technologies may be explained both by the algorithms companies utilize as well as individual variations in exercising V O 2 . Perhaps the most significant challenge in judging the methodology used by wearable technologies to estimate V O 2 is the highly secretive, proprietary nature of the variables and algorithms used to estimate V O 2 . From the little public information available, such as from patent filings and white papers, it appears some devices utilize running speed and the inclination of the terrain (both presumably derived from the GPS signal of the device) to estimate theoretical V O 2 . Technologies then apply a "quality" filter to improve data quality. Ultimately, V O 2max is based on a constructed curve of HR vs estimated workload (Firstbeat Technologies Ltd.; US Patent Application Publication). Without full transparency in precisely how each wearable device estimates the parameters reported to the user, it is impossible to rigorously evaluate the validity and reliability of these metrics. Considering that wearable technologies often estimate V O 2 from HR, along with other variables, the accuracy of HR measurement becomes relevant. Whereas heart rate monitors utilizing chest straps have a high level of validity and reliability (Gillinov et al. 2017) , the accuracy of wrist-worn devices [which rely on photoplethysmography (PPG) to measure HR] remains questionable as exercise type and intensity, as well as other factors, such as skin tone, temperature, humidity, and proximity of the sensor to the skin, may influence application (Støve et al. 2019) . Acknowledging the inherent flaws when relying solely on heart rate or accelerometry to estimate V O 2 (Fudge et al. 2007) , some companies attempt to overcome limitations by combining methodologies. For example, one recent method attempts to estimate V O 2 from HR and body movement in daily life (Bonomi et al. 2020 ). This approach uses activity recognition techniques to identify contextual setting (e.g., cycling, sedentary, walking, etc.) and then apply prediction equations specific to a given setting. This method may guide future advancements. While merging accelerometry and heart rate improves accuracy (Haskell et al. 1993) , V O 2 predictions still differ from laboratorybased measurements, particularly at high and low exercise intensities (Acheson et al. 1980; Brage et al. 2003) . Discrepancies also exist amongst the algorithms wearable technologies use, and though general information is public-such as how Firstbeat and Garmin utilize submaximal exercise data (US Patent Application Publication) whereas Polar V800 relies upon HRV at rest (Polar)-proprietary information makes accessing these equations difficult and limits testing for validation and reliability. Further discrepancies between the predictive equations used in wearable technologies and laboratory-based calculation of V O 2 may be explained considering the individual response to exercise. Although the relationship between HR and V O 2 is linear at sub-maximal intensities, HR can be influenced by age, sex, training status, efficiency, economy, emotional state, and many other factors, making the translation from heart rate to V O 2 prone to errors (Fudge et al. 2007 ). Differences also exist in the biomechanical characteristic of locomotion which influence V O 2 . For example, during running exercise, differences in vertical displacement between individuals alters the metabolic cost of running. Even within an individual, biomechanical changes occur within an event (Winter et al. 2016) , with form usually deteriorating throughout an exercise bout as an individual tires. Chronic changes also occur with training, and training generally improves running economy thereby altering the V O 2 for a given pace and distance. Other individual responses to exercise, such as the presence of the V O 2 slow component or cardiac drift, may also result in a disconnect between predictive equations and actual V O 2 . Although current uncertainties in the validity of V O 2 estimation in wearable technologies limit usefulness in health and medical practices, accuracy of wearable technology is optimized when V O 2 prediction equations are based on a person's unique exercise characteristics (Fudge et al. 2007 ). Application of these technologies will therefore expand when technology permits a more individualized equation based on physiological and biomechanical responses to exercise, and when algorithms become more easily accessible and available for testing. Energy expenditure (EE) can be defined as the calories burned at rest or during physical activity. Direct calorimetry quantifies heat production, and while it represents the gold standard for measuring human metabolic rate, indirect calorimetry is the more common method utilized for assessing EE, even within the research laboratory (Kenny et al. 2017) . While the doubly-labeled water technique represents the most reliable measure of indirect calorimetry (particularly for quantifying free-living EE), open-circuit spirometry is the customary measure, primarily because of its ease of use and comparatively lower price (Ainslie et al. 2003; O'Driscoll et al. 2020) . Another method of quantifying caloric expenditure-and one used by wearable devices-is to calculate EE from anthropomorphic data (such as body mass) and exercise parameters recorded through GPS and accelerometry. Because EE data are often used to promote body mass regulation, the ability of wearable devices to accurately report EE has implications for health and disease. Although insufficient validation studies exist for many of the features promoted by wearable technology, EE is one of the more frequently examined physiological outputs. In general, research demonstrates strong reliability for wearable technologies (Evenson et al. 2015) , but poor validity (Düking et al. 2020; Fuller et al. 2020; O'Driscoll et al. 2020; Argent et al. 2022) . When pitted against the gold standard doubly label water technique, accelerometers demonstrated varying degrees of validity, prompting researchers to call for further development and evaluation of wearable technology (Plasqui and Westerterp 2007; Plasqui et al. 2013; Murakami et al. 2019) . As V O 2 estimation improves when identifying contextual setting (Bonomi et al. 2020) , evidence also suggests that when accelerometry algorithms recognize various types of physical activity, estimation of energy expenditure improves (Bonomi et al. 2009 ). Within this framework, data generally agree that the ability of wearable technology to accurately report EE depends on the type of exercise (Boudreaux et al. 2018) , the intensity of exercise (Roos et al. 2017; Wahl et al. 2017) , sex and skin tone of user (Shcherbina et al. 2017) , and the specific device in question (Kendall et al. 2019; O'Driscoll et al. 2020 ). Though calculations for EE are proprietary and prevent specific analysis, discrepancies between laboratory-based measurement of EE and wearable technologies likely arise from limitations in EE calculations as well as individual variation in EE that would not be captured using traditional equations. Common calculations for EE utilize body mass, age, activity status, accelerometry data, and heart rate. Assuming height, weight, age, and sex are recorded correctly, calculations could fall short if levels of activity status are unclear (e.g., "I was highly-active last week but moderately-active the week prior, so what is my activity status?") or if heart rate data are inaccurate (as mentioned in the previous section on V O 2 max ). Individual variation in EE represents an alternative explanation for the observed discrepancies and a large hurdle for wearable technologies to overcome, particularly given contemporary limitations on technological capabilities. While discussion on the individual variations in energy expenditure is beyond the scope of this paper, we offer a few example scenarios of where energy expenditure could be altered and yet go unnoticed by wearable technologies: Resting metabolic rate (RMR) accounts for the largest contributor to total daily energy expenditure and is itself dependent on body size and body composition. While body size may be accurately assessed by inputting height and weight into wearable devices, many of these activity trackers are unable to assess body composition. Even if body fat percentages were known, many calculations on EE do not account for lean mass vs. fat mass or variations in EE that exist within lean tissue. Furthermore, a large variability in RMR-up to 250 kcal per day-exists outside of differences in body composition (Johnstone et al. 2005) , and again would not be assessed using the traditional calculations for EE. Finally, energy balance also influences RMR where RMR may demonstrate an adaptive response to caloric restriction thereby predisposing individuals to weight regain (Fothergill et al. 2016) . For example, following weight loss, RMR for a given fat-free mass may be reduced, thereby lowering the total daily energy expenditure and the calories that can be consumed if body mass is to be maintained. In addition to body composition then, an accurate estimation of EE considers both chronic and acute alterations to energy balance. It is unlikely that current wearable technologies permit such considerations. For the many people using wearable technologies for weight loss (or maintenance of lost weight), this last point becomes particularly relevant in that wearable devices may be most accurate for those in energy balance but may be used most often by those in energy imbalance. Therefore, those most dependent upon wearable devices for EE may be the ones most likely to experience inaccuracies in EE estimation. Each of these examples represents not only the individual response to energy expenditure, but also the complexities of quantifying caloric expenditure and potential pitfalls that may explain the lack of validity in wearable technologies regarding EE. In addition to limiting our understanding of the validity and reliability of these devices, the proprietary nature of the energy expenditure calculations also prevents proper use of these devices. For example, when using wearable devices to determine caloric expenditure for a given exercise session, are the number of calories burned for a given exercise session based on net or gross expenditure? Perhaps more importantly, is this clear to the user? If, for a given 30 min exercise session, a wearable device displays that 350 cal were burned, does this mean that an additional 350 cal were burned above resting metabolic rate, or does this mean that a total of 350 cal were burned which includes those attributable to resting metabolic rate? Overall, given the consistent finding that wearable devices lack validity in calculating EE, and given the inconsistencies in the nature of inaccuracies throughout devices, refinements are in order before these devices can be relied upon to report EE. Step count/physical activity Whereas physical activity promotes musculoskeletal health and disease prevention, a sedentary lifestyle is linked with muscle atrophy, decreased quality of life, and a less favorable cardiometabolic profile (Riel et al. 2016) . While well known that physical activity is meaningful, less is known about what constitutes meaningful physical activity. Metrics toward this goal have been developed, including the recommendation to achieve 10,000 steps per day (Hatano 1993; Choi et al. 2007 ). In pursuing meaningful physical activity then, monitoring daily step count represents a valuable component of health promotion and one of the major functions endorsed by wearable technologies. In the research setting, the gold standard for quantifying steps is via hand tally, where physical activity is tracked by video recording and two reviewers independently assign manual step counts (Dijkstra et al. 2008; Riel et al. 2016) . When video tracking is unavailable, the use of accelerometers is preferred to selfreport (Riel et al. 2016) , and one of the more common accelerometers used in the research setting is the ActiGraph-a triaxial accelerometer placed near the hip above the right anterior superior iliac spine (Riel et al. 2016) . As physical activity and step count are common features of wearable devices, a number of reviews have examined these functions in various technologies and under a variety of settings-laboratory, free-living, walking, running, etc. While some studies (Wahl et al. 2017; Montes et al. 2020 ) report promising outcomes for reliability and validity of step count, a 2020 systemic review (Fuller et al. 2020 ) examining 158 publications, 9 different wearable brands, and 45 devices concluded that, according to the wider body of literature, wearable devices are accurate for measuring step count in the laboratory, but exhibit a wider range of inaccuracy in free-living environments. Even within brands, validity differed. For example, whereas the Fitbit Charge tended to underestimate steps, the Fitbit Classic overestimated step count. Variability also existed for intra-device reliability, where step count differed, not only within the same company, but also within the same device. Another 2020 systematic review (Henriksen et al. 2020 ) examined devices that used a triaxial accelerometer system and found that a large heterogeneity between study protocols (test duration, laboratory vs. free-living, reported metrics, statistical analyses, model investigated, etc.) limited conclusions on the overall accuracy of these technologies. One of the issues manufacturers and researchers face is determining, "What is a step?" Does it include the shuffle of the elderly, slide of the tennis player, leap of the dancer, march-in-place of the military person? For laboratory and validation purposes, as long as proprietary algorithms set the criteria for what constitutes a step, validation studies will prove difficult as they may define a step according to different criteria. Need exists for standardized performance benchmarks industries can meet to permit uniform testing (Bassett et al. 2017) . A variety of step counters exists, utilizing diverse internal mechanisms, and placed on different regions of the body. Recent trends, however, favor wrist-and arm-worn activity trackers that utilize triaxial accelerometers to detect movement. While some data support the relationship between wrist-worn devices and observer-counted steps (El-Amrawy and Nounou 2015; Chen et al. 2016) , one of the main sources of error is when upper-body movement fails to reflect lower body locomotion. For example, wrist-worn devices may inappropriately record steps when folding laundry (Chen et al. 2016 ), brushing teeth, or when worn on the dominant vs. non-dominant wrist. Conversely, some steps occur without wrist movement, such as when walking while pushing a stroller (Chen et al. 2016) . Under these circumstances, devices fail to appropriately log steps. Every arm movement of daily living is not accompanied by a step, nor vice versaa discrepancy manufacturers have attempted to overcome. For example, some devices have altered sensitivity thresholds whereby a certain acceleration must be met before a step is counted. This appropriately limits the counting of steps due to subtle wrist movements but fails to count steps in slower-moving or clinical populations. Other devices utilize a filter where movement must be maintained for ≥ 4 s to be counted as a step (Bassett et al. 2017 ). This method, however, eliminates steps of daily, household movements, which is concerning because frequent, short-duration bouts occur in daily activity (Orendurff et al. 2008) . Ultimately, step counters are more accurate when placed at the hip or ankle compared to the wrist [exceptions exist depending on the age of the user and intensity of the activity (Mandigout et al. 2019; Fuller et al. 2020) ]. Although device placement in the laboratory often leans on proximity of the variable being measured-such as the heart rate strap worn on the torso or step count accelerometer worn on the waist-consumer preference dictates that functionality and fashion be united, shifting emphasis towards wrist-worn devices. While some studies (Evenson et al. 2015; Wahl et al. 2017 ) report a general level of accuracy or reliability for certain devices and under set conditions, recent systematic reviews (Fuller et al. 2020; Henriksen et al. 2020 ) demonstrate less confidence in the ability of current technology to accurately report step count and raise questions about consistency, even within brands. A recent study investigating five different wearable devices under both walking and jogging conditions found that when manual counters recorded ~ 800 steps, wearable devices were off by as much as 50. Another investigation (Wahl et al. 2017 ) estimated that over the course of a marathon, where 60,000 steps may be taken, some devices would be within 100 steps whereas others off by nearly 8,000. Therefore, while wearable devices appear useful for tracking physical activity and step count, one must consider that error waxes or wanes depending on the type of activity, the intensity, and the environment where exercise occurs. Regarding step count and physical activity tracking, INTERLIVE noted the lack of consistency in how validity is evaluated, and identified various domains that validation studies should consider (Johnston et al. 2021) . To confidently fulfill step goals then, users, technology companies, and validation procedures must take additional steps. Heart rate (HR) increases with exertion and is used to indicate exercise intensity or as a correlate for VO 2 and physiological stress. Changes in heart rate represent useful signposts for training adaptations-where a lower heart rate for a given workload represents improved cardiorespiratory and muscular fitness. Within the clinical setting, HR is commonly measured via electrocardiogram (ECG), which detects depolarization and repolarization of the heart using electrodes placed on the chest, or by photoplethysmography such as in a pulse oximeter. In healthy populations, and in particular during exercise, HR is monitored via short-range telemetry using an electrical chest strap that transmits data via radiowave to a receiver or watch. Heart Rate Variability (HRV) describes the time variation between consecutive heartbeats and reflects the health of the autonomic nervous system. A healthy heart displays some amount of time oscillation between beats, whereas a diseased heart exhibits either a metronome-like rhythm, or abnormal variations between beats (Shaffer and Ginsberg 2017) . Changes in HRV may indicate overtraining or illness, and are useful in evaluating training intensities when comparing HRVs between exercise bouts (Hinde et al. 2021 ). Similar to HR, HRV is quantified using ECG, where algorithms calculate variability between ventricular contractions (Shaffer et al. 2014; Hinde et al. 2021) . The ability of wrist-worn devices to accurately and reliably capture HR has been investigated over a wide range of devices and brands with results demonstrating both accuracy and reliability, inside and outside the laboratory (Düking et al. 2020 ) Accuracy of HR wavers, however, with alterations in exercise intensity (Spierer et al. 2015; Thiebaud et al. 2018; Müller et al. 2019; Thomson et al. 2019; Chow and Yang 2020) . Regarding accuracy and reliability of HRV, a systematic review (Georgiou et al. 2018 ) examined eighteen studies and found high correlations for wearable HRV and classic ECG at rest. Similar to HR, HRV accuracy and reliability decreased during exercise-a finding supported by a 2021 investigation (Hinde et al. 2021 ) that examined thirty-two portable devices and found that validity and reliability decreased as HR and exercise intensity increased. Previously, wearable technologies capable of detecting HR did so via chest strap and associated watch. This technique compared favorably with ECG (Leger and Thivierge 1988; Goodie et al. 2000) . Recent devices, however, detect HR and HRV using PPG, although many retain compatibility with HR-measuring chest straps. PPG employs light-emitting diodes and detectors to monitor light absorption in the blood. Transmission of light through the tissue, or reflection of light from the tissue, alter light intensity and are associated with changes in tissue perfusion (Tamura et al. 2014) . Changes in perfusion then indicate contraction or relaxation of the heart. Validity of devices that use PPG may be compromised due to a number of scenarios including motion of the extremities, wrist positioning, disturbances in sensor-blood interface (such as sweat), and skin pigmentation. (Spierer et al. 2015) also found that when the devices were evaluated over specific exercise modalities, such as stair climbing and resistance training, reliability waned-a finding confirmed in later research (Shcherbina et al. 2017) . A study on trail running similarly demonstrated poor validity with PPG-based HR sensors (Navalta et al. 2020) . To optimize the performance of PPG monitoring, most manufacturers recommend routine cleaning of the PPG sensor-wiping excess dirt, sweat, or other debris from the skin-prior to placement, and then securing the sensor tightly to minimize motion. In real-world settings, it can be difficult to achieve these recommendations given that some users sweat heavily during an exercise bout, some run in inclement weather or conditions that impair PPG sensor performance (e.g., trail running and open-water swimming, which may increase the debris and sediment that contact the sensor), and some may experience discomfort when a wrist-worn unit is tightened sufficiently to ensure optimal PPG performance. HRV can also be detected using PPG (Giles et al. 2016) , and although ECG has been considered superior due to clear detection of ventricular contraction, updated algorithms in PPG devices have improved pulse detection. Losses in accuracy associated with the exercising state may be due to a number of physiological and technological factors, such as sympathetic stiffening of the blood vessels and sampling rate of the devices themselves (Shaffer et al. 2014 ). Finally, because HRV exhibits variation between people, baseline values for HRV must be established using resting, personal data prior to the influence of other stimuli. Otherwise, HRV values by themselves lack meaningful interpretation. Daily fluctuations in body temperature, circulating hormones, sleep cycle, and metabolism contribute to HRV, and gold standard techniques in HRV assessment account for these by taking a 24 h HRV recording. Many wearable devices lack the battery capacity for 24 h measurement and rely instead upon short-term (5 min) data acquisition. When short-term measurements are used, recommendations call for resting measurements, which influences the use of short-term HRV when exercising (Shaffer and Ginsberg 2017) . HRV can also change based on age, sex, health, and HR itself (Shaffer and Ginsberg 2017) . Therefore, algorithms for HRV and especially the diagnostic outcome of HRV measurement would necessarily take these variables into account. When using short-term recordings (compared with 24 h recordings), recommendations suggest utilizing frequency domains instead of time interval domains (Malik et al. 1996) . Therefore, if wearable algorithms utilize short-term recordings (which is likely) they will be most effective if transforming the data into domains based on frequency. Though short-term recordings refer to HRV measurement over a 5 min span, even shorter recordings (≤ 1 min) have become more prevalent. The usefulness of these snapshot views remains to be assessed and is not endorsed by leading scientific societies (Hinde et al. 2021) . Once again, the INTERLIVE collaboration between universities and industry offers a best-practice protocol for validating PPG devices, which includes considering instrument placement, target population, and testing conditions amongst other considerations (Mühlen et al. 2021) . Ultimately, while wearables may provide a general overview of fitness and health, future devices will permit a 24 h view of HRV and demonstrate more consistency with ECG technology, especially during heavy exercise sessions. Health and performance hang from a tenuous line separating undertraining from overtraining. Individuals-coaches, athletes, or those pursuing health and wellbeing-seek to promote stress-induced adaptation while avoiding any injury or chronic fatigue elicited by the over-trained or overstressed state. While appropriate physical stress from exercise promotes health, additional sources of stress (be it physical, social, environmental, etc.) can compile and create an overburdened state with an elevated risk of injury or illness (Hamlin et al. 2019) . Monitoring training stressthe physiological strain resulting directly from training sessions (Paquette et al. 2020 )-represents a useful diagnostic technique to prevent harm and optimize adaptation. The quantification of training load considers both external loads (e.g., speed, distance, duration, work) and internal loads (e.g., HR, HRV, blood lactate) during a given time. Alongside exercise-specific stressors, the monitoring of training load may also consider sleep quantity and quality as a potential diagnostic for assessing physiological strain. Finally, daily stress resulting from non-training factors, such as work, relationships, financial stress, etc., may also contribute to overall training load and can be assessed through visual analog scales and questionnaires. In the laboratory, the gold standard for sleep tracking is polysomnography, which records cortical and electro-ocular activity via electrodes on the scalp (De Zambotti et al. 2019) . Given the vast number of contributors to physiological status and overall stress, however, no gold standard for training load quantification currently exists. For training load to be meaningful it must relate to an outcome, such as injury occurrence or performance, and investigations must validate the strength of this relationship. Limited data assess the dose-response relationship between training load measurements and training outcomes. One investigation (Sanders et al. 2017 ) assessed the relationships between various measures of training load and aerobic fitness in trained cyclists and found the strongest dose-response relationship when measures considered individual characteristics in the calculation of training load rather than assuming a one-size-fits-all approach with preestablished exercise intensity levels. Similar to many of the previously-discussed variables, the rapid rate of industry development and deployment, combined with a paucity of performance standards, means that assessment of accuracy and reliability struggles to keep pace with emerging techniques for quantifying training load (Passfield et al. 2022) . A 2018 review of consumer wearables for monitoring stress and sleep found that 5% of technologies had been formally validated (Peake et al. 2018) . Need exists for validation of training load algorithms that are sport-specific with thresholds that technology companies must meet to promote their devices as accurately assessing training load. Recently, a group investigating the use of sleep-tracking devices (Menghini et al. 2021 ) suggested one such evaluation technique. Further efforts such as this will promote validity and reliability of future technology. Technological advancement in wearable devices enables the acquisition of massive datasets which opens new opportunities for insight into health and disease but also creates new challenges in determining how best to use data in assessing these outcomes. Regarding training load, wrist-worn devices reliably track days, distance, duration of exercise, as well as speed, cadence, and even surrogates for ground reaction forces (Moore and Willy 2019). These data are then combined and weighted to assess training load. Several quantification methods have been proposed, each relying upon a unique combination of variables, and each prescribing different weight to those variables. For example, "training impulse" uses exercise duration and mean heart rate during exercise to assess the intensity and overall load of the exercise session. However, mean work rate or total accumulated work may not be truly reflective of the stress of an individual exercise bout. Time spent in different intensity domains, particularly at high intensity, may have little effect on overall average heart rate, but may induce significant training stress (Black et al. 2017; Clark et al. 2018) . A different method calculates exercise intensity based on an individual's heart rate and blood lactate relationship. Still others utilize session rating of perceived exertion, where athletes rate their perceived level of difficulty for a given bout of exercise. Sleep patterns may also reflect the stress of training, and sleep is included in some quantifications of training load by integrating movement data from accelerometers with cardiac and environmental information (Menghini et al. 2021) . Training load can also be sport-specific, and the development of power meters in cycling enables power output to be included in training load calculations for cyclists. In assessing the utility of sleep-tracking devices, Menghini et al. (2021) state that "unstandardized, undisclosed, and often unvalidated data outcomes and algorithms are among the main challenges." This sentiment can be expanded to the current state of training load monitoring in general as we remain uncertain of which variables best identify the level of stress incurred on an individual. In addition to validation standards then, advancements in training load quantification will establish best practices for which variables to rely upon (and how heavily). In the future, additional variables may also emerge, and technology will take an individualized approach to training load, where a certain distance and pace for one person incurs a different training load compared to that same distance and pace in another person. Technology may also benefit from an ability to sieve through the more general term, "stress" and pinpoint anatomical, site-specific load (Moore and Willy 2019). Until a method exists in which both internal and external loads are reliably accounted for, the calculation of training load by wearable devices remains deficient. It should be clear that wearable activity trackers have advanced rapidly over the past decade and continue to feature new capabilities and metrics of interest to the end users. Improvements in sensors, engineering, algorithms, and incorporating some level of individual data (such as height, weight, activity/training level, etc.) have all expanded and enhanced the capabilities of these devices (Düking et al. 2018 ). Yet, for all the advances that have been made, significant gaps still exist in validity and reliability, particularly in real-world performance circumstances. Because of the variety in manufacturer models, algorithms, and even the advertised purpose of the wearable device, this topic is quite broad and factors, such as the relatively small number of studies and lack of transparency from the manufacturers make it difficult to draw rigorous scientific conclusions. Nevertheless, recent meta-analyses on specific metrics along with available validation studies provide some understanding of the validity and reliability of these devices. In reviewing the available data on measures, such as V O 2 , EE, step count, and others, several themes emerge. First, while some measurements appear to be valid and reliable in more controlled circumstances, such as at rest or low-intensity activity, under more dynamic conditions, such as high-intensity exercise, validity and reliability, are significantly lower, and likely undependable. This issue is particularly problematic given the application of these devices as training tools and aids to improve health and fitness. Low confidence in these metrics suggests that athletes and other users should not build their training and dietary plans around the V O 2 , EE, training load, and other measures that are given by these wearable devices. Doing so may present significant risk of over-or underestimation and could result in imbalanced training load, energy imbalance, and other factors ultimately impairing performance, health, and fitness. Second, the proprietary nature of algorithms used to estimate these metrics impedes validation testing and likely slows innovation. This presents a challenge because independent validation of algorithms would significantly improve consumer and scientific confidence in how these calculations are made, and whether assumptions used in these measures are indeed valid and applicable to the population or individual using these devices. Moreover, open-source, or otherwise publicly available data could help drive innovation and improvement in these algorithms and enhance transparency and consumer confidence. Systematic reviews of individual metrics have shown that although some algorithms perform well at the population level, the estimation error at the individual level is still large and thus, do not provide useful information for the end-user of most consumer-wearable devices. One possibility is that a "one-size-fits-all" algorithm that is universally applied to all users is not nuanced enough to accurately estimate these metrics at the individual level. Future research should focus on not only validating algorithms at the population level, but also reducing estimation error at the individual level. Third, accounting for individual differences and environmental factors is both complex and challenging, yet essential for the best performance of wearable devices. Adapting different algorithms to specific sub-populations based on age, sex, training history, presence or absence of cardiometabolic and/or respiratory disease, race, ethnicity, body composition, etc. may yield results more useful to a broader audience. Moreover, estimates are almost certainly impacted by interindividual variability in factors, such as running economy (or cycling efficiency, or economy/efficiency in other exercise modalities), actual vs. predicted maximal HR, muscle fiber type distribution, V O 2 slow component and cardiac drift during prolonged exercise, and biomechanical, musculoskeletal, and neurophysiological factors that influence stress, recovery and injury, many of which are not properly accounted for in algorithms used to estimate common metrics. Acknowledging the complexity of these challenges, future development should encourage more robust testing and transparency. Particularly in light of the recent COVID-19 pandemic, which necessitated many people to monitor their own exercise regimens to a greater extent due to loss of organized exercise classes, gym access, face to face coaching, etc., it seems prudent to hold these wearable activity tracking devices to a higher standard. It would seem logical to develop a standard analogous to a "Phase 3" clinical trial validating the efficacy and safety of medical therapies and devices and indeed the International Federation of Sports Medicine recently published a perspective proposing the establishment of a global standard for wearable devices in sport and fitness (Ash et al. 2020) . The proposed quality assurance standard would commission testing of marketing claims and endorse the claims that are validated. However, practical constraints, such as budget, facilities, and human resources, necessary for testing and validation may make widespread adoption of such a standard difficult. Ultimately, the nature of this field of wearable activity trackers is dynamic, one of constant change, updates, redesign, and new models, and will necessitate continual review and research. In light of this, it is likely that the science will always be lagging to supporting new devices emerging into the market. If so, this begs the question: Is it the role of scientists to validate the claims of manufacturers? Rather than place the burden on the scientific community, manufacturers should be encouraged to provide full transparency for the algorithms they use in their wearable devices with rigorous, transparent, and complete reporting of algorithm development, validation, and real-world testing. Highlighting the validity and reliability challenges in wearable technology then begs the questions, "How are people utilizing this technology?" and "What are the general levels of acceptance for metrics reported by these devices?" According to a 2019 survey, the majority of wearable technology users turn to their devices for tracking step count (60%). Heart rate (44%), calories burned (42%), and sleep monitoring (40%) represented the next highest metrics (Global Web Index 2020). Figure 2 provides a normalized representation of average acceptance for various metrics. Users mentioned managing their fitness levels (47%) and feeling in control of their health (45%) as the primary reasons they use this technology (Global Web Index 2020). Indeed, this ability to collect activity data supports one's adoption of wearable technology (Canhoto and Arp 2017) . Additionally, users expressed a desire to track more health parameters in the future, with blood pressure (53%), heart rate (51%), blood sugar (45%), cardiac issues (42%), and stress issues (42%) at the top of the list (Global Web Index 2020). Wearable technology is here to stay (Market Research Future; Pew Research Center; SurveyMonkey; Thompson), which likely means that users will continue collecting health-and fitness-related data, but will also desire assistance in interpreting these data. In fact, over half of survey respondents in 2014 were willing to share their data with a physician (Accenture Interactive). In another survey, two thirds of respondents expressed a desire for a physician or health coach to guide them in understanding their wearable technology data in making lifestyle changes. Three fourths of respondents were willing to pay for this service (The Harris Poll). Thus, as the use of wearable technology grows more widespread, and as companies producing these devices add features that generate even more data, health professionals will likely find their patients or clients asking them for recommendations based on self-collected health and fitness data. But if data are unreliable and lack validity, health professionals may struggle in guiding patients or clients to the next best steps for their health and fitness goals. Figure 3 summarizes the Mean Absolute Percentage Error for various physiological variables recorded by consumer-wearable devices reported in recent studies. So, what can be done? Perhaps this is the push the wearable technology industry needs (as the previous section mentioned) to create standards for new devices and capabilities (Cardinale and Varley 2017) , as well as marketing what those true capabilities are. Additionally, the health field will need to devise best practices so providers know how to handle the data their patients or clients present to them. This could include comparing the wearable technology data with already accepted measures of health and fitness (e.g., comparing time and distance walked using the wearable device with the 6-Minute Walk Test norms chart) (Chiauzzi et al. 2015) . Furthermore, validity and reliability issues may be addressed if wearable technology users and their health providers look at data trends over time, instead of simply as vital signs taken in the moment. For instance, if a wearable technology user starts experiencing a higher-than-normal resting HR, this could signal overtraining and/or possible sickness, whereas receiving a resting heart rate measurement of 80 bpm in the doctor's office only indicates that the patient is not currently experiencing tachycardia or bradycardia. Indeed, studies show that deviations in one's wearable technology metrics over time can signal disease, or risk for disease (Li et al. 2017; Rose et al. 2019; Lown et al. 2020) . Currently, although wearable devices can track changes over time, wearable devices cannot detect what is a meaningful change. Rather, various algorithms have been created and must be applied to find anomalies in the user's generated data to predict future illness (Lown et al. 2020; Alavi et al. 2021; Sunny et al. 2022 ). These algorithms require multiple steps to be useful, including first downloading the data to an application programming interface (API), processing it for uniformity, and filling in missing data points (Sunny et al. 2022) . For healthcare providers to accurately predict disease risk based on deviations in user's wearable technology data, current restrictions require them to employ the use of If consumers and their health providers are monitoring changes over time, this is a path to greater personalized health and fitness training (Li et al. 2017; Rose et al. 2019) . As stated earlier, V O 2 max estimation, HRV, and training load become more accurate when an individualized, and not a one-size-fits-all, approach is used. Designers of wearable technology could create a more robust, yet user-friendly, setup process for new purchases, making it more individualized to the wearer. This could include showing users pictures of various types of exercise, allowing them to swipe through options, thereby obtaining a closer look at their true physical activity levels. For example, apps such as the "Diet ID" have recently been developed using this method, giving health providers a more accurate view of patient or client nutrition. Figure 4 provides a comparison between how users are relying upon consumer-wearable devices along with an expert recommendation on how best to utilize these devices. If wearable technology companies provide consumers with more individualized capabilities, this would improve the generated outputs and represent an additional marketing tool to boost sales. It would also enable the identification of a specific baseline of health for each user so deviations that signal disease could be found more easily (Guk et al. 2019) . Furthermore, creating wearables with the ability to track data changes themselves would simplify data processing and make these devices more valuable to the health field. While consumers and health professionals should continue to push the wearable technology industry toward greater standards for reliability and validity, improving data tracking and implementing greater individualization can provide a win for all stakeholders. Wearable technologies are powerful tools for health and fitness and have become indispensable training tools for athletes of all levels. Yet, for all their merits, significant limitations exist, primarily related to the validity and reliability of the metrics these devices purport to measure. The rapid rate of development and deployment of new technologies, sensors, algorithms, and other components of these devices may lead athletes and other users to believe that these sophistications are highly accurate and Fig. 3 Mean absolute percentage error (MAPE) for various physiological variables recorded by wearable devices from recent investigations. MAPE indicates the predictive accuracy of devices. Although no standardized thresholds exist for high or low error, MAPE > 3% has been considered high for laboratorybased studies and > 10% has been considered high for studies in free-living conditions. Note: studies normally found a range for MAPE. Therefore, "X" indicates the approximate average of various devices and scenarios tested. 1, Passler (2019); 2, Henriksen (2020); 3, Nelson (2016); 4, Montoye (2017); 5, Wallen (2016); 6, Navalta (2020); 7, LeBoeuf (2014); 8, Carrier (2020); 9, Henriksen (2021) Fig. 4 Comparison between how users are relying upon wearable devices with expert recommendation and 5-star rating on how best to rely upon wearable devices. 5-star, excellent reliability; 4-star, good reliability; 3-star, moderate reliability; 2-star, poor reliability; 1-star, very poor reliability. 1, Evenson (2015) ; 2, Global Web Index (2020); 3, Canhoto (2017) ; 4, Rose (2019); 5, Li (2017) ; 6, McDonough (2021); 7, Kerner (2017); 8, Montgomery-Downs (2012) valid, yet published data suggest this may be the case in limited circumstances, primarily at rest and during lowintensity activity. These limitations make it difficult to be confident in metrics the end-user is seeing from these devices and challenging for the practitioner to interpret the meaning of the data generated by these devices. Improving transparency in development and validation of these metrics, along with better tailoring to individuals should increase the validity and reliability of these devices. While it is acknowledged that no technology or device performs perfectly under all circumstances, the breakdown in performance of many wearable devices in certain real-world settings begs the question, are they more advanced technology or advanced marketing to the end user? Scientists and practitioners alike would do well to remember that good science is often not good marketing; and conversely (and perhaps more importantly), good marketing is not always good science. Author contributions RJS and HLP conceived of the idea for the work; RJS, IGH, ASO, BAP, HLP drafted the manuscript and critically revised the work; all authors reviewed and approved the final manuscript. Funding No funding was received for this work. The authors declare no conflicts of interest. Ren-Jay Shei is an employee of Coherus BioSciences. The work described herein is solely reflective of the author's (R-JS) personal views and is unrelated to his job duties with Coherus BioScience. These views do not constitute an endorsement by Coherus BioSciences, do not represent the views of Coherus BioSciences, and Coherus BioSciences had no role in the conception, writing, revision, or final approval of the manuscript. Brittni Paris is co-owner of Smart Fit Womxn LLC, a wellness coaching company. The work described here is BAP's personal view and does not reflect the view of Smart Fit Womxn. Accenture Interactive Consumer Privacy Concerns for Wearable Tech Acheson K, Campbell I, Edholm O et al (1980) The measurement of daily energy expenditure-an evaluation of some techniques Estimating human energy expenditure Real-time alerting system for COVID-19 and other stress events using wearable data Recommendations for determining the validity of consumer wearables and smartphones for the estimation of energy expenditure: expert statement and checklist of the INTERLIVE network Establishing a global standard for wearable devices in sport and fitness: perspectives from the New England chapter of the American college of sports medicine members Step counting: a review of measurement considerations and health-related applications Muscle metabolic and neuromuscular determinants of fatigue during cycling in different exercise intensity domains Improving assessment of daily energy expenditure by identifying types of physical activity with a single accelerometer Cardiorespiratory fitness estimation from heart rate and body movement in daily life Validity of wearable activity monitors during cycling and resistance exercise Reexamination of validity and reliability of the CSA monitor in walking and running Exploring the factors that support adoption and sustained use of health and fitness wearables Wearable training-monitoring technology: applications, challenges, and opportunities Validation of garmin fenix 3 HR fitness tracker biomechanics and metabolics (VO2max) Accuracy of wristband activity monitors during ambulation and activities Patient-centered activity monitoring in the self-management of chronic health conditions Daily step goal of 10,000 steps: a literature review Accuracy of optical heart rate sensing technology in wearable fitness trackers for young and older adults: validation and comparison study Effects of two hours of heavy-intensity exercise on the power-duration relationship Wearable sleep technology in clinical and research settings Detection of walking periods and number of steps in older adults and patients with Parkinson's disease: accuracy of a pedometer and an accelerometry-based method Recommendations for assessment of the reliability, sensitivity, and validity of data provided by wearable sensors designed for monitoring physical activity Wrist-worn wearables for monitoring heart rate and energy expenditure while sitting or performing light-to-vigorous physical activity: validation study Are currently available wearable devices for activity tracking and heart rate monitoring accurate, precise, and medically beneficial? Systematic review of the validity and reliability of consumer-wearable activity trackers Firstbeat Technologies White Paper: Automated Fitness Estimation Firstbeat Technologies Ltd. Firstbeat Technologies White Paper: VO2 estimation Persistent metabolic adaptation 6 years after "the biggest loser" competition Estimation of oxygen uptake during fast running using accelerometry and heart rate Reliability and validity of commercially available wearable devices for measuring steps, energy expenditure, and heart rate: systematic review Can wearable devices accurately measure heart rate variability? a systematic review Validity of the Polar V800 heart rate monitor to measure RR intervals at rest Variable accuracy of wearable heart rate monitors during aerobic exercise Global web index (2020) Digital healthcare report: understanding the evolution and digitization of healthcare Validation of polar heart rate monitor for assessing heart rate during physical and mental stress Evolution of wearable devices with real-time disease monitoring for personalized healthcare Monitoring training loads and perceived stress in young elite university athletes Simultaneous measurement of heart rate and body motion to quantitate physical activity Use of the pedometer for promoting daily walking exercise Validation of the garmin fenix 6S maximal oxygen consumption (VO2max) estimate Measuring physical activity using triaxial wrist worn polar activity trackers: a systematic review Wearable Devices suitable for monitoring twenty four hour heart rate variability in military populations Recommendations for determining the validity of consumer wearable and smartphone step count: expert statement and checklist of the INTERLIVE network Factors influencing variation in basal metabolic rate include fat-free mass, fat mass, age, and circulating thyroxine but not sex, circulating leptin, or triiodothyronine Validity of wearable activity monitors for tracking steps and estimating energy expenditure during a graded maximal treadmill test Direct calorimetry: a brief historical review of its use in the study of human metabolism and thermoregulation Heart rate monitors: validity, stability, and functionality Digital health: tracking physiomes and activity using wearable biosensors reveals useful healthrelated information Machine learning detection of Atrial Fibrillation using wearable technology Heart rate variability: Standards of measurement, physiological interpretation, and clinical use Comparison of step count assessed using wrist-and hip-worn Actigraph GT3X in freeliving conditions in young and older adults A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code Validity of estimating the maximal oxygen consumption by consumer wearables: a systematic review with meta-analysis and expert statement of the INTERLIVE network Step count reliability and validity of five wearable technology devices while walking and jogging in both a free motion setting and on a treadmill Use of wearables: tracking and retraining in endurance runners Recommendations for determining the validity of consumer wearable heart rate devices: expert statement and checklist of the INTERLIVE network Heart rate measures from wrist-worn activity trackers in a laboratory and free-living setting: validation study Accuracy of 12 wearable devices for estimating physical activity energy expenditure using a metabolic chamber and the doubly labeled water method: validation study Concurrent heart rate validity of wearable technology devices during trail running How well do activity monitors estimate energy expenditure? a systematic review and meta-analysis of the validity of current technologies National physical activity surveillance: users of wearable activity monitors as a potential data source How humans walk: bout duration, steps per bout, and rest duration Moving beyond weekly "distance": optimizing quantification of training load in runners Validity of the training-load concept Validity of wrist-worn activity trackers for estimating VO2max and energy expenditure A critical review of consumer wearables, mobile applications, and equipment for providing biofeedback, monitoring stress, and sleep in physically active populations Pew research center about one-in-five Americans use a smart watch or fitness tracker Physical activity assessment with accelerometers: an evaluation against doubly labeled water Daily physical activity assessment with accelerometers: new insights and validation studies Polar tech support Comparison between Mother, ACTIGRAPH wGT3X-BT, and a hand tally for measuring steps at various walking speeds under controlled conditions Validity of sports watches when estimating energy expenditure during running A longitudinal big data approach for precision health Methods of monitoring training load and their relationships to changes in fitness and performance in competitive road cyclists An overview of heart rate variability metrics and norms A healthy heart is not a metronome: an integrative review of the heart's anatomy and heart rate variability Accuracy in wrist-worn, sensor-based measurements of heart rate and energy expenditure in a diverse cohort Comparison of the polar V800 and the garmin forerunner 230 to predict VO2max Validation of photoplethysmography as a method to detect heart rate during rest and exercise Accuracy of the wearable activity tracker garmin forerunner 235 for the assessment of heart rate during rest and activity Wearable-devicemeasured physical activity and future health risk SurveyMonkey consumer health and wellness: What it means to brands in 2021 Anomaly Detection framework for wearables data: a perspective review on data concepts Wearable photoplethysmographic sensors-past and present Validity of wrist-worn consumer products to measure heart rate and energy expenditure Worldwide survey of fitness trends for 2019 Thompson WR worldwide survey of fitness trends for 2021. ACSM's health and fitness journal US patent application publication firstbeat technologies US. Patent Wahl Y, Düking P, Droszez A et al (2017) Criterion-validity of commercially available physical activity tracker to estimate step count, covered distance and energy expenditure during sports conditions Effects of fatigue on kinematics and kinetics during overground running: a systematic review Can we trust the oxygen saturation measured by consumer smartwatches?