key: cord-0292434-xkhn3t81 authors: Laschowski, Brokoslaw; McNally, William; Wong, Alexander; McPhee, John title: Environment Classification for Robotic Leg Prostheses and Exoskeletons using Deep Convolutional Neural Networks date: 2021-06-25 journal: bioRxiv DOI: 10.1101/2021.06.24.449600 sha: d6c1ba154c11105daecb4c00fc7e16fdf80d29d6 doc_id: 292434 cord_uid: xkhn3t81 Robotic leg prostheses and exoskeletons can provide powered locomotor assistance to older adults and/or persons with physical disabilities. However, the current locomotion mode recognition systems being developed for intelligent high-level control and decision-making use mechanical, inertial, and/or neuromuscular data, which inherently have limited prediction horizons (i.e., analogous to walking blindfolded). Inspired by the human vision-locomotor control system, we designed and evaluated an advanced environment classification system that uses computer vision and deep learning to forward predict the oncoming walking environments prior to physical interaction, therein allowing for more accurate and robust locomotion mode transitions. In this study, we first reviewed the development of the ExoNet database – the largest and most diverse open-source dataset of wearable camera images of indoor and outdoor real-world walking environments, which were annotated using a hierarchical labelling architecture. We then trained and tested over a dozen state-of-the-art deep convolutional neural networks (CNNs) on the ExoNet database for large-scale image classification of the walking environments, including: EfficientNetB0, InceptionV3, MobileNet, MobileNetV2, VGG16, VGG19, Xception, ResNet50, ResNet101, ResNet152, DenseNet121, DenseNet169, and DenseNet201. Lastly, we quantitatively compared the benchmarked CNN architectures and their environment classification predictions using an operational metric called NetScore, which balances the image classification accuracy with the computational and memory storage requirements (i.e., important for onboard real-time inference). Although we designed this environment classification system to support the development of next-generation environment-adaptive locomotor control systems for robotic prostheses and exoskeletons, applications could extend to humanoids, autonomous legged robots, powered wheelchairs, and assistive devices for persons with visual impairments. There are currently hundreds of millions of individuals worldwide with mobility impairments resulting from aging and/or physical disabilities. For example, the number of people with limb amputations in the United States alone is expected to more than double from ~1.6 million in 2005 to ~3.6 million in 2050 (Ziegler-Graham et al., 2008) ; this increased prevalence is primarily driven by the aging population and higher rates of dysvascular diseases, especially diabetes, among older adults. The World Health Organization recently projected that the number of individuals above 65 years will increase from ~524 million in 2010 to ~1.5 billion in 2050 (Grimmer et al., 2019) . The global prevalence of musculoskeletal disorders like osteoarthritis is ~151 million and neurological conditions like Parkinson's disease, cerebral palsy, and spinal cord injury affect approximately 5.2 million, 16 million, and 3.5 million persons, respectively (Grimmer et al., 2019) . Moreover, the number of individuals needing physical rehabilitation due to immobility and/or disease has recently been exacerbated by the coronavirus disease 2019 (COVID-19) (Negrini et al., 2021) . Mobility impairments, especially those resulting in wheelchair dependency, are often associated with reduced independence and secondary health complications, including osteoporosis, coronary artery disease, obesity, hypertension, and sarcopenia (Grimmer et al., 2019) . Fitted with conventional passive assistive devices, individuals with mobility impairments tend to fall more frequently and walk more energetically inefficient and with greater biomechanical asymmetries compared to healthy young adults, which have implications on joint and bone degeneration resulting from disproportionate loading on the unaffected limb (i.e., assuming unilateral impairments) (Tucker et al., 2015) . One reason for these gait abnormalities is that many locomotor activities, like climbing stairs, require net positive mechanical work about the lower-limb joints via power generation from the human muscles. However, limb amputation or paralysis removes/diminishes these sources of biological actuation, therein compromising the user's ability to effectively and efficiently perform many locomotor activities of daily living. Fortunately, robotic leg prostheses and exoskeletons can replace the propulsive function of the amputated or impaired biological limbs and allow users to perform locomotor activities that require net positive mechanical work by using motorized hip, knee, and/or ankle joints (Krausz and Hargrove, 2019; Laschowski and Andrysek, 2018; Tucker et al., 2015; Young and Ferris, 2017; Zhang et al., 2019a) . However, control of these robotic assistive devices is exceptionally difficult and often considered one of the leading challenges to real-world deployment (Tucker et al., 2015; Young and Ferris, 2017) . Most robotic prosthesis and exoskeleton control systems for human locomotion involve a hierarchical architecture, including high, mid, and low-level controllers (Tucker et al., 2015; Young and Ferris, 2017 ) (see Figure 1) . The high-level controller is responsible for state estimation and inferring the user's locomotor intent (e.g., climbing stairs, sitting down, or levelground walking). The mid-level controller converts the locomotor activity into mode-specific reference trajectories often using dynamic equations of the human-robot system; this level of control consists of individual finite-state machines with discrete mechanical impedance parameters like stiffness and damping coefficients, which are heuristically tuned for different locomotor activities to generate the desired actuator joint torques (i.e., the desired device state). Tunning these system parameters can be time-consuming as the number of parameters increases with the number of states per locomotor activity, the number of activity modes, and the number of actuated joints. The low-level controller uses standard controls engineering algorithms like proportional-integral-derivative (PID) control to calculate the error between the measured and desired device states and command the robotic actuators to minimize the error using reference tracking and closed-loop feedback control (Krausz and Hargrove, 2019; Tucker et al., 2015; Young and Ferris, 2017; Zhang et al., 2019a) . Nonrhythmic movements and high-level transitions between different locomotor activities remain significant challenges. Inaccurate and/or delayed decisions could result in loss-of-balance and injury, which can be especially problematic when involving stairs. Switching between different mid-level controllers is supervised by the high-level controller, which infers the locomotor intent using either sensor data (i.e., for devices under research and development) or direct communication from the user (i.e., for commercially available devices) (see Figure 1 ). Commercial devices like the Össur Power Knee prosthesis and the ReWalk and Indego powered lower-limb exoskeletons require the users to perform exaggerated movements or use hand controls to manually switch between locomotion modes (Tucker et al., 2015; Young and Ferris, 2017) . Although highly accurate in communicating the user's locomotor intent, such manual high-level control and decision making can be time-consuming, inconvenient, and cognitively demanding (Karacan et al., 2020) . Researchers have thus been working on developing automated locomotion mode recognition systems using pattern recognition algorithms and data from wearable sensors like inertial measurement units (IMUs) and surface electromyography (EMG), therein shifting the high-level control burden from the user to an intelligent controller (Tucker et al., 2015; Young and Ferris, 2017 ) (see Figure 2 ). The ideal automated high-level controller would estimate the individual states and interactions within the overall human-robotenvironment system. Mechanical sensors embedded in robotic prostheses and exoskeletons (e.g., potentiometers, pressure sensors, magnetic encoders, and strain gauge load cells) can be used for state estimation by measuring the joint angles and angular velocities, and interaction forces and/or torques between the human and device, and between the device and environment. For example, Varol and colleagues (2010) used onboard mechanical sensors and a Gaussian mixture model classification algorithm for automated intent recognition and high-level control. IMU sensors can measure angular velocities, accelerations, and direction using an onboard gyroscope, accelerometer, and magnetometer, respectively. Although mechanical and inertial sensors allow for fully integrated control systems (i.e., no additional instrumentation or wiring apart from the device need be worn), these sensor technologies can only respond to the user's movements. In contrast, the electrical potentials of biological muscles, as recorded using surface EMG, precede movement initiation and thus could predict locomotion mode transitions with small prediction horizons. Using only surface EMG and a linear discriminant analysis (LDA) pattern recognition algorithm, Huang et al (2009) differentiated between seven locomotion modes with 90% classification accuracy. In addition to locomotion mode recognition, EMG signals could be used for proportional myoelectric control (Nasr et al., 2021) . Although researchers have explored using brain-machine interfaces for direct neural control of robotic prostheses and exoskeletons (i.e., using either implanted or electroencephalography-based systems) (Lebedev and Nicolelis, 2017) , these methods are still highly experimental and infrequently used for locomotor applications. Fusing information from mechanical and/or inertial sensors with surface EMG, known as neuromuscular-mechanical data fusion, has shown to improve the automated locomotion mode recognition accuracies and decision times compared to implementing either system individually (Du et al., 2012; Huang et al., 2011a; 2011b; Krausz and Hargrove, 2021; Liu et al., 2016; Spanias et al., 2015; Wang et al., 2013) . Huang et al (2011b) first demonstrated such improvements in classification predictions across six locomotion modes using a support vector machine (SVM) classifier, surface EMG signals, and measured ground reaction forces and torques on a robotic prosthesis. However, neuromuscular-mechanical data are user-dependent, therein requiring time-consuming experiments to amass individual datasets, and surface EMG sensors require calibration and are susceptible to fatigue, motion artifacts, changes in electrode-skin conductivity, and crosstalk between adjacent muscles (Tucker et al., 2015; Young and Ferris, 2017) . EMG signals may also be inaccessible in users with high amputations where insufficient neural content is available from the residual limb. Despite the developments in automated intent recognition using mechanical, inertial, and/or neuromuscular signals, further improvements in system performance are desired for safe and robust locomotor control, especially when involving stair environments. Taking inspiration from the human vision-locomotor control system, supplementing neuromuscular-mechanical data with information about the oncoming walking environment could improve the automated high-level control performance (see Figure 2 ). Environment sensing would precede modulation of the user's muscle activations and/or walking biomechanics (i.e., having an extended prediction horizon), therein allowing for more accurate and real-time transitions between different locomotion modes by minimizing the high-level decision space. During natural human locomotion, the central nervous system acquires state information from biological sensors (e.g., the eyes) through ascending pathways, which are subsequently used to actuate and control the musculoskeletal system through feedforward efferent commands (Patla, 1997; Tucker et al., 2015) . However, these control loops are compromised in persons using assistive devices due to limitations in the human-machine data communication. Environment sensing and classification could artificially restore these control loops for automated high-level control. In addition to supplementing the locomotion mode recognition system, environment information could be used to adapt the mid-level reference trajectories (e.g., increasing the actuator joint torques for toe clearance corresponding to an obstacle height) (Zhang et al., 2020) , optimal path planning (e.g., identifying opportunities for energy recovery) (Laschowski et al., 2019a; 2021a) , and varying foot placement based on the walking surface. The earliest studies fusing neuromuscular-mechanical data with environment information for prosthetic leg control came from Huang and colleagues (Du et al., 2012; Huang et al., 2011a; Liu et al., 2016; Wang et al., 2013; Zhang et al., 2011) . Different walking environments were statistically modelled as prior probabilities using the principle of maximum entropy and incorporated into the discriminant function of an LDA classification algorithm. Rather than assuming equal prior probabilities as usually done, they simulated different walking environments by adjusting the prior probabilities of each class, which allowed their automated locomotion mode recognition system to dynamically adapt to different environments. For instance, when approaching an incline staircase, the prior probability of performing stair ascent would progressively increase while the prior probabilities of other locomotion modes would likewise decrease. Using these adaptive prior probabilities based on the terrain information significantly outperformed (i.e., 95.5% classification accuracy) their locomotion mode recognition system based on neuromuscular-mechanical data alone with equal prior probabilities (i.e., 90.6% accuracy) (Wang et al., 2013) . These seminal papers showed 1) how environment information could be incorporated into an automated high-level controller, 2) that including such information could improve the locomotion mode recognition accuracies and decision times, and 3) that the controller could be relatively robust to noisy and imperfect environment predictions such that neuromuscular-mechanical data dominated the high-level decision making (Du et al., 2012; Huang et al., 2011a; Liu et al., 2016; Wang et al., 2013) . Several researchers have explored using wearable radar detectors (Kleiner et al., 2018) and laser rangefinders (Liu et al., 2016; Wang et al., 2013; Zhang et al., 2011) for active environment sensing. Unlike camera-based systems, these sensors circumvent the need for computationally expensive image processing and classification. Radar can uniquely measure distances through non-conducting materials like clothing and are invariant to outdoor lighting conditions and surface textures. Using a leg-mounted radar detector, Herr's research group (Kleiner et al., 2018) measured stair distances and heights with 1.5 cm and 0.34 cm average accuracies, respectively, up to 6.25 m maximum distances. However, radar reflection signatures usually struggle with source separation of multiple objects and have relatively low resolution. Huang and colleagues (Liu et al., 2016; Wang et al., 2013; Zhang et al., 2011 ) developed a waist-mounted system with an IMU and laser rangefinder to reconstruct the geometry of the oncoming walking environment between 300-10,000 mm ranges. Environmental features like the terrain height, distance, and slope were used for classification via heuristic rule-based thresholds. The system achieved 98.1% classification accuracy (Zhang et al., 2011) . While simple and effective, their system required subject-specific calibration (e.g., the device mounting height) and provided only a single distance measurement. Compared to radar and laser rangefinders, cameras can provide more detailed information about the field-of-view and detect physical obstacles and terrain changes in peripheral locations (example shown in Figure 3 ). Most environment recognition systems have used RGB cameras (Da Silva et al., 2020; Diaz et al., 2018; Khademi and Simon, 2019; Krausz and Hargrove, 2015; Laschowski et al., 2019b; 2020b; 2021b; Novo-Torres et al., 2019; Zhong et al., 2020) and/or 3D depth cameras (Hu et al., 2018; Krausz et al., 2015; Krausz and Hargrove, 2021; Massalin et al., 2018; Varol and Massalin, 2016; Zhang et al., 2019b; 2019c; 2019d , 2020 ) mounted on the chest (Krausz et al., 2015; Laschowski et al., 2019b; 2020b; 2021b) , waist (Khademi and Simon, 2019; Krausz et al., 2019; Krausz and Hargrove, 2021; Zhang et al., 2019d) , or lower-limbs (Da Silva et al., 2020; Diaz et al., 2018; Massalin et al., 2018; Varol and Massalin, 2016; Zhang et al., 2019b; 2019c; 2020; Zhong et al., 2020 ) (see Table 1 ). Few studies have adopted head-mounted cameras for biomimicry (Novo-Torres et al., 2019; Zhong et al., 2020) . Zhong et al (2020) recently compared the effects of different wearable camera positions on classification performance. Compared to glasses, their leg-mounted camera more accurately detected closer walking environments but struggled with incline stairs, often capturing only 1-2 steps. Although the glasses could detect further-away environments, the head-mounted camera also captured irrelevant features like the sky, which decreased the overall image classification accuracy. The glasses also struggled with detecting decline stairs and had larger standard deviations in classification predictions due to head movement. For image classification, researchers have traditionally used statistical pattern recognition or machine learning algorithms like support vector machines, which require hand-engineering to develop feature extractors (Da Silva et al., 2020; Diaz et al., 2018; Hu et al., 2018; Krausz et al., 2015; Krausz and Hargrove, 2021; Massalin et al., 2018; Varol and Massalin, 2016 ) (see Table 2 ). Hargrove's research group (Krausz et al., 2015; used standard image processing and rule-based thresholds to detect convex and concave edges and vertical and horizontal planes for stair recognition. Although their algorithm achieved 98.8% classification accuracy, the computations were time-consuming (i.e., ~8 seconds/frame) and the system was evaluated using only five sampled images. In another example, Huang and colleagues (Diaz et al., 2018) achieved 86% image classification accuracy across six environment classes using SURF features and a bagof-words classifier. Varol's research group used support vector machines for classifying depth images, which mapped extracted features into a high-dimensional space and separated samples into different classes by constructing optimal hyperplanes with maximum margins (Massalin et al., 2018; Varol and Massalin, 2016) . Their system achieved 94.1% classification accuracy across five locomotion modes using a cubic kernel SVM and no dimension reduction (Massalin et al., 2018) . The average computation time was ~14.9 ms per image. Although SVMs are effective in high-dimensional space and offer good generalization (i.e., robustness to overfitting), these algorithms require manual selection of kernel functions and statistical features, which is timeconsuming and suboptimal. The latest generation of environment recognition systems has used convolutional neural networks (CNNs) for image classification (Khademi and Simon, 2019; Laschowski et al., 2019b; 2021b; Novo-Torres et al., 2019; Rai and Rombokas, 2018; Zhang et al., 2019b; 2019c; 2019d; 2020; Zhong et al., 2020 ) (see Table 3 and Figure 4 ). One of the earliest publications came from Laschowski and colleagues (2019b), who designed and trained a 10-layer convolutional neural network using five-fold cross-validation, which differentiated between three environment classes with 94.9% classification accuracy. The convolutional neural network from Simon's group (Khademi and Simon, 2019) achieved 99% classification accuracy across three environment classes using transfer learning of pretrained weights. Although CNNs typically outperform SVMs for image classification and circumvent the need for manual feature engineering (LeCun et al., 2015) , deep learning requires significant and diverse training data to prevent overfitting and promote generalization. Deep learning has become pervasive in computer vision ever since AlexNet (Krizhevsky et al., 2012) (Laschowski et al., 2020a) . Motivated by these limitations in the literature, we developed and evaluated an advanced environment classification system that uses computer vision and deep learning. In this study, we 1) reviewed the development of the "ExoNet" database -the largest and most diverse opensource dataset of wearable camera images of human walking environments; 2) trained and tested over a dozen state-of-the-art deep convolutional neural networks on the ExoNet database for large-scale image classification of the different walking environments; and 3) quantitatively compared the benchmarked CNN architectures and their environment classification predictions using an operational metric called "NetScore", which balances the image classification accuracy with the computational and memory storage requirements (i.e., important for onboard real-time inference). The overall objective of this research is to support the development of nextgeneration environment-adaptive locomotion mode recognition systems for robotic prostheses and exoskeletons. One participant without wearing an assistive device was instrumented with a lightweight smartphone camera system (iPhone XS Max) (see Figure 5 ). Compared to lower-limb systems (Da Silva et al., 2020; Diaz et al., 2018; Hu et al., 2018; Kleiner et al., 2018; Massalin et al., 2018; Rai and Rombokas, 2018; Varol and Massalin, 2016; Zhang et al., 2011; 2019b; 2019c; 2020) , our chest-mounted camera can provide more stable video recording and allow users to wear pants and dresses without obstructing the sampled field-of-view. The chest-mount height was ~1.3 m from the ground when the participant stood upright. The smartphone has two 12-megapixel RGB rear-facing cameras and one 7-megapixel front-facing camera. The front and rear cameras provide 1920×1080 and 1280×720 video recording at 30 Hz, respectively. The smartphone weighs 0.21 kg and has an onboard rechargeable lithium-ion battery, 512-GB of memory storage, and a 64-bit ARM-based integrated circuit (Apple A12 Bionic) with a six-core CPU and four-core GPU; these hardware specifications can theoretically support onboard deep learning for realtime environment classification. The relatively lightweight and unobtrusive nature of the wearable camera system allowed for unimpeded human locomotion. Ethical review and approval were not required for this research in accordance with the University of Waterloo Office of Research Ethics. Whereas most environment recognition systems have been limited to controlled indoor environments and/or prearranged walking circuits (Du et al., 2012; Hu et al., 2018; Khademi and Simon, 2019; Kleiner et al., 2018; Krausz et al., 2015; Krausz and Hargrove, 2021; Liu et al., 2016; Rai and Rombokas, 2018; Wang et al., 2013; Zhang et al., 2011; 2019b; 2019c; 2019d) , our participant walked around unknown outdoor and indoor real-world environments while collecting images with occlusions and intraclass variations (see Figure 6 ). We collected data at various times throughout the day to include different lighting conditions. Similar to human gaze fixation during walking (Li et al., 2019) , the field-of-view was 1-5 m ahead of the participant, therein showing the oncoming walking environment rather than the ground directly underneath the subject's feet. This operating range allowed for detecting physical obstacles and terrain changes within several walking strides; research has shown that lower-limb amputees tend to allocate higher visual attention to oncoming terrain changes compared to healthy controls (Li et al., 2019). Our camera's pitch angle slightly differed between data collection sessions. Images were sampled at 30 Hz with 1280×720 resolution. We recorded over 52 hours of video, amounting to ~5.6 million images. The same environment was never sampled twice to maximize diversity in the dataset. Images were collected during the summer, fall, and winter seasons to capture different weathered surfaces like snow, grass, and multicolored leaves. The image database, which we named "ExoNet", was uploaded to the IEEE DataPort repository and is publicly available for download (Laschowski et al., 2020b) . The file size of the uncompressed videos is ~140 GB. Given the subject's self-selected walking speed, there were relatively minimal differences between consecutive images sampled at 30 Hz. We therefore downsampled and labelled the images at 5 frames/second to minimize the demands of manual annotation and increase the diversity in image appearances. However, for online environment-adaptive control of robotic leg prostheses and exoskeletons, higher sampling rates would be more advantageous for accurate and robust automated locomotion mode transitions. Similar to the ImageNet dataset (Deng et al., 2009), the ExoNet database was human-annotated using a hierarchical labelling architecture. Images were mainly labelled according to common high-level locomotion modes, rather than a purely computer vision perspective. For instance, images of level-ground environments showing either pavement or grass were not differentiated since both surfaces would be assigned the same level-ground walking locomotion mode. In contrast, computer vision researchers might label these different surface textures as separate classes. Approximately 923,000 images were manually annotated using a 12-class hierarchical labelling architecture (see Figure 5 ). The dataset included: 31,628 images of "incline stairs transition wall/door" (I-T-W); 11,040 images of "incline stairs transition level-ground" (I-T-L); 17,358 images of "incline stairs steady" (I-S); 28,677 images of "decline stairs transition level-ground" Taking inspiration from (Du et al., 2012; Huang et al., 2011a; Khademi and Simon, 2019; Liu et al., 2016; Wang et al., 2013) , our labelling architecture included both "steady" (S) and "transition" (T) states (see Figure 7) . A steady state describes an environment where an exoskeleton or prosthesis user would continue to perform the same locomotion mode (e.g., an image showing only level-ground terrain). In contrast, a transition state describes an environment where an exoskeleton or prosthesis high-level controller might switch between locomotion modes (e.g., an image showing both level-ground terrain and incline stairs). Manually labelling these transition states was relatively subjective. For instance, an image showing level-ground terrain was labelled "level-ground transition incline stairs" (L-T-I) when an incline staircase was approximately within the sampled field-of-view. Similar labelling principles were applied to transitions to other conditions. Although the ExoNet database was labelled by one designated researcher, consistently determining the exact video frame where an environment would switch between steady and transition states was challenging; Huang et al (2011a) reported experiencing similar difficulties. We used convolutional neutral networks for the image classification. Unlike previous studies that used support vector machines that required hand-engineering (Massalin et al., 2018; Varol and Massalin, 2016) , deep learning replaces manually extracted features with multilayer networks that can automatically and efficiently learn the optimal image features from the training data. Generally speaking, a CNN architecture contains multiple stacked convolutional and pooling layers with decreasing spatial resolutions and increasing number of feature maps. Starting with an input 3D tensor, the convolutional layers perform convolution operations (i.e., dot products) between the inputs and convolutional filters. The first few layers extract relatively general features, like edges, while deeper layers learn more problem-dependent features. The resulting feature maps are passed through a nonlinear activation function, typically a rectified linear unit (ReLU). The pooling layers spatially downsample the feature maps to reduce the computational effort by aggregating neighboring elements using either maximum or average values. The CNN architecture concludes with one or more fully connected layers and a softmax loss function, which estimates the probability distribution (i.e., scores) of each labelled class. Previous CNNbased environment recognition systems (see Table 3 ) have used supervised learning such that the differences between the predicted and labelled class scores are computed and the learnable network parameters (i.e., weights) are optimized to minimize the loss function via backpropagation and stochastic gradient descent. After training, the CNN performs inference on previously unseen data to evaluate the generalizability of the learned parameters. These artificial neural networks are loosely inspired by biological neural networks. , 2017) . During data preprocessing, the images were cropped to an aspect ratio of 1:1 and downsampled to 256x256 using bilinear interpolation. Random crops of 224x224 were used as inputs to the neural networks; this method of data augmentation helped further increase the sample diversity. The final densely connected layer of each CNN architecture was modified by setting the number of output channels equal to the number of environment classes in the ExoNet database. We used a softmax loss function to predict the individual class scores. The labelled images were split into training (89.5%), validation (3.5%), and testing (7%) sets, the proportions of which are consistent with ImageNet (Deng et al., 2009), which is of comparable size. We experimented with transfer learning of pretrained weights from ImageNet but found no additional performance benefit. Dropout regularization was applied before the final dense layer to prevent overfitting during training such that the learnable weights were randomly dropped (i.e., activations set to zero) during each forward pass at a rate of 0.5. Images were also randomly flipped horizontally during training to increase stochasticity and promote generalization. We trained each CNN architecture for 40 epochs using a batch size and initial learning rate of 128 and 0.001, respectively; these hyperparameters were experimentally tuned to maximize the performance on the validation set (see Figure 8 ). We explored different combinations of batch sizes of 32, 64, 128, and 256; epochs of 20, 40, and 60; dropout rates of 0, 0.2, 0.5; and initial learning rates of 0.01, 0.001, 0.0001, and 0.00001. The learning rate was reduced during training using a cosine weight decay schedule (Loshchilov and Hutter, 2016) . We calculated the sparse categorical cross-entropy loss between the labelled and predicted classes and used the Adam optimization algorithm (Kingma and Ba, 2015), which computes backpropagated gradients using momentum and an adaptive learning rate, to update the learnable weights and minimize the loss function. During testing, we used a single central crop of 224x224. Training and inference were both performed on a Tensor Processing Unit (TPU) version 3-8 by Google Cloud; these customized chips allow for accelerated CNN computations (i.e., matrix multiplications and additions) compared to traditional computing devices. To facilitate onboard real-time inference for environment-adaptative control of robotic leg prostheses and exoskeletons, the ideal convolutional neural network would achieve high classification accuracy with minimal parameters, computing operations, and inference time. Motivated by these design principles, we quantitatively evaluated and compared the benchmarked CNN architectures (Ɲ) and their environment classification predictions on the ExoNet database using an operational metric called "NetScore" (Wong, 2018) : where a(Ɲ) is the image classification accuracy during inference (0-100%), p(Ɲ) is the number of parameters expressed in millions, m(Ɲ) is the number of multiply-accumulates expressed in billions, and a, β, and γ are coefficients that control the effects of the classification accuracy, and the architectural and computational complexities on the NetScore (W), respectively. We set the coefficients to {α = 2, β = 0.5, γ = 0.5} to better emphasize the classification accuracy while partially considering the parameters and computing operations, since neural networks with low accuracy are less practical, regardless of the size and speed. Note that the NetScore does not explicitly account for inference time. The number of parameters p(Ɲ) and multiply-accumulates m(Ɲ) are assumed to be representative of the architectural and computational complexities, respectively, both of which are inversely proportional to the NetScore. 4 Results Table 4 summarizes the benchmarked CNN architectures (i.e., the number of parameters and computing operations) and their environment classification performances on the ExoNet database (i.e., prediction accuracies and inference times), including the NetScores. The EfficientNetB0 network achieved the highest image classification accuracy (Ca) during inference (i.e., 73.2% accuracy), that being the percentage of true positives (i.e., 47,265 images) out of the total number of images in the testing set (i.e., 64,568 images) (C ! = '()* ,-./0/1*. '-0!2 3&!4*. × 100%). In contrast, the VGG19 network produced the least accurate predictions, with an overall image classification accuracy of 69.2%. The range of accuracies across the different CNN architectures was thus relatively small (i.e., maximum arithmetic difference of 4 percentage points). We observed relatively weak correlations between both the number of parameters (i.e., Pearson r = -0.3) and computing operations (i.e., Pearson r = -0.59) and the classification accuracies on the ExoNet database across the different CNN architectures. Although the VGG16 and VGG19 networks involved the largest number of computations -i.e., ~15.4 and ~19.5 billion multiply-accumulates, respectively -they resulted in the fastest inference times (i.e., on average 1.4 ms and 1.6 ms per image). For comparison, the DenseNet201 network had 72.1% and 78% fewer operations than VGG16 and VGG19, respectively, but was 364% and 306% slower. These results concur with Ding et al (2021), who recently showed 1) that the number of computing operations does not explicitly reflect the actual inference speed, and 2) that VGG-style architectures can run faster and more efficiently on CNN computing devices compared to more complicated architectures like DenseNets due to their relatively simple designs (i.e., consisting of basic convolutions and ReLU activations). Our inference times were calculated on the Cloud TPU using a batch size of 8. Note that the relative inference speeds between the benchmarked CNN architectures (i.e., their ordering from fastest to slowest) may differ across different computing devices given that some platforms are designed to accelerate certain operations better than others (e.g., cloud computing vs. those designed for mobile and embedded systems). Tables 5-17 show the multiclass confusion matrix for each CNN architecture. The matrix columns and rows are the predicted and labelled classes, respectively. The diagonal elements are the classification accuracies for each environment class during inference, known as true positives, and the nondiagonal elements are the misclassification percentages; darker shades represent higher classification accuracies. Despite slight numerical differences, most of the benchmarked CNN architectures displayed a similar interclass trend. The networks most accurately predicted the "level-ground transition wall/door" (L-T-W) class (i.e., average Ca = 84.3% ± 2.1%), followed by the "level-ground steady" (L-S) class with an average accuracy of 77.3% ± 2.1% and the "decline stairs transition level-ground" (D-T-L) class with an average accuracy of 76.6% ± 2.4%. These results could be attributed to the class imbalances among the training data (i.e., there were significantly more images of L-T-W environments compared to other classes). However, some classes with limited images showed relatively good classification performance. For instance, the "incline stairs transition level-ground" (I-T-L) class comprised only 1.2% of the ExoNet database but achieved 71.2% ± 5.1% average classification accuracy. The least accurate predictions were for the environment classes that contained "other" features -i.e., the "wall/door transition other" (W-T-O) class with an average accuracy of 38.3% ± 3.5% and the "level-ground transition other" (L-T-O) class with an average accuracy of 46.8% ± 2.4%; these results could be attributed to the increased noise and randomness of the environments and/or objects in the images. In this study, we developed an advanced environment classification system for robotic leg prostheses and exoskeletons using computer vision and deep learning. Taking inspiration from the human vision-locomotor control system, environment sensing and classification could improve the automated high-level controller by reconstructing the oncoming walking environments prior to physical interaction, therein allowing for more accurate and real-time transitions between different locomotion modes. However, small-scale and private image datasets have impeded the development and comparison of environment classification algorithms from different researchers, respectively. We therefore developed ExoNet -the largest open-source dataset of wearable camera images of walking environments. Unparalleled in scale and diversity, ExoNet contains ~5.6 million images of indoor and outdoor real-world walking environments, of which ~923,000 images were annotated using a 12-class hierarchical labelling architecture. We then trained and tested over a dozen state-of-the-art deep convolutional neural networks on the ExoNet database for large-scale image classification of the walking environments, including: EfficientNetB0, InceptionV3, MobileNet, MobileNetV2, VGG16, VGG19, Xception, ResNet50, ResNet101, ResNet152, DenseNet121, DenseNet169, and DenseNet201. This research provides a benchmark and reference for future comparisons. Although we designed our environment classification system for robotic prostheses and exoskeletons, applications could extend to humanoids, autonomous legged robots, powered wheelchairs, and assistive devices for persons with visual impairments. The use of deep convolutional neural networks was made possible by the ExoNet database. In addition to being open-source, the large scale and diversity of ExoNet significantly distinguishes itself from previous research (see Table 1 ). The ExoNet database contains ~923,000 labelled images. In comparison, the previous largest dataset, developed by Varol's research group (Massalin et al., 2018) , contained ~402,000 images. Whereas previous datasets included fewer than six environment classes, the most common being level-ground terrain and incline and decline stairs, the ExoNet database contains a 12-class hierarchical labelling architecture; these differences have important practical implications since deep learning requires significant and diverse training data to prevent overfitting and promote generalization (LeCun et al., 2015) . Furthermore, our image quality (1280×720) is considerably higher than previous datasets (e.g., 224x224 and 320x240 Krausz and Hargrove, 2021; Massalin et al., 2018; Varol and Massalin, 2016; Zhang et al., 2019b; 2019c; 2019d , 2020 . These range imaging systems work by emitting infrared light and measuring the light time-of-flight between the camera and oncoming walking environment to calculate distance. Depth sensing can uniquely extract environmental features like step height and width, which can improve the mid-level control of robotic prostheses and exoskeletons (e.g., increasing actuator joint torques to assist with steeper stairs). Despite these benefits, depth measurement accuracy typically degrades in outdoor lighting conditions (e.g., sunlight) and with increasing distance (Krausz and Hargrove, 2019; Zhang et al., 2019a) . Consequently, most environment recognition systems using depth cameras have been tested in controlled indoor environments and/or have had limited capture volumes (i.e., 1-2 m of maximum range imaging) (Hu et al., 2018; Krausz et al., 2015; Massalin et al., 2018; Massalin, 2016, Zhang et al., 2020) . These systems usually require an onboard accelerometer or IMU to transform the 3D environment information from the camera coordinate system into global coordinates (Krausz et al., 2015; Krausz and Hargrove, 2021; Zhang et al., 2019b; 2019c; 2019d; 2020) . The application of depth cameras for active environment sensing would also require robotic prostheses and exoskeletons to have onboard microcontrollers with high computing power and low power consumption. The current embedded systems would need significant modifications to support the real-time processing and classification of depth images given their added complexity. These practical limitations motivated our decision to use RGB images for the environment classification. The EfficientNetB0 network (Tan and Le, 2019) achieved the highest image classification accuracy on the ExoNet database during inference (i.e., 73.2% accuracy). However, for environment-adaptive control of robotic leg prostheses and exoskeletons, higher classification accuracies are desired since even rare misclassifications could cause loss-of-balance and injury. Although we used state-of-the-art deep convolutional neural networks, the architectures included only feedforward connections, therein classifying the walking environment frame-byframe without knowing the preceding classification decisions. Moving forward, sequential information over time could be used to improve the classification accuracy and robustness, especially during steady-state environments. This computer vision technique is analogous to how light-sensitive receptors in the human eye capture dynamic images to control locomotion (i.e., known as optical flow) (Patla, 1997) . Sequential data could be classified using majority voting (Wang et al., 2013; Massalin et al., 2018; Varol and Massalin, 2016) or deep learning models like Transformers (Vaswani et al., 2017) or recurrent neural networks (RNNs) (Zhang et al., 2019b) . Majority voting works by storing sequential decisions in a vector and generates a classification prediction based on the majority of decisions. Recurrent neural networks process sequential data while maintaining an internal hidden state vector that implicitly contains temporal information. Training RNNs can be challenging though, due to exploding and vanishing gradients. Although these neural networks were designed to learn long-term dependencies, research has shown that they struggle with storing sequential information over long periods (LeCun et al., 2015) . To mitigate this issue, RNNs can be supplemented with an explicit memory module like a neural Turning machine or long short-term memory (LSTM) network, therein allowing for uninterrupted backpropagated gradients. Fu's research group (Zhang et al., 2019b ) explored using sequential data for environment classification. Sequential decisions from a baseline CNN were fused and classified using a recurrent neural network, LSTM network, majority voting, and a hidden Markov model (HMM). The baseline neural network achieved 92.8% image classification accuracy across five environment classes. Supplementing the baseline CNN with an RNN, LSTM network, majority voting, and HMM resulted in 96.5%, 96.4%, 95%, and 96.8% classification accuracies, respectively (Zhang et al., 2019b) . Although using sequential data can improve the classification accuracy of walking environments, these decisions require longer computation times and thus could impede real-time locomotor control. The development of deep convolutional neural networks has traditionally focused on improving classification accuracy, often leading to more accurate yet inefficient algorithms with greater computational and memory storage requirements (Canziani et al., 2016) . These design features can be especially problematic for deployment on mobile and embedded systems, which inherently have limited operating resources. Despite the advances in computing platforms like graphics processing units (GPUs), the current embedded systems in robotic leg prostheses and exoskeletons would struggle to support the architectural and computational complexities typically associated with deep learning for computer vision. Accordingly, we quantitatively evaluated and compared the benchmarked CNN architectures and their environment classification predictions on the ExoNet database using an operational metric called "NetScore" (Wong, 2018) , which balances the image classification accuracy with the computational and memory storage requirements (i.e., important for onboard real-time inference). Although the ResNet152 network achieved one of the highest image classification accuracies on the ExoNet database (i.e., 71.6% accuracy), it received the lowest NetScore (i.e., W = 46) due to the disproportionally large number of parameters (i.e., containing more parameters than any other CNN architecture we tested). Interestingly, the EfficientNetB0 network did not receive the highest NetScore (i.e., W = 72.6) despite achieving the highest classification accuracy and the architecture having been optimally designed using a neural architecture search to maximize the accuracy while minimizing the number of computing operations (Tan and Le, 2019) . The MobileNetV2 network developed by Google (Sander et al., 2019) , which uses depthwise separable convolutions, received the highest NetScore (i.e., W = 76.2), therein demonstrating the best balance between the classification accuracy (i.e., 72.9% accuracy) and the architectural and computational complexities. The authors previously demonstrated the ability of their CNN architecture to perform onboard real-time inference on a mobile device (i.e., ~75 ms per image on a CPU-powered Google Pixel 1 smartphone) (Sander et al., 2019). However, our classification system could theoretically generate even faster runtimes since 1) the smartphone we used (i.e., the iPhone XS Max) has an onboard GPU, and 2) we reduced the size of the final densely connected layer of the MobileNetV2 architecture from 1,000 outputs (i.e., as originally used for ImageNet) to 12 outputs (i.e., for the ExoNet database). Compared to traditional CPUs, GPUs have many more core processors, which permit faster and more efficient CNN computations through parallel computing (LeCun et al., 2015) . Moving forward, we recommend using the existing CPU embedded systems in robotic prostheses and exoskeletons for performing the highlevel locomotion mode recognition based on neuromuscular-mechanical data, which is less computationally expensive, and a supplementary GPU computing platform for the image-based environment classification; these recommendations concur with those recently proposed by Huang and colleagues (Da Silva et al, 2020). Lastly, given that the environmental context does not explicitly represent the user's locomotor intent, although it constrains the movement possibilities, data from computer vision should supplement, rather than replace, the automated locomotion mode control decisions based on mechanical, inertial, and/or neuromuscular signals. Images from our wearable smartphone camera could be fused with its onboard IMU measurements to improve the high-level control performance. For example, when an exoskeleton or prosthesis user wants to sit down, the acceleration data would indicate stand-to-sit rather than level-ground walking, despite the levelground terrain being accuracy detected within the field-of-view (bottom right image in Figure 7) . , our smartphone IMU measurements could also help minimize the onboard computational and memory storage requirements via sampling rate control (i.e., providing an automatic triggering mechanism for the image capture). Whereas faster walking speeds can benefit from higher sampling rates for continuous classification, standing still does not necessarily require environment information and thus the smartphone camera could be powered down or the sampling rate decreased, therein conserving the onboard operating resources. However, relatively few researchers have fused environment data with mechanical and/or inertial measurements for automated locomotion mode recognition (Du et al., 2012; Huang et al., 2011a; Krausz et al., 2019; Krausz and Hargrove, 2021; Liu et al., 2016; Wang et al., 2013; Zhang et al., 2011) and only one study (Zhang et al., 2020) has used such 6 The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest. (1) Manual High-Level Control (2) Automated High-Level Control Figure 2 . An automated locomotion mode recognition system for robotic prostheses and exoskeletons, also known as an intent recognition system or intelligent high-level controller. These systems can be supplemented with an environment recognition system to forward predict the walking environments prior to physical interaction, therein minimizing the high-level decision space. Note that the photograph (right) is the lead author wearing our robotic exoskeleton. Figure 5 . Development of the "ExoNet" database, including (1) photograph of the wearable camera system used for large-scale data collection; (2) (1) (2) (3) Figure 6 . Examples of the wearable camera images of indoor and outdoor real-world walking environments in the ExoNet database. Images were collected at various times throughout the day and across different seasons (i.e., summer, fall, and winter). Table 3 . Previous environment recognition systems that used convolutional neural networks for image classification of walking environments. Note that each network was trained and tested on different image datasets (see Table 1 ). Subject-and environment-based sensor variability for wearable lower-limb assistive devices Sensor fusion of vision, kinetics and kinematics for forward prediction during walking with a transfemoral prosthesis ImageNet classification with deep convolutional neural networks Electromechanical design of robotic transfemoral prostheses Lower-limb prostheses and exoskeletons with energy regeneration: Mechatronic design and optimization review Preliminary design of an environment recognition system for controlling robotic lower-limb prostheses and exoskeletons Comparative analysis of environment recognition systems for control of lower-limb exoskeletons and prostheses ExoNet database: Wearable camera images of human locomotion environments Simulation of stand-to-sit biomechanics for robotic exoskeletons and prostheses with energy regeneration Computer vision and deep learning for environment-adaptive control of robotic lower-limb exoskeletons. bioRxiv Brain-machine interfaces: From basic science to neuroprosthetics and neurorehabilitation Very deep convolutional networks for large-scale image recognition Effect of additional mechanical sensor data on an EMG-based pattern recognition system for a powered leg prosthesis Rethinking the Inception architecture for computer vision EfficientNet: Rethinking model scaling for convolutional neural networks Control strategies for active lower extremity prosthetics and orthotics: A review Multiclass real-time intent recognition of a powered lower limb prosthesis A feasibility study of depth image based intent recognition for lower limb prostheses Terrain recognition improves the performance of neural-machine interface for locomotion mode recognition NetScore: Towards universal metrics for large-scale performance analysis of deep neural networks for practical on-device edge usage State of the art and future directions for lower limb robotic exoskeletons Preliminary design of a terrain recognition system