key: cord-0559516-7zfq0kxm authors: Menda, Kunal; Becdelievre, Jean de; Gupta, Jayesh K.; Kroo, Ilan; Kochenderfer, Mykel J.; Manchester, Zachary title: Scalable Identification of Partially Observed Systems with Certainty-Equivalent EM date: 2020-06-20 journal: nan DOI: nan sha: 6afa374dbda286f5e43d21e80252ddc62cb7e7ab doc_id: 559516 cord_uid: 7zfq0kxm System identification is a key step for model-based control, estimator design, and output prediction. This work considers the offline identification of partially observed nonlinear systems. We empirically show that the certainty-equivalent approximation to expectation-maximization can be a reliable and scalable approach for high-dimensional deterministic systems, which are common in robotics. We formulate certainty-equivalent expectation-maximization as block coordinate-ascent, and provide an efficient implementation. The algorithm is tested on a simulated system of coupled Lorenz attractors, demonstrating its ability to identify high-dimensional systems that can be intractable for particle-based approaches. Our approach is also used to identify the dynamics of an aerobatic helicopter. By augmenting the state with unobserved fluid states, a model is learned that predicts the acceleration of the helicopter better than state-of-the-art approaches. The codebase for this work is available at https://github.com/sisl/CEEM. The performance of controllers and state-estimators for non-linear systems depends heavily on the quality of the model of system dynamics (Hou & Wang, 2013) . Systemidentification addresses the problem of learning or calibrating dynamics models from data (Ljung, 1999) , which is often a time-history of observations of the system and control inputs. In this work, we address the problem of learning dynamics models of partially observed systems (shown in · · · · · · · · · x txt+1 f θ (x, u, t) g θ (x, u, t) g θ (x, u, t + 1) · · · · · · System Identification Figure 1. A graphical model representing a partially observed dynamical system. Gray-box identification algorithms attempt to search a model class of dynamics and observation models for the model that maximizes the likelihood of the observations. Figure 1 ) that are high-dimensional and non-linear. We consider situations in which the system's state cannot be inferred from a single observation, but instead requires inference over time-series of observations. The problem of identifying systems from partial observations arises in robotics (Punjani & Abbeel, 2015; Cory & Tedrake, 2008; Ordonez et al., 2017) as well as domains such as chemistry (Gustavsson, 1975) and biology (Sun et al., 2008) . In many robotic settings, we have direct measurements of a robot's pose and velocity, but in many cases we cannot directly observe relevant quantities such as the temperature of actuators, the state the environment around the robot, or the intentions of other agents. For example, Abbeel et al. (2010) attempted to map the pose and velocity of an aerobatic helicopter to its acceleration. They found their model to be inaccurate when predicting aggressive maneuvers because of the substantial airflow generated by the helicopter that affected the dynamics. Since it is often impossible to directly measure the state of the airflow around a vehicle, identification must be with only partial observability. System identification is a mature field with a rich history (Ljung, 1999; 2010) . A variety of techniques have been proposed to learn predictive models from time series arXiv:2006.11615v1 [cs. LG] 20 Jun 2020 data. Autoregressive approaches directly map a time-history of past inputs to observations, without explicitly reasoning about unobserved states (Billings, 2013) , and are the stateof-the-art approach to the aforementioned problem of modeling the aerobatic helicopter (Punjani & Abbeel, 2015) . In contrast, state-space models (SSM) assume an unobserved state x t that evolves over time and emits observations y t that we measure, as shown in Figure 1 . Recurrent Neural Networks (RNNs) (Bailer- Jones et al., 1998; Zimmermann & Neuneier, 2000) are a form of black-box non-linear SSM that can be fit to observation and input time-series, and Subspace Identification (SID) methods (Van Overschee & De Moor, 1994) can be used to fit linear SSMs. However, in many cases prior knowledge can be used to specify structured, parametric models of the system (Gupta et al., 2019; in state-space form, commonly refered to as gray-box models. Such models can be trained with less data and used with a wider array of control and stateestimation techniques than black-box models (Gupta et al., 2019; Lutter et al., 2019b; a) . To identify partially observed gray-box models, the unobserved state-trajectory is often considered as a missing data, and techniques based on Expectation-Maximization (EM) are used (Dempster et al., 1977; Schön et al., 2011; Kantas et al., 2015; Ghahramani & Roweis, 1999) . The smoothing step (E-step) deals with state inference-the current system dynamics estimate is used to infer the distribution of unobserved state-trajectories conditioned on the observations p(x 1:T | y 1:T ). This distribution is sometimes called joint smoothing distribution in the literature, and it is used to estimate the expected log-likelihood of the observations. In the learning step (M-step), the system's dynamics estimate is updated such that it maximizes the expected log-likelihood. The smoothing step can typically be approached with particle approximations of p(x 1:T | y 1:T ), but naive implementations of particle smoothers can be computationally intensive and become rapidly intractable in high dimensions. Across various fields and disciplines, numerous methods have been developed to alleviate the computational burden, several of which are discussed in Section 2.3. This work is motivated by robotics applications, in which the following assumptions are often valid: • Systems evolve nearly deterministically, which implies that the process noise is small and unimodal, and, • The distribution of states conditioned on the observations p(x 1:T | y 1:T ) is unimodal. We study the benefits of making the certainty-equivalent approximation in the E-step of the EM procedure, and we refer to this approach as CE-EM. Specifically, we use non-linear programming to tractably find the maximum-likelihood (ML) point-estimate of the unobserved states, and use it in lieu of a fully characterized approximate distribution. The contributions of this paper are to describe an efficient implementation of CE-EM, and to test the approximation against state-of-the-art approaches on a variety of systemidentification problems. We demonstrate on a system of Lorenz attractors that: • CE-EM can be faster and more reliable than approaches using particle approximations, • CE-EM scales to high-dimensional problems, and, • CE-EM learns unbiased parameter estimates on deterministic systems with unimodal p(x 1:T | y 1:T ). We also demonstrate the algorithm on the problem of identifying the dynamics of an aerobatic helicopter. We show that a non-linear SSM can be trained with CE-EM that outperforms various approaches including the most recent work done on this dataset (Punjani & Abbeel, 2015) . A codebase implementing CE-EM and other supplementary material can be found at our website: https://sites.google. com/stanford.edu/ceem/. This section states the nonlinear system identification problem with partial observations and discusses approaches that use EM to solve it. In this work, we assume that we are given a batch of trajectories containing observations y 1:T ∈ R m×T of a dynamical system as it evolves over a time horizon T , possibly forced by some known input sequence u 1:T . We assume that this dynamical system has a state x ∈ R n that evolves and generates observations according to the following equations, where w t is referred to as the process noise and v t as the observation noise. Both w t and v t are assumed to be additive for notational simplicity, but this is not a required assumption. 1 Without loss of generality, we can drop the dependence on u t , absorbing it into the dependence on t. We further assume that we are provided a class of parameterized models f θ (x, t) and g θ (x, t) for θ ∈ Θ that approximate the dynamical system's evolution and observation processes. The goal of our algorithm is to find the parameters θ that maximize the likelihood of the observations. That is, we seek to find: (2) Using the graphical model shown in Figure 1 , we can rewrite this integral as: (3) Finally, using the notation chosen to describe a dynamical system in Equation (1), we obtain: The Expectation-Maximization (EM) algorithm has been used in the literature to solve problems of this form. The EM algorithm is a two-step procedure that copes with the missing state information by forming an approximation Q(θ, θ k ) of the joint state and observation likelihood p θ (x 1:T , y 1:T ) at the kth iteration of the algorithm: Q(θ, θ k ) = log p(x 1:T , y 1:T | θ)· p(x 1:T | y 1:T , θ k )dx 1:T Classical proofs show that maximizing Q(θ, θ k ) with respect to θ results in increasing p θ (x 1:T , y 1:T ) (Dempster et al., 1977) . The EM algorithm iteratively performs two steps: The evaluation of Q(θ, θ k ) in the E-step requires the estimation the p(x 1:T | y 1:T , θ k ), as well as the integration of the likelihood over this distribution. In the linear Gaussian case, EM can be performed exactly using Kalman smoothing (Rauch et al., 1965; Ghahramani & Hinton, 1996) . In the more general case, there is no analytic solution, and previous work on approximating the E-step is summarized in the next section. There are general approaches that are based on particle representations of the joint states-observations likelihood which use Sequential Monte-Carlo (SMC) methods such as Particle Smoothing (PS) (Schön et al., 2011; Kantas et al., 2015) . With enough particles, PS can handle any system and any joint distribution of states and observations. However, PS suffers from the curse of dimensionality, requiring an intractably large number of particles if the state space is high-dimensional (Snyder et al., 2008; Kantas et al., 2015) . In the simplest form, both the E-step and the M-step can be quadratic in complexity with respect to the number of particles (Schön et al., 2011) . An important body of work has attempted to alleviate some of this burden by using Nested SMC (Naesseth et al., 2015) and forward filtering-backward simulation (FFBSi), (Lindsten & Schön, 2013) , and conditional particle filtering (Lindsten, 2013) in the E-step. PS can also require variance reduction techniques, such as fulladaptation, to perform reliably, and likelihood estimates in the E-step can be noisy . Stochastic Approximation EM can be used to stabilize learning in the presence of this noise (Delyon et al., 1999; . Another group of methods is based on linearizing the dynamics around the current state estimates to obtain a timedependant linear Gaussian dynamical system, to which we can apply Kalman smoothing techniques. Ghahramani & Roweis (1999) use Extended Kalman Smoothing (EKS) for a fast E-step and show that, when using radial basis functions to describe the system dynamics, no assumption is required in the M-step. The drawbacks of this approach are that the number of radial basis functions required to accurately represent a function grows exponentially with the input dimension, that the user cannot specify a gray-box SSM, and that EKS performance can vary dramatically with the hyperparameters. Goodwin & Aguero (2005) proposed a method called MAP-EM, which linearizes around the state-trajectory that maximizes joint state and observation likelihood, approximates process and observation noise as Gaussian, and performs a local version of EM using this linearization. In comparison, the method we consider assumes that the maximum likelihood state-trajectory estimate concentrates all of the probability mass. Goodwin & Aguero (2005) call this simplification Certainty Equivalent EM (CE-EM) and report worse results than their approach on simple examples. This paper shows that CE-EM can actually be a fast, simple, and reliable approach. The method we use here is tailored to systems that are not dominated by process noise, in which the state-trajectory distribution is unimodal, and whose high dimensional state-space make other methods intractable. Such scenarios are regularly encountered in robotics, but perhaps less so in other applications domains of system identification such as chemistry, biology, or finance. This section introduces the CE-EM approximation and presents an efficient algorithm to identify high-dimensional robotic systems. Under the certainty-equivalence condition, the distribution of states conditioned on observations is assumed to be a Dirac delta function. That is, we assume: In other words, this assumption implies that there is only one state trajectory that satisfies the system dynamics, and is coherent with the data. The E-step can be rewritten as: By observing the maximization of the expression in Equation (7) as the E-step, and maximizing the expression in Equation (8) as the M-step, we see that CE-EM is simply block coordinate-ascent on the single joint-objective: Jointly maximizing this objective over x 1:T and θ yields θ ML . Though this objective can be optimized as is by a non-linear optimizer, it is not necessarily efficient to do so since θ and x 1:T are highly coupled, leading to inefficient and potentially unstable updates. For this reason, we still opt for the block coordinate-ascent approach similar to EM, where θ and x 1:T are each sequentially held constant while the other Algorithm 1. CE-EM Implementation Input: observations y 1:T , control inputs u 1:T , stopping criterion tol Initialize: model parameters θ Initialize: hidden state estimates is optimized. By viewing EM as block coordinate-ascent, we may also borrow good practices typically employed when running the algorithm (Shi et al., 2016) . In particular, we employ trust-region regularization in order to guarantee convergence in even non-convex settings (Grippo & Sciandrone, 2000) . At iteration k, we first perform smoothing by holding θ constant and finding the ML point-estimate for x 1:T as follows: (10) where ρ x scales a soft trust-region regularizer similar to damping terms found in Levenberg-Marquardt methods (Levenberg, 1944; Marquardt, 1963) . We note that the Hessians of this objective with respect to x 1:T are blocksparse, and as a result this step can be efficiently solved with a second-order optimizer in O(n 2 T ). If the cost function J(x 1:T , θ) is a sum of quadratic terms, i.e. if all stochasticity is Gaussian, then the smoothing can be solved with a Gauss-Newton method, where the Hessian matrix is approximated using first-order (Jacobian) information. In this case, the solution to the smoothing step is equivalent to iteratively performing EKS (Aravkin et al., 2017) . However, EKS typically involves forward and backward passes along each trajectory, performing a Riccati-like update at each time step. Though square-root versions of the algorithm exist to improve its numerical stability (Bierman, 1981) , solving the problem with batch non-linear least-squares as opposed to iteratively performing a forward-backward method can improve stability for long time-series (Van Dooren, 1981). In the learning step, we hold x 1:T constant and find: Again, ρ θ scales a soft trust-region regularizer, and specifying log p(θ) allows us to regularize θ toward a prior. The above optimization problem can be solved using any nonlinear optimizer. We find the Nelder-Mead (Gao & Han, 2012) scheme well-suited for small parameter spaces, and first-order schemes such as Adam (Kingma & Ba, 2015) , or quasi-second-order schemes such as L-BFGS (Liu & Nocedal, 1989 ) suited for larger parameter spaces such as those of neural networks. The routine, summarized in Algorithm 1, iterates between the smoothing and learning steps until convergence. It should be noted that making the certainty-equivalent approximation is generally known to bias parameter estimates (Celeux & Govaert, 1992) , but can yield the correct solutions under the assumptions we make (Neal & Hinton, 1998) . Readers of Goodwin & Aguero (2005) will likely be left with the incorrect impression that CE-EM is an inferior method that would perform poorly in practice. The objective of our experiments is to demonstrate that the CE-EM algorithm is capable of identifying high-dimensional nonlinear systems in partially observed settings. We do so in simulation by identifying the parameters of a system of partially observed coupled Lorenz attractors, as well as by identifying the dynamics of a real aerobatic helicopter. In the second experiment, we build on previous analysis of the dataset (Abbeel et al., 2010; Punjani & Abbeel, 2015) by attempting to characterize the interaction of the helicopter with the fluid around it, without having any direct observation of the fluid state. Instructions for reproducing all experiments are included in the supplementary material. In this experiment, we show that: 1. CE-EM learns unbiased parameter estimates of systems that are close to deterministic, and, 2. CE-EM scales to high-dimensional problems in which particle-based methods can be intractable. To justify these claims, we use a system that is sufficiently non-linear and partially observable to make particle-based smoothing methods intractable. We choose a system of coupled Lorenz attractors for this purpose, owing to their ability to exhibit chaotic behavior and their use in nonlinear atmospheric and fluid flow models (Bergé et al., 1984) . Arbitrary increases in state dimensionality can be achieved by coupling multiple individual attractors. The state of a system with K coupled Lorenz attractors is x ∈ R 3K = {. . . , x 1,k , x 2,k , x 3,k , . . .}. The dynamics of the system are as follows:ẋ where H is an R 3K×3K matrix. We nominally set the parameters (σ k , ρ k , β k ) to the values (10, 28, 8/3), and randomly sample the entries of H from a normal distribution to generate chaotic and coupled behavior between attractors, while avoiding self-coupling. These parameters are estimated during identification. In order to make the system partially observed, the observation y ∈ R (3K−2) is derived from x as follows: where C ∈ R (3K−2)×3K is a known matrix with full rowrank, and v is the observation noise sampled from a Gaussian with diagonal covariance σ 2 v I. The entries of C are also randomly sampled from a standard normal distribution. In the following experiments, we simulate the system for T = 128 timesteps at a sample rate of ∆t = 0.04s, and integrate the system using a 4th-order Runge-Kutta method. Initial conditions for each trajectory are sampled such that x 1,k ∼ N (−6, 2.5 2 ), x 2,k ∼ N (−6, 2.5 2 ), x 3,k ∼ N (24, 2.5 2 ). To test the conditions under which CE-EM learns unbiased parameter estimates, we simulate a single Lorenz system with H = 0 and known C ∈ R 2×3 . We introduce and vary the process noise w ∼ N (0, σ 2 w I), and vary the observation noise coefficient σ v , and then attempt to estimate the parameters (σ, ρ, β). Using initial guesses within 10% of the system's true parameter values, we run CE-EM on a single sampled trajectory. For each choice of σ w and σ v , we repeat this process for 10 random seeds. Table 1 shows the mean and standard errors of parameter estimates for various σ w and σ v . We highlight in red the mean estimates that are not within two standard errors of their true value. We see that σ and ρ are estimated without bias for all scenarios. However, the estimate of β appears to become biased as the process noise is increased, but not as the observation noise is increased. This supports the assumption that the objective used in CE-EM is sound when systems evolve close to deterministically, but can be biased if it is not. In Section 2, we discussed methods for parameter estimation in state-space systems that are based on particle-filtering and smoothing (Kantas et al., 2015) . Since in their E-step, these methods approximate the distribution over unobserved state-trajectories as opposed to only their point estimate, such methods can be asymptotically unbiased. However, for a finite number of particles, such methods can result in high-variance estimates. In this experiment, we compare the bias resulting from using CE-EM with the variance of using a state-of-the-art Particle EM algorithm. We attempt to identify the parameters (σ, ρ, β) of the same single Lorenz system as in Section 4.1.1. However, we introduce process noise w ∼ N (0, 0.1 2 I) and observation noise v ∼ N (0, 0.5 2 I). We use a training dataset of four trajectories sampled with conditions specified in Section 4.1. The performance of Particle EM can vary substantially depending on implementation of the particle filter and smoother in the E-step, and implementation of the Mstep. We use a fully-adapted particle filter with systematic and adaptive resampling, following the recommendations of Doucet et al. (2000) ; Hol et al. (2006) . Furthermore, we use the FFBSi algorithm (Godsill et al., 2004; Lindsten & Schön, 2013) in order to generate iid samples of smoothed state-trajectories, while avoiding complexity that is quadratic in the number of particles experienced by the forward filter-backward smoother (FFBSm) approach (Doucet et al., 2000) . We then use Stochastic Approximation EM (SAEM) (Delyon et al., 1999) to perform the M-step. 2 We use N p = 100 particles for filtering, and sample N s = 10 smoothed trajectories using FFBSi. Figure 2 shows the estimated parameters versus EM epoch using CE-EM and Particle EM. We plot learning curves for 10 random seeds, each of which initializes parameter estimates to within 10% of their true value, and uses a different set of training trajectories. We see that CE-EM consistently converges to accurate parameter estimates in approximately 5 epochs. Estimates of Particle EM appear to initially diverge but in all but one case converge to a similar accuracy in 50 epochs. Furthermore, since the com- 2 We have found that using FFBSi with SAEM performs more reliably than FFBSm with the M-step recommended by (Schön et al., 2011) , and both implementations can be found in the associated codebase. plexity of FFBSi is O(N p N s ), 3 the runtime per epoch of Particle EM is 47× more than that of CE-EM. 4 Since the variance of particle-based methods generally increases with the effective dimension of the system, the bias induced by CE-EM may be a worthwhile trade-off for fast and reliable parameter estimation. To demonstrate that CE-EM is capable of identifying highdimensional systems, we show that we can estimate the dynamics of an 18 dimensional system of six coupled Lorenz attractors. Moreover, we test whether CE-EM converges to more accurate estimates when more trajectories are provided. To test these claims, we sample 2, 4, and 8 trajectories from a deterministic system with parameters θ true , and σ v = 0.01. We randomly initialize each element of the parameters being optimized (θ = [σ 1:K , ρ 1:K , β 1:K , H]) to within 10% of the their value in θ true . We then run CE-EM on each batch, tracking the error in the estimated dynamics as training proceeds. We measure this error, which we call (θ), as follows: In the learning step, we do not regularize θ to a prior and set ρ θ = 0. For comparison, we also run the same Particle EM implementation as in the previous experiment on this problem. We use the same N p , N s , and hyperparameters as before. Figure 3 shows the results of this experiment for four random seeds for each batch size. We can see that, as the number of trajectories used in training increases, the error in the estimated dynamics tends toward zero. Furthermore, we see that CE-EM convergences monotonically to a local optimum in all cases. In contrast, Particle EM appears to initially improve but then converges to very poor parameter estimates. The experiments conducted thus far have demonstrated that CE-EM can learn unbiased parameter estimates of nearlydeterministic systems, and can scale to high-dimensional problems for which particle-based methods are intractable. In the next experiment, we use CE-EM to characterize the effect of unobserved states on the dynamics of an aerobatic helicopter. Characterizing the dynamics of a helicopter undergoing aggressive aerobatic maneuvers is widely considered to be a challenging system-identification problem (Abbeel et al., 2010; Punjani & Abbeel, 2015) . The primary challenge is that the forces on the helicopter depend on the induced state of the fluid around it. The state of the fluid cannot be directly observed and its dynamics model is unknown. Merely knowing the state of the helicopter and the control commands at a given time does not contain enough information to accurately predict the forces that act on it. In order to address this issue, (Punjani & Abbeel, 2015) use an approach based on Takens theorem, which suggests that a system's state can be reconstructed with a finite number of lagged-observations of it (Takens, 1981) . Instead of attempting to estimate the unobserved fluid state, they directly learn a mapping from a 0.5 s history of observed state measurements and control commands to the forces acting on the helicopter. This approach is equivalent to considering the past 0.5 s of observations as the system's state. However, it can require a very large number of lagged observations to represent complex phenomena. In reality, the characteristic time of unsteady flows around helicopters can easily be up to tens of seconds. Having such a high dimensional state can make the control design and state-estimation more complicated. To avoid large input dimensions, a trade-off between the duration of the history and sample frequency is necessary. This trade-off will either hurt the resolution of low-frequency content or will alias high-frequencies. We attempt to instead explicitly model the unobserved states affecting the system. The objective of this learning problem is to predict y t , the helicopter's acceleration at time t, from an input vector u t containing the current measured state of the helicopter (its velocity and rotation rates) and the control commands. We use data collected by the Stanford Autonomous Helicopter Project (Abbeel et al., 2010) . Trajectories are split into 10 s long chunks and then randomly distributed into train, test, and validation sets according to the established protocol (Abbeel et al., 2010; Punjani & Abbeel, 2015) and summarized in Appendix A.1. The train, test, and validation sets respectively contain 466, 100, and 101 trajectories of 500 time-steps each. A simple success metric on a given trajectory is the root mean squared prediction error, where y (measured) t is the measured force from the dataset, y (pred) t is the force predicted by the model, and T is the number of time-steps in each trajectory. Naive: We first consider a naive baseline that does not attempt to account for the time-varying nature of the fluidstate. We train a neural-network to map only the current helicopter state and control commands to the accelerations: y t = NN θn (u t ), where NN θn is a neural-network with parameters θ n . We refer to this model as the naive model. We also compare to the work of Punjani & Abbeel (2015) . They predict y t using a time-history u t−H:t of H = 25 lagged observations of the helicopter's measured state and control commands. This input is passed through a ReLU-activated neural network with a single hidden-layer combined with what they call a Quadratic Lag Model. As a baseline, we reproduce their performance with a single deep neural network y t = NN θ h (u t−H:t ) with parameters θ h . We call this neural network model the H25 model. Both of these models can be trained via stochastic gradient descent to minimize the Mean-Squared-Error (MSE) of their predictions for y. The optimization methodology for these models is described in Appendix A.2. SID: As a third baseline, we compare with subspaceidentification methods (Van Overschee & De Moor, 1994) . We letỹ t = y t − NN θn (u t ) be the prediction errors of the trained naive model. We use the MATLAB command n4sid to fit a linear dynamical system of the following form: Here, x ∈ R d is the unobserved state with arbitrary dimension d. The learned parameters are θ s = [A s , B s , C s , D s ]. We use a state dimension of 10 and call this model the SID model. The n4sid algorithm scales super-linearly with the amount of data supplied, and thus we train on 10 randomly sampled subsets of 100 trajectories each, and report the distribution in prediction performance. This approach fits a linear system to the residual error of the naive model, therefore the prediction of y t obtained from the SID model is a nonlinear function of the states x t . We also train an LSTM (Hochreiter & Schmidhuber, 1997 ) on the residual time-seriesỹ 1:T and u 1:T . We use CE-EM to train a non-linear SSM. Similar to the parameterization used for subpace-identification, we fit the prediction errors of the naive model using the following dynamical system: where NNθ NL is a neural network, and θ NL = [A NL , B NL , C NL , D NL ,θ NL ] are the learned parameters. We introduce non-linearity only in the observation function because it is known from Koopman theory that a non-linear system can be approximated by a high-dimensional linear system provided the correct non-linear mapping between them (Brunton et al., 2016) . While learning, we assume that both process and observation noise are distributed with diagonal Gaussian covariance matrices σ w I and σ v I respectively. The values of σ w and σ v are treated as hyperparmeters of CE-EM, and are both set to 1. Here as well, we use a state dimension of 10 and call this model the NL model. The optimization methodology for this model is described in Appendix A.2. It should be noted that the system we learn need not actually correspond to an interpretable model of the fluid-state, but only of time-varying hidden-states that are useful for predicting the accelerations of the helicopter. Expert knowledge of helicopter aerodynamics could be used to further inform a gray-box model trained with CE-EM. The test RMSE of the naive, H25, and LSTM models can be evaluated directly on the test trajectories using next-step prediction. However, the SID and NL models require an estimate of the unobserved state before making a prediction. The natural analog of next-step prediction is extended Kalman filtering (EKF), during which states are recursively predicted and corrected given observations. At a given timestep, a prediction ofỹ t is made using the current estimate of x t , and is used in the computation of RMSE. The stateestimate is then corrected with the measuredỹ t . Figure 4a shows the RMSE of the compared models on trajectories in the test-set. We see that the NL model is able to consistently predict the accelerations on the helicopter with better accuracy than any of the other models. The naive model performs on average 2.9 times worse than the H25 model, and its results can be found in Appendix A.3. The LSTM model also performs poorly, on average 1.4 times worse than the H25 model. The SID model notably outperforms the state-of-the-art H25 model, suggesting that a large linear dynamical system can be used to approximate a non-linear and partially observable system (Korda & Mezić, 2018) . However, introducing non-linearity as in the NL model noticeably improves performance. Figure 4b depicts the errors in prediction over a sample trajectory in the test-set. Here, we also see that the NL model is able to attenuate the time-varying error present in predictions made by the H25, suggesting that it has accurately characterized the dynamics of unobserved, time-varying states. This experiment validates the effectiveness of CE-EM to identify a non-linear dynamical model of unobserved states that affect the forces acting an aerobatic helicopter. This paper presented an algorithm for system identification of non-linear systems given partial state observations. The algorithm optimizes system parameters given a time history of observations by iteratively finding the most likely state-history, and then using it to optimize the system parameters. The approach is particularly well suited for highdimensional and nearly deterministic problems. In simulated experiments on a partially observed system of coupled Lorenz attractors, we showed that CE-EM can perform identification on a problem that particle-based EM methods are ill-suited for. However, we also find that CE-EM yields biased parameter estimates in the presence of large process noise. This bias can be partially mitigated by locally approximating the posterior state-marginals as Gaussian, as is done by MAP-EM (Goodwin & Aguero, 2005) . We then used the algorithm to model the timevarying hidden-states that affect the dynamics of an aerobatic helicopter. The model trained with CE-EM outperforms state-of-the-art methods because it is able to fit large non-linear models to unobserved states. Numerous system-identification problems can be studied using CE-EM. Recently, there have been tremendous efforts to characterize predictive models for the spread of COVID-19 (Fanelli & Piazza, 2020) . Limited capacity for testing the prevalence of the disease makes relevant states partially observed, and thus CE-EM may be useful for its modeling. We also hope to apply CE-EM to very high-dimensional systems with sparsely coupled dynamics using general-form consensus optimization (Boyd et al., 2011) . In this work we use the dataset gathered by Abbeel et al. (2010) and available at http://heli.stanford. edu/. A gas-powered helicopter was flown by a professional pilot to collect a large dataset of 6290s of flight. There are four controls: the longitudinal and lateral cyclic pitch, the tail rotor pitch and the collective pitch. The state is measured thanks to an accelerometer, a gyroscope, a magnetometer and vision cameras. Abbeel et al. (2010) provide the raw data, as well as states estimates in the Earth reference frame obtained with extended Kalman smoothing. Following Punjani & Abbeel (2015) , we use the fused sensor data and downsample it from 100Hz to 50Hz. From the Earth frame accelerations provided in the dataset, we compute body frame accelerations (minus gyroscopic terms) which are the prediction targets for our training. Using the notations from Punjani & Abbeel (2015) , we can write the helicopter dynamics in the following form: where s ∈ R 13 is the helicopter state consisting of its position r, quaternion-attitude q, linear velocity v, angular velocity ω, and δ ∈ R 4 to be the control command. C 12 is the rotation-matrix from the body to Earth reference frame, and f v and f ω are the linear and angular accelerations caused by aerodynamic forces, and are what we aim to predict. The above notation is related to that used in Section 4.2 as follows: • We define u as the concatenation of all inputs to the model, including the relevant state variables v and ω and control commands δ. • We define y as the output predicted, which would correspond to a concatenation of f v and f ω . • We define x as the vector of unobserved flow states to be estimated and is not present in their model. The processed dataset used in our experiments can be found at https://doi.org/10.5281/zenodo. 3662987 (Menda et al., 2020) . Neural networks in the naive and H25 models have eight hidden layers of size 32 each, and tanh non-linearities. We optimize these models using an Adam optimizer (Kingma & Ba, 2015) with a harmonic learning rate decay, and minibatch size of 512. The neural network in the NL model has two hidden layers of size 32 each, and tanh non-linearity. We train the NL model with CE-EM, using ρ x = ρ θ = 0.5, σ w = σ v = 1.0, and use an Adam optimizer to optimize Equation (11) in the learning step. The learning rate for dynamics parameters in θ NL is 5.0 × 10 −4 and observation parameters in θ NL is 1.0 × 10 −3 . For its relative robustness, we optimize Equation (10) using a non-linear least squares optimizer with a Trust-Region Reflective algorithm (Jones et al., 2001-) in the smoothing step. This step can be solved very efficiently by providing the solver with the block diagonal sparsity pattern of the Jacobian matrix. To evaluate the test metric, running an EKF is required. The output of an EKF depends on several user-provided parameters: • x 0 : value of the initial state • Σ 0 : covariance of error on initial state • Q: covariance of process noise • R: covariance of observation noise In this work, we assume that Q, R and Σ 0 are all set to the identity matrix. x 0 is assumed to be 0 on all dimensions. A well-tuned EKF with an inaccurate initial state value converges to accurate estimations in only a few time steps of transient behavior. Since the H25 model needs 25 past inputs to predict its first output prediction, we drop the first 25 predictions from the EKF when computing RMSE, thereby omitting some of the transient regime. Autonomous helicopter aerobatics through apprenticeship learning Generalized Kalman smoothing: Modeling and algorithms A recurrent neural network for modelling dynamical systems Order within Chaos: Towards a Deterministic Approach to Turbulence A new computationally efficient fixedinterval, discrete-time smoother Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains Distributed optimization and statistical learning via the Scalable Identification of Partially Observed Systems with CE-EM alternating direction method of multipliers. Foundations and Trends in Machine Learning Koopman invariant subspaces and finite linear representations of nonlinear dynamical systems for control A classification EM algorithm for clustering and two stochastic versions Experiments in fixed-wing UAV perching Convergence of a stochastic approximation version of the EM algorithm Maximum likelihood from incomplete data via the EM algorithm On sequential Monte Carlo sampling methods for Bayesian filtering Analysis and forecast of COVID-19 spreading in China, Italy and France Implementing the Nelder-Mead simplex algorithm with adaptive parameters Parameter estimation for linear dynamical systems Learning nonlinear dynamical systems using an EM algorithm Monte Carlo smoothing for nonlinear time series Approximate em algorithms for parameter and state estimation in nonlinear stochastic models On the convergence of the block nonlinear Gauss-Seidel method under convex constraints A general framework for structured learning of mechanical systems Structured mechanical models for robot learning and control Survey of applications of identification in chemical and physical processes LSTM can solve hard long time lag problems On resampling algorithms for particle filters From model-based control to datadriven control: Survey, classification and perspective SciPy: Open source scientific tools for Python On particle methods for parameter estimation in state-space models A method for stochastic optimization Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control A method for the solution of certain nonlinear problems in least squares An efficient stochastic approximation em algorithm using conditional particle filters Scalable Identification of Partially Observed Systems with CE-EM Backward simulation methods for Monte Carlo statistical inference. Foundations and Trends R in Machine Learning On the limited memory bfgs method for large scale optimization System identification Perspectives on system identification Deep Lagrangian networks for end-to-end learning of energy-based control for under-actuated systems Deep Lagrangian networks: Using physics as model prior for deep learning An algorithm for least-squares estimation of nonlinear parameters Normalized Stanford helicopter dataset Nested sequential Monte Carlo methods A view of the EM algorithm that justifies incremental, sparse, and other variants Learning of skid-steered kinematic and dynamic models for motion planning Deep learning helicopter dynamics models Maximum likelihood estimates of linear dynamic systems System identification of nonlinear state-space models A primer on coordinate descent algorithms Obstacles to high-dimensional particle filtering Extended kalman filter for estimation of parameters in nonlinear state-space models of biochemical networks Learning dynamical systems with particle stochastic approximation em Learning nonlinear state-space models using smooth particle-filterbased likelihood approximations Detecting strange attractors in turbulence A generalized eigenvalue approach for solving riccati equations N4SID: Subspace algorithms for the identification of combined deterministicstochastic systems Modeling dynamical systems by recurrent neural networks This work is supported in part by DARPA under agreement number D17AP00032, and by the King Abdulaziz City for Science and Technology (KACST) through the Center of Excellence in Aeronautics and Astronautics. The content is solely the responsibility of the authors and does not necessarily represent the official views of DARPA or KACST.