key: cord-0212359-ymykulqz
authors: Amendola, Maddalena; Arcucci, Rossella; Mottet, Laetitia; Casas, Cesar Quilodran; Fan, Shiwei; Pain, Christopher; Linden, Paul; Guo, Yi-Ke
title: Data Assimilation in the Latent Space of a Neural Network
date: 2020-12-22
journal: nan
DOI: nan
sha: eb57971a35b53182abafe5679d946b58e142f78a
doc_id: 212359
cord_uid: ymykulqz

There is an urgent need to build models to tackle Indoor Air Quality issue. Since the model should be accurate and fast, Reduced Order Modelling technique is used to reduce the dimensionality of the problem. The accuracy of the model, that represent a dynamic system, is improved integrating real data coming from sensors using Data Assimilation techniques. In this paper, we formulate a new methodology called Latent Assimilation that combines Data Assimilation and Machine Learning. We use a Convolutional neural network to reduce the dimensionality of the problem, a Long-Short-Term-Memory to build a surrogate model of the dynamic system and an Optimal Interpolated Kalman Filter to incorporate real data. Experimental results are provided for CO2 concentration within an indoor space. This methodology can be used for example to predict in real-time the load of virus, such as the SARS-COV-2, in the air by linking it to the concentration of CO2.

Urbanisation is the process for which people move from rural zone to urban zone changing their habits. This process grows year by year: about half of the global population already lives in urban areas and by 2050 two-thirds of the world's people are expected to live in urban areas. Urbanisation process has led to an increase in building, human activities and energy consumption causing environmental degradation. High building densities and the low presence of vegetation impair the air quality and circulation. People who live in such areas are hesitant to open the windows of their house thinking that this can led to an increment of pollution in their habitation. A solution from their point of view is to use air-conditioning increasing in this way the energy consumption. This is a vicious cycle of increased urban emissions of heat, pollutants and greenhouse gases and an associated increase in energy demand.

The scope of the MAGIC 1 (Managing Air for Green Inner Cities) project is to study and build systems to assist reduction in energy demand through natural ventilation [1] . To this aim, there is the need to use systems with high accuracy in predicting air flows and air pollution concentration. These systems use the Large Eddy Simulation method within the Computational Fluids Dynamics (CFD) software: Fluidity [2] . Fluidity is an open source, general purpose, multi-phase computational fluid dynamics code capable of numerically solving the Navier-Stokes equations and advection-diffusion equations on arbitrary unstructured finite-element meshes. Fluidity is used in a number of different scientific areas including geophysical fluid dynamics, ocean modelling, mantle convection and air pollution.

Numerical simulation has been widely applied in many fields including environmental sciences, aerospace engineering, bio-medicine and industrial design. It provides powerful technical support for solving industrial problems and making scientific research in these fields. However, high fidelity numerical simulations of complex systems consume vast time and computing resources. When real data collected by instruments (i.e. sensors) are available, it is possible to use them to improve the accuracy of the prediction.

The integration is made up by Data Assimilation techniques.

Data Assimilation (DA) is an approach for fusing data (observations) with prior knowledge (e.g., mathematical representations of physical laws; model output) to obtain an estimate of the distribution of the true state of a process [3] . In order to perform DA, one needs observations (i.e., a data or measurement model), a background (i.e., a priori state or process model) and information about the distribution of the errors on these two. For those applications, where the background is defined in big computational grids which lead to a big data problem sometimes impossible to handle without introducing approximations or space reductions, Reduced Order Modelling (ROM) techniques are used [4, 5] .

ROM allows to speed up the dynamic model and the DA process. Popular approaches to reduce the domain are the Principal Component Analysis (PCA) and the Empirical Orthogonal Functions (EOF) technique both based on a Truncated Singular Value Decomposition (TSVD) analysis [6] . The simplicity and the analytic derivation of those approaches are the main reasons behind their popularity in atmospheric and ocean science. However, despite those powerful approaches, the accuracy of the obtained solution exhibits a severe sensibility to the variation of the value of the truncation parameters. This issue introduces a severe drawback to the reliability of these approaches, hence their usability in operative software in different scenarios [7] .

An approach to reduce the dimensionality maintaining information of the data is the Neural Network (NN), precisely the AutoEncoders [8, 9] . NNs have the ability to fit functions and they can fit almost any unknown function theoretically. That is the ability which makes it possible for neural networks to face complex problems. AutoEncoders with non-linear encoder functions and non-linear decoder functions can thus learn a more powerful non-linear generalisation of methods based on TSVD. In the latent space, the evolution of the transformed state variables defined in time, can be learned using Recurrent Neural Networks (RNN) [10, 11] . In the present work, we propose a new methodology which we called Latent Assimilation (LA). It consists in reducing the dimensionality with NN and perform both prediction through a surrogate dynamic model and DA directly in the latent space. In the latent space, the surrogate dynamic system is built by a RNN.

The future challenges of Numerical Weather Prediction (NWP) is to include more accurate initial conditions that take advantage of the increasing volume of real-time observations, and improve the post-processing of model outputs, amongst others [12] . To answer this need, Neural network (NN) for correction of error in forecasting have been extensively studied [13, 14, 15] . However, the error correction by NN does not have a direct relation with the updated model system at each step and the training is not on the results of the assimilation process.

A framework for integration of NN with physical models by Data Assimilation (DA) algorithms is described in [16] : the NNs are iteratively trained when observed data are updated. Unfortunately, this approach presents a limit due to the time complexity of the numerical models involved, which limits the use of the forecast model for large data problems. An approach for employing artificial neural networks (NNs) to emulate the Local Ensemble Transform Kalman Filter (LETKF) as a method of data assimilation is presented in [17] . Deep learning and Data Assimilation technologies are also combined to predict the production of gas from mature gas wells in [18] . The authors used a modified deep Long Short-Term Memory (LSTM) model as their prediction model in the Ensemble KF framework for parameter estimation. A Neural Network is integrated into a conventional DA in [16] : deep learning shows great advantage in function approximations which have unknown model and strong non-linearity. The authors used NNs to characterise the structural model uncertainty. The NN is implemented in an End-to-End (E2E) approach and its parameters are iteratively updated with coming observations by applying the DA method.

A framework which performs fast data assimilation with sufficient accuracy for open ocean is proposed in [19] . Speed improvement is achieved by performing the data assimilation on a reduced-space rather than on a full-space. A dimension reduction of the full-state is made by an Empirical Orthogonal Function (EOF) analysis while retaining most of the explained variance. Analysis of EOFs can be used to identify structures in geophysical data which hold a large part of the variance. In this framework, the assimilation is performed in the control space. EOFs analysis has become a fundamental tool in atmosphere, ocean, and climate science for data diagnostics and dynamical mode reduction. Each of these applications exploits the fact that EOFs allow a decomposition of a data function into a set of orthogonal functions, which are designed so that only a few of these functions are needed in lowerdimensional approximations. Furthermore, since EOFs are the eigenvectors of the error co-variance matrix, its condition number is reduced as well. Nevertheless, the accuracy of the solution obtained by truncating EOFs exhibits a severe sensibility to the variation of the value of the truncation parameter, so that a suitably choice of the number of EOFs is strongly recommended. This issue introduces a severe drawback to the reliability of EOFs truncation, hence to the usability of the operative software in different scenarios. A powerful solution to this is to use a Tikhonov regularisation which reveals to be more appropriate than truncation of EOFs [4] .

Neural networks have tremendous ability to fit functions and they can fit almost any unknown function theoretically. That is the ability which makes it possible for neural networks to model complex flows. The complex computations involving matrices is reduced by factorising the representation deriving a latent state used from the Kalman Filter in [20] . The authors also used a linear dynamic model to compute, i.e predict, the next timestep. A variational AutoEncoder capable to generate trajectories from a latent space where the dynamics is linear is presented in [21] .

In this paper, we propose a new methodology that use the NNs to reduce the space and perform the assimilation of the sensors data in the latent space. Specifically, we use a Convolutional AutoEncoder to reduce the domain and we perform an Optimal Interpolated Kalman Filter in the latent space.

In this paper, we make the following contributions:

• We have designed a novel data assimilation technology, we called Latent Assimilation (LA), mainly composed by an AutoEncoder, a surrogate model and an Optimal Kalman Filter. The Latent Assimilation model performs the prediction of the flows and the assimilation of observed data through a Kalman Filter in the latent space.

• We have developed a Convolutional AutoEncoder to reduce the space where the surrogate model will work and where we perform the assimilation of the observation using the Optimal Interpolated Kalman Filter. We have chosen to use an encoder-decoder model instead of Principal Component Analysis (PCA) since neural networks maintain non-linearities and perform better in modelling flows;

• We have built a Recurrent Neural Network (LSTM) to emulate a Computational Fluid Dynamics (CFD) simulation in the latent space of an AutoEncoder: the trained LSTM represents the surrogate model to predict the CO 2 concentration in a room;

• We prove that our novel Latent Assimilation model answers the needs of accuracy, stability and efficiency required by real-time applications.

• We have developed a software written in python to test the Latent Assimilation model. The LA code and the pre-processed data can be downloaded using the link:

https://github.com/DL-WG/LatentAssimilation.

Experimental results are provided for pollutant dispersion within an indoor space. This methodology can be used for example to predict in real-time the load of virus, such as the SARS-COV-2, in indoor spaces by linking it to the concentration of CO 2 [22] .

In this section, we introduce the concept of Data Assimilation (DA) and the Kalman Filter (KF) which is one of the most used approach for DA.

DA merges the estimated state x t ∈ R n of a discrete-time dynamic process at time t:

with an observation y t ∈ R m :

where M t+1 is a dynamic linear operator and H t is the observation operator.

The vectors w t and v t represent the process and observation errors, respectively. They are usually assumed to be independent, white-noise processes with Gaussian probability distributions:

where Q t and R t are called errors covariance matrices of the model and the observations, respectively.

DA tries to answer questions such as "what can be said about the value of an unknown variable x t that represents the evolution of a system, if we have some measured data y t and a model M of the underlying mechanism that generated the data?". This is the Bayesian context, where we seek a quantification of the uncertainty in our knowledge of the parameters that, according to Bayes' rule takes the form

Here, the physical model is represented by the conditional probability (also known as the likelihood) p (y t |x t ), and the prior knowledge of the system by the term p (x t ). The denominator is considered as a normalising factor and represents the total probability of y t . DA is a Bayesian inference that combines the state x t with y t at each given time. The Bayes theorem conducts to the estimation of x a t which maximise a probability density function given the observation y t and a prior from x t . This approach is implemented in one of the most popular DA methods which is the Kalman Filter (KF) [23] which mainly consists of two steps: a prediction (equation (4)) and a correction (equations (5)-(6)) steps. The goal of the KF is to compute an optimal a posteriori estimate, x a t , which is a linear combination of an a priori estimate, x t , and a weighted difference between the actual measurement, y t , and the measurement prediction, H t x t as described in equation (6).

x

2. Correction:

For big data problems, KF is usually implemented in a simplified version as an Optimal Interpolation method [24] for which the covariance matrix Q t = Q is fixed at each timestep t.

The prediction-correction cycle is complex and time-consuming and it mandates the introduction of simplifications, approximations or data reductions techniques. In the next section, we present the Latent Assimilation approach which consists in performing KF in the latent space of an Autoencoder with nonlinear encoder and nonlinear decoder functions. In the latent space, the dynamic system in equation (4) is replaced by a surrogate model built with a RNN.

Latent Assimilation is a model that implements the idea of assimilating real data in the Latent Space of a Neural Network (NN). Instead of using PCA or others mathematical approaches to reduce the space, we model the reduction with non-linear transformations using Deep NNs. Specifically, we choose to use Convolutional Autoencoder to reduce the space. The model is divided into four main parts:

1. Dimensionality reduction: the physical space is transformed in a latent space of smaller dimension by a Convolutional Autoencoder; 2. Surrogate model: a surrogate of the CFD is built in the latent space by a Recurrent Neural Network; 3. Data Assimilation: observed data are assimilate in the surrogate of the CFD by a Kalman Filter; 4. Physical space: the results of the DA in the latent space are then reported in the physical space through a Decoder. Figure 1 shows the work flows of the Latent Assimilation model. Let assume that we want to predict the state of the system at time t and we assume that the LSTM needs one observation back to predict the next timestep. The input of the system is the state x t−1 . We encode

To perform the Kalman Filter, we need the observationŷ t at timestep t. We encode y t and we combine the result,ĥ t , with the prediction h t through the KF. The result h a t is the updated prediction. We report the updated prediction in its physical space through the Decoder, producing x a t .

The dimensionality reduction is implemented by an AutoEncoder (AE). AEs are usually used for dimensionality reduction or feature learning. To use the autoencoder for dimensionality reduction, the encoder function must returns an output with lower dimension with respect to the input. This kind of autoencoder are called undercomplete. Learning an undercomplete representation forces the autoencoder to capture the most salient features of the training data. One type of undercomplete autoencoder is the Convolutional autoencoder. As we can deduce from the name, this autoencoder uses the Convolutional operation. Thanks to the convolutional operation, the network takes into account the spatial information: they are specially used with images or grid data. Usually, the Convolutional AutoEncoders are composed by more than one convolutional layers, each followed by pooling layer to reduce the input [25] . Latent Assimilation implements a Convolutional Autoencoder which produces a representation of the state vector x t ∈ R n in (1) in a "latent" state vector h t ∈ R p defined in a Latent Space where p < n.

We denote with f : R n → R p the Encoder function

which transforms the state x t in a latent variable h t . 

In the latent space we perform a regression through a Long Short Term Memory (LSTM) function l :

where h t,q = {h i } i=t,...,t−q is a sequence of q encoded timesteps up to time t. The LSTM is a Recurrent Neural Network (RNN) with good performance with time-series data [26] . It is composed by gates and cells as shown if Figure 3 . The gates decide which information should pass using a sigmoid function. The LSTM is composed by four elements described below. In all formulas, b, U and W denote respectively the biases, input weights and recurrent weights for corresponding gate.

where h (t) is the input and u (t) is the hidden layer. 2. Input Gate: it is similar to the Forget Gate but with its own parameters c

3. Cell State: the cell state is then updated using sigmoid and hyperbolic tangent functions

4. Output Gate: it decides what the value of the next hidden state using a sigmoid function on the previous hidden state and the current input and an hyperbolic tangent function on the newly modified cell state. 

The assimilation is performed in the latent space. In order to merge the observations in (2) with the "latent" state vector h t , the observations are processed by the Encoder in the same way as the state vector. As y t ∈ R m where usually m ≤ n, i.e. the observations are usually held or measured in just few point in space, the observations vector y t is interpolated in the state space R n obtainingŷ t ∈ R n . The observationsŷ t are then processed in the same way as the state vector trough f :

The "latent" observationsĥ t , transformed by the Encoder in the latent space, are then assimilated by the prediction-correction steps as described in equations (15)-(17) ad as shown in Figure 4 :

where l in (15) is the surrogate model defined in (8) computed by the LSTM, Q andR are the errors covariance matrices of the transformed background h t and observationsĥ t respectively: they are computed directly in the latent space. The background covariance matrixQ is computed with a sample of s model state forecasts h that we set aside as background such that:

whereh is the mean of the sample of background states, thenQ = V V T . The observations errors covariance matrixR can be computed with the same process than in equation (18) by replacing h t withĥ t ∀t

whereh is the mean of the sample of observations, thenR =VV T . The covariance matrixR can be estimated by evaluations of measurements (in-

where 0 < σ < 1 and I ∈ R p×p denotes the identity matrix [24] .K is the Kalman Gain matrix defined in the latent space andĤ is the observation operator. 

The results of the DA in the latent space are then reported in the physical space through the Decoder, applying the function g : R p → R n to compute

The Decoder is almost a mirror of the Encoder: it is composed of a Fully Connected Layer followed by some Convolutional Layers. 

The code is written in Python and is available at the following link: https: //github.com/DL-WG/LatentAssimilation. The LatentAssimilation folder is composed by different subfolders:

• DataSet: it contains the Structured dataset divided in train and test;

• PreProcess: it contains all the code written to extract the Structured dataset starting from the unstructured meshes. We used the python libraries math, numpy, vtktools and pyvista.

• AutoEncoder and LSTM: both folders contain the code used to find the structure of the model and the hyper-parameters for the Structured dataset. All results are stored and also visualized in AnalysisLS7 jupyter notebook. We used python libraries such as numpy, sklearn, pandas and tensorflow.

• Data Assimilation: here there is all the observation data preprocessing, the Kalman Filter and the LatentAssimilation module which performs the assimilation in the Latent Space and it prints the table of the results.

In the next Section we apply Latent Assimilation to the problem of assimilating data to improve the prediction of air flows and indoor pollution transport in a real scenario [1] . We show the performance of the model step by step and we compare results with a standard DA performed in the physical space. 

The LA model presented in Section 4 is applied to real data collected in the context of the MAGIC project [1] : external and internal air quality measurements were performed within a naturally ventilated office room located at the top floor of the three-storey Clarence Centre building, Borough of Southwark, London, UK ( Figure 6 ). The room has two windows facing a busy road (London Road), one window facing a traffic-free courtyard and a skylight in the ceiling. Seven sensors located in different positions were used to record, amongst others, the indoor temperature and CO 2 concentration, with a sampling rate of 1 minute. The three windows were opened during 25 minutes to look at cross ventilation effect on the decay of temperature and CO 2 concentration. During the whole period of the experiment, the predominant wind was a south-westerly wind.

To replicate the field study experiment, a numerical simulation has been performed using the Computational Fluid Dynamics (CFD) software Fluidity (http://fluidityproject.github.io/). The same CFD simulation has been used in a previous paper [27] and only the main details of the CFD setup are re-called here. The computational domain includes the Clarence Centre building as well as the immediate upwind building, and the test room office as shown in Figure 7 in order to replicate the cross ventilation scenario done during the field study. The mesh generated is an unstructured tetrahedral mesh composed by 400,000 nodes ( Figure 7) . The initial and boundary conditions are set to replicate the experimental conditions and are derived from the indoor sensors and the weather station used during the field study. The initial indoor CO 2 concentration is set equal to 1420 ppm, while the outdoor background CO 2 level is set equal to 400 ppm. The initial indoor and outdoor temperatures are equal to 19.5 o C and 9.1 o C, respectively. The inlet velocity is following a log-law profile reaching 2.58 m/s at 28.5 m height. The simulation was run in parallel on 20 CPU and 15 minutes were simulated rendering approximately 3,500 timesteps. In this paper, the working variable of interest is the CO 2 concentration. It is worth noting that after the timestep 2,500 the concentration of CO 2 is low everywhere in the room since the room is completely ventilated.

The data generated by the CFD simulation are stored on an unstructured mesh and need to be converted into structured data in order to apply the LA model presented in Section 4. Indeed, convolutional kernels work on the assumption that adjacent states are equally spaced. As a first step, only the nodes located within the test room were selected to work with, thus excluding the rest of the domain to the working dataset as shown in Figure 8 . As a second step, we choose to extract and work with the data from a 2D slice located at half height of the room: this location being a good compromise between the different heights of sensors used during the field experiment. Finally, two different pre-processing approaches were adopted:

• "Structured dataset" Data from the unstructured 2D slice are projected on a structured grid. The "Structured dataset" is generated by interpolating the CO 2 concentration values of the unstructured grid on a structured grid. The values stored in the final matrix corresponds to actual values of CO 2 concentration as shown in Figure 9 .

• "RGB dataset" Data from the unstructured 2D slice are directly converted into a RGB image: a screenshot of the 2D slice coloured based on the CO 2 concentration values is created. The scalar bar of the RGB images is set based on the minimum and maximum CO 2 concentration, i.e. 400 ppm and 1420 ppm, respectively. This transformation allows to move from unstructured mesh to structured data since the RGB image is a 3D structured matrix of pixel values. The values stored in the final matrix corresponds to RGB values being between 0 and 255 as shown in Figure 10 .

Both pre-processing approaches are performed for each timestep and the final size of the working matrix is a 180×250 regular grid.

As a final step, all the data are normalised between 0 and 1. The "RGB dataset" is divided by 255, while the "Structured dataset" is normalised based on the minimum and maximum values of CO 2 concentration. Based on the sampling rate of the sensors, 10 CFD output were selected corresponding to time levels for which we have sensors data. As a first preprocessing step, considering that the area of influence of one sensor has a radius of about 15 cm, a zone of 10 pixels × 10 pixels centred on the sensor location is defined and the value given by the sensor at that location is assigned to this whole area. The rendering of this process is shown in Figure 11 . The second step consists in interpolating linearly the values of the sensors to the entire 2D structured grid as shown in Figure 12 . This pre-processing is performed to be consistent with both the "Structured dataset" and the "RGB dataset", i.e. is done in terms of CO 2 concentration and RGB values scaled using the CO 2 concentration, respectively. As a final step, all the data are normalised between 0 and 1 as for the CFD pre-processing. 

In this section, the procedure to determine the optimal network architectures of both the AutoEncoder and the LSTM is first presented. Then the results of the novel Latent Assimilation model developed in this paper are discussed based on the assimilation of the sensors data in both latent and physical space.

The data set is decomposed into training, validation and testing sets. In the CFD simulation, the flow field and the associated CO 2 concentration does not change much between consecutive steps. For this reason, we decide to divide the data in training, validation and testing sets making jumps. First, the CFD output at timesteps corresponding to sensors data are excluded and are assigned to the testing set. For the remaining data, two consecutive timesteps are considered for the training, then a jump is performed. The jumped data, i.e the ones not considered yet, are assigned to the validation and testing sets alternately. Considering a jump equal to 1, this process is summarised in Figure 13 . 

The AutoEncoder implemented in this paper is a Convolutional AutoEncoder (CAE). Specifically, the encoder is composed by several convolutional layers followed by a flattened layer then a regular densely-connected layer which determine the shape of the latent space. In our CAE, the decoder architecture has almost the same structure than the encoder one: indeed, an additional convolutional layer is used in the decoder. Finding the optimal construction of the CAE architecture is divided in two steps:

For each CAE network architecture tested, 5-fold cross-validation are performed for which the data are shuffled to make the neural network independent from the order of the data. Both the training and validation sets are used for the cross-validation. The evaluation of the CAE network architecture is estimated based on the mean and the standard deviation of the Mean Squared Error between the CAE prediction and the CFD output, i.e Mean-MSE and Std-MSE, respectively; the mean and the standard deviation of the Mean Absolute Error between the CAE prediction and the CFD output, i.e Mean-MAE and Std-MAE; and the mean and the standard deviation of the CAE execution time, i.e. Mean-Time and Std-Time. A low MSE/MAE standard deviation reflects that the model is stable and does not depend on the data used to train and validate it, while a low MSE/MAE means that the prediction is close to the real input, i.e has a good accuracy.

The baseline CAE network architecture is using the following fixed parameters:

• Convolutional layers parameters -Number of filters: 32 -Activation function: Last decoder layer: sigmoid function to restrict the output in a range [0, 1] as the input; Rectified Linear Unit (ReLU) otherwise.

• Regular densely-connected layer Latent Space: 7

• Training configuration -Losses and metrics optimiser: Adam, learning rate of 1.10 −3 . Adam optimisation is a stochastic gradient descent method based on adaptive estimation of first-order and second-order moments well suited for problems with large data.

-Number of epochs: 300

-Batch size: 32

The following different CAE network architectures are tested using the "Structured dataset" in order to find the optimal numbers and structure of layers: Number of layers Comparing configurations 1, 2 and 3 for which only the number of convolutional layers is changing, configuration 2, i.e. 4 convolutional layers for the encoder and 5 convolutional layers for the decoder, is the one highlighting the best performance in terms of both Mean-MSE and Mean-MAE with MSE two order of magnitude lower than configurations 1 and 3. Moreover, configuration 2 is the most stable regarding the standard deviations, reflecting well that this CAE network architecture does not depend on the data used to train and validate it. In addition, the execution time of configuration 2 is relatively acceptable to answer real-time problems. Hence, in the following, the number of layers is taken as the same than configuration 2: 4 for the encoder and 5 for the decoder.

The accuracy (Mean-MSE/Mean-MAE) and the stability (Std-MSE/Std-MAE) are slightly better, while the execution time is slightly longer, when using convolutional layers (config. 2) rather than transpose convolutional layers (config. 4) in the decoder. As no major improvements in terms of MSE/MAE is observed when switching from convolutional (config. 2) to transpose convolutional layers in the decoder (config. 4), convolutional layers are used for the decoder.

Size of the kernel Configurations 2, 5 and 6 have the same layer number and the size of the kernel is changed. Using a 5×5 kernel size for all the layers (config. 5) or using a mix of 3×3 and 5×5 kernel sizes (config. 6) both increase the MSE/MAE by two order of magnitude compared to using a 3×3 kernel size for all the layers (config. 2). In addition, the execution cost is considerably increased, about 50 % less efficient, when the kernel size is larger as the complexity scales with k 3 where k is the kernel size. Overall, 3×3 is then used as the optimal kernel size in the following.

The grid search is now performed in order to find the optimal hyperparameters for both the "Structured dataset" and the "RGB dataset". The CAE network architecture is using the following fixed parameters:

• Convolutional layers parameters -Number of encoder/decoder convolutional layers: 4 and 5

-Kernel size: 3x3

• Regular densely-connected layer Latent Space: 7

• Training configuration Adam optimiser: learning rate of 1.10 −3 .

The hyperparameters tested for the grid search are as follows:

• Convolutional layers parameters Table 2 shows the optimal hyperparameters found for each input dataset, while the evaluation performance are reported in Table 3 . The optimal hyperparameters are the same for both dataset, i.e. 64 number of filters, a ReLU activation function and 400 epochs. Only the batch size differs: 32 for the "Structured dataset" and 16 for the "RGB dataset". The fact that "RGB dataset" needs less batch size than the "Structured dataset" can be potentially attributed to the fact that the former has 3 channels (R, G and B colours), so requiring less batch size. The results show that both datasets have very similar accuracy and stability: low MSE and low standard deviation, of the order of 10 −5 , meaning that the CAE is not dependent on the set of input chosen to train it. Using "RGB dataset" highlights better performance, with Mean-MSE 57 % lower than when using "Structured dataset". However, using the "RGB dataset", more time is needed to train the CAE because an element of this dataset is composed by three channels, i.e. the R, G and B colour values. Table 3 : Convolutional AutoEncoder performance using "Structured dataset" or "RGB dataset" as input and the optimal hyperparameters found with the grid search (Table 2 . Time is given in seconds.

In this model, the training set is used for the fitting step and the validation set for the validation step. An extra splitting of the training set is performed: the data are split in small sequences such that one timestep is predicted and 3 timesteps are used as "look back" values as shown in Figure 14 . All data are encoded with the AutoEncoder: each input sample of the LSTM is then a vector of 7 scalars. Finding the optimal construction of the LSTM architecture is divided in two steps:

1. Finding the optimal numbers of layers 2. Grid search: Finding the optimal hyperparameters

For each LSTM network architecture tested, the fitting and the evaluation of the model is repeating 5 times. As for the CAE network architecture, the 

The baseline LSTM network architecture, tested with the "Structured dataset", is using the following fixed parameters:

• LSTM layers parameters Table 4 . The single layer LSTM is the one highlighting the best accuracy with the lowest Mean-MSE and Mean-MAE values. Indeed, the input of the LSTM consists of a 7×1 vector and adding more LSTM layer introduces overfitting bias. In addition, the standard deviation, reflecting the stability, of the single layer LSTM are about one order of magnitude lower than the other tested LSTM. Finally, as expected, the single layer LSTM is also the most efficient in term of computation cost. 

The grid search is now performed in order to find the optimal hyperparameters for both the "Structured dataset" and the "RGB dataset". The LSTM network architecture is using the following fixed parameters:

• LSTM layers parameters Number of layers: 1

• Regular densely-connected layer Latent Space: 7

• Training configuration Optimiser: Adam, learning rate of 1.10 −3

The hyperparameters tested for the grid search are as follows:

• Table 5 shows the optimal hyperparameters found for each input dataset, while the evaluation performance are reported in Table 6 . Exponential Linear Unit (ELU) appears to be the optimal activation function, with an epochs of 400 and a batch size of 16 for both the input dataset. The results show that the "RGB dataset" needs more neurons and more back observations than the "Structured dataset". From Table 6 , it can be seen that the LSTM with "RGB dataset" as input has better accuracy and takes also less time than when using "Structured dataset" input. Indeed, the accuracy is about 45 % higher when using input "RGB dataset", while the execution time is reduced by approximately 27 %. Table 6 : LSTM performance using "Structured dataset" or "RGB dataset" as input and the optimal hyperparameters found with the grid search (Table 5 ). Time given in seconds.

In this section, results of our novel Latent Assimilation (LA) model are presented: the assimilation takes place in the latent space. The Testing set is considered and both dataset, i.e. "Structured dataset" and "RGB dataset", are encoded using the AutoEncoders with optimal network architecture as presented in Section 6.1.2. The predictions are performed through the LSTM and are updated using the corresponding observations through Optimal Interpolated Kalman Filter (KF).

In the KF, the error covariance matrixQ is computed asQ = V V T , where V is as defined in equation (18) . Since both predictions of the model and observations are values of CO 2 or pixels, i.e. the observations do not have to be transformed, the operatorĤ is an identity matrix. We studied how KF improves the accuracy of the prediction by testing different forms of the observation error covariance matrixR: computed using equation (18) or, fixed asR = 0.01I, 0.001I, 0.0001I where I ∈ R p×p denotes the identity matrix. This last assumption is usually made to give higher fidelity and trust to the observations [24] .

The MSE between the background data and the observed data in the latent space for the "Structured dataset" and the "RGB dataset", without performing data assimilation, are 7.220 × 10 −1 and 5.447 × 10 −1 , respectively. Table 7 shows values of MSE in the latent space between the assimilated data h a t and the observed data as well as the execution time of the assimilation for both input dataset. As expected, we can observe an improvement in the execution time of the assimilation in assumingR as a diagonal matrix instead of a full matrix. In addition, the assimilation increases the accuracy of the model, whatever the input dataset used, with MSE values about 2.2 times lower compared to without assimilation, highlighting that our novel Latent Assimilation model is behaving as expected. UsingR as an identity matrix of the form 0.0001I allows to improve the accuracy by up to 4 order of magnitude.R Table 7 : MSE values in the latent space and execution time of the assimilation in seconds of the Latent Assimilation model for different form of the observations error covariance matrixR in the latent space when using the "Structured dataset" or the "RGB dataset" as input.

After having performed the DA in the latent space, the results h a t are reported in the physical space through the decoder which gives x a t . Figure 15 shows in the physical space the results of the assimilation for the timesteps 509, 1062 and 1485 using our novel LA model. The MSE in the physical space using our LA model is then compared with the one using a standard Data Assimilation (sDA) procedure. sDA is performed in the physical space using a Kalman Filter approach (equations (4)-(6)), where R ∈ R n×n is defined in the physical space. Table 8 shows values of MSE in the physical space between the assimilated data x a t and the observed data as well as the execution time of the assimilation for our LA model and the standard methodology (sDA) when using the "Structured dataset" as input. The MSE between the background data and the observed data in the physical space, without performing data assimilation, is 6.491 × 10 −2 . Both LA and sDA improve the accuracy of the forecasting as shown in Table 8 : however it can be observed that the LA model gives 35 % more accuracy than a sDA model. In addition, LA performs better in terms of execution time with respect to a sDA: indeed sDA works directly with big matrices making it slower by six order of magnitude.R Table 8 : MSE values of x a t using our novel Latent Assimilation (LA) model or using a standard Data Assimilation (sDA) procedure for different form of the observations error covariance matrixR when using the "Structured dataset" as input. MSE are computed in the physical space. Execution time of the assimilation in seconds.

In this section, the impact of increasing the size of the latent space is discussed. Results are presented for the "Structured dataset" only. Table 9 and Table 10 give the MSE values of the Latent Assimilation model with different latent space sizes, from 1000 to 20000, computed in the latent and physical space, respectively. The column "No DA" reports the MSE values without the assimilation. Table 11 reports the execution time of the assimilation.

DefiningR as an identity matrix always highlights better accuracy whatever the latent space size. Increasing the latent space size tends to decrease the MSE, i.e. gain in accuracy, in both the latent and the physical space. Overall, a latent space size equal to 18000 seems optimal for this problem, whatever the form of the observations error covariance matrixR, which represents about 40 % of the original data. However, the execution time of the assimilation can be up to 5 order of magnitude higher when using the optimal latent space size compared to a lower size. Finding the optimal parameters of our LA depends the expectancy of the user as a balance needs to be taken between accuracy and efficiency. Regarding the small accuracy gain while increasing the latent space size, it is recommended to work with the smallest latent space size as possible in order to benefit of the best efficiency while still keeping a high accuracy. 

In this paper, we proposed a new methodology called Latent Assimilation (LA) to efficiently and accurately perform Data Assimilation (DA). LA consists in performing the Optimal Kalman Filter in the latent space obtained by a Convolutional AutoEncoder with non-linear encoder functions and nonlinear decoder functions. In the latent space, the dynamic system is represented by a surrogate model built by an LSTM network to train a function that emulates the dynamic system in the latent space. The data from the dynamic model and the real data coming from the sensors are both processed through the AutoEncoder.

We applied the methodology to a real test case and we have shown that the LA performs better than a standard DA in terms of both accuracy and efficiency. The data of the real test case was time-series data representing the airflow within a naturally ventilated office room. The data was provided by CFD on an unstructured mesh and we pre-processed these data to extract two different structured datasets: one composed of 2D matrices of CO 2 concentration ("Structured dataset") and the other one composed of RGB images colored by CO 2 concentration ("RGB dataset"). We pre-processed also the data coming from sensors in the same manner.

We tried different AutoEncoder configurations and we performed a grid search for both input datasets in order to determine the optimal configurations. The same was done for the LSTM: it is the surrogate model. We performed the assimilation in the latent space using the Latent Assimilation model for both datasets as input. We tested also the standard data assimilation in the physical space and we have shown that LA performs better in terms of both efficiency and accuracy.

In conclusion, we have successfully proposed and developed a novel model able to assimilate data in the latent space, thus answering the needs of accuracy, stability and efficiency required by real-time systems. This methodology can be used for example to predict in real-time the load of virus, such as the SARS-COV-2, in indoor spaces by linking it to the concentration of CO 2 [22] .

There are different improvements that could be applied to the model to be used with more challenging applications:

• Develop an implementation of LA to emulate a variational DA [24] which is often applied to big data problems. In particular, we will focus on a 4D Variational (4DVar) method. 4DVar is a computational expensive method as it is developed to assimilate several observations (distributed in time) for each timestep of the forecasting model. We will develop an extended version of LA able to assimilate set of distributed observations for each timestep and, then, able to perform a 4DVar;

• Add a third dimension, i.e. test the methodology on a 3D space using a 3D Convolutional Autoencoder. Instead of cutting a slice, the 3D Convolutional Autoencoder will work on the complete room space without losing information;

• Recent research studies has started in the direction of working directly with unstructured meshes. It will be challenging developing Latent Assimilation with an Encoder-Decoder which works directly on a 3D adaptive and unstructured mesh;

• Instead of using only indoor data, the methodology could be applied considering the exchange with the outdoor environment or tested in different applications, i.e ocean.

Natural ventilation in cities: the implications of fluid mechanics

Fluidity manual v4

A bayesian tutorial for data assimilation

Optimal reduced space for variational data assimilation

A domain decomposition reduced order model with data assimilation (dd-roda)

Deblurring images: matrices, spectra, and filtering

A primer for eof analysis of climate data

Attention-based convolutional autoencoders for 3d-variational data assimilation

Data-driven reduced order model with temporal convolutional neural network

A reduced order deep data assimilation model

Neural assimilation

Leveraging modern artificial intelligence for remote sensing and nwp: Benefits and challenges

From global to local modelling: a case study in error correction of deterministic models

Neural networks as routine for error updating of numerical models

Data assimilation of local model error forecasts in a deterministic model

Model error correction in data assimilation by integrating neural networks

Data assimilation by artificial neural networks for an atmospheric general circulation model

Deep learning and data assimilation for real-time production prediction in natural gas wells

Fast ocean data assimilation and forecasting using a neural-network reduced-space regional ocean model of the north Brazil current

Recurrent kalman networks: Factorized inference in high-dimensional deep feature spaces

Embed to control: A locally linear latent dynamics model for control from raw images

Exhaled co2 as covid-19 infection risk proxy for different indoor environments and activities, medRxiv

A new approach to linear filtering and prediction problems

Data assimilation: methods, algorithms, and applications

Deep learning

Long short-term memory

Variational Gaussian Process for Optimal Sensor Placement

This work is supported by the EPSRC Grand Challenge grant Managing Air for Green Inner Cities (MAGIC) EP/N010221/1 and the EP/T003189/1 Health assessment across biological length scales for personal pollution exposure and its mitigation (INHALE).