key: cord-0835651-669pn2st authors: Sundqvist, Nicolas; Grankvist, Nina; Watrous, Jeramie; Mohit, Jain; Nilsson, Roland; Cedersund, Gunnar title: Validation-based model selection for (13)C metabolic flux analysis with uncertain measurement errors date: 2022-04-11 journal: PLoS Comput Biol DOI: 10.1371/journal.pcbi.1009999 sha: d6f3f0206d6171afb501703a9e0812714fb32b12 doc_id: 835651 cord_uid: 669pn2st Accurate measurements of metabolic fluxes in living cells are central to metabolism research and metabolic engineering. The gold standard method is model-based metabolic flux analysis (MFA), where fluxes are estimated indirectly from mass isotopomer data with the use of a mathematical model of the metabolic network. A critical step in MFA is model selection: choosing what compartments, metabolites, and reactions to include in the metabolic network model. Model selection is often done informally during the modelling process, based on the same data that is used for model fitting (estimation data). This can lead to either overly complex models (overfitting) or too simple ones (underfitting), in both cases resulting in poor flux estimates. Here, we propose a method for model selection based on independent validation data. We demonstrate in simulation studies that this method consistently chooses the correct model in a way that is independent on errors in measurement uncertainty. This independence is beneficial, since estimating the true magnitude of these errors can be difficult. In contrast, commonly used model selection methods based on the χ(2)-test choose different model structures depending on the believed measurement uncertainty; this can lead to errors in flux estimates, especially when the magnitude of the error is substantially off. We present a new approach for quantification of prediction uncertainty of mass isotopomer distributions in other labelling experiments, to check for problems with too much or too little novelty in the validation data. Finally, in an isotope tracing study on human mammary epithelial cells, the validation-based model selection method identified pyruvate carboxylase as a key model component. Our results argue that validation-based model selection should be an integral part of MFA model development. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Cellular metabolism is fundamental for all living organisms, involving thousands of metabolites and metabolic reactions that together form large interconnected metabolic networks [1, 2] . While a substantial part of the human metabolic network has been reconstructed [2] , measuring fluxes through individual reactions and metabolic pathways in living cells and tissues remains a challenge. This problem is central to a variety of medically relevant processes, including T-cell differentiation [3] , caloric restriction and aging [4] , cancer [5, 6] , the metabolic syndrome [7] , and neurodegenerative diseases such as Parkinson's disease [8] . The gold standard method for measuring metabolic fluxes in a given system is model-based metabolic flux analysis (MFA) [9] . In this technique, cells or tissues are fed "labelled" substrates containing stable isotopes such as 13 C (Fig 1A) . These substrates are metabolized to products containing various isotopic isomers (isotopomers) (Fig 1B) . By measuring the abundance of these isotopomers, mass isotopomer distributions (MIDs, Fig 1C) are obtained for each metabolite [10] . Fluxes are then inferred by fitting a mathematical model M to the observed MID data D (Fig 1D) . While the above methodology is well established for assessing the fit of a given MFA model, several problems arise when it is used for model selection. In practise, MFA models are usually developed iteratively (Fig 1D) , by repeatedly attempting to fit the same data to a sequence of models M 1 ; M 2 ; . . . ; M k with successive modifications (adding or removing reactions, metabolites, and so on), until a model M k is found acceptable, i.e. not statistically rejected. In practice, this means that the model M k passes the χ 2 -test for goodness-of-fit [11] . Given the iterative nature of modifying the model structures, model development thus turns into a model selection problem. Depending on the approach used to solve this model selection problem different model structures might be selected, given the same data set (Fig 1E) . For instance, if the traditionally iterative modelling cycle is used, the first model that passes the χ 2 -test might be selected and used for flux estimation. On the other hand, there might be multiple model structures that passes the χ 2 -test. In this case, the model structure that passes the χ 2 -test with the biggest margin may be a better option. Validation-based model selection for 13 C MFA with uncertain measurement errors Generally, model selection approaches that rely solely on the χ 2 -test to select a model can be problematic. First, correctness of the χ 2 -test depends on knowing the number of identifiable parameters, which is needed to properly account for overfitting by adjusting the degrees of freedom of the χ 2 distribution [12] , but can be difficult to determine for nonlinear models [13] . Second, the χ 2 -test can be unreliable in practise since the underlying error model is often not accurate. Typically, the MID errors σ are estimated by sample standard deviations s from biological replicates, which for mass spectrometry data often is below 0.01, and even can be as low as 0.001 (Fig 2A) . However, such low estimates may not reflect all error sources. For example, MI fractions obtained from orbitrap instruments can be biased so that minor isotopomers are underestimated [14, 15] . Also, s does not account for experimental bias, such as deviations from metabolic steady-state that always occur in batch cultures. Some such problems can be detected by repeating experiments, but some others cannot. The normal distribution assumption itself is also questionable for MIDs, which are constrained to the n-simplex [16] . For these reasons, s can severely underestimate the actual errors, making it exceedingly difficult to find a model that passes a χ 2 -test. In this situation, one is left with two bad choices: either arbitrarily increase s to some "reasonable" value to pass the χ 2 -test (Fig 2B) , or introduce more or less well-motivated extra fluxes into the model. The former alternative, increasing s, may lead to high uncertainty in the estimated fluxes and does not necessarily reflect the experimental bias one tries to account for. The latter approach, introducing additional fluxes, increases model complexity and can lead to overfitting. While these issues with model selection are well known, they have to our knowledge not been treated systematically in the 13 C MFA field. Indeed, MFA model selection is typically done in an informal fashion by trial-and-error, and the underlying procedure is rarely reported [17] . However, in other contexts where model fitting is central, such as systems biology, the problem of model selection has been treated extensively [13, [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] . In these areas, a widely accepted solution is to perform model selection on a separate "validation" data set, which is not used for model fitting. Intuitively, this protects against overfitting by choosing the model that can best predict new, independent data. In this paper, we propose a formalized version of such a validation-based model selection approach for MFA. In a series of simulated examples, we demonstrate that this method consistently selects the correct metabolic network model, despite uncertainty in measurement errors, whereas "traditional" χ 2 -testing on the estimation data does not. By quantifying prediction uncertainty using prediction profile likelihood, we can avoid cases where the validation data is too similar, or too dissimilar, to the estimation data. Finally, in an application to flux analysis on our own new data in human epithelial cells, we find that the same robustness to measurement uncertainty variations still holds, and that the validation-based model selection method can identify reactions that are known to be active in this cell type. To systematically examine the effects of the model selection procedure on MFA, we adopted a scheme where a sequence of models M 1 ; M 2 ; . . . with increasing complexity (increasing number of parameters) is tested by each model selection method, simulating typical iterative model development. We considered five possible model selection methods that use all available data for both parameter estimation and model evaluation (Table 1) . Method "SSR" selects the model with the smallest weighted summed squared residuals (SSR) based on the data, included as a baseline. Method "First χ 2 " selects the model with fewest parameters (the "simplest" model) that passes a χ 2 -test, while accounting for overfitting by subtracting the number of free parameters p from the degrees of freedom in the χ 2 -distribution (see Section 4.3). Method "Best χ 2 " selects the model that passes the χ 2 -threshold with the greatest margin. Methods "AIC" and "BIC" select the model that minimizes the Akaike Information Criterion or the Bayesian Information Criterion, respectively [28, 29] . The five methods mentioned above all depend on the noise model Eq (5) , and all except "SSR" also requires knowing the number of free parameters p. Considering common practices in the field, it is probable that some combination of the "First χ 2 " and "Best χ 2 " methods is the prevailing approach in MFA modelling [17, 30] , although this is not entirely clear since the model selection process is often not described. In addition to these methods, we propose a validation-based model selection method ("Validation") that divides the data D into estimation data D est and validation data D val . For each model, parameter estimation (model fitting) is then done using D est , and the model achieving the smallest SSR with respect to D val is selected. The division into estimation and validation In other words, the model selected with our new Validation method would still need be subjected to some form of final model testing. A detailed description of the "SSR", "First χ 2 ", "Best χ 2 ", and "Validation" methods can be found in S1 Algorithm: A-D respectively (S1 Algorithm). Before examining the behavior of the different model selection methods on metabolic network models, it may be helpful to illustrate their properties on a simple univariate example. For this purpose, we considered a model with a single input x and a single outputŷ, where model M n is the n-th order polynomialŷ with parameter vector u. We assume that M 7 is the correct model, with true parameters u 0 , and sampled 20 measurements y = h 7 (x, u 0 )+� for different values of x, where � was drawn from N(0, σ r ) with standard deviation σ r = 0.2. To simulate uncertainty about the error model, we considered σ to be unknown, and let the various model selection methods choose among M 1 ; . . . ; M 14 with a "believed" standard deviation, denoted σ b , in the range [0.1 σ r , 10 σ r ]. For the "Validation" method, we reserved 4 of the 20 measurements for D val (Fig 3, red error bars ). An illustration of the dependency on σ for a model selection method that does not use validation is shown in Fig 3. When D val is not considered, we would expect larger values of σ b to result in a simpler model, since almost all of the variation in the data is interpreted as noise ( Fig 3A) . Further, at very small values of σ b an overly complex model will be required to obtain an acceptable fit to D est (Fig 3C) . Applying the five model selection methods to data from this polynomial model gave different results (Figs 4 and S1). Since the model selection process is somewhat stochastic, we resampled the data 10,000 times, each time with a new error � drawn from N(0, σ r ), and report results as the fraction of times a particular model was chosen. As expected, "SSR" mostly selected the most complex polynomial M 14 regardless of σ b , as the most complex model always gives the lowest SSR ( Fig 4A) . In contrast, "First χ 2 " or "Best χ 2 " gave different results depending on σ b . "First χ 2 " selected the correct model M 7 only when σ b �σ r . At σ b �10σ r , only the low-degree polynomials (M 1 ; M 2 and M 3 ) was chosen by the "First χ 2 " method, while at σ b �0.1σ, an overly complex polynomial was chosen ( Fig 4B) . The "Best χ 2 " method selected the correct model M 7 for σ b �σ r , but selected overly complex models for smaller σ b (Fig 4C) . If σ b were to increase further, "Best χ 2 " would choose a lower degree polynomial. This is because, for these χ 2 -based methods, the tradeoff between model complexity and goodness-of-fit is based on σ b , and such a tradeoff is thus correct only if we happen to have σ b �σ r . Similar results are seen with the "AIC" and "BIC" methods, which also depend on σ b (S1 Fig). In contrast, the "Validation" method predominantly selected the correct model, M 7 , regardless of σ b (Fig 4D) . This happens because, even though a polynomial of the wrong degree may fit well on D est , it fails to predict independent validation data, resulting in large SSR on D val . Since the correct model structure M 7 will best predict new data, agreement with validation data helps identify the right model, also in cases where the error model is inaccurate. Again, it should be recalled that the "Validation" method only is applicable to the task of selecting the best model, and that this selection approach should be followed by a final step that tests the quality of the model, e.g. using a χ 2 -test. To investigate model selection in a setting more relevant to metabolic networks, we next considered a multivariate linear model, where the output vectorŷ is a linear combination of the inputs x weighted by the model parameters. In this case, each model structure M k is fully specified by a matrix A k such thatŷ and where the free parameters are elements of A k (Fig 5) . This type of model is roughly analogous to a simple metabolic network, where x corresponds to labelled substrates and y corresponds to metabolic products. We constructed six such models (A 1 −A 6 ) of increasing complexity, nested so that the parameter space of each A k contains the parameter space of all models A l for l