key: cord-0646117-zp7w4spl
authors: Guarda, Pablo; Qian, Sean
title: Statistical inference of travelers' route choice preferences with system-level data
date: 2022-04-23
journal: nan
DOI: nan
sha: 127f7025a85d8c14795887501c3e6cfb8d220703
doc_id: 646117
cord_uid: zp7w4spl

Traditional network models encapsulate travel behavior among all origin-destination pairs based on a simplified and generic utility function. Typically, the utility function consists of travel time solely and its coefficients are equated to estimates obtained from stated preference data. While this modeling strategy is reasonable, the inherent sampling bias in individual-level data may be further amplified over network flow aggregation, leading to inaccurate flow estimates. This data must be collected from surveys or travel diaries, which may be labor intensive, costly and limited to a small time period. To address these limitations, this study extends classical bi-level formulations to estimate travelers' utility functions with multiple attributes using system-level data. We formulate a methodology grounded on non-linear least squares to statistically infer travelers' utility function in the network context using traffic counts, traffic speeds, traffic incidents and sociodemographic information, among other attributes. The analysis of the mathematical properties of the optimization problem and of its pseudo-convexity motivate the use of normalized gradient descent. We also develop a hypothesis test framework to examine statistical properties of the utility function coefficients and to perform attributes selection. Experiments on synthetic data show that the coefficients are consistently recovered and that hypothesis tests are a reliable statistic to identify which attributes are determinants of travelers' route choices. Besides, a series of Monte-Carlo experiments suggest that statistical inference is robust to noise in the Origin-Destination matrix and in the traffic counts, and to various levels of sensor coverage. The methodology is also deployed at a large scale using real-world multi-source data in Fresno, CA collected before and during the COVID-19 outbreak.

System-level data, as opposed to individual-level data, measures characteristics of aggregated flow in transportation networks, and it is a valuable source of information for studying travel demand. An example is the use of traffic count data for estimating origin-destination (O-D) matrices (Fisk, 1989; Yang et al., 1992; Cascetta and Postorino, 2001; Ma and Qian, 2018b; Krishnakumari et al., 2020) . Interestingly, few attention has been given to use of systemlevel data to estimate travelers' utility functions in route choice models. These models are key to depict travel demand in transport planning applications and their parameters are usually estimated with data collected from individual-level surveys and experiments. Modeling route choices at the system level using individual-level data, however, may be inaccurate and expensive due to the inherent sampling bias and high collection cost of experimental data. The increase in availability and diversity of system-level data offers opportunities to overcome those limitations and to understand the impact of a broader set of factors that may influence travelers' route choice decisions. Some of the new sources of system-level data include crashes, weather and pavement conditions, land use characteristics, socio-demographics information from Census, travel time reliability and other relevant attributes. Clearly those factors may determine how a traveler makes his/her route choice, but surveying those for each individual traveler would be infeasible. Thus, this paper proposes a statistical method to better understand why and how a traveler make route choices using a diverse set of system-level data only.

Researchers have already leveraged system level data within traditional network models to estimate travelers' route choice preferences (Robillard, 1974; Fisk, 1977; Daganzo, 1977; Anas and Kim, 1990; Liu and Fricker, 1996; Yang et al., 2001; Lo and Chan, 2003; García-Ródenas and Marín, 2009; Russo and Vitetta, 2011; Wang et al., 2016) . However, their primary focus has been on the estimation of a utility function dependent on travel time only. Furthermore, most existing algorithms has been deployed to network of relatively small size, which raises questions on their potential to scale up to large transportation networks and to contribute to real world applications. A main difficulty to solve this estimation problem comes from the fact that network models are not data-driven. Discrete choice models, the gold standard to analyze individual-level data in travel behavior studies, may be considered a promising alternative to solve this problem given their data-driven nature. However, these models are not able to capture the endogeneity of travel times arising from the interdependence between travelers' decisions and from the effect of traffic congestion in transportation networks.

With the goal of estimating the coefficients of travelers' utility functions with multiple attributes and using systemlevel data, this paper enhances traditional network models with optimization algorithms developed in the machine learning literature. Mathematical properties of the optimization problem and its pseudo-convexity motivates the use of normalized gradient descent, a first order method that is suitable for pseudo-convex optimization. Our solution algorithm is tailored to perform a robust estimation of the coefficients of the travelers' utility function and to also reproduce the traffic conditions observed at the system level. To ease the identification of the determinants of travelers' route choices with system level data, we formulate hypothesis tests on the utility function coefficients.

The paper is structured as follows. We first survey the existing literature and identify the main contributions of our work. Then, we present an example on a toy network to illustrate the optimization problem arising from the estimation of the travelers' utility function coefficients using system level data. The following three sections describe the formulation of our methodology on a general transportation network, analyze the mathematical properties of the optimization problem and propose a solution algorithm to solve the problem. Subsequently, we describe the framework to perform statistical inference on the utility function coefficients and we present the results of numerical experiments conducted in networks of small and medium size. Next, we show the estimation results obtained in a large scale network and using real world system level data. Finally, we describe our main conclusions and suggests avenues for further research. The mathematical notation used for the remainder of the paper is included in Appendix A.1.

The computation of network equilibrium requires to specify a travelers' route choice model. The choice of route choice model defines the class of network equilibria (Prashker and Bekhor, 2004) . The simplest and most traditional class of equilibria is known as deterministic user equilibrium (DUE). At DUE no traveler has incentive to change to alternative routes and travelers' are assumed to pick, in a deterministic manner, the route that maximize their utility. The multinomial logit model (MNL) remains as a gold standard in travel behavior research to depict travelers' decision making (McFadden, 1973) . A well-known application of the MNL model in network modeling is for the computation of Stochastic User Equilibrium with logit assignment (SUE-logit) . In contrast to other probabilistic models of travelers' choices, the logit model exhibits a better compromise between behavioral realism and mathematical tractability. Similar to the MNL model, the SUE-logit enjoys of this mathematical convenience in the network modeling context. For instance, under mild conditions, SUE-logit has solution uniqueness in link and path flow space whereas DUE only has solution uniqueness in link flow space. Besides, the existence of a closed form expression for the choice probabilities in the multinomial logit model can be leveraged for the computation of SUE-logit, e.g the Dial logit algorithm (Dial, 1971; Bell, 1995) or to solve the ODE problem (Ma and Qian, 2018a) .

A stream of research that brought our attention concerns the estimation of the travelers' utility function from traffic count data and which induces a traffic assignment consistent with SUE-logit (Robillard, 1974; Fisk, 1977; Daganzo, 1977; Anas and Kim, 1990; Liu and Fricker, 1996; Yang et al., 2001; Lo and Chan, 2003; García-Ródenas and Marín, 2009; Russo and Vitetta, 2011; Wang et al., 2016) . The solution of this problem and of network equilibrium in general requires to know a priori the coefficients of the travelers' utility function of the route choice model. Hence, a standard practice is to set the values of these coefficients equal to estimates obtained in previous travel behavior studies. Nevertheless, there are multiple advantages to estimate these coefficients. First, it avoids to search for external estimates which may be cumbersome and it may provide inaccurate coefficients. Second, it can improve the generalization performance on the estimated O-D matrix when, for instance, the coefficients fitted from existing count data become close to the population coefficients. The following sections describe the relevant literature that has studied the aforementioned problem.

Seminal work in the literature focused on the problem of estimating the coefficientθ ∈ R of a utility function dependent solely on travel time and where both the O-D matrix and travel costs among links/paths are assumed exogenous. To our knowledge, the LUE problem was first studied by Robillard (1974) , who estimated the coefficient θ using link and path flows obtained from a traffic assignment consistent with the Dial (1971) method. The estimator θ was assumed to be Gaussian distributed and it was obtained via maximum likelihood estimation (MLE). A main limitation of this work was the requirement of knowing the traffic counts at every link/path in the network. The problem of estimatingθ from a subset of traffic count in the network was then addressed by Fisk (1977) . Another major limitation of this work was assuming full knowledge of the average cost of traversing paths between O-D pairs, which is plausible in settings where costs are equated to travel times but arguably unrealistic in real world scenarios where travelers make route choices based on multiple attributes such as monetary cost and travel time reliability. Daganzo (1977) focused on a more general setting where link flows observations were consistent with SUE but not necessarily with a logit assignment. In contrast to Robillard (1974) , the distribution of the estimatorθ was derived by imposing distributional properties on the link flows and by relying on the statistical properties of the MLE estimator. This work is the first that conducts hypothesis testing on the coefficients of the travelers' utility function. However, similar to previous research, it assumes a utility function dependent on travel time only and exogenous link costs, which limits the application to uncongested networks or to settings where link costs at equilibrium are known. Under the assumption that link flow data is consistent with SUE-logit, Anas and Kim (1990) extended prior work by allowing for endogenous travel costs. The authors studied the impact of estimatingθ without accounting for the endogeneity of the travel costs and observed that estimates ofθ become biased and that statistical inference is less consistent. Besides, the estimation ofθ was shown to perform better with least squares than maximum likelihood.

In the following decades, the research interest shifted to addressing the joint estimation of the O-D matrix andθ from traffic count data following SUE-logit, which we coined as O-D and Logit Utility estimation (ODLUE) problem 1 . Interestingly, the extension of the LUE problem to settings where the utility function was dependent on multiple attributes besides travel time did not gain the same attention. By this time, the ODE literature had made significant progress to estimate O-D matrices with traffic count data consistent with SUE-logit and these advances could be directly leveraged to solve the LUE problem. A rich set of solution methods had been developed in the ODE literature, which include sensitivity analysis (Patriksson, 2004) and the alternating optimization of the upper and lower levels of the classic ODE bilevel formulation (Ma and Qian, 2017) .

To our best knowledge, Liu and Fricker (1996) conducted the first study that studied the ODLUE problem and that presented an application using system level data collected from a real transportation network. Traffic count and travel time data was collected from a small network at the Purdue University campus and estimates ofθ were obtained for different time periods of the day. The authors proposed a two stage calibration method that leveraged the closed form of the logit route choice probabilities to solve forθ using the secant method. The O-D matrix was estimated via a least square minimization. Key limitations of this work were to not account for the congestion effects in the transportation network and to assume that traffic counts and travel time data were available for the full set of links. This assumption could be mild in small networks where traffic counts can be collected at every link but implausible in larger networks where this data is typically available for a small proportion of the links. Yang et al. (2001) overcame some of the aforementioned limitations by solving a bilevel optimization problem that resembled the traditional mathematical program with equilibrium constraints (MPEC) that had been used for solving the ODE problem. A main contribution of this work was to estimate bothθ and the O-D matrix while also accounting for the impact of traffic congestion. A Sequential Quadratic Programming (SQP) algorithm was used to find the parameters of interest and to minimize the gap between observed and predicted traffic counts. The optimization problem also included constraints to ensure that the SUE-logit equilibrium conditions were satisfied over parameters' updates. The authors also derived for the first time the analytical expression of the gradients of the outer level objective respect to the parameters associated to the O-D matrix and to the travelers' utility function. Lo and Chan (2003) proposed a similar framework but where a MLE instead of a non linear least square (NLLS) problem was solved at the upper level of the bilevel formulation. The authors performed ODLUE on both synthetic and real world data collected from the Tuen Mun Corridor network in Hong Kong. Later, Wang et al. (2016) implemented a similar approach using real world data gathered from a small network in Seattle, WA. They also performed experiments on synthetic data to study the robustness of the parameter estimates to noise in the ground truth O-D matrix and the parameterθ used to generate synthetic traffic count data.

A number of studies have estimated additional parameters on top of the O-D matrix and of the travelers' utility function coefficients. Russo and Vitetta (2011) and Caggiani and Ottomanelli (2011) estimated the shape parameters of the link performance function using traffic counts and travel time measurements consistent with SUE-logit. To leverage the travel time data, the outer level objective of their NLLS problem incorporated the squared difference between predicted and observed travel times. Results on synthetic data suggested that all parameters could be consistently recovered. However, few attention was paid to the identifiability of the parameters and on overfitting due to the additional degrees of freedom introduced in the model. In fact, ODE is known to be an underdetermined problem which is subject to identifiability issues even when a full set of traffic counts is available in a network (Yang et al., 2018) . Naturally, this issue is worsen if additional parameters are introduced in the model. Besides the problem of identifiability (also known as observability), a high model complexity may cause overfitting and thus, a poor generalization of the model on unseen data even if a good fit is observed on training data. Russo and Vitetta (2011) also argued that in an uncongested network, the optimization problem is convex to claim solution uniqueness. However, no attention was paid to the non-convexity of the feasible set in the optimization problem, i.e. which makes the problem non-convex, and also to the fact that convexity is not a sufficient condition for solution uniqueness.

Other line of research extended the application of the ODLUE beyond the traffic assignment stage of the the classic four step model (Ortúzar and Willumsen, 2011) . Under the assumption of an exogenous demand matrix, Cascetta and Russo (1997) simultaneously estimated the parameters of a category index trips generation model, the parameters associated to the distribution step and the mode specific constants, and the travel time and cost coefficients of the utility function of a multinomial logit model of mode choice. The authors tested their method with real data collected from two Italian cities and found that the magnitude and sign of the estimated parameters were consistent with their expectations. However, the treatment of congestion in the road network in their work to capture the endogeneity of travel times is missing. In similar research, García-Ródenas and Marín (2009) estimated the utility function parameters of a route choice model and the mode choice parameters using a nested logit model via Nonlinear Generalized Least Squares (NGLS). A bilevel optimization problem was formulated to account for traffic congestion and it was solved via an alternating optimization of the upper and inner level problems. Similar to most prior work, no statistical tests were conducted to check if those parameters were statistically significant or to assess if the increase in goodness of fit compensated the additional degrees of freedom introduced in the full model.

More recently, Wu et al. (2018) used a computational graph to estimate the parameters of a model that incorporated the steps of trip generation, spatial distribution and path flow-based traffic assignment of the 4-step model. Data from household travel surveys, mobile phones and on traffic counts were leveraged to estimate these parameters and the model was deployed on a network with more than 2500 nodes and 5000 links. This research illustrated the potential of computational graphs to increase the scalability of models that have been developed in the ODLUE literature and to also integrate heterogeneous data sources that inform about transport decisions. Nevertheless, a main drawback of this work is to not account for traffic congestion and to focus on a travelers' utility function dependent on travel time only.

This paper enhances existing formulations of the LUE problem with the goal to statistically infer the coefficients of travelers' utility functions with multiple attributes using system-level data. A bilevel optimization program is formulated to both minimize the gap between estimated and observed traffic counts, and to account for the endogenous effect of traffic congestion on travelers' route choices. More importantly, a framework of hypothesis tests is proposed to examine statistical properties of the coefficient estimates. Under the pseudo-convexity of the optimization problem, we implement a variant of gradient descent suitable for pseudo-convex optimization. First order optimization methods seem promising to speed up computation and thus, to make model training scalable to large transportation networks.

Below are the four main contributions of this paper to existing literature:

1. It presents a methodology to statistically infer the coefficients of travelers' utility functions consisting of multiple attributes and using system-level data. It does not only extend the LUE problem, but it also enables learning realistic travel behavior from diverse datasets at the system level.

2. It conducts for the first time a mathematical analysis of the non-convexity of the LUE problem respect to the utility function coefficients estimated from traffic count data.

3. For the solution of the LUE problem, it shows that the integration of first order and second order optimization methods outperform existing approaches in the literature.

4. It presents a statistical framework to perform hypothesis testing and attributes selection on the coefficients of multi-attribute utility functions 5. It implements the methodology in a large-scale transportation network and with real system-level data.

All our analyses are replicable, open-sourced and ready to be used for the transportation community (Section 11).

Consider a network with two parallel links and thus, with two alternative paths only ( Figure 1 ). The costs functions associated to each link/path are t 1 (x 1 ) = t 0 1 (1 + α( x 1 γ 1 ) β ) and t 2 (x 2 ) = t 0 2 (1 + α( x 2 γ 2 ) β ), where γ 1 , γ 2 ∈ R are the link capacities and t 0 1 = t 0 2 are the links' free flow travel times. The parameters of the link performance functions are α = 0.15 and β = 4. Assume that the travelers' utility function is dependent of the travel time t and the monetary cost c of traversing each link of a path only. In addition, suppose that the coefficients θ t , θ c ∈ R are linearly weighting each attribute of the utility function and that they are common among travelers. Thus, the deterministic component of the utility attained to each link/path can be expressed as v 1 (x 1 ) = θ t t 1 (x 1 ) + θ c c 1 and v 2 (x 2 ) = θ t t 2 (x 2 ) + θ c c 2 . Considerq ∈ R + individuals traveling between origin destination 1-2 and making route choices consistent with a logit model.q is small enough such that links' travel times are approximately equal to their free flow travel times, i.e. t 1 (x 1 ) ≈ t 0 1 , t 2 (x 2 ) ≈ t 0 2 . Suppose link a 1 has a toll fee equal to 1 USD, i.e c 1 = 1, c 2 = 0 and a sensor recording traffic counts. Then, path choice probabilities can be modeled with a sigmoid function and link flows can be obtained as follows:

Suppose the goal is to estimate the coefficients θ t and θ c of the travelers' utility function with the traffic count measurementx 1 . From the link flow solutions (Eq. 1), we note first that θ t is not identifiable because travel times are the same in the two alternative paths. In contrast, because the monetary costs are different between paths, θ c is identifiable and equal to the solution θ c of the following nonlinear least square minimization problem:

Consider two scenarios for the solution of this problem (Eq. 2). In the first scenariox 1 = 0.5q meaning that x 1 is deterministic and that θ c = 0. In the second scenario,x 1 = x 1 + ε , ε ∼ N (0, 1), i.e.x 1 is not deterministic andx 1 ∼ N (0.5q, 1). Note that a single realization ofx 1 in the second scenario could lead us to conclude with the same probability that the traveler's utility and the monetary cost are positively or negatively associated. However, in expectation and similar to the first scenario, this association does not exist.

This example illustrates important features of the problem studied in this paper. First, it shows that the non-linear least squares formulation of the problem is suitable to estimate the coefficients of the travelers' utility function with traffic count data. Second, the identification issue associated to θ t illustrates the special considerations that must be made to specify the travelers' utility function and to prevent a naive estimation of coefficients. Third, the estimation of θ c in the non-deterministic scenario illustrates the relevance of statistical inference tools to assess if a feature is a determinant of the travelers' route choices in the presence of randomness of the data generating process.

For larger networks, the analytical derivation of the optimal solution for travelers' utility function parameters is intractable and thus methods that can iteratively approach to the optimal solution are needed. Besides, due to the class of non-linearity of the link flows respect to the utility function coefficients, the objective function cannot be guaranteed to be convex, which imposes additional difficulties to search for an optimal solution. Another challenge is related to the impact of traffic congestion on travelers' choices and which is not addressed in the illustrative example.

This section discusses in detail our methodology to estimate the travelers' utility function coefficients using system level data. We first introduce our main assumptions and then we describe the components of the bilevel optimization program formulated to estimate the utility function coefficients.

Assumption 1 (SUE-Logit). Network traffic flow follows stochastic user equilibrium with logit assignment With the motivation of bridging transportation network analysis and the study of travel behavior with discrete choice models, we focused on the problem of estimating the travelers' utility function from link flow measurements consistent with stochastic user equilibrium under logit assignment (SUE-logit). On one hand, SUE-logit is one of the many alternative representations of user equilibrium used in the networking modeling community, hence it may be considered a strong assumption. On the other hand, SUE-logit is in line with the state of the practice in travel behavior research where the logit model remains as one of the gold standards to depict individual route choices. The good compromise between behavioral realism and the mathematical tractability of logit is a key property for the estimation in discrete choice models and it will be also leveraged in the formulation and solution of our problem.

Assumption 2 (Exogenous and deterministic O-D). The origin-destination (O-D) demand matrix is deterministic and exogenous O-D demand estimation (ODE) is a complex problem that has been largely studied in the network modeling literature. As an underdetermined problem, there is no guarantee on solution uniqueness for the estimated O-D matrix and this makes difficult to verify the consistency of the solution with a ground truth O-D matrix. The ODLUE literature has extended ODE to estimate the travelers' utility function coefficients on top of the O-D matrix. The ODLUE problem is harder to solve due to the additional degrees of freedom introduced by an endogenous O-D matrix and therefore, as ODE, it also suffers of solution non-uniqueness.

The assumption of an exogenous and deterministic O-D matrix allows us to focus on the estimation of the utility function coefficients, a first step to understand the mathematical properties of our problem. It can always be easily extended with any ODE method to solve for ODLUE. In the context of the existing literature, our methodology can be seen as a case where a reasonably accurate reference O-D matrix is available and assumed as the ground truth O-D matrix. As we will show later, we can still obtain satisfactory results if the exogenous O-D demand is noisy but not too far from the ground truth O-D. Note also that our methodology can be extended to an iterative process where ODE and LUE alternate if the goal is to simultaneously estimate both the travelers' utility function coefficients and the O-D matrix.

Assumption 3 (Linearity and homogeneity of utility function). The utility function is a linear weight of attributes and coefficients and the coefficients are common/homogeneous for all individuals traveling in the transportation network.

The choice of a linear-in-parameters utility function and with homogeneous coefficients is a standard assumption for the estimation of discrete choice models in travel behavior studies. The preference of linear over non-linear specifications of the utility function is motivated by parsimony arguments and on the convenience of dealing with a concave likelihood function in the estimation of multinomial logit models (MNL) models. In our case, this assumption significantly facilitates the proofs of theoretical guarantees in our optimization problem. Note that this assumption could be relaxed by allowing a variation of the utility function coefficients at the O-D pair level or by letting the coefficients to be drawn from some arbitrary probability density function. Both modeling strategies can be seen as analogies to the systematic taste variation used in MNL models and to the mixing distributions used in Mixed Logit models (Train, 2002) .

Assumption 4 (Monotonic increase of and exogeneity of parameters of link performance functions). The link performance functions are monotonically increasing respect to their link flows and their parameters are exogenous

The exogeneity of the parameters of the link performance is a standard assumption in transportation network analysis. Following the state of the practice in prior literature, we adopt the Bureau of Public Roads (BPR) link travel time function. For each link, the travel time monotonically increases respect to the link flow. For the sake of simplicity, we also assume that there is no interaction between links and hence, the travel time of a link is dependent on the traffic flow in that link only. A direct consequence of this assumption is the uniqueness of the link flow solution of SUE-logit. Some work in the literature has proposed methods to estimate the parameters of the link performance function, namely, relaxing the exogeneity assumption. Although it is feasible to estimate these parameters on top of the travelers' utility function coefficients within our modeling framework, the introduction of these additional degrees of freedom may cause identifiability issues or it could make the statistical inference of the utility function coefficients less reliable.

Assumption 5 (Constrained path set). The set of paths between each O-D pair has a fixed size and it is dynamically updated over iterations with column generation methods Most prior work on the LUE problem assumes that the consideration set in an O-D pair is equal to the set all reasonable paths in that pair (Yang et al., 2001; Wang et al., 2016; Lo and Chan, 2003; Liu and Fricker, 1996) . Under this assumption, each path choice probability can be decomposed into a weight of exponentials of their link utilities. On one hand, this property avoids the enumeration of all paths in the network and it can significantly speed up the computation of SUE-logit that it is required to estimate the travelers' utility function coefficients. On the other hand, it implicitly assumes that travelers' consider all paths to travel between an O-D pair and hence, it may induce overlapping among paths. Path overlapping is known to affect the computation of path choice probabilities in multinomial logit models (MNL) and it has been identified as a drawback of algorithms that rely on the set of all reasonable paths (Yang et al., 2001; Wang et al., 2016; Lo and Chan, 2003; Liu and Fricker, 1996) . Navigation apps are widely used nowadays and they recommend a constrained set of options to travel between an O-D pair. This makes us believe that in modern real world applications the assumption of a constrained path set between O-D is more behaviorally plausible than considering all reasonable paths. Besides, a constrained path set is also expected to induce less path overlapping, which is detrimental logit route choice models. A challenge of working with a constrained path set is the impossibility of performing a link level decomposition of the path choice probabilities and hence, the potential increase of computational cost in the network loading stage of SUElogit. In addition, while path sets can be reasonable approximated with existing information, e.g. Google Maps recommendation, they may be inaccurate or not available for all O-D pairs. To improve our prior about the true composition of the path set, the path sets are dynamically updated via column generation methods (Damberg et al., 1996; García-Ródenas and Marín, 2009) . The size of the consideration sets is treated as a hyperparameter and it is constrained to small values to reduce computational burden.

To capture the correlation between paths in the consideration set due to shared link segments, path utilities are corrected with the same principle used in the path size logit (PSL) model (Ben-Akiva and Bierlaire, 1999) . Path utilities incorporate the logarithm of a factor that increase with the amount of overlapping among paths within the same consideration set. A path with no overlapping links needs no utility adjustment since the PSL factor has a size of one. Thus, the PSL correction reduces the chances of a violation of the Independence of Irrelevant Alternatives (IIA) assumption of the MNL model. Previous literature suggests to generate paths that do not overlap with existing paths during the column generation phase (Damberg et al., 1996) . However, this approach requires to solve a combinatorial problem on all possible subset of paths and it implicitly assumes that travelers' consideration set have paths with few or no overlap.

In a congested network, travel times are endogenous and dependent of the traffic flows. At equilibria, both traffic flows and travel times are expected to reach an stationary point where travelers have no incentives to switch to alternative paths. Depending on the underlying behavioral representation used to model travelers' route choices, different types of network equilibrium are induced. In Deterministic User Equilibrium (DUE), travelers' choices are deterministic, meaning that the modeler fully knows the specification of the traveler' utility function. In contrast, in stochastic user equilibrium (SUE), travelers' choices are assumed to be probabilistic and modelers' may be ignorant of a set of unobservable components of the traveler' utility function. In the network context, at SUE, no user believes he can improve his travel cost by unilaterally changing routes (Daganzo and Sheffi, 1977) . Under SUE with logit assignment (SUE-logit), travelers' are assumed to make choices consistent with a multinomial logit model (MNL).

The logit model remains as the gold standard to model route choices due to its good compromise between behavioral realism and mathematical tractability (McFadden, 1973) . Consider a traveler l ∈ L choosing a path i within her consideration set J l and based on the latent utility U jl of each alternative path j ∈ J l . Suppose that the modeler knows the observable component V jl of the travelers' latent utility but ignores a component e jl which can be safely assumed to be stochastic. Therefore, if any attribute relevant to route choice decision is unobserved, travelers' choices could look probabilistic but not deterministic from the modelers' perspective. In particular, if e jl i.i.d ∼ EV (0, µ) the probability p i that a traveler l chooses a path i ∈ J l will have the following closed form:

where µ > 0 is a scale parameter of the Extreme Value (EV) Type 1 distribution and which is set to 1 for convenience and identification purposes. µ is inversely proportional to the variance of the random component, and thus, it is expected to be lower as smaller is the unobserved component of the latent utility. At a extreme case where the latent utility function is fully observed, µ → ∞ and the choice probabilities reduce to an indicator function taking the value 1 if the utility of an alternative is the highest within a choice set and 0 otherwise, i.e. the DUE case.

The original formulation of the SUE-logit problem entails that route choices are made based on travel cost/time only (Fisk, 1980) . Utility, however, may depend on additional attributes such as travel time reliability, waiting time or monetary costs (e.g. toll fees). To make a more explicit bridge between the travelers' utility function specified in route choice models and to extend the analysis for a multi-attribute case, we reformulated the SUE-logit problem as follows:

is the utility associated to link a and at a traffic flow level u, Z ak is the value of the exogenous attribute k ∈ K Z at link a, K Z is the set of exogenous attributes at each link, θ Z ∈ R |K Z | and θ = θ t θ Z ∈ R |K| are the vector of coefficients associated to the exogenous attributes and to all attributes, respectively. Exogenous attributes may include the number of traffic lights, streets intersections or the level of income of the area where a link is located. From the specification of v a (·), it follows that travel time (t) is the only endogenous attribute in the utility function and it is linearly weighting the preference coefficient θ t . The first order optimality condition of Problem 4 gives the following path flow solution:

where δ ah = I(link a ∈ path h) and v a is the link utility of link a ∈ A at SUE-logit. As expected, the path flow vector f h at SUE-logit follows a logit distribution. From a micro-level travel behavior standpoint, q w individuals traveling in the O-D pair w ∈ W are making route choices according to a logit model (Eq. 3), and hence the path flow f h is the aggregate of these individual decisions (Eq. 5). A detailed derivation of the extension of the (Fisk, 1980) formulation from a single to a multi-attribute case is included in Appendix A.2.

Remark 1. In line with the rules of parameterization used to derive the MNL, the specification of the utility function defined to compute SUE-logit should account for the fact that the utility function coefficients are scaled by a factor µ ∈ R + proportional to the variance of the unobservable component of the utility function i.e. θ = µθ whereθ is the unscaled vector of logit coefficients and which is not identifiable. A direct consequence in the single attribute case is that only the sign but not the magnitude of the coefficients should assumed to be known by the modeler. This is particularly relevant in cases where estimates from external travel behavior studies are used to set the utility function coefficients of the route choice model defined for the computation of SUE-logit. Therefore, accounting for the scale factor of the logit parameters here is critical for a correct economic and behavioral interpretation of the coefficients of multi-attribute utility functions but that is oftentimes overlooked in the LUE and ODLUE literature.

Suppose the vector of the travelers' preferences θ ∈ R |D| , the values of the matrix of exogenous attributes Z and the travel times at SUE-logit are known and that the goal is to find the resulting path/link flows in the transportation network. This is precisely the problem that stochastic network loading (SNL) aims to solve. Assumptions about the composition of the consideration set can significantly speed up the computation of SNL. A well-known example is the Dial algorithm which, under the assumption of path sets containing all reasonable paths, allows to decompose path probabilities into link level weights that can be then used to recursively obtain the resulting links flows in the transportation network. Alternatively, and in line with Assumption 5, we assume path sets with a constrained size and that are dynamically updated via a column generation method. The pseudo-code of our SNL method is shown in Algorithm 2, Appendix B.1.

If travel times were insensitive to variation of link flows, namely, exogenous, a single computation of SNL would suffice to obtain the link and path flows at SUE-logit. In practice, travel times are endogenous variables and thus, the computation of SUE-logit requires to perform SNL multiple times.

The method of successive average (MSA) is one of the prominent algorithms used to compute SUE-logit in the ODLUE literature. At the initial iteration, SNL is computed to generate a feasible link flow solution. For each of the following iterations, the convex combination of the SNL link flow solutions at the current and the previous MSA iteration is computed to obtain a new feasible solution. MSA defines a step size λ i = 1/(1 + i) to set the weights λ i and 1 − λ i that are used to compute the convex combination of solutions at iteration i. Because the feasible set of the SUE-logit is convex, the convex combination necessarily gives a feasible point. The process is repeated until some convergence criterion has been achieved, such as the difference between the current and previous feasible link flow solution.

Frank-Wolfe (F-W) is a general purpose method for constrained convex optimization and that can outperform MSA by allowing to make a smarter choice of the step size parameter λ. A key thing to notice is that the convex combination of two feasible link flows solutions obtained via SNL provides a feasible descent direction for the objective function of traffic equilibria problem (Eq. 4). Thus, the value of λ used for the convex combination of solutions is chosen to best improve the SUE-logit objective. This solution strategy for SUE-logit was originally implemented by Chen and Alfa (1991) with a utility function dependent on travel time only and it is also integrated in the disaggregate simplicial decomposition (DSD) algorithm developed by Damberg et al. (1996) . An interesting feature of DSD is the introduction of a column generation phase that dynamically update the consideration sets among O-D pairs. Consideration sets can be augmented with new paths according to different criterion such as the choice probabilities or the level of dissimilarity of the set of candidate paths respect to the existing paths among consideration sets.

Non-linear least squares (NLLS) is a standard estimation method used in the ODE and ODLUE literature. In contrast to maximum likelihood estimation (Lo and Chan, 2003) , there is no need to make distributional assumptions about the data generating process (Cascetta and Russo, 1997) of the traffic counts and there is a variety of specialized algorithms to minimize the NLLS optimization objective (Bazaraa et al., 2006) . Note that when the response function is linear, NLLS reduces to ordinary least squares (OLS) and thus, the NLLS solution could be proved to be unique under mild conditions. Under a non-linear response function, there may exist multiple local minima and saddle points that could make the optimization to be sensitive to the starting points for optimization and to converge toward a local but not the global minima. Therefore, the performance of the optimization algorithms heavily depends on the class of non-linearity of the response function respect to the parameters of interest.

Consider the regression equation:

where m(·) is the response function, X ∈ R |N|×|K| is the matrix of values of a set of exogenous attributes, y ∈ R |N| is the vector of values of the dependent variables, β ∈ R |K| is the true parameter vector and u ∈ R |N| is a vector of random perturbations of some arbitrary distribution.

The NLLS problem consists in finding the NLLS estimatorβ NLLS that minimizes the residuals, namely, the deviation between the vector of observed measurementsȳ and the predictions of the response function:

The application of the NLLS formulation to our problem is direct. Let's beθ ∈ R |D| the vector of estimated coefficients in the travelers' utility function, t ∈ R |A o | the vector of link travel times, Z ∈ R |A|×|K Z | the matrix of exogenous link attributes and x(θ, Z, t) the vector with the link flow functions of any link a ∈ A o . In particular, when all attributes of the travelers' utility function are exogenous and known, such that t =t, x(θ, Z,t) becomes a vector valued function that resembles the response function m(β, X) in NLLS and which has the following closed form:

where p(θ, Z,t) is the vector of path choice probabilities:

Note that is an operator for element wise division and v x (θ, Z,t) is the vector of link utilities:

Finally, ifx ∈ R |A o | is the vector of observed traffic countsx, the NLLS estimatorθ of θ ∈ R |D| can be obtained as follows:θ

The use of the MNL model in our problem imposes rules of identifiability that are similar to those used in discrete choice models. For the computation of the logit choice probabilities, differences in utility among the alternatives is what matters (Walker, 2002) . Therefore, if an attribute of the paths' utility function is the same within each path set, then the coefficient weighting that attribute will be, by construction, not identifiable. From a behavioral point of view, travelers would perceive no difference in utilities when facing a path choice decision and thus, their choices will be no informative about the strength of their preference for that attribute. From an optimization standpoint, there will be no gradient respect to the attribute's coefficient because the choice probabilities are invariant to changes in the value of that coefficient.

A second rule of identifiability refers to the inclusion of alternative specific constants, which may lead to an overspecification of the utility function. The practical implication in our problem is that the specific constant in one of the paths connecting each O-D pair should be fixed, e.g. setting a constant to zero. A third rule of identifiability is related to the minimum amount of traffic counts measurements required to estimate a certain number of coefficients in the utility function, which can be referred as empirical identification (Walker, 2002) . Similar to the rank condition in OLS, the number of observations should higher or equal than the number of parameters but, in practice, a larger amount of data may be needed due to the existence of collinearity between observations.

To show the relevance of the identification rules, let's consider the network in Figure 1 and suppose that there are link flow measurements available for the two links. Assume the network is uncongested, t 1 (x 1 ) ≈ t 0 1 , t 2 (x 2 ) ≈ t 0 2 , and that the observed link counts perfectly matched the true link flow solution at equilibria, i.e. (x 1 ,x 2 ) = (x 1 , x 2 ),x 1 + x 2 =q. From Eq. 1, Section 2.5, we note that the link flow solution of the inner level problem can be written in closed form as:

which satisfies both the conservation constraints between travel demand and path flows and the non-negativity constraints of link flows (Eq. 4, Section 2.5) for ∀θ t ∈ R, ∀θ c ∈ R. From the two link count measurements, we can derive a system of two equations which are dependent on the two unknowns coefficients of the utility function:

Note that in line with the first identification rule, Eq. 13 can be solved for θ c or θ t if the difference of travel time or cost are zero, respectively (see Section 2.5). Besides, from the third rule of empirical identification, the two traffic count measurements would suffice to identify the two parameters. However, the two equations in Eq. 13 are identical, hence linearly dependent and thus, one equation is not providing additional information to identify another coefficient of the utility function.

Remark 2. The objective function of the optimization problem may be globally minimized in a point that is not attainable or where the gradient does not vanish. Suppose that the traffic counts are consistent with UE, such that (x 1 ,x 2 ) = (0,q) or (x 1 ,x 2 ) = (q, 0). With utility function dependent on travel time only, these link flow measurements are consistent with the limiting cases where θ t → −∞ or θ t → ∞, respectively. While the identification rules are satisfied, the data generating process does not follow SUE-logit. To our knowledge, testing for the SUE-logit assumption remains an open question in the literature. If this assumption does not hold in practice, the optimization algorithm chosen to fit the travel time coefficient is expected to improve the objective function over iterations even when the global optima is not attainable.

Remark 3. Suppose that the monetary cost and the free flow travel times are equal in the two links, i.e. c 1 = c 2 , t 0 1 = t 0 2 . Under UE, by definition, the link travel times will be the same for any level level of demandq. Note that the later will be true even if the free flow travel time of the links are different, provided that the demand level surpasses a certain threshold such that both links are utilized. Because the travelers are uniformly distributed among paths and the path utilities are the same, SUE-logit will reproduce the same equilibria. In line with the first rule of identifiability, the travel time coefficient θ t is not identifiable but regardless, any value of θ t will perfectly reproduce the observed link flows. An optimization library will not necessarily warn about these identification problems and it may return an arbitrary estimate for θ t . A basic sanity check is to analyze if the objective function improves over iterations. Note, however, that this identifiability issue will not affect the goodness of fit but only statistical inference.

Second order optimization methods remain as the gold standard to estimate the travelers' utility function from traffic counts in the LUE and ODLUE literature. One of the earliest examples in the ODLUE literature is found in Liu and Fricker (1996) , who use the secant method, a Quasi-Newton method for unidimensional optimization, to estimate the coefficients of a utility function dependent on travel time only. More recent examples are Yang et al. (2001) and Wang et al. (2016) who solved the outer level problem of the ODLUE via Sequential Quadratic Programming (SQP), a variant of the Newton method for constrained optimization. Applications of second order methods are also found in travel behavior research where Broyden-Fletcher-Goldfarb-Shannon (BFGS) remains as a preferred optimizer to find the maximum likelihood estimates in discrete choice models (Train, 2003) .

Second order optimization methods can achieve a superquadratic or superlinear convergence in convex problems. They are appealing when datasets are of moderate size and thus, when the computational cost for the calculation or approximation of the Hessian matrix is reasonable. However, their convergence guarantees heavily relies on how well the curvature of the objective function informs about the direction of its steepest descent/ascent. Besides, in nonconvex problems, their performance strongly depends on how close the initial points for optimization are to a local optima and on whether these points are located within a locally convex region of the objective function. Furthermore, the frequent sign changes of the curvature in non-convex functions may cause that second order optimization methods does not converge to a local minima or that they get stuck at saddle points or flat regions of the optimization landscape.

First order methods have become a gold standard for non-convex optimization due in part to the huge computational gains that are attained when avoiding Hessian matrix computation. The machine learning community has made continued efforts to develop multiple variants of these methods that are able to accelerate the optimization and without compromising computational costs and that can perform reasonably well in highly non-convex optimization landscapes (Staib et al., 2019) . Some optimizers that remains in state-of-the-art applications include stochastic gradient descent (SGD) (Robbins and Monro, 1951) and the Adagrad (Duchi et al., 2011) and Adam (Kingma and Ba, 2015) optimizers.

In the context of quasiconvex optimization problems, Hazan and Levy (2015) proposed a modified version of gradient descent (GD) in which the gradient direction is normalized by its norm. The existence of flat regions in quasiconvex problems cause that the gradients become small despite that they may be pointing to the right direction of improvement. In this setting, the normalization of the gradient is helpful to keep improving the objective function in flat regions as well as to avoid gradient explosion in sharp regions of the feasible space. The application of normalized gradient descent (NGD) in unidimensional unconstrained optimization problems generates parameter updates according to the sign of the first derivative of the objective function.

Interestingly, the ODLUE and LUE literature has not explored the use of first order optimization methods to estimate the utility function coefficients using system level data. Further sections of this paper will show some theoretical and empirical results to support the choice of NGD over the standard second order optimization methods in previous work.

The sole application of the NLLS method to estimate the travelers' utility function coefficients from network level data requires that travel times are known and exogenous. In uncongested networks, this assumption may be enforced by setting the travel times to be equal to the free flow travel times. In congested networks, this assumption is not plausible and hence, it is necessary to solve both a NLLS and a SUE-logit problems of a bilevel formulation in an iterative fashion. Below is the standard formulation of the problem solved in the LUE literature and that we extend to account for travelers' utility functions with multiple coefficients:

where (θ) and g(θ, Z, x, f ) are the objective functions at the upper and lower level andθ ∈ R |D| is the vector of estimated utility function coefficients. The idea behind expressing the objective functions in terms ofθ can be understood as follows. At the inner level, any value ofθ will lead to a different assignment of path/link flows in the network, hence, the objective function g(θ, x, f ) depends onθ. Note also that, thanks to the SUE-logit property (Eq. 5), (θ) can be approximated as an analytical function ofθ.

A brute-force strategy to solve the bilevel formulation in Problem 14 would consist of performing a grid search on all feasible values ofθ. Therefore, for each value ofθ, SUE-logit would be first computed at the inner level and then the objective function at the upper level would be evaluated using the closed form expression derived in Eq. 8, Section 3.2.2. Lastly, a candidate solution would be the value of θ that minimizes (θ). Note however that this solution strategy becomes intractable in a higher dimensional space resulting from a multi-attribute travelers' utility function or from the joint estimation of the utility function coefficients and an O-D matrix.

A standard heuristic employed in the ODE and ODLUE literature to solve bilevel formulations is an alternating optimization of the inner and outer level problem. Over iterations, this heuristic is expected to converge toward a stationary point that, at the outer level, minimizes the gap between observed and predicting flows and, at the inner level, induces a link/path flow solution that satisfies traffic equilibria. The solution strategy in our problem boils down to performing gradient-based updates of the utility function coefficients at the outer level and to compute SUE-logit 13 at the inner level using the coefficientsθ previously updated at the outer level. The resulting travel times from the SUE-logit computation are then feed to the outer level and the process is repeated until some criterion is convergence is met. There are many variants of solution methods in the literature but most can be summarized with the following steps:

1. Initialization of estimated parameters (e.g. utility function coefficients or O-D matrix) 2. Inner level problem: solve SUE-logit and update travel times in the network 3. Outer level problem: solve NLLS with new travel times and update estimated parameters 4. Repeat 2 and 3 until fulfilling some convergence criteria In our application and similar to the LUE literature, the utility function coefficients inθ are the only free parameters.

Remark 4. A key advantage of SUE-logit in network modeling applications is the existence of a closed form and unique solution at path and link flow spaces. This property can significantly ease the solution of the typical bilevel optimization structure that arises in ODLUE problems. During the alternating optimization, the closed form solution is valid but only for the set of travel times obtained in the inner level problem. Thus, it is convenient to perform small updates ofθ at the outer level problem such that the inner level solution remains valid.

This section studies the mathematical properties of our optimization problem. Note that our problem falls into the LUE class and since LUE is a subclass of ODLUE, the analysis of mathematical properties of our problem is also relevant for the ODLUE problem.

To ease the analysis of the mathematical properties, it is convenient to start assuming that the utility function is dependent on exogenous attributes only. This assumption significantly facilitates the solution of the bi-level formulation because the inner level problem at each iteration consists of a single pass of SNL. This assumption is plausible in an uncongested network where travel times are close to the free flow travel times or when there is access to travel time measurements for every link in the network. Alternatively, this assumption may be enforced by either setting the travel time coefficient of the utility function or the parameter α of the cost performance to zero but this is arguably less realistic.

The non-convexity of the ODLUE and LUE problems respect to the travelers' utility function coefficients has been discussed in prior literature but via counterexamples only (Yang et al., 2001) . To our knowledge, no formal proof of the non-convexity has been given for neither LUE and ODLUE. A key observation to prove the non-convexity of these problems is that their objective function f is upper-bounded respect to the utility function coefficients. In a onedimensional case where θ ∈ R, the upper-bounds of f corresponds to the solutions of UE when θ → ∞ + and θ → ∞ − . To extend this intuition to a multi-dimensional case where θ ∈ R |D| , it is useful to introduce a general definition of convexity (Boyd and Vandenberghe, 2004) .

Proposition 1 (bounds of objective function). The objective function of the LUE problem is lower-bounded and upper-bounded Proof. The objective function : R |D| → R of the LUE problem can be expressed as:

where N is the set of observed traffic count measurements. Each traffic function x i , ∀i ∈ N can be bounded as:

where Q ∈ R + is the total demand, namely, the sum of all cells in the O-D matrix Q ∈ R V ×V . Then, lower and upper bounds of the objective function can be found as follows:

which completes the proof Let's now prove the non-convexity of the LUE problem with the following proposition:

Proposition 2 (Non-convexity of LUE problem). Suppose the coefficients of the travelers' utility function in the LUE problem are identifiable. Then, the LUE problem is not convex if its objective function is upper-bounded

Proof. Let's prove this by contradiction. Assume that the LUE objective function (θ) = x(θ) −x 2 2 , ∀θ ∈ R |D| is convex. Then, let's use Definition 1 of convexity at points y 1 = θ 1 −(1−λ)θ 2 λ and y 2 = θ 2 , with y 1 , y 2 , θ 1 , θ 2 ∈ R |D| and λ ∈]0, 1]:

which implies that:

If θ is identifiable, f is not a constant function and ∃λ ∈ [0, 1] : (θ 1 ) > (θ 2 ). From Eq. 17, when (θ 1 ) > (θ 2 ), the RHS grows with no bound as λ → 0 + . Hence, the function f is not upper bounded, which generates a contradiction and it completes the proof.

Convexity is not a necessary but a sufficient condition for global optimality. Therefore, before ruling out the global optimality of our problem, it is convenient to study a generalization of convexity known as a pseudo-convexity that imposes weaker conditions on the optimization problem and that provides sufficient conditions for global optimality.

Pseudo-convex functions are a subclass of the quasi-convex functions. Hence, the properties of pseudo-convex functions are also valid for quasi-convex functions, while the converse is not true (Mangasarian, 1965) . Pseudo-convexity is a generalization of convexity that, as quasi-convexity, extends the notion of unimodality to higher dimensional spaces. In contrast to quasi-convex functions but similar to convex functions, the first order necessary optimality condition of pseudo-convex functions suffices for global optimality (Crouzeix and Ferland, 1982) . This property will be key later to prove the existence of a global minimizer in our optimization problem. A formal definition of pseudo-convexity is the following (Bazaraa et al., 2006) :

or equivalently,

The definition of pseudo-convexity can be assessed via the augmented hessian of the objective function (Bazaraa et al., 2006) . Crouzeix and Ferland, 1982) . This proposition can be also evaluated in terms of the signs of the principal minors of the bordered Hessian (Mereau and Paquet, 1974) .

For the remaining analyses of mathematical properties, it will be also helpful to introduce Definition 3 of coordinatewise pseudo-convexity. A function f is coordinate-wise pseudo-convex if, when all coefficients of the utility function except for one coefficient are fixed, it is pseudo-convex respect to the non-constant coefficient. Note that if the function is dependent on a single attribute, the definitions of coordinate-wise pseudo-convexity and pseudo-convexity are equivalent. Formally:

or equivalently,

Proposition 13, Appendix A.4 shows that coordinate-wise monotonicity of the traffic count (response) functions implies coordinate-wise pseudo-convexity of the objective function of the LUE problem under exogenous travel times. The definition of coordinate-wise monotonicity is the following:

Definition 4 (Coordinate-wise monotonicity of response functions). The traffic count (response) functions are monotonic respect to each coefficient of the utility function if, when all coefficients of the utility function except for one coefficient are kept constant, the response functions are monotonic respect to the non-constant coefficient.

Propositions 10 and 11, Appendix A.3 illustrate two cases where under mild assumption the coordinate-wise monotonicity of the traffic flow functions holds in a general transportation network. While it is easy to create counter examples where the coordinate-wise monotonicity of the traffic flow functions is violated, analysis on synthetic data suggests that the assumption holds in practice (see Section 7.2.1).

In absence of measurement error in the traffic counts, the proofs for existence and uniqueness of a global optimality are direct (Propositions 15 and 16, Appendix A.5). While the absence of measurement error can be enforced in experiments with synthetic data, it is implausible in real world problems where traffic count data is expected to be noisy. Fortunately, the pseudo-convexity of the objective function in our problem can be leveraged to extend the proofs of existence and uniqueness in presence of measurement error. A key property of pseudo-convex functions that also holds in convex functions is that every local minima is also global minima (Mangasarian, 1965) . This relaxes the positive (semi)definite assumption of the Hessian of the objective function used to prove (strict) global optimality in convex problems. Following the proof in Bazaraa et al. (2006) for multi-variable pseudo-convex functions, Proposition 3 shows that this property can be also applied to prove the existence of global minima in our problem.

Proposition 3 (Sufficient condition for the existence of a global minima). Assume the objective function f of the LUE problem under exogenous travel times is coordinate-wise pseudo-convex. Then, if the gradient of f vanishes at θ ∈ R |D| , θ is a global minima.

Proof. By assumption, ∇ θ (θ ) = 0, which implies that ∂ (θ ) ∂θ i = 0, ∀i ∈ D. By applying Definition 3 of coordinatewise pseudo-convexity to each coordinate i ∈ D, (θ 2 ) ≥ (θ ), ∀θ 2 ∈ R |D| , which completes the proof.

Remark 5. Proposition 14, Appendix A.4 gives a sufficient condition for the existing of a vanishing gradient (derivative) in the unidimensional case. While this seems restrictive, it addresses the standard setting studied in prior literature with a utility function dependent on travel time only.

Proposition 4 proves the uniqueness of the global optima in our problem by leveraging the the strict quasiconvexity of differentiable pseudo-convex functions proven by Mangasarian (1965) . A formal definition of strict quasiconvexity is the following (Bazaraa et al., 2006) :

Analogously to Definition 3, the definition of strict quasi-convexity can be also applied coordinate-wise:

Definition 6 (Coordinate-wise strict quasi-convexity). : R |D| → R is said to be coordinate-wise strictly quasi-convex iff ∀i ∈ D, ∀θ 1 ,

Proposition 4 (Sufficient conditions for uniqueness of the global minima). If the objective function of LUE problem is (coordinate-wise) pseudo-convex and a global minima exist, then the global minima is unique.

Proof. Let's prove this by contradiction. Let's be : R |D| → R the objective function of the problem and θ ∈ R |D| the vector of coefficients of the utility function. Suppose that there are two global minima θ 1 , θ 2 ∈ R |D| : (θ 1 ) = (θ 2 ) = and θ 1 = θ 2 . By the multivariate mean value theorem (MMVT), for θ 3 ∈ R |D| , ∃θ 3 ∈ (1 − λ)θ 1 + λθ 2 :

and hence, [θ 1 , θ 2 ] forms a convex set. By Property 2, Mangasarian (1965) , if is pseudo-convex, is also strictly quasi-convex. Similarly, if is coordinate-wise pseudo-convex, is also coordinate-wise strictly quasi-convex. By applying Definition 6 of strict quasi-convexity on θ 1 , θ 2 ∈ R |D| gives = (λθ 1 + (1 − λ)θ 2 ) < max (θ 1 ), (θ 2 ) = , which leads to a contradiction and it completes the proof.

In a congested network, travel times are endogenous and this requires to solve a bilevel formulation of the LUE problem (see Section 3.4). Note that even if the inner and outer level problems of the bilevel formulation were convex, the analysis of global optimality is complex (Dempe and Franke, 2016) . Because convex functions are a subclass of the pseudo-convex functions, the pseudo-convexity of the outer problem is expected to make the analysis even harder. Despite the above difficulties, this section proves two properties of the inner level and outer level problems of the LUE bilevel formulation that can improve the convergence of an alternating optimization algorithm.

The SUE-logit problem with a utility function dependent on travel time only is known to be strictly convex in path (and link) flow space under link performance functions that are monotonically increasing and dependent on the traffic flow on the links only. To our knowledge, Proposition 5 extends, for the first time, the aforementioned property to multi-attribute utility functions.

Proposition 5 (Uniqueness of SUE-logit path flow solution with a multi-attribute utility function). Suppose that the travelers' utility function is linear-in-parameters and that the link cost functions are monotonically increasing and dependent on the traffic flow on the link only. Then, the SUE-logit problem has a unique solution in path flow space.

Proof. Let's transform the objective function of Problem 4, Section 3.2.2 into a minimization problem with the following argument: Evans (1973) , it follows that g 2 (f ) = − f , ln f is strictly convex on the feasible set S defined by the constraint in Problem 4, Section 3.2.2. Note that the matrix Z ∈ R A×k Z with link attributes values is exogenous and thus, g 1 (x, θ) can be rewritten as:

To analyze the convexity of the first term g 1 (x, θ), it suffices to analyze its second derivative respect to any path flow variable f h :

Then, replacing the link flow variables by using the balance constraint for path and link flows in the second term of g 1 :

Given the class of link cost functions used in this proposition (Assumption 4, Section 3.1) and the fact that θ t < 0, it follows that g 1 is convex on the path flows variables f h (Fisk, 1980) . As a consequence, g(x, f , θ) is a sum of a convex and strictly convex function, and thus, it is strictly convex. Also, all constraints in Problem 4, Section 3.2.2 are affine and thus they define a convex set. Thus, the solution of problem 4 has a unique global minima, which completes the proof.

A pseudo-convex function can be convex and concave on its feasible space. Despite of the above, its gradient always point out to the global descent direction (Hazan and Levy, 2015) . Proposition 6 shows that this property also holds in the outer level objective of the LUE problem.

Proposition 6. Assume that the objective function of the outer level of the LUE problem is coordinate-wise pseudoconvex and that a global optimal solution exist. Then, the negative gradient of the objective function points out to the global descent direction Proof. Let's be θ ∈ R |D| the global optima of the LUE problem andθ ∈ R |D| a feasible point that is not a global optima. By the coordinate-wise pseudo-convexity of : R |D| → R:

which proves that the negative gradient is always pointing to the global descent direction.

Remark 6. In a congested network, each value of θ ∈ R |D| generates a different traffic assignment at the inner level of the bilevel formulation. Therefore, link travel times will change between iterations of an alternating optimization algorithm. As a consequence, the objective function of the outer level problem will be (coordinate-wise) pseudoconvex respect to the travelers' utility function coefficients only in a small neighborhood where travel times remain approximately constant.

The solution method for the inner level problem of the bilevel formulation is described in Algorithm 3, Appendix B.2. The stochastic network loading (SNL) is performed according to Algorithm 2, Appendix B.1. The column generation phase is adapted from Damberg et al. (1996) and it is performed once per each iteration of the alternating optimization algorithm (Step 1, Algorithm 3, Appendix B.2). Note that in the original implementation of DSD, the column generation phase is performed at each iteration of SUE-logit but in the context of a bilevel optimization, this path set augmentation strategy would not select good paths if the current solution for θ is far from the global optima. The best convex combination of solutions at each iteration of SUE-logit (Step c, Algorithm 3, Appendix B.2) is searched over a uniform grid in [0, 1] and with an arbitrary granularity depending on the desired level of accuracy. To reduce computational cost, the size of the augmented path set is constrained to a fix quantity (k g ) and the path set augmentation is only performed in a proportion (ρ W ) of the O-D pairs at each iteration. The sample of O-D pairs is selected according to the level of demand. To capture the correlation between paths in the consideration set due to overlapped link segments, we correct path utilities using the path size logit (PSL) factor (Ben-Akiva and Bierlaire, 1999) . The utility term associated to the PSL factor is weighted by a coefficient β ∈ R that can be estimated (Bierlaire and Frejinger, 2008) . For the scope of this paper, the coefficient β is treated as an hyperparameter with default value equal to one. The utility term associated to the path size correction is updated before generating new paths and after selecting paths depending if the composition of the path set change when performing the steps 1 and 3 of the inner level optimization (Algorithm 3, Appendix B.2),

To study the convenience of using first or second order methods, the outer level objective function of the bilevel formulation is optimized with an hybrid algorithm (see Algorithm 4, Appendix B.3). In a non-refined stage, the algorithm uses Normalized Gradient Descent (NGD) (Algorithm 5,Appendix B.3.4). Subsequently, the best solution obtained from the non-refined stage is used a starting point for the refined stage, where second order optimization methods are used to obtain a more accurate solution (Algorithm 4, Appendix B.3). Two specialized second order methods were initially considered. The first is Gauss-Newton (GN), one of the standard optimizer in the Econometrics literature on non-linear regression (Del Pino, 1989; Amemiya, 1983; Gallant, 1975a) and which can be seen as an unconstrained version of the sequential quadratic programming (SQP) method. The second is the Levenberg-Marquardt (LM) algorithm which, by interpolating between GN and gradient descent, can increase the robustness of GN in flat regions of the optimization landscape (Marquardt, 1963) . Given that GN is a particular case of the LM method (Algorithm 6, Appendix B.3.5), only LM is used in the refined stage.

Algorithm 1, Section 5.3 shows the pseudocode for the alternating optimization of the inner and outer level of our bilevel formulation (Problem 14, Section 3.4). The algorithm returns the vector of utility function coefficientŝ θ ∈ R |D| that minimizes the discrepancy between predicted and observed link flows. To guarantee that the predicted link flows follow SUE-logit, the last iteration of the alternating algorithm only solves the inner level problem using the value ofθ obtained in the previous iteration of the outer level problem.

Algorithm 1 BilevelOptimization Input: Iterations I, initial vector of utility function coefficients θ 0 , inputs of InnerLevelOptimization T , Z, t f , q, λ g , ρ W , inputs of OuterLevelOptimization T 1 , T 2 , η,x, x, p, refined-method, η 1 , η 2 , b, δ

Step 0: Initialization: θ := θ 0

Step 1: Alternating optimization

Our bilevel formulation can be casted as a mathematical program with equilibrium constraints (MPEC). For this class of problems, the feasible region is typically non-convex (Boyles et al., 2020) and this makes difficult to find theoretical guarantees even for convergence toward local optima (Sabach and Shtern, 2017; Dempe and Franke, 2016) . Thus, we found convenient to analyze a case where all attributes of the utility function are exogenous (Section 4.1) and where we could leverage the pseudo-convexity of the optimization problem (Section 4.1.2).

The analysis of mathematical properties in Section 4 provided sufficient conditions for the existence and uniqueness of a global optima. Now it remains to prove that the global optima is attainable when running normalized gradient descent (NGD) on a finite number of steps. We will consider the LUE problem with a utility function dependent of a single exogenous attribute and where only the no-refined stage of the outer level algorithm is conducted. While this seems restrictive, it addresses the standard setting studied in prior literature with a utility function dependent on travel time only. Hazan and Levy (2015) proved the convergence of NGD to a global minimum for functions that are both strictly quasi-convex and locally-Lipschitz. These results are relevant for our problem, since pseudo-convex functions are also quasi-convex (Mangasarian, 1965) . Locally-Lipschitz functions have the property of having its first derivative bounded by an arbitrary positive constant and they are required to be Lipschitz in a small region around the optimum. Generally, the Lipschitz constant is relevant to study the theoretical convergence speed of first optimization methods. Formally:

Definition 7 (Lipschitzness). A unidimensional function f : R → R is Lipschitz iff ∀x, y ∈ R, G > 0:

Proposition 7. Suppose that the travelers' utility function in the LUE problem depends on a single exogenous attribute. Then, the objective function is Lipschitz

Proof. Let's define the the objective function of the LUE problem a : R |D| → R. Starting from the LHS of Eq. 23 and using the reverse triangle inequality:

We can now bound the range of the each traffic flow function x i , ∀i ∈ N and the norm of the difference between the vector of traffic functions in Eq. 24 as follows:

where Q ∈ R + is the total demand, namely, the sum of all cells in the O-D matrix Q ∈ R V ×V . Now define |θ 1 t − θ 2 t | = δ > 0 and set G = NQ 2 /δ > 0, and replace it into Eq. 25:

which proves that the objective function is G-Lipschitz

Now armed with Proposition 7, we can provide guarantees for the convergence of NGD when θ ∈ R and under exogenous travel times:

Proposition 8. Assume that the objective function of the LUE problem is (coordinate-wise) pseudo-convex and that the travelers' utility function in the depends on a single exogenous attribute. If the optimization problem is solved with normalized gradient descent (NGD), NGD converges to an ε-optimal solution in a finite number of iterations (poly(1/ε)) Proof. By assumption, the objective function : R → R is pseudo-convex and thus strictly quasi-convex. Under exogenous travel times, we can use Proposition 7 to conclude that is locally Lipschitz. Thus, all conditions from Theorem 4.2, Hazan and Levy (2015) are satisfied and hence the proof is complete.

This section derives closed form expressions of the statistical tests used to analyze the coefficients of the travelers' utility function estimated with the methodology presented in Section 3. NLLS estimation is often preferred over MLE because the former relies on weaker distributional assumptions (Cameron and Trivedi, 2005) .

The nonlinear regression formulation of the outer level objective of our problem defines each traffic count measurement x i ∈ R, ∀i ∈ N in the true data generating process as a scalar dependent random variable with conditional mean:

where x i is the traffic count (response) function (Eq. 8, Section 3.3.1),t is the set of link travel times and Z is the matrix of exogenous attributes. Note thatt is assumed exogenous in the NLLS problem solved at the outer level but it is iteratively updated via the alternating optimization with the inner level problem (Section 5.3). The regression equation of x i is defined as:

where u i is the random error coming from the true data generating process. The differentiation of the objective function of the NLLS problem (Appendix B.3.2) leads to the following first order necessary optimality condition:

where n is the number of observations and D θ x(θ) ∈ R n×d is a Jacobian matrix corresponding to the stacked gradient vectors for each observation n ∈ N. Also, for the ease of notation and for the remainder of the paper x(θ) = x(Z, t, θ). Note that Eq. 29 defines a set of d equations, namely, one equation for each each element of the vector of utility function coefficients θ ∈ R |D| , and it restricts the residual to be orthogonal to D θ x(θ). A key difference of NLLS respect to OLS is that the matrix of exogenous attributes of the regression equation is replaced by the Jacobian matrix D θ x(θ) (Appendix A.6). Under this realization, the statistical properties of the NLLS problem can be derived analogously to the OLS case.

The proof of consistency of the NLLS estimator relies on a series of assumptions that has been well-established in Econometrics literature (Wu, 1981 ). An accessible reference on these assumptions is found in Cameron and Trivedi (2005) . Below are the assumptions reformulated in the context of our problem:

Assumption 6 (Model specification). The response function is well-specified, i.e. ∀θ 0 ∈ R |D| , x = x(θ 0 , Z,t) + u Assumption 7 (Orthogonality between errors and regressors). In the data generating process, E[u|Z,t] = 0 and E[uu |Z,t] = Ω 0 , where Ω is the covariance matrix of the errors terms Remark 7. Assumption 7 allows for heterocedastic errors in the NLLS problem but it requires knowing or estimating a functional form of the covariance matrix Ω 0 (Cameron and Trivedi, 2005) . To ease the analysis, we introduce an additional assumption:

Assumption 8 (Spherical errors). Errors are homocedastic and non-autocorrelated, that is, their covariance matrix is diagonal Ω 0 = σ 2 I, for σ ∈ R ≥0 Remark 8. Assumption 8 of spherical errors is satisfied if the errors are independent and identically distributed (iid) and with a constant variance Assumption 9 (Identifiability). Each traffic flow function x i (·) satisfies that: ∀i ∈ N, ∀θ 1 , θ 2 ∈ R |D| , x i (θ 1 , Z,t) = x i (θ 2 , Z,t) iff θ 1 = θ 2 Remark 9. Assumption 9 is directly satisfied when θ ∈ R and the traffic flow functions are (coordinate-wise) monotonic (see Definition 4, Section 4.1.2).

Assumption 10 (Rank of the limit distribution of the Hessian matrix approximation). The matrix

exists and is finite and nonsingular ∀θ 0 ∈ R |D| .

Remark 10. Assumption 10 formalizes the third rule of identifiability discussed in Section 3.3.2. Proposition 9 shows that this assumption can be tested via the rank of the Jacobian of the traffic functions:

Proposition 9 (Full rank Jacobian and non-singular Hessian matrix approximation). Suppose the Jacobian matrix of the traffic flow functions given by D θ x(θ) ∈ R n×d , with n > d, is full rank at θ 0 ∈ R |D| . Then, the square matrix [D θ=θ 0 x(θ)] D θ=θ 0 x(θ) ∈ R d×d is positive definite.

Proof. Consider ∀u ∈ R m = 0:

By assumption, the Jacobian matrix D θ=θ 0 x(θ) is full rank at θ 0 and thus D θ=θ 0 x(θ)u = 0 iff u = 0. Since u = 0, D θ=θ 0 x(θ)u 2 2 > 0 which proves that the matrix [D θ=θ x(θ)] D θ=θ 0 x(θ) is positive definite and it completes the proof.

Thus, if the Jacobian of the traffic flow functions D θ x(θ) has full rank, by Proposition 9, the matrix specified in the argument of the probability limit defined in Assumption 10 is positive definite and hence non-singular.

and Ω 0 = σ 2 I by Assumption 8 of homocedasticity.

Remark 11. When the errors u are Gaussian and independently distributed, the linear combination of errors given by [D θ x(θ)] u is a Multivariate Gaussian. As a result, Assumption 11 holds even in small samples. Alternatively, if each term [D θ x i (θ)] u i follows an arbitrary distribution, the central limit theorem (CLT) must be invoked to ensure the asymptotic normality of the terms in the sequence {D θ x i (θ)} i=1∈N . Since D θ x i (θ) is assumed exogenous for statistical inference, the distribution of each term i of the previous sequence is determined by the distribution of u i .

Suppose θ 0 ∈ R |D| satisfies the first order necessary optimality condition defined in Eq. 29. Then, under the assumptions stated in Section 6.2, we can establish the consistency of the NLLS estimatorθ at θ 0 (see Proposition 5.6, Cameron and Trivedi (2005) ) and that

where

By the non-singularity of A 0 (Assumption 10, Section 6.2):

By rearranging terms in Eq. 30, the asymptotic distribution ofθ becomes:

is the asymptotic variance matrix of the NLLS estimator. Note that the distribution of the NLLS estimator resembles OLS, except that the design matrix X is replaced byX = D θ=θ 0 x(θ) (Appendix A.6). Also, similar to the OLS case, the statistical behavior of the NLLS estimator and the variance of the error terms in large samples can be expressed in terms of u and approximated as (Gallant, 1975a) :

Note that the variance σ 2 ∈ R ≥0 of the error terms defined in Eq. 31 is unknown in real applications but it can be consistently estimated as:σ

where the numerator represents the residual sum of squares (RSS) of the model and |K| ∈ Z + in the denominator represents the dimension of the NLLS estimator.

Based on the asymptotic normality of the NLLS estimatorθ proved in the previous section, we can derive confidence intervals and hypothesis tests ofθ. The 1 − α% confidence interval ofθ can be found by using the percentile 1 − α/2 of a t-variate with n − |K| degrees of freedom t n−|K|,1−α/2 and the variance of the NLLS estimator as shown in the following formula (Gallant, 1975a) :

The hypothesis H 0 : θ d = θ H 0 can be contrasted at the α% confidence level by computing the statisticsT d,H 0 :

and H 0 is rejected when |T d,H 0 | > |t 1−α/2 |.

The F-test is another useful statistic for model comparison in non-linear regression and that allows to contrast hypotheses on multiple model parameters (Gallant, 1975b) . Let's define θ 1 , θ 2 ∈ R |D| as the vector of coefficients in the restricted (or nested) and unrestricted models. Then, the F-statisticF 1,2 is:

where θ 1 0 and θ 2 0 are the number of non-zero entries in the vectors of parameters of the non-restricted and restricted models. The null hypothesis is that the models are statistically equivalent and it is rejected whenF 1,2 > F θ 2 0 − θ 12 0 ,N− θ 2 0 ,α , where F θ 2 0 − θ 12 0 ,N− θ 2 0 ,α is the value of a central F-distribution with θ 2 0 − θ 1 0 numerator degrees of freedom and N − θ 2 0 denominator degrees of freedom and at the α-th upper percentile.

Remark 12. An important consideration for hypothesis testing concerns the identifiability of the parameters under the null hypothesis (Gallant, 1975a) . Assume the travelers' utility function is defined as U = θ t t α where θ t , α ∈ R are the free parameters and t is the travel time in a given path. If one tests the hypothesis θ t = 0, α becomes not identifiable. The latter violates Assumption 10, Section 3.1 on the rank condition and which is required to prove the consistency and asymptotic normality of the NLLS estimator. Conversely, under a homogeneous linear-in-parameters utility function (Assumption 3, Section 3.1), the identification of the parameters on a restricted model can be guaranteed under mild conditions (Section 3.3.2).

To study the performance of the proposed algorithm and some of the mathematical properties of the optimization problem reviewed in Section 4, we conduct experiments with synthetic data generated from four small networks. Subsequently, we study the statistical properties of the non-linear least squares estimators ( 

The DGP to conduct the experiments assumes the modeler has perfect information about the link performance functions, the network topology, the paths sets and the ground truth coefficients and attributes of the travelers' utility function. For each network and replicate of the experiment, we generate a set of synthetic traffic counts under an ideal scenario where link flows perfectly matches SUE-logit. To introduce randomness, we add i.i.d errors to each set of traffic counts measurements. This i.i.d sampling strategy (see Remark 8) ensures that Assumptions 7 and 8, Section 6.2 are satisfied and thus, the statistical inference for the NLLS estimator remains consistent. For the sake of convenience, we choose Gaussian errors but other choices of probability distribution are also reasonable, including the Poisson or Negative Binomial distributions. The Gaussian errors are generated with a standard deviation equals to 10% of the average value of the traffic counts. As a reference, the standard deviation of the error term used in the Monte Carlo experiments conducted by Gallant (1975a) was equal to the 3% of the average value of the dependent variable reported in his synthetic dataset.

All networks except for the toy network shown in Figure 2a have been analyzed in prior literature (Wang et al., 2016; Yang et al., 2001; Lo and Chan, 2003) . For the toy network, the number of trips between origin-destination pairs is assumed equal to q 1,4 = 50, q 2,4 = 100, q 3,4 = 150 and q i, j = 0, ∀(i, Lo and Chan (2003) . The utility function is dependent on travel time only, and with a ground truth coefficient θ t = −1 < 0. Note that prior research defines θ = −θ t > 0 as a positive dispersion coefficient that weights the link/path costs. Figure 3 shows the output of the traffic flow functions for four arbitrary links in each network and for θ t ∈ [−15, 15]. The first two values of the tuples shown in the legend of each subfigure correspond to the origin and destination nodes of a link. The last element of the tuple is an index that distinguish parallel links connecting the same O-D pair, e.g a 3 , a 4 in Figure 3 (a). Note that most traffic flow functions are monotonic and some exhibit convex and concave regions that resemble sigmoidal functions. In the toy network, the traffic flow functions of links (2, 3) and (1, 3) are constant because their link flows are invariant to the value of the travel time coefficient. Therefore, all individuals traveling from node 1 or 2 are forced to traverse those links and thus, the traffic count measurements are not contributing to the identifiability of additional coefficients of the travelers' utility function. Figure 4 illustrates the coordinate-wise pseudo-convexity of the LUE problem. The dashed vertical lines represents the true value of the travel time coefficient θ t . Note that since noise was introduced in the traffic count measurements, the error is not necessarily minimized at θ t . In the four networks, the objective function is monotonically decreasing or increasing for the ranges of values that are lower or higher than θ , respectively, meaning that the negative slope of the first derivatives are pointing toward the region where the global optima is located (Figure 4b ). As expected, the changes in curvature of within the feasible domain suggest that is not convex but (coordinate-wise) pseudo-convex. The proof of coordinate-wise pseudo-convexity of the LUE problem presented in Section 4 requires both coordinatewise monotonocity of the traffic flow functions and a utility function with exogenous attributes only. Our empirical results suggest that the exogeneity of travel time and the monotonocity of traffic flow function are not necessary but only sufficient conditions for the pseudo-convexity of the objective function.

To study the advantages of using NGD instead of a second order method in the non-refined stage of the optimization, we interchange the use of the methods between stages. Figure 5a presents the convergence results obtained by applying LM and NGD in the no-refined and refined stages, respectively. Figure 5b shows the converse case. To ensure that the global optima at θ t = −1 is attainable, no noise is introduced into the traffic count measurements. The learning rate for NGD was set to η = 2 which, for our unidimensional optimization problem on θ t , equates to solution updates with magnitude equal to 2. The starting point for optimization of the travel time coefficient is set to θ t = −14, which is the most challenging scenario of convergence analyzed by Yang et al. (2001) . We observe that the use of NGD in the no-refined stage makes the convergence faster and stable in the four networks. In contrast, the use of LM in the no-refined stage results into unstable updates of θ t . This behavior in second order optimization methods is expected due to the constant change of curvature sign and the existence of flat regions within the range of pseudo-convex functions. In fact, for all networks the objective function is concave and flat at the initial value for optimization θ t = −14 and it is convex and sharp around the global optima at θ t = −1 (Figure 4) . Table 1 shows the parameters estimates and t-tests obtained for the four small networks and under four scenarios; (i) only NGD, (ii) only LM, (iii) NGD in no-refined stage and LM in refined stage, (iv) LM in no-refined stage and NGD in refined stage. Overall, this evidence confirms the advantages of the integration of first and second order optimization methods to perform statistical inference. The integration of NGD and LM in scenarios (iii) and (iv) generate solutions closer to the global optima than the standalone application of each optimization method. The hypothesis tests correctly reject the null at a 1% confidence level but only when NGD and LM are used in the no-refined and refined stages of the optimization, respectively. Note: Significance levels: * p<0.1; * * p<0.05; * * * p<0.01

To study the impact on convergence and inference of assuming an inaccurate reference O-D matrix, the experiment and testing network used by Yang et al. (2001) were used as a baseline. The ground truth O-D matrix was defined as q true = (q 1,6 , q 1,8 , q 1,9 , q 2,6 , q 2,8 , q 2,9 , q 4,6 , q 4,8 , q 4,9 ) = (120, 150, 100, 130, 200, 90, 80, 180, 110) whereas the reference O-D matrix was q distorted = (100, 130, 120, 120, 170, 140, 110, 170, 105) . In line with Yang et al. (2001) , it is assumed that traffic counts are only observed in the following links of the network: (3, 6), (5, 6), (5, 8), (5, 9), (7, 8). Figure 6 shows the convergence toward the optimal solution when the reference O-D matrix was assumed equal to (i) the true O-D matrix q true (blue curve) or to (ii) the distorted O-D matrix q distorted (red curve). Note that in the two scenarios, the algorithm converges close to the global optima ( 

The Sioux Falls network is a standard testing bed used by transportation researchers in network modeling studies (Shen and Wynter, 2012) . The network comprises 24 nodes and 76 links (Figure 7a) . We generate 1,584 paths corresponding to the three paths with the shortest distance to travel between each O-D pair. The O-D matrix is obtained from TNTP (2016) and a heatmap with its values is presented in Figure 7b . The setup to generate the synthetic data is the same as for the small networks except that the exogenous attributes for the monetary cost cand the number of intersections s at each link/path are incorporated in the utility function. The attributes c and s are generated as continuous and discrete random variables, respectively, and the utility function coefficients are set to θ t = −1, θ c = −6, θ s = −3. If the units of the travel time and monetary cost attributes are minutes and US dollars, the value of time (VOT) become equal to 10 USD$ per hour. By Assumption 3, Section 3.1, the set of coefficients θ t , θ c , θ s ∈ R are common among travelers and linearly weighting the attributes t, c, s in the utility function. The true path set is assumed to be known and thus, the experiments do not perform the column generation phase of the inner level optimization algorithm (Step 1, Algorithm 3, Appendix B.2) (a) Network topology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 To have a detailed characterization of the coordinate-wise pseudo-convexity of the outer level optimization problem, Figure 8 shows plots of the objective function (top left), the first of derivative of (top right), the sign of the first derivative of (bottom left) and the sign of the second derivative of (bottom right) respect to a coefficient of the utility function. The vectorized expressions to compute the first and second derivatives of the objective function are presented in Appendix B.3.2 and Appendix B.3.3. To avoid an excess of overlap of multiple curves when included in the same figure, we only analyze the curves associated to the travel time and cost coefficients. The true values of the coefficients are represented with vertical dashed lines in each figure and the curve associated to each coefficient was generated by fixing the value of the other coefficient to its true value. To satisfy the exogeneity assumption required for coordinate-wise pseudo-convexity in the objective function (Section 4.1), travel times are assumed known and equal to the travel times obtained when generating the synthetic traffic counts at SUE-logit.

Similar to the results obtained for the small networks (Section 7.2.1), we observe that the objective function is locally convex and minimized at a point close to θ t = −1, θ c = −6. In addition, the sign of the first derivatives are always pointing toward the global optima, which illustrate this key property of pseudo-convex functions. In contrast to the results obtained in small networks, the sign of the second derivative changes more frequently. The latter may be associated with the higher complexity in larger networks which tends to increase the non-linearity of the traffic flow functions and hence, of the objective function, respect to the utility function coefficients.

The coordinate-wise pseudo-convexity of the outer level objective function provides theoretical guarantees for the convergence of NGD toward global optima but only when the utility function depends on a single exogenous attribute (Section 5.4.1). Interestingly, our experimental results suggests the ground truth coefficients of a multi-attribute utility function can be also recovered in the endogenous case. The top and bottom plots in Figure 9 show the values of the ratio between the travel time and cost coefficients, i.e. the value of time, and the value of the objective function over iterations, respectively. The non-refined and refined stages of the optimization performed 10 iterations of NGD and LM, respectively. The starting points for optimization of the utility function coefficients were set to zero. In the exogenous and endogenous cases, we observe that the value of time is perfectly recovered and that the value of the objective 

We conduct 100 replicates of two Monte Carlo experiments to study the impact of the endogeneity of travel times on the statistical inference of the utility function coefficients. The significance level for all hypotheses tests is set at α = 0.1. Monte Carlo experiments are often used by the discrete choice modeling community to empirically study the properties of estimators in travel behavior models (Train, 2003) . We also study the impact on statistical inference of three different setups of the optimization algorithm; (i) NGD is used in the no-refined stage (ii) LM is used in the no-refined stage, (iii) NGD and LM are used in the no-refined and refined stages, respectively. In scenarios (i) and (ii), no iterations of the optimization methods are performed in the refined stage.

Each replicate of the experiment draws a new sample of errors from a Gaussian distribution (Section 7.1) and hence of new traffic counts. For each replicate, we compute the bias of the utility function coefficients and of the value of time, the normalized root mean squared error (NRMSE) and the amount of false negatives. The NRMSE is defined as the ratio between the root mean squared error (RMSE) and the mean of the observed traffic counts. Because the standard deviation σ of the random error is defined as a proportion of the mean of true traffic counts (Section 7.1), the NRMSE is expected to be close to σ = 0.1. Since the ground truth coefficients of the utility function are all set to values different than zero, a false negative equates a non-rejection of the null hypothesis H 0 = 0. The initial values for optimization of the utility function coefficients are set with a uniform distribution centered at the true values of the coefficients and with a width of two. Thus, the initial value of the travel time coefficient θ t is randomly chosen within the interval [−3, 1]. The experiment results show that the estimates were roughly unbiased in all optimization setups and under exogenous and endogenous travel times. This result is interesting because the NLLS assumptions prove only consistency but not unbiasedness of the NLLS estimators in large samples. A potential explanation is related to the pseudo-convexity of the outer level objective and to the existence of a unique global optima in the neighborhood around the initial values for optimization. Furthermore, our evidence strongly supports the use of NGD +LM over the standalone application of first or second order optimization methods. The higher standard error of the estimates obtained with LM unveils the problems of numerical stability of second order methods when the initial values for optimization are in regions where the objective function is not convex. The latter is also associated with the large proportion of false negatives reported by LM in both experimental scenarios. In contrast, the standalone application of NGD reports less than 5% of false negatives in both scenarios, which equates to a statistical power of 95%. This is a encouraging result given the reference threshold of 80% typically used for experimental studies.

In the endogenous case, we observe an increase of the difference between the NRMSE and its expected value, given by the dashed bar shown in the bottom right plots of Figures 10a and 10b . Similarly, the bias of the estimated coefficients and of the value of time increase consistently in the three setup of optimization methods compared to the exogenous case. The computation of equilibria in the endogenous case may result into inaccurate approximations of the travel times at equilibria. From an econometric standpoint, this introduce measurement error in the travel time attribute and it can add bias in the parameter estimates. Another problem in the endogenous case is the significant increase of computational burden due to solving traffic equilibria at each replicate of the experiment.

To ensure good convergence over replicates and that the assumptions for consistency of the NLLS estimator are satisfied, the following experiments in this section are performed under the scenario of exogenous travel times only. The initial values for optimization of the utility function coefficients are also set with a uniform distribution centered at the true values of the coefficients and with a width of two. This choice of width allows to start the optimization in a region close of the ground truth values of the utility function coefficients and where the objective function is at best convex or at worst pseudo-convex. A tighter width makes difficult to assess the reduction of the objective function with different setups of optimization method. The significance level for all hypotheses tests is also set at α = 0.1.

The NLLS theory suggests that the NLLS estimator is asymptotically consistent. Therefore, the coefficients associated to irrelevant attributes included in the utility function should converge to zero and they should not impact the statistical inference on coefficients weighting relevant attributes. However, with finite samples, the inclusion of irrelevant attributes in the utility function can reduce the efficiency of coefficients of relevant attributes. As a consequence, the t-tests of other coefficients may decrease and this could artificially increase the number of false negatives, i.e. lower rejections of the null hypothesis for coefficients different than zero. Figure 11 shows the impact of the inclusion of irrelevant attributes on the bias of the estimated coefficients and of the value of time, and on hypothesis testing. In contrast to the Monte Carlo experiments conducted in Section 7.3.3, the utility function used to generate synthetic traffic counts includes six irrelevant attributes. Each irrelevant attribute is generated as a standard Gaussian random variable and it is weighted by a coefficient equal to zero in the utility function. Hence, the hypothesis tests are expected to not reject the null for these coefficients. To quantify the power of the hypothesis tests to detect irrelevant attributes, we compute the proportion of false positives, namely, the proportion of times that the null hypotheses for coefficients weighting the set of irrelevant attributes are rejected. By classic statistical theory, p-values are uniformly distributed under the null hypothesis, hence, a significance level of α = 0.1 should result into a 10% of false positives. To analyze the impact of the inclusion of irrelevant attributes on the statistical inference of relevant attributes, we also report the proportion of false negatives. Similar to the results obtained in Section 7.3.3, the integration of NGD and LM significantly reduces the amount of false negatives and the bias of the estimated coefficients and of the value of time. The inclusion of irrelevant attributes in the utility function increases the amount of false negative but only in the case of LM. Notably, the integration of NGD and LM perfectly matches the theoretical 10% level of false positives that is expected at a significance level of α = 0.1. Overall, these results reaffirm that the statistical inference is more reliable with the integration of first order and second order optimization methods. Given this evidence, the following experiments are conducted using NGD +LM only.

A lower sensor coverage reduces the sample size and this is expected to increase the standard error in the estimates of the utility function coefficients. As a consequence, t-tests may be overestimated and thus, the amount of false negative may increase. Figure 12 shows the impact on hypothesis testing when varying the link coverage at three levels; 25%, 50% and 75%. As expected, the amount of false negatives and the standard error of the coefficients decrease with the level of sensor coverage. Notably, for coverages of only 50% (N = 38), the statistical power already surpasses the desired threshold of 80%. For all coverage levels, the proportion of false positives does not surpass the theoretical level of 10% and the NRMSE closely matches its expected value at 10%. 

The standard error of the NLLS estimators is expected to increase when the variance of the random perturbation is higher. Thus, hypothesis testing should tend to reject less frequently the null for coefficients associated to relevant attributes and hence, the amount of false negatives could artificially increase. Figure 13 shows the impact of increasing the standard deviation of the Gaussian perturbation on statistical inference. The standard deviation is defined as a percentage of the average value of the synthetic traffic counts (see Section 7.1). As expected, the amount of false negatives and the standard error of the estimated coefficients increases with the level of variability in the random perturbation. The values of the NRMSE closely match the error levels and the proportion of false positives does not surpass the theoretical level of 10%. In real applications and in line with the sensor coverage experiment (Figure 12) , a larger sample size should compensate for an increase in the variance of the random perturbation. 

Our methodology assumes that the O-D matrix is exogenous and deterministic (Assumption 2, Section 3.1). While it is feasible to have an accurate reference O-D matrix available, its cells may be subject to random perturbations. Any gap between the true and reference O-D matrix may add measurement error in the model and this could bias or distort the t-tests of the coefficients. Figure 14 shows the impact of introducing different level of noise in the O-D matrix. The reference O-D matrix is assumed to be a Gaussian random variable with expected value µ Q ∈ R V ×V ≥0 equal to the values of the true O-D matrix Q and with standard deviation σ Q ∈ R + defined as a percentage of the average value of Q. To compare the impact of adding noise in either the traffic counts (Section 7.3.6) or the O-D matrix, we set the same levels of the standard deviation of the random perturbation in both experiments. Surprisingly, our results show that the noise in the O-D matrix does not significantly impact statistical inference. Therefore, the standard error of the estimated coefficients and the NRMSE remains invariant for all levels of random noise. An ill scaled O-D matrix is expected to increase the gap between predicted and observed traffic counts and this may induce an overestimation of the variance of the random error. As a consequence, the t-tests and the false negatives could increase. Figure 15 shows the impact of a ill-scaled the O-D matrix. The level represents the true value of the scale or factor that should multiply the reference O-D matrix to recover the true O-D matrix. Therefore, values lower or higher than 1 represent cases of overestimation or underestimation of the true O-D matrix, respectively. Compared to the previous experiments that assume a well-scaled O-D matrix, we observe that the NRMSE is far from its expected value, which suggest that there is no a good convergence toward the ground truth values of the utility function coefficients. The estimated coefficients are biased upwards and downwards when the true O-D matrix is overestimated or underestimated, respectively. The latter is directly associated with the underestimation and overestimation of the false positives respect to the theoretical level of 10%. The false negatives surpasses the 20% when the true O-D matrix is underestimated, which significantly reduce the statistical power to detect the effect of relevant attributes in the utility function. The proposed methodology was also applied to a large scale transportation network located in the City of Fresno, California. This network primarily covers major roads and highways around the SR-41 corridor and it comprises 1789 nodes and 2413 links Qian, 2017, 2018a) . The following sections describe the use of various data sources to estimate our model in the Fresno network. Subsequently, we describe the attributes of the utility function, the models' specifications, the estimation procedure and the results.

The Caltrans Performance Measurement System (PeMS, 2021) provides open source data with georeferenced stations that record information on traffic counts, speeds and travel times in major highways in California. Using geoprocessing tools, a total of 141 links of the Fresno network were matched to PeMS stations, which equates to a 5.8% of sensor coverage. We only use traffic count data collected during the first Tuesday of October 2019 and October 2020 between 4pm and 5pm and which correspond to periods before and during COVID-19. The average traffic counts for the selected periods in 2019 and 2020 are 2213.6 and 2113.3 vehicles per hour, respectively.

A dynamic O-D demand matrix calibrated with support of the PARAMICS estimator tool for the SR-R1 corridor is adopted in our study (Ma and Qian, 2017; Zhang et al., 2008; Liu et al., 2006) . This O-D demand is provided with a 15 minutes resolution and for the period between 4PM and 6PM on a typical weekday. To have consistency in the temporal resolutions of the O-D demand and the traffic counts data, both data sources are aggregated in a one hour window starting at 4pm. The aggregation in vehicles per hour is also consistent with the temporal resolution of the link performance functions. 

Below are the sources of system level data used to compute the attributes of the travelers' utility function in every link of the Fresno network:

• Speed and travel times INRIX traffic time and traffic speed data have been successfully used in previous transportation studies an it is considered a reliable data source for our analyses (Yao and Qian, 2020) . Our raw data includes a shapefile with a collection of georeferenced line segments representing the major roads in Fresno, CA. It also includes csv files with traffic speeds for every line segment and with a 5 minutes resolution. Attributes in the dataset include a unique identifier for each line segment, a time stamp, observed speed, average speed and reference speed. Using the length of each line segment, the speed attributes were transformed into travel time attributes in minute units. The line segments of the INRIX shapefile and the links of the Fresno network are matched according to the orientation and proximity of the streets. For the 15.5% links that were not successfully matched, we imputed the mean value of the attributes that were matched in the remaining links.

The Statewide Integrated Traffic Records System (SWITRS) is one of the two official sources of traffic incidents data in California (Waetjen and Shilling, 2021) . SWITRS is managed by the California Highway Patrol (CHP) and • Socio-demographics from Census data A shapefile with the most recent selected sociodemographic data at the block level was retrieved from US Census Bureau (2019). We only use the layers with data about the proportions of population by gender, age and income level at each block. There are other layers with relevant features to explain route choice preference, such as one related to commuting patterns, but this data is not available for the city of Fresno or it is only available at the tract and county levels.

Travel time is the only endogenous attribute in the utility function and it is equal to the output of the links' performance functions. The free flow travel times defined in the link performance functions is obtained from INRIX data. The speed attribute used to compute free flow travel times corresponds to the speed driven on a road when it is wide open and it does not necessarily match the legal speed limit. Using the system level data described in Section 8.3, we also compute the following set of exogenous attributes for every link of the Fresno network: Std. Speed is computed using historical data provided by INRIX for the period between 4pm and 5pm on a weekday and it is assumed to negatively impact the utility of choosing a path. There are alternative ways to capture the effect of travel time variability on travelers' route choices, such as the standard deviation or coefficient of variation of travel time. The main advantage of the standard deviation of speed is that it provides a measure of time variability that is independent of the length of the link segments. The rationale to aggregate the number of incidents in the current year is to have a proxy of the drivers' perception on the road safety of a link segment. Median income is used as a proxy of the perception of the level of crime in surrounding neighborhoods and it is hypothesized to be positively associated with the utility of choosing a path. A larger number of bus stops or streets intersections in a link segment should encourage drivers to use alternative paths. Thus, all attributes, except for Median income, are expected to have a negative coefficient in the utility function.

To test non-linear effects of the attributes on the utility function, we also binarized the exogenous attributes as follows. Reliable speed and Low income take the value 1 if the values of Std. speed and Median income are below the 25th percentile of their distributions, respectively, and 0 otherwise. Bus stop, Intersection and Incident take the value 1 if Bus stops, Intersections and Incidents are greater than zero, respectively, and 0 otherwise. Under this new specification of the attributes, all coefficients of the utility function, except for the coefficient weighting Reliable speed, are expected to be negative.

Note that the network contains 656 links (38.8%) that are centroid connectors. These connectors usually do not have physical counterparts in the real network (Sean Qian and Zhang, 2012) and they should not impact travelers' route choices. Thus, we set the values of the attributes in these links to zero, regardless if the attributes are continuous or binary. The path size correction factor is set to its default value of one in all model specifications. Figure 18 shows the correlations among traffic counts and system level attributes of the Fresno network during the first Tuesdays of October 2019 and 2020. We include free flow speed instead of free flow travel times which normalize the attribute by the link length. The Pearson correlations shown in the top row of the plots indicate that the free flow speed is negatively correlated with the number of intersections and bus stops in the two selected periods. Also, as expected, the number of incidents is positively associated with the free flow speed. An interesting result is the negative correlation between median income and free flow speed. The latter is probably explained by the higher rate of motorization in high income areas and which should be associated to lower speeds. The standard deviation of speed shows a positive correlation with free flow speed but only during 2020. The sign of the correlation between traffic flows and free flow speed is unclear. Overall, the correlational analysis shows sensitive associations between the attributes of the utility function and it suggests that the spatial matching of attributes conducted in 8.3 was reasonably accurate.

We estimate three model specifications using the data collected in each time period. Following the state of the practice in previous literature, we first estimate a Baseline model including travel time only. We also estimate a Full model including the exogenous attributes in a continuous scale; Std. Speed, Median income, Incidents, Intersections, Bus stops. Finally, a Binarized model includes the exogenous attributes in a binary scale; Reliable speed, Low income, Incident, Intersection, Bus stop (Section 8.4). To assess the gain in explanatory and predictive power of the Full model and Binarized model compared the Baseline model, these models also include Travel time. By assumption, the utility function is defined as a linear weight between attributes and their respective coefficients. Travel time, the only endogenous attribute in all models, is iteratively updated during the bilevel optimization.

To prevent that numerical issues associated to the scale of the attributes impact the quality of the statistical inference, we normalize each exogenous attribute, including the free flow travel time defined in the link performance functions, by their maximum values. Because all these attributes are non-negative, their new ranges are between 0 and 1. Section Appendix C.1 includes figures with descriptive statistics of the attributes after this normalization and that exclude links that are centroid connectors. The models are estimated on Amazon Web Services on two t2.2xlarge instances equipped with Intel Xeon CPU 3.3GHz x 8, 32 GB RAM (AWS, 2022). The three model specifications are run in parallel using the data collected in either 2019 or 2020. Because SUE-logit is path-based, the computational and memory complexity is a function of the number of paths. The CPU processor and RAM memory of the t2.2xlarge instance allow to handle 33,430 of paths. In contrast, a t2.xlarge instance with Intel Xeon CPU 3.3GHz x 4, 16 GB RAM, handles 22,015 paths without reporting memory overflow. The memory bottleneck of the algorithm is on the computation of the analytical gradients performed to update the utility function coefficients during the outer level optimization. The computation time of each notebook is about 6 hours.

All models are estimated using NGD in the no-refined stage with a learning rate of η 1 = 0.5, and LM in the refined stage, respectively. Each stage performs 10 iterations of the optimization methods. The starting points for optimization in the non-refined stage are set to zero for the three model specifications. The best estimate of the utility function coefficients in the non-refined stage is chosen as the starting point for optimization in the refined-stage. The coefficients are projected to zero during the iterations of the bilevel optimization if their sign is not consistent with our expectation (Section 8.4). If the coefficients of the attributes are zero at the end of the non-refined stage, they are excluded for the refined stage. This strategy is used to reduce computational burden and to accelerate the convergence of the optimization algorithm to a local optimal solution.

We generate an initial set of 13,905 paths using the two shortest paths between the 6970 O-D pairs that report trips ( Figure 17 , Section 8.2). The initial path set of the three model specifications are the same. The total number of paths is odd because some O-D pairs are only connected by a single path. The column generation method described in Step 1, Algorithm 3, Appendix B.2 is used to update the path sets during the bilevel optimization. To guide the paths exploration, new paths are generated in the 30% of O-D pairs with the highest demand only, which covers approximately 85% of the total trips in the Fresno network ( Figure 17, Section 8.2) . Based on the current estimate of the utility function coefficients and the link attributes, the column generation step generates the 40 shortest paths among 3% of the OD pairs at each iteration of the bilevel optimization. The selected O-D pairs at each iteration changes sequentially according to their level of demand and the process continue until the target of 30% of OD-pairs is reached in the last iteration. The sequential selection of O-D pairs reduces the computational burden of generating new paths in all O-D pairs and at every iteration, without compromising significantly path exploration. Subsequently, the inner level optimization algorithm solves SUE-logit and the paths utilities are updated according to the new travel times at equilibria. Finally, it selects the 10 paths with the highest utility for every O-D pair.

For each model specification and time period, we compute the value of the objective function (i.e. sum of squared errors), the root mean squared error (RMSE), the normalized RMSE (NRMSE), the F-test (Section 6.4) and the adjusted pseudo R 2 . The F-test compares the sum of squared errors of the model with its null version where all coefficients of the utility function were set to zero and that represents an scenario where travelers' make equilikely choices among paths. F-tests with p-values lower than α in rejects the null hypothesis that two models are statistically equivalent at the 1 − α % level of confidence and hence, this is evidence that supports the selection of the augmented model.

Our adjusted pseudo R 2 is analogous of the McFadden Pseudo-R 2 (McFadden, 1973) and it is adjusted by the number of coefficients of the model (Ben-Akiva et al., 1985) . It is defined as 1 minus the ratio between the SSE of a model minus the number of coefficients and the SSE of the model with all coefficients set to zero. Since traffic counts are an aggregate of individual path choices, the adjusted pseudo R 2 of our model is directly comparable with those values obtained from discrete choice models. This indicator is an absolute measure of the predictive ability of discrete choice models and it is useful to generate benchmark values against which researchers can evaluate (Parady et al., 2021) . It tends to be considerably lower than the R 2 index used in ordinary regression analysis, and values of 0.2 to 0.4 represent an excellent fit (McFadden, 1977) . Table 2 shows the estimation results obtained with traffic count data collected during the first Tuesday of October 2019 (before COVID-19) and October 2020 (during COVID-19 ) and for three model specifications (Section 8.6). Besides the point estimates and t-tests of each coefficient of the utility function, the table also includes multiple indicators for model comparison (Section 8.9). Figure 19 shows histograms with the distribution of errors obtained after performing the non-refined and refined stages of the optimization. Figures C.21 and C.22, Appendix C.2 show a comparison of the convergence of the full and binarized models against the baseline model. The top plots of the figures shows the number of paths that are generated during column generation and that are added after performing the paths selection step during the inner level optimization. The significant decrease in the number of paths added in the bilevel iterations of the refined stage suggests that the path exploration over the selected set of O-D pairs in the non-refined stage was reasonably exhaustive.

Notably, our results shows that the adjusted pseudo R 2 obtained in all models are in the order of 0.3-0.56, which are considered an excellent fit in travel behavior studies (McFadden, 1977; Bovy et al., 2008; Bierlaire and Frejinger, 2008; Bwambale et al., 2019) . The F-tests included in the table suggest that all models are statistically different than the null model at the 99% confidence level. The F-tests comparing the Full model with the Baseline model are equal to 3.7804 and 4.3697 when using data collected before and during COVID, respectively. Because they are lower than their critical value of F 6−1,141−63.1557,99% , we conclude that the Full model is not statistically equivalent to the Baseline model. The indicators for model comparison of the Binarized model were lower than the Full model in all cases, thus, the latter is preferred over the former.

The NRMSE of the Full model is in the order of 0.5, meaning that the standard deviation of the traffic counts is approximately a 50% of their sample mean. The experiments conducted in the Sioux Falls network find that a NRMSE of 0.25 results into more than 20% false negatives. Although the sample size in the Fresno network is two times higher -test (p-value) 71.954 * * * (0.000) 11.224 * * (0.000) 8.047 * * * (0.000) 99.482 * * * (0.000) 4.157 * * * (0.002) 9.166 * * * (0.000) Note: Significance levels: * p<0.1; * * p<0.05; * * * p<0.01 than the number of links in the Sioux Falls network, it is unlikely to compensate for the difference in NRMSE. Other factors that could contribute to the decrease of the t-test of the coefficients in the Full model are associated to the high correlation of travel time with some exogenous attributes and also to numerical issues arising from the estimation of a richer utility function. Overall, this evidence anticipates a low statistical power to detect of the effect of attributes that significantly impact travelers' utility in the Fresno network. As expected, the travel time coefficient is significant at the 90% confidence level in all the models. Note that the estimated coefficients of travel times in the baseline models are exact multiples of the learning rate of NGD because there was no improvement of the objective function in the refined stage. In contrast, the estimations of the full and binarized models report improvements during the refined stage. The estimation of the Full model with the data collected before COVID-19 suggests that the total incidents during the year and the monthly household income of the neighborhoods nearby the link segments has a positive and negative effect on the travelers' utility, respectively. Both effects were significant at the 95% level of confidence. The estimation results of the Full model with the data collected during COVID-19 also found a negative effect of the number of streets intersection but the coefficients of all attributes, except for travel time, were not significant at the 90% confidence level. As discussed earlier, the high level of noise in the Fresno network can explain the difficulty to detect significant effect of these attributes. Although it is not required in non-linear regression models, the distribution errors of the models resemble Gaussian distributions with zero mean (Figure 19 ). Finally, regarding the impact of COVID-19, we conclude that there are no significant differences on the attributes that determine travellers' route choices.

The network modeling community has studied the problem of estimating the travelers' utility function coefficients using traffic counts and travel time measurements. However, research has been limited to utility functions dependent on travel time only and methods have been tested on networks of relatively small size. Under the assumption of a known exogenous O-D demand matrix, we enhance existing methods to estimate the coefficients of utility functions with multiple attributes and using traffic counts consistent with stochastic user equilibrium with logit assignment (SUE-logit). We refer to this problem as Logit Utility Estimation (LUE). To perform attributes' selection, we conduct hypothesis tests on the coefficients of a multi-attribute utility function . Furthermore, a rigorous analysis of the non- Figure 19 : Distribution of errors in non-refined and refined stages of the optimization of the baseline, full and binarized models convexity and mathematical properties of the LUE problem is conducted to inform the design of our solution algorithm and to derive some theoretical guarantees about convergence toward local optima. The realization of the pseudo-convexity of the optimization problem motivates the use of normalized gradient descent (NGD), a first order method developed in the machine learning community that is suitable for pseudo-convex optimization. The integration of NGD with Levenberg-Marquardt (LM) algorithm outperforms the standalone application of second order optimization methods used in previous literature in many ways. First, the estimates of the utility function coefficients become less sensitive to the starting points for optimization, which is a common issue in nonconvex problems. Second, NGD improves the convergence toward global optima and this makes the statistical inference more reliable. Third, the use of first order methods reduces computational cost because it only requires to compute the gradient of the objective function respect to the utility function coefficients. To our knowledge, this is the first time that a paper presented vectorized expressions to perform the gradient computation and this is key to accelerate the optimization and to scale up our methodology to the largest transportation network studied to date within the LUE literature.

The analysis of mathematical properties of the problem identifies the coordinate-wise monotonicity of the traffic flow functions as a main driver of the coordinate-wise pseudo-convexity of the objective function. Experiments in networks used in previous studies show that the traffic flow functions are generally monotonic and that the objective function of the LUE problem is pseudo-convex respect to the coefficient of a utility function dependent on travel time only. Results in the Sioux Falls network support the coordinate-wise pseudo-convexity of the objective function respect to the coefficients of a multi-attribute utility function. A series of Monte Carlo experiments show that statistical inference on the utility function coefficients is robust to traffic congestion and also to different levels of sensor coverage and noises in the O-D matrix and in the traffic counts. The amount of false negatives and false positives obtained in these experiments are generally well-aligned with statistical theory. For instance, the amount of false positives perfectly matches the value expected for an arbitrary significance level. A higher amount of noise in the traffic counts increase false negatives and hence it reduces the statistical power to identify effect of relevant attributes in the utility function. Surprisingly, the statistical inference is resilient to high level of noise in the cells of t he reference O-D matrix.

Our solution algorithm is deployed on a large scale network in Fresno, CA and it gave reasonable results in terms of both the estimates and hypothesis tests of the utility function coefficients. Based on standard metrics of model comparison used in travel behavior research, the models report an excellent goodness of fit. As expected, travel time is identified as a main determinant of travelers' route choices in all model specifications, which partially supports the standard practice in the network modelling community of assuming utility functions dependent on travel time only. However, the incorporation of additional system level attributes in the utility function significantly increases the goodness of fit of the baseline model that uses travel time only. These conclusions are robust when using data collected before and during COVID-19. Among the exogenous attributes, the total incidents during the year and the median income are the most relevant predictors of travellers' route choices. The coefficients of these attributes are significant at the 95% level of confidence but only when using data collected before COVID-19.

The implementation of our methodology in a large scale network is challenging mainly due to the high computational cost and level of noise of real world data. On one hand, accounting for the interdependence between travelers' choices in transportation networks and for the endogeneity of travel times requires to compute traffic equilibria. On the other hand, estimating the coefficients of the travelers' utility function requires to solve a regression problem. The computational complexity of the inner and outer level problems is a function of the number of paths and thus, the alternating optimization of the problems become intractable with a large number of O-D pairs. We learned that the use of column generation methods for updating path sets in SUE-logit provides a good compromise between computational cost and prediction error. The strategy of selecting the O-D pairs according to their demand level is also key to control the exploration of new paths over iterations of the bilevel optimization and to also identify the paths that most decrease the prediction error. To our knowledge, this strategy is not integrated in previous column generation algorithms and we strongly encourage its use in further application of our methodology. To control for the effect of path correlation/overlapping, we correct path utilities according to the Path Size Logit model.

Further research could enhance our methodology to incorporate multi-day data and to perform a joint estimation of the utility function coefficients and the O-D matrix. The increase in sample size should help to identify all parameters of interest and to improve the quality of the statistical inference. The use of cloud computing is important to handle a larger number of paths and thus, to achieve a larger reduction of prediction error. However, the cost of estimating a model with data from multiple time period can become high. We expect that use of deep learning models, computational graphs and automatic differentiation tools will help to handle large amounts of multi-day data while keeping computational cost reasonable. Regarding statistical inference, our study is lacking a more careful treatment of the endogeneity that arises from the computation of travel time over the iterations of the bilevel optimization. Therefore, the use of Two-Stage Least Squares (2SLS) and instrumental variables may contribute to correct bias and inconsistency in the coefficient estimates caused by the presence of endogeneity (Gallant and Jorgenson, 1979) .

We are also interested in relaxing the assumption of a homogeneous and linear-in-parameters utility function. The use of a non-homogeneous utility function and fixed effects may contribute to capture heterogeneity of preferences among individuals traveling between different O-D pairs. Besides, the effect of time variability may be better captured with a non-linear specification of the utility function as the one used in prospect theory route choice models. We will also look at estimating the coefficient weighting the utility term associated to the path size correction. We would also like to leverage the use of GPS data to have a better prior of the path sets among O-D pairs and to improve the estimation of the utility function coefficients. Here the integration of our methodology with the nested recursive logit model (Mai et al., 2015) seems a promising avenue for further research.

This study chooses traffic flows for the response function of the non-linear least objective functions. However, there are other choice of response functions that are also admissible such as link travel times or traffic densities. In static traffic assignment, both quantities are monotonic functions of the traffic counts and thus, they are suitable quantities to estimate the utility function coefficients. Finally, system level data is less subject to sampling bias than data collected from individual surveys but it is also more subject to measurement errors. Thus, we expect the joint use of travel surveys and system-level data can help leverage the strengths and weaknesses of each data source.

The Python package developed to implement our methodology and the system level data from the Fresno, CA network can be found at the following url: https://github.com/pabloguarda/isuelogit. The folder notebooks contains Jupyter notebooks that reproduce all the results presented in this paper. The set of exogenous attributes in the utility function K

The set of attributes in the utility function D

The set of utility function coefficients L

The set of travelers in the network J l Consideration set of traveler l ∈ L t ∈ R

The vector of values of the endogenous travel times among links Z ∈ R |A|×|K Z | The matrix of values for the exogenous attributes among links z k ∈ R |A| The vector of values for the exogenous attribute k ∈ K Z among links Z ak ∈ R

The value of the exogenous attribute k ∈ K Z at link a ∈ Ā t ∈ R

The vector of exogenous travel times among links θ ∈ R |D| The vector of true utility function coefficients

The vector of true utility function coefficients associated to the exogenous attributes θ d ∈ R

The utility function coefficient associated to

The vector of link utilities v a ∈ R

The link utility associated to link a ∈ A U jl ∈ R

The latent (unobservable) utility that traveler l attained to alternative (path) j ∈ J l V jl ∈ R

The observable utility that traveler l attained to alternative (path) j ∈ J l ε jl ∈ R

The latent (unobservable) error in the utility function associated to traveler l and alternative (path) j ∈ J l µ ∈ R + Scale parameter of the logit model and of the extreme value Type 1 distribution The vector of estimated utility function coefficientŝ θ Z ∈ R |K Z | The vector of estimated utility function coefficients associated to the exogenous attributes p(θ) : R |D| → R |H| The vector of path choice probability functions p h (θ) : R |D| → R

The path choice probability function associated to path h ∈ H x(θ) : R |D| → R |A| The vector of traffic count (response) functions x a (θ) : R |D| → R

The traffic flow (response) function associated to link a ∈ Ã X = D θ x(θ) ∈ R |A o |×|D| The design matrix in NLLS and which is equal to the Jacobian matrix of the vector of traffic flow functi

The vector of observed traffic counts x a ∈ R ≥0

The traffic count measurement at link a ∈ Ā T d,H 0 ∈ R The t-test associated to attribute d ∈ D and under null hypothesis

The f-test comparing models 1 and 2 σ 2 ∈ R ≥0

The variance of the errors in the nonlinear regression σ 2 ∈ R ≥0

The estimated variance of the errors in the nonlinear regression

Using the conservation constraint of path flows and demand:

∑ h∈H w f w = ∑ h∈H w exp(θ t λ h w + θ t V h − 1) = q w (A.9)

Noting that the value of λ h w is the same ∀h ∈ H w :

Replacing Eq. A.10 into Eq. A.8:

Substituting by V h = V h θ t = ∑ a∈A δ ah (θ t t a (x a ) + ∑ k∈K Z (θ k /θ t )Z ak ):

where it is clear that the set of optimal path flows { f h } h∈H follows a logit assignment.

Appendix A.3. Monotonicity of path choice probabilities and traffic flow functions Proposition 10 (monotone path choice probabilities). Assume the travelers' utility function is a linear weight between a set of attributes and coefficients. If there are as most two paths to travel between every O-D pair, the path choice probabilities are coordinate-wise monotonic functions respect to the utility function coefficients.

Proof. Let's start from a general case where there are an arbitrary number of alternative paths and attributes and thus, where the choice probabilities are obtained from a softmax function. Define θ t ∈ R as the utility function coefficient associated to an attribute t ∈ K and τ i as the utility component associated to the remaining set of attributes of the path i ∈ H rs of an arbitrary O-D pair w ∈ W . Then, the choice probability of path i in the O-D pair w ∈ W is:

The first derivative of the softmax function respect to the utility function parameter θ is:

The sign of the derivative is given by:

Given that exponential functions are always positive, Eq. A.14 will be negative or positive for any θ t ∈ R when t i = min {t j } j∈H w or t i = max {t j } j∈H w . Therefore, when there are at most two paths connecting the O-D pair, the path choice probability for any path i ∈ H w will be necessarily a monotonic function respect to θ t . The same analysis can be extended to every utility function coefficient d ∈ D and O-D pair w ∈ W . This proves that the path choice probabilities are coordinate-wise monotonic functions respect to the utility function coefficients when all paths sets have at most two paths.

Proposition 11 (monotone traffic flow functions under dominant and non dominant paths). Assume the travelers' utility function is a linear weight between a set of attributes and coefficients. Let's define dominated and dominating paths as those paths where an attribute of the utility function reaches its maximum or minimum value, respectively, within each O-D pair. If the set of paths traversing a link are all dominating or dominated paths, the traffic flow function at that link is coordinate-wise monotonic respect to the utility function coefficients Proof. Consider an arbitrary attribute t ∈ K and define θ t as the coefficient weighting that attribute in the travelers' utility function. Denote τ i as the utility component associated to the remaining set of attributes in path i ∈ H. The traffic flow function x a (θ) : R |D| → R associated to any link a ∈ A is given by:

where δ a ai takes the value 1 if path i traverses link a, and 0 otherwise, and δ w wi takes the value 1 if path i belong to O-D pair w, and 0 otherwise. If the set of paths traversing the link are all dominating or dominated, then ∀w ∈ W , i ∈ H w , t i = min{t j } j∈H w or t i = max{t j } j∈H w , respectively. From Proposition 10, we observed that the softmax function associated to each path choice probability p i (θ t ) will be either a monotonically decreasing or a increasing function when t i = min{t j } j∈H rs or t i = max{t j } j∈H rs , respectively. By assumption, x a (θ) is a positive weighted sum of choice probabilities associated to dominating or dominated paths, respectively, that is a positive weighted sum of monotonically increasing or decreasing functions. Therefore, x a (θ) it is either a monotonically increasing or decreasing function respect to θ t and the same analysis can be applied to every coordinate d ∈ D. Finally, x a (θ) is a coordinate-wise monotonic function, which completes the proof.

Proposition 12 (monotone traffic flow functions under binary attributes). Suppose there are only two paths connecting every O-D pair and that the travelers' utility function only depends on a binary attribute that can take the values 0 or 1. Then, the traffic flow function is monotonic respect to the coefficient weighting that attribute.

Proof. The traffic flow function x a (θ) : R |D| → R associated to any link a ∈ A can be expressed as:

x a (θ) = ∑ i∈H δ a ai ∑ w∈W q w p i (θ t )δ w wi = ∑ i∈H δ a ai ∑ w∈W δ w wi q w exp(θI(z i = 1)) exp(θI(z i = 1)) + exp(θI(z −i = 1) = ∑ i∈H δ a ai ∑ w∈W δ w wi q w 1 1 + exp(θI(z −i = 1) − (θI(z i = 1))) where z i ∈ {0, 1}, ∀i ∈ H, δ a ai takes the value 1 if path i traverses link a ∈ A, and 0 otherwise, and δ w wi takes the value 1 if path i belong to O-D pair w ∈ W , and 0 otherwise. Given the existence of two alternatives per O-D pair, we can express the x a (θ) in terms of the sigmoid function σ(·):

x a (θ) = ∑ which does not depend on θ because σ(θ)(1 − σ(θ)) > 0, ∀θ ∈ R in Eq. A.16. Then, it is clear that the function x(θ) is monotonic respect to θ and this completes the proof.

Remark 13. Note that the left and right terms in Eq. A.17 are the sums of demand associated to dominated and dominating paths, respectively. Therefore, the traffic flow function will be monotonically increasing or decreasing if the sum associated to the dominating paths is greater or lower, respectively.

Appendix A.4. Coordinate-wise properties of the objective function Proposition 13 (Coordinate-wise pseudo-convexity of objective function). The objective function of the LUE problem under an uncongested network is coordinate-wise pseudo-convex if the traffic flow functions are coordinate-wise monotonic respect to each utility function coefficient. B.6) and the Jacobian matrix Dθ x(θ) ∈ R n×d associated to the traffic flow functions are simply the stacked gradient vectors for each observation n ∈ N:

Finally, the analytical gradient of the objective function respect to the vector of utility function coefficients is :

and

where

is a scalar.

Input: # Iterations T , initial vector of coefficientsθ 0 ∈ R |D| , link and path flows x, f from inner level problem, choice of second order optimization method, vector of observed traffic countsx, dumping parameter for LM method δ LM for t = 1 . . . T do Compute stochastic network loading (Algorithm 2, Appendix B.1)

x, f , p ← SNL(θ, ∆ q , ∆ x , q,t, Z)

Compute Jacobian of the traffic flow functions (Appendix B.3.2):

J t := Dθ x(θ, x, q, p,x,t, Z)

if method = GN then

Update:θ t+1 =θ t + ∆θ t end for returnθ T = arg min {θ 1 ,...,θ T } t (θ t ) 

Chapter 6 Non-linear regression models

Network Loading Versus Equilibrium Estimation of the Stochastic Route Choice Model Maximum Likelihood and Least Squares Revisited

AWS, 2022. Amazon EC2 Instance Types

Nonlinear Programming: Theory and Algorithms

Alternatives to Dial's logit assignment algorithm

Discrete choice methods and their applications to short term travel decisions, in: Handbook of transportation science

Discrete choice analysis: theory and application to travel demand

Route choice modeling with network-free data

The factor of revisited path size: Alternative derivation

Convex Optimization

Transportation network analysis. volume I

Modelling long-distance route choice using mobile phone call detail record data: a case study of Senegal. Transportmetrica A: Transport Science 9935

Calibration of equilibrium traffic assignment models and O-D matrix by network aggregate data

Microeconometrics: methods and applications

Fixed point approaches to the estimation of O/D matrices using traffic counts on congested networks

Calibrating aggregate travel demand models with traffic counts: Estimators and statistical performance

Algorithms for solving fisk's stochastic traffic assignment model

Criteria for quasi-convexity and pseudo-convexity: Relationships and comparisons

Some statistical problems in connection with traffic assignment

On Stochastic Models of Traffic Assignment

An algorithm for the stochastic user equilibrium problem

The unifying role of iterative generalized least squares in statistical algorithms

On the solution of convex bilevel optimization problems

A probabilistic multipath traffic assignment model which obviates path enumeration

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

A relationship between the gravity model for trip distribution and the transportation problem in linear programming

Federal Highway Administration

Note on the maximum likelihood calibration on Dial's assignment method

Some developments in equilibrium traffic assignment

Trip matrix estimation from link traffic counts: The congested network case

Nonlinear Regression

The power of the likelihood ratio test of location in nonlinear regression models

Statistical inference for a system of simultaneous, non-linear, implicit equations in the context of instrumental variable estimation

Simultaneous estimation of the origin-destination matrices and the parameters of a nested logit model in a combined network equilibrium model

City of Fresno GIS Data Hub

Beyond Convexity: Stochastic Quasi-Convex Optimization

California Traffic Collision Data from SWITRS

Adam: A method for stochastic optimization

A data driven method for OD matrix estimation

A Streamlined Network Calibration Procedure for California SR41 Corridor Traffic Simulation Study

Estimation of a trip table and the Θ parameter in a stochastic network

Simultaneous estimation of an origin-destination matrix and link choice proportions using traffic counts

On the variance of recurrent traffic flow for statistical traffic assignment

A Generalized Single-Level Formulation for Origin-Destination Estimation under Stochastic User Equilibrium

Statistical inference of probabilistic origin-destination demand using day-to-day traffic data

A nested recursive logit model for route choice analysis

Pseudo-convex functions

An Algorithm for Least-Squares Estimation of Nonlinear Parameters

Conditional logit analysis of qualitative choice behavior

Quantitative methods for analysing travel behaviour of individuals: Some recent developments

Second Order Conditions for Pseudo-Convex Functions

Modelling Transport

The overreliance on statistical goodness-of-fit and under-reliance on model validation in discrete choice models: A review of validation practices in the transportation academic literature

Sensitivity analysis of traffic equilibria

Route choice models used in the stochastic user equilibrium problem: A review

A Stochastic Approximation Method

Calibration of Dial'S Assignment Method

Reverse assignment: Calibrating link cost functions and updating demand from traffic counts and time measurements

A First Order Method for Solving Convex Bilevel Optimization Problems

On centroid connectors in static traffic assignment: Their effects on flow patterns and how to optimize their selections

A new one-level convex optimization approach for estimating origin-destination demand

Escaping saddle points with adaptive gradient methods

SWITRS, 2021. Codebook of Statewide Integrated Traffic Records System (SWITRS)

TIGER/Line with Selected Demographic and Economic Data

Leveraging the California Highway Incident Processing System for Traffic Safety Policy and Research

Mixed logit (or logit kernel) model: Dispelling misconceptions of identification

A two-stage algorithm for origin-destination matrices estimation considering dynamic dispersion parameter for route choice

Asymptotic Theory of Nonlinear Least Squares Estimation

Hierarchical travel demand estimation using multiple data sources: A forward and backward propagation algorithmic framework on a layered computational graph

Simultaneous estimation of the origin-destination matrices and travel-cost coefficient for congested networks in a stochastic user equilibrium

Estimation of origin-destination matrices from link traffic counts on congested networks

Stochastic travel demand estimation: Improving network identifiability using multi-day observation sets

Learning to Recommend Signal Plans under Incidents with Real-Time Traffic Prediction

Developing Calibration Tools for Microscopic Traffic Simulation Final Report Part III: Global Calibration -O-D Estimation, Traffic Signal Enhancements and a Case Study

This research is supported by a National Science Foundation grant CMMI-1751448

The authors confirm contribution to the paper as follows: study conception and design: Pablo Guarda, Sean Qian; data collection: Pablo Guarda, Sean Qian; programming and experiments: Pablo Guarda; analysis and interpretation of results: Pablo Guarda, Sean Qian; draft manuscript preparation: Pablo Guarda, Sean Qian. All authors reviewed the results and approved the final version of the manuscript.

Appendix A.1. Notation Tables A.3, A.4 and A.5 present the notation used throughout the paper. The sets of links with observed and unobserved traffic counts, respectively ∆ q ∈ R |H|×|W | The path-demand incidence matrix ∆ x ∈ R |A|×|H| The path-link incidence incidence matrix Q ∈ R |V |×|V |The O-D matrix q ∈ R |W | The dense vector associated to the O-D matrix q w ∈ RThe demand in O-D pair w ∈ W x ∈ R

The vector of link flows x a ∈ R ≥0 Link flow in link a ∈ A γ ∈ R

The vector of capacities among links γ a ∈ R ≥0 The capacity of link ā t 0 ∈ R |A| + The vector of links' free flow travel times t 0 a ∈ R + The free flow travel time at link a ∈ A f ∈ R

The vector of path flows f h ∈ R ≥0 The path flow on path h ∈ H p ∈ R The standard formulation of the SUE with logit assignment (SUE-logit) problem assumes a utility function dependent on travel time only. Fisk (1980) proved that the first order necessary optimality condition of the following optimization problem gives a path flow solution that is logit distributed:From this formulation is clear that if |θ| → ∞ + , the objective function reduces to the first term associated to the Beckmann transformation, and thus, the SUE and DUE optimization problems become equivalent. The LUE and ODLUE literature typically define θ ∈ R + as a dispersion parameter measuring the sensitivity of route choices to travel times (Yang et al., 2001) and interpret it as the accuracy of the travelers' perception about travel costs (Daganzo, 1977; Wang et al., 2016) or the level of information about travel costs (Lo and Chan, 2003; Liu and Fricker, 1996) . Under a single attribute utility function, the interpretation of the parameter may be irrelevant or difficult to falsify. However, in the case of a multi-attribute utility function, these interpretations would rest importance on the relationship of the magnitude of the dispersion parameter with the amount of unobservable components of the utility function and which are ignored by the modeler. To understand this connection, it is key to reformulate the problem into a utility based representation.

The objective function in Problem A.1 is written in terms of the link performance functions t a : R ≥0 → R + instead of the observable component of the travelers' utility function v a : R → R associated to each link a ∈ A. Assume the observable component of the travelers' utility function is given by v a = θ t t a , where θ t = µθ t is the coefficient measuring the preference of travelers' for travel time and it is scaled by a factor µ ∈ R + proportional to the variance of the unobservable component of the utility function, i.e. θ t = µθ t whereθ t is the unscaled vector of logit coefficients and which is not identifiable. Then, the utility based representation of Problem A.1 can be written in vectorized form as follows:A key observation respect to Problem A.1 is that θ = −θ t > 0 and since θ t = µθ t is now clear that the dispersion parameter is also scaled by the scale factor µ of the logit model. where the link utility function v a (u) at a traffic flow level u was reparameterized as:Z ak represents the value of the exogenous attribute k ∈ K Z at link a ∈ A and θ t ∈ R − , θ Z ∈ R |K Z | are set the of coefficients measuring the travelers' preferences for the endogenous attribute t and the exogenous attributes z ∈ K Z . Note that if both terms of the objective function in Problem A.3 are multiplied by θ t < 0, the problem becomes a maximization:which, in compact form, can be written as:Appendix A.2.4. Logit assignment of path flows in utility based formulation Following the rationale of the proof in Fisk (1980) , we can prove that the path flow solution of the optimization model presented in Problem A.3 follows a logit assignment. The Lagrangian L of the problem is:The set of first order optimality conditions are:where δ w hw takes the value 1 if path h belong to O-D pair w ∈ W , and 0 otherwise, andas a reparameterized utility function associated to path h ∈ H. Now we can find an expression for f h using Eq. A.7:Proof. Let's define θ 1 d , θ 2 d ∈ R |D| as vectors with all coordinates set to 0, except for the coordinate t ∈ D. Let's be θ 1 d , θ 2 d ∈ R the values of the non-zero coordinates in θ 1 d , θ 2 d ∈ R. To prove coordinate-wise pseudo-convexity of the objective function : R |D| → R, it suffices to show that the following holds:By assumption, the traffic count (response) functions are coordinate-wise monotonic. Let's start considering the set J + of functions that are monotonically increasing respect toA.18 becomes non-negative in the following two cases:Because the increasing monotonicity of the traffic flow functions j ∈ J + respect to θ d and given that θ 2 d ≥ θ 1 d :Because the increasing monotonicity of the traffic flow functions j ∈ J + respect to θ d and given that θ 2 d ≤ θ 1 d :Now consider the set J − of traffic flow functions that are monotonically decreasing respect to θ d ∈ R. Since ∂x j ∂θ d θ=θ 1 d < 0, ∀ j ∈ J − , the LHS in Eq. A.18 becomes non negative in the following two cases:Because the decreasing monotonicity of the traffic flow functions j ∈ J − respect to θ d and given that θ 2 d ≤ θ 1 d :Because the decreasing monotonicity of the traffic flow functions j ∈ J − respect to θ d and given that θ 2 d ≥ θ 1 d :Thus, if the traffic flow functions are monotonic, the following condition holds ∀ j ∈ J − ∪ J + :Putting altogether:which proves the pseudo-convexity of the objective function of the LUE problem respect to respect to θ d . The analysis conducted for θ d can be applied to every coordinate d ∈ D of θ ∈ R |D| , which proves the coordinate-wise pseudo-convexity of the objective function respect to an arbitrary utility function coefficient θ.Assumption 12 (Range of response functions). The range of each response function include the value of the traffic count measurement Remark 14. Assumption 12 may be tested by checking that there exist values of the vector of utility function coefficients where each response function matches the value of the corresponding traffic count measurement. Note that this condition is analyzed for each traffic flow function independently, and thus, the vector of utility function coefficients does not need to be same for all traffic flow functions. An alternative way to test this assumption would be to approximate the ranges of the traffic flow function with the bounds derived from Proposition 1, Section 4.1.1 and then checking if the traffic count measurements fall within those ranges. Furthermore, if the traffic flow functions are dependent on a single attribute, their range could be found by evaluating them at the extreme cases where θ → ∞ − and θ → ∞ + .Proposition 14 (Coordinate-wise vanishing gradient of objective function). Suppose that each traffic flow function x n (θ), ∀n ∈ N is coordinate-wise monotonic and that the range of each includes the value of the link count measurementx i , ∀i ∈ N (Assumption 12). Then, the gradient of the LUE objective function in an uncongested network vanishes coordinate-wise at least once.Proof. Let's be θ d ∈ R the value of an arbitrary coordinate t of the vector θ ∈ R |D| of travelers' utility function coefficients. Assume that the values of all coefficients except for θ d are kept constant and that the minimization of the objective function : R |D| → R is performed respect to θ d only. Let's now split the set of traffic flow functions N between those that are monotonically decreasing and increasing and denote each set as N − and N + . Then, the expression of the first derivative of the objective function : R |D| → R can be decomposed as follows:By assumption, the ranges of the traffic flow functions include the values of the link count measurements, hence for each traffic flow function i ∈ N, ∃θ i ∈ R |D| :x i −x i (θ i ) = 0. Note that by the coordinate-wise increasing monotonicity of the set of traffic functions inLet's define {θ d } i∈N as the set that contains the values of the coefficientθ d that satisfies thatx i − x i (θ i ) = 0 for the traffic flow function i ∈ N. Defineθ + d as the maximum value in the set and then add to it some arbitrary quantity ε > 0. Let's defineθ + ∈ R |D| as the vector of utility function coefficients associated to the traffic flow function i wherẽ θ + d =θ + d + ε. Then, the first derivative of f evaluated at this point can be expressed as:By the increasing and decreasing coordinate-wise monotonicity of the traffic functions within the sets N − and N + ,Finally, since the first derivative of f is continuous and it changes sign atθ + ,θ − , ∈ R |D| , by intermediate value theorem, it must vanish at least once at some feasible point θ ∈ R |D| . The analysis conducted for θ d can be applied to every coordinate d ∈ D of θ ∈ R |D| , which completes the proof.

Proofs of existence and uniqueness of a global optima typically relies on the convexity of the optimization problem. Despite the non-convexity of our problem, we can rely on Assumption 1, Section 3 of SUE-logit and to assume absence of measurement error to prove existence and uniqueness:Proposition 15 (Existence of global minima with no noise in traffic counts). The LUE problem under an uncongested network has a global optima if traffic count measurements follow SUE-logit and they have no measurement error Proof. By assumption, ifx follows SUE-logit, ∃θ ∈ R |D| : x(θ ) =x. Hence, the objective function reaches its lower bound when f (θ ) = 0 and thus θ is a global minimizer, which completes the proof.Proposition 16 (Uniqueness of global minima with no noise in traffic counts). The LUE problem under an uncongested network has a unique global optima at θ ∈ R |D| if (i) the traffic count measurementsx follow SUE-logit and they have no measurement error, (ii) the Jacobian matrix of the objective function has full rank at any feasible point θ ∈ R |D| , (iii) the response functions x(θ) are strictly monotone Proof. By Proposition 15 and the assumption thatx follows SUE-logit, the optimization problem has a global optima at θ . To prove uniqueness, we first derive the first order necessary optimality condition:By assumption, the Jacobian matrix D θ x(θ) is full rank ∀θ ∈ R |D| and thus [D θ x(θ)] (x − x(θ)) = 0 iff x(θ) =x. By the strict monotonicity of x(θ), there exists a unique value of θ ∈ R |D| such that x(θ) =x. Therefore, this proved that θ = θ is the unique the global optima.Appendix A.6. Connection between OLS and NLLS Consider the vectorized version of the NLLS regression equation (Eq. 28, Section 6.1) and let's expressed it in terms of the response function x(θ) : R |D| → R |A| defined in Eq. 8, Section 3.3.1. Let's compute the first order Taylor 60 approximation of the response function respect to θ ∈ R |D| and around an arbitrary vector θ 0 ∈ R |D| :where, for the ease of notation, x(θ) = x(Z, t, θ). Replacing back into the NLLS regression equation (Eq. 28, Section 6.1):From where is clear that Eq. A.21 resembles the OLS regression equation, except thatX = D θ x(θ) θ 0 .

Appendix B.1. Stochastic network loading

Input: θ ∈ R |D| , incident matrices ∆ q , ∆ x , non-sparse O-D vector q, vector of link travel timest, matrix of exogenous link attributes Z a: Travel time initialization: Ift = / 0, thent =t f , wheret f is the vector of links' free flow travel times.b: Computation of link utilities v x ← θ tt + θ Z Z c: Computation of path choice probabilities:d: Computation of path flows: f ← (∆ q q) • p e: Computation of link flows:where is the operator for element-wise division, v x ∈ R |A| is the vector of link utilities, p ∈ R |H| ]0,1[ is a vector of path choice probabilities andt is the vector of travel times which is assumed to be exogenous during SNL. Note that the definition of p in the first step of SNL correspond to the vectorized form of the path flows at SUE-logit presented in Eq. 5 and which is written as a function of link utilities and the network incidence matrices ∆ x , ∆ q .

Input: # Iterations T , initial vector of estimated coefficientsθ ∈ R |D| , matrix of exogenous attributes Z, vector of links' free flow travel times t 0 , vector of link capacities γ, incidence matrices ∆ q , ∆ x , dense O-D vector q, grid of values λ FW ∈ R |D| in Frank-Wolfe algorithm, proportion of selected O-D pairs in column generation phase ρ W , proportion of generated and selected paths in column generation phase k g , k s :Step 0: Initialization. a: Compute initial vector of link utilities v x ← θ tt f +θ Z Z, whereθ = [θ t θ Z ], θ t ∈ R, θ Z ∈ R |K Z | b: Generate k−shortest paths S pq , ∀(p, q) ∈ W based on link utilities v x c: Compute incident matrices ∆ q , ∆ x from S pq d: Perform stochastic network loading:Step 1: Column generation phase. a: Select subset of O-D pairs W s ∈ W with the highest travel demand, such that |W s | = |W |ρ W b: Generate set of the k g −shortest paths C pq , ∀(p, q) ∈ W s based on current link utilities v x c: Update path sets S pq ← C pq , ∀(p, q) ∈ W s d: Update incidents matrices ∆ q , ∆ x from S pq , ∀p, q ∈ WStep 2: SUE-logit a: Perform stochastic network loading (algorithm 2):c: Solve a linear search problem: e: Go to step 3 if desire level of accuracy have been achieved a . Otherwise, i = i + 1 and start again from Step 2Step 3: Paths selection a: Select the k s −shortest paths R pq , ∀p, q ∈ W s b: Update path sets S pq ← R pq , ∀p, q ∈ W s c: Update incidents matrices ∆ q , ∆ x from R pq , ∀p, q ∈ W 

Input: # Iterations in no-refined and refined stages T 1 , T 2 , initial vector of estimated coefficientsθ 0 ∈ R |D| , link flows and path choice probabilities x, p from inner level problem, choice of optimization algorithm the refined stage (refined-method), vector of observed traffic countsx, learning rates η 1 , dumping parameter for LM method δ LMStep 1: no-refined stage for t = 1 . . . T 1 dô θ t+1 ←FirstOrderOptimization(θ, η 1 ,x, x, p, T = 1) end forStep 2: refined stage for t = 1 . . . T 2 dô θ T 1 +t+1 ←SecondOrderOptimization(method = LM,θ T 1 +t , δ LM ,x, x, p, T = 1) end for

The first derivative of the objective function (·) respect to a utility function coefficientθ d ∈ R, ∀d ∈ K can be written in vectorized form as:where x(·) is a vector valued response function that receives as input a vector with the utility function coefficients and it returns a vectorθ ∈ R |D| with the predicted traffic counts among all links in the transportation network:and p(θ) is a vector valued function that receives as input a vector with the utility function coefficients and it returns a vector with the choice probabilities associated to all paths in the transportation network:3)The first derivative of x(·) respect to a utility function coefficientθ d can be written in vectorized form as:is a vector with the first derivatives of the path choice probabilities respect to the utility function coefficientθ d . Z d ∈ R |H| is a column vector with the values of attribute d ∈ D among all paths and 1 |Z d | ∈ R |H| is a column vector with ones. Then, the gradient ∇θ x(θ) associated to a traffic flow function i is obtained by stacking the first derivatives in a column vector as follows:

The second derivative of the objective function (·) respect to a utility function coefficientθ d ∈ R, ∀d ∈ D can be written in vectorized form as:is a vector of dimension H with the second derivatives of the path choice probabilities respect to a utility function coefficientθ d .

Algorithm 5 shows the pseudo code to implement the first order optimization algorithms. Note that the only difference of normalized gradient descent (NGD) with vanilla gradient descent is the step of normalization of the 64 gradient.

Input: # Iterations T , initial vector of estimated coefficientsθ 0 ∈ R |D| , incident matrices ∆ q , ∆ x , non-sparse O-D vector q, path choice probabilities p, link flows x, dense vector with O-D demand q, matrix of exogenous link attributes Z, vector of link travel timest, vector of observed traffic countsx, learning rate η for t = 1 . . . T do Compute stochastic network loading (Algorithm 2, Appendix B.1)x, f , p ← SNL(θ, ∆ q , ∆ x , q,t, Z)Compute gradient of the objective function (Appendix B.3.2):g t := ∇θ (θ, x, q, p,x,t, Z, ∆ q , ∆ x )Normalization of gradient g t ← g t g tSolution update:θ t+1 =θ t − ηg t end for returnθ T = arg min {θ 1 ,...,θ T } t (θ t ) Appendix B.3.5. Second order optimization methods Table 6 shows the pseudo code to implement the Gauss-Newton (GN) and Levenberg-Marquardt (LM) algorithms.