key: cord-0660704-5i8ds440
authors: Fonseca, Diego; Junca, Mauricio
title: Distributionally Robust Optimization with Expected Constraints via Optimal Transport
date: 2021-11-08
journal: nan
DOI: nan
sha: 69ff23cd351c9a7fa1d78537e55e28ad573ef78d
doc_id: 660704
cord_uid: 5i8ds440

We consider a stochastic program with expected value constraints. We analyze this problem in a general context via Distributionally Robust Optimization (DRO) approach using 1 or 2-Wasserstein metrics where the ambiguity set depends on the decision. We show that this approach can be reformulated as a finite-dimensional optimization problem, and, in some cases, this can be convex. Additionally, we establish criteria to determine the feasibility of the problem in terms of the Wasserstein radius and the level of the constraint. Finally, we present numerical results in the context of management inventory and portfolio optimization. In portfolio optimization context, we present the advantages that our approach has over some existing non-robust methods using real financial market data.

In this work we consider stochastic programs with expected value constraints given by the following formulation:

where F and G are functions such that F, G : R m × R n → R, ξ ∈ R n is a random vector with (unknown) probability distribution P supported in Ξ ⊆ R n , and X ⊆ R m is a set of constraints on the decision vectors. In addition, the objective function Φ is a risk function that depends on the performance function F . When Φ (x, F, P) := E P [F (x, ξ)], this problem appears in various contexts such as finance, [16, 9] , operations research, [21] , and, machine learning [25, 22] . Most attempts to solve (1) use Sample Average Approximation (SAA) where samples of ξ are used to replace expected values by the sample means. Strategies based on the stochastic gradient descent methods are used in [1, 15, 37] . These strategies are sensitive to alterations in the quality of the sample, and the out-of-sample performance can be poor, specifically, the constraints tend not to be satisfied out-of-sample when the sample size is small. Note that other risk functions, like Conditional Value-at-Risk, can be formulated as (1) , see [26] .

When Φ (x, F, P) := Var P [F (x, ξ)], (1) is known as the mean-variance model and this problem has been mostly explored in portfolio optimization and inventory management. The mean-variance model is formulated for the first time in portfolio optimization in [20] . In this case m = n where m is the number of assets of the portfolio, ξ is a random vector of returns of each asset, x is a portfolio weights vector, and other constraints admissible for the investor are described by the set X . Additionally, in this case, F = G where F (x, ξ) := x, ξ where ·, · is the Euclidean inner product in R m so that x, ξ is the return of the portfolio x. The first attempts to solve this problem consider estimates of the vector of means and the covariance matrix of returns, but in [7] it is shown that the resulting portfolios do not perform well out of sample, and they are very sensitive to variations on the estimates. One of the first ideas to overcome this problem is to consider the vector of returns and the covariance matrix as variables, that is, the variables of the optimization problem will be the portfolio weights, the vector of means, and the covariance matrix. The choice of the feasible set for the vector and the matrix is crucial in this approach. Some works use sets based on a priori information about the returns or impose sets that are computationally tractable, see for example [11] , [39] , [23] , [17] , [18] , and [36] . To impose unverifiable assumptions about the moments of the returns can also affect the out-of-sample performance of these methods. Most of the strategies that are used in portfolio optimization can be replicated in inventory management; however, the disadvantages of these strategies carry over to inventory management, so this motivates the search for other approaches. In the context of inventory management, mean-variance models are used in the newsvendor problem, where advances presented in [5] paved the way for this topic to become an attractive one. In this regard, there are research works such as [6] , [27] ,and [38] .

We propose a different data-driven approach to address (1) using Distributionally Robust Optimization (DRO). DRO was initially proposed for unconstrained stochastic problems such as

where X is a set of feasible solutions, ξ is a random vector of parameters with unknown probability distribution P, and f (x, ξ) is a cost function. In this sense, DRO approach for problem (2) is formulated as

where D is a set of probability distributions, which is known as ambiguity set. Note that J * ≤ J D if P ∈ D. Therefore, a natural DRO version of (1) is the following optimization problem

The choice of the set D is decisive in the tractability of this problem, and there are several ways to define D in the literature. For example, in [14] and [30] it is defined as a set of distributions that are supported in a single point, while in [8] , [24] , [28] , and [31] , D is defined as the set of distributions that satisfy specific restrictions in their moments, or distributions belonging to a given family of parametric distributions. Another option is to endow the set of probability distributions with a notion of distance, so we define D as a ball respect to this distance. Usually, this ball is centered on an empirical distribution P N , given by a sample ξ 1 , . . . , ξ N of the random vector ξ, and the radius is chosen in such a way that the distribution P belongs to the ball with high probability, or such that the out-of-sample performance of the optimal solution is good. Again, the tractability of the resulting DRO depends on the notion of distance adopted. Some distances frequently used are Burg's entropy, see [35] ; Kullback-Leibler divergence, see [13] ; and Total Variation distance, which is adopted in [32] . In this work we use Wasserstein distance, that is, we define D as a ball respect to the Wasserstein metric with center at an empirical distribution and radius properly chosen. Note that if choose radius 0 in this approach we recover the SAA strategy.

Definition 1 (Wassertein distance). The Wasserstein distance W p (µ, ν) between µ, ν ∈ P p (Ξ) is defined by

and d is a metric in Ξ.

W p defines a metric in P p (Ξ) for p ∈ [1, ∞), hence the ball with respect to some p-Wasserstein distance with radius ε > 0 and center µ ∈ P(Ξ) is given by

One of the first works in which this notion of distance is defined is in [33] although this notion of distance arises in different fields of science almost simultaneously, and, depending on the context, it is usually known by other names. In computer science, it is called Earth moving distance; in the field of physics, it is called the distance of Monge-Kantorovich-Rubinstein, and, in the context of optimization, some researchers called it the Optimal Transport distance.

There are theoretical reasons, many exposed in [34] , and practical reasons that make this distance very appealing. One of the main advantages is its dual representation. In fact, this property allows to find a more tractable equivalent formulation of (3).

Theorem 1. Assume that D := B ε P N with respect to W p and that f is upper semicontinuous. Then the problem (3) is equivalent to the optimization problem

This theorem is formulated and proved in [3] . However, the reformulation (6) was also obtained in [12] and [19] under more restrictive assumptions. It is important to note that tractability of this reformulation does not depend on the form of the unknown true distribution P.

Concretely, in this work we use DRO with Wasserstein metric to address stochastic programs with expected value constraints from a sample of the random vector. In that sense, our contributions are the following:

• We propose a data-driven robust formulation of (1) which prioritizes the expected value constraints without relying on regularization parameters. • We show that our DRO approach of (1) can be reformulated as optimization problems with finite-dimensional variables. In portfolio optimization and inventory management contexts, for specific cases, we show that this problem is convex. We also establish criteria to determine the feasibility of the problems in terms of the radius of the ball and features of F and G. • In inventory management, we focus on one version of the Newsvendor problem. Most of the works on this problem are parametric studies where they assume that the probability distribution of the random variable that influences demand is known. In contrast, our approach is non-parametric, so the performance of our strategy is compared to the SAA strategy. • In portfolio optimization, we evaluate the performance of our approach and compare it with other traditional approaches. This evaluation is done on two types of data, the first type is synthetically generated return data, and the second is real market data. The results show the advantages of our proposal because the highest expected return and the highest Sharpe ratio are obtained compared to other benchmarks; in addition, a low turnover is obtained compared to the SAA strategy.

The organization of this paper is as follows. In Section 2, using Wasserstein distance, we describe our distributionally robust optimization model. In this section, we also derive tractable reformulations for the optimization problem and study its feasibility. Additionally, we establish a criterion to calibrate the size of the ambiguity set. Simulation analysis of the proposed approaches are derived in Sections 3 and 4 for management inventory and portfolio optimization context respectively. Finally, conclusions are drawn in Section 5.

As stated before, in (1), the distribution P is unknown, and we assume we have access to realizations of the random vector ξ. That is, let ξ 1 , . . . , ξ N be a sample of ξ which allows to estimate P by means of the empirical distribution P N := 1 N N i=1 δ ξi . Note that if we replace P with P N in (1) we obtain the SAA strategy, with the drawbacks already mentioned. Another possible approach could be to consider D = B ε P N as ambiguity set in (4). However, this strategy has disadvantages. Note that if we use Theorem 1 to reformulate the objective function and the constraint in (4), for general functions F and G we will obtain a complicated and non-tractable semi-infinite optimization problem. This motivates our proposal, in which we seek to choose an ambiguity set that allows us to obtain a reformulation of (4) that is tractable and with good performance.

To present our approach we impose the following assumption.

Assumption 1 (Lipschitz). We assume that F and G are Lipschitz functions respect to ξ. This is, for each x, there exists γ x,F , γ x,

Our approach also uses an empirical distribution but this will depend on x. First, we establish the following notation: For x ∈ R m we define ζ x,F := F (x, ξ) and ζ x,G := G (x, ξ), note that these are random variables. We called P x,F and P x,G to the probability distributions of ζ x,F and ζ x,G respectively. Because it depends on P, P x,F and P x,G are also unknown. Additionally, we define ζ x,F i := F x, ξ i and ζ x,G i := G x, ξ i , so ζ x,F 1 , . . . , ζ x,F N is a sample of ζ x,F , and ζ x,G 1 , . . . , ζ x,G N is a sample of ζ x,G . This allows us to define the empirical distributions of ζ x,F and ζ x,G , which are given by P x,F

. This dependence on x has its justification in the fact that the decision vector x has an influence on whether the constraint of (1) is satisfied or not. Specifically, the constraint E P [G (x, ξ)] ≥ µ must be satisfied by G (x, ξ) which depends on x. Therefore, we consider the following optimization problem: For a given > 0

where B εγ x,F P x,F N and B εγ x,G P x,G N are balls centered at P x,F N and P x,G N with radius εγ x,F and εγ x,G respectively. These balls are defined with respect to the p-Wassertein distance where p is 1 or 2. The type of p-Wasserstein distance that we use depend on Φ and the support of ξ. We see this in detail in the following subsections. However, the following lemma gives the reason why our ambiguity sets has radius εγ x,F and εγ x,G , and also shows a relationship between

Note that if ε > 0 is such that P ∈ B ε ( P N ), where the ball is taken with respect to the p-Wasserstein metric for p = 1, 2 and distributions supported in a subset of R n , then P x,F ∈ B εγ x,F ( P x,F N ), where these balls are taken with respect to the p-Wasserstein metric for p = 1, 2 and distributions supported in a subset of R. The proof of Lemma 1 is addressed in A.1. From now on, the optimal solutions of (7) will be denoted by x N (ε). Figure 1 shows an illustration of our approach for the case F = G and Φ (x, F, P) = Var P [F (x, ξ)] in (7) . In this figure, we have two decisions, x 1 and x 2 , which induce two balls. The centers of these balls must belong to the blue shaded region. Additionally, these balls must be contained in the red shaded region that represents the semi-space induced by the constraint in (7) . From the figure we can see that x 2 is a better decision than x 1 since the measures in the ball induced by x 2 have lower variance levels compared to the measures within the ball induced by x 1 . This implies that the supreme of the variance in B εγ x,G P x 2 ,G N is less than the supreme of the variance in B εγ x,G P x 1 ,G N .

In this case we assume that Φ (x, F, P) := E P [F (x, ξ)]. The first task is to reformulate (7) as an optimization problem with finite-dimensional variables. This reformulation depends on the image of the support of ξ under the functions F (x, ·) and G(x, ·) for each x ∈ X .

Theorem 2. We have the following cases:

for each x ∈ X , then the optimization problem (7) is equivalent to the following optimization problem

x ∈ X . 

for each x ∈ X , then the optimization problem (7) is equivalent to the following optimization problem

x ∈ X .

In principle, problem (9) may be easier to solve than to solve (8) . However, in certain cases, for example, in the context of inventory management (8) it is a tractable problem as we will see later.

Note that some values of ε and µ can make problem (7) infeasible. The conditions to obtain feasibility are established in the following corollaries. Corollary 1. The feasibility of (8) is divided into the following cases:

i) When there exists x ∈ X such that A G (x) ≥ µ, the optimization problem (8) is feasible for all ε > 0, and for µ satisfying the following inequality

and ε satisfies the following inequalities

subject to x ∈ X .

. (11) Note that in case i) of the previous corollary we can consider ε max N (µ) = ∞.

Corollary 2. The optimization problem (9) is feasible if µ and ε satisfies the following inequalities

x ∈ X .

In this case we assume that Φ (x, F, P) := Var P [F (x, ξ)]. As in the previous case, the first task is to reformulate (7) but to make it computationally tractable we only consider 2-Wasserstein distance.

for each x ∈ X , then the optimization problem (7) is equivalent to the following optimization problem with finitedimensional variables

Note that Corollary 2 also applies in this case. One of the advantages of our approach is that the resulting optimization problem is finite dimensional optimization problem, and, for some functions F and G, convexity is obtained; for example, this happens in the context of portfolio optimization. The proofs of Theorem 2 and Theorem 3 can be consulted in A.4.

To finish this section, we analyze how to choose ε. Our goal is to ensure that, out of sample, the constraint in (1) is satisfied. For example, in (7), we want ε such that if x N (ε) is a optimal solution, then E P [G ( x N (ε), ξ)] ≥ µ is satisfied with high probability. In that sense, the following lemma could provide a criterion to achieve this.

If ε is such that (7) is feasible, and x N (ε) is a optimal solution of (7), then

This lemma suggests that it is enough to find an ε such that P ∈ B ε ( P N ) and this is achieved with large values of ε, but we must bear in mind that ε could be limited by ε max N (µ) in (7). Another problem is that there is no efficient method to determine from which ε the condition P ∈ B ε ( P N ) is satisfied. In that vein, our idea is not to concentrate on guaranteeing that condition, instead, we concentrate on satisfying E P [G ( x N (ε), ξ)] ≥ µ in (7) . To do this we use a strategy based on a Bootstrap method assuming that ξ satisfies all the conditions for the Bootstrap-based technique to be valid (see [10] ). Given ε > 0, the strategy is to estimate the probability that the constraint is satisfied, which we call the confidence level of ε. The following method estimates this probability.

• Confidence level of ε for (7): Given x N (ε), generate K bootstrap samples with repetition from Ξ N = ξ 1 , . . . , ξ N , all these K samples of size N .

We denote these bootstrap samples as Ξ bt

If the estimate confidence level of ε is β, this means that the constraint is satisfied with a approximate probability β/100. Thus, the strategy is to find the smallest ε such that its confidence level β is acceptable, for example, β = 75%. Since the confidence level decreases as ε gets smaller, so the maximum level of confidence that can be obtained is the one reached with ε = ε max N (µ), we can start from this value whenever is finite. In Sections 3.1 and 4.1 we will show the performance of this procedure with two applications. The proof of Lemma 2 is relegated to A.1.

In this section, we consider the following newsvendor problem

where c > 0 is the manufacturing cost, p ≥ c is the sell price, and x is the inventory level set for the good. In addition, ξ is the demand of the good where the probability distribution P of ξ is unknown, but with support Ξ = [0, ∞).

In this problem the constraint excludes inventory levels that underestimate the demand. Note that this problem is equivalent to

x ≥ 0, and this is a particular case of (1). Indeed, in (1), we consider m = n = 1,

0} respectively, and µ = −α. Note that F and G are Lipschitz functions with Lipschitz constants γ x,F = p and γ x,G = 1, respectively. Moreover, in this case,

Therefore, given ξ 1 , . . . , ξ N sample of ξ, we solve the following problem

as a function of the Wasserstein radius ε and estimated on the basis of 1000 simulations. The blue solid line is the mean, the green dashed line is the median, and the blue shaded area is the tube between the 20% and 80% quantile of data generated by 1000 simulations. In this case, α = 0.8.

From Theorem 2-1 it follows the next proposition.

Proposition 1. The problem (13) is equivalent to the optimization program

The proof of this proposition can be consulted in A.5. Also, as a result of the Corollary 1, it follows that the problem is feasible if ε ≤ min ξ , α , wherē ξ = 1 N N i ξ i . In addition, the reformulation obtained from this proposition is a linear optimization problem.

We use synthetically generated data, that is, generated by a known distribution. Specifically, we consider ξ as a random variable exponentially distributed with mean equal to 10. Additionally, we consider the price p = 2, the cost c = 1 and α = 0.8.

Our first objective is to analyze the impact of the Wasserstein radius ε on the optimal distributionally robust inventory levels and their out-of-sample performance. Therefore, we solve problem (14) using samples of size N ∈ {30, 300, 3000}. Figure 2 shows the tube between the 20% and 80% quantiles (shaded area), the mean (blue solid line), and the median (green dashed line) of the out-of-sample performance of expected constraint E P [(ξ − x N (ε)) + ] as a function of ε estimated using 1000 independent simulation runs. We observe that the out-of-sample performance of expected constraint is decreasing as Wasserstein radius ε grows. Furthermore, according to the median of the results, it is observed that, for small N , with the SAA approach (ε = 0) the restriction is not fulfilled in a percentage higher than 50% of the simulations. However, this figure shows that it is possible to find ε such that the constraint is satisfied out-of-sample with high probability, the higher this probability the higher ε. Note that this ε also depends on α (see Corollary 1). This stylized fact was observed consistently across all of simulations and provides an empirical justification for adopting a distributionally robust approach. 

, and robust optimal inventory level (IL) x N (ε) as a function of the Wasserstein radius ε and estimated on the basis of 1000 simulations. The blue solid lines are the means, the green dashed lines are the medians, and the blue shaded areas are the tubes between the 20% and 80% quantile of data generated by 1000 simulations. In this case, α = 0.8. Figure 3 shows the profit E P [p min {ξ, x N (ε)}−c x N (ε)], optimal values J N (ε), and robust optimal inventory level (IL) x N (ε). In the case of the out-of-sample performance of profit E P [p min {ξ, x N (ε)} − c x N (ε)] and optimal value J N (ε), are decreasing as Wasserstein radius ε grows. In addition, it is observed that low values of the expected constraint induce low profits. However, the inventory level is high compared to that obtained with the SAA approach. Furthermore, when ε is large, the decision that our approach suggests is to consider high inventory levels. This shows that our strategy prioritizes having all possible demand covered, which is achieved with large inventory levels.

Another aspect that follows from these figures is that if ε is such that E P [(ξ − x N (ε)) + ] ≤ α in a large percentage of the simulations, then J N (ε) ≤ J in a similar percentage of those simulations. This is consistent with the formulation of problem (7) because if ε is such that P

The figures verify for the existence of this ε. Even though this observation was made consistently across all simulations, we were unable to validate it theoretically.

Performance of confidence level of ε. Now, we evaluate the performance of the strategy to determine the confidence level of any given ε, this strategy is based on the Bootstrap method as presented above. For this purpose we generate 500 samples of size N = 300, for each of these samples we calculate ε max N (α) and consider ε = 2 ε max N (α)/5. Given this ε we calculate its confidence level obtaining Figure 4(a) . This figure shows that the confidence levels are high for this value of ε, specifically, it is around 95.8%. This means that E P [(ξ − x N (ε)) + ] should be above α by a percentage of around 95.8% of the 500 simulations. Additionally, for each sample we calculate the out-of-sample expected constraint of x N (ε), that is, E P [(ξ − x N (ε)) + ]. Figure 4 (b) shows that in 92.3% of the simulations, E P [(ξ − x N (ε)) + ] ≤ α was obtained. This shows that our bootstrapbased strategy has a good performance. Moreover, this is consistent with what was observed in the out-of-sample performance of the expected return in Figure  2 . 

In this case, in (1), we consider Φ as variance, m = n where m is the number of assets, ξ i is the return of i-th asset, and x i the proportion of the initial amount invested in the i-th asset. Given that the returns of each asset are random with unknown distribution P, ξ = (ξ 1 , . . . , ξ m ) ∈ R m is a random vector. Additionally, x = (x 1 , . . . , x m ) ∈ R m is a portfolio which is vector of weights that satisfies the relation m i=1 x i = 1 and other additional convex constraints admissible for the investor described by the set X, so we have

Then, x, ξ is the return of the portfolio x. Note that F satisfies Assumption 1 because F is a Lipschitz function with respect to ξ with Lipschitz constant γ x,F = x . Finally, µ is the minimum level of return admissible for the investor. With these considerations in mind, (1) becomes the Markowitz mean-variance portfolio selection optimization problem. This consists of choosing portfolio weights such that minimize the variance of the return rate subject to a constraint on the expected value of the return rate. From Theorem 3 we obtain the following formulation. Therefore, (12) is equivalent to the optimization problem

Note that E and L are the sample versions of the covariance matrix and the vector of means of ξ respectively, generated by the sample ξ 1 , . . . , ξ N . As a result of the Corollary 2, note that if X = R m , that is, when short selling is allowed, then µ max N (ε) → ∞ when ε → 0, hence for every µ there is a ε that makes the problem feasible. However, regardless of whether short selling is allowed or not, note that ε max N (µ) is decreasing with respect to µ, so this reduces the number of ε candidates that we can explore as µ grows.

This subsection is divided into two parts. In one we use synthetically generated data, that is, generated by a known distribution. In the other part, we use real data from the financial market. In both cases we consider X = R m + , that is, we do not consider short selling. Using simulated data. We consider a market of m = 10 assets which returns have the form adopted in [12] , that is, we assume that ξ i = ψ + ζ i where ψ and ζ i are independent, ψ ∼ N (0, 2%) and ζ i ∼ N (i × 3%, i × 2.5%) for each i = 1, 2, . . . , m. With this assumption, the assets are ordered from the one with the lowest return and volatility to the one with the highest return and volatility. Additionally, we denote by m the vector of means and Σ the covariance matrix of ξ. In this case, m and Σ are easy to calculate from the distribution of ξ. Given x ∈ R m we define R(x) := m T x, the expected return induced by x, and V (x) := x T Σx, the variance induced by x. Because we have all the information about the returns, the vector of optimal weights x * is known.

Impact of the Wasserstein Radius ε. Our first objective is to analyze the impact of the Wasserstein radius ε on the optimal distributionally robust portfolios and their out-of-sample performance. Therefore, we solve problem (15) using samples of size N ∈ {30, 300, 3000}. Figure 5 shows the tube between the 20% and 80% quantiles (shaded area) and the mean (solid line) of the out-of-sample performance of expected returns R( x N (ε)) as a function of ε estimated using 500 independent simulation runs.We observe that the out-of-sample performance of expected returns R( x N (ε)) are increasing as Wasserstein radius ε grows. Therefore, as in the previous application, the higher the the higher the probability of satisfyng the constraint. Note also that this depends on the value of µ.

Regarding the out-of-sample performance of variance V ( x N (ε)) and optimal value J N (ε), Figure 6 also shows that this quantities are increasing as Wasserstein radius ε grows. In addition, it is observed that high returns induce large variances. However, this figure also shows that Sharpe ratio is always greater than the one obtained by the SAA strategy, in fact, the Sharpe ratio reaches a maximum value near the end of the ε grid which is close to ε max N (µ). Furthermore, for large N this maximum Sharpe Ratio exceeds the true Sharpe Ratio, where the latter is defined as

. Overall, from these figures we can extract the same conclusions obtained in the newsvendor case. Finally, Figure 7 visualizes the corresponding optimal portfolio weights x N (ε) as a function of ε, averaged over 500 independent simulation runs. The thin colored bar on the right side of each graph corresponds to x * . Our numerical results show that the optimal distributionally robust portfolios tends to give little weight to goods with little return even if they have little volatility while it gives more weight to goods with high return even if they have high volatility, all this, as the Wasserstein radius ε increases.

Performance of confidence level of ε. To evaluate the performance of the strategy to determine the confidence level of any given ε we consider µ = 0.2 and perform 500 simulations of size N = 300. Using ε = 2 ε max N (µ)/5 and proceeding as in the previous application, Figure 8 shows that the confidence level is around 93,8% while in 95.2% of the simulations, R( x N (ε)) ≥ µ was obtained. This shows that the bootstrap-based strategy has a good performance. , the out-of-sample performance of x N (ε) for ε = 2 ε max N (µ)/5, that is, R( x N (ε)). All this for each of the 500 simulations with N = 300 and µ = 0.2.

Using real market data We now consider real market data. The data used in this study correspond to the daily returns of 23 companies selected from the S&P 500 index. The selected returns correspond to the companies described below. These data correspond to the time window between January 1, 2008 to June 30, 2021. In the experiments, we want to analyze the cumulative wealth over time using a rolling horizon procedure with daily rebalancing, that is, we use the corresponding data from January 1 of 2008 to February 13 of 2018 to estimate portfolio vector for February 14, 2018. After that, we use the corresponding data from January 2 of 2008 to February 14 of 2018 to estimate portfolio vector for February 15, 2018, and so on. In summary, this process is continued by removing Fig. 9 : The cumulative wealth of the trading strategies with µ = min{0.001, µ max N }. the first return and adding a return to the dataset for the next period until the end of the dataset is reached. The objective is to see how the cumulative wealth evolves in that period of time. In addition, we compare our approach with standard portfolio optimization techniques which are SAA, EW, MinVar, and MaxSR. SAA has already been mentioned before, the other four are explained below.

• Equal Weight (EW): This approach gives equal weight to all assets in the portfolio. • Minimum variance (MinVar): We use the sample covariance matrix of the returns to find this portfolio. • Maximum Sharpe ratio (MaxSR): We use sample mean con covariance matrix of the returns to find the portfolio that maximizes the Sharpe ratio. Now, for a given µ, we consider strategies with ε max N (µ), called Wasserstein MaxFact, with , called Wasserstein MaxFact/2 and Wasserstein 3MaxFact/4 respectively. The reason for including these strategies is to see the influence of ε on the portfolio value. Additionally, we consider µ = min{0.001, α µ max N } where µ max N = µ max N (0) and α ≤ 1. Note that µ changes every day because it depends on µ max N which depends on the sample. The rationale for this µ is that the minimum expected return acceptable for the investor is 0.001, but it is possible that this value is very ambitious, which would cause infeasibility. Hence, we correct this with the term µ max N . Note that if α = 1 we could have that ε max N (µ) = 0. Figure 9 shows that the strategies based on our approach allows obtaining a portfolio with high cumulative wealth. The wealth of these portfolios exceeds the values obtained with traditional strategies such as SAA, EW, MinVar, and MaxSR. Additionally, the period of time in which we are evaluating the strategies includes the days of the beginning of the COVID-19 pandemic, this phenomenon affected all investment strategies; however, our strategies somehow mitigate this effect on long-term portfolio value. On the other hand, Figure 10 also shows that the highest cumulative wealth is achieved by Wasserstein MaxFact strategy but followed by MaxSR. This situation is due to the influence of α, which is 0.5 in this case. Finally, both figures show that the cumulative wealth of strategies based on Wasserstein strategies increases with respect to ε. Table 1 shows some out-of-sample indicators of the different strategies. Recall that in our experiment µ depends on the sample and changes on each day. The first thing to note is that the mean of all Wasserstein strategies exceeds the average value of µ. This is remarkable because the SAA approach does not achieve this despite the sample size is considerably large. It is known that if the (a) α = 1 (b) α = 0.5 Fig. 11 : Confidence level target mean return, µ = min{0.001, α µ max N }. sample size is large enough, then the SAA strategy gives a portfolio very close to the one obtained by solving (1) if the distribution of returns were known, and therefore this portfolio will satisfy the constraint E P [ x, ξ ] ≥ µ with very high probability. However, in this case, this is not achieved because not all data in the sample comes from the same distribution, which implies that the average returns obtained with the SAA strategy do not exceed the average value of µ. In contrast, Wasserstein-based strategies manage to overcome this situation.

Another important indicator is the standard deviation. The standard deviations of the Wasserstein approaches are among the largest, but the difference with respect to the SAA approach is not so significant, this becomes evident when we see the Sharpe ratios. We also observe that the highest Sharpe ratio among Wasserstein-based strategies is not always attained at Wasserstein Max-Fact, as oppose to the mean. This agrees with what is observed in Figure 6 for simulated data, where it was evidenced that the highest Sharpe ratio is not always obtained in Wasserstein MaxFact.

Finally, regarding the other indicators, we can see that the turnover increases with ε but in all cases is lower that the one obtained in SAA. On the other hand, the number of assets on average that make up the portfolio of the Wassersteinbased strategies decreases with ε and, compared to the other strategies, the situation is different for α = 1 (among the smallest) than for α = 0.5 (among the highest). This indicator can be useful to identify the most promising companies within the portfolio. In summary, if we combined the information of Table 1 with the fact that Wasserstein approaches produce portfolios that generate more cumulative wealth as time progresses, then Wasserstein approaches are a good option for making investment decisions with real market data. We aslo evaluate the confidence level of each of the Wasserstein-based strategies. Figure 11 shows that this confidence level is high for all Wasserstein strategies and, as expected increases with ε. Clearly, the hardest is the constraint (α = 1), the lower the confidence level. Note that in this case determining exactly whether that confidence level is satisfied out-of-sample is not possible because the true vector of means is not known, nevertheless, as we said before, Table 1 gives us indications that this level is satisfied in our financial market data because the mean of the Wasserstein strategies always exceeded the average value of µ.

In this work we have shown that the Wasserstein distance-based approach (7) has an equivalent finite dimensional formulation, where, in some cases, it is a convex formulation. Furthermore, we established theoretical results that characterize the values of µ and ε for which the Wasserstein approach (7) is valid and feasible. As future work, we want to extend the results presented in this work for a set of functions F and G larger than the set determined by the Lipschitz functions respect to ξ. Additionally, we want to explore other types of Φ functions, for example, consider Φ as a quantile in order to consider Value-at-Risk as risk function.

We implemented our strategy in two applications, a newsvendor problem and a portfolio optimization problem. The experiments with synthetic data show that it is possible to find ε such that, out-of-sample, the constraint is satisfied with high probability. This is most evident for small sample sizes. In both contexts, the proposed bootstrap-based method for choosing this parameter performed well. The experiment with real market data showed that our approach has higher expected returns, higher Sharpe Ratios, and a reduction in turnover compared to the SAA approach.

We present proofs of the results presented in this work. Section A.1 present the proofs of Lemmas 1 and 2. Section A.2 explores worst-case expectation problems with expected value constraints and the corresponding dual formulation. Therorem 4 is an important result on its own. Section A.3 presents a distributionally robust estimation of the variance under known mean, a key result to prove Theorem 3. Section A.4 shows the proofs of Theorems 2 and 3 and its corollaries. Finally, Section A.5 contains the proofs of Propositions 1 and 2. . Because P M → P

and P x,F M → P x,F weakly as M goes to ∞, by Corollary 6.11 en [34] we have that

Additionally, for each M we get that

Therefore, by (16) we conclude

Proof (Lemma 2). Let ε > 0 such that P ∈ B ε ( P N ) and (7) is feasible, then, by Lemma 1, we have that P x N (ε),G ∈ B εγ x N (ε),G P

. Therefore, we conclude that E P [G ( x N (ε), ξ)] ≥ µ.

When we refer to worst-case expectation problems with restrictions on the expected value, we are referring to problems of the form

where b i ∈ R and g 1 , . . . , g k are integrable functions respect to each measure in B ε ( P N ). This problem is important for the following section and to prove Thereom 5. 

ii) The set of optimal distributions of (17) is not empty and bounded.

Then (17) satisfies strong duality, that is, the optimal value of (17) is equal to

Proof. The strategy is to characterize (17) as a Linear Conic Problem (LCP) and then to apply the strong duality results from [29] . First, we consider P(Ξ) as the set of non-negative probability measures in the measurable space (Ξ, E) such that h, g 1 , g 2 , . . . , g k are integrable with respect to each measure in P(Ξ), where Ξ is the support of ξ and E is a σ-algebra that contains all singletons in Ξ, i.e., {ξ} ∈ E, ∀ξ ∈ Ξ. Also, we called X to the linear space (over R) of signed measures generated by P(Ξ). Additionally, we define X as the linear space of functions : Ξ → R generated by h, g 1 , . . . , g k , that is, the functions that can be expressed as a linear combination of h, g 1 , . . . , g k . We define a bilinear form between X and X given by

Moreover, we consider Y = R k+1 and Y as the dual space of Y , note that Y is equal to R k+1 . Also, ·, · Y is the traditional bilinear form between a space and its dual space, in this case, it is the Euclidean inner product in R k+1 . Continuing with the characterizations, we establish K = {0} where 0 is zero vector of R k+1 , G as the convex cone generated by B ε ( P N ). Because B ε ( P N ) is convex, we have G = ∪ λ>0 λB ε ( P N ) (see [4] ). In addition, we called b as a vector in R k+1 such that b i = b i for each i = 1, . . . , k and b k+1 = 1, y we define ψ i = g i for i = 1, . . . , k and ψ k+1 (x) = 1 for all x ∈ Ξ. Finally, we define the linear application A : X → Y given by A(Q) = ( ψ 1 , Q X , . . . , ψ k , Q X , ψ k+1 , Q X ). Taking into account these characterizations, we have the following LCP

However, because G = ∪ λ>0 λB ε ( P N ) and 1, Q X = 1 for all Q ∈ B ε ( P N ), we have (18) is equal to the following

Note that the problem to the right of the above equality is (17), so we can conclude that (17) is equal to (18) . Therefore, by conditions i) and ii), and Theorem 2.8 in [29] , we have that (18) satisfies strong duality and its dual formulation is min

where A * : Y * → X * is the adjoint map of A and G * and K * are the polar cones of G and K respectively. Because B ε ( P N ) is convex, G * can be expressed as

Moreover, note that −K * = R k+1 . Therefore, we have that (19) can be expressed as

Using Theorem 1 we can further reformulate (17) as a semi-infinite optimization problem.

satisfies the hypotheses of the Theorem 1 for all a ∈ R k , and satisfies any of the conditions i) and ii) of Theorem 4, then the problem (17) can be rewritten as

A.3 Distributionally robust estimation of the variance of a random variable with known mean

In this part, we will formulate a robust distributional version of the problem of estimating the variance of a random variable with known mean, and we will demonstrate that the obtained optimization problem admits an explicit solution.

Let ζ be a random variable with unknown distribution P with support Ξ ⊆ R, we assume that the expected value of ζ is known, specifically, we assume that E P [ζ] = η. Also, we consider a sample ζ 1 , . . . , ζ N of ζ. Let P N := 1 N N i=1 δ ζi be the empirical distribution, and denote byζ : 2 , the empirical mean and variance respectively. Let B ε ( P N ) be the 2-Wasserstein ball with center in P N and radius ε. We call the following problem distributionally robust estimate of the variance of ζ:

However, for some values of ε, this problem may be not feasible as the following result shows.

Proof. Let Q ∈ B ε ( P N ), we must show that E Q [ζ] = η. Indeed, by Observation 6.6 in [34] we know that if p ≤ q then W p ≤ W q . In particular, we have that W 1 ≤ W 2 and this implies that

Therefore, defining S(Q, P N ) as the set of couplings between Q and P N , there exists Π ∈ S(Q, P N ) such that Ξ×Ξ |ζ − δ|Π(dξ, dζ) < η −ζ .

We also have

Then

In consequence, we obtain

From the above and the inverse triangular inequality follows

This allows us to conclude that E Q [ζ] = η.

The following theorem establishes an explicit expression for the optimal value of the optimization problem to the right of (21).

Theorem 5. Let ε > 0 with ε 2 ≥ (ζ − η) 2 , and Ξ = R. Then, the optimal value of (21) is equal to

Proof. By Theorem 4 we have that (21) satisfies strong duality and its optimal value is equal to inf β sup Q∈Bε( P N )

Note that ζ → (ζ − η) 2 −βζ +βη satisfies the hypotheses of Corollary 3, therefore this last formulation is equivalent to the semi-infinite optimization program

If λ < 1, then λ is not a optimal value of (24) because, in this case, the set

is not bounded. On the other hand, if λ ≥ 1, then

can be calculated explicitly because the function is a concave quadratic polynomial. The unique maximum is attained at ϕ i = 2η+β−2λ ζi

and (24) is equivalent to

This previous problem can be simplified by analyzing the objective function with respect to λ. For a fixed β ∈ R, first note that the function goes to infinity when λ → 1 + or λ → ∞. Now, its second derivative is given by

Since λ ≥ 1, the sign of the last expression is determined by the sign of its numerator, which, in terms of β, is a polynomial with discriminant given by η −ζ 2 − σ 2 . As a consequence of Cauchy-Schwarz inequality this discriminant is always negative which implies that the polynomial is always positive. Therefore, the objective function in (25) is convex and has a unique minimum value in the region λ ≥ 1. This minimum is reached at λ * = 1 + 1 ε β 2 4 + β(η −ζ) + σ 2 and (25) can be rewritten as

Since the objective function differentiable for all β ∈ R, after some calculations we obtain that the infimum is attained at β * = 2(ζ−η)+2(ζ−η) σ 2 − (ζ − η) 2 ε 2 − (ζ − η) 2 and the optimal value of (21) is

Before proceeding with the proof of Theorems 2 and 3, we need the following lemma which allows us to express the feasible set of problem (7) in terms of finite dimensional variables.

Lemma 3. Assume the same setting as in the previous section. 

and sup Q∈Bε( P N )

Proof. We only show the first equality of each case since the second is analogous. 

Equality (28) is guaranteed by Von Neumann's minimax theorem (see [2] ). Therefore, we obtain inf Q∈Bε( P N )

In case 2 we have that Proof (Theorem 2). Let X the feasible set of (7), then, by Lemma 3-1, we have that

Therefore, (7) is equivalent to

but, again, by Lemma 3-1, we have that

which is equivalent to (8) . Analogously, to prove (9), we use Lemma 3-2.

Proof (Theorem 3). Let X the feasible set of (7), then, by Lemma 3-1, we have that

Therefore, by Proposition 3 (7) 

By Theorem 5, the maximization problem in (29) which has a variance as its objective function, can be rewritten. Then (29) is equivalent to

But, note that the internal maximization problem of (30) can be explicitly solved. Actually, this problem reaches its optimal value in η * = 1 N N i=1 ζ x,F i . Therefore, (30) can be rewritten as

x ∈ X .

A.5 Proofs of Propositions 1 and 2.

Proof (Proposition 1). According to Theorem 2-1, we have that (13) is

x i − pε ≥ 0, z i ≤ ξ i ∀i = 1, . . . , N, z i ≤ x ∀i = 1, . . . , N, x ≥ 0.

Proof (Proposition 2). According to Theorem 3, we have that (7) is

x i = 1.

x ∈ X

Note that

x, ξ i = Lx.

Therefore, (15) follows.

Conservative stochastic optimization with expectation constraints

Convex Optimization Theory

Quantifying distributional model risk via optimal transport

Convex optimization

Mean -variance analysis of basic inventory models

Mean-variance analysis for the newsvendor problem

The effect of errors in means, variances and covariances on optimal portfolio choice

Distributionally robust optimizatión under moment uncertainty with application to data-driven problems

Optimization with stochastic dominance constraints

An Introduction to the Bootstrap

Worst-case value-at-risk and robust portfolio optimization: a conic programming approach

Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations

Data-driven chance costrained stochastic program

Distributionally robust Monte Carlo simulation

Algorithms for stochastic optimization with function or expectation constraints

Designing a hierarchical decentralized system for distributing large-scale, cross-sector, and multipollutant control accountabilities

Adjusted robust mean-value-at-risk model: less conservative robust portfolios

Robust VaR and CVaR optimization under joint ambiguity in distributions, means, and covariances

Decomposition algorithm for distributionally robust optimization using Wasserstein metric with an application to a class of regression models

Portfolio selection

Chance constrained programming with joint constraints

Stochastic gradient made stable: A manifold propagation approach for large-scale optimization

Tractable robust expected utility and risk models for portfolio optimization

Robust mean-covariance solutions for stochastic optimization

Neyman-pearson classification, convexity and stochastic constraints

Optimization of conditional value-at-risk

A price-setting newsvendor problem under mean-variance criteria

A min-max solution of an inventory problem

(eds) Semi-Infinite Programming. Nonconvex Optimization and Its Applications pp

Worst-case distribution analysis of stochastic programs

Minimax analysis of stochastic problems

Convergence analysis for distributionally robust optimization and equilibrium problems

Markov processes over denumerable products of spaces describing large system of automata

Optimal transport: old and new

Likelihood robust optimization for data-driven problems

Robust trade-off portfolio selection

Penalized stochastic gradient methods for stochastic convex optimization with expectation constraints

Supply chains involving a meanvariance-skewness-kurtosis newsvendor: Analysis and coordination

Robust portfolio optimization with derivative insurance guarantees