key: cord-0680242-0mdybbqo authors: Zhu, Ying; Mirzaei, Mozhgan title: Ordinary differential equations (ODE): metric entropy and nonasymptotic theory for noisy function fitting date: 2020-11-23 journal: nan DOI: nan sha: 5be4fb59eed1817024151335fd2b42bd3e639adc doc_id: 680242 cord_uid: 0mdybbqo This paper establishes novel results on the metric entropy of ODE solution classes. In addition, we establish a nonasymptotic theory concerning noisy function fitting for nonparametric least squares and least squares based on Picard iterations. Our results on the metric entropy provide answers to"how do the degree of smoothness and the"size"of a class of ODEs affect the"size"of the associated class of solutions?"We establish a general upper bound on the covering number of solution classes associated with the higher order Picard type ODEs, $y^{left(mright)}left(xright)=fleft(x,,yleft(xright),,y^{'}left(xright),,...,y^{left(m-1right)}left(xright)right)$. This result implies, the covering number of the underlying solution class is (basically) bounded from above by the covering number of the class $mathcal{F}$ that $f$ ranges over. This general bound (basically) yields a sharp scaling when $f$ is parameterized by a $K-$dimensional vector of coefficients belonging to a ball and the noisy recovery is essentially no more difficult than estimating a $K-$dimensional element in the ball. For $m=1$, when $mathcal{F}$ is an infinitely dimensional smooth class, the solution class ends up with derivatives whose magnitude grows factorially fast --"a curse of smoothness". We introduce a new notion called the"critical smoothness parameter"to derive an upper bound on the covering number of the solution class. When the sample size is large relative to the degree of smoothness, the rate of convergence associated with the noisy recovery problem obtained by applying this"critical smoothness parameter"based approach improves the rate obtained by applying the general upper bound on the covering number (and vice versa). Differential equations enjoy a long standing history in mathematics and have numerous applications in science and engineering. Differential equations also play a crucial role in social science and business, for example, the famous Solow growth model (which is a first order ODE) in economics (see, Barro and Sala-i-Martain, 2004) , and the famous Bass product diffusion model (also a first order ODE) in marketing (see, Bass 1969 Bass , 2004 . Since the COVID-19 pandemic, lots of attention has been given to the compartmental models in epidemiology for predicting the spread of infectious diseases (see, Vynnycky and White, 2010) . Statisticians have proposed useful methods for inference of structural parameters in differential equations and addressed important computational challenges (see, e.g., Ramsay, et. al, 2007; Campbell, et. al, 2012) . In contrast to the existing statistical literature on differential equations, this paper focuses on the metric entropy 1 of ODE solution classes and a nonasymptotic theory for noisy function fitting problems, both of which, to the best of our knowledge, have not been examined in the literature. Results on metric entropy of ODE solution classes are crucial for building an empirical process theory for differential equations. We choose to focus on ODEs in this paper as this is a natural and necessary first step before one delves into more complicated differential equations (such as partial differential equations). Even in ODEs, there are plenty of theoretical challenges and we are able to develop many new perspectives. Moreover, because of the deep connections between ODEs and contraction mapping, understanding the "size" of ODE solution classes could pave the way for understanding the "size" of systems that involve fixed points (such as Markov decision processes and game theoretic models in economics). When talking about the theory of ODEs, one must mention the Picard-Lindelöf theorem, a central result in ODEs. This theorem hinges on a Lipschitz continuity condition, which ensures the existence of a solution to an initial value problem and the uniqueness of this solution. In this paper, we consider ODEs in the form A natural question to ask is, how do the degree of smoothness and the "size" of a class of f affect the "size" of the associated class of y in (1) ? The answers to this question turn out quite complicated and the analyses require substantial efforts as well as deep thinking, as our paper suggests. By letting f in (1) range over various function classes, we establish novel results on the metric entropy of the associated solution classes Y. Most of our theory is built upon or expands the smoothness structure in the Picard-Lindelöf theorem although we also provide a couple of results that do not require such a structure. We also establish a nonasymptotic theory concerning the following noisy function fitting/recovery problem where {Y i } n i=1 are response variables and {x i } n i=1 is a collection of fixed design points (both observed), the unobserved noise terms {ε i } n i=1 are i.i.d. draws from N 0, σ 2 , and y * (·) ∈ Y. We consider the so-called nonparametric least squares estimator y ∈ arg miñ and establish upper bounds on the prediction error E 1 n n i=1 (ŷ(x i ) − y * (x i )) 2 . Besides its theoretical importance, we point out that a Picard-type Lipschitz condition can be quite useful for practical implementations, as it paves the way for computing an approximate solution to an initial value problem in an iterative manner. If one can solve for y * (x) in an explicit form, nonlinear least squares and maximum likelihood procedures can obviously be applied. However, in many ODEs (even when the ODE is parameterized by a finite vector of coefficients), the solutions may be in implicit forms. In the situation where a quality estimator of the initial value is available, one can exploit the Picard iteration to circumvent this issue. We propose estimation strategies based on Picard iterations for noisy recovery of y * (·) ∈ Y associated with first order ODEs ranging over both parametric and nonparametric classes, and establish upper bounds for the estimation errors of these strategies. We also discuss the estimation of initial values with local polynomials. Below we highlight the novelty in some of our results. We establish a general upper bound on the covering number of solution classes associated with the higher order Picard type ODEs. This result implies, the covering number of the underlying solution class Y is bounded from above by the covering number of the class F that f ranges over, as long as F satisfies a Lipschitz condition with respect to the y, ..., y (m−1) coordinates (which we will refer to as the Picard type Lipschitz condition). If the initial values are fixed 2 , this general bound yields a sharp scaling when f is parameterized by a K−dimensional vector of coefficients belonging to a ball and further satisfies a Lipschitz condition with respect to the coefficients. As a result, the noisy recovery is essentially no more difficult than estimating a K−dimensional element in the ball. When F is an infinitely dimensional smooth class, the rate of convergence associated with the noisy recovery problem obtained by applying the general bound on the covering number can be improved when the sample size n is large relative to the degree of smoothness. In the paper we focus on the first order ODE y to showcase the improvement. In particular, when F is the standard smooth class of degree β + 1 (where the absolute values of all partial derivatives of f ∈ F are bounded from above by 1 and the βth derivative of f is 1−Lipschitz 3 ), we show that the solution class Y is a subset of a variant smooth class S † β+2 , where the absolute value of the kth derivative of y ∈ Y is bounded from above by 2 k−1 (k − 1)! for k = 1, ..., β + 1 and the (β + 1)th derivative of y is ρ β −Lipschitz with ρ β = 2 β+1 (β + 1)!. Given the ρ β −Lipschitz continuity of the (β + 1)th derivative of y ∈ Y, existing results (e.g., Wainwright, 2019) would suggest The minimax lower bounds in Tsybakov (2009) also imply the same scaling ρ 2 2(β+2)+1 β σ 2 n 2(β+2) 2(β+2)+1 . When β is small relative to n σ 2 , the bound ρ 2 2(β+2)+1 β σ 2 n 2(β+2) 2(β+2)+1 is clearly useful. However, if β is large enough such that ρ 2 2(β+2)+1 β σ 2 n 2(β+2) 2(β+2)+1 → ∞, the lower bound result suggests that there exist no consistent estimators in the worst case. Classical results like the above do not deliver a useful bound in this case because they concern a bigger class than S † β+2 and fail to take into account the special structure of S † β+2 . Indeed, for estimations of functions in S † β+2 , one can easily obtain a rate much better than ∞ even if ρ 2 2(β+2)+1 β σ 2 n 2(β+2) 2(β+2)+1 → ∞. To see this, note that the first derivative of h ∈ S † β+2 is 2−Lipschitz and the optimal rate concerning estimations of twice smooth functions scales as n − 4 5 , for which consistent estimators apparently exist in the worst case. What makes the class S † β+2 interesting 2 Our results on the covering numbers include an additional term related to the "size" of the sets where initial values belong when they are not fixed. When they are fixed, this term is zero. 3 We say that a function g on [0, 1] |·| is the l∞−norm. is that it exhibits a "curse of smoothness" as β increases because of the factorial growth in the derivatives. Motivated by this observation, we introduce a new notion called the "critical smoothness parameter" to derive an upper bound on the covering number of S † β+2 . The resulting upper bound along with the aforementioned general upper bound for the covering number of Y allows us to obtain the following rate: where the non-negative integer γ * ∈ [0, β] is the critical smoothness parameter (whose derivation is detailed in Section 2). In theory, our analysis suggests that σ 2 n 2(γ * +2) 2(γ * +2)+1 in (5) may be improved even further but an analytical form of this sharper bound is difficult to obtain. Nevertheless, the rate " σ 2 n 2(γ * +2) 2(γ * +2)+1 " can be smaller than the rate " σ 2 n β+1 β+2 ". Obviously, if β+2 is only useful when β > 3. We also show that γ * = β when n In this case, (5) becomes On the other hand, when β is large enough relative to n σ 2 , the rate " σ 2 n β+1 β+2 " can be smaller than the rate " σ 2 n 2(γ * +2) 2(γ * +2)+1 ". Obviously, as we mentioned before, if β is much smaller than n σ 2 , the classical bound ρ (4) can be also useful. We also consider a special case of (3) where its right-hand-side has a "single index" structure In the ODE literature, this structure is referred to as an autonomous system while (3) is referred to as a nonautonomous system. In particular, when F is a standard smooth class of degree β + 1, we show that the solution class Y ⊆ AS † β+2 , where the absolute value of the kth derivative of y ∈ Y is bounded from above by (k − 1)! for k = 1, ..., β + 1 and the (β + 1)th derivative of y is (β + 1)!−Lipschitz. These factorial bounds are tight in the case of (6) . Like in the nonautonomous system, we derive an upper bound on the covering number of AS † β+2 using the approach of "critical smoothness parameter". This result along with the general upper bound for the covering number of Y allows us to obtain the following rate: Despite the tight factorial bounds and the "single index" structure in the autonomous system, we find that the critical smoothness parameters γ * associated with (7) and (5) can coincide if n σ 2 is large relative to β. Consequently, when the rate σ 2 n 2(γ * +2) 2(γ * +2)+1 is smaller than σ 2 n 2(β+1) 2(β+1)+1 , bounds (5) and (7) exhibit the same scaling. On the other hand, in situations where σ 2 n 2(β+1) 2(β+1)+1 improves upon σ 2 n 2(γ * +2) 2(γ * +2)+1 in (7) and σ 2 n β+1 β+2 improves upon σ 2 n 2(γ * +2) 2(γ * +2)+1 in (5) (as a result of small enough n σ 2 and large enough β), we can clearly see the difference in the upper bounds for the prediction errors between the autonomous system and the nonautonomous system. While most of our theoretical results are built upon a Picard-type Lipschitz condition, we provide an oracle inequality for the prediction errors without requiring (the true) f * in the ODE y * ′ (x) = f * (x, y * (x)) to be Lipschitz continuous with respect to the y * coordinate. In addition, we provide results on the covering number of separable first order ODEs with monotonic f , as well as the VC dimension of ODEs with polynomial potential functions. for a universal constant c ′ ∈ (0, ∞); and f (n) ≍ g(n) means that f (n) g(n) and f (n) g(n) hold simultaneously. As a general rule for this paper, the various c and C constants (all 1) denote positive universal constants that are independent of the sample size n and the smoothness parameter β, and may vary from place to place. General definitions. Given a set T, a set t 1 , t 2 , ..., t N ⊂ T is called a δ−cover of T with respect to a metric ρ if for each t ∈ T, there exists some i ∈ {1, ..., N } such that ρ(t, t i ) ≤ δ. The cardinality of the smallest δ−cover is denoted by N ρ (δ; T), namely, the δ−covering number of T. For example, N ∞ (δ, F) denotes the δ−covering number of a function class F with respect to the supremum metric |·| ∞ . Let V be a class of binary-valued functions. The class V shatters the collection of points (w 1 , ..., w d ) =: w d 1 if the cardinality of V w d 1 is 2 d . The VC dimension is defined as the largest d such that there exists some collection w d 1 which can be shattered by V. We begin by refreshing on a number of classical existence and/or uniqueness results about first ODEs These results can certainly be extended to higher order ODEs but require us to introduce extra notations. Therefore, we refer the readers to Coddington and Levinson (1955) (8) Theorem III (Local existence and uniqueness). Assume f (x, y) is continuous on Λ and satisfies Then for some α ∈ (0, a], there is a unique solution to (8) where C max = sup x∈ [0, α] {exp (x) [1 + x 0 exp (−s) ds]} with α = min {1, b} and Y is the class of solutions (to (9) with θ ∈ B q (1)) on [0, α]. for all y 0 such that |y 0 | ≤ C 0 . By the Cauchy-Peano Existence Theorem (Theorem I) or the Picard Existence Theorem (Theorem II), there exists a solution to every ODE in the form of (9) on [0, α] such that α = min {1, b}. We will use this definition of α throughout the results in this subsection and subsections 2.2-2.3. In all these results, the existence of a solution to their underlying ODE is guaranteed by Theorem I or Theorem II. In (12) , the part "K log 1 + 2CmaxL K, q δ " is related to the "size" of B q (1) that θ ranges over, and the part "log C 0 Cmax δ + 1 " is related to the "size" of [−C 0 , C 0 ] that the initial value y 0 ranges over. The part "log C 0 Cmax δ + 1 " reveals an interesting feature of differential equations: even if the equation is fixed and known (for example, y ′ (x) = y (x) with solutions y (x) = c * e x ), the solution class is still not a singleton (and has infinitely many solutions) unless a fixed initial value is given. As a consequence, in the simple example y ′ = y, the noisy recovery of a solution to this ODE still requires the estimation of c * . 5 The result in Proposition 2.1 implies that if the class f ranges over is a "parametric" class, then the associated solution class Y also behaves like a "parametric" class. Suppose the class of ODEs (9) has a fixed initial value. Then the term "log C 0 Cmax δ + 1 " can be dropped while the scaling of "K log 1 + Our next result concerns the estimation of the following ODE where θ * ∈ B ∞ (1) and |y * 0 | ≤ C 0 . We first consider the setup where one can solve for y * (x; θ * , y * 0 ) in an explicit form from (14) and then obtain an estimator y * x;θ,ŷ 0 with θ ,ŷ 0 ∈ arg min In particular, we assume that all the n design points are sampled from [0, α). Under (10), the Picard local existence and uniqueness theorem (Theorem III) implies that there is a unique solution to (9) on [0, α) and therefore, given θ and y 0 , y * (x i ; θ, y 0 ) in program (15) has a unique form for all x i (i = 1, ..., n) on [0, α). 5 In terms of the homogeneous higher order linear ODEs it is well understood that the solutions of (13) (where aj (·)s are fixed continuous functions) form a vector space of dimension m. A classical result on the VC dimension of a vector space of functions (Steele, 1978 and Dudley, 1978) implies that the solution class associated with (13) has VC dimension at most m, when aj (·)s (j = 0, ..., m) are fixed continuous functions. (2) where y * (·) is the (unique) solution to (14) for a sufficiently large positive universal constant c 0 , then we have (15) . Remark. Condition (16) simply restricts C 0 from being too large, and as a consequence, (12) . This upper bound implies that Y is no "larger" than the class of f s parameterized by θ ∈ B ∞ (1). As long as B L K, ∞ 1, Proposition 2.2 suggests that the noisy recovery of (14) is essentially no more difficult than estimating a K−dimensional element in B ∞ (1), where the optimal rate is σ 2 K n . Remark. Condition (17) in Proposition 2.2 simply excludes the case where b and C 0 are "too small". Without such a condition, as long as f is bounded from above, we would simply replace In many ODEs, the solutions may be in implicit forms so it is not possible to write down (15) . In the situation where we have a quality estimatorŷ 0 of y * 0 , we can exploit the Picard iteration below to circumvent this issue: An estimator based on (19) performs the following steps: first, we computê . . . second, we solve the following program θ ∈ arg min (2) where y * (·) is the (unique) solution to (14) on [0,α]. In terms ofŷ R+1 x i ;θ whereθ is obtained from solving (21) , if for a sufficiently large positive universal constant c 0 , then we have Remark. If the integral in (19) is hard to compute analytically, numerical integration can be used in (20) . This would introduce an additional approximation error σ (R + 1) · Err, where Err is an upper bound on the error incurred in each iteration of (20) , depending on the smoothness of f and which numerical method is used. For example, if f is twice differentiable with bounded first and second derivatives, and the integral is approximated with the midpoint rule with T slices, then Err T −2 . Remark. When our results concern a Picard-iteration-based estimation (Proposition 2.3 above and later Propositions 2.5-2.6), we assume the sampling of all the n design points from [0,α] with α < α. This assumption ensures 1 1−α |ŷ 0 − y * 0 | +α R+1 1−α max {C 0 ,α} in (23) to be well defined in the case where α = 1. The bound (23) reflects three sources of errors:b K n is due to the estimation error inθ, 1 1−α |ŷ 0 − y * 0 | is due to the estimation error inŷ 0 , andα R+1 1−α max {C 0 ,α} is due to the error from the finite (R + 1) Picard iterations. To facilitate presentations of the upcoming results, we introduce additional definitions below. Definitions. Given C 0 > 0, b > 0, and α = min {1, b}, let Let p = (p 1 , ..., p d ) and [p] = d k=1 p k where p k s are non-negative integers. We write Given a non-negative integer γ, we let S γ+1, d ρ, [ When ρ = 1, d = 1, a = 0 and a = 1, we use the shortform S γ+1 := S γ+1, 1 (1, [0, 1]). In addition, when d = 1, a = 0 and a = 1, we denote the class of functions satisfying parts 1 and 3 above by S * γ+1 (ρ). Lastly, we let S † β+2 denote the class of functions such that any function h ∈ S † β+2 satisfies the following properties: 1. h is continuous on [0, 1] and differentiable β + 1 times; The results in this section concern the ODE in the form which is subject to the following assumption. where Y is the class of solutions (to (24) with f ∈ S β+1, 2 (1, Ξ)) on [0, α]. (ii) For a given δ 5 ∈ (0, 1), let β * (δ) = γ (≤ β) be the largest non-negative integer such that 6 We have Consequently, log N ∞ (δ, Y) is bounded from above by the minimum of (25) and (27)-(28). We can take C 1 = 2 log 21 and C 2 = 2 max 0, log 4C in Theorem 2.1. Note that the LHS of (26) is a strictly increasing function of γ (since δ 5 ∈ (0, 1)) and the RHS is a strictly decreasing function of γ, and the LHS is no greater than the RHS for any δ 5 ∈ (0, 1) when γ = 0 (to see this, note that LHS = 3 log where "log" is the natural logarithm.). Therefore, the largest non-negative solution β * (δ) = γ (≤ β) to (26) exists (i.e., β * (δ) is well defined). Essentially, bound (25) implies that Y is no "larger" than S β+1, 2 (1, Ξ), the class where f in (24) ranges over. There are situations where (27)-(28) is more useful for deriving bounds on estimation errors than (25), as we will see in the subsequent subsection (2.2.2). In proving (27)-(28), we first establish an intermediate lemma (Lemma A.1(ii) in Section A.6.2 of the supplementary materials) which shows that Y ⊆ S † β+2 . One can clearly see that S † β+2 ⊂ S * β+2 (ρ β ) where ρ β := 2 β+1 (β + 1)!. For this bigger class S * β+2 (ρ β ), the minimax lower bounds in Tsybakov (2009) If β tends to infinity at a faster rate than n, then the lower bound above suggests that there exist no consistent estimators in the worst case. However, this conclusion cannot hold true for estimations of functions in S † β+2 (which is a proper subset of S * β+2 (ρ β )). To see this, note that an easy calculation shows that S † β+2 ⊂ S 2, 1 C ∨ 2, [0, 1] and the optimal rate concerning estimations of functions in S 2, 1 C ∨ 2, [0, 1] scales as n − 4 5 , for which consistent estimators apparently exist in the worst case. Motivated by this observation, we introduce a new notion called the δ−dependent "critical smoothness parameter", β * (δ) = γ (≤ β), which is the largest non-negative integer solution to (26). The naming "critical smoothness parameter" echoes the so called "critical radius" in the existing literature, although these two definitions are fundamentally different. A "critical radius" typically determines the optimal estimation rate; see, e.g., Wainwright (2019) and Zhu (2017) for a number of applications involving critical radiuses. Finding a "critical radius" often boils down to finding a sharp enough δ resolution in the covering numbers. This sharp δ resolution generally becomes smaller as the sample size n gets larger in statistical estimation problems. Meanwhile, as δ becomes smaller, γ in (26) may be expected to increase such that the inequality still holds, and hence the "critical smoothness parameter" increases. Intuitions on why (27)-(28) can lead to sharper bounds on estimation errors than (25) can be gained from a simple example: If δ is small enough (possibly as a result of large enough n) so that β * (δ) = β − 1 (or even β * (δ) = β), then In the next two subsections (2.2.2 and 2.2.3), we establish estimation theory with the assistance of the bounds in Theorem 2.1. Our results concern the estimation of the following ODE with |y * 0 | ≤ C 0 and (x, y * (x)) ∈ Ξ. In this subsection, we establish upper bounds for the the constrained least squares estimator As we have mentioned after Theorem 2.1, 2 using the idea of critical smoothness parameter we introduced in Theorem 2.1. To motivate Theorem 2.2, we look at a classical result in the literature applied to our context. , we may consider the alternative estimatoř where F β+2 is the (β + 2)th order Sobolev class. If we solve (32) with λ n , Theorem 13.17 and Corollary 13.18 in Wainwright (2019) imply that which essentially matches the lower bound (29). Note that our earlier calculation in Section When β is small relative to n σ 2 , the bound can tend to infinity even as n → ∞. Classical results like the above do not deliver a useful bound in this case because they concern a bigger class than S † β+2 and fail to take into account the special structure of S † β+2 . The prediction error of (31) can have a much better rate than ∞, as we show below. Assumption G. For every γ ∈ {0, ..., β} and a positive universal constant c, let δ (γ) = c σ 2 n γ+2 2(γ+2)+1 and γ * (≤ β) be the largest non-negative integer such that The sample size n is large enough such that and for some positive universal constants c ′ and c ′′ . Almost all results (as far as we are aware) in the existing literature implicitly assume that δ −d γ+1 δ , and hence one typically sees the following expression The upper bound for log N ∞ δ, S † β+2 differs from the upper bound for log N ∞ (δ, S β+2 ) mainly in log β i=0 i! + β 2 +3β 2 log 2, which comes from the factorial growth in the derivatives of functions in S † β+2 and gives arise to the "curse of smoothness" evidenced by ρ for a sufficiently large positive universal constant c 0 , then we have where β * (δ) is defined in Theorem 2.1 andF := {g = g 1 − g 2 : g 1 , g 2 ∈ Y}. However, computing the second integral in the above analytically is difficult and therefore we use the bound For the prediction error, Theorem 2.2 suggests that the noisy recovery of (30) is no more difficult than estimating a function in S β+1, 2 1, [0, 1] 2 , whose optimal rate is σ 2 n β+1 β+2 . The rate " σ 2 n 2(γ * +2) 2(γ * +2)+1 " can be sharper than the rate " σ 2 β+2 is only useful when β > 3. 7 Note that (34) always holds for γ * = 0 and any δ 7 −resolution within (0, 1), so we easily have σ 2 n 4 5 ≤ σ 2 n β+1 β+2 as long as σ 2 n ≤ 1 and β ≤ 3. As another example, it is easy to show that when n On the other hand, when β is large enough relative to n σ 2 , it is possible that the rate " is smaller than the rate " σ 2 n 2(γ * +2) 2(γ * +2)+1 ". Obviously, as we mentioned before, if β is much smaller than n σ 2 , the classical bound ρ (33) can be also useful. So far, our results have focused on the ODE (24) where f at least satisfies a Picard-type Lipschitz condition It is possible to bound 1 In the following, we provide an oracle inequality that achieves this purpose. (2) where y * (·) is a solution to In terms of (31 then we have If an estimatorŷ 0 of y * 0 is available, then we can exploit the Picard iteration in an analogous way as in (21): Suppose all the n design points are sampled from [0,α] whereα < α. An estimator based on (44) performs the following steps: first, we computê . . . second, we solve the following program then we have Remark. For technical reasons, Proposition 2.5 excludes the trivial case β = 0. For this case, the results in Section 2.2.2 suggest that a direct estimator y ∈ arg miñ β+2 ", then Proposition 2.5 suggests that recovering (30) in (2) based on Picard iterations is essentially no more difficult than estimating a function in S β+1, 2 1, [0, 1] 2 , whose optimal rate is σ √ n β+1 β+2 . Without knowing y * 0 , the fact that Y ⊆ S † β+2 allows us to apply existing estimators such as the local polynomials to estimate y * 0 :ŷ * is a kernel, h is a bandwidth parameter, and ω is a non-negative integer. Since S † β+2 ⊂ S * β+2 (ρ β ) (recalling ρ β = 2 β+1 (β + 1)!), it is natural to consider the estimator y * 0, β constructed as above with ω = β. Proposition 1.13 and Theorem 1.6 in Tsybakov (2009) where λ min, β is the smallest eigenvalue of Suppose the conditions in Lemma 1.5 of Tsybakov (2009) hold; that is, x i = i n for i = 1, ..., n; for some constants C min > 0 and ∆ > 0, K (t) ≥ C min I (|t| ≤ ∆) for all t ∈ R. Then one has where λ min, ω is the smallest eigenvalue of B x 0 ,ω . In this subsection, we consider the ODE in the form The right-hand-side of the above has a "single index" structure. In the ODE literature, this structure is referred to as an autonomous system. The main results concerning (49) are presented in a similar fashion as those in Section 2.2. As we show in these results, the autonomous system (49) and the nonautonomous system (24) in noisy recovery problems sometimes exhibit the same behavior while other times, a smaller upper bound can be obtained for the estimation error associated with (49) than with (24). At the end of this subsection, we present a special case of (49), where f need not be Lipschitz continuous, yet the associated solution class is essentially no "larger" than S 2 . Assumption AF. In (49), |y 0 | ≤ C 0 and (x, y (x)) ∈ Ξ; f ranges over S β+1, 1 for all k = 0, ..., β and y ∈ −C, C , where f (0) (y) = f (y); and for all y,ỹ ∈ −C, C . where Y is the class of solutions (to (24) with f ∈ S β+1, 1 1, −C, C ) on [0, α]. (ii) For a given δ 5 ∈ (0, 1), let β * (δ) = γ (≤ β) be the largest non-negative integer such that We have Consequently, log N ∞ (δ, Y) is bounded from above by the minimum of (50) and (52)-(53). We can take C 1 = 2 log 21 and C 2 = 2 max 0, log 4C in Theorem 2.3. In proving (52)-(53) of Theorem 2.3, we first show that Y ⊆ AS † β+2 (see Lemma A.1(i) in Section A.6.2), where any function h ∈ AS † β+2 satisfies the following properties: • h is continuous on [0, 1] and differentiable β + 1 times; • |h(x)| ≤ C, and h (k) (x) ≤ (k − 1)! for all k = 1, ..., β + 1 and x ∈ [0, 1]; In the next two subsections (2.3.2 and 2.3.3), we establish estimation theory with the assistance of the bounds in Theorem 2.3. Our results concern the estimation of the following ODE with |y * 0 | ≤ C 0 and (x, y * (x)) ∈ Ξ. In this subsection, we establish upper bounds on the the constrained least squares estimator The sample size n is large enough such that and for a sufficiently large positive universal constant c 0 , then we have where Given the "single index" structure of (49), bound (50) in Theorem 2.3 implies that Y is basically no "larger" than S β+1, 1 1, −C, C , the class where f in (49) ranges over. Like how we motivated (27)-(28) in Theorem 2.1, there are situations where (52)-(53) can be more useful for deriving bounds on the estimation errors than (50). For example, when n σ 2 β √ log β 4(β+2)+2 such that (56) holds for γ * = β, (60) becomes Recalling the definition of AS † β+2 , the bound (k − 1)! on the absolute value of the kth derivative of y (x) (for all k = 1, ..., β + 1) and the bound (β + 1)! in the autonomous system (49). Despite these tight bounds and the "single index" structure in the autonomous system, the left-hand-sides of (51) and (56) have the same scaling as those of (26) and (34), as the terms γ 2 +γ 2 log 2 in (26) and (γ * 2 +3γ * ) log 2 2 in (34) are dominated by c · log ( γ i=0 i!) and c ′ · log γ * i=0 i! , respectively (for some positive universal constants c and c ′ ). As a consequence, (38) and (60) can coincide if n σ 2 is large enough relative to β. On the other hand, in situations where (50) and the rate " σ 2 n 2(β+1) 2(β+1)+1 " improve upon (52)-(53) and the rate " σ 2 n 2(γ * +2) 2(γ * +2)+1 " (as a result of small enough n σ 2 and large enough β), we can clearly see the difference in the upper bounds for the prediction errors between the autonomous system (49) and the nonautonomous system (24); that is, "δ If an estimatorŷ 0 of y * 0 is available, then we can exploit the Picard iteration in an analogous way as in (44): Suppose all the n design points are sampled from [0,α] whereα < α. An estimator based on (61) performs the following steps: first, we computê second, we solve the following program (62) Proposition 2.6. Let Assumption AF hold for f * in (54), and |y * 0 | ≤ C 0 and (x, y * (x)) ∈ Ξ. Suppose all the n design points are sampled from [0,α] whereα < α. Let us consider (2) where y * (·) is the (unique) solution to (54) on [0,α]. In terms ofŷ R+1 x i ;f wheref is obtained from solving (62), if Remark. In contrast to Proposition 2.5, Proposition 2.6 is able to accommodate the case β = 0. So far all our results for the nonparametric smooth ODEs have required f to be at least uniformly Lipschitz continuous on the relevant domain, in which case we have (at least) The autonomous system y ′ (x) = f (x) with x ∈ [0, 1] and the initial value y (0) = y * 0 (known as the separable ODE) indicates that uniform Lipschitz continuity of f is clearly not necessary for (64) to hold. Suppose f is continuous and non-decreasing (or non-increasing) on [0, 1]. 8 Then the solution y (x) = y * 0 + 1 0 f (x) dx is convex (respectively, concave) on [0, 1] (see our Proposition B.1 in Section B of the supplementary materials for the proof). As a result, the existing results (e.g., Guntiboyina and Sen, 2013) imply that log N ∞ (δ, Y) δ − 1 2 . Essentially, Y is no "larger" than S 2 . In this subsection, we study ODEs with polynomial potential functions. These ODEs have natural geometric interpretations, and the concept of VC dimensions turns out the most useful for understanding the "size" of such ODEs. If a class F has a finite VC dimension ν and sup f,g∈F |f − g| n ≤ 2r, then existing results in the literature (e.g., van der Vaart and Wellner, 1996; Wainwright, 2019) imply that N n (δ, F) ν (16e) ν r δ 2ν (65) 8 For example, f (x) = √ x defined on [0, 1] is continuous and non-decreasing, but not Lipschitz continuous. where N n (δ, F) is the δ−covering number of F with respect to the norm |·| n . We begin by briefly reviewing the exact ODEs and inexact ODEs that can be transformed into exact ODEs. The system F 1 (x, y) y ′ + F 2 (x, y) = 0 (66) with the functions F 1 and F 2 satisfying is an exact ODE. For example, a separable ODE is exact. where F 0 = f 0 (s) ds and f 0 = ∂yF 2 −∂xF 1 where φ has the name "potential function" satisfying ∂ y φ = e F 0 F 1 and ∂ x φ = e F 0 F 2 . Consequently, the solutions of (66) have the following form: where c * ∈ R depends on the initial value. While it is difficult to derive a meaningful general theory for noisy recovery problems associated with the implicit form (68), we can develop results for special (but useful) cases. Often the potential function φ is a polynomial. As a simple example, the (exact) ODE yy ′ + x = 0 has a potential function φ = x 2 + y 2 and solutions in the form of 2−dimensional spheres x 2 + y 2 = c * . In what follows, we restrict our attention to the class of ODEs that satisfy the following: 1. the solutions of the ODEs have the form (68), where the potential function φ (x, y) associated with each member of this class is a polynomial (in 2 variables) of degree at most D; 2. writing (67) as y ′ (x) = −e F 0 (x) F 2 (x, y) e F 0 (x) F 1 (x, y) =: f (x, y), the associated initial value |y 0 | ≤ C 0 and f is continuous Let us refer to this class of ODEs satisfying the two properties above as E. We would like to recover y * in (2) where y * is a solution to an ODE in E. In this setup, a solution often has an implicit form and therefore, the estimation strategy based on Picard iterations is the most natural method if an estimatorŷ 0 of the initial value y * 0 is available. We can adapt the idea in Section 2.2.3 to obtain an estimatorŷ R+1 for y * . If we can bound the VC dimension of the solution class Y underlying E, then arguments similar to those in the proof for Proposition 2.5 along with (65) can be adapted for deriving an upper bound on 1 The following result (from Matoušek, 2002) provides an upper bound on the VC dimension of the set of polynomials: let y) is a polynomial (in 2 variables) of degree D, then the VC-dimension of the solution class in the form of (68) is no greater than 2 + D 2 . Note that this result by itself does not require (69), which, however, would be quite useful for estimating y * as discussed above. d + D d . Therefore, if φ (x, We end this section with a general upper bound on the covering number of higher order Picard type ODEs. Previously in (12), (25) and (50), we have seen that the underlying solution class Y is "essentially" no "larger" than the class f ranges over. This phenomenon also holds for higher order ODEs as long as they satisfy a Picard type condition. We would like to point out that this general bound overlooks additional structures in the solution class. For establishing an estimation theory, entropy bounds that take into account special structures (such as those we have seen in Sections 2.2-2.4) may also be useful. We let Let us consider the ODE with |Y 0 | 2 ≤ C 0 and (x, Y (x)) ∈Γ wherē Proposition 2.7. In terms of (70), suppose f is continuous onΓ, and for all (x, Y ) := x, y, ..., y (m−1) and x,Ỹ := x,ỹ, ...,ỹ (m−1) inΓ. If f ranges over a Below we discuss two applications of Proposition 2.7. The first one is an extension of (12) to parametric smooth higher order ODEs and the second one is an extension of (25) to nonparametric smooth higher order ODEs. for all (x, Y ) ∈Γ and θ, θ ′ ∈ B 2 (1). Let Y be the class consisting of solutions to (73) with In terms of (70), suppose (71) is satisfied. Assume f ranges over S β+1, m+1 1,Γ . That is, f is continuous onΓ, and all partial derivatives Remark. Essentially, bound (75) implies that Y is no "larger" than S β+1, m+1 1,Γ , the class where f ranges over. It is theoretically possible to derive a bound with a similar flavor to (27)-(28) in Theorem 2.1 but this task is very laborious for smooth higher order ODEs. Differential equations have a long standing history in mathematics and are widely applied in science, engineering, economics, business analytics, and public health. While the statistical literature on differential equations has been focusing on methodological and/or computational development, this paper focuses on the metric entropy of ODE solution classes and a nonasymptotic theory for noisy function fitting problems, both of which, to the best of our knowledge, have not been examined in the literature. Given y (m) (x) = f x, y (x) , y ′ (x) , ..., y (m−1) (x) and letting f range over various function classes, we establish novel results on the metric entropy of the associated solution classes Y. Most of our theory is built upon or expands the smoothness structure in the Picard-Lindelöf theorem although we also provide a number of results that do not require such a structure. In addition, we establish a nonasymptotic theory concerning the noisy function fitting/recovery problem for nonparametric least squares estimators and least squares estimators based on Picard iterations. We also discuss the estimation of initial values with local polynomials. Here we discuss a number of future directions. First, an obvious extension would be to explore the metric entropy of solution classes associated with partial differential equations. Second, while this paper has delivered upper bounds for covering numbers of solution classes and estimation theory, it would be useful to establish minimax lower bounds to understand the information theoretic limitations of statistical recovery of an ODE solution. Third, this paper focuses on recovery of ODE solutions rather than recovery of structural parameters in an ODE. Nonasymptotic theory for the latter would be also worthwhile pursuing. by Ying Zhu 9 and Mozhgan Mirzaei 10 The l q −norm of a K−dimensional vector θ is denoted by |θ| q , 1 ≤ q ≤ ∞ where |θ| q := K j=1 |θ j | q 1/q when 1 ≤ q < ∞ and |θ| q := max j=1,...,K |θ j | when q = ∞. Let B q (1) := θ ∈ R K : |θ| q ≤ 1 with q ≥ 1. Define P n := 1 n n i=1 δ x i that places a weight 1 n on each observation x i for i = 1, ..., n, and the associated L 2 (P n )−norm of the vector v := {v(x i )} n i=1 , denoted by |v| n , is given by 1 for a universal constant c ′ ∈ (0, ∞); and f (n) ≍ g(n) means that f (n) g(n) and f (n) g(n) hold simultaneously. As a general rule for this paper, the various c and C constants denote positive universal constants that are independent of the sample size n and the smoothness parameter β, and may vary from place to place. Given a set T, a set t 1 , t 2 , ..., t N ⊂ T is called a δ−cover of T with respect to a metric ρ if for When ρ = 1, d = 1, a = 0 and a = 1, we use the shortform S γ+1 := S γ+1, 1 (1, [0, 1]). In terms of our least squares estimators, the basic inequality To bound the right-hand-side of (76), we define the shifted version of the function class: Given a radiusr n > 0, define the local complexity , Ω(r n ;F) = h ∈F : |h| n ≤r n , and |h| n : Our goal is to seek a sharp enoughr n > 0 that satisfies the critical inequality log N n (δ, Ω(r n ;F))dδ ≤r 2 n σ . (80) Proof. Let N q (δ, B q (1)) denote the covering number of B q (1) with respect to the l q −norm and N q δ Cmax , [−C 0 , C 0 ] denote the covering number of [−C 0 , C 0 ] with respect to the l q −norm. For a given δ > 0, let us consider the smallest δ CmaxL K, q −covering θ 1 , ..., θ N with respect to the l q −norm. By (11) , note that {f θ 1 , f θ 2 , ..., f θ N } forms a δ Cmax −cover of F with respect to the supnorm. We also consider the smallest δ Cmax −covering y 0,1 , ..., y 0,N ′ for the interval [−C 0 , C 0 ] where the initial value lies. By Theorem IV, for a solution y to the ODE with f parameterized by any θ ∈ B q (1) (and subject to (11) ) and y 0 ∈ [−C 0 , C 0 ], we can find i ∈ {1, ..., N } and i ′ ∈ 1, ..., N ′ such that where y (i,i ′ ) is a solution to the ODE with f parameterized by θ i and the initial value being y 0,i ′ . Consequently, we obtain a δ−cover of Y. By the standard volumetric argument which yields log N q (δ, B q (1)) ≤ K log 1 + 2C max L K, q δ , Proof. Let F = Y in (77). We have where we have applied a change of variable t = with probability at least 1 − c 1 exp −c 2 B 2 L K, ∞ K . Proof. By (76), we need to bound 1 , estimation error due toθ estimation error due to the finite iterations, where As a result, we have (82) with probability at least 1 − c 1 exp (−c 2 n). We first analyze 1 whereŷ R+1 (s; θ) is constructed in the following fashion: For any θ, θ ′ ∈ B ∞ (1), at the beginning, we have where the inequality follows from (11) . For the second iteration, we have where (i) and (ii) in the second inequality follow from (10) with (84) and (11), respectively. Continuing with this pattern until the (R + 1)th iteration, we obtain In particular, (85) holds for x ∈ {x 1 , x 2 , ..., x n }. Consequently, For a given δ > 0, let us consider the smallest δ 2b −covering θ 1 , ..., θ N (with respect to the l ∞ −norm), and by (85), for any θ, θ ′ ∈ B ∞ (1), we can find some θ i and θ j from the covering set θ 1 , ..., θ N such that Settingbr n K n ≍r 2 n σ yieldsr n ≍ σb K n and therefore, with probability at least 1 − c 1 exp −c 2b 2 K . To analyze 1 (82), note that (10) implies for all i = 1, ..., n. As a result, we have (82), standard argument for the Picard-Lindelöf Theorem implies that for any non-negative integers r and r ′ ; as a result, we have For a given δ > 0, let us consider the smallest δ Cmax −covering {f 1 , ..., f N } of S β+1, 2 (1, Ξ) with respect to the sup-norm such that We also consider the smallest δ Cmax −covering y 0,1 , ..., y 0,N ′ for the interval [−C 0 , C 0 ] where the initial value lies. Note that By Theorem IV, for a solution y to the ODE associated with any f ∈ S β+1, 2 (1, Ξ) and y 0 ∈ [−C 0 , C 0 ], we can find i ∈ {1, ..., N } and i ′ ∈ 1, ..., N ′ such that where y (i,i ′ ) is a solution to the ODE associated with f i and the initial value y 0,i ′ from the covering sets. Consequently, we obtain a δ−cover of Y. We conclude that Part (ii). The proof for part (ii) consists of three steps. Step 1: In Lemma A.1(ii) (subsection A.6.2), we establish Y ∈S † β+2 where S † β+2 denotes the class of functions such that any function h ∈ S † β+2 satisfies the following properties: • h is continuous on [0, 1] and differentiable β + 1 times; • |h(x)| ≤ C, and h (k) (x) ≤ 2 k−1 (k − 1)! for all k = 1, ..., β + 1 and x ∈ [0, 1]; Step 2: In Lemma A.2, we establish a crude bound as follows: Step 3: The third step refines the crude bound (87). For a given δ 5 ∈ (0, 1), let β * (δ) = γ (≤ β) be the largest non-negative integer such that Note that the LHS of (88) is a strictly increasing function of γ (since δ 5 ∈ (0, 1)) and the RHS is a strictly decreasing function of γ, and the LHS is smaller than the RHS for any δ 5 ∈ (0, 1) when γ = 0 (to see this, note that LHS = 3 log where "log" is the natural logarithm.). Therefore, the largest non-negative solution β * (δ) = γ (≤ β) to (88) exists (i.e., β * (δ) is well defined). Case (i): For a given δ, when β > β * (δ), we can always bound log N ∞ δ, S † β+2 by log 21 + 2 max 0, log 4C . Case (ii): For a given δ, when β = β * (δ), the LHS of (88) is dominated by the RHS and therefore, we can bound log N ∞ δ, S † β+2 by 2 where f (0) = f . Then for all x, x ′ ∈ [0, 1], we have When β = 1, note that When β = 2, we have After trying β = 1, ..., 4, we observe the following pattern: ). (f (y(x))) 3 + 2f (2) (y(x)).f (1) (y(x) ). (f (y(x))) 2 + 2f (2) (y(x)).f (1) (y(x) ). (f (y(x))) 2 + f (1) (y(x)) In what follows, we derive b k s (for all k = 1, ..., β + 1) such that and L β+1 such that f (a 1 ,··· ,a k ) (x) := f (a 1 ) (y(x)).f (a 2 ) (y(x)). · · · .f (a k ) (y(x)), where a 1 + a 2 + · · · + a k = k − 1 for all k = 1, ..., β + 1, and a i ≥ 0 are all integers. We first show that ...,a j−1 ,a j +1,0,a j+1 ,. ..,a k ) (x). The equality (96) follows from the derivations below: Now by induction on k, we show that if f is β−times differentiable, then for each 1 ≤ k ≤ β + 1, we have , and for all i in (a i 1 , · · · , a i k ), a i 1 + . . . For the base case, k = 1 ⇒ y where a 1 = 0. Now assume k ≤ β + 1 and the induction hypothesis holds for k − 1. Then Notice that a i 1 , . . . , a i j−1 , a i j + 1, 0, a i j+1 , . . . , a i k−1 has exactly k terms, and adds up to a i 1 + . . . + a i k−1 + 1 + 0 = k − 2 + 1 + 0 = k − 1, and in total there are (k − 2)!(k − 1) = (k − 1)! terms. This completes the induction. By (89), we have y (k) (t) ≤ (k − 1)!. To show the second claim y (β+1) (x) − y (β+1) (x ′ ) ≤ (β + 1)!|x − x ′ |, we use (97). Note that and the third line in the above follows from (92) and (93). Hence, For β = 1, note that (1) x (x, y(x)) ≤ 2 and y (2) (x) − y (2) x In what follows, we derive b k s (for all k = 1, ..., β + 1) such that For a β−times differentiable function f, define where the second equality comes from (98); that is, Given (99), now we show that satisfies the following properties: (1) the summation has at most 2 k−1 (k − 1)! terms for each 1 ≤ k ≤ β + 1, and (2) ∀i, a i So the total number of terms in y (k+1) ≤ 2k(number of terms in y k ) ≤ 2k(k − 1)!2 k−1 . This proves the induction hypothesis. By the assumption that |f (x, y(x))| ≤ 1, we have To show the second claim and we have As a consequence, we obtain Let us define g = h (i) ∈ S † β+2−i for 0 ≤ i ≤ β + 1 and the above implies that To bound N ∞ δ, S † β+2 from above, we fix δ > 0 and x ∈ (0, 1). Suppose that for some δ 0 , . . . , δ β+1 > 0, h, g ∈ S † β+2 satisfy We can bound |h(x + ∆) − g(x + ∆)| from above for some ∆ such that x + ∆ ∈ (0, 1) as follows: If |∆| ≤ ( δ 5 ) In other words, by considering a grid of points ( δ 5 ) where ⌊x⌋ means the largest integer smaller than or equal to x. Our earlier argument implies that the number of distinct sets H (h 0 ) with h 0 ranging over S † β+2 bounds the δ−covering number of S † β+2 from above. Note that for i = 1, . . . , s and k = 0, . . . , β + 1, H (h 0 ) depends on h (k) 0 (x i ) /δ k only and the number of distinct sets H (h 0 ) is bounded above by the cardinality of Starting from x 1 , let us count the number of possible values of the vector Now we move to x 2 . Given the values of h (k) (x 1 ) /δ k , 0 ≤ k ≤ β + 1 , we count the number of possible values of the vector For each 0 ≤ k ≤ β + 1, we define Let us fix 0 ≤ i ≤ β + 1. Applying (101) with x = x 1 and ∆ = x 2 − x 1 yields 2 ). Therefore, given the values of h (k) (x 1 ) /δ k , 0 ≤ k ≤ β + 1 , h (i) (x 2 ) takes values in an interval whose length is no greater than 2δ i . Therefore, the number of possible values of h (k) (x 2 ) δ k , 0 ≤ k ≤ β + 1 is at most 2. The same argument goes through when x 1 and x 2 are replaced with x j and x j+1 for any j = 1, ..., s − 1. This result along with (104) gives where I is defined in (103). Observe that the choicer n = δ (γ * ) =: δ * solves (80) because The second line is argued as follows: When δ = δ * , the critical smoothness parameter is γ * and log N ∞ δ,F (2019), in terms of (31), we have with probability at least 1 − exp −nδ * 2 2σ 2 . Integrating the tail bound yields We now show the part T 2 . For a given δ > 0, let us consider the smallest δ 2Cmax −covering f 1 , ..., f N (w.r.t. the sup-norm) of S β+1, 2 (1, Ξ) and the smallest δ 2Cmax −covering y 0,1 , ..., y 0,N ′ for the interval [−C 0 , C 0 ] where the initial value lies. By Theorem IV and arguments similar to those in the proof for Theorem 2.1(i), for any f,f ∈ S β+1, 2 (1, Ξ) and initial values y 0 ,ỹ 0 ∈ [−C 0 , C 0 ], we can find some f i , f j ∈ f 1 , ..., f N and y 0,i ′ , y 0,j ′ ∈ y 0,1 , ..., y 0,N ′ such that where y,ỹ, y (i,i ′ ) , and y (j,j ′ ) are solutions to the ODE associated with {f, y 0 }, f ,ỹ 0 , f i , y 0,i ′ and f j , y 0,j ′ , respectively. Thus, we obtain forms a δ−cover ofF in terms of F = Y. Proof. When f ∈ S 1, 2 (1, Ξ), the critical smoothness parameter γ * = 0. Hence, (43) is a standard oracle inequality (see, e.g., Theorem 13.13 in Wainwright, 2019). The terms E 1 (x i ) and E 2 (x i ) come from Theorem IV. Note that Theorem IV does not require g to satisfy a Lipschitz condition. Proof. Like in Proposition 2.3, we writeŷ R+1 ( As a result, we have (107) Following the argument in the proof for Proposition 2.3, we have Continuing with this pattern until the (R + 1)th iteration, we obtain In particular, (109) holds for x ∈ {x 1 , x 2 , ..., x n }. Consequently, Letb =α 1−α . For a given δ > 0, let us consider the smallest δ 2b −covering f 1 , ..., f N (with respect to the sup-norm) of S β+1, 2 (1, Ξ), and by (109), for any f,f ∈ S β+1, 2 (1, Ξ), we can find some f i and f j from the covering set f 1 , ..., f N such that Thus, g f 1 , g f 2 , ..., g f N × g f 1 , g f 2 , ..., g f N forms a δ−cover ofF in terms of F defined in (108). For β > 0, we have By Theorem B.1, for any y ∈ Y with Y 0 ∈ B 2 (C 0 ), we can find an element (indexed by i) from the smallest δ Lmax −covering of F and an element (indexed by i ′ ) from the smallest δ Lmax −covering of B 2 (C 0 ) such that where y (i,i ′ ) is a solution to the ODE associated with f i and the initial value Y 0, i ′ from the covering sets. Consequently, we obtain a δ−cover of Y. We conclude that Theorem IV (Gronwall inequality for first order ODEs). Consider the following pair of ODEs: Remark. Theorem IV is a slight modification of Theorem 2.1 in Howard (1998), which gives a variant of the famous Gronwall inequality (Gronwall, 1919) . In the following result, we extend Theorem IV to higher order ODEs. Let (1) . . . Since f (x) is non-decreasing, we have and therefore, Since f is continuous, (119) implies that F (x) is convex. Economic growth Gaussian and Rademacher complexities: risk bounds and structural results A new product growth for model consumer durables Comments on 'A new product growth for model consumer durables The Bass model Parameter estimation in differential equation models with constrained states Theory of ordinary differential equations Central limit theorems for empirical measures Note on the derivatives with respect to a parameter of the solutions of a system of differential equations Covering numbers for convex functions The Gronwall inequality ǫ-entropy and ǫ-capacity of sets in functional spaces Local Rademacher complexities and oracle inequalities in risk minimization Lectures on discrete geometry Parameter estimation for differential equations: A generalized smoothing approach Empirical discrepancies and sub-additive processes Introduction to nonparametric estimation Empirical Processes in M-Estimation Weak convergence and empirical processes An introduction to infectious diseases modelling High-dimensional statistics: A non-asymptotic viewpoint Nonasymptotic analysis of semiparametric regression models with highdimensional parametric coefficients for all 0 < k ≤ β + 1 and(ii) Assume that all partial derivatives D p of f exist for all p with [p] = p 1 + p 2 ≤ β; y) ; and for all 0 ≤ k ≤ β + 1 andRemark. In Lemma A.1(i), by the mean value theorem, (89) with k = 0 (i.e., f (0) (y(x)) = |f (y(x))| = y ′ (x) ≤ 1) and (90) imply thatmoreover, (89) with 1 ≤ k ≤ β implies thatIn Lemma A.1(ii), by the mean value theorem, the assumption that |f (x, y(x))| = y ′ (x) ≤ 1 and (91) imply thatRemark. Lemma A.1 can easily handle situations where the bound "1" on the absolute values of the derivatives and the Lipschitz conditions is replaced with general constants.Lemma A.1(i) concerns the autonomous system (49) and Lemma A.1(ii) concerns the nonautonomous system (24). We first prove Lemma A.1(i) and then Lemma A.1(ii). Let us begin with some intuitions for the autonomous system. When β = 0, y ′ (x) = |f (y(x))| ≤ 1 for all x andHence,Proof. By Lemma A.1(ii), we have shown that Y ⊆ S † β+2 . The rest of argument modifies the original proof for smooth classes in Kolmogorov and Tikhomirov (1959) . For every function h ∈ S † β+2 and x, x + ∆ ∈ (0, 1), we haveLet us define (77). Because we are working withF in (77) in terms of F = S † β+2 , all the coefficients 2 k−1 (k − 1)! associated with the derivatives for k = 1, ..., β + 1, the coefficient 2 β+1 (β + 1)! associated with the Lipschitz condition, as well as the function value itself are multiplied by 2. Slight modifications of the proof for Lemma A.2 giveFor every γ ∈ {0, ..., β}, let δ (γ) = c n σ 2√ n and γ * (≤ β) be the largest non-negative integer such that(105) Note that the LHS of (105) is a strictly increasing function of γ * (since δ(γ * ) 7 ∈ (0, 1) and the larger γ * is, the more negative log δ (γ * ) is) and the RHS is a strictly decreasing function of γ * , and the LHS is smaller than the RHS for any δ(γ * ) 7 ∈ (0, 1) when γ * = 0 (to see this, note that LHS = 3 log 2 ). Therefore, the largest non-negative solution γ * (≤ β) to (105) exists (i.e., γ * is well defined). Step 1: In Lemma A.1(i) (subsection A.6.2), we have established Y ∈AS † β+2 where AS † β+2 denotes the class of functions such that any function h ∈ S † β+2 satisfies the following properties: • h is continuous on [0, 1] and differentiable β + 1 times;• |h(x)| ≤ C, and h (k) (x) ≤ (k − 1)! for all k = 1, ..., β + 1 and x ∈ [0, 1];Step 2: Following the argument in the proof for Lemma A.2, we establish a crude bound as follows:Step 3: The third step refines the crude bound (111). For a given δ 5 ∈ (0, 1), let β * (δ) = γ be the largest non-negative integer such thatAs for Theorem 2.4, we replace (34) withlog 21 + max 0, log 36C . Proof. Let L max = sup x∈[0, α] exp x L 2 m + 1 1 + x 0 exp −s L 2 m + 1 ds . For a given δ > 0, we consider the smallest δ Lmax −covering of F with respect to the sup-norm. We also consider the smallest δ Lmax −covering of B 2 (C 0 ) := {θ ∈ R m : |θ| 2 ≤ C 0 } (where the initial values lie) with respect to the l 2 −norm. Note that the standard volumetric argument yields. . .The following arguments extend those in Howard (1998 which is equivalent to Proof. We show (i) and the argument for (ii) is almost identical. Letting 0 ≤ x 1 < x 2 , we have F (x 2 ) − F x 1 + x 2 2 =x 2x 1 +x 2 2 f (t)dt