key: cord-0504477-6yeeuek4
authors: Zhao, Ruixuan; He, Xin; Wang, Junhui
title: Learning linear non-Gaussian directed acyclic graph with diverging number of nodes
date: 2021-11-01
journal: nan
DOI: nan
sha: 32f276ed28fa62672ebdb9abef3e654e93b30844
doc_id: 504477
cord_uid: 6yeeuek4

Acyclic model, often depicted as a directed acyclic graph (DAG), has been widely employed to represent directional causal relations among collected nodes. In this article, we propose an efficient method to learn linear non-Gaussian DAG in high dimensional cases, where the noises can be of any continuous non-Gaussian distribution. This is in sharp contrast to most existing DAG learning methods assuming Gaussian noise with additional variance assumptions to attain exact DAG recovery. The proposed method leverages a novel concept of topological layer to facilitate the DAG learning. Particularly, we show that the topological layers can be exactly reconstructed in a bottom-up fashion, and the parent-child relations among nodes in each layer can also be consistently established. More importantly, the proposed method does not require the faithfulness or parental faithfulness assumption which has been widely assumed in the literature of DAG learning. Its advantage is also supported by the numerical comparison against some popular competitors in various simulated examples as well as a real application on the global spread of COVID-19.

Directed acyclic graph (DAG) provides an elegant way to represent directional or causal structures among collected nodes, which finds applications in a broad variety of domains, including genetics (Sachs et al., 2005) , finance (Sanford and Moosa, 2012) and social science (Newey et al., 1999) . In recent years, learning the DAG structures from observed data has attracted tremendous attention from both academia and industries.

In literature, various structure learning methods have been proposed to recover the Markov equivalence class (Spirtes et al., 2000; Peters et al., 2017 ) of a DAG, which can be roughly categorized into three classes. The first class is the constraint-based method (Spirtes et al., 2000; Kalisch and Bühlmann, 2007) , which uses some local conditional independence criterion to test pairwise causal relations. The second class, referred as the score-based method (Chickering, 2002; Zheng et al., 2018) , attempts to optimize some goodness-of-fit measures among the possible graph space. The last class (Tsamardinos et al., 2006; Nandy et al., 2018) combines the constraint-based method and score-based method. Although success has been widely reported, most aforementioned methods can only recover the Markov equivalence class and their computational burden remains a severe bottleneck. Recently, efforts have been made to pursuit exact DAG recovery. Specifically, Peters and Bühlmann (2014) shows that a linear Gaussian DAG is identifiable under the equal variance assumption, and Ghoshal and Honorio (2018) and Park (2020) relax the Gaussianity assumption but still require an explicit order among noise variances. Under these assumptions, a number of learning methods are proposed to recovery the exact DAG structure (Chen et al., 2019; Yuan et al., 2019; Li et al., 2020; Park et al., 2021) , yet these assumptions of Gaussianity or ordered noise variances are often difficult to verify in practice.

Linear non-Gaussian DAG, also known as linear non-Gaussian acyclic model (LiNGAM) in literature (Shimizu et al., 2006) , relaxes the Gaussianity assumption, and its identifiability does not require any additional noise variance assumption. It is clear that linear non-Gaussian DAG can accommodate more flexible distributions, yet it has only received limited attention in literature (Shimizu et al., 2006 (Shimizu et al., , 2011 Hyvarinen and Smith, 2013; Wang and Drton, 2020) . Specifically, Shimizu et al. (2006) proposes an iterative search algorithm to recover the causal ordering of linear non-Gaussian DAG by using linear independent component analysis (ICA) and permutation. Subsequently, Shimizu et al. (2011) proposes a multiple-step algorithm to learn the linear non-Gaussian DAG by some pairwise statistics, which is further extended in Hyvarinen and Smith (2013) to iteratively identify pairwise causal ordering by likelihood ratio tests. The statistical properties of these methods remain largely unknown, not to mention their expensive computational cost. Most recently, Wang and Drton (2020) proposes a modified direct learning algorithm for linear non-Gaussian DAG in high dimensional cases, which sequentially recovers the causal ordering with a moment-based criterion and reconstructs the directed structure with hard-thresholding. Yet, it requires the parental faithfulness condition and its computational complexity is of exponential order of the maximum in-degree.

In this paper, we propose a novel method to learn linear non-Gaussian DAG in high dimensional cases based on a novel concept of topological layer. It assures that any DAG can be reformulated into a unique topological structure with T layers, where the parents of a node must belong to its upper layers, and thus acyclicity is naturally guaranteed. More importantly, we show that the topological layers can be exactly reconstructed via precision matrix estimation and independence testing procedure in a bottom-up fashion, and the parent-child relations can be directly obtained from the estimated precision matrix. These results are obtained without requiring the popular faithfulness (Uhler et al., 2013; Peters et al., 2017) or parental faithfulness assumption (Wang and Drton, 2020) . The constructive proof also motivates an efficient learning algorithm for the proposed method, whose complexity is much smaller than most existing linear non-Gaussian DAG learning methods (Shimizu et al., 2006 (Shimizu et al., , 2011 Wang and Drton, 2020) .

The main contribution of this paper is the development of a novel and efficient method to learn linear non-Gaussian DAG in high dimensional cases, and the investigation on its statistical guarantee in terms of exact causal structure recovery. More precisely, we show that the topological layers of the DAG can be exactly reconstructed in Theorem 1 and Corollary 1, and the parent-child relations can be directly recovered along with the topological layers in Corollary 2. We connect learning method and precision matrix estimation by proving that the topological layers can be exactly reconstructed via precision matrix estimations in a bottom-up fashion, and the parent-child relations can be obtained directly from the obtained precision matrix. The statistical guarantees of the proposed method by using graphical Lasso and distance covariance measure is established with sub-Gaussian and (4m)-th bounded moment noise distributions, respectively. The established consistency results are governed by the sample size, the number of nodes, and the maximum cardinality of Markov blankets (Peters et al., 2017) . Most interestingly, the obtained results allow the number of nodes and the maximum cardinality of Markov blankets to diverge with the sample size at some fast rate, which is particularly attractive in high-dimensional learning method.

The rest of this paper is organized as follows. Section 2 introduces some background of linear non-Gaussian DAG. Section 3 introduces the concept of topological layers and shows that the topological layers and the parent-child relations can be exactly reconstructed in a bottom-up fashion. Section 4 provides an efficient learning algorithm for linear non-Gaussian DAG in high dimensional cases, and Section 5 establishes the reconstruction consistency of the proposed method under mild conditions. Numerical experiments on several simulated examples and one real application to the spread of COVID-19 are conducted in Section 6. Section 7 contains a brief discussion, and all the technical details are provided in Appendix.

. . , p} consists of a set of nodes associated with each coordinate of x, and E ⊂ N × N consists of all the directed edges among the nodes. The directed edge from node j to node k is denoted as j → k, indicating their parent-child relationship. For simplicity, we denote node k's parents as pa k , its children as ch k , its descendants as de k , its non-descendants as nd k , and its Markov blanket as mb k = pa k ∪ ch k ∪ {i ∈ pa j \{k}|j ∈ ch k }. It is also assumed that G satisfies the Markov property (Spirtes et al., 2000) , and thus P (x) can be factorized as P (x) = p k=1 P (x k |x pa k ), where x pa k = {x j : j ∈ pa k }.

Once each node x k is centered with mean zero, the graph structure in G can be embedded into a linear structural equation model (SEM),

where β kj = 0 for any j ∈ pa k , k denotes a continuous non-Gaussian noise with variance σ 2 k , and l ⊥ ⊥ k for any l = k. This independent noise condition further implies that k ⊥ ⊥ x l for any l / ∈ de k ∪ {k}. It is also often assumed that there is no unobserved confounding effect among the observed nodes in G, which is known as the casual sufficiency condition in literature (Spirtes et al., 2000) .

Note that the SEM model in (1) can be organized into a matrix form

where B = (β kj ) k,j ∈ R p×p and = ( 1 , ..., p ) T is the noise vector with covariance matrix Ω = diag{σ 2 1 , ..., σ 2 p }. Simple algebra yields that

and x k = p j=1 a kj j , where a kj is the (k, j)-th element of A = ( I − B ) −1 , representing the total effect of x j on x k . The SEM model implies that j ⊥ ⊥ x k and thus a kj = 0 for any node j ∈ de k . Moreover, the covariance matrix of x is Σ = ( I − B ) −1 Ω( I − B ) −T , and the corresponding precision matrix is Θ = Σ −1 = ( I − B ) T Ω −1 ( I − B ). In the sequel, we use Θ lk to denote the (l, k)-th element of Θ, and Θ −ll to denote the l-th column of Θ without Θ ll .

In this section, we introduce a novel concept of topological layer, which allows us to convert a DAG into a unique topological structure. Particularly, given a DAG G, we construct its topological structure by assigning each node to one and only one layer, based on its longest distance to one of the leaf nodes.

Without loss of generality, we assume G has a total of T layers, and A t denotes all the nodes contained in the t-th layer, for t = 0, . . . , T − 1. It is clear that ∪ T −1 t=0 A t = N , and all the leaf nodes and isolated nodes in G belong to the lowest layer A 0 . For each node k ∈ A t , it follows from the layer construction that pa k ⊂ S t+1 = ∪ T −1 d=t+1 A d , and thus acyclicity is automatically guaranteed. Note that S 0 = N . Figure 1 illustrates a toy DAG in the left panel, and its converted topological structure with three layers in the right panel. Figure 1 : A toy DAG and its topological structure with three layers.

In Figure 1 , node 4 is an isolated node and node 3 has no child node, and thus they both belong to A 0 . Node 3 has two parent nodes, where node 2 belongs to A 1 but node 1 belongs to A 2 due to the existence of a longer path 1 → 2 → 3. It is clear that the concept of topological layer is general and it can restructure any DAG in such a way that causal ordering among each layers is uniquely determined. This is in sharp contrast to the idea of causal ordering in literature (Shimizu et al., 2006 (Shimizu et al., , 2011 Wang and Drton, 2020) , which only requires that each node is ranked behind its parents. For the toy DAG in Figure 1 , it induces multiple possible causal orderings among the four nodes, such as 1 → 2 → 3 → 4, 1 → 2 → 4 → 3, 1 → 4 → 2 → 3, or 4 → 1 → 2 → 3. This indeterministic causal ordering may cause unnecessary estimation instability and computational inefficiency in reconstructing the DAG structures.

To be self-contained, we first restate the Darmois-Skitovitch theorem (Darmois, 1953; Skitovitch, 1953) , which is crucial for the reconstruction of topological layers of a linear non-Gaussian DAG.

Lemma 1 (Darmois-Skitovitch, 1953) Define two random variables u 1 and u 2 as linear combinations of independent random variables s i , i = 1, ..., m, that u 1 = m i=1 c 1,i s i and u 2 = m i=1 c 2,i s i . Then, if u 1 and u 2 are independent, all variables s i with c 1,i c 2,i = 0 are Gaussian distributed.

Lemma 1 shows that if s i 's are non-Gaussian distributed, it is impossible to construct two independent linear combinations of s i 's. This fact motivates us to use independence test to identify nodes in each layers in a bottom-up fashion.

Theorem 1 Suppose that x = (x 1 , .., x p ) T ∈ R p is generated from the linear SEM model in (1) with precision matrix Θ. For any l ∈ N , we regress x l on all other nodes x N \{l} , and denote the residual as e l,

Theorem 1 provides a sufficient and necessary condition to identify nodes in A 0 . After all the nodes in A 0 are identified, we can remove them from N and denote S 1 = N \ A 0 . Next, we apply similar treatment to S 1 as in Theorem 1 to identify A 1 , and then A 2 , until all nodes are assigned to layers. Denote Θ St as the precision matrix of x St and Θ S 0 = Θ. Further, denote [Θ St ] lk as the element of Θ St corresponding to nodes l and k, and [Θ St ] −ll as the column of Θ corresponding to node l without [Θ St ] ll . Corollary 1 summarizes the reconstruction of all the topological layers for a linear non-Gaussian DAG.

Corollary 1 Suppose that all the conditions in Theorem 1 are satisfied and the layers A 0 , ..., A t−1 have been reconstructed. For any l ∈ S t , we regress x l on x St\{l} and denote the residual as

Then we have l ∈ A t if and only if e l,St ⊥ ⊥ x k for any k ∈ S t \{l}.

The proof of Corollary 1 is similar to that of Theorem 1 with slight modification by replacing N with S t . The reconstruction of the topological layers follows immediately by applying Corollary 1 repeatedly in the sense of mathematical induction. Furthermore, the parent sets of each node in the DAG can be determined during the reconstruction of the topological layers as well.

Corollary 2 Suppose that all the conditions in Theorem 1 are satisfied. For any l ∈ A t and k ∈ S t+1 , we have

Corollary 2 follows immediately after Corollary 1, and assures that the parent-child relations can also be sequentially reconstructed along with the topological layers. It is important to point out that both Corollaries 1 and 2 don't require the popularly-adopted faithfulness assumption (Uhler et al., 2013; Peters et al., 2017) or the parental faithfulness assumption (Wang and Drton, 2020) .

Consider a simple linear non-Gaussian DAG generated as follows,

where l is a non-Gaussian distributed noise, and l ⊥ ⊥ k for any l = k. Clearly, both the faithfulness and the parental faithfulness assumptions are violated when β 32 β 21 + β 31 = 0. Yet such a linear non-Gaussian DAG can be successfully reconstructed with the proposed criteria in Section 3.1.

We first regress x l on x −l and obtain the residuals e l,N for l = 1, . . . , 4. It is easy to see that e 3,N = 3 and e 4,N = 4 , and each of them is independent with all the other three nodes. For e 1,N and e 2,N , it follows from similar treatment as in the proof of Theorem 1 that

and can be regarded as the partial correlation between nodes k and l given all the other nodes in N .

3 and M

3 are nonzero, and thus both e 1,N and e 2,N are dependent with x 3 , leading to A 0 = {3, 4} and S 1 = {1, 2}. It also follows from Corollary 2 that pa 3 = {1, 2} and pa 4 = ∅.

Next, we repeat the above procedure for x 1 and x 2 , and obtain e 1,

2 and e 2,S 1 = 2 . Clearly, e 2,S 1 is independent with x 1 , whereas e 1,S 1 is dependent with x 2 due to the fact that M

(1) 2 = 0. Therefore, we have A 1 = {2} and pa 2 = {1}. Finally, the last remaining node 1 is assigned to A 2 , and the topological layers and directed structure of the DAG in (3) are perfectly reconstructed.

Given a sample matrix X = (x 1 , ..., x p ) ∈ R n×p with x l = (x 1l , ..., x nl ) T , we first obtain the estimated precision matrix Θ, and then the residual is e l,N = x l + X−l Θ −ll / Θ ll , where X−l denotes the sample matrix without x l . To invoke Theorem 1, we proceed to test the independence between e l,N and all the nodes in N \ {l}, and estimate A 0 as A 0 = l : e l,N is tested to be independent with x k for any k ∈ N \{l} .

Suppose that the estimated layers A 0 , ..., A t−1 are obtained, we denote S t = N \{∪ t−1 d=0 A d } and estimate the corresponding precision matrix Θ St . The residuals can be computed as e l, St =

where X St\{l} denotes the sample matrix corresponding to S t \{l}. By Corollary 1, A t can be estimated as A t = l : e l, St is tested to be independent with x k for any k ∈ S t \{l} .

then Corollary 2 implies that for each l ∈ A t ,

for any k ∈ pa l . The procedure is repeated until | S t | ≤ 1, and we set T = t + 1 and A t = S t if | S t | = 1, and T = t otherwise.

The details of the developed DAG learning method is summarized in Algorithm 1. Note that the performance of the proposed method relies on the accuracy of the precision matrix estimation in Step 2a and the independence test procedure in Step 2b. Many existing methods in literature can be adopted, such as the graphical Lasso algorithm (Friedman et al., 2008; Ravikumar et al., 2011) or the constrained sparse estimation method (Cai et al., 2011) for precision matrix estimation, and the Hilbert-Schmidt independence criterion (Gretton et al., 2008) , the distance covariance measure (Szekely et al., 2007; Szekely and Rizzo, 2009) or the ball divergence (Pan et al., 2018) for independence test. For illustration, we adopt the graphical Lasso algorithm and the distance covariance measure in the proposed method, which yields satisfactory performance in all the numerical experiments in Section 6.

The computational complexity of Algorithm 1 is largely determined by the precision matrix estimation and independence tests in Steps 2a and 2b. Particularly, the complexity of Step 2a by using the graphical Lasso algorithm (Friedman et al., 2008) is of order O ( T −1 k=t |A k |) 3 with |A t | being the cardinality of A t , and the complexity of Step 2b by using the distance covariance measure (Szekely et al., 2007; Szekely and Rizzo, 2009) 

. Therefore, the computational complexity of Algorithm 1 in learning a random linear non-Gaussian DAG with T layers is of order O

In the worstcase scenario with T = p, the computational complexity of Algorithm 1 becomes O p 4 + n 2 p 3 ; and when the DAG is a shallow hub graph with T = 2, the computational complexity of Algorithm 1 becomes O p 3 + n 2 p(p − 1) .

It is important to remark that the computational complexity of the proposed method is significantly less than most existing methods for learning linear non-Gaussian DAG. For example, the MDirect method (Wang and Drton, 2020) is one of the most recently proposed methods in literature, and its computational complexity in the worst-case scenario is at least of order O(J 2 (n + J)p J+1 ), where J denotes the maximum in-degree of the DAG. It is of an exponential order in J, and thus MDirect may suffer serious computational challenges when some nodes have a relatively large number of parents.

In this section, we establish asymptotic consistency of the proposed method with the graphical Lasso and distance covariance measure in terms of exact DAG recovery. The consistency results are established with explicit dependence on the sample size n, the number of nodes p, and the maximum cardinality of the Markov blankets d = max l∈N |mb l |. Note that the support of Θ −ll is a subset of the Markov blanket of node l (Park et al., 2021) .

For simplicity, denote σ 2 max = max l∈N σ 2 l , σ 2 min = min l∈N σ 2 l , β max = max l∈N ,k∈pa l |β lk | and β min = min l∈N ,k∈pa l |β lk |. Further, denote f (n) = Ω(g(n)) if there exists a positive constant a such that f (n) ≥ ag(n) for all sufficiently large n. The following technical assumptions are made to establish the exact DAG recovery.

Assumption 1 There exists some constant ψ ∈ (0, 1] such that

where Γ = Σ ⊗ Σ with ⊗ denoting the Kronecker product, Γ (l,k),(j,m) := Γ lp+k,jp+m = Σ lk Σ jm ,

Assumption 2 For any j ∈ N , j /σ j follows a sub-Gaussian distribution with parameter γ.

Assumption 1 limits the correlation between the zero and non-zero elements in Γ, which is analogous to the irrepresentable condition in Zhang and Yu (2006) or the incoherence condition in Ravikumar et al. (2011) . Assumption 2 characterizes the noise distribution and implies that x j / Σ jj also follows a sub-Gaussian distribution with parameter γ.

Lemma 2 Suppose Assumptions 1 and 2 hold, and n = Ω(d 2 log p). For any l ∈ N , there exist some positive constants a 1 , a 2 and τ > 4 such that with probability at least 1 − a 2 p 2−τ , there holds

provided that the regularization parameter in estimating Θ is λ n,0 ∝ log p/n, where L 2 = σ −2 max min 1, ( σ min σmax ) 2 L −1 1 with L 1 = β max (1 + dβ max ).

Lemma 2 paves a bridge between the estimated precision matrix and the estimated residuals, and plays a crucial role in establishing the consistency for the exact DAG recovery.

1 n X T X ≤ λ max , where Λ max (·) denotes the maximum eigenvalue of a matrix.

Assumption 4 For any t = 0, ..., T − 1, we have

where ρ 2 n,t = Ω(max{d 3 n −1/2 log 1/2 (max{|S t |, n}), n −η }) with 0 < η < 1 2 , and |S t | denotes the cardinality of S t .

Assumption 3 is a standard regularity condition on the sample covariance matrix of X in N , which also regulates the sample covariance matrix of XS t since Λ max 1 n X T St XS t ≤ Λ max 1 n X T X for any t = 0, ..., T − 1. Assumption 4 assures that the distance covariance is sufficient in discriminating nodes in A t or S t \A t . Similar assumptions have also been employed in Kalisch and Bühlmann (2007) and Ha et al. (2016) .

Theorem 2 (Consistency of A 0 ) Suppose that all the assumptions in Lemma 2 and Assumptions 3 and 4 hold, and n = Ω(d 6 log p). Then there exist some positive constants a 3 , a 4 and a 5 such that

provided that the significant level of the independence test is set as α n , and α n → 0 as n diverges.

Theorem 2 shows that the lowest layer A 0 in G can be exactly recovered by the proposed method with high probability. After reconstructing A 0 , the selection consistency of the parent set for nodes in A 0 can also be established, following from the fact that min l∈A 0 ,k∈pa l |Θ lk | = min l∈A 0 ,k∈pa l σ −2 l |β lk | ≥ σ −2 max β min and a similar treatment in Theorem 2 of Ravikumar et al. (2011) .

Corollary 3 Suppose that all the assumptions in Theorem 2 are satisfied, and n = Ω((d 6 + β −2 min ) log p). Then there exists some positive constant a 6 such that

Further, let S 1 = N \ A 0 , and we apply the similar treatment on S 1 to establish consistency in estimating A 1 , as well as other upper layers. In the spirit of mathematical induction, we arrive at the following theorem on the asymptotic estimation consistency of G.

Theorem 3 (Consistency of G) Suppose that all the assumptions in Corollary 3 are satisfied, and n = Ω T 1/(τ −4) (d 6 + β −2 min )(log(max{p, n})) 3/(1−2η) . Then there holds

provided that the regularization parameter in estimating Θ St is λ n,t ∝ log(max{|S t |, n})/n.

Theorem 3 ensures that the linear non-Gaussian DAG G can be consistently recovered by the proposed method even under the high dimensional setting. Specifically, with all other terms fixed, the consistency of exact DAG recovery holds true when log(p) = o(n (1−2η)/3 ) with 0 < η < 1/2. This is in sharp contrast with the result in Wang and Drton (2020) with log(p) = o(n 1/(2K) ), where K ≥ 3 denotes the order of moment-based statistic in Wang and Drton (2020) .

Theorem 3 can be further extended by relaxing the sub-Gaussian noise assumption, such as a noise distribution with (4m)-th bounded moment; that is, max j E(( j /σ j ) 4m ) ≤ K m for a positive integer m and a postive constant K m .

Corollary 4 Suppose that all the assumptions in Theorem 3 are satisfied, except that Assumption 2 is relaxed to a noise distribution with (4m)-th bounded moment, Assumption 4 holds with ρ 2 n,t = Ω max{d 3 n − 1

2m , n −η } for some constants 4 < τ < m + 4 and 0 < η < 1 2 , and n = Ω T

Corollary 4 establishes the asymptotic DAG recovery of the proposed method with (4m)-th bounded moment noise distributions, which requires a relatively larger sample size compared with that in Theorem 3. More importantly, both Theorem 3 and Corollary 4 are established without assuming the parental faithfulness assumption in Wang and Drton (2020) , indicating a more general applicability of the proposed method.

In this section, we examine the numerical performance of the proposed method, denoted as TL, and compare it against some popular DAG learning methods, including the direct high-dimension learning algorithm (MDirect, Wang and Drton (2020) ), the pairwise learning algorithm (Pairwise, Shimizu et al. (2011); Hyvarinen and Smith (2013) ), the ICA-based learning algorithm (ICA, Shimizu et al. (2006) ), the high dimensional constraint-based PC algorithm (PC, Kalisch and Bühlmann (2007) ) and a hybrid version of max-min hill climbing algorithm (MMHC, Tsamardinos et al. (2006) ). Particularly, TL adopts the graphical Lasso algorithm with a fixed regularization parameter as suggested in Section 5, and the independence test based distance covariance measure with a significance level α = 0.01. MDirect is implemented in the R package highDLingam (Wang and Drton, 2020) , and both the methods ICA and MMHC are implemented in the R package CompareCausalNetworks. Furthermore, we implement Pairwise by using the R package causalXtreme (Gnecco et al., 2021) , and further extend it with the Lasso algorithm for DAG in high dimensional cases, and we implement PC by using the R package pcalg (Kalisch et al., 2012) , which outputs a partial DAG, and then we apply the treatment in Yuan et al. (2019) to convert it to a DAG by using the pdag2dag routine in the R package pcalg. Note that the significant level of independent tests in both TL and PC is set to α = 0.01, and the least square estimation is also applied to estimate the connection strength of directed structures for MDirect, MMHC and PC.

The numerical performance of all the methods is evaluated in terms of estimation accuracy of directed edges and coefficients. For the accuracy of estimated directed edges, we employ the true positive rate (TPR) and false discovery rate (FDR) as the evaluation metric. To evaluate the closeness of the estimated and true DAG, we report the normalized structural Hamming distance (Tsamardinos et al., 2006) , which measures the smallest number of edge insertions, deletions, and flips to convert the estimated DAG into the truth. For overall accuracy of the estimated DAG structure, we use the Matthews correlation coefficient (MCC) as an overall evaluation metric, which is also considered in Yuan et al. (2019) . For evaluation of the coefficient estimation, we report the relative error between the estimated adjacency matrix B and the true adjacency matrix B in Frobenius norm that rel-Fnorm := B − B F / B F . Note that a good estimation is implied with small values of FDR, HM and rel-Fnorm, but large values of TPR and MCC.

In this section, the numerical performance of all the methods are evaluated in two simulated examples, where Examples 1 considers a hub graph, and Example 2 considers a scale-free graph generated by the Barabási-Albert (BA) model. Example 1. We consider a hub graph with T = 2, A 0 = {2, ..., p} and A 1 = {1}, whose DAG structure is illustrated in Figure 2a . Moreover, we consider various noise distributions, including uniform distribution on [−3, 3], student t distribution with 9 degree of freedom, and double exponential distribution with location parameter 0 and scale parameter √ 1.5, and the coefficient of each directed edge is uniformly generated from [−1.5, −0.5] ∪ [0.5, 1.5].

Example 2. We consider a similar data generating scheme as in Wang and Drton (2020) . Specifically, we start with a graph with only one node, and at each step, a node with 2 directed edges are added to the graph. The probability of generating a directed edge from each previous node to the newly added node is proportional to the number of neighbors of the previous node. Its DAG structure is illustrated in Figure 2b . Moreover, we generate the noise terms from σ × Uniform [−3, 3] with σ ∼ Uniform[0.2, 1], and the coefficient of each directed edge is uniformly generated from [−1.5, −0.5] ∪ [0.5, 1.5]. For each example, we repeat the data generating scheme 50 times and the averaged performance of all the methods under the cases with (n, p) = (200, 100), (200, 200), (400, 200) and (400, 1000) are summarized in Tables 1 and 2. Note that ICA is only designed for the low dimensional case with p < n, and some methods do not produce any results for cases with large p in Examples 1 and 2 after more than 48 hours. It is evident from Tables 1 and 2 that TL outperforms all the other competitors in almost all the cases, except that it yields the second best TPR in the cases with (n, p) = (200, 100) and (400, 200) . In these two cases, Pairwise or ICA attain higher TPR, largely due to the fact that they tend to produce very dense graphs with many false edges and thus have much higher FDR. Note that the performance of MDirect appears less satisfactory, possible due to its sensitivity to the data generating scheme. It is also interesting to point out that the performance of TL may be further improved with a finer tuning scheme, at the cost of increasing computational cost.

We now apply TL to analyze the spread of COVID-19 based on the daily global confirmed cases collected by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, which is publicly available at https://github.com/CSSEGISandData/COVID-19. Figure 3 displays the heat maps for the global cumulative confirmed cases for countries around the world from March 1st, 2020 to April 15th, 2020. It is clear that most confirmed cases of COVID-19 are first reported in China, Europe and Iran, then it quickly spreads to Middle East and North America, and finally most of the countries are affected by COVID-19, especially USA and west European countries. Interestingly, compared with other continents, countries in Africa appear to be much less affected. It is interesting to note that DAG is an efficient tool to describe the spread of COVID-19, where a directed edge indicates the virus is spread from one country to the other. Although virus-spread may not be necessarily acyclic, DAG provides insightful information on the future infection tendency of virus-spread among the countries. We pre-process the dataset and exclude those countries or regions with no confirmed cases for more than 10 days during March 1st, 2020 to April 15th, 2020. This leads to the daily confirmed cases in p = 99 countries or regions for a total of 61 days. Further, we convert the actual number of daily confirmed cases to the percentage over all countries, and use a 3-day moving average of the percentages as the observation for each day. We then apply TL to estimate the DAG for the spread of COVID-19, with 99 nodes and 736 directed edges. Figure 4 shows the top 25 hub nodes with the number of their child nodes in the estimated DAG for the spread of COVID-19, which consists of mostly Eastern Asian and Western Asian countries, and most European countries. This concurs with the heat map in Figure 3b that these countries reported many confirmed cases in middle March, and thus are more likely to spread the virus. Figure 5 presents 30 directed edges with largest estimated weights, showing that China, Korea, Italy and United Kingdom are the major countries that spread the virus to others. This trend of infection appear sensible, since these countries have more confirmed cases in early March as shown in Figure  3 and they are also closely connected to other countries due to their active economy or tourism attractions. Moreover, many directed edges are present among European countries, largely due to the fact that population movement and interaction in these countries are much more frequently than others. It is also interesting to note that Figure 5 shows no directed edges point from the United States of America and Canada to other countries, yet they are indeed hub nodes with 11 and 6 child nodes, respectively. Given the fact that there is a surge in the number of confirmed cases in these two countries as observed in Figure 3 , one can expect the virus would spread from these two countries to their child nodes after April, 2020.

This paper proposes an efficient method to learn linear non-Gaussian DAG in high dimensional cases with statistical guarantees. The proposed method leverages a novel concept of topological layers to facilitate DAG learning, which ensures that the parents of a node must belong to its upper layers, and thus naturally guarantees acyclicity. To learn the DAG, its layers can be reconstructed via precision matrix estimation and independence tests in a bottom-up fashion, and its parent-child relations can be directly obtained from the estimated precision matrix. More importantly, the proposed method can consistently recover the underlying DAG under more mild conditions than existing methods in literature. Its advantages over some popular competitors are also supported by numerical experiments on a variety of simulated and real-life examples.

A constrained 1 minimization approach to sparse precision matrix estimation

On causal discovery with an equal-variance assumption

Optimal structure identification with greedy search

Analyse generale des liaisons stochastiques

Sparse inverse covariance estimation with the graphical Lasso

Learning linear structural equation models in polynomial time and sample complexity

Causal discovery in heavy-tailed models

A kernel statistical test of independence

PenPC: A two-step approach to estimate the skeletons of highdimensional directed acyclic graphs

Pairwise likelihood ratios for estimation of non-Gaussian structural equation models

Estimating high-dimensional directed acyclic graphs with the PCalgorithm

Causal inference using graphical models with the R package pcalg

Likelihood ratio tests for a large directed acyclic graph

Feature screening via distance correlation learning

High-dimensional consistency in score-based and hybrid structure learning

Nonparametric estimation of triangular simultaneous equations models

Ball Divergence: Nonparametric two sample test

Identifiability of additive noise models using conditional variances

Learning a high-dimensional linear structural equation model via 1 -regularized regression

Identifiability of Gaussian structural equation models with equal error variances

Elements of Causal Inference -Foundations and Learning Algorithms

High-dimensional covariance estimation by minimizing 1 -penalized log-determinant divergence

Causal protein-signaling networks derived from multiparameter single-cell data

A Bayesian network structure for operational risk modelling in structured finance operations

A linear non-Gaussian acyclic model for causal discovery

Directlingam: a direct method for learning a linear non-Gaussian structural equation model

On a property of the normal distribution

Causation, Prediction, and Search. Cambridge, Massachusetts

Brownian distance covariance

Measuring and testing dependence by correlation of distances

The max-min hill-climbing Bayesian network structure learning algorithm

Geometry of the faithfulness assumption in causal inference

High-dimensional causal discovery under non-Gaussianity

Constrained likelihood for reconstructing a directed acyclic Gaussian graph

On model selection consistency of Lasso

DAGs with NO TEARS: Continuous optimization for structure learning

XH's research is supported in part by NSFC-11901375 and Shanghai Pujiang Program 2019PJC051, and JW's research is supported in part by GRF-11303918, GRF-11300919, and GRF-11304520.