key: cord-0475978-sd0zqnt5
authors: Nguyen, Thi Kim Hue; Berge, Koen Van den; Chiogna, Monica; Risso, Davide
title: Structure learning for zero-inflated counts, with an application to single-cell RNA sequencing data
date: 2020-11-24
journal: nan
DOI: nan
sha: 0ddac44883f1092b3246921c58fb295c4b4603ae
doc_id: 475978
cord_uid: sd0zqnt5

The problem of estimating the structure of a graph from observed data is of growing interest in the context of high-throughput genomic data, and single-cell RNA sequencing in particular. These, however, are challenging applications, since the data consist of high-dimensional counts with high variance and over-abundance of zeros. Here, we present a general framework for learning the structure of a graph from single-cell RNA-seq data, based on the zero-inflated negative binomial distribution. We demonstrate with simulations that our approach is able to retrieve the structure of a graph in a variety of settings and we show the utility of the approach on real data.

In recent years, a growing interest has developed around the problem of retrieving, starting from observed data, the structure of graphs representing relationships among variables of interest. In fact, reconstruction of a graphical model, known as structure learning, traces back to the beginning of the nineties, and a vast literature exists that considers the problem from various perspectives, within both frequentist and Bayesian approaches (see Drton and Maathuis (2017) for an extensive review). But a central role in the renewal of interest on structure learning has been played by molecular biology applications. In this field, the abundance of data with increasingly large sample sizes, driven by novel high-throughput technologies, has opened the door for the development and application of structure learning methods, in particular applied to the estimation of gene regulatory or gene association networks.

At the inception of transcriptomics, the technology of choice for measuring gene expression was the microarray assay, that, by optically scanning fluorophore intensities, provided data on a continuous scale (Irizarry et al., 2003) . When it came to (sparse) structure learning from these data, the first proposals assumed that data arose from a multivariate Gaussian distribution, and took advantage of the many results and tools available for such family of distributions (see Schäfer and Strimmer (2005) ; Junbai, Leo and Jan (2005) ; Peña (2008) ; Yin and Li (2011), among others).

Later, a new technology allowed for the high-throughput sequencing of RNA molecules (i.e., RNA-seq), and quickly established itself as the reference technology for the study of genome-wide transcription levels (Wang, Gerstein and Snyder, 2009 ). One of the main advantages of RNA-seq over microarrays is that it allows to analyze small amounts of RNA, making it feasible to study gene expression even at single-cell resolution (Kolodziejczyk et al., 2015) . This new technology provided statisticians with a wealth of novel problems. Indeed, RNAseq yields counts, rather than intensities on a continuous scale, as measures of gene expression. Data are usually high dimensional and, typically, come from skewed distributions with high variance. Moreover, they very often show a large number of zeros, typically larger than expected under a Poisson or negative binomial model (Van De Wiel et al., 2013) .

Structure learning of graphs with such data was initially performed by exploiting data transformations, such as log, Box-Cox, copulas, etc (Abegaz and Wit, 2015) . Although data transformation can work well in some circumstances, it can be also ill-suited, possibly leading to wrong inferences in some circumstances (Gallopin, Rau and Jaffrézic, 2013) . Awareness of these problems fueled the development of methods for learning (sparse) graphical models tailored to count data. , Yang et al. (2013) and, more recently, considered structure learning for Poisson and truncated Poisson counts. A general class of models was studied in Yang et al. (2015) , which considered graphical models for the class of exponential family.

The challenges posed by RNA-seq technology are exacerbated in single-cell RNA sequencing (scRNA-seq). scRNA-seq allows the measurement of RNA from individual cells, promising to permit the study of gene interactions at an unprecedented resolution (McDavid et al., 2019) . Some scRNA-seq platforms employ unique molecular identifiers (UMIs), which help reduce amplification biases (Islam et al., 2014) by counting unique RNA molecules rather than reads potentially representing the same molecule more than once. This implies that the distribution of the resulting data is substantially different: read-count data typically show larger counts than UMI data and a more pronounced bi-modality (Svensson, 2020) . Moreover, the small amount of RNA present in the cell and the technical limitations of the sequencing platforms (e.g., a limited number of sequenced reads per cell) lead to higher variance and larger fraction of zero counts compared to "bulk" RNA-seq McDavid et al., 2019) . As a result, single-cell RNA-seq gene-wise data distributions are highly skewed and show an abundance of zero counts. Inference using Gaussian models is definitely infeasible even after variance stabilizing transformations and even models for count data may suffer from high false discovery rates (see Gallopin, Rau and Jaffrézic (2013) , and Section 6). To account for zero-inflation, McDavid et al. (2019) proposed a Hurdle model, equivalent to a finite mixture of singular Gaussian distributions. The authors' model, however, does not account for the count nature of the data.

From this quick tour on problems and methods, it appears evident that principled solutions to structure learning that account for the possibility of overdispersion and/or zero-inflation are still lacking. In this paper, we try to fill this gap. We present a general framework, based on the zero-inflated negative binomial distribution, for learning the structure of a graph from single-cell RNA-seq data. We focus in particular on UMI data, as its growing popularity suggests that the majority of future studies will employ this technology.

The remainder of this article is organized as follows. In Section 2 we introduce a motivating dataset; we describe our proposed model in Section 3 and our structure learning procedure in Section 4. One key question in the literature is whether zero inflation needs to be accounted for in the data, we offer our perspective in Section 5. After exploring the behavior of our method in simulated data in Section 6, we apply our proposal to real single-cell RNA-seq data in Section 7. Section 8 concludes the article with a discussion.

Despite the distributional challenges described in the previous section, singlecell data offer an unprecedented opportunity to discover cellular dynamics, especially in developing cell populations. Graphical models could be an important tool to learn gene interactions from single-cell data, to learn how these change across conditions and throughout development, and to identify potentially novel transcription factor target genes. While graphs are widely used in scRNA-seq to group similar cells in the space of gene expression, our approach learns a graphical model considering genes as nodes. This allows us to model cells as i.i.d. observations from a multivariate distribution in which the genes are the variables and the cells are considered a random sample from the cell population.

In particular, here, we study gene expression from the mouse olfactory epithelium (OE). This tissue is made of two major mature cell types, olfactory neurons and sustentacular support cells. Furthermore, a stem cell niche provides a mechanism through which the tissue is regenerated Gadye et al., 2017) . As the aim of the study is to understand how stem cells mature into neurons following tissue damage, we focused only on the cells in the neuronal lineage. Briefly, the olfactory reserve stem cells, called Horizontal Basal Cells (HBC), become activated and subsequently develop into Globose Basal Cells (GBC) and then into immature (iOSN) and finally mature olfactory neurons (mOSN). By reconstructing the structure of the graph for each of these cell types separately, we hope to get a glimpse of the relationships between genes in neuronal development. For instance, the cell type that results in the most highly connected graph could indicate the most transcriptionally active developmental stage .

A probabilistic graphical model requires the definition of a pair, (G, F) say. Here, G = (V, E) represents an undirected graph, where V is the set of nodes, and E = {(s, t) : s, t ∈ V, s = t} represents the set of undirected edges. Each node in the graph corresponds to a random variable X s , s ∈ V ; the existence of an edge (s, t) ∈ E indicates the dependency of the random variables X s and X t . Moreover, F represents a family of probability measures for the random vector X V , indexed by V and with support X V .

Thanks to the well known Markov properties (global, local, pairwise, see Lauritzen (1996) ), the pattern of edges in the graph translates into conditional independence properties for variables in X V , which, in turn, allow possible factorizations of F into smaller, more tractable entities. In undirected graphical models, each absent edge (s, t) in E has the role of portraying the conditional independence, X s ⊥ ⊥ X t |X V \{s,t} , and the family F is said to satisfy the pairwise Markov property with respect to G. The smallest undirected graph G with respect to which F is pairwise Markov is given the name conditional independence graph. When all variables in X V are discrete with positive joint probabilities, as is the case of this paper, the three kinds of Markov properties are equivalent, so that a factorization of the joint probability distribution with respect to the cliques (fully connected subsets of vertices) of the graph G is also guaranteed (Lauritzen, 1996, Chap. 3) .

Let x is be the gene expression for gene s ∈ V in cell i ∈ {1, 2, . . . , n}, we assume that the distribution of each variable X is , conditional to all possible subsets of variables X iK , K ⊆ V is a zero-inflated negative binomial (zinb) distribution:

(3.1) where δ 0 (.) is the Dirac function, π is|K ∈ [0, 1] is the probability that a 0 is sampled from a distribution degenerate at zero and f nb (., µ, θ) denotes the probability mass function of the negative binomial (NB) distribution with mean µ and inverse dispersion parameter θ. We assume that

A missing edge between node s and node t corresponds to the condition β µ st|K = β µ ts|K = β π st|K = β π ts|K = 0, ∀K ⊆ V \ {s}. On the other hand, one edge between node s and node t implies that at least one of the four parameters β µ st|K , β µ ts|K , β π st|K , β π ts|K is different from 0. This specification defines a family of models that includes the most common models employed for count data and embraces a variety of situations. It is evident that, when π s|K = 0, ∀ K ⊆ V \ {s}, the model reduces to a NB distribution, which, in turn reduces to a Poisson distribution when the inverse dispersion parameter θ s tends to infinity. When π s|K > 0, zero-inflation comes into play and zero-inflated Poisson and NB models can be considered. In this case, when β π st|K = 0, ∀ t ∈ K \ {s}, the neighborhood of a node s is defined to be the set of effective predictors of µ s|K and consists of all nodes t for which β µ st|K = 0. On the other side, when β µ st|K = 0, ∀ t ∈ K \ {s}, the neighborhood of a node s is defined to be the set of effective predictors of π s|K and consists of all nodes t for which β π st|K = 0. In other words, the family includes models in which the structure of the graph is attributable only to one of the two parameter components, π s|K or µ s|K .

The difficulty with our model specification is that the definition of a set of conditional distributions does not guarantee the existence of a valid joint distribution, i.e., a joint distribution that possesses the specified conditionals. This might create difficulties in interpreting the resulting graph in probabilistic terms: if the joint distribution does not exist, graphical separations stored in G as a result of our model specification might not correspond to conditional independence properties on F. However, our formulation guarantees the existence of the joint distribution in a number of relevant subcases. In the following theorem, we clarify conditions for existence of a joint distribution coherent with the conditional specification (see Section 1.1, Supplementary Material, for a proof).

Theorem 1. Let X V = (X 1 , X 2 , . . . , X p ) be a p-random vector with support X V . Assume that a set of univariate conditional probability mass functions of the kind (3.1) are given for variables in X V . Then, a joint distribution having those conditionals exists if and only if θ s is constant for all s ∈ V , and all regression coefficients β µ st|K are negative, ∀K ⊆ V. The condition on negativity of local regression coefficients in (3.2) resembles a condition known in the literature of Markov random fields known as "competitive relationship" (Besag, 1974) . Generally speaking, the presence of only negative relations among entities is quite a rare event and incapability of capturing positive dependencies might be a severe drawback in various applications. Nevertheless, the existence of a joint distribution in these specific cases assures that statistical guarantees hold for conditional approaches to structure learning such as the one used in this paper and somehow softens the hazard of the use of such algorithms outside the conditions of existence of a joint distribution.

A conditional independence graph G = (V, E) on X V can be estimated by estimating, for each node s ∈ V, its neighborhood. Hence, one can proceed by estimating the conditional distribution of X s |X V \s and fixing the neighborhood of s to be the index set of variables X N (s) on which the conditional distribution depends.

To estimate the neighborhood of each node, we employ the PC-stable algorithm, a variant of the PC algorithm first proposed by Spirtes, Glymour and Scheines (2000) . The PC algorithm starts with a complete graph on V.

Marginal independencies for all pairs of nodes are tested, and edges removed when marginal independencies are found. Then, for every pair of linked nodes, independence is tested conditional to all subsets of cardinality one of the adjacency sets of the two nodes. This testing procedure is iterated, increasing in turn the size of the conditioning sets, until this reaches its maximum limit, or a limit imposed by the user. Reasons for choosing the PC algorithm are many, spanning from its consistency (assuming no latent confounders) under i.i.d. sampling (Spirtes, Glymour and Scheines, 2000) , to its ability to deal with a large number of variables and only moderately large sample sizes. The variant that we employ, PC-stable (Colombo and Maathuis, 2014) , allows to control instabilities due to the order in which the conditional independence tests are performed. To perform the tests, deviance test statistics are employed, for which a chi-squared asymptotic distribution can be obtained by standard asymptotic theory.

In what follows, let X = {x (1) , . . . , x (n) } be the collection of n samples drawn from the random vectors X V , with x (i) = (x i1 , . . . , x ip ), i = 1, . . . , n. Starting from the complete graph, for each s and t ∈ V \{s} and for any set of variables S ⊆ {1, . . . , p}\{s, t}, we test, at some pre-specified significance level, the null hypothesis H 0 : β µ st|K = β µ ts|K = β π st|K = β π ts|K = 0, with K = S∪{s, t}. In other words, we test if the data support the existence of the conditional independence relation X s ⊥ ⊥X t |X S . If the null hypothesis is rejected, there exists an edge (s, t) in the resulting graph. A control is operated on the cardinality of the set S of conditioning variables, which is progressively increased from 0 to p − 2 or to m, m < (p − 2). Assume X s |x K\{s} ∼ zinb(X s ; µ s|K , θ s , π s|K |x K\{s} ), as in Equation (3.1). The conditional log-likelihood for variable X s given x K\{s} is obtained by

where ν s|K , β s|K are linked to π s|K , µ s|K through Equations (3.2) -(3.3). The estimatesν s|K ,β s|K ,θ s of the parameters ν s|K , β s|K , θ s are obtained by maximizing the conditional log-likelihood given in Equation (4.1), i.e., (ν s|K ,β s|K ,θ s ) = argmax (ν s|K ,β s|K ,θs)∈R 2|K|+1 s (ν s|K , β s|K , θ s ).

See Section 1.2, Supplementary Material, for details on the estimation procedure. A deviance test statistic for the hypothesis H 0 : β µ st|K = β π st|K = 0 can be obtained as

whereν 0 s|K ,β 0 s|K ,θ 0 s are the maximum likelihood estimates of the parameters under H 0 . It is readily available that D s|K is asymptotically chi-squared distributed with 2-degrees of freedom under the null hypothesis, provided that some general regularity conditions hold.

Remark 1. On assuming faithfulness of the node conditional distributions to the graph G, consistency of the algorithm can be proved in the case of competitive relationships in µ s |K by suitably modifying results in . We recall that a distribution P X is said to be faithful to the graph G if for all disjoint vertex sets A, B, C ⊂ V it holds

where A ⊥ ⊥ G B|C means that A and B are separated in G by C. Thanks to the equivalence between local and global Markov properties, faithfulness of the local distributions guarantees faithfulness of the joint distribution.

Remark 2. Although a theoretical proof of convergence of the algorithm is in question in the case of unrestricted relationships among variables, inference on the structure is still principled within a pseudo-likelihood perspective, i.e., by approximating the likelihood function by a product of the conditional likelihood functions. Different pseudo-likelihood-based structure estimators have been shown to be consistent under a conditional model construction (see, Imre and Zsolt (2006) , among others). See also for an empirical exploration of convergence of a similar algorithm under the Poisson assumption in the case of unrestricted relationships among variables.

Remark 3. A large sample size, as typical in the applications at hand, impacts on the actual significance level of individual tests. Moreover, a multiplicity of tests are performed by the algorithm. For this reason, we advice to set the nominal level of the test α to α n = 2(1 − Φ(n b )), where 0 < b < 1/2 is related to the average neighborhood size. This choice is based on results in and guarantees that the probability that a type I or II error occurs in the whole testing procedure goes to zero as n → ∞, i.e., it asymptotically controls the family-wise error rate of all potential tests that could be done.

Remark 4. The chosen learning strategy has some advantages over alternative approaches based on sparse regressions (see also for an extended discussion in the Poisson case). Sparsity can be easily implemented by a control on the conditional set size, instead of a control on parameter magnitudes, which can lead to over-shrinkage. Moreover, it offers computational advantages, especially when sparse networks are the target of inference.

The need for modeling zero-inflation in single cell data is a question at the core of an ongoing debate, with several authors arguing that the negative binomial distribution is sufficient to fit single-cell RNA-seq data when unique molecular identifiers are used (Vieth et al., 2017; Townes et al., 2019; Svensson, 2020; Sarkar and Stephens, 2021) . Indeed, the ability to distinguish between a non zero-inflated distribution and zero-inflated alternatives highly depends on the relative size of the parameters of the distributions.

To gain a better understanding of this problem, we have tried to assess the misspecification cost due to assuming a zero-inflated distribution when no zero inflation occurs. To this aim, we confined ourselves to a univariate case with no covariates, fixed a non zero-inflated model and measured the model misspecification cost occurring when using its zero-inflated counterpart by using the squared Hellinger distance as loss function. Such a loss function should, in principle, indicate, in an inferential sense, how far apart the two distributions are.

To this aim, let Y = {0, 1, 2, . . . , +∞} be the support of a discrete variable Y . We consider for Y a true probability distribution P (y; φ 0 ), φ 0 ∈ Φ, as well as a family F = {Q(y; ψ), ψ ∈ Ψ}, Φ ⊆ Ψ, of zero-inflated versions of the true probability distribution P (y; φ 0 ). The squared Hellinger distance between two probability distributions P and Q is defined as

where p y = P (y; φ 0 ) and q y = Q(y; ψ).

In particular, assume that P is a NB distribution with φ 0 = (µ 0 , θ 0 ) and Q is a zinb distribution, defined as Q(y; ψ) = πδ 0 (0) + (1 − π)P (y; φ 0 ), with ψ = (φ 0 , π). Hence, the squared Hellinger distance of P and Q can be written as Table S1 show the value of the Hellinger distance in a number of cases. As expected, the distance increases with the probability of zero inflation π. However, when the inverse dispersion parameter θ 0 and/or the mean µ 0 are small, the distance between the distributions is small even in the case of moderate to large π. In fact, when µ 0 and θ 0 are both small (low mean and high variance) the two distributions are close even when π = 0.9.

As, broadly speaking, maximum likelihood estimators and minimum Hellinger distance estimators are asymptotically equivalent, it emerges that, in inferential terms, the degree of zero inflation of a true model could be difficult to ascertain, as suitable choices of the parameters of the non contaminated component may possibly absorb the excess of zeros generated by the contamination. This is despite identifiability of the zinb model (see Section 1.3, Supplementary Material, for a proof). These remarks might contribute to the ongoing debate about existence of zero-inflation from a novel perspective.

We devote this section to the empirical study of consistency of the proposed algorithms. In particular, we concentrate on the ability of proposed methods to recover the true structure of the graphs. We also list the running time of each algorithm. As measures of the test's accuracy, we adopt three criteria including Precision P ; Recall R; and their harmonic mean, known as F 1 -score, respectively defined as

where TP (true positive), FP (false positive), and FN (false negative) refer to the number of inferred edges (Liu, Roeder and Wasserman, 2010) . The considered algorithms are listed below, along with specifications, if needed, of tuning parameters. For all PC-like algorithms, we let the maximum cardinality of conditional independence set be m = 8 for p = 10 and m = 3 for p = 100.

-PC-zinb1: zinb models in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; -PC-zinb0: zinb models in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant (i.e., β π st|K = 0, ∀ t ∈ K \ {s}, ∀ s ∈ V ); -PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; -PC-pois: Poisson model .

Moreover, we compare the results to those obtained with the algorithm of McDavid et al. (2019) , which employs a Gaussian Hurdle model.

For two different cardinalities (p = 10 and p = 100), we consider three graphs of different structure: (i) a scale-free graph, in which the node degree distribution follows a power law; (ii) a hub graph, where each node is connected to one of the hub nodes; (iii) a Erdos-Renyi graph, where the presence of the edges is drawn from independent and identically distributed Bernoulli random variables.

To construct the scale-free and Erdos-Renyi networks, we employed the R package igraph (Csardi et al., 2006) . For the scale-free networks, we followed the Barabasi-Albert model with constant out-degree of the vertices ν = 2 for p = 10 and ν = 0.2 for p = 100. For the Erdos-Renyi networks, we followed the Erdos-Renyi model with probability to draw one edge between two vertices γ = 0.3 for p = 10 and γ = 0.03 for p = 100. To construct the hub networks, we assumed 2 hub nodes for p = 10, and 5 hub nodes for p = 100. See Supplementary Figure  S2 and Supplementary Figure S3 for representative plots of the three chosen graphs for p = 10 and p = 100, respectively.

For the given graphs, 50 datasets were sampled with four different sample sizes, n = {100, 200, 500, 1000} for p = 10, and three different sample sizes, n = {200, 500, 1000} for p = 100. To generate the data, we followed the approach of the Poisson models in . Let X ∈ R n×p be the set of n independent observations of random vector X. Then, X is obtained from the following model X = YA + , where Y = (y st ) is an n × (p + p(p − 1)/2) matrix whose entries y st are realizations of independent random variables Y st ∼ zinb(µ, θ, π) (or NB(µ, θ); or Pois(µ)) and = (e st ) is an n × p matrix with entries e st which are realizations of random variables E st ∼ zinb(µ nois , θ, π) (or nbinom(µ nois , θ); or Pois(µ nois )). This approach leverages the additive property of these distributions and allows us to generate the required dependencies. In particular, let B be the adjacency matrix of a given true graph, then A takes the following form A = [I p ; P (1 p tri(B) T )] T . Here, P is a p × (p(p − 1)/2) pairwise permutation matrix, denotes the elementwise product, and tri(B) is the (p(p − 1)/2) × 1 vectorized upper triangular part of B (Allen and Liu, 2013).

Figures 2 and 3 show the Monte Carlo means of the F 1 -scores for each of the considered methods with p = 100 and low signal-to-noise ratio (µ noise = 0.5), at high (µ = 5) and low (µ = 0.5) mean levels, respectively. Each value is computed as the average of the 50 values obtained by simulating 50 samples for the model corresponding to each network. Monte Carlo means of Precision P , Recall R, and F 1 -score are given in Supplementary Tables S2-S4. The two values of µ = {5, 0.5} were chosen to mimic typical values observed in real full-length and droplet-based datasets, respectively. In fact, the mean expression level of transcription factors in the dataset presented in Section 2 is 0.67 (median 0.14), while the mean expression level of transcription factors in a similar experiment performed with a full-length protocol is 32.03 (median 7.89).

These results indicate that the PC-zinb1 algorithm and its variants (PC-zinb0, PC-nb, PC-pois) are consistent in terms of reconstructing the structure from given data. In fact, when the model is correctly specified, the F 1 -scores of the algorithms are close to 1 when n ≥ 1000 in all scenarios. This means that the proposed algorithm is able to recover the underlying graph from the given data for both low (Fig. 3) and high (Fig. 2 ) mean levels.

When the data are generated with a high mean level (µ = 5), the PC-pois algorithm performs well only when it is the true model, i.e., for data generated from Poisson random variables ( Fig. 2 Figs. S10 and S12). In the other scenarios, PC-pois often shows a low Precision ( Fig. 2 ; Supplementary Tables S2 and S3; Supplementary Fig. S10 ). This result is expected since the node conditional Poisson distributions are unable to model the over-dispersion generated by the (zero-inflated) negative binomial distributions.

On the other end of the spectrum, the more general zinb models work well in all scenarios ( Fig. 2 ; Supplementary Tables S2 -S4; Supplementary Figs. S10 and S12). This is not surprising as the data are generated according to models (e.g, Poisson, NB) that can be seen as special cases of the zinb distribution, which means that in all tested scenarios the zinb model is correctly specified.

The PC-nb algorithm, based on the negative binomial assumption, performs reasonably well ( Fig. 2 ; Supplementary Tables S2 -S4; Supplementary Figs. S10 and S12). However, in the hub graph (center column of Fig. 2 ), its performances are slightly worse than the zinb models, showing low Precision when the true data generating distribution is node conditional zinb (Supplementary Fig. S10 ; Supplementary Table S2 ). This result indicates that a zero inflated negative binomial model may be needed when the mean is large .

As we expected from the considerations reported in Section 5, the performances of the variants of PC-zinb are quite similar to each other when the mean and the dispersion parameter are both small, i.e., when the data are characterized by low mean and high variance (µ = 0.5, θ = 0.5; Fig. 3 ; Supplementary Tables S2 -S4; Supplementary Figs. S11 and S13). This might be explained by the fact that a suitable choice of the parameters may allow non-zero inflated models to absorb the excess of zeros (see Section 5 for more details). Therefore, when applying our approach on this type of data, one should use the simplest variant, (i.e., PC-pois) to leverage the better computational performance (see last column of Supplementary Table S2 -S4) .

Finally, we see no difference in the performance of the PC-zinb variants (PC-zinb1 and PC-zinb0). This is perhaps not surprising, as we simulated the same structure of the graph for both µ and π. These results suggest that the information inferred from µ is sufficient to reconstruct the correct graph in this case.

We have focused here on p = 100, as this setting is closer to our real application. The results for p = 10 are reported in Supplementary Figures S4-S9 and  Supplementary Tables S5-S7 and lead to similar conclusions.

We demonstrate our method on the motivating example dataset described in Section 2. To this aim, we analyzed a set of cells, assayed with 10X Genomics (v2 chemistry) after injury of the OE, to characterize HBCs and their descendants during regeneration (Brann et al., 2020) . Starting from an initial set of 25,469 cells, low-quality samples as well as potential doublets were removed as described in Brann et al. (2020) . After clustering with the Leiden algorithm (Traag, Waltman and van Eck, 2019), known marker genes were used to identify cell types. Figure S3 with p = 100, µ = 5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and π s|K is constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Figure S3 with p = 100, µ = 0.5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and π s|K is constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of .

We discarded the cell types outside of the neuronal lineage (macrophages, sustentacular cells, and microvillar cells), obtaining a dataset consisting of 7782 HBCs, 5418 activated HBCs (HBC*), 755 GBCs, 2859 iOSN, and 929 mOSN. For more details on the data preprocessing, see Brann et al. (2020) . We perform two complementary analyses on two different subsets of the dataset. First, we focus on transcription factor (TF) genes, with the aim of identifying important networks of regulation in the different cell types that constitute the neuronal developmental lineage. We then turn our attention to the activated HBCs, a critical stage of neurogenesis, with the aim of identifying important transcription factors that regulate genes important for stem cell differentiation.

Our first analysis focuses on the total set of 1543 known transcription factors in mouse, which are thought to regulate the observed differentiation processes. We furthermore focus on the differentiation path starting at the activated HBC (HBC*) stage (i.e., activated stem cells upon injury) up to mature neurons, therefore investigating the entire neuronal lineage in the trajectory of this dataset. As previously discussed in Section 2, we expect four different cell types along this path, being respectively HBC*, GBC, iOSN and mOSN, and we estimate the structure of the graph for each of these cell types. We selected the top 1000 cells with the highest means from the cell types that had more than 1000 cells (HBCs, HBC*, iOSN) to ensure a fair comparison between groups. In fact, the power of our algorithm to detect edges increases with the sample size and since one of the goals of this analysis is to compare the graphs across cell types we want to avoid a confounding effect due to the number of cells. See Supplementary material, Section 2, for details on the preprocessing.

The average degree of the graphs is highest at the activated stem cell stage, with an average degree of 4, and decreases as cells develop to mature neurons, with average degrees of 3.9, 3.3 and 3.5 for the GBC, iOSN and mOSN networks, respectively. To interpret the graph structure, we focus on the 2-core of each network, i.e., we retain TFs that are associated with at least two other TFs, a preprocessing step that helps in understanding the core structure (Wang and Rohe, 2016).

We identify communities in each graph using the Leiden algorithm (Traag, Waltman and van Eck, 2019) and, in order to validate the associations discovered by PC-zinb, we interpret each of the communities by computing overlaps with known functional gene sets in the MSigDB database Liberzon et al., 2015) , see Supplementary Material Section 2 for details. The interpretation of these communities relies on known processes involved in the development of the olfactory epithelium as found by previous research (e.g., ; Gadye et al. (2017) ).

In the HBC* cell type, cells have been injured ∼ 24h ago, so we expect response to injury, and stem cells actively preparing for differentiation, as well as replication to produce more stem cells to repair the epithelium. Four communities are discovered in the association network (Figure 4 ), broadly involved in either cell cycle, epigenetic mechanisms and (epithelial) cell differentiation (Supplementary Table S9 ). These communities reflect the need to divide in order to produce more cells, epigenetic mechanisms that are likely required to activate molecular processes upon injury, and the differentiation of stem cells to restore the damaged epithelium.

In the GBC cell type, we expect cells to proliferate to produce immature neurons. We discover four communities (Figure 4 ), broadly involved in DNA replication, cell proliferation, signaling, expression regulation and cell differentiation (Supplementary Table S10 ). Relevant pathways, such as the P53 and notch signaling pathways, are also recovered for specific communities, and have previously been found to be involved in neurogenesis in neuroepithelial stem cells (Marin Navarro et al., 2020; Wang et al., 2011) .

In the immature olfactory sensory neuron (iOSN) stage, we expect basal cells to start developing into immature neurons. Four communities are discovered (Figure 4 ), of which one community comprises the majority of the graph, i.e., 63% of all TFs retained in the graph, and importantly is involved in neurogenesis (Supplementary Table S11 ). Other, smaller, communities are enriched in processes such as cell and axon growth, wound healing, signaling and cell population maintenance.

Finally, in the mature olfactory sensory neuron stage (mOSN), we expect the final differentiation to functional neurons. Five communities are discovered (Figure 4) , again with very different sizes. The largest communities are enriched in broader processes related to chromatin organization and transcription, possibly reflecting the basic changes required for cells to develop into and maintain at the mature stage (Supplementary Table S12 ). The third largest community is enriched specifically in the TGF-Beta pathway, known to be required for neurogenesis, and to modulate inflammatory responses (Meyers and Kessler, 2017) .

Taken together, these results confirm previously known processes associated with differentiation of HBCs into mature neurons upon injury, with relevant processes highlighted by communities of transcription factors. Furthermore, while the community detection results are useful to validate the estimated graphs, they also provide a gateway to more detailed analysis, e.g., investigation of hub genes (e.g., Chen et al. (2018) ) or master regulators of development (e.g., Sikdar and Datta (2017)), therefore unlocking powerful interpretation of single-cell RNA-seq datasets. We give an example of such detailed analysis in the next paragraph, in which we focus on the role of the Trp63 TF in activated HBCs.

Our second analysis focuses on a set of 242 genes, annotated with the term "stem cell differentiation" in the Mouse Genome Database , expressed in the activated HBC cell type. Following the same preprocessing employed for the first analysis, and detailed in Section 2 of the Supplementary Material, we obtain a dataset consisting of 1000 cells and 160 genes. (Krzywinski et al., 2012; Bryan, 2020) of TF gene networks estimated with PC-zinb. Gene communities were estimated using the Leiden algorithm and are represented on the axes of the plots and by different edge colors; hub nodes, defined as nodes with more than 9 neighbors, are represented as solid black circles. Each axis (community) was annnotated with the most enriched gene set (see Supplementary Material Section 2).

Our goal here is to infer the interactions among genes, with a particular focus on the role of TFs in regulating target genes. Importantly, in this second analysis, we include many genes that are not TFs, allowing us to focus on which genes are regulated by TFs at this specific point in development.

We expect to find several TFs as hub nodes in the graph. In fact, hub nodes, i.e., nodes with a particularly high number of connections, may represent sites of signaling convergence, potentially indicating those genes that regulate other genes.

The PC-zinb algorithm inferred a sparse graph, shown in Figure 5 , where hub nodes are highlighted in orange or red (when they are TFs). It is immediate to recognize important TFs previously demonstrated to be involved with stem cell differentiation, e.g., Trp63 (Senoo et al., 2007) , Sox2 , and Sox9 (Jo et al., 2014) . Other hub nodes include genes that, while not TFs, have been shown to play a central role in this biological process. For instance, Epcam is known to be essential for the maintenance of self-renewal in stem cells (González et al., 2009) . Another example is Ptn, the gene encoding the pleiotrophin growth factor, which has significant roles in cell growth and survival and has been demonstrated to be essential for stem cell maturation and neuronal development (Tang et al., 2019) . Hub nodes are coloured in orange and red, in which hub nodes that are TF genes are red .

We next focus on one of the most important TFs for stem cell activation, Trp63, by zooming in the sub-network made of this gene and its direct neighbors (Fig. 6 ). Trp63 is one of the most important hubs in the network inferred by PC-zinb, with 20 connections. To validate the biological meaning of these connections, we leverage existing external data. In particular, Riege et al. (2020) performed a meta-analysis of 20 publicly available Chromatin Immunoprecipitation (ChIP-seq) datasets to create a curated catalog of p63 (the human ortholog of Trp63 ). Out of the 20 direct targets of Trp63 in our network, 15 have been confirmed by Riege et al. (2020) as direct targets of p63, i.e., there is experimental evidence that the p63 protein binds either at the transcription start site (TSS) or upstream, indicating that p63 is either a promoter or enhancer of these genes (Supplementary File 3 of Riege et al., 2020) .

We want to stress that PC-zinb is able to find these putative TF-target pairs only on the basis of gene expression, hence proving itself as a useful tool to predict novel TF targets to be further validated with other techniques. 

In this work, we have introduced PC-zinb, a class of constraint-based algorithms for structure learning, supporting possibly overdispersed and zero-inflated count data. In focusing on these two nonstandard but realistic situations, our framework goes beyond what has so far been proposed in the literature. Moreover, by leveraging the proposal in Nguyen and Chiogna (2021) -shown to be competitive with state-of-the-art methods supporting count data -we inherit the benefits of that approach, most notably: the existence of a theoretical proof of convergence of the algorithm under suitable assumptions; an easy implementation of sparsity by a control on the number of variables in the conditional sets; invariance to feature scaling. On the synthetic datasets considered in Section 6, we showed that the algorithms work well in terms of reconstructing the structure from given data for large enough sample sizes, while providing biologically coherent information and insight on the real dataset analyzed in Section 7.

Our simulation studies allow us to derive various recommendations on the use of PC-zinb. Clearly, these do not rule out sensitivity analyses with respect to both model specification and tuning of the algorithms, which remain an important part of the model criticism process. A control of the level of significance of the tests with respect to the sample size, n, and the expected size of the neighborhood of each node, b, is highly recommended to guarantee good reconstruction abilities. As in real applications knowledge of the expected size of the neighborhood might be difficult to elicit, it may be prudent to try a range of values for b, and check stability of results. This might also generate a sequence of models of decreasing complexity for increasing values of b and whose dynamic might also point researchers to the most significant connections.

If only the structure of the graph is of interest, irrespective of the strength of the links among variables, we suggest making use of the Poisson variant of the algorithm when the mean of the variables is small, so as to reduce computational complexity (Supplementary Tables S15 and S16). Moreover, when the mean of the variables is small, presence of zero-inflation might not influence reconstruction abilities of the algorithms, as also confirmed by the small study on zero-inflation in Section 5. In these situations, we recommend using, at least in the first instance, non zero-inflated models.

Clearly, in many applications, learning the structure might not be the only goal, and one might want to gain a quantitative insight into the dependence structure of the underlying process, by measuring the sign and the strength of the relations pictured in the graph. If the distribution needs also to be explicitly estimated, this can be achieved by using any of several existing parameter estimation methods conditional on the fixed structure learned by our approach.

If the null hypothesis H 0 : β µ st|K = β π st|K = 0 fails to be rejected, PC-zinb will remove the edge between variables s and t. While such a procedure can only be justified in settings with high power, our simulation study shows that, even in settings with small sample sizes, our algorithm is able to achieve high power, and the correct underlying structure of the graph can be learned successfully.

While it is straightforward to interpret the case in which the neighborhood of s is defined by the predictors of µ s|K , i.e., gene dependencies act on the average gene expression, the case of structure on π s|K requires more thought. If zero inflation represents true biological signal, we can interpret a non-zero β π st|K as the fact that the presence of gene t will influence the presence of gene s, regardless of their average expression. This is similar to McDavid et al. (2019) . If zero inflation represents only technical noise, a simpler model with constant π s|K might be preferable. This is a special case of our general model. Since it is unclear what is the true nature of zero inflation in scRNA-seq data, we opted for generality in our model specification. Furthermore, having a general model expands the set of applications in which our approach may be useful.

The question of whether zero-inflated models are useful for the analysis of scRNA-seq data has been frequently posed in the recent literature. In Section 5 we try to shed some light on why a negative binomial distribution can fit UMI data well, as observed by Svensson (2020) and Sarkar and Stephens (2021) among others (see also our Figure 3 ). We show that in the case of low mean and high variance the zinb and NB distributions are very close to each other, rendering the question of whether UMI data are zero inflated moot. However, we also show that in real data zinb and NB models lead to different results, albeit with decent concordance between the inferred graphs (Supplementary Table S14 ). This result is only partially in agreement with those of Sarkar and Stephens (2021) , in which the authors found that only a small percentage of genes show evidence of zero inflation. However, while Sarkar and Stephens (2021) focus much of their attention to the case of univariate gene expression, modeling zero inflation may be important when looking at correlation between genes (see also Yang and Ho, 2021) .

Latent or unmeasured variables might induce associations between observed variables that can be spurious. Theoretical proposals are available to deal with the issue of latent factors in the setting in which the latent and observed variables are jointly Gaussian with the conditional statistics of the observed variables conditioned on the latent variables being specified by a graphical model (Chandrasekaran, Parrilo and Willsky, 2012) , but, to the best of our knowledge, no similar results are available for other families of models. For this reason, in our paper, we simply leverage on convergence of the PC algorithm to the model marginalized over the latent factors.

As for the treatment of observed covariates and/or confounding factors, our proposed PC algorithm -that decomposes the structure learning problem into a series of tests performed on conditional log-likelihoods of each variable conditional on other variables -naturally allows to incorporate the covariates into the conditional regression models and, therefore, to estimate a covariate-adjusted structure for the graph. However, challenges remain if introduction of covariates augment the dimension of the conditional regression models to the point that one needs to resort to penalized tools. The treatment of both observed and latent covariates will be the object of future work.

Our real data analysis, aimed at assessing biological validity of the reconstructed network, has demonstrated the great importance of finding meaningful visualizations of large complex networks. Our proposal, based on a search for communities of variables and their association to gene ontologies via enrichment analysis, allowed us to confirm both biological interpretability of the estimated structure, and to contribute to our understanding of which and where biological processes are occurring.

The methods presented in this article are available in the learn2count R package, available at https://github.com/drisso/learn2count. The code to reproduce the analyses of this paper is available at https://github.com/drisso/ structure_learning. Supplementary material to Structure learning for zero-inflated counts, with an application to single-cell RNA sequencing data 1. Proofs and mathematical details https://www.overleaf.com/project/609d5d40edff1a87046d5d64 In Section 1.1, we provide the proof of Theorem ?? (Section ?? of main paper); Section 1.2 describes the maximum likelihood estimation procedure used in zinb regressions; in Section 1.3 we prove the identifiability of zinb models (Section ?? of main paper).

In what follows, we will first consider the case K = V . Results for general case K ⊂ V will be then deduced by zero-padding the true parameter β s|K ∈ R 2|K|−2 to include zero weights over V \K.

To simplify notation, we write (µ j , π j , β µ jk , β π jk ) instead of (µ j|V , π j|V , β µ jk|V , β π jk|V ).

Proof. It is easy to show that for x = (x 1 , x 2 , . . . , x p ) and y = (y 1 , y 2 , . . . , y p ) one can write P (x) P (y) = p j=1 P (x j |x 1 , . . . , x j−1 , y j+1 , . . . , y p ) P (y j |x 1 , . . . , x j−1 , y j+1 , . . . , y p ) .

Let y = (0, 0, . . . , 0). It holds

where µ 0 j = exp{β µ j + j−1 k=1 β µ jk x k }, and logit(1 − π 0 j ) = β π j + j−1 k=1 β π jk x k . Consider the following two cases. 

Here, µ 0 j → 0 when x ∞ → ∞. From Equation 1.1, one has

.

To prove existence of the joint distribution, it is sufficient to control the term

.

Indeed, taking the sum over x 1 , x 2 , . . . , x p , one gets x1,...,xp

for some q < 1. Therefore, when x j is large enough, from Equation 1.2, one has x1,...,xp

that proves existence of the joint distribution. • At least one parameter β µ jk > 0.

Without loss of generality, one can assume β µ p(p−1) > 0. Then, µ p−1 p → ∞, when x p−1 → ∞. From Equation 1.1, one has

Taking the sum over x 1 , x 2 , . . . , x p of the term

Therefore, the joint distribution exists if and only if all regression coefficients β µ st|K are negative, ∀K ⊆ V.

Given a set of data X, for each s ∈ K ⊆ V , we estimate parameters ν s|K , β s|K , θ s by maximizing the local log-likelihood function

To solve this standard regression problem, we alternate the estimation of the dispersion parameter θ s and of parameters (ν s|K , β s|K ), as done in . Specifically, after initialization, we iterate the following two steps until convergence (for the reader's convenience, we drop the index K). Note that this procedure can be performed independently and in parallel for each regression performed within the structure learning strategy.

Step 1: dispersion optimization. Let ln θ s = ζ s , we estimate parameter ζ s by solving the following problem

A first initial value for ζ s , ζ 0 s say, is found by applying the method of moments. If the moment estimator does not exist, as in the Poisson case, ζ 0 s is assigned to the value that maximizes the objective function s (ν s ,β s , ζ s ) over a suitable interval. In practice, we use optimize function in R over the range [-100, 100] . Once the initial value ζ 0 s is fixed, we optimize s (ν s ,β s , ζ s ) by a quasi-Newton optimization scheme.

The gradient of the objective function used by the optimization procedure is found as follows. The derivative of the negative binomial log density is:

where Ψ(θ) = Γ (θ)/(Γ(θ)) is the digamma function. Hence, the derivative of the zinb model for π ∈ [0, 1] can be written as follows:

• for y > 0, f zinb (y, µ, θ, π) = (1 − π)f nb (y, µ, θ), so that ∂ ∂θ ln f zinb (y, µ, θ, π) = Ψ(y + θ) − Ψ(θ) + ln θ + 1 − ln(µ + θ) − y + θ µ + θ ;

• for y = 0, f zinb (0, µ, θ, π) = π + (1 − π)f nb (0, µ, θ), so that ∂ ∂θ ln f zinb (0, µ, θ, π) = (1 − π)f nb (0, µ, θ) ∂ ∂θ ln f nb (0, µ, θ) π + (1 − π)f nb (0, µ, θ) = ln θ + 1 − ln(µ + θ) − θ µ+θ π(µ+θ) θ (1−π)θ θ + 1 , Therefore, the derivative of the objective function w.r.t. ζ s for each response X s , s = 1, . . . , p, is

Step 2: log-likelihood optimization. Given a collection of n samples drawn from the random vector X, X say, and dispersion parameterζ s ∈ R, we estimate parameter (β s , ν s ) by solving the following problem (β s ,ν s ) ← argmax (β s ,νs) { s (ν s , β s ,ζ s )}.

(1.4)

For convenience, we reparameterize the log-likelihood as follows, as done in :

Therefore, Problem (1.4) translates into finding a set of vectors (a µ , a π ) that locally maximizes the log-likelihood of a zinb model for a vector of count X s , i.e., F (a µ , a π ) = s (ν s , β s ,ζ s ). Starting from an initial value, we perform a local maximization of this function F using BFGS method (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) . We first explicit the derivatives of the log-likelihood w.r.t. parameters (µ, π) in the zinb distribution. We note that f zinb (y; µ, θ, π) = πδ 0 (y) + (1 − π)f nb (y, µ, θ),

∂ ∂µ ln f zinb (y, µ, θ, π) = (1 − π)f nb (y, µ, θ) ∂ ∂µ ln f nb (y, µ, θ) f zinb (y, µ, θ, π) ∂ ∂π ln f zinb (y, µ, θ, π) = δ 0 (y) − f nb (y, µ, θ) f zinb (y, µ, θ, π) .

Depending on whether y is equal to zero, the above given expressions become:

-y > 0: here, δ 0 (y) = 0 and f zinb (y, µ, θ, π) = (1 − π)f nb (y, µ, θ), so that one has

-y = 0: here, δ 0 (y) = 1 and f zinb (0, µ, θ, π) = π + (1 − π)f nb (0, µ, θ), and one

Using standard calculus for the differential of compositions and the fact that:

where G and H are the n-dimensional vectors given by

Remark 1. Note that for each s ∈ V , the dispersion parameter θ s is constant, i.e., θ s does not depend on conditional sets K. Hence, in practice, we estimate θ s by applying the above given procedure for K = V . Then, we use this estimatê θ s as the input of the second step, log-likelihood optimization. Remark 2. It is worth to note that: (i) for negative binomial models, i.e., π s|K = 0, ∀s ∈ K ⊆ V , parameters are estimated similarly by considering a π as a vector 0, and A π as a matrix 0; (ii) negative binomial distributions belong to the exponential family. Hence, an alternative way to estimate parameters is the iteratively reweighted least squares (IRWLS) algorithm. This approach does not depend on the choice of initial values for parameters. As a result, PC-nb using IRWLS algorithm, called PC-nbglm could be more accurate in some scenarios with significantly reduced computational time (see Table S8 ). In practice, when dealing with high dimensional data, we suggest users to use the PC-nbglm algorithm.

In this section, we prove that the zinb model is identifiable.

Theorem 1. Assume X follows a zinb distribution, i.e., X ∼ zinb(π, θ, µ). Then, the density function of X is identifiable.

Proof. We prove the theorem by contradiction. Assume there exists (π 1 , θ 1 , µ 1 ) = (π 2 , θ 2 , µ 2 ) such that X 1 ∼ zinb(π 1 , θ 1 , µ 1 ) with density function f 1 (., π 1 , θ 1 , µ 1 ); X 2 ∼ zinb(π 2 , θ 2 , µ 2 ) with density function f 2 (., π 2 , θ 2 , µ 2 ), and f 1 (., π 1 , θ 1 , µ 1 ) ≡ f 2 (., π 2 , θ 2 , µ 2 ). For s = 1, 2, one has f s (x s , µ s , θ s , π s ) = π s δ 0 (x s ) + (1 − π s )f nb (x s , µ s , θ s ). Let x s = 0, 1, 2, 3, then

(1.6) As Γ(z + 1) = zΓ(z), Equation 1.6 yields

(1.7)

Similarly, f 1 (3, π 1 , θ 1 , µ 1 ) f 1 (2, π 1 , θ 1 , µ 1 ) = f 2 (3, π 2 , θ 2 , µ 2 ) f 2 (2, π 2 , θ 2 , µ 2 )

(1.8)

From Equation (1.7) and Equation (1.8), one gets

(1.9)

Finally, Equation (1.5), (1.7), and (1.9) yield (π 1 , θ 1 , µ 1 ) = (π 2 , θ 2 , µ 2 ), that contradicts the Assumption.

As measurements were zero-inflated and highly skewed, with total count volumes depending on experimental condition, standard preprocessing was applied to the data as in ; . In particular, we first filtered out the genes with a mean across the cells in each group smaller than 0.005 as these genes are nearly constant and are likely to not contribute much to the structure of the graph. Second, we normalized the data by 95% quantile matching between cells to account for differences in sequencing depth. Then, the top 1000 cells with the highest means were selected from the groups that had more than 1000 cells to ensure a fair comparison between groups (since the dataset has 755 GBCs, and 929 mOSNs). Furthermore, we adjust the data to be closer to a zinb distribution by using a power transformation X α , where α ∈ (0, 1] is chosen to minimize the distance between the empirical distribution and the zinb distribution, measured by Kolmogorov-Smirnov statistics. Finally, we round the normalized data to the closest smaller integer (flooring). The effect of preprocessing on four randomly selected TF genes are shown in Figure S1 , while the effect of the preprocessing steps on the overall performance of the algorithm is shown in Table S13 .

Our first analysis focuses on the total set of 2576 transcription factors (TF), which are known to play a central role in the regulation of other genes . Of these, 1543 are expressed in our dataset and the following passed our gene filtering step in each of the cell types: 1479 in HBC*, 1493 in GBC, 1437 in iOSN, 1360 in mOSN. Data after preprocessing steps were used as input to PC-zinb1 with a significance level of α = 2(1 − Φ(n 0.15 )) , and the maximum cardinality of conditional independence set m = 3.

Once estimated, we extracted the 2-core of the networks by iteratively excluding TFs with fewer than 2 edges, leaving a network with 926, 991, 465 and 772 genes for the HBC*, GBC, iOSN and mOSN cell types, respectively. For each network, we performed community detection using the Leiden algorithm (Traag, Waltman and van Eck, 2019) , where we supervise the resolution parameter to be 0.4 for HBC*, GBC and mOSN, and 0.25 for the iOSN network, as a resolution parameter of 0.4 resulted in a too high number of communities there. The TFs in each of these communities were then used in an overrepresentation analysis using the biological processes hallmark gene sets of the MSigDB database Liberzon et al., 2015) , where only gene sets containing at least 5 genes were retained. We ranked gene sets using a Fisher's exact test, and only the top 10 enriched gene sets were interpreted.

Our second analysis focuses on a set of 242 genes, annotated with the term "stem cell differentiation" in the Mouse Genome Database , expressed in the activated HBC cell type.

The raw count data set consisted of 5418 cells and 175 expressed "stem cell differentiation" genes. The preprocessing outlined above left us with 1000 cells and 160 genes. Normalized data was used as input to PC-zinb algorithm, with a significance level of α = 2(1 − Φ(n 0.15 )). Fig. S4 : F 1 -score of the considered algorithms for the three types of graphs in Supplementary Figure S2 with p = 10, µ = 5, θ = 0.5, π = 0.7: scalefree; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Fig. S5 : F 1 -score of the considered algorithms for the three types of graphs in Supplementary Figure S2 with p = 10, µ = 0.5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Figure S2 with p = 10, µ = 5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Figure S2 with p = 10, µ = 0.5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Figure S2 with p = 10, µ = 5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Figure S2 with p = 10, µ = 0.5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Figure S3 with p = 100, µ = 5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Figure S3 with p = 100, µ = 0.5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Figure S3 with p = 100, µ = 5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of . Figure S3 with p = 100, µ = 0.5, θ = 0.5, π = 0.7: scale-free; hub; Erdos-Renyi. The data were simulated from Poisson (top), NB (middle), and zinb (bottom) models. PC-zinb1: zinb model in which the structure of the graph is attributable to both of the two parameter components µ s|K and π s|K ; PC-zinb0: zinb model in which the structure of the graph is attributable to only the parameter component µ s|K and consider π s|K as a constant; PC-nb: Negative binomial model, i.e., the special case of zinb models where π s|K = 0; PC-pois: Poisson model of .

Supplementary 

Monte Carlo means of Precision (P), Recall (R), F 1 -score, and runtime obtained by simulating 50 samples from three different types of graphs in Figure 

Monte Carlo means of Precision (P), Recall (R), F 1 -score, and runtime obtained by simulating 50 samples from three different types of graphs in Figure 

Monte Carlo means of Precision (P), Recall (R), F 1 -score, and runtime obtained by simulating 50 samples from three different types of graphs in Figure 

Monte Carlo means of Precision (P), Recall (R), F 1 -score, and runtime obtained by simulating 50 samples from three different types of graphs in Figure S2 (p = 10), under zinb node conditional models with mean µ = 0.5, 5; θ = 0.5, π = 0.7, and levels of noise µ noise = 0.5. The levels of significance of tests α = 2(1 − Φ(n 0.15 )). 

Monte Carlo means of Precision (P), Recall (R), F 1 -score, and runtime obtained by simulating 50 samples from three different types of graphs in Figure S2 (p = 10), under NB node conditional models with mean µ = 0.5, 5; θ = 0.5, and levels of noise µ noise = 0.5. The levels of significance of tests α = 2(1 − Φ(n 0.15 )).

PC-zinb0 

Monte Carlo means of Precision (P), Recall (R), F 1 -score, and runtime obtained by simulating 50 samples from three different types of graphs in Figure 

Monte Carlo means of Precision (P), Recall (R), F 1 -score, and runtime obtained by simulating 50 samples from three different types of graphs in Figure S3 (p = 100) , under zinb and NB node conditional distributions mean µ = 5; θ = 0.5; π = 0.7 and levels of noise µ noise = 0.5. The levels of significance of tests α = 2(1 − Φ(n 0.2 )) for n = 500, 1000, and α = 2(1 − Φ(n 0.225 )) for n = 200. 

Copula Gaussian graphical models with penalized ascent Monte Carlo EM algorithm

A local Poisson graphical model for inferring networks from sequencing data

Spatial interaction and the statistical analysis of lattice systems

Non-neuronal expression of SARS-CoV-2 entry genes in the olfaory system suggests mechanisms underlying COVID-19-associated anosmia

HiveR: 2D and 3D Hive Plots for R R

Mouse genome database (MGD) 2019

Latent variable graphical model selection via convex optimization

Architecture of gene regulatory networks controlling flower development in Arabidopsis thaliana

Order-independent constraintbased causal structure learning

The igraph software package for complex network research. InterJournal, complex systems

Structure Learning in Graphical Modeling

Deconstructing olfactory stem cell trajectories at single-cell resolution

Injury activates transient olfactory stem cell states with diverse lineage capacities

A hierarchical Poisson lognormal model for network inference from RNA sequencing data

EpCAM is involved in maintenance of the murine embryonic stem cell phenotype

Consistent estimation of the basic neighborhood of Markov random fields

Exploration, normalization, and summaries of high density oligonucleotide array probe level data

Quantitative single-cell RNAseq with unique molecular identifiers

The versatile functions of Sox9 in development, stem cells, and human diseases

New probabilistic graphical models for genetic regulatory networks studies

The technology and biology of single-cell RNA sequencing

Hive plots-rational approach to visualizing networks

The Molecular Signatures Database Hallmark Gene Set Collection

Stability approach to regularization selection (stars) for high dimensional graphical models

The multiple roles for Sox2 in stem cell maintenance and tumorigenesis

p53 controls genomic stability and temporal differentiation of human neural stem cells and affects neural organization in human brain organoids

Graphical models for zero-inflated single cell gene expression

TGF-β Family Signaling in Neural and Neuronal Differentiation, Development, and Function. Cold Spring Harbor perspectives in biology

Structure Learning of Undirected Graphical Models for Count Data

Learning Gaussian Graphical Models of Gene Networks with False Discovery Rate Control

Dissecting the DNA binding landscape and gene regulatory network of p63 and p53

A general and flexible method for signal extraction from single-cell RNA-seq data

Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis

An empirical Bayes approach to inferring large-scale gene association networks

p63 Is essential for the proliferative potential of stem cells in stratified epithelia

A novel statistical approach for identification of the master regulator transcription factor

Causation, prediction, and search

Gene set enrichment analysis: A knowledge-based approach for interpreting genome

A local Poisson graphical model for inferring networks from sequencing data

The convergence of a class of double-rank minimization algorithms 1. general considerations

Mouse genome database (MGD) 2019

A new approach to variable metric algorithms. The computer journal

Deconstructing olfactory stem cell trajectories at single-cell resolution

A Family of Variable Metric Updates Derived by Variational Means

The Molecular Signatures Database Hallmark Gene Set Collection

Structure Learning of Undirected Graphical Models for Count Data

A general and flexible method for signal extraction from single-cell RNA-seq data

Conditioning of quasi-Newton methods for function minimization

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

From Louvain to Leiden: guaranteeing well-connected communities

The authors would like to thank Diya Das, Rebecca Chance, and John Ngai for providing access to the data and for help with the biological interpretation of

Top 10 enriched gene sets identified for each of the transcription factor communities in the HBC* cell type. Supplementary Table S13 TP, FP, FN, Precision (P), Recall (R), F 1 -score of zinb algorithm for stem Cell Differentiation gene set (consider result graphs from raw as the true graphs). The levels of significance of tests α = 2(1 − Φ(n 0.15 )).