VIA: Generalized and scalable trajectory inference in single-cell omics data / 1 VIA: Generalized and scalable trajectory inference in single-cell omics data 2 Shobana V. Stassen 1 , Gwinky G. K. Yip 1 , Kenneth K. Y. Wong 1,3 , Joshua W. K. Ho 2,4 and Kevin K. Tsia 1,3 3 1 Department of Electrical & Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong 4 2 School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong 5 3 Advanced Biomedical Instrumentation Centre, Hong Kong Science Park, Shatin, New Territories, Hong Kong 6 4 Laboratory of Data Discovery for Health, Hong Kong Science Park, Shatin, New Territories, Hong Kong 7 Abstract 8 Inferring cellular trajectories using a variety of omic data is a critical task in single-cell data science. 9 However, prediction and thus biologically meaningful discovery of cell fates are challenged by the sheer 10 size of single-cell data, diverse omic data types, and their complex data topologies. We present VIA, a 11 scalable trajectory inference algorithm that uses lazy-teleporting random walks to accurately reconstruct 12 complex cellular trajectories beyond tree-like pathways (e.g. cyclic or disconnected structures), and to 13 discover less populous lineages or those otherwise obscured in other methods. VIA outperforms existing 14 algorithms in recapitulating cell fates/lineages, and also mitigates loss of global connectivity information 15 in large datasets beyond a million cells. Furthermore, VIA demonstrates versatility by distilling cellular 16 trajectories in single-cell transcriptomic, epigenomic, proteomic and morphological data – showing new 17 promise in scalable, multifaceted single-cell analysis to explore novel biological processes. 18 Introduction 19 Single-cell omics data captures snapshots of cells that catalog cell types and molecular states with high 20 precision. These high-content single-cell readouts can be harnessed to model evolving cellular 21 heterogeneity and track dynamical changes of cell fates in tissue, tumour, and cell population. However, 22 current computational methods face four critical challenges. First, it remains difficult to accurately 23 reconstruct high-resolution cell trajectories and detect cell fates embedded within them. Even the few 24 algorithms which automate cell fate detection (e.g., SlingShot 1 and Palantir 2 ) exhibit low sensitivity and 25 are highly susceptible to changes in input parameters. Second, current trajectory inference (TI) methods 26 predominantly work well on tree-like trajectories (e.g. Slingshot, Monocle2 3 ), but lack the generalisability 27 to infer disconnected, cyclic or hybrid topologies without imposing restrictions on transitions and 28 causality 4 . Third, the growing scale of single-cell data, notably cell atlases of whole organisms 6 ,7 , 29 embryos 8 ,9 and human organs 10 , exceeds the existing TI capacity, not just in runtime and memory, but in 30 preserving global connectivity, which is often lost after extensive dimension reduction or subsampling. 31 Fourth, fueling the advance in single-cell technologies is the ongoing pursuit to understand cellular 32 heterogeneity from a broader perspective beyond transcriptomics. However, the applicability of TI to a 33 broader spectrum of single-cell data has yet to be fully exploited. 34 To overcome these recurring challenges, we present VIA, a graph-based TI algorithm that uses a new 35 strategy to compute pseudotime, and reconstruct cell lineages based on lazy-teleporting random walks 36 integrated with Markov chain Monte Carlo (MCMC) refinement. VIA relaxes common constraints on 37 traversing the graph by allowing cyclic and temporally reversed movements, and thus robustly detects cell 38 fates involving complex transitions that are otherwise obscured in other methods. VIA outperforms 39 popular TI algorithms in terms of capturing cellular trajectories not limited to multi-furcations and trees, 40 but also disconnected and cyclic topologies ( Supplementary Fig. S1) . Compared to existing TI methods, .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s4aw0ujvuwn8 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=i7cha45y61s6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=pwzzqt446v1 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=2glf5ij7qkb6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mtxpwbv8o1fn https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=gsc2l2p4cfhh https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 41 VIA is highly scalable with respect to number of cells (10 2 to >10 6 cells) and features, without requiring 42 extensive dimensionality reduction or subsampling which compromise global information. We 43 demonstrate VIA’s accuracy, scalability, topological-generalizability and multi-omic versatility across 44 multiple modalities by investigating 10 simulated and experimental datasets ( Supplementary Table S1 ), 45 ranging from single-cell RNA-sequencing (scRNA-seq), single-cell sequencing assay for 46 transposase-accessible chromatin (scATAC-seq), multi-omics integration, to mass and imaging cytometry. 47 Figure 1. General workflow of VIA algorithm. Step 1: Single-cell level graph is clustered such that each node 48 represents a cluster of single cells (computed by our clustering algorithm PARC 11 ). The resulting cluster graph forms 49 the basis for subsequent random walks. Step 2: 2-stage pseudotime computation: (i) The pseudotime (relative to a 50 user defined start cell) is first computed by the expected hitting time for a lazy-teleporting random walk along an 51 undirected graph. At each step, the walk (with small probability) can remain (orange arrows) or teleport (red arrows) 52 to any other state. (ii) Edges are then forward biased based on the expected hitting time (See forward biased edges 53 illustrated as the imbalance of double-arrowhead size). The pseudotime is further refined on the directed graph by 54 running Markov chain Monte Carlo (MCMC) simulations (See 3 highlighted paths starting at root). Step 3: Consensus 55 vote on terminal states based on vertex connectivity properties of the directed graph. Step 4 : lineage likelihoods 56 computed as the visitation frequency under lazy-teleporting MCMC simulations. Step 5 : visualization that combines 57 network topology and single-cell level pseudotime/lineage probability properties onto an embedding using GAMs, as 58 well as unsupervised downstream analysis (e.g. gene expression trend along pseudotime for each lineage). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 59 Results 60 Algorithm 61 VIA first represents the single-cell data as a cluster graph (i.e. each node is a cluster of single cells), 62 computed by our recently developed data-driven community-detection algorithm, PARC, which allows 63 scalable clustering whilst preserving global properties of the topology needed for accurate TI 11 ( Step 1 in 64 Fig. 1) . The cell fates and their lineage pathways are then computed by a two-stage probabilistic method, 65 which is the key algorithmic contribution of this work ( Step 2 in Fig. 1 , see Methods ). In the first stage, 66 VIA models the cellular process as a modified random walk that allows degrees of laziness (remaining at 67 a node/state) and teleportation (jumping to any other node/state) with pre-defined probabilities. The 68 pseudotime, and thus the graph directionality, can be computed based on the theoretical hitting times of 69 nodes (See the theory and derivation in Methods and Supplementary Note 2 ). The lazy-teleporting 70 behavior prevents the expected hitting time from converging to a local distribution in the graph as 71 otherwise occurs in regular random walks, especially when the sample size grows 12 . More specifically, the 72 laziness and teleportation factors regulate the weights given to each eigenvector-value pair in the expected 73 hitting time formulation such that the stationary distribution (given by the local-node degree-properties in 74 regular walks) does not overwhelm the global information provided by other ‘eigen-pairs’. Moreover, the 75 computation does not require subsetting the first k eigenvectors (bypassing the need for the user to select 76 a suitable threshold or subset of eigenvectors) since the dimensionality is not on the order of number of 77 cells, but equal to the number of clusters. Hence all eigenvalue-eigenvector pairs can be incorporated 78 without causing a bottleneck in runtime. Consequently in VIA, the modified walk on a cluster-graph not 79 only enables scalable pseudotime computation for large datasets in terms of runtime, but also preserves 80 information about the global neighborhood relationships within the graph. In the second stage of Step 2, 81 VIA infers the directionality of the graph by biasing the edge-weights with the initial pseudotime 82 computations, and refines the pseudotime through MCMC simulations. Next (Step 3 in Fig . 1), the 83 MCMC-refined graph-edges of the lazy-teleporting random walk enable accurate predictions of terminal 84 cell fates through a consensus vote of various vertex connectivity properties derived from the directed 85 graph. The cell fate predictions obtained using this approach are more robust to changes in input data and 86 parameters compared to other TI methods ( Supplementary Fig. S1 and Fig. S16) . Trajectories towards 87 identified terminal states are resolved using lazy-teleporting MCMC simulations ( Step 4 in Fig. 1 ). The 88 probabilistic approach and relaxation of edge constraints allowed by VIA in computing differentiation 89 pathways and pseudotime enables greater sensitivity to cell fates and complex trajectories, and makes 90 allowances for asynchrony in differentiation processes by avoiding prematurely imposing constraints on 91 node-to-node mobility. Other methods resort to constraints such as reducing the graph to a tree, imposing 92 unidirectionality by thresholding edges based on pseudotime directionality, removing outgoing edges 93 from terminal states 13 , 2 and computing shortest paths for pseudotime 2 ,1 . VIA’s probabilistic approach to 94 graph-traversal allows it to infer cell fates when the underlying data spans combinations of multifurcating 95 trees and cyclic/disconnected topologies - topologies and hence lineages often obscured in existing TI 96 methods ( Supplementary Fig. S1 ). Together, these four steps facilitate holistic topological visualization 97 of TI on the single-cell level (e.g. using UMAP or PHATE embeddings 14 ,15 ) and other data-driven 98 downstream analyses such as recovering gene expression trends ( Methods ). ( Step 5 in Fig. 1 ). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=e9uxoipvwota https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=onqq15xmlfrf https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=igxr9liha0sr https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 99 VIA accurately infers trajectories in diverse scRNA-seq data 100 VIA recapitulates differentiation topologies and identifies elusive cell fates across a wide range of 101 transcriptomic data. We first showcase the ability of VIA to explore large single-cell transcriptomic 102 datasets by employing the 1.3-million-cell mouse organogenesis cell atlas (MOCA) 8 . While this dataset is 103 inaccessible to most TI methods from a runtime and memory perspective, VIA can efficiently resolve the 104 underlying developmental heterogeneity, including 9 major trajectories ( Fig. 2a,b ) with a runtime of ~40 105 min, compared to the next fastest method which has a runtime of at least 4 hours 2 ( Supplementary Table 106 S3 ). VIA preserves wider neighborhood information and reveals a globally connected topology of MOCA 107 which is otherwise lost in the previous method. Broadly speaking, the overall cluster graph of VIA 108 consists of three main branches that concur with the known developmental process at early 109 organogenesis. 16 ( Fig. 2a) . It starts from the root stem which has a high concentration of E9.5 early 110 epithelial cells made of multiple sub-trajectories (e.g. epidermis, nose and foregut/hindgut epithelial cells 111 derived from the ectoderm and endoderm). The stem is connected to two distinct lineages: 1) 112 mesenchymal cells originated from the mesoderm which arises from interactions between the ectoderm 113 and endoderm 17 and 2) neural tube/crest cells derived from neurulation when the ectoderm folds inwards 1 . 114 The sparsity of early cells (only ~8% are E9.5) and the absence of earlier ancestral cells make it 115 particularly challenging to capture the simultaneous development of trajectories. However, the overall 116 pseudotime structure presented by VIA is reasonable. For instance, at the junction of the 117 epithelial-mesenchymal branch, we find early mesenchymal cells from E9.5-E10.5. Cells from later 118 mesenchymal developmental stages (e.g. myocytes from E12.5- E13.5) reside at the leaves of branches. 119 Similarly, at the junction of epithelial-neural tube, we find dorsal tube neural cells and notochord plate 120 cells which are predominantly from E9.5-E10.5 and more developed neural cells at the tips (e.g. 121 excitatory and inhibitory neurons from E12.5-E13.5). VIA also places the other dispersed groups of 122 trajectories (e.g. endothelial, hematopoietic) in biologically relevant neighborhoods ( Supplementary 123 Notes 3, Supplementary Fig. S11 ). While VIA’s connected topology offers a coarse-grained holistic 124 view, it does not compromise the ability to delineate individual lineage pathways (consistent with those 125 found by Cao et al., 8 ) as shown in Fig. 2c and Supplementary Fig. S11. TI using VIA uniquely 126 preserves both the global and local structures of the data and is thus particularly favorable for biological 127 exploration involving large datasets, especially for comparative studies involving cell atlases 19 . Whilst 128 manifold-learning methods are often used to extensively reduce dimensionality to mitigate the 129 computational burden of large single-cell datasets, they tend to incur loss of global information and be 130 sensitive to input parameters. VIA is sufficiently scalable to bypass such a step, and therefore retains a 131 higher degree of neighborhood information when mapping large datasets. This is in contrast to 132 Monocle3’s 8 UMAP-reduced inputs that reveal different disconnected super-groups and fluctuating 133 connectivity depending on input parameters (see Supplementary Fig. S12-15 for the biologically 134 consistent structures proposed by VIA across a range of parameters compared to the contradicting cell 135 super groups and connectivity suggested by a UMAP based TI interpretation ). 136 We next demonstrated the applicability of VIA in single-cell multi-omics analysis by inferring murine 137 Isl1+ cardiac progenitor cell (CPC) transition states using both single-cell transcriptomic and chromatin 138 accessibility information 20 ( Fig. 2d-i ). VIA consistently uncovers the bifurcating lineages towards the .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mtxpwbv8o1fn https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s2f3i1x6uus6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=ic935kiviuls https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=c4gwvsmlicg https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mtxpwbv8o1fn https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mtxpwbv8o1fn https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 139 endothelial and cardiomyocyte fates based on the scRNA-seq, scATAC-seq datasets and their data 140 integration ( see Methods for data integration). Other methods such as Palantir and Slingshot, that are also 141 applicable to non-transcriptomic data, fail to uncover the two main lineages in the individual as well as 142 the more challenging integrated multi-omic data. They typically only detect one of the two lineages and 143 instead falsely detect several intermediate and early stages as final cell fates ( see Fig. 2i for prediction 144 accuracy). PAGA does not offer automated cell fate prediction and is therefore not benchmarked for this 145 dataset. VIA detects lineage pathways in both the scRNA-seq and scATAC-seq that can be used to 146 interpret relationships between transcription factor dynamics and gene expression in an unsupervised 147 manner. VIA automatically generates a pseudotemporal ordering of cells (without requiring manual 148 selection of relevant cells as done in Jia et al. 20 ) along respective lineages and their marker-TF pairs ( see 149 Fig. 2f and Supplementary Fig. S8e) . The highlighted gene and TF pairs in the cardiac lineage show a 150 strong correlation between expression and accessibility of Gata and Homeobox Hox genes which are 151 known to be related to the regulation of cardiomyocyte proliferation 23,24,25 . VIA’s reliable performance 152 against user-reconfiguration (choice of components, individual or integrated omic data) suggests it can be 153 used for transferable interpretation between scRNA-seq and chromatin accessibility data. 154 We further tested VIA on a wider scope of (small-to mid-sized) scRNA-seq datasets, including B-cell 155 differentiation 26 , hematopoiesis 2 , 27 , embryonic stem (ES) cell differentiation in embryoid bodies 15 , and 156 endocrine differentiation (~10 2 - 10 4 cells). By comparing VIA with top-performing and popular TI 157 algorithms, e.g. PAGA 28 , Palintir, SlingShot and CellRank 13 (See Methods , and Supplementary Fig. 158 S1-7 for full analysis), we showed that VIA consistently outperforms other methods in terms of both 159 runtime (in some cases by several magnitudes see Supplementary Table S3 for runtime comparison ), 160 and more robust and accurate lineage prediction across a wide range of pre-processing and algorithmic 161 parameters. VIA’s relaxation of graph traversal to permit cyclic sub-paths (see Supplementary Fig. S1) 162 and movements that are temporally reversed, augments its sensitivity to lineages. Notably, VIA more 163 consistently across a wide range of input parameter choice identified less populous lineages that were at 164 best detected by other methods for a narrow sweet spot of parameters. For example, VIA reliably 165 delineates the megakaryocyte, conventional and plasmacytoid dendritic cell (cDC and pDC) lineages in 166 human hematopoiesis ( Fig. 2m-o, Supplementary Fig. S3-4 for pseudotime and graph-topological gene 167 trends for all lineages); and Delta cells (3%) during the endocrine progenitor cells differentiation ( Fig. 168 2j-l, Supplementary Fig. S6 for pseudotime and topological gene trends for all lineages), as evidenced 169 by the corresponding gene-expression trend analysis and parameter stress tests. Interestingly, we find that 170 VIA often detects 2 Beta cell subpopulations (Supplementary Fig. S6b,d,f) that express typical Beta 171 markers like Dlk1, Pdx1 , but differ in their expression of Ins1 and Ins2 (Supplementary Fig. S6d) . Such 172 a Beta cell heterogeneity 29 ,30 , whereby the immature Beta-2 population strongly expresses Ins2 , and 173 weakly expresses Ins1 , and the mature Beta-1 population expresses both types of Ins 30 , can also be 174 reconciled based on the position of the Beta-2 cluster on the VIA graph (Supplementary Fig. S6f). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hpu96rs05k0s https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=y5ixemnm3tac https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=9ohfvgceg55 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s2mv1z2n9zoj https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hn4iy4esq5sd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=thb0kc2o9554 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 175 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 176 Figure 2 VIA accurately infers trajectories in diverse scRNA-seq datasets. (a) VIA cluster-graph trajectory where 177 nodes are colored by pseudotime, and branches are shaded according to major lineages of 1.3-million-cell mouse 178 organogenesis cell atlas (MOCA). The VIA analysis (which is independent of the choice of visualization) produces a 179 connected structure with linkages between some of the major cell types that have a tendency to become segregated 180 in a UMAP based TI analysis (see Supplementary Fig. S12-15 ). The stem (root) branch consists of epithelial cells 181 derived from ectoderm and endoderm, leading to two main branches: 1) the mesenchymal and 2) the neural tube and 182 neural crest. Other major groups are placed in the biologically relevant neighborhoods, such as the adjacencies 183 between hepatocyte and epithelial trajectories; the neural crest (comprising glial cells and PNS neurons) and the 184 neural tube; as well as the links between early mesenchyme with both the hematopoietic cells and the endothelial 185 cells ( see Supplementary Note 3 ). (b) (Left) Single-cell PHATE embedding colored by major cell groups. (Right) 186 Single-cell PHATE embedding colored by VIA pseudotime. (c) Lineage pathways and probabilities of neuronal, 187 myocyte and WBC lineages ( see Supplementary Fig. S6 for other lineages ). (d) scRNA-seq and scATAC-seq data 188 of Isl1+ cardiac progenitors (CPs) integrated using Seurat3 before VIA TI analysis and PHATE visualization. Cells are 189 colored by annotated cell-type and experimental modality (e) Cells are colored by VIA pseudotime with the 190 VIA-inferred trajectory towards endothelial and myocyte lineages projected on top. (f) Marker gene expression and 191 chromatin accessibility for gene-TF pairs along pseudotime axis for Cardiomyocyte lineage (g) VIA-graph trajectory 192 with nodes colored by pseudotime shows bifurcation to endothelial and myocyte cells in scRNA-seq cells (h) 193 scATAC-seq of Isl1+ CPs: VIA-graph again shows bifurcation after intermediate CP stage. (i) Lineage prediction 194 accuracy (F1-score) for methods that offer automated lineage detection and are not limited to transcriptomic data. (k) 195 Pancreatic Islets: Colored by VIA pseudotime with detected terminal states shown in red and annotated based on 196 known cell type as Alpha, Beta-1, Beta-2, Delta and Epsilon lineages where Beta-2 is Ins1 low Ins2+ Beta subtype 197 ( Supplementary Fig. S7 ). (l) VIA inferred cluster-level pathway shows gene regulation along endocrine progenitor 198 (EP) to Delta lineage specification (top) and Sst gene-expression trend shows rise of Sst in Delta lineage ( See 199 Supplementary Fig. S7 for remaining ). (m) Prediction Accuracy of the 4 major endocrine cell types when varying 200 the number of HVGs selected in pre-processing, and the number of PCs. (n) Human CD34+ hematopoiesis with 6 201 detected cell fates annotated (o) lineage pathway and gene-pseudotime trend shown for the CD41 Megakaryocytic 202 cells ( see Supplementary Fig. S3 for other lineages ). (p) Prediction accuracy of 6 cell fates when varying number 203 of K (nearest neighbors) and PCs. Note Slingshot on default mode (“V2”) uses GMM clustering and “V1” uses 204 K-means clustering (allowing for over-clustering K=15, to increase sensitivity). Runtime of each method is also 205 highlighted below the chart. 206 VIA enables multi-omic analysis beyond transcriptomic data 207 Broad applicability of TI beyond transcriptomic analysis is increasingly critical, but existing methods 208 have limitations contending with the disparity in the data structure (e.g. sparsity and dimensionality) 209 across a variety of single-cell data types and oftentimes are designed with a view to only handling 210 transcriptomic data (e.g. methods using RNA velocity to infer directionality). 211 First, we employ VIA to analyze human scATAC-seq profiles (from CD34+ human bone marrow) ( Fig. 212 3a ), and find that the continuous landscape of hematopoiesis generally mirrors the scRNA-seq human 213 hematopoietic data ( Fig. 2c ). The intrinsic sparsity of scATAC-seq data poses a challenge that can be 214 alleviated by choice of pre-processing pipelines, and we see that VIA consistently predicts the expected 215 hierarchy of furcations that leads to the lymphoid, myeloid and erythroid lineages for two commonly 216 accepted pre-processing protocols 31 ,27 ( Methods ) . This again holds across a wide range of input 217 parameters, as shown by the changes in the accessibility of TF motifs associated with known regulators, 218 e.g. Gata1 (erythroid), Cebpd (myeloid) (Fig 3b-d, Supplementary Fig. S7). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=cpy816x2qcr https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 219 We next investigated whether VIA can cope with a significant drop in data dimensionality (10-100), as 220 often presented in flow/mass cytometry data, and still delineate continuous biological processes. We run 221 VIA on a time-series mass cytometry data (28 antibodies, 90K cells) capturing murine embryonic stem 222 cells (ESCs) differentiation toward mesoderm cells (Day 0 - Day 11) 32 . Unlike previous analysis 32 of the 223 same data which required chronological labels to visualize the developmental hierarchy, we ran VIA 224 without such supervised adjustments and accurately captured the sequential development. VIA computed 225 the trajectories with faster runtime (running in 2 minutes versus Slingshot which required 6 hours see 226 Table S3 ), detecting 3 terminal states corresponding to cells in the final developmental stages: 2 227 corresponding to the main region of Day 10-11 (marked by Pdgfra , Cd44 and Gata4 expressions), and a 228 small population of Day 10-11 cells expressing EpCAM, which are otherwise obscured in other methods 229 (e.g. Palantir, Slingshot), especially the small EpCAM population (~0.5% of cells) (Fig 3e-h, Fig. S9e,f) . 230 Finally we tested the adaptability of VIA to infer cell-cycle stages based on label-free single-cell 231 biophysical morphology (38 features, see Supplementary Table S4 and Table S5 ) profiled by our 232 recently developed high-throughput imaging flow cytometer, called FACED 33 . VIA reliably reconstructed 233 the continuous cell-cycle progressions from G1-S-G2/M phase of two different types of live breast cancer 234 cells as validated by the single-cell fluorescent (DNA dye) images captured by the same system 235 ( Methods )( Fig. 3i-k for MCF7 , Supplementary Fig. S10 for MDA-MB231 ) . Intriguingly, according to 236 the pseudotime ordered by VIA, not only can it reveal the known cell growth in size and mass 34 , and 237 general conservation of cell mass density 35 (as derived from the FACED images ( Methods )) throughout 238 the G1/S/G2 phases, but also a slow-down trend during the G1/S transition, consistent with the lower 239 protein-accumulation rate during S phase 36 ( Fig. 3l, Supplementary Fig, S10 f,g ). The variation in 240 biophysical textures (e.g. phase entropy) along the VIA pseudotime likely relates to known architectural 241 changes of chromosomes and cytoskeletons during the cell cycles ( Fig. 3l, Fig. S10 f,g ). These results 242 further substantiate the growing body of work 37 ,38,39,40 on imaging biophysical cytometry for gaining a 243 mechanistic understanding of biological systems, especially when combined with omics analysis 41 . 244 Concluding Remarks 245 Overall, VIA offers an advancement to TI methods to study a diverse range of single-cell omic data, 246 including those targeted by many cell-atlas initiatives. By combining lazy-teleporting random walks and 247 MCMC simulations, VIA relaxes common constraints on graph traversal and causality. This enables 248 accurate lineage prediction that is robust to parameter configuration for a variety of complex topologies 249 and rarer lineages obscured in other methods. Our stress tests showed that the modeled developmental 250 landscape in other methods is vulnerable to user parameter choice which can incur fragmentation or 251 spurious linkages, and consequently only yield biologically sensible lineages for a narrow sweet spot of 252 parameters (See the summary in Supplementary Fig. S16 ). For example, due to algorithmic measures 253 taken to restrict permissible graph-edge transitions and progressively reduce the inherent dimensionality 254 (e.g. PCA followed by subsetting the number of diffusion components) other algorithms struggle to 255 delineate obscure lineages and maintain neighborhood relationships. VIA’s wider bandwidth of accuracy, 256 superior runtime and preservation of global graph properties for very large datasets, offers a unique and 257 well-suited approach for multifaceted exploratory analysis to uncover novel biological processes, 258 potentially those deviated from healthy trajectories. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=29jzsvekzs4z https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=29jzsvekzs4z https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=1ezm5xa09jlh https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=tbqfcii5qr09 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=wlju6tjkbefs https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=qj4499p96ubk https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=z8feu0u2i2yt https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=d11sb02yalb https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 259 Figure 3 VIA infers trajectories in single-cell multi-omic and image datasets (a) Major lineages of human 260 hematopoiesis (profiled by scATAC-seq) projected onto the UMAP embedding. Lineages are colored by FACS sorted 261 labels 27 . (b) VIA cluster-graph topology colored by VIA pseudotime. (c) Trajectory, pseudotime and detected terminal 262 states (red) projected onto the UMAP embedding. (d) F1-scores (on the k-mer Z-score input) for terminal state 263 prediction by different TI methods (for a fixed KNN = 20). Terminal states include megakaryocyte–erythroid progenitor 264 (MEP), common lymphoid progenitor (CLP), plasmacytoid dendritic cell (pDC) and monocytes (Mono) lineages. The 265 comparisons show that VIA's accuracy remains high across a wide range of PCs. (e) Differentiation of mESC to 266 mesoderm cells measured by single-cell mass cytometry. UMAP embedding is colored by different measurement time 267 points (Day 0-11). (f) VIA cluster graph with 3 detected terminal nodes (red) and colored by pseudotime. (g) VIA 268 results projected onto single-cell UMAP embedding shows 3 terminal states correspond to Day 10/11 regions. (h) 269 Correlation of inferred pseudotime and day-labels achieved by different TI methods. The benchmark was done across 270 different numbers of KNN (using all 28 antibodies). (i) Label-free cell cycle progression tracking based on FACED 271 imaging cytometry. The PHATE embedding is constructed using 38 biophysical/morphological features computed 272 from images of human breast cancer cells (MCF7) (See Supplementary Fig. S10 for additional results using another 273 breast cancer cell type (MDA-MB231)). The embedding is colored by the known cell cycle stages given by the DNA 274 fluorescence images (obtained from the same system). (j) VIA graph topology colored by pseudotime. (k) VIA 275 trajectory and pseudotime projected on embedding. (l) “Biophysical” feature expressions (Z-score normalized) over 276 pseudotime. (See Supplementary Table S4-5 for detailed definitions of the features). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=y5ixemnm3tac https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 277 Methods 278 VIA Algorithm 279 VIA applies a scalable probabilistic method to infer cell state dynamics and differentiation hierarchies by 280 organizing cells into trajectories along a pseudotime axis in a nearest-neighbor graph which is the basis 281 for subsequent random walks. Single cells are represented by graph nodes that are connected based on 282 their feature similarity, e.g. gene expression, transcription factor accessibility motif, protein expression or 283 morphological features of cell images. A typical routine in VIA mainly consists of four steps: 284 1. Accelerated and scalable cluster-graph construction . VIA first represents the single-cell data in a 285 k-nearest-neighbor (KNN) graph where each node is a cluster of single cells. The clusters are 286 computed by our recently developed clustering algorithm, PARC 11. . In brief, PARC is built on 287 hierarchical navigable small world (HNSW 58 ) accelerated KNN graph construction and a fast 288 community-detection algorithm (Leiden method 42 ), which is further refined by data-driven pruning. 289 The combination of these steps enables PARC to outperform other clustering algorithms in 290 computational run-time, scalability in data size and dimension (without relying on subsampling of 291 large-scale, high-dimensional single-cell data (>1 million cells)), and sensitivity of rare-cell detection. 292 We employ the cluster-level topology, instead of a single-cell-level graph, for TI as it provides a 293 coarser but clearer view of the key linkages and pathways of the underlying cell dynamics without 294 imposing constraints on the graph edges. Together with the strength of PARC in clustering scalability 295 and sensitivity, this step critically allows VIA to faithfully reveal complex topologies namely cyclic, 296 disconnected and multifurcating trajectories ( Supplementary Fig. S1 ). 297 2. Probabilistic pseudotime computation . The trajectories are then modeled in VIA as (i) 298 lazy-teleporting random walk paths along which the pseudotime is computed and further refined by 299 (ii) MCMC simulations. The root is a single cell chosen by the user.These two sub-steps are detailed 300 as follows: 301 (i) Lazy-teleporting random walk : We first compute the pseudotime as the expected hitting time 302 of a lazy-teleporting random walk on an undirected cluster-graph generated in Step 1. The 303 lazy-teleporting nature of this random walk ensures that as the sample size grows, the expected 304 hitting time of each node does not converge to the stationary probability given by local node 305 properties, but instead continues to incorporate the wider global neighborhood information 12 . 306 Here we highlight the derivation of the closed form expression of the hitting time of this modified 307 random walk with a detailed derivation in Supplementary Note 2 . 308 The cluster graph constructed in VIA is mathematically defined as a weighted connected graph G 309 ( V , E , W ) with a vertex set V of n vertices (or nodes), i.e. and an edge set E , V = {v , , }1 ⋯ vn 310 i.e. a set of ordered pairs of distinct nodes. W is an weight matrix that describes a set of n ×n 311 edge weights between node i and j , are assigned to the edges . For an undirected ≥0wij v ,( i vj) 312 graph, The probability transition matrix, P, of a standard random walk on this wwij = ji ×nn 313 graph G can be given by .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=e9uxoipvwota https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=nlu4dvpyxpr5 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=w4963vuo4u2f https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=onqq15xmlfrf https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 314 D WP = −1 (1) 315 where D is the degree matrix, which is a diagonal matrix of the weighted sum of the degree ×nn 316 of each node, i.e. the matrix elements are expressed as 317 where k are the neighbouring nodes connected to node i . Hence, (which can be reduced as ) dii di 318 is the degree of node i . We next consider a lazy random walk, defined as Z , with probability 319 ( ) of being lazy (where 0 ), i.e. staying at the same node, then 1 − x < x < 1 320 xP 1 )IZ = + ( − x (3) 321 where I is the identity matrix. When teleportation occurs with a probability ( ), the modified 1 − α 322 lazy-teleporting random walk Z' can be written as follows, where is an matrix of ones. J ×nn 323 αZ 1 ) JZ ′ = + ( − α n 1 (4) 324 Here we adapt the concept of personalized PageRank vector, originally used for recording (or 325 ranking ) personal preferences of a web-surfer toward particular website pages 43 , to rank the 326 importance of other nodes (clusters of cells) to a given node, depending on the similarities among 327 nodes (related to P in the graph), and the lazy-teleporting random walk characteristics in the 328 graph (set by probabilities of teleporting and being lazy). Based on this concept, one could model 329 the likelihood to transit from one node (cluster of cells) to another, and thus construct the 330 pseudotime based on the hitting time, which is a parameter describing the expected number of 331 steps it takes for a random walk that starts at node i and visit node j for the first time. Consider 332 the teleporting probability of ( ) and a seed vector s specifying the initial probability 1 − α 333 distribution across the n nodes (such that , where s m is the probability of starting at ∑ m sm = 1 334 node m ) the personalized PageRank vector (which is defined as a column vector) is the prα (s) 335 unique solution to 56 336 . αpr Z 1 )sprα (s) T = α (s) T + ( − α T (5) 337 Substituting Z (Eq. (3)) into Eq. (5), we can express the personalized PageRank vector in prα (s) 338 terms of the inverse of the 𝛃 -normalized Laplacian, of the modified random walk Rβ,N L 339 ( Supplementary Note 2), i.e. 340 , s D R Dprα (s) T = β T −0.5 β,N L 0.5 (6) 341 where , and . and are the m th eigenvector and β = (2−α) 2(1−α) Rβ,N L = ∑ m=1 Φ Φm T m β+2x(1−β)η[ m] Φm ηm 342 eigenvalue of the normalized Laplacian. In the expression of R 𝛃,NL, the 𝛃 and x regulate the 343 weight of contribution in each eigenvalue-eigenvector pair of the summation such that the first 344 eigenvalue-eigenvector pair (corresponding to the stationary distribution and given by the 345 local-node degree-properties) remains included in the overall expression, but does not overwhelm 346 the global information provided by subsequent ‘eigen-pairs’. Moreover, computation of R 𝛃,NL is 347 not limited to a subset of the first k eigenvectors (bypassing the need for the user to select a 348 suitable threshold or subset of eigenvectors) since the dimensionality is not on the order of .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=uzyf68ddp60p https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=fi93hnl2oym0 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 349 number of cells, but equal to the number of clusters and hence all eigenvalue-eigenvector pairs 350 can be incorporated without causing a bottleneck in runtime. 351 The expected hitting time from node q to node r is given by 44 , 352 hα (q, )r = dr pr (e ) (r)[ α r T ] − dq pr (e ) (q)[ α r T ] (7) 353 where is an indicator vector with 1 in the i th entry and 0 elsewhere (i.e. if and ei sm = 1 m = i 354 if ). We can substitute Eq. (6) into Eq. (7), making use of the fact that sm = 0 ≠im 355 , and is symmetric, to obtain a closed form expression of the 1dr = D e[ −1 r] (r) R DD−0.5 β,N L −0.5 356 hitting time in terms of Rβ,N L 357 (e ) D R D ehα (q, )r = β r − eq T −0.5 β,N L −0.5 r (8) 358 (ii) MCMC simulation : The hitting time metric computed in Step-1 is used to infer 359 graph-directionality. Instead of pruning edges in the ‘reverse’ direction, edge-weights are biased 360 based on the time difference between nodes using the logistic function with growth factor b =1. 361 (t) f = 1 1+e −b (t − t )1 0 362 We then recompute the pseudotimes on the forward biased graph: Since there is no closed form 363 solution of hitting times on a directed graph, we perform MCMC simulations (parallely processed 364 to enable fast simulations of 1000s of teleporting, lazy random walks starting at the root node of 365 the cluster graph) and use the first quartile of the simulated pseudotime values for a respective 366 node as the refined pseudotime for that node relative to the root. This refinement step ensures that 367 the pseudotime is robust to the spurious links (or conversely, links that are too weakly weighted) 368 that can distort calculations based purely on the closed form solution of hitting times 369 ( Supplementary Fig. S9d ). By using this 2-step pseudotime computation, VIA mitigates the 370 issues of convergence issues and spurious edge-weights, both of which are common in 371 random-walk pseudotime computation on large and complex datasets 12. . 372 3. Automated terminal-state detection. The algorithm then uses the refined directed and weighted 373 graph (the edges are re-weighted using the refined pseudotimes) to predict which nodes represent the 374 terminal states based on a consensus vote of pseudotime and multiple vertex connectivity properties, 375 including out-degree (i.e. the number of edges directing out from the node), closeness C( q ) , and 376 betweenness B( q ). 377 C (q) = 1 (q,r)∑ q≠r l 378 B (q) = ∑ r=q≠t/ σrt σ (q)rt 379 is the distance between node q and node r (i.e. the sum of edges in a shortest path connecting l (q, )r 380 them). is the total number of shortest paths from node r to node t . is the number of these σrt σrt (q) .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hy8nvy7h6bta https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=onqq15xmlfrf https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 381 paths passing through node q . The consensus vote is performed on nodes that score above (or below 382 for out-degree) the median in terms of connectivity properties. We show on multiple simulated and 383 real biological datasets that VIA more accurately predicts the terminal states across a range of input 384 data dimensions and key algorithm parameters than other methods attempting the same 385 (Supplementary Fig. S16). 386 4. Automated trajectory reconstruction . VIA then identifies the most likely path of each lineage by 387 computing the likelihood of a node traversing towards a particular terminal state (e.g. differentiation). 388 These lineage likelihoods are computed as the visitation frequency under lazy-teleporting MCMC 389 simulations from the root to a particular terminal state, i.e. the probability of node i reaching 390 terminal-state j as the number of times cell i is visited along a successful path (i.e. terminal-state j is 391 reached) divided by the number of times cell i is visited along all of the simulations. In contrast to 392 other trajectory reconstruction methods which compute the shortest paths between root and terminal 393 node 1 ,2 , the lazy-teleporting MCMC simulations in VIA offer a probabilistic view of pathways under 394 relaxed conditions that are not only restricted to the random-walk along a tree-like graph, but can also 395 be generalizable to other types of topologies, such as cyclic or connected/disconnected paths. In the 396 same vein, we avoid confining the graph to an absorbing Markov chain 13,3 (AMC) as this places 397 prematurely strict / potentially inaccurate constraints on node-to-node mobility and can impede 398 sensitivity to cell fates (as demonstrated by VIA’s superior cell fate detection across numerous 399 datasets ( Supplementary Fig. S16 ). 400 Downstream visualization and analysis 401 VIA generates a visualization that combines the network topology and single-cell level 402 pseudotime/lineage probability properties onto an embedding based on UMAP or PHATE. Generalized 403 additive models (GAMs) are used to draw edges found in the high-dimensional graph onto the lower 404 dimensional visualization ( Fig. 1 ). An unsupervised downstream analysis of cell features (e.g. marker 405 gene expression, protein expression or image phenotype) along pseudotime for each lineage is performed 406 ( Fig. 1 ). Specifically, VIA plots the expression of features across pseudotime for each lineage by using 407 the lineage likelihood properties to weight the GAMs. A cluster-level lineage pathway is automatically 408 produced by VIA to visualize feature heat maps at the cluster-level along a lineage-path to see the 409 regulation of genes. VIA provides the option of gene imputation before plotting the lineage specific gene 410 trends. The imputation is fast as it relies on the single-cell KNN (scKNN) graph computed in Step 1. 411 Using an affinity-based imputation method 45 , this step computes a “diffused” transition matrix on the 412 scKNN graph used to impute and denoise the original gene expressions. 413 Benchmarked Methods 414 The methods were mainly chosen based on their superior performance in a recent large-scale 415 benchmarking study 4 , including a select few recent methods claiming to supersede those in the study. 416 Specifically, recent and popular methods exhibiting reasonable scalability, and automated cell fate 417 prediction in multi-lineage trajectories were favoured as candidates for benchmarking (See 418 Supplementary Table S1 for the key characteristics of methods). Performance stress-tests in terms of 419 lineage detection of each biological dataset, and pseudotime correlation for time-series data were .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s4aw0ujvuwn8 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=c4deqc1by20h https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=pwzzqt446v1 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 420 conducted over a range of key input parameters (e.g. numbers of k-nearest neighbors, highly variable 421 genes (HVGs), and principal components (PCs)) and pre-processing protocols (see Fig. 2m,p, 422 Supplementary Fig. 16 ). All comparisons were run on a computer with an Intel(R) Xeon (R) W-2123 423 central processing unit (3.60GHz, 8 cores) and 126 GB RAM. 424 Quantifying terminal state prediction accuracy for parameter tests was done using the F1-score, defined 425 as the harmonic mean of recall and precision and calculated as: 426 F 1 = tp tp + 0.5(f p+f n) 427 Where tp is a true-positive: the identification of a terminal cluster that is in fact a final differentiated cell 428 fate; fp is a false positive identification of a cluster as terminal when in fact it represents an intermediate 429 state; and fn is a false negative where a known cell fate fails to be identified 430 PAGA 28 . It uses a cluster-graph representation to capture the underlying topology. PAGA computes a 431 unified pseudotime by averaging the single-cell level diffusion pseudotime computed by DPT, but 432 requires manual specification of terminal cell fates and clusters that contribute to lineages of interest in 433 order to compare gene expression trends across lineages. 434 Palantir 2 . It uses diffusion-map 46. components to represent the underlying trajectory. Pseudotimes are 435 computed as the shortest path along a KNN-graph constructed in a low-dimensional diffusion component 436 space, with edges weighted such that the distance between nodes corresponds to the diffusion 437 pseudotime 47. (DPT). Terminal states are identified as extrema of the diffusion maps that are also outliers 438 of the stationary distribution. The lineage-likelihood probabilities are computed using Absorbing Markov 439 Chains (constructed by removing outgoing edges of terminal states, and thresholding reverse edges). 440 Slingshot 1 . It is designed to process low-dimensional embeddings of the single-cell data. By default 441 Slingshot runs clustering based on Gaussian mixture modeling and recommends using the first few PCs as 442 input. Slingshot connects the clusters using a minimum spanning tree and then fits principle curves for 443 each detected branch. It uses the orthogonal projection against each principal curve to fit a separate 444 pseudotime for each lineage, and hence the gene expressions cannot be compared across lineages. Also, 445 the runtimes are prohibitively long for large datasets or high input dimensions. 446 CellRank 13 . This method combines the information of RNA velocity (computed using scVelo 48. ) and 447 gene-expression to infer trajectories. Given it is mainly suited for the scRNA-seq data, with the 448 RNA-velocity computation limiting the overall runtime for larger dataset, we limit our comparison to the 449 pancreatic dataset which the authors of CellRank used to highlight its performance. 450 Simulated Data 451 We employed the DynToy 4 ( https://github.com/dynverse/dyntoy ) package, which generates synthetic 452 single-cell gene expression data (~1000 cells x 1000 ‘genes’), to simulate different complex trajectory 453 models. Using these datasets, we tested that VIA consistently and more accurately captures both tree and 454 non-tree like structures (multifurcating, cyclic, and disconnected) compared to other methods .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s2mv1z2n9zoj https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=6th1gotw3ydi https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=l3wpb1nev23n https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s4aw0ujvuwn8 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=xpkwepv274v https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=pwzzqt446v1 https://github.com/dynverse/dyntoy https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 455 (Supplementary Fig. S1) . All methods are subject to the same data pre-processing steps, PCA dimension 456 reduction and root-cell to initialize the path. 457 Multifurcating structure . This dataset consists of 1000 ‘cells’ multifurcating into 4 terminal states. VIA 458 robustly captures all four terminal cell fates across a range of input PCs and the pseudotimes are well 459 inferred relative to the root node (Supplementary Fig. S1a) . Note that two terminal states (M2 and M8), 460 which are very close to each other, are easily merged by the other methods (Slingshot, Palantir and 461 PAGA). 462 Cyclic structure. We ran VIA and other methods for different values of K nearest neighbors. VIA 463 unambiguously shows a cyclic network for a range of K (in KNN). Slingshot does not use a KNN 464 parameter and shows 3 fragmented different lineages (top to bottom). PAGA fails to capture the 465 connected cyclic structure at K = 10 and 5, while Palantir visually shows a linear (K = 10, 30) or 466 disconnected structure (K = 5). Van den Berge et al 57 note that the challenge of cyclic trajectory 467 reconstruction is also common in other popular methods, such as Monocle3 that consistently fragments or 468 fits branching structures onto cyclic simulated datasets. 469 Disconnected structure. This dataset comprises two disconnected trajectories (T1 and T2). T1 is cyclic 470 with an extra branch (M5 to M6), T2 has a bifurcation at M3 ( Supplementary Fig. S1c) . VIA captures 471 the two disconnected structures as well as the M6 branch in the cyclic structure, and the bifurcation in the 472 smaller structure. PAGA captures the underlying structure at PC = 20 but becomes fragmented for other 473 numbers of PCs. Palantir also yields multiple fragments and is not able to capture the overall structure, 474 while Slingshot (using the default clustering based on Gaussian mixture modeling) connects T1 and T2, 475 and only captures one of the bifurcations in T1. 476 Biological Data 477 The pre-processing steps described below for each dataset are not included in the reported runtimes as 478 these steps are typically very fast, (typically less than 1-10% of the total runtime depending on the 479 method. E.g. only a few minutes for pre-processing 100,000s of cells) and only need to be performed 480 once as they remain the same for all subsequent analyses. It should also be noted that visualization (e.g. 481 UMAP, t-SNE) are not included in the runtimes. VIA provides a subsampling option at the visualization 482 stage to accelerate this process for large datasets without impacting the previous computational steps. 483 However, to ensure fair comparisons between TI methods (e.g. other methods do not have an option to 484 compute the embedding on a subsampled input and transfer the results between the full trajectory and the 485 sampled visualization, or rely on a slow version of tSNE), we simply provide each TI method with a 486 pre-computed visualization embedding on which the computed results are projected. 487 ScRNA-seq of mouse pre-B cells. This dataset 26 models the pre-BI cell (Hardy fraction C’) process 488 during which cells progress to the pre-BII stage and B cell progenitors undergo growth arrest and 489 differentiation. Measurements were obtained at 0, 2, 6, 12, 18 and 24 hours (h) for a total of 313 cells x 490 9,075 genes. We follow a standard Scanpy preprocessing recipe 49 that filters cells with low counts, and 491 genes that occur in less than 3 cells. The filtered cells are normalized by library size and log transformed. 492 The top 5000 highly variable genes (HVG) are retained. Cells are renormalized by library count and 493 scaled to unit variance and zero mean. VIA identifies the terminal state at 18-24 h and accurately .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=mfw3e1p8x8hj https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hpu96rs05k0s https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=s9izyt6361ye https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 494 recapitulates the gene expression trends 26 along inferred pseudotime of IgII1 , Slc7a5 , Fox01 , Myc , Ldha 495 and Lig4 . ( Supplementary Fig. S2a). We show the results generalize across a range of PCs for two 496 values of K of the graph with higher accuracy in locating the later cell fates than Slingshot and Palantir. 497 ( Supplementary Fig. S2b). 498 ScRNA-seq of human CD34+ bone marrow cells. This is a scRNA-seq dataset of 5800 cells 499 representing human hematopoiesis 2. . We used the filtered, normalized and log-transformed count matrix 500 provided by Setty et al 2. ., with PCA performed on all the remaining genes. The cells were annotated using 501 SingleR 50. which automatically labeled cells based on the hematopoietic reference dataset Novershtern 502 Hematopoietic Cell Data - GSE24759 51. . The annotations are in agreement with the labels inferred by 503 Setty et al. for the 7 clusters, including the root HSCs cluster that differentiates into 6 different lineages: 504 monocytes, erythrocytes, and B cells, as well as the less populous megakaryocytes, cDCs and pDCs. VIA 505 consistently identifies these lineages across a wider range of input parameters and data dimensions (e.g. 506 the number of K and PCs provided as input to the algorithms see Fig. 2p, and Supplementary Fig. S3c ). 507 Notably, the upregulated gene expression trends of the small populations can be recovered in VIA, i.e. 508 pDC and cDC show elevated CD123 and CSF1R levels relative to other lineages, and the upregulated 509 CD41 expression in megakaryocytes ( Supplementary Fig. S3-S4) . 510 ScRNA-seq of human embryoid body. This is a midsized scRNA-seq dataset of 16,825 human cells in 511 embryoid bodies (EBs) 15 . We followed the same pre-processing steps as Moon et al. to filter out dead 512 cells and those with too high or low library count. Cells are normalized by library count followed by 513 square root transform. Finally the transformed counts are scaled to unit variance and zero mean. The 514 filtered data contained 16825 cells × 17580 genes. PCA is performed on the processed data before 515 running each TI method. VIA identifies 6 cell fates, which, based on the upregulation of marker genes as 516 cells proceed towards respective lineages, are in accord with the annotations given by Moon et al., (See 517 the gene heatmap and changes in gene expression along respective lineage trajectories in Supplementary 518 Fig. S5). Note that Palantir and Slingshot do not capture the cardiac cell fate, and Slingshot also misses 519 the neural crest ( see the F1-scores summary for terminal state detection Supplementary Fig. S5). 520 ScRNA-seq of mouse organogenesis cell atla s . This is a large and complex scRNA-seq dataset of mouse 521 organogenesis cell atlas (MOCA) consisting of 1.3 million cells 6. . The dataset contains cells from 61 522 embryos spanning 5 developmental stages from early organogenesis (E9.5-E10.5) to organogenesis 523 (E13.5). Of the 2 million cells profiled, 1.3 million are ‘high-quality’ cells that are analysed by VIA. The 524 runtime is approximately 40 minutes which is in stark contrast to the next fastest tool Palantir which takes 525 4 hours (excluding visualization). The authors of MOCA manually annotated 38 cell-types based on the 526 differentially expressed genes of the clusters. In general, each cell type exclusively falls under one of 10 527 major and disjoint trajectories inferred by applying Monocle3 to the UMAP of MOCA. The authors 528 attributed the disconnected nature of the 10 trajectories to the paucity of earlier stage common 529 predecessor cells. We followed the same steps as Cao et al. 6 to retain high-quality cells (i.e. remove cells 530 with less than 400 mRNA, and remove doublet cells and cells from doubled derived sub-clusters). PCA 531 was applied to the top 2000 HVGs with the top 30 PCs selected for analysis. VIA analyzed the data in the 532 high-dimensional PC space. We bypass the step in Monocle3 6 which applies UMAP on the PCs prior to 533 TI as this incurs an additional bias from choice of manifold-learning parameters and a further loss in .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hpu96rs05k0s https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=14ruyzz6p81 https://bioconductor.org/packages/3.11/SingleR https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=gqvc4cw37qlq https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24759 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=8w1hbpz1mcwd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=9ohfvgceg55 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=2glf5ij7qkb6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=2glf5ij7qkb6 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=2glf5ij7qkb6 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 534 neighborhood information. As a result, VIA produces a more connected structure with linkages between 535 some of the major cell types that become segregated in UMAP (and hence Monocle3), and favors a 536 biologically relevant interpretation ( Fig. 2, Supplementary Fig. S11 ). A detailed explanation of these 537 connections (graph-edges) extending between certain major groups using references to literature on 538 organogenesis is presented in Supplementary Note 3. 539 ScRNA-seq of murine endocrine development 5 . This is an scRNA-seq dataset of E15.5 murine 540 pancreatic cells spanning all developmental stages from an initial endocrine progenitor-precursor (EP) 541 state (low level of Ngn3 , or Ngn3 low ), to the intermediate EP (high level of Ngn3 , or Ngn3 high ) and Fev + 542 states, to the terminal states of hormone-producing alpha, beta, epsilon and delta cells5. 5. Following steps 543 by Lange et al 13. , we preprocessed the data using scVelo to filter genes, normalize each cell by total counts 544 over all genes, keep the top most variable genes, and take the log-transform. PCA was applied to the 545 processed gene matrix. We assessed the performance of VIA and other TI methods (CellRank, Palantir, 546 Slingshot) across a range of number of retained HVGs and input PCs ( Fig. 2m , Supplementary Fig. S6) . 547 ScATAC-seq of human bone marrow cells. This scATAC-seq data profiles 3072 cells isolated from 548 human bone marrow using fluorescence activated cell sorting (FACS), yielding 9 populations 27 : HSC, 549 MPP, CMP, CLP, LMPP, GMP, MEP, mono and plasmacytoid DCs ( Fig. 3a and Supplementary Fig. 550 S7 ). We examined TI results for two different preprocessing pipelines to gauge how robust VIA is on the 551 scATAC-seq analysis which is known to be challenging for its extreme intrinsic sparsity. We used the 552 pre-processed data consisting of PCA applied to the z-scores of the transcription factor (TF) motifs used 553 by Buenrostro et a 27. . Their approach corrects for batch effects in select populations and weighting of PCs 554 based on reference populations and hence involves manual curation. We also employed a more general 555 approach used by Chen et al. 31. which employs ChromVAR to compute k-mer accessibility z-scores across 556 cells. VIA infers the correct trajectories and the terminal cell fates for both of these inputs, again across a 557 wide range of input parameters ( Fig. 3d and Supplementary Fig. S7 ). 558 ScRNA-seq and scATAC-seq of Isl1+ cardiac progenitor cells. This time-series dataset captures 559 murine Isl1+ cardiac progenitor cells (CPCs) from E7.5 to E9.5 characterized by scRNA-seq (197 cells) 560 and scATAC-seq (695 cells) 20. . The Isl1+ CPCs are known to undergo multipotent differentiation to 561 cardiomyocytes or endothelial cells. For the scRNA-seq data, the quality filtered genes and the size-factor 562 normalized expression values are provided by Jia et al. 20 as a “Single Cell Expression Set” object in R. 563 Similarly, the cells in the scATAC-seq experiment were provided in a “SingleCellExperiment” object with 564 low quality cells excluded from further analysis. The accessibility of peaks was transformed to a binary 565 representation as input for TF-IDF (term frequency-inverse document frequency) weighting prior to 566 singular value decomposition (SVD). The highlighted TF motifs in the heatmap ( Fig. 2j ) correspond to 567 those highlighted by Jia et al. We tested the performance when varying the number of SVDs used. We 568 also considered the outcome when merging the scATAC-seq and scRNA-seq data using Seurat3 52. . 569 Despite the relatively low cell count of both datasets, and the relatively under-represented scRNA-seq cell 570 count, the two datasets overlapped reasonably well and allowed us to infer the expected lineages in an 571 unsupervised manner ( Fig. 2d and Supplementary Fig. S8 . In contrast, Jia et al., performed a supervised 572 TI by manually selecting cells relevant to the different lineages (for the scATAC-seq cells) and choosing 573 the two diffusion components that best characterize the developmental trajectories in low dimension 20 . .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=4l8bltp8p9u2 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=4l8bltp8p9u2 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=jbvdrwuod0wd https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=y5ixemnm3tac https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=y5ixemnm3tac https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=cpy816x2qcr https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=hkri0zw1jhb3 https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=dt4acr7aal4q https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 574 Mass cytometry data of mouse embryonic stem cells (mESC) . This is a mass cytometry (or CyTOF) 575 dataset, consisting of 90,000 cells and 28 antibodies (corresponding to ~7000 cells each from Day 0-11 576 measurements), that represents differentiation of mESC to mesoderm cells 32. . An arcsinh transform with a 577 scaling factor of 5 was applied on all features - a standard procedure for CyTOF datasets, followed by 578 normalization to unit variance and zero mean. Given the small feature set, no PCA is required 579 (Supplementary Fig. S9) . VIA identifies 3 main terminal states corresponding to Day 11 and Day 10, 580 Palantir on the other hand identifies three Terminal states that all correspond to Days in the first half of 581 the experiment and the pseudotime is heavily influenced by the root node being very weakly connected to 582 the other stages of the process. Slingshot appears to capture the overall pseudotime but the 6 lineages 583 imposed onto the low dimensional representation are difficult to interpret and distinguish. To improve 584 Palantir performance we used 5000 waypoints but this takes almost 20 minutes to complete (excluding 585 time taken for embedding the visualization). VIA runs in ~3 minutes and produces results consistent with 586 the known ordering. The pseudotime reflects the range of Days very well, even capturing the small 587 population of Day 11 cells on the left hand side of the Day 6 cells in the embedding (Fig. 3, and 588 Supplementary Fig. S9) . 589 Single-cell biophysical phenotypes derived from imaging flow cytometry. This is the in-house dataset 590 of single-cell biophysical phenotypes of two different human breast cancer types (MDA-MB231 and 591 MCF7). Following our recent image-based biophysical phenotyping strategy 53 , 54 , we defined the 592 spatially-resolved biophysical features of a cell in a hierarchical manner based on both bright-field and 593 quantitative phase images captured by the FACED imaging flow cytometer (i.e., from the bulk features to 594 the subcellular textures). At the bulk level, we extracted the cell size, dry mass density, and cell shape. At 595 the subcellular texture level, we parameterized the global and local textural characteristics of optical 596 density and mass density at both the coarse and fine scales (e.g., local variation of mass density, its 597 higher-order statistics, phase entropy radial distribution etc.). This hierarchical phenotyping approach 53 , 54 598 allowed us to establish a single-cell biophysical profile of 38 features, which were normalized based on 599 the z-score ( See Supplementary Table S4 and Table S5 ). All these features, without any PCA, are used 600 as input to VIA. In order to weigh the features, we use a mutual information classifier to rank the features, 601 based on the integrated fluorescence intensity of the fluorescence FACED images of the cells (which 602 serve as the ground truth of the cell-cycle stages). Following normalization, the top 3 features (which 603 relate to cell size) are weighted (using a factor between 3-10). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=29jzsvekzs4z https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=msgdsdsh4sty https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=qiw3flfp2a4l https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=v1z8vlisxo3y https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=7jdtkd13bs9a https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 604 Imaging flow cytometry experiment 605 FACED imaging flow cytometer setup 606 A multimodal FACED imaging flow cytometry (IFC) platform was used to obtain the quantitative phase 607 and fluorescence images of single cells in microfluidic flow at an imaging throughput of ~70,000 608 cells/sec. The light source consisted of an Nd:YVO picosecond laser (center wavelength = 1064 nm, 609 Time-Bandwidth) and a periodically-poled lithium niobate (PPLN) crystal (Covesion) for second 610 harmonic generation of a green pulsed beam (center wavelength = 532 nm) with a repetition rate of 20 611 MHz. The beam was then directed to the FACED module, which mainly consists of a pair of 612 almost-parallel plane mirrors. This module generated a linear array of 50 beamlets (foci) which were 613 projected by an objective lens (40X, 0.6NA, MRH08430, Nikon) on the flowing cells in the microfluidic 614 channel for imaging. Each beamlet was designed to have a time delay of 1 ns with the neighboring 615 beamlet in order to minimize the fluorescence crosstalk due to the fluorescence decay. Detailed 616 configuration of the FACED module can be referred to Wu et al. 33. . The epi-fluorescence image signal 617 was collected by the same objective lens and directed through a band-pass dichroic beamsplitter (center: 618 575nm, bandwidth: 15nm). The filtered orange fluorescence signal was collected by the photomultiplier 619 tube (PMT) (rise time: 0.57 ns, Hamamatsu). On the other hand, the transmitted light through the cell was 620 collected by another objective lens (40X, 0.8NA, MRD07420, Nikon). The light was then split equally by 621 the 50:50 beamsplitter into two paths, each of which encodes different phase-gradient image contrasts of 622 the same cell (a concept similar to Scherlien photography 55. ). The two beams are combined, 623 time-interleaved, and directed to the photodetector (PD) (bandwidth: >10 GHz, Alphalas) for detection. 624 The signals obtained from both PMT and PD were then passed to a real-time high-bandwidth digitizer (20 625 GHz, 80 GS/s, Lecroy) for data recording. 626 Cell culture and preparation 627 MDA-MB231 (ATCC) and MCF7 (ATCC), which are two different breast cancer cell lines, were used for 628 the cell cycle study. The culture medium for MDA-MB231was ATCC modified RPMI 1640 (Gibco) 629 supplemented with 10% fetal bovine serum (FBS) (Gibco) and 1% antibiotic-antimycotic (Anti-Anti) 630 (Gibco), while that for MCF7 was DMEM supplemented with 10% FBS (Gibco) and 1% Anti-Anti 631 (Gibco). The cells were cultured inside an incubator under 5% CO 2 and 37°C, and subcultured twice a 632 week. 1e6 cells were pipetted out from each cell line and stained with Vybrant DyeCycle orange stain 633 (Invitrogen). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=1ezm5xa09jlh https://docs.google.com/document/d/17hKqD3B5gmaBgnhrDUYigPD_-Ue37OFupUwD5jHxSOU/edit#smartreference=k1dnflcadtcx https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 634 Data Availability 635 Data used in Figures 1-3 as well as Supplementary Figures S1-S15) is available on: 636 1. Pancreatic data: Gene Expression Omnibus (GEO) under accession code GSE132188. 637 2. Cardiac progenitor data is available from the ENA repository under the accession code 638 PRJEB23303 or from [ https://github.com/loosolab/cardiac-progenitors ]. 639 3. B-cell: STATegraData GitHub repository. [ https://github.com/STATegraData/STATegraData ] 640 4. Mass cytometry mesoderm: Cytobank 641 [ https://community.cytobank.org/cytobank/experiments/71953 ]. 642 5. Raw and processed data for scRNA-seq Human Hematopoeisis are available through the Human 643 Cell Atlas data portal at 644 https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45 . 645 6. Embryoid Body: Mendeley Data repository at https://doi.org/10.17632/v6n743h5ng.1. 646 7. Mouse Organogenesis : NCBI Gene Expression Omnibus under accession number GSE119945 647 8. FACED cell cycle: https://github.com/ShobiStassen/VIA and on FigShare 648 https://doi.org/10.6084/m9.figshare.13601405.v1 649 9. scATAC-seq Hematopoiesis: GEO: GSE96772. Processed scATAC-seq data, which include PC 650 values and TF scores per cell can be found in Data S1. of 651 https://doi.org/10.1016/j.cell.2018.03.074 652 10. Toy Data: https://github.com/ShobiStassen/VIA 653 Code Availability 654 VIA is available as a pip installable python library “pyVIA” with tutorials and sample data available on 655 https://github.com/ShobiStassen/VIA and https://pypi.org/project/pyVIA/ 656 References 657 1. Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. 658 BMC Genomics 19, 477 (2018). 659 2. Setty M, Kiseliovas V, Levine J, Gayoso A, Mazutis L, Pe'er D. Characterization of cell fate 660 probabilities in single-cell data with Palantir [published correction appears in Nat Biotechnol. 661 2019 Oct;37(10):1237]. Nat Biotechnol. 2019;37(4):451-460. doi:10.1038/s41587-019-0068-4 662 3. Qiu, X., Mao, Q., Tang, Y. et al. Reversed graph embedding resolves complex single-cell 663 trajectories. Nat Methods 14, 979–982 (2017). https://doi.org/10.1038/nmeth.4402 664 4. Saelens, W., Cannoodt, R., Todorov, H. et al. A comparison of single-cell trajectory inference 665 methods. Nat Biotechnol 37, 547–554 (2019). https://doi.org/10.1038/s41587-019-0071-9 666 5. Bastidas-Ponce, A. et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap 667 for pancreatic endocrinegenesis. Development 146, (2019). 668 6. Cao, J. et al. Comprehensive single- cell transcriptional profiling of a multicellular organism. 669 Science 357,661–667 (2017). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://github.com/loosolab/cardiac-progenitors https://github.com/STATegraData/STATegraData https://community.cytobank.org/cytobank/experiments/71953 https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119945 https://github.com/ShobiStassen/VIA https://doi.org/10.6084/m9.figshare.13601405.v1 https://doi.org/10.1016/j.cell.2018.03.074 https://github.com/ShobiStassen/VIA https://github.com/ShobiStassen/VIA https://pypi.org/project/pyVIA/ https://doi.org/10.1038/nmeth.4402 https://doi.org/10.1038/s41587-019-0071-9 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 670 7. Packer, J. S. et al. A lineage- resolved molecular atlas of C. elegans embryogenesis at single- cell 671 resolution.Science 365, eaax1971 (2019). 672 8. Cao, J., Spielmann, M., Qiu, X. et al. The single-cell transcriptional landscape of mammalian 673 organogenesis. Nature 566, 496–502 (2019). 674 9. Briggs, J. A. et al. The dynamics of gene expression in vertebrate embryogenesis at single- cell 675 resolution.Science 360, eaar5780 (2018). 676 10. Litviňuková, M., Talavera-López, C., Maatz, H. et al. Cells of the adult human heart. Nature 677 (2020). 678 11. Stassen SV, Siu DMD, Lee KCM, Ho JWK, So HKH, Tsia KK. PARC: ultrafast and accurate 679 clustering of phenotypic data of millions of single cells. Bioinformatics. 2020 May 680 1;36(9):2778-2786. doi: 10.1093/bioinformatics/btaa042. 681 12. Ulrike von Luxburg, Agnes Rad, Matthias Hein. Hitting and Commute Times in Large Random 682 Neighborhood Graphs. Journal of Machine Learning Research 15, 1751-1798 (2014) 683 13. Marius Lange, Volker Bergen, Michal Klein, Manu Setty, Bernhard Reuter, Mostafa Bakhti, 684 Heiko Lickert, Meshal Ansari, Janine Schniering, Herbert B. Schiller, Dana Pe’er, Fabian J. 685 Theis. CellRank for directed single-cell fate mapping. bioRxiv 2020.10.19.345983; doi: 686 https://doi.org/10.1101/2020.10.19.345983 687 14. McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and 688 projection. J. Open Source Software. 3, 861 (2018). 689 15. Moon, K.R., van Dijk, D., Wang, Z. et al. Visualizing structure and transitions in 690 high-dimensional biological data. Nat Biotechnol 37, 1482–1492 (2019). 691 https://doi.org/10.1038/s41587-019-0336-3 692 16. Tam PP, Behringer RR. Mouse gastrulation: the formation of a mammalian body plan. Mech Dev. 693 1997;68(1-2):3-25. doi:10.1016/s0925-4773(97)00123-8 694 17. Chin AM, Hill DR, Aurora M, Spence JR. Morphogenesis and maturation of the embryonic and 695 postnatal intestine. Semin Cell Dev Biol. 2017 Jun;66:81-93. doi: 10.1016/j.semcdb.2017.01.011. 696 Epub 2017 Feb 1. 697 18. Gilbert SF. Developmental Biology. 6th edition. Sunderland (MA): Sinauer Associates; 2000. The 698 Neural Crest. Available from: https://www.ncbi.nlm.nih.gov/books/NBK10065/ 699 19. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program, Nature 700 (2019) https://doi.org/10.1038/s41586-019-1629-x 701 20. Jia G, Preussner J, Chen X, Guenther S, Yuan X, Yekelchyk M, Kuenne C, Looso M, Zhou Y, 702 Teichmann S, Braun T. Single cell RNA-seq and ATAC-seq analysis of cardiac progenitor cell 703 transition states and lineage settlement. Nat Commun. 2018 Nov 19;9(1):4877. 704 21. Tanya E. Foley, Bradley Hess, Joanne G. A. Savory, Randy Ringuette, David Lohnes.Role of Cdx 705 factors in early mesodermal fate decisions.Development 2019 146: dev170498 doi: 706 10.1242/dev.170498 Published 1 April 2019 707 22. Yao Y, Yao J, Boström KI. SOX Transcription Factors in Endothelial Differentiation and 708 Endothelial-Mesenchymal Transitions. Front Cardiovasc Med. 2019;6:30. Published 2019 Mar 709 28. doi:10.3389/fcvm.2019.00030 710 23. Potta SP, Liang H, Winkler J, Doss MX, Chen S, Wagh V, Pfannkuche K, Hescheler J, Sachinidis 711 A. Isolation and functional characterization of alpha-smooth muscle actin expressing .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1101/2020.10.19.345983 https://doi.org/10.1038/s41587-019-0336-3 https://www.ncbi.nlm.nih.gov/books/NBK10065/ https://doi.org/10.1038/s41586-019-1629-x https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 712 cardiomyocytes from embryonic stem cells. Cell Physiol Biochem. 2010;25(6):595-604. doi: 713 10.1159/000315078. Epub 2010 May 18. PMID: 20511704. 714 24. Warkman AS, Whitman SA, Miller MK, Garriock RJ, Schwach CM, Gregorio CC, Krieg PA. 715 Developmental expression and cardiac transcriptional regulation of Myh7b, a third myosin heavy 716 chain in the vertebrate heart. Cytoskeleton (Hoboken). 2012 May;69(5):324-35. doi: 717 10.1002/cm.21029. Epub 2012 Apr 30. Erratum in: Cytoskeleton (Hoboken). 2012 718 Dec;69(12):1086. PMID: 22422726; PMCID: PMC4734749. 719 25. Mahmoud AI, Kocabas F, Muralidhar SA, et al. Meis1 regulates postnatal cardiomyocyte cell 720 cycle arrest. Nature. 2013;497(7448):249-253. doi:10.1038/nature12054 721 26. Gomez-Cabrero, D., Tarazona, S., Ferreirós-Vidal, I. et al. STATegra, a comprehensive 722 multi-omics dataset of B-cell differentiation in mouse. Sci Data 6, 256 (2019). 723 https://doi.org/10.1038/s41597-019-0202-7 724 27. Jason D. Buenrostro, M. Ryan Corces, Caleb A. Lareau, Beijing Wu, Alicia N. Schep, Martin J. 725 Aryee, Ravindra Majeti, Howard Y. Chang, William J. Greenleaf, Integrated Single-Cell Analysis 726 Maps the Continuous Regulatory Landscape of Human Hematopoietic Differentiation, Cell, 173, 727 1535-1548.e16, (2018) https://doi.org/10.1016/j.cell.2018.03.074 . 728 28. Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through 729 a topology preserving map of single cells. Genome Biol. 20, 59 (2019). 730 29. Gutierrez GD, Gromada J, Sussel L. Heterogeneity of the Pancreatic Beta Cell. Front Genet. 731 2017;8:22. Published 2017 Mar 6. doi:10.3389/fgene.2017.00022 732 30. Krentz NAJ, Lee MYY, Xu EE, Sproul SLJ, Maslova A, Sasaki S, Lynn FC. Single-Cell 733 Transcriptome Profiling of Mouse and hESC-Derived Pancreatic Progenitors. Stem Cell Reports. 734 2018 Dec 11;11(6):1551-1564. doi: 10.1016/j.stemcr.2018.11.008. PMID: 30540962; PMCID: 735 PMC6294286. 736 31. Chen, H., Lareau, C., Andreani, T. et al. Assessment of computational methods for the analysis of 737 single-cell ATAC-seq data. Genome Biol 20, 241 (2019). 738 https://doi.org/10.1186/s13059-019-1854-5 739 32. Ko, M.E., Williams, C.M., Fread, K.I. et al. FLOW-MAP: a graph-based, force-directed layout 740 algorithm for trajectory mapping in single-cell time course datasets. Nat Protoc 15, 398–420 741 (2020). https://doi.org/10.1038/s41596-019-0246-3 742 33. Wu J. L., Xu Y. Q., Xu J. J., Wei X. X., Chan A. C. S., Tang A. H. L., Lau A. K. S., Chung B. M. 743 F., Cheung Shum H., Lam E. Y., Wong K. K. Y., Tsia K. K., “Ultrafast laser-scanning time-stretch 744 imaging at visible wavelengths,” Light Sci. Appl. 6(1), e16196 (2016).10.1038/lsa.2016.196 745 34. Popescu G, Park Y, Lue N, Best-Popescu C, Deflores L, Dasari RR, Feld MS, Badizadegan K. 746 Optical imaging of cell mass and growth dynamics. Am J Physiol Cell Physiol. 2008 747 Aug;295(2):C538-44. doi: 10.1152/ajpcell.00121.2008. Epub 2008 Jun 18. 748 35. Kyoohyun Kim, Jochen Guck The Relative Densities of Cytoplasm and Nuclear Compartments 749 Are Robust against Strong PerturbationBiophysical Journal. Volume 119, Issue 10, 17 November 750 2020, Pages 1946-1957 751 36. Kafri R, Levy J, Ginzberg MB, Oh S, Lahav G, Kirschner MW. Dynamics extracted from fixed 752 cells reveal feedback linking cell growth to cell cycle. Nature. 2013 Feb 28;494(7438):480-3. doi: 753 10.1038/nature11897. PMID: 23446419; PMCID: PMC3730528. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1038/s41597-019-0202-7 https://doi.org/10.1016/j.cell.2018.03.074 https://doi.org/10.1186/s13059-019-1854-5 https://doi.org/10.1038/s41596-019-0246-3 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 754 37. Park SR, Namkoong S, Friesen L, Cho CS, Zhang ZZ, Chen YC, Yoon E, Kim CH, Kwak H, 755 Kang HM, Lee JH. Single-Cell Transcriptome Analysis of Colon Cancer Cell Response to 756 5-Fluorouracil-Induced DNA Damage. Cell Rep. 2020 Aug 25;32(8):108077. doi: 757 10.1016/j.celrep.2020.108077. 758 38. Zangle TA, Teitell MA. Live-cell mass profiling: an emerging approach in quantitative 759 biophysics. Nat Methods. 2014 Dec;11(12):1221-8. doi: 10.1038/nmeth.3175. PMID: 25423019; 760 PMCID: PMC4319180. 761 39. Tse HT, Gossett DR, Moon YS, Masaeli M, Sohsman M, Ying Y, Mislick K, Adams RP, Rao J, 762 Di Carlo D. Quantitative diagnosis of malignant pleural effusions by single-cell 763 mechanophenotyping. Sci Transl Med. 2013 Nov 20;5(212):212ra163. doi: 764 10.1126/scitranslmed.3006559. PMID: 24259051. 765 40. Otto, O., Rosendahl, P., Mietke, A. et al. Real-time deformability cytometry: on-the-fly cell 766 mechanical phenotyping. Nat Methods 12, 199–202 (2015). https://doi.org/10.1038/nmeth.3281 767 41. Kimmerling, R.J., Prakadan, S.M., Gupta, A.J. et al. Linking single-cell measurements of mass, 768 growth rate, and gene expression. Genome Biol 19, 207 (2018). 769 https://doi.org/10.1186/s13059-018-1576-0 770 42. Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected 771 communities. Sci Rep 9, 5233 (2019). https://doi.org/10.1038/s41598-019-41695-z 772 43. Langville, Amy N., and Carl D. Meyer. Google's PageRank and Beyond: The Science of Search 773 Engine Rankings. Princeton University Press, 2006. 774 44. Chung F., Zhao W. (2010) PageRank and Random Walks on Graphs. In: Katona G.O.H., 775 Schrijver A., Szőnyi T., Sági G. (eds) Fete of Combinatorics and Computer Science. Bolyai 776 Society Mathematical Studies, vol 20. Springer, Berlin, Heidelberg. 777 45. van Dijk D, Sharma R, Nainys J, et al. Recovering Gene Interactions from Single-Cell Data 778 Using Data Diffusion. Cell. 2018;174(3):716-729.e27. doi:10.1016/j.cell.2018.05.061 779 46. Coifman, R. R. et al. Geometric diffusions as a tool for harmonic analysis and structure definition 780 of data: diffusion maps. Proc. Natl Acad. Sci. USA 102,7426–7431 (2005). 781 47. Haghverdi L, Büttner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime robustly 782 reconstructs lineage branching. Nat Methods. 2016;13(10):845-848. doi:10.1038/nmeth.3971 783 48. Bergen, V., Lange, M., Peidli, S. et al. Generalizing RNA velocity to transient cell states through 784 dynamical modeling. Nat Biotechnol 38, 1408–1414 (2020). 785 https://doi.org/10.1038/s41587-020-0591-3 786 49. Zheng GX, Terry JM, et al. Massively parallel digital transcriptional profiling of single cells. Nat 787 Commun. 2017 Jan 16;8:14049. doi: 10.1038/ncomms14049. 788 50. Aran D et al., (2019). “Reference-based analysis of lung single-cell sequencing reveals a 789 transitional profibrotic macrophage.” Nat. Immunol., 20, 163-172 790 51. Novershtern N. et al., Densely interconnected transcriptional circuits control cell states in human 791 hematopoiesis. Cell. 2011 Jan 21;144(2):296-309. 792 52. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, Hao Y, Stoeckius M, 793 Smibert P, Satija R. Comprehensive Integration of Single-Cell Data. Cell. 2019 794 Jun13;177(7):1888-1902.e21. doi: 10.1016/j.cell.2019.05.031. Epub 2019 Jun 6. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://doi.org/10.1038/nmeth.3281 https://doi.org/10.1186/s13059-018-1576-0 https://doi.org/10.1038/s41598-019-41695-z https://doi.org/10.1038/s41587-020-0591-3 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/ / 795 53. Siu, KCM Lee, MCK Lo, SV Stassen, M Wang, IZQ Zhang, HKH So. Deep-learning-assisted 796 biophysical imaging cytometry at massive throughput delineates cell population heterogeneity. 797 Lab on a Chip 20 (20), 3696-3708 798 54. KCM Lee, M Wang, KSE Cheah, GCF Chan, HKH So, KKY Wong, KK Tsia.. Quantitative 799 phase imaging flow cytometry for ultra‐large‐scale single‐cell biophysical phenotyping. 800 Cytometry Part A 95 (5), 510-520 801 55. Wenwei Yan Jianglai Wu Kenneth K. Y. Wong Kevin K. Tsia, A high‐throughput all‐optical 802 laser‐scanning imaging flow cytometer with biomolecular specificity and subcellular resolution, 803 J. Biophotonics (2017) https://onlinelibrary.wiley.com/doi/abs/10.1002/jbio.201700178 804 56. F. Chung and S.-T. Yau, Discrete Green’s Functions. Journal of Combinatorial Theory, Series A, 805 91(1-2) (2000), pp. 191–214 806 57. Van den Berge, K., Roux de Bézieux, H., Street, K. et al. Trajectory-based differential expression 807 analysis for single-cell sequencing data. Nat Communications 808 58. Yury A. Malkov, D. Yashunin. Efficient and Robust Approximate Nearest Neighbor Search Using 809 Hierarchical Navigable Small World Graph, Computer Science, Medicine, Mathematics, IEEE 810 Transactions on Pattern Analysis and Machine Intelligence, 2020 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430705doi: bioRxiv preprint https://onlinelibrary.wiley.com/doi/abs/10.1002/jbio.201700178 https://doi.org/10.1101/2021.02.10.430705 http://creativecommons.org/licenses/by-nc/4.0/