key: cord-0569563-fw3ea5qj
authors: LaRock, Timothy; Scholtes, Ingo; Eliassi-Rad, Tina
title: Sequential Motifs in Observed Walks
date: 2021-12-10
journal: nan
DOI: nan
sha: 968f578a180ddcc2ae9e7e789ce7e7814d2aff46
doc_id: 569563
cord_uid: fw3ea5qj

The structure of complex networks can be characterized by counting and analyzing network motifs. Motifs are small subgraphs that occur repeatedly in a network, such as triangles or chains. Recent work has generalized motifs to temporal and dynamic network data. However, existing techniques do not generalize to sequential or trajectory data, which represents entities moving through the nodes of a network, such as passengers moving through transportation networks. The unit of observation in these data is fundamentally different, since we analyze full observations of trajectories (e.g., a trip from airport A to airport C through airport B), rather than independent observations of edges or snapshots of graphs over time. In this work, we define sequential motifs in trajectory data, which are small, directed, and edge-weighted subgraphs corresponding to patterns in observed sequences. We draw a connection between counting and analysis of sequential motifs and Higher-Order Network (HON) models. We show that by mapping edges of a HON, specifically a $k$th-order DeBruijn graph, to sequential motifs, we can count and evaluate their importance in observed data. We test our methodology with two datasets: (1) passengers navigating an airport network and (2) people navigating the Wikipedia article network. We find that the most prevalent and important sequential motifs correspond to intuitive patterns of traversal in the real systems, and show empirically that the heterogeneity of edge weights in an observed higher-order DeBruijn graph has implications for the distributions of sequential motifs we expect to see across our null models.

Motifs in complex networks are small subgraphs that can be counted and analyzed from observed data. Methods for using motifs to characterize static complex networks have been developed over the last 20 years [43, 4, 56, 71] , and equivalent or similar concepts were investigated even earlier in the social sciences (see e.g., triadic analysis as reviewed in [72] ). More recent research has focused on identifying motif structures in temporal or dynamic network data, which comes in the form of either timestamped independent edge observations or snapshots of a process over time [32, 31, 33, 75, 45, 69, 40] . In this work, we focus our attention on data that represents sequences, trajectories, or walks through networks. 1 Observations in this data are walks of varying lengths through a network. Each walk is observed independently and generally without temporal information about the appearance of individual edges within the walk. These walks may represent trajectories through physical networks (for example passengers or freight moving through a railway network) or through non-physical networks (for example sequences of proteins that make up the proteome of an organism, or sequences of words in a language such as n-grams [50] ).

The goal of this work is to count and analyze sequential motifs in observed walks. Intuitively, a sequential motif is a small directed subgraph that is an abstract representation of a walk through a network. In the simplest case, we can think of an edge from node A to node B, or the self-loop from node A to itself, as sequential motifs of length 1 edge. These structures are the objects of study in traditional network analysis, thus in practice we exclude them from our analyses; and in general, we will exclude self-loops. At length 2 edges, there are two motifs: the backtracking or bidirectional motif A-B-A, and the chain motif A-B-C. These motifs are both relevant for understanding the interaction between a network structure and the processes that unfold on top of it. For example, in an airport network representing passenger movements, the sequential motif A-B-A represents a round-trip flight originating and terminating at the same airport.

In the left panel of Fig. 1 , we show all simple (e.g., no multi-edges) directed motifs on 3-nodes. The left column shows motifs that are only interpretable in static network data, while the right column shows motifs that can be interpreted in either static or sequential contexts. A motif has a sequential interpretation if, starting at any node, there is a directed Eulerian path-a path that visits each edge exactly once-through the simple (i.e., without parallel edges) version of the motif. Motifs without Eulerian paths cannot possibly correspond to observed sequences of edges, since the lack of such a path indicates that there is at least 1 edge that cannot be traversed in sequence. 2 In the right panel of Figure 1 , we show the observed frequency of the directed triangle sequential motif in data representing passenger flight itineraries ( [68] , see Section 4 for details). For air travel, the cycle motif corresponds to a trip with a direct flight in one direction and a layover in the other, or a trip that visits two different locations before returning to the origin, such as a multi-city business trip. The key takeaway from this comparison is that the directed triangle appears to be overrepresented compared to random expectation of sequential motifs, since it is observed considerably more often in the data than in the randomized DeBruijn graphs. However, based on static motifs computed from a directed and unweighted graph constructed from the flights (using the G-tries algorithm [54] ), the motif appears to be underrepresented, as it occurs slightly less frequently than expected at random. This sort of discrepancy is the motivation for our work.

Our methodology is strongly connected to recent research on Higher-Order Network Models (HONs). Here, higher-order refers to sequential and temporal correlations in how a network is traversed that violate the typical Markovian assumption implicit in the study of traditional, first-order networks [60, 59, 74, 58, 34, 36] . 3 The violated assumption is that the t + 1st step taken by a walker moving randomly on the edges of a network is dependent only on the position of the walker at time t, e.g., that the walker has no memory of where it was at times i < t. HONs for sequential data directly incorporate memory into a graphical representation, moving the Markovian assumption from edges between two nodes to paths through k nodes. These graphical representations can be then interpreted and analyzed using modified techniques from graph theory and network science.

In this paper, we count sequential motifs in observed data and evaluate their importance by mapping the edges of a particular HON, known as the DeBruijn graph, to their corresponding motif structures. In a DeBruijn graph of order k, each node represents a walk of length k − 1 (e.g., A → B for k = 2); and each directed, weighted edge represents the frequency of a length k walk through a traditional (or first-order) graph (e.g., A → B, B → C). Our work extends recent research showing the utility of DeBruijn graphs as representations for modeling correlations in pathway data through networks to study motif structures [58, 36, 26] . 4 Definitions We compute sequential motifs from a dataset of n walks S = {(s 1 , w 1 ), (s 2 , w 2 ), · · · (s n , w n )}. Each s i = u 1 , u 2 , · · · , u |si| represents an |s i |length walk through a directed graph G = (V, E) with N = |V | nodes and M = |E| edges. For each walk s i , the optional value w i represents the frequency of the observed walk in the data. 5 If frequencies are not provided, S is considered a multiset where duplicate appearances of the same walk are aggregated and assigned a frequency equal to their multiplicity in S. We denote as M k the set of k-edge sequential motifs, where each motif m = σ 1 , σ 2 , · · · , σ k+1 ∈ M k represents a walk of length k edges through an arbitrary alphabet Σ = {σ 1 , σ 2 , . . . σ }, where ≤ k + 1 is the maximum number of unique nodes in a 

Average Sequential Observed Average Static Figure 1 : Left: Enumeration of 3-node simple directed motifs. Motifs in the right column have interpretations in both sequential data and static graphs, while those in the left column have no sequential interpretation because the edges involved cannot appear in sequence (equivalently there is no Eulerian path through the motif starting from any node). Right: Comparison of the frequency of the directed cycle motif on 3-nodes (i.e., a directed triangle) in data representing passenger trajectories through the airport-to-airport network. The first (orange) bar shows the observed frequency using the proposed sequential motif methodology. The second bar shows the same count but averaged over many randomizations of the data. The third (blue) bar shows the observed count in a static, directed, and unweighted graph, computed using the G-tries algorithm [54] . The last bar again shows the average frequency of the motif in randomized networks. Static motif analysis suggests cycles are under-represented, since the motif is more prevalent in randomized networks. In contrast, sequential motif analysis shows that the directed triangle is over-represented in the motifs compared with random realizations of the DeBruijn graph ensemble.

k-edge walk. The alphabet may consist of any arbitrary set of symbols. The only requirement is that the number of unique symbols is at least as great as the number of unique nodes in any observed walk. Therefore, the worst-case alphabet size is the number of nodes N . 6 Since we assume our input data are walks through a network, sequential motifs m ∈ M k may have parallel edges that represent multiple traversals of the same edge in m. 7 Finally, we define a 6 In principle we can compute sequential motifs of any length k, provided that sequences of length at least k edges exist in the input data. However, larger values of k quickly become impractical to analyze. Therefore, we we will limit our results to motifs up to 3 edges, with the exception of Figure 1 , where we show motifs up to 6-edges. 7 We could equivalently say the edges of sequential motifs are weighted by the number of times the edge is used. However, to avoid confusion with other notions of "weight" used in kth-order DeBruijn graph as G k = (V k , E k ) where V k is the set of N k kth-order nodes #" v , each representing a path of length k − 1 through G, and E k is the set of M k kth-order edges ( #" v , #" w), each representing a path of length k. The higher-order nodes in an edge (

Contributions Our main contribution is to make explicit the connection between sequential motifs and DeBruijn graph models of observed walks. We use this connection to develop methods for evaluating the importance of a given sequential motif to a dataset or process (see Section 3.2). Since the edges of DeBruijn graphs correspond to walks of length k, we can interpret randomizalater analyses, we describe motifs as multigraphs for clarity.

tion of the DeBruijn graph structure as randomization of walks of length k (i.e., motif realizations, and vice versa). Thus our methods attempt to answer the question: is this motif realized substantially more or less often in the observed data than expected at random?

The key insight driving our approach is that we can count sequential motifs involving k edges by constructing a kth-order DeBruijn graph G k and for every edge e = ( #" v , #" w) ∈ E k that defines a k-edge walk p e = #" v + #" w[k − 1] mapping the first-order nodes u i ∈ p e , i ∈ 1 . . . k + 1 one-to-one with the alphabet Σ. Projecting every edge in this way and counting the frequency of each projected motif pattern allows us to uncover patterns of traversal in observed walks. Fig. 2 shows a toy example illustrating this process. We then adopt and extend appropriate null models for DeBruijn graphs, which we use to evaluate the statistical importance of the motifs based on different randomizations.

Our work makes the following contributions to the literature on understanding patterns in trajectories through complex networks:

• We define sequential motifs, which are small and directed subgraphs, sometimes with parallel edges, representing abstractions of walks through a directed graph.

• We show how counting sequential motifs can be achieved by mapping the edges of a DeBruijn graph to motifs; and provide an algorithm called CAC for simultaneously constructing a DeBruijn graph and counting sequential motifs in time that scales linearly with the number of observed walks.

• We provide methods for evaluating the importance of motifs based on sampling edges uniformly at random from complete DeBruijn graphs, which represent all possible walks of length k edges.

• We study sequential motifs in datasets of passenger trips through the domestic airport network in the United States, as well as clickstreams representing players navigating from random source pages to random targets in Wikipedia. We find correspondence between how we expect people to navigate the networks and the motifs that appear more or less often than expected at random based on our null models.

The remainder of the paper is organized as follows. In the next section, we review some work related to our research. In Section 3, we describe our methodology for computing sequential motifs and evaluating their importance. Then, in Section 4, we present empirical experiments and discuss the circumstances in which sequential motifs are the right tool for analyzing a dataset. Finally, we conclude the paper in Section 5 with a discussion of future directions for this research.

Motifs in network data have been studied in various forms across disciplines for decades [66, 53, 30] . The formulation presented in this work was inspired by work published mostly in the last twenty years, starting with the identification of network motifs as building blocks of complex networks [43] . This work measured and compared motifs across contexts, including gene regulatory networks (building on [63] ), neuronal networks, food webs, electronic circuits, and the World Wide Web, using network randomization to define a notion of significance for each motif. Much of the early work on motifs in biological systems has been reviewed in [3] and more recently in [46] . The null models that underly motif comparisons are typically samples from ensembles of random graphs that preserve properties relevant to the particular networks under study [27] . Some work has also been done investigating heterogeneous network motifs, where heterogeneous refers to networks where nodes do not all have the same type [55] .

The closely related concept of closed frequent subgraph mining was developed in parallel in the data mining community [76, 77] . In this formulation, the goal is to quickly identify interesting and maximal subgraph patterns in very large graph datasets, including over sets of graphs. These patterns are typically larger and more complex than what is studied in the literature on network motifs.

Another related literature addresses higher-order structure in graphs [6] , for example by studying simplicial complices [79, 47, 8, 28] and hypergraphs [7, 15] , including work on hypergraph motifs [37] . Both of these areas are focused on simultaneous interaction among more than two nodes, sometimes referred to as polyadic interactions [15] . In contrast, our focus is on data that is characterized by dyadic interactions that happen in sequence.

Also related is the study of subgraph frequencies, most notably in [70] . This work defines a coordinate system for collections of disconnected graphs based on the frequency of small connected subgraphs, defined equivalently to motifs.

The work discussed so far largely addresses networks viewed as static, meaning that the nodes and edges remain the same over time. Yet another field of study has emerged to understand motifs in temporal network data [32, 31, 33, 75, 45, 69, 40] . Temporal network data comes in the form of timestamped edge observations e = (u, v, t), where u and v are nodes and t is a timestamp. Temporal, dynamic, or streaming network data is common, especially in (online) communication [1, 32, 31, 33, 75, 69] and also in biological systems [57] . The difference between temporal and sequential data is the unit of observation: temporal network data usually consists of independent and timestamped edge observations, while sequential data consists of complete observations of sequences, usually without timestamps. In this data we begin from whole observations of sequences, but we do not necessarily know the order of the observations of the sequences themselves or the subobservations (e.g., individual edges) across different sequences, since we do not have timestamps. In contrast, in temporal network data we only have partial observations (individual edges) and need to infer or define a time-scale to find "whole" observations of paths, but we know for certain the order of the individual events given their timestamps. Although we do not address it in the current paper, this distinction does not necessarily impede us from analyzing temporal network data using sequential motifs. We can apply existing techniques to transform edge data into pathway datasets [48] , then apply our sequential motif analysis to the resulting pathway dataset. This moves the problem of determining the appropriate timescale at which to study the data to a pre-processing step [65] .

Our work is also related to research on community detection using the line graph transformation, which is closely related to the DeBruijn graph [21] . An unweighted 2nd-order DeBruijn graph is the same as the line graph transformation of the first-order representation. In [21] , the concept of modularity that is the basis of many community detection algorithms is generalized to line graphs to discover link partitions or communities in the network.

Our work is also related to the concept of k-motifs introduced by Sinatra et al. [64] . They identified subsequences of strings of symbols that they described as "fundamental units" of the process generating the strings, analogous to identifying words in English sentences where the punctuation has been removed. Although our methods seem similar at first glance, there is a fundamental difference in our respective deployments of the word "motif." In our case, a motif is defined arbitrarily, and trajectories through real networks can realize the abstract motif. Or, put another way, we are interested in motifs as types and walks as tokens, each of which is a realization of a type. In Sinatra et al [64] , the most significant realizations-strings representing observed sequences, i.e., tokens-are taken to be motifs, and there is no study of motifs as we define them in this work. In their study, a k-motif is defined as a significantly re-occuring sub-sequence in a sequential dataset. They defined and analyzed networks of k-motifs by first determining which subsequences were observed at higher than expected frequencies based on statistical significance of the observed frequencies in kth order Markov models. They then defined significant k-motifs as nodes in a motif graph, and connected pairs of motifs if they co-occurred in at least one sequence of the input data, where two motifs (i, j) co-occur if i appears immediately before j. Finally, they filtered out edges between k-motifs whose co-occurrences were not significant with respect to random expectation. They go on to show that community detection algorithms on k-motif networks can uncover patterns in diverse data, from clusters of proteins in the human proteome, to information cascades on online social media platforms.

Recent work has attempted to understand the role of motifs in dynamics on networks [62, 57, 61] . Our work is related to that of Schwarze and Porter [61] , who recently defined process motifs on graphs. A process motif is defined by the walks that are possible on a specific substructure of a network. Schwarze and Porter [61] show how the connection between structural and process motifs can be used to determine the role of different substructures in shaping a dynamical process on a network. Our work focuses on counting structural motifs from sequential data, but we note that because sequential data is the result of a discrete dynamical process (e.g., a random walk), our work blends structure and process, suggesting a close connection with this line of research.

We also build on the literature on DeBruijn graphs, which have been studied across fields for decades. In discrete mathematics, for example, the appearance and disappearance of cycles in DeBruijn graphs was a topic of interest in the 1970s [44, 39] . In biology and bioinformatics, variations of DeBruijn graphs are widely used to assemble DNA sequencing data into full genomes [49, 80, 29, 23] .

In computer networks, optimal or near-optimal routing schemes have been found for DeBruijn graph representations [9, 41, 14, 22] , and DeBruijn graphs have been used to design and analyze feedback shift registers in memory systems [38, 12] . Most recently, DeBruijn graphs have been introduced as an appropriate representation for sequential and pathway data on networks [58, 36, 26] , which is the line of research we are contributing to most directly.

Finally, our research is related to the literature on network sampling. In general, network sampling is the process of gathering random samples of the network, for example random nodes or random edges (for a review, see [2] ). These samples can then be used to approximate important network properties like the degree distribution [25, 78, 18, 51, 52, 16, 17] . Our work is related to the well-developed literature on sampling nodes uniformly at random using random walks [5, 13] . However, in our work we sample entire k-edge walks uniformly at random from all walks on k-edges, which is a fundamentally different problem. The work that is most closely related is about sampling induced subgraphs or graphlet realizations on a specific number of vertices [42, 10] . However, our work differs from these in that the subgraphs we are sampling are a subset of all possible subgraphs, since we only sample k-edge subgraphs that are valid walks through the graph.

Our contribution is to link DeBruijn graphs as models of higher-order correlations in pathway data to counting and analyzing sequential motifs from trajectory data. In Section 3, we describe our methodology for computing sequential motifs and evaluating their importance using DeBruijn graphs.

In this section we first present our method for simultaneously constructing a kth order DeBruijn graph and counting k-edge sequential motifs, which we call ConstructAndCount, or CAC. Then we describe a method for evaluating the importance of motifs using randomizations of complete DeBruijn graphs.

We start from a dataset of n trajectories S = {(s 1 , w 1 ), (s 2 , w 2 ), · · · (s n , w n )} through the directed graph G = (V, E) defined by the edges in S. We are also given an alphabet Σ = (σ 1 , σ 2 , . . . , σ ) of arbitrary symbols (see problem statement in Section 1 for more details). Finally, we assume we have chosen an integer k that will be the length in edges of sequential motifs we are interested in computing.

We provide the pseudocode for CAC in Algorithm 1. CAC takes as input a dataset of sequences S, motif alphabet Σ, and order k, and proceeds as follows. First, CAC initializes the set of kth-order nodes V k and edges E k to be empty. These will define the DeBruijn graph. CAC also initializes all entries of the lookup table C[m], indexed by motifs m ∈ M k , to 0. This lookup table will contain the frequency counts of each motif. CAC then enters the outer for loop (line 3) that iterates over all pairs of sequences and their associated weights: (s, w) ∈ S. The default weight is 1. In the inner for loop (line 4), CAC slides a window of length k across s, where each index i corresponds to the first position of a k-edge subsequence seq starting at s[i], computed in line 5. CAC then determines which motif the sequence seq corresponds to by projecting it into the alphabet Σ using the ProjectMotif(seq, Σ) procedure, which maps each node u ∈ s to a distinct element σ j ∈ Σ in order of appearance, as illustrated in Fig. 2 . After a node has been mapped to a symbol, it is replaced everywhere in the projected path with that symbol. Once every node in seq has been mapped, the procedure returns the sequential motif m ∈ M k corresponding to seq. Now CAC increments the count of m in C by the frequency of s, constructs the edge e from seq, and adds the node, edge and frequency to V k , E k , and W k . Note that paths observed in S with length shorter than k can not be included by definition, since |s| − k will be negative and thus the inner for loop will not begin.

Algorithm 1 ConstructAndCount(S, k, Σ): Construct the kth order weighted DeBruijn Graph & count all length k motifs from sequence dataset S using motif alphabet Σ.

for i from 0, · · · , |s| − k do 5:

The runtime of CAC depends on three variables: (i) n, the number of sequences in S; (ii) p( ), the distribution of lengths of sequences in S; and (iii) the motif length k. Assuming all paths are of the maximum length max , the worst case time is O(n · ( max − k)). We note that in practice distributions of path lengths tend to have tails in large values, meaning that in real data the average path length avg is likely to be much lower than the maximum (see Figure 4 ). Lastly, we note that the two loops in the CAC algorithm can also be parallelized over the edges, since the operations for each edge (lines 4 through 10) are independent of any other edge (see Appendix A).

To evaluate the importance of a motif, we compare the count of that motif in the observed data with the average count of the same motif in many random samples from a null model. This is a flexible framework for determining importance, since the null model we choose determines which properties of the input data are randomized and to what extent, and therefore null models can be designed to test different hypotheses about the processes generating the observed data (similar to [43, 4] ). Here we propose three null models. First, we describe a null model that relies on sampling from a complete kth-order DeBruijn graph G k c , where all possible higher-order edges based on the directed first-order topology G = (V, E) can be sampled with non-zero probability. Second, we propose a null model that samples uniformly from all observed k-edge walks by sampling from the edges of the observed DeBruuijn graph G k . Finally, we adopt the null model proposed in [36] , which samples from the edges of the observed kth-order DeBruijn graph based on a k − 1st-order null model.

Sampling from G k c is advantageous because it is conceptually simple to construct a complete kth-order DeBruijn graph G k c from a directed graph G that includes every possible kth order node and edge. The procedure works as follows: first, take all of the edges in G and turn them into nodes in G 2 c ; then, add edges between higher-order nodes (s, u) and (v, t) if u = v; and finally, repeat this procedure k times. 8 Sampling edges from this graph is equivalent to sampling a k-edge walk uniformly at random from all possible k-edge walks through G. 9 Importantly, this procedure is not equivalent to computing random walks on the network topology. In Figure 3 we show a concrete case where sampling kedge random walks would not results in a uniform distribution over all possible walks.

In practice, computing complete DeBruijn graphs G k c for even relatively small values of k is computationally expensive, since the number of nodes and edges grows exponentially. In Appendix C, we give two strategies for sampling approximately uniformly from the edges of G k c .

Sampling from all possible k-edge walks using G k c is maximally random while maintaining the 1st-order network structure. This is both an advantage, because it gives us a sense for what we could expect if the observed motifs were based on fully random data, but also a disadvantage, since many of the walks sampled were not observed in the real data. As a first step towards a more data-informed null model, we also sample uniformly at random from the edges of the observed DeBruijn graph G k . This is a uniform random sample over the observed k-edge walks. If all of the edge weights in G k are equal to 1, then our sampled motif distribution should be approximately equal to the observed motif distribution. However, since this is not the case in our datastets, sampled counts for motifs corresponding to edges in G k that have high weights are likely to be substantially smaller, and vice-versa.

The previous two null model sought to sample uniformly from a set of k-edge walks. We also evaluate motif importance by sampling from ensemble derived from previous work on anomaly detection in DeBruijn graphs, which corresponds to sampling k-edge walks from a k − 1st-order model of the observed data [36] . HYPA represents a soft configuration model for the edge weights of a kth order observed DeBruijn graph, called the Generalized Hypergeometric Ensemble of Random Graphs [11] . Samples from this ensemble preserve the average in and out weight for each node. We note that each of the null models we have proposed here can be customized to incorporate known correlations. For example, we may observe spatial correlations that indicate a high likelihood of certain walks occurring, and so rather than sampling uniformly, we may weight the walks based on a measure that captures this correlation. In fact, the Generalized Hypergeometric Ensemble that underlies HYPA is designed to incorporate this information using the propensity matrix [11] . We leave investigation of these bespoke null models for future work.

In this section, we present experiments on counting and analyzing sequential motifs in two datasets. We first describe the datasets-a transportation network and an online hyperlink navigation network-then we compare sequential motifs within and across the datasets. Appendix D contains more experimental results, and an implementation of our methodology is available online [35] .

Our first dataset consists of a large sample of flight itineraries representing trajectories of passengers through the domestic airport network in the United States in Quarter 1 of 2020 [68] . Each itinerary is a walk through the network corresponding to a starting airport, i ≥ 0 layover airports, and a destination. We are also given frequencies for each walk corresponding to the number of people who bought a ticket with that itinerary (e.g., a number w of people bought tickets with the itinerary JFK to Chicago O'Hare to Seattle, Washington).

The second dataset consists of walks through the Wikipedia network during successful runs of the Wikispeedia game, where a player was given a random target page in Wikipedia, then placed on a random start page and asked to navigate from the start to the target using only internal Wikipedia links [73] . These walks are not associated with frequencies, though we note that the DeBruijn graph edges are still weighted, since the same paths are repeated in different instances of the game. We present statistics of each of the datasets in Table 1 , and show the distribution of sequence lengths in each dataset in Figure 4 . 

In Figure 5 , we show the distribution of sequential motifs for orders 2 and 3 in both of our datasets. Each row of the figure corresponds to a single motif. The gray bar shows the observed frequency of each motif. The remaining bars show the average count of the motif over 10 randomized datasets from each of the following ensembles: uniformly from G k c ; uniformly from G k ; the HYPA ensemble; unweighted first-order random walks; and weighted first-order random walks. 10 We will consider a motif overrepresented with respect to a null model if its observed frequency is greater than its frequency in samples from the null model, and underrepresented if its observed frequency is less than its frequency in the null model. If a motif has similar frequency in both the observed data and null model samples, we say its frequency is within expectation.

We begin with some general observations about the motif distributions. First, we note that all sampling methods are biased towards chain motifs (second and last rows in Figure 5 ). This is not surprising since the number of possible chain motifs on k edges grows with the in-and out-degrees of the hubs, which are present in both networks.

A second observation is that sampling random walks is generally a good proxy for sampling uniformly from all possible walks. There are only a small number of motifs measured from our datasets where the unweighted random walks result in qualitatively different results (Flights, 3rd row and Wikispeedia, 6th row in Figure 5) .

We now compare the distributions of each motif in turn, beginning from the first 2-edge motif at the top of Figure 5 , which corresponds to the sequence A-B-A. In the flights data, paths that correspond to this motif represent round Each row corresponds to a sequential motif (shown on the left). The gray panel shows the total number of motifs counted in the data. Note that the y-axes are measured in thousands. The orange bars show the average counts when sampling uniformly from the G k c and G k ensembles; the dark green bar shows the average counts when sampling from HYPA and the blue bars show counts from unweighted (RW) and weighted (RW-W) random walks. A sequential motif is overrepresented with respect to a null model if its observed frequency is greater than its frequency in the null model, and underrepresented if its observed frequency is less than its frequency in the null model. In the Flights data, directed triangles are over-represented according to all ensembles. However, the results are mixed for the same motif in the Wikispeedia data, where it is considerably less prevalent. All null models except for the HYPA ensemble show the directed triangle as either within expectation or slightly underrepresented, while the HYPA ensemble suggests that the triangle is very underrepresented, appearing about twice as much in random samples. Since the directed triangle is a 3-edge sequential motif, the corresponding HYPA ensemble preserves the frequency of 2nd-order patterns in the Wikispeedia data. The fact that more triangles would be expected under this null model means that players opt not to close triangles as often as we would expect at random. trips starting and ending at the same airport, whereas in the Wikispeedia game these correspond to a sort of "guess and check" behavior, where a player starts at page A, chooses a promising page B, then finds that the page does not contain any links that are better than those on the previous page, and so clicks a link back to page A. The motif is prevalent in both datasets, and is also overrepresented based on almost all ensembles. In the flights data, overrepresentation indicates that round trips occur substantially more often than expected in the real data than in any of the null models. This is not surprising, since people presumably prefer to take fewer flights, thus a direct flight in both directions is ideal. However, it is worth noting that the results for 2-edge motifs in the flights data are essentially the same across all of the null models. To understand this, we first note that at k = 2, all of the null models, including HYPA, are sampling some variation of a 1st-order random walk. In fact, at k = 2 the HYPA ensemble and the weighted random walk (RW-W) are sampling from the same distribution-random walks on the weighted 1st-order topology-but with different strategies. In the flights data, regardless of which null model we choose, the result is the same: backtracks are very overrepresented, and chains are very underrepresented. One implication of this pattern is that the weights on the 2nd-order edges that correspond to backtracks must be non-uniform in the flights data, otherwise we would expect samples from G k and HYPA to contain more backtracks. This makes intuitive sense because some round-trip flights-such as flights from major cities to Washington, D.C.-are likely to have extreme edge weights relative to less traveled trips. However, this does not hold in the Wikispeedia data, where sampling edges uniformly from the observed G k results in an average count of backtrack motifs that is close to the observed count, and sampling from HYPA results in considerably more backtracks than the other null models. This indicates that the weight of the 2nd-order edges corresponding to the backtrack motif in the Wikispeedia data are relatively small on average, since sampling uniformly from those edges results in approximately the same number of backtracks. Since the Wikispeedia network is larger and more sparse than the flights network, it is not surprising that the distribution of 2nd-order backtrack weights is more uniform, as we show in the bottom left panel of Figure 6 .

The second 2-edge motif is the chain on 3 nodes: A-B-C. This motif is also prevalent in both datasets. In the flights data, the 2-edge chain appears underrepresented compared to all of the ensembles. As we observed previously, all of the random sampling procedures are biased towards chains, since there is a disproprtionate number of possible chains due to the presence of hubs. The relatively low prevalence in the flights data is the flip side of the observation we made above: since people prefer to take direct flights in both directions, chains occur less often than expected at random. In the Wikispeedia data, the motif appears at approximately the same rates as all of the ensembles predict. In contrast to the flights, where the goal is often to return to the origin, in the Wikispeedia network the goal is to move away from the source and towards the target, so while backtracks are overrepresented, 2-edge chains are also an important part of the searching process.

The third motif from the top is the first 3-edge motif, corresponding to a round trip that doubles back again (e.g., the sequence A-B-A-B). This motif is rarely observed in the flights dataset, and is observed at approximately the same rates as in samples from the G k c , G k , and weighted random walk ensembles. However, the motif appears under-represented based on the HYPA and unweighted random walk ensembles. This is likely related to the prevalence of round trips in the observed data, which make the second backtracking edge much more likely than expected at random in a 2nd-order model or weighted random walk. In the Wikispeedia data the motif appears to be within expectation for the observed G k and HYPA ensembles, but is over-represented with respect to the others. As we discussed previously, the G k and HYPA ensembles are constrained to walks that were observed in the real data, and so even the relatively small prevalence of this motif creates a substantial bias in samples from those ensembles.

We will analyze the next two motifs (rows 4 and 5 from the top in Figure 5 ) together, since they are complementary to one another. The first represents the sequence A-B-A-C, while the second represents A-B-C-B. Taken together these motifs have an intuitive interpretation in the flights data: they represent a round trip with a layover at the same airport in each direction. In the first direction, we move from the source airport a 1 to a layover airport a 2 , then on to our destination to a 3 . On the return trip, we go from airport a 3 back to the layover airport a 2 , before finally returning again to a 1 . The 4th motif in Figure 5 corresponds to the first direction plus the edge (a 3 , a 2 ) going back to the layover airport, while the 5th motif corresponds to the same edge plus the rest of the trip. These motifs are overrepresented with respect to the G k c and random walk ensembles in the flights data, which is intuitive since, as we noted in our discussion of the round trip motif, people tend to take round trips when they fly, plus the fact that airlines send flights between the same airports on a regular schedule, making it likely that the layover will be the same in both directions. Again we see that the ensembles constrained to the observed walks result in counts similar to the observed frequencies. However, in this case the uniform samples from G k still result in fewer observations of these motifs, while the counts based on HYPA are approximately the same as the observed data. The interpretation of these motifs in the Wikispeedia data is analogous if less intuitive, since they represent mediated round trips from one page, through two intermediary pages, then back where it started. This is indicative of a player giving up on a path and returning to an earlier page to try again, perhaps remembering that there was another promising link a few pages back. Accordingly, these motifs are also overrepresnted in the Wikispeedia sequences with respect to the G k c and random walk ensembles, and just like in the flights data, the G k and HYPA ensembles produce counts much closer to the observations.

Next we analyze the directed triangle (row 6), compared earlier to its static counterpart in Figure 1 . In the flights data, this motif represents a trip with a direct flight in one direction and a layover in another, or a multi-city trip starting and ending at the same airport. In Wikispeedia, it corresponds to essentially the same discussion as in the last paragraph, but where the player did not need to backtrack because they found a link to the page to which they wanted to return. This motif is overrepresented with respect to all ensembles in the flights data, although the average count based on sampling uniformly from G k is greater than the rest, suggesting that observed kth-order edges that correspond to triangles are a larger share of the edges in G k compared to G k c . In the Wikispeedia data, where this motif occurs relatively rarely, all but the HYPA ensemble averages suggest the motif appears approximately within expectation, while the HYPA ensemble suggests the motif is underrepresented, indicating that the directed triangle is more likely to occur in k-edge walks through a 2nd-order model of the data.

Finally, we come to the 3-edge chain on 4 nodes (row 7). This is a very common motif in both datasets, and it appears to be underrepresnted or within expectation according to all of the ensembles.

Taken together with Figure 1 , this analysis shows how sequential motifs can be used to understand a dataset in a way that is consistent with the inherent sequential nature of the process generating the data.

We introduced a method, CAC, for counting and analyzing sequential motifs on k edges in observed walk data using DeBruijn graph ensembles and random walks. A major advantage of sequential motifs is that their interpretation is straightforward: we count the prevalence of a particular structure in the data, not after aggregating it into a first-order network, thus throwing away sequential information. We presented three sampling procedures to assist in evaluating the importance of sequential motifs: (1) a uniform k-edge walk sampling procedure from a complete DeBruijn graph G k c , (2) a uniform sampling procedure over observed walks based on the edges of the observed DeBruijn graph G k , and (3) a sampling procedure based on the Generalized Hypergeometric Ensemble of DeBruijn Graphs defined in HYPA [36] . We showed that sequential motifs are substantially different than static network motifs, and analyzed two datasets, highlighting the subtleties between the various null models and comparing to simply sampling random walks on the 1st-order graph.

There are numerous future directions for this research. For example, we can design bespoke DeBruijn graph null models for specific datasets based on external correlations such as geographical or conceptual distance between nodes. The convergence of our sampling methods to the true motif distribution, as well as the related question of when a random walk ensemble is good enough, also remain open. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes not withstanding any copyright notation hereon.

As mentioned in Section 3, the for loops in CAC can be parallelized. We note that constructing the DeBruijn graph from the list of sequences S is the same as counting n-grams, a problem that has been extensively studied and parallelized [50] . In particular, we can do the counting within a Map-Reduce framework [19] , where the mapping step splits each walk w ∈ S into length k segments (edges in the kth order DeBruijn graph), while simultaneously projecting each segment into the alphabet Σ. Then the Reduce step aggregates the count of each segment (i.e., edges of the DeBruijn Graph) as well as each projected motif. Algorithm 2 shows the pseudocode for this procedure. Reduce(kseg, 1)

Reduce(m, 1) Increment the count of x, which is either a sequence or a motif, by f 9: function Reduce(x, f ) 10: IncreaseCount(x, f )

In this section we briefly examine the relationship between sampling uniformly from G k c and recovering network structure, both 1st and higher-order. In the first column of Figure 7 , we show for each dataset the proportion of observed 1st-order nodes and edges recovered after sampling increasing number of walks from G k c . In both the flights (top) and Wikispeedia (bottom) datasets, all 1storder nodes appear after a relatively small number of walks, while the whole edgeset is not sampled until about 75k walks have been sampled in the flights, and 500k walks sampled in Wikispeedia.

In the middle column of Figure 7 , we show the proportion of observed and unobserved 3rd-order nodes observed as the number of walks sampled from G k c increases. Note that we analyze the 3rd-order rather than the 2nd-order network because unobserved nodes in the 2nd-order network would be unobserved edges in the 1st-order network, which will never appear by definition. After sampling 2 million walks uniformly at random, we discover all of the observed 3rd-order nodes in the flights dataset, but only about 60% of the unobserved nodes. In the Wikispeedia data, after 2 million walks we have only sampled about 70% of the observed nodes and 50% of unobserved nodes.

Finally, in the last column of Figure 7 , we show the same relationship as above but for 3rd-order edges. After 2 million samples, the proportion of sampled edges, both observed and unobserved, is less than 1% for both datasets. The relationship between walks sampled and edges observed is linear, which is expected since walks are being sampled uniformly, and the number of walks sampled is much smaller than the total number of possible walks. This means that on average each sample adds 1 to either the observed or unobserved count. 

We reduce the computational burden of sampling from G k c by using the following procedure to sample edges uniformly without generating the entire DeBruijn graph. First, we sample a node u from the set of all nodes in V that are the source of at least one k-edge walk. Formally, we compute the number of walks originating from a node u by taking the row summation of the kth power of the adjacency matrix of the directed graph G, which we will denote as w u = v A k uv . We then sample from the set of nodes where w u > 0 with probability proportional to w u , and generate all k-edge walks starting from u by constructing all walks of length one-e.g., all of the directed edges (u, v) pointing away from u-and iteratively expanding the set of walks by one edgee.g., appending each neighbor of v to the end of the previous walk-until the walks become longer than k-edges. Finally, we sample one of these walks uniformly at random.

We argue that the above sampling procedure is equivalent to sampling from all edges in G k c . We note that the probability of sampling a given first-order walk from G k c is simply the probability of choosing any of its m edges: p ei = 1 m . In contrast, our procedure requires two steps. First, we sample a starting node u with probability wu v∈V wv . Second, we sample one of the w u k-edge walks starting from u uniformly. The probability of sampling an arbitrary walk w starting from any node u is the product of these two terms:

Thus the probability p w of sampling a k-edge walk w using our two-step procedure is equal to the probability of sampling the same walk as an edge e i from the complete DeBruijn graph G k c . While the above procedure is more computationally tractable than constructing G k c at the outset, if the number of walks we want to compute is substantially larger than N , we will implicitly construct G k c anyway, since we will eventually compute all k-edge walks starting from all nodes u. To further reduce the computational burden, we use random walk simulations to sample. Instead of constructing all w u k-edge walks starting from a node u, we simulate some number x < w u random walks, rejecting any walks with fewer than k edges (e.g., if a walker arrives at a node with no out-degree). Then we sample a walk from this subset of random walks. If a node is sampled again, we compute another set of x random walks, and union them with the previously computed set. Note that this procedure is no longer a uniform sample from the edges of G k c , since the probability of a random walk is determined by the out-degrees of its constituent nodes, meaning we are sampling a walk uniformly from a non-uniform sample of walks starting from u. We address this by sampling a random walk with probability proportional to the product of the out-degrees of the first k nodes of the walk. This means that walks, which are individually less likely because they visit multiple hubs with many choices, are weighted more highly by our sampling procedure. This procedure will also eventually construct G k c in its entirety, but we can now trade between speed and uniformity of the sample by setting x lower (faster) or higher (more accurate) compared to the true number of walks per node. An even simpler approximation is to simulate M k-edge random walks on the first-order graph G, both with and without edge weights. This corresponds to comparing the observed DeBruijn graph to a version were all of the unobserved edges (including between nodes that were not observed in G k ) have weight proportional to their probability to be observed in G. This method also benefits from the familiarity of interpretation of random walks on graphs. The vertical axis shows the KL-divergence between the true motif distribution and the sampled motif distribution after accumulating the corresponding number of walks on the horizontal axis. The parameter λ indicates the number of random walks computed per walk sampled for the approximate methods, with λ = ∞ corresponding to sampling one walk from all possible walks for each sampled node. "RW" means sampling a single random walk in the 1st-order graph G at every step. "Uniform" means sampling a walk uniformly from all possible k-edge walks based on G k c . All methods converge to the uniform distribution at similar rates, but as k increases, we see that the true uniform sampling method converges more quickly on average.

We evaluate the efficacy of the sampling procedures described above in the following way. First, we generate a random network G from G N,p [24, 20] , with N = 500 and p = 0.005. For orders k = 2, 3, 4, we compute the true motif distribution by computing the complete DeBruijn graph of G and mapping all of its edges-all k-edge walks-to motifs, then normalize the motif counts to sum to 1. We then choose integers w and i, where i represents the number of iterations and w represents the number of walks sampled per iteration. For every iteration we sample w walks using each sampling method, map the walks to motifs, and accumulate the count of each motif for each method. Then we use the accumulated motif counts to compute a sample distribution and compute the KL-divergence between the sampled distribution up to iteration j and the true distribution.

We show the results of this simulation using w = 25 and i = 500 in Figure 8 , where each panel corresponds to an order k. In each plot of Figure 8 , the horizontal axis represent the cumulative number of walks sampled so far-i.e., w, w · 2, · · · , w · i-and the vertical axis represents the KL-divergence between the true motif distribution and the sampled distribution so far, averaged over 1000 samplings. Each line corresponds to a different sampling procedure: λ = ∞ refers to computing all k-edge walks for every sampled start node; λ = 1, 10, 50, 100 refer to computing λ walks per sampled node; "RW" refers to sampling a single random walk from the 1st-order graph G; and "Uniform" refers to sampling one walk from all possible k-edge walks based on G k c . At k = 2, where there are only 2 sequential motifs, the KL-divergence is already close to the true distribution after only a small number of walks are sampled. However, for k = 4, we see that the uniform sampling has a slightly lower average for the first 1000 samples.

In Figure 9 , we reproduce the Wikispeedia -Finished results from Figure 5 on the left, and show the results for Wikispeedia -Unfinished on the right. The results are very similar overall, however there is 1 key difference worth highlighting. The directed triangle motif (6th row of Figure 9 appears at similar frequencies in both finished and unfinished games (around 500 observations), but in the finished games, the uniform ensemble suggests that the motif is either within expectation or slightly under-represented, while in the unfinished games the motif appears over-represented. This could indicate that triangles are indicative of a bad game, because the player makes two steps, gets frustrated with their options, then returns to an earlier node. In Figure 10 , we compare flight motifs between Q1 (left, reproduced from Figure 5 ) and Q2 (right) of 2020. The overall number of flights dropped considerably, likely due to the COVID-19 pandemic. Specifically, the use of the 2-node chain motif (2nd row in Figure 10 ) decreased both in terms of observed trips and in the extent to which it is underrepresented, potentially indicating that passengers were opting for 1-way trips during the pandemic. Future work should examine these patterns while controlling for year-to-year seasonality. Figure 5 ) compared to Quarter 2 (right). Comparing Q1 and Q2, we observe chain motifs (row 2) in Q2 at rates much closer to the expectation the null models. This might indicate a preference on the part of passengers for 1-way trips to destinations where they planned to stay during the pandemic, rather than round trips that are typical according to Q1 data.

Detecting Novel Discrepancies in Communication Networks

Network Sampling: From Static to Streaming Graphs

Network motifs: Theory and experimental approaches

Comment on "Network Motifs: Simple Building Blocks of Complex Networks" and "Superfamilies of Evolved and Designed Networks

Approximately uniform random sampling in sensor networks

Networks beyond pairwise interactions: Structure and dynamics

Three Hypergraph Eigenvector Centralities

Simplicial closure and higher-order link prediction

Strategies for interconnection networks: Some methods from graph theory

GUISE: Uniform Sampling of Graphlets for Large Graph Analysis

From Relational Data to Graphs: Inferring Significant Links Using Generalized Hypergeometric Ensembles

Constrained de bruijn codes: Properties, enumeration, constructions, and applications

On sampling nodes in a network

On the Representation of De Bruijn Graphs

Configuration models of random hypergraphs

Estimating network parameters using random walks. Social Network Analysis and Mining

Fast low-cost estimation of network properties using random walks

Exploring complex networks through random walks

MapReduce: Simplified data processing on large clusters

On the evolution of random graphs

Line graphs, link partitions, and overlapping communities

Random Regular Graph and Generalized De Bruijn Graph with $k$ -Shortest Path Routing

Detection of simple and complex de novo mutations with multiple reference sequences

Genome Research, page genome;gr.255505.119v1

Random graphs

Random walks in peer-to-peer networks: Algorithms and evaluation

Predicting Sequences of Traversed Nodes in Graphs using Network Models with Multiple Higher Orders

Network comparison and the within-ensemble graph distance

Simplicial models of social contagion

De novo assembly and genotyping of variants using colored de Bruijn graphs

Motif discovery algorithms in static and temporal networks: A survey

Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia

Temporal motifs in time-dependent networks

Temporal motifs reveal homophily, gender-specific patterns, and group talk in call sequences

From networks to optimal higherorder models of complex systems

DeBruijnNets.jl software package

HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks

Hypergraph Motifs: Concepts, Algorithms, and Discoveries. Proceedings of the VLDB Endowment

On a Homomorphism of the de Bruijn Graph and its Applications to the Design of Feedback Shift Registers

On extremal factors of the de Bruijn graph

Temporal Network Motifs: Models, Limitations, Evaluation

Graph-theoretic Analysis of Structured Peer-to-Peer Systems: Routing Distances and Fault Resilience

Sampling connected induced subgraphs uniformly at random

Network Motifs: Simple Building Blocks of Complex Networks

A proof of Golomb's conjecture for the de Bruijn graph

Motifs in Temporal Networks

Review of tools and algorithms for network motif discovery in biological networks

Simplicial Activity Driven Model

Counting Causal Paths in Big Times Series Data on Networks

An Eulerian path approach to DNA fragment assembly

Handling Massive N-Gram Datasets Efficiently

Estimating and sampling graphs with multidimensional random walks

Sampling directed graphs with random walks

A Survey on Subgraph Counting: Concepts, Algorithms and Applications to Network Motifs and Graphlets

G-Tries: A data structure for storing and finding subgraphs

Heterogeneous Network Motifs

Characterizing Motifs in Weighted Complex Networks

Using network motifs to characterize temporal network evolution leading to diffusion inhibition

Multi-order graphical model selection in pathways and temporal networks

Higher-order aggregate networks in the analysis of temporal networks: Path structures and centralities

Causality-driven slow-down and speed-up of diffusion in non-Markovian temporal networks

Motifs for processes on networks

Fundamental structures of dynamic social networks

Network motifs in the Transcriptional regulation network of Escherichia coli

Networks of Motifs from Sequences of Symbols

Generating Graph Snapshots from Streaming Edge Data

Network motifs and their origins

The Why, How, and When of Representations for Complex Systems

Origin and destination survey database

Network Classification in Temporal Networks Using Motifs

Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large Graph Collections

Motif-based spectral clustering of weighted directed networks

Social Network Analysis: Methods and Applications

Human wayfinding in information networks

Representing higher-order dependencies in networks

Temporal motifs reveal collaboration patterns in online task-oriented networks

CloseGraph: Mining Closed Frequent Graph Patterns

Mining closed relational graphs with connectivity constraints

Statistical properties of sampled networks by random walks

Construction of and efficient sampling from the simplicial configuration model

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

Acknowledgements TL thanks Vahan Nanumyan for preliminary discussions and analyzes that led to this work; Leo Torres for advice and discussion about the methodology; andBrennan Klein for help with designing the visualizations.