key: cord-0482692-h284r8lg authors: Moshiri, Niema title: NiemaGraphGen: A memory-efficient global-scale contact network simulation toolkit date: 2022-01-13 journal: nan DOI: nan sha: af867508480cb33ce637830f925fe164c2493f7c doc_id: 482692 cord_uid: h284r8lg Epidemic simulations require the ability to sample contact networks from various random graph models. Existing methods can simulate city-scale or even country-scale contact networks, but they are unable to feasibly simulate global-scale contact networks due to high memory consumption. NiemaGraphGen (NGG) is a memory-efficient graph generation tool that enables the simulation of global-scale contact networks. NGG avoids storing the entire graph in memory and is instead intended to be used in a data streaming pipeline, resulting in memory consumption that is orders of magnitude smaller than existing tools. NGG provides a massively-scalable solution for simulating social contact networks, enabling global-scale epidemic simulation studies. NiemaGraphGen (NGG) is a memory-efficient undirected graph generation tool that enables the simulation of global-scale contact networks. NGG is intended to be used in data-streaming epidemic simulation pipelines and thus avoids storing the entire contact network in memory, resulting in faster runtime as well as memory consumption that is orders of magnitude smaller than existing tools ( Fig. ) . NGG is written in C++ and has no dependencies beyond a modern C++ compiler (and optionally the command line make tool for convenience). When NGG is compiled, a separate executable is produced for each model. NGG is also available via a Docker container on DockerHub (niemasd/niemagraphgen). NGG currently supports the following stochastic and deterministic models: Barabási-Albert [ ], Barbell, Complete, Cycle, Empty, Erdős-Rényi [ ], Newman-Watts-Strogatz [ ], Path, and Ring Lattice. By default, NGG uses -byte unsigned integers to represent nodes in the network, which supports networks with up to -≈ . billion nodes, but users can use -byte (up to -= , nodes) or -byte (up to -= nodes) unsigned integers to reduce memory consumption, or they can use -byte unsigned integers (up to -≈ quintillion nodes) to support larger networks at the cost of higher memory consumption. By default, NGG outputs networks in the tab-delimited edge list format used by FAVITES [ ]. Output files in this format can then be used as input files within FAVITES, which will then be able to simulate a transmission network, viral phylogeny, and sequences along the given contact network. However, for ultra-large simulation studies, plain-text edge list representations of networks may result in extremely large files, so NGG also implements a proprietary compact binary output format that uses exactly b | E | + bytes to represent a network with | E | edges in which nodes are represented using b-byte unsigned integers. Both supported output formats are highly structured and can thus be compressed reasonably well using standard compression tools (e.g. gzip). FAVITES does not currently support this compact binary format, so contact networks output in this binary format will not be usable as input files in the current version of FAVITES (v . . ), but support for this binary format will be implemented into FAVITES in the near future. Code examples for loading contact networks in NGG's output formats can be found in the NGG GitHub Wiki (https://github.com/niemasd/NiemaGraphGen/wiki). In this subsection, we discuss the memory-efficient graph sampling algorithms implemented within NGG. Most models implemented in NGG are sampled in O ( ) memory. The complete graph, in which every node has an edge to every other node, is trivial to sample in O ( ) memory (Alg. ). The path graph, in which n nodes are connected in a linear path, is trivial to sample in O ( ) memory (Alg. ). Algorithm Sample path graph procedure P (n) for u ← to ndo Output edge (u, u + ) The barbell graph, which consists of two complete graphs with n nodes (Alg. ) connected by a path graph with n nodes (Alg. ), can be sampled in O ( ) memory (Alg. ). Sample Path (n ) for u ← n to n + ndo Output edge (u, u + ) Connect Complete (n ) and Complete (n ) to Path (n ) Output edge ( , n ) Output edge (n , n + n -) The cycle graph, which consists of a single n-node cycle, is trivial to sample in O ( ) memory: it is simply a path graph (Alg. ) with a single additional edge connecting the start and end nodes (Alg. ). Output edge ( , n -) The ring lattice graph, in which every node has an edge to each of its k neighbors (where k must be even), is essentially a generalization of the cycle graph. Specifically, Cycle (n) is equivalent to RingLattice (n, ). The ring lattice graph can be sampled in O ( ) memory (Alg. ). Algorithm Sample ring lattice graph The Erdős-Rényi model is a random graph model for generating networks, and it has two parameters: the total number of nodes in the network (n) and the probability that any of the n possible edges is included (p). A naive algorithm can be used to sample graphs under the model in O ( ) memory (Alg. ). Algorithm Sample Erdős-Rényi model (naive) However, the time complexity of the naive algorithm is O n , making it unsuitable for ultra-large large networks. Instead, an alternative algorithm can also be implemented in O ( ) memory (Alg. ), which is faster than the naive algorithm when the expected number of edges p n is relatively low (i.e., the network is relatively sparse) [ ], as is the case with social contact networks. | GigaByte, , Vol. , No. Algorithm Sample Erdős-Rényi model The Barabási-Albert model is a random graph model for generating scale-free networks, and it has two parameters: the total number of nodes in the network (n) and the number of edges to attach from new nodes to existing nodes (m). An algorithm exists to sample graphs under the model in O (nm) memory. Graphs sampled under BarabasiAlbert (n, m) will have exactly m (n -m) edges, with exactly m targets selected during each iteration of the sampling algorithm. Thus, when implementing the sampling algorithm, memory for repeat and targets can be reserved up-front to avoid array resizing operations during the algorithm (Alg. ). Algorithm Sample Barabási-Albert model The Newman-Watts-Strogatz model, an extension of the Watts-Strogatz model [ ], is a random graph model for generating connected networks with small-world properties. Unlike the Watts-Strogatz model, which may yield in disconnected graphs, the Newman-Watts-Strogatz model is guaranteed to yield connected graphs. The Newman-Watts-Strogatz model begins by sampling RingLattice n, k , and for each edge (u, v) in in the initial ring lattice, a new "shortcut" edge (u, w) is added with probability p. This motivates a naive sampling algorithm (Alg. ). Algorithm Sample Newman-Watts-Strogatz model (naive) However, the naive algorithm requires all edges of the graph to be stored in memory, which results in prohibitively large memory requirements for ultra-large networks. An alternative memory-efficient algorithm can be devised. There are n nodes, and in the original ring lattice, each node has k edges. Therefore, the initial ring lattice graph has nk / undirected edges, meaning we sample from Bernoulli (p) exactly nk / . The total number of successful Bernoulli trials is thus a single sampling from Binomial ( nk / , p). Further, each node has nk -possible new edges that can be added during the "shortcut"-adding step; these edges can be represented by a matrix with n rows (representing u) and nk -columns (representing w): then (v, u) cannot be selected because the graph is undirected. Thus, we can disregard the bottom-right portion of the matrix. We can then represent each cell of the matrix with its corresponding index in an array representation. For example, for n = and k = (X denotes "disregarded"): : : : : : : : : : : : : : With this representation, sampling "shortcut" edges can be reduced to an efficient algorithm: randomly select a collection of Binomial ( nk / , p) integers from Uniform , n(n-k-) -without replacement, then map from the selected integers to their corresponding cells in the matrix, and finally map from cells in the matrix to edges (u, w). Define a "full" row to be a row without any X symbols (i.e., no disregarded cells), and define an "empty" row to be a row that only contains X symbols (i.e., all cells were disregarded). The last column in the first row contains node nk / -, and the last column in the last full row has node n -, so there are (n -) -(n -k / + ) + = k / + non-empty rows: through k / . Thus, for rows through k / (i.e., the full rows of the matrix), we can imagine the following representation in which cells are filled with the corresponding index of the array representation of the matrix: Row k / + has exactly empty cell, row k / + has exactly empty cells, etc. Thus, the first row that is completely empty (i.e., nkempty cells) is row k / + nk -= nk / -. Thus, the remaining portion of the matrix from which "shortcuts" can be sampled can be represented as follows (X denotes "disregarded", and Y denotes "not disregarded"): This is simply a nk --dimensional square matrix with a triangle in the upper-left. We can now use these findings to define an efficient algorithm that only has to keep the "shortcut" edges in memory, rather than all edges (Alg. ). Sample Newman-Watts- To benchmark network generation runtime and memory consumption, we used NetworkX, iGraph, and NGG to simulate replicate networks of various sizes, and we used the GNU time command line tool to measure total runtime and peak memory usage. We chose to explore Complete, Erdős-Rényi, Barabási-Albert, and Newman-Watts-Strogatz graphs in this benchmarking experiment due to their popularity in modeling social contact networks in epidemiological studies. In addition to the number of nodes in the network (n), the Erdős-Rényi, Barabási-Albert, and Newman-Watts-Strogatz models have additional parameters that controls the expected degree E d of the network; the choice of E d = was made arbitrarily, and the same trend was observed for E d = and E d = . All tools are single-threaded, and all runs were executed sequentially on an -core . GHz Intel Xeon CPU with GB of memory. The results of the benchmarking experiment can be found in Figure . iGraph was excluded from the Newman-Watts-Strogatz simulations because iGraph does not support sampling from the Newman-Watts-Strogatz model. Further, NetworkX was unable to run to completion on larger network sizes due to memory requirements that exceeded the GB memory of the benchmarking machine. In all scenarios, NGG was the fastest and least memory-intensive of the three tools. With respect to Complete graphs, NGG is marginally faster than NetworkX and iGraph, and the peak memory usage of NGG is orders of magnitude smaller than both NetworkX and iGraph, with the gap widening as network size grows. With respect to Erdős-Rényi graphs, NGG is ∼ x faster than NetworkX and ∼ . x faster than iGraph, and its peak memory usage is orders of magnitude smaller than both tools, with the gap again widening as network size grows. With respect to Barabási-Albert graphs, NGG is ∼ x faster than NetworkX and ∼ . x faster than iGraph, and its peak memory usage is consistently ∼ x smaller than NetworkX and ∼ x smaller than iGraph. With respect to Newman-Watts-Strogatz graphs, NGG is ∼ x faster than NetworkX, and its peak memory usage is ∼ x smaller than NetworkX, with the gap widening as network size grows. Importantly, aside from the Barabási-Albert and Newman-Watts-Strogatz models, all network models implemented in NGG have constant memory usage regardless of network size. We introduce NiemaGraphGen (NGG), a memory-efficient graph generation tool that enables the simulation of global-scale contact networks. We benchmarked NGG against the two most popular network simulation tools, NetworkX and iGraph, and we showed that NGG was consistently the fastest and had orders of magnitude lower memory consumption than other tools (typically constant with respect to network size). Benchmarking results. Total runtime (left) and peak memory usage (right) for NetworkX, iGraph, and NGG for various network models and sizes. Each point is the average of replicates, and error bars (which are smaller than the marker sizes) represent % confidence intervals. All tools are single-threaded, and all runs were executed sequentially on an -core . GHz Intel Xeon CPU with GB of memory. ; : of source code and requirements • Project name: NiemaGraphGen (NGG) Operating system(s): Platform independent • Programming language: C++ • Other requirements: C++ or higher • License: GNU GPL v We would like to thank Jonathan Pekar, Joel O. Wertheim, Michael Worobey, and Tajana Rosing for fruitful conversations. . Moshiri N, Smith DM, Mirarab S. HIV Care Prioritization Using Phylogenetic Branch Length. Journal of Acquired Immune Deficiency Syndromes The data sets supporting the results of this article, along with all relevant scripts and commands, are available in the following GitHub repository:https://github.com/niemasd/NiemaGraphGen-Paper The same data and scripts can be found in the following portable Code Ocean environment: https://doi.org/10.24433/CO.4009211.v1 Not applicable. The authors declare that they have no competing interests. This work has been supported by the National Science Foundation (NSF) grant NSF-to N.M. N.M. implemented the software tool described in this manuscript, designed and executed the benchmarking experiment, and wrote the manuscript.