key: cord-0196173-jjxwejoi
authors: Cappelletti, Luca; Fontana, Tommaso; Casiraghi, Elena; Ravanmehr, Vida; J.Callahan, Tiffany; Joachimiak, Marcin P.; Mungall, Christopher J.; Robinson, Peter N.; Reese, Justin; Valentini, Giorgio
title: GraPE: fast and scalable Graph Processing and Embedding
date: 2021-10-12
journal: nan
DOI: nan
sha: 0eb9e250bfa039475dcc8c2536ef4f21beca9339
doc_id: 196173
cord_uid: jjxwejoi

Graph Representation Learning methods have enabled a wide range of learning problems to be addressed for data that can be represented in graph form. Nevertheless, several real world problems in economy, biology, medicine and other fields raised relevant scaling problems with existing methods and their software implementation, due to the size of real world graphs characterized by millions of nodes and billions of edges. We present GraPE, a software resource for graph processing and random walk based embedding, that can scale with large and high-degree graphs and significantly speed up-computation. GraPE comprises specialized data structures, algorithms, and a fast parallel implementation that displays everal orders of magnitude improvement in empirical space and time complexity compared to state of the art software resources, with a corresponding boost in the performance of machine learning methods for edge and node label prediction and for the unsupervised analysis of graphs.GraPE is designed to run on laptop and desktop computers, as well as on high performance computing clusters

In a broad range of fields, knowledge is often represented in the form of graphs of interrelated concepts and entities. This in turn motivated the development of Graph Representation Learning methods that allow nodes and edges to be represented as embeddings, e.g. vectors in a low-dimensional Euclidean space [1] . These embeddings aim to preserve the original topological, structural, and semantic relationships of the original non-Euclidean graph space, and can be used for unsupervised clustering analysis, multi-resolution graph visualization, as well as supervised and semi-supervised node label and edge prediction problems [2, 3] . Approaches for generating graph embeddings range from matrix factorization-based methods [4] , random walk based methods [5, 6] , edge modeling methods [7] , generative adversarial networks [8] , to deep learning methods [2] , and have been applied to the analysis of social network, economics, biology, medicine and other disciplines [9, 3] . and billions of edges, raising the problem of the scalability of existing software resources implementing Graph Representation Learning algorithms [1] .

To deal with this scalability problem, we developed GraPE (Graph Processing and Embedding), a fast and scalable software resource that leverages quasi-succinct data structures [19] and parallel programming [20] to analyze big graphs and provide embeddings based on random walk samples.

The generation of billions of sampled random walks made feasible by the fast and efficient implementation of GraPE can lead to more accurate embedded representations of graphs and boost the performance of machine learning methods that learn from the embedded vector representation of nodes and edges. The quality of the graph embedding computed by current state-of-the-art libraries is limited by the reduced capability of efficiently generating enough data to accurately represent the topology of the underlying graph. This limitation affects the performance of node and edge label prediction methods, since they strongly depend on the informativeness of the underlying embedded graph representation. GraPE offers a modular and flexible solution to these problems through well-documented, scalable, and fast software libraries that can run on general-purpose desktop and laptop computers with or without additional GPU/TPU devices, as well as on high performance computing clusters.

GraPE is a fast graph processing and embedding library that can efficiently scale with big graphs using parallel computation and efficient data structures. The main functionalities of the library are depicted in Figure 1 .

GraPE is composed of two main modules: Ensmallen (ENabler of SMALL runtimE and memory Needs) and Embiggen (EMBeddInG GENerator). Ensmallen efficiently loads big graphs and executes graph processing operations including large-scale first and second-order random walks. Embiggen leverages the large amount of sampled random walks generated by Ensmallen to effectively compute node and edge embeddings that can be used for unsupervised exploratory analysis of graphs or to train the flexible neural models provided by Embiggen itself for solving edge and node label prediction problems in big graphs.

Ensmallen is implemented in Rust to allow efficient parallel computation [20] , with Python bindings for ease of use, and implements both traditional map-reduce thread-based parallelism and branch-less Single Instruction Multiple Data (SIMD) parallelism, in order to achieve a speed-up in graph processing operations of several orders of magnitude. Ensmallen is designed to load graphs using a fraction of the main memory required by other libraries while maintaining a high speed through a succinct data structure exploiting Elias-Fano [19] representation and an indexing strategy we devised to achieve constant-time rank and select operations with memory usage close to the theoretical minimum (Section 4.1).

Ensmallen can lazily generate tens of millions of 100-length random walk samples per minute using offthe-shelf desktop computers, to avoid memory bottlenecks and to train Skip-gram [21] , CBOW [22] , and GloVe [23] embedding modules, efficiently implemented in the Embiggen library using TensorFlow and GPU computing (Section 4.2). The resulting accurate graph embeddings generated by Embiggen boost the performance of the deep neural models implemented in the GraPE resource, achieving results competitive with state-of-the-art methods for edge and node label prediction problems, and facilitating unsupervised clustering analysis and visualization of complex graphs (Section 2.5).

Ensmallen efficiently implements approximated weighted first and second-order random walks with no statistically significant decrease in the performance on node and edge prediction tasks, thus allowing graphs containing high-degree nodes (degree > 10 6 ) to be processed, which is otherwise unmanageable with exact random walk algorithms (Section 2.3).

GraPE provides also other resources, such as multiple graph Monte Carlo holdout techniques, Bader and Kruskal algorithms for computing random and minimum spanning arborescence and connected components, parallel computation of betweenness centrality [24] , graph node and edge filtering methods, graph algebraic set operations, as well as immediate access to an ever-increasing curated list of commonly used graphs (currently about 60000, detailed in the Supplementary section S3.2.4), making it easier to design and reproduce experiments (Section 2.6).

Multiple time-memory trade-off settings can be enabled according to the available resources to enable running GraPE on different hardware architectures, ranging from off-the-shelf notebook computers to High-Performance-Computing clusters (Section 4.1.6).

In the following sections, we detail the main characteristics of GraPE , and we compare it with state-of-the-art graph-processing libraries across several types of graphs having different size and characteristics. 

Ensmallen can process graphs in different formats while simultaneously checking for common format errors. Figures 2-a and 2-b show respectively the empirical space and time complexity of Ensmallen with respect to state-of-the-art graph processing libraries, including NetworkX [25] , CSRGraph, PecanPy [14] , for the loading task across 44 real-world graphs spanning different sizes and topological characteristics. The graphs were downloaded from Network Repository [26] , BioGRID [27] , WormNet [28] , and KG-COVID-19 [29] . Libraries and graphs used in the experiments are further detailed in the Supplementary Information (S1 and S2).

Through extensive use of both thread and SIMD parallelism, and specialized quasi-succinct data structures, Ensmallen outperforms state-of-the-art libraries by one to four orders of magnitude in the computation of random walks, both in terms of empirical computational time and space requirements ( Figure 2 -c, d, e, f and Section 2.3.1).

Further speed-up is obtained by automatically (and transparently) dispatching one of 8 optimized implementations of the second-order random walks for node2vec [6] preprocessing based on the values of the return and in-out parameters and on the type of the graph (weighted or unweighted). By analyzing these parameters GraPE can automatically provide the version best suited to the requested task, with minimal code redundancy (Section 4.1.3). The time performance difference between the least and the most computationally expensive configurations is around two orders of magnitude (Supplementary Section S4.2 and  Supplementary Table S4 ).

Ensmallen also implements optional time-space trade-offs (Section 4.1.6), thanks to which the algorithms can be executed on computing environments ranging from simple notebooks to large HPC clusters. Approximated random walk versions for graphs with high-degree nodes allow efficient scaling with graphs containing nodes with degree higher than 10 6 , with no statistically significant decrease in performance (Section 2.3.2).

2.3.1 Experimental comparison of Ensmallen with state-of-the-art graph processing libraries.

We compared Ensmallen with a set of state-of-the-art libraries including GraphEmbedding, Node2Vec, CSR-Graph and PecanPy [14] , on a large set of first and second-order random walk tasks. The random walk procedures in the GraphEmbedding and Node2Vec libraries use the alias method (Supplementary section S4.1). The PecanPy library also employs the alias method for small graphs use-cases (less than 10000 nodes). CSRGraph, on the other hand, computes the random walks lazily using Numba [30] . Similarly, PecanPy leverages Numba lazy generation for graphs having more than 10000 nodes. All libraries are further detailed in the Supplementary Information (Section S1). Ensmallen largely outperforms all the compared graph libraries on both first and second-order random walks for the experimental evaluations of both space and time complexity. Note that Ensmallen nicely scales with the biggest graphs considered in the experiments, while the other libraries either crash, exceed 200GB of memory, or take more than 4 hours to execute the task (Figure 2 c, d, e, f). More detailed results are provided in the Supplementary Information.

To overcome the computational burden of processing graphs characterized by nodes with high degree, Ensmallen provides an approximated implementation of random walks that scales with graphs containing nodes with millions of neighbours (Figure 3 a, b, c) -see Section 4.1.5 for more details. We compared exact and approximated random walk samples for graph embedding using SkipGram in the context of an edge prediction problem with the H. sapiens STRING PPI network [31] , achieving results that are comparable between the two approaches (Wilcoxon rank-sum p-value> 0.2, Fig. 3 d) , by running 30 holdouts and setting a degree threshold equal to 10 for the approximated random walk. In contrast, the maximum degree in the training set ranged between 3325 and 4184 across the holdouts. These results show that no relevant performance decay is registered, even if we introduce a relatively stringent degree threshold.

Embiggen is the second main module of GraPE; it provides node embedding models, which may be processed by efficient methods for node-label prediction and edge prediction. All the models are implemented using the Keras/TensorFlow2 [32] library. Embiggen and Ensmallen are seamlessly integrated modules whose relationships are schematically summarized in Figure 5 . Ensmallen lazily generates random walk samples to train graph embedding models (i.e. CBOW, SkipGram and GloVe [21, 22, 23] ), whose node and edge representations are used to train deep neural models for node and edge prediction problems, or also for the unsupervised analysis of graphs ( Figure 4 ). The synergistic relationships between Ensmallen and Embiggen are enforced by the complementary exploitation of memory and computational resources: Ensmallen runs on CPU/RAM, while Embiggen primarily on GPU(s) ( Figure 5 ). If required, Embiggen can run also on CPU.

Embiggen also includes tools to easily visualize in a fully unsupervised way the computed node and edge embedding and their properties, including for instance the edge weights, node degrees, and edge and node types. For example the node and edge categories of the KG-COVID19 graph are visualized in Figure 4 using t-SNE decomposition [33] . Note that no information about the node or edge type was used during the computation of the embedding. embedding was computed using Node2Vec with a SkipGram model, and the edge embedding was obtained by concatenating the node embeddings of the source and destination nodes. The embeddings were projected onto 2D space using t-SNE decomposition [33] . Colors indicate the Biolink model category for each node (on the left) and edge (on the right). The KG-COVID-19 graph contains multi-label nodes: for those nodes we visualized the first of their node types, which partially accounts for the different node types overlaps. The embedding shows a meaningful grouping of the node and edge types of KG-COVID-19.

Embiggen supports different edge embedding methods, ranging from concatenation of the node embedding vectors, to Hadamard element-wise multiplication between the embedding vectors, element-wise difference in L 1 or L 2 norm, and element-wise mean or sum between the node embedding vectors.

On the basis of the embeddings obtained through the Embiggen module, the GraPE library provides Keras/Tensorflow neural models that can be used for node-label, edge-label and edge prediction problems.

Ensmallen efficiently generates massive random walk samples ( Figure 5 -a) to construct accurate node and edge embeddings ( 7

The library also allows further tuning of the node embeddings in a supervised way through the same backpropagation procedure used to train the prediction model, leading to improved performance ( Figure 5 ) -see sections 4.2.2 and 4.2.3 for more details. Figure 5 : Relationships between Ensmallen sample generation methods and Embiggen node embedding and node and edge prediction models. a: Ensmallen provides random walk samples (pink boxes) to Embiggen node embedding models and generates balanced mini-batches for the edge and node label prediction modules (blue boxes). b: Embiggen node embedding models receive sampled random walks from the Ensmallen module and output node embedding data to the Embiggen prediction modules. c: Embiggen edge prediction models receive as input embedded nodes from node embedding models, and indices of the nodes for mini-batch training of the classifier (green box) from Ensmallen. Embedded edge vectors are constructed from the embedded nodes (gray box) and given as input to the classifier module. d: Similarly Embiggen node prediction models receive node indices and embedded nodes from the other GraPE modules. The gray box merges the node representations of the node with those of its neighbours (gray node) to regularize the training of the node label classifier (green node). The models are further detailed in section 4.2.

In edge prediction problems, the negative edges are the edges not present in a given graph; considering that most real-world graphs are sparse, the resulting classification task is highly imbalanced. The same problem also arises in node classification since several problems are highly imbalanced. To deal with unbalancing issues Ensmallen implements balanced mini-batches, randomly sampling a number of negative edges/nodes equal or comparable to the number of positive edges/nodes ( Figure 5 -a). The number and size of minibatches per epoch can be chosen depending on the graph size, as well as the balancing rate between positive and negative examples (Section 4.2).

We compared Embiggen with state-of-the-art embedding libraries for edge prediction implementing matrix factorization based methods (GraRep [34] and HOPE [35] ), random-walk based (Node2vec and struc2vec [36] and deep neural network-based (SDNE [37] , GAE [38] and LINE [7] ) methods. In all the edge prediction experiments, the Embiggen module used a neural network composed of two dense regularized layers with ReLU activation and a top output layer with a sigmoid unit, without any hyper-parameter selection (including the graph node embedding parameters). The GraPE neural models were trained using node embeddings computed using CBOW, SkipGram and GloVe, and edge embeddings were computed using the concatenation method. Details about the neural model implementation are available in section 4.2.3.

We compared Embiggen across a set of graphs used in [9] , including the Comparative Toxicogenomics Database (CTD-DDA), the DrugBank, drug-drug interaction (DrugBank DDI), the National Drug File Reference Terminology, drug-disease association (NDFRT-DDA) and STRING protein-protein interaction 8 (STRING-PPI). A report about the graph characteristics, automatically generated using Ensmallen (Section 2.6), and the hyper-parameters considered for the embedding process of each graph are available in the Supplementary Information (section S3.1).

We used the same edge prediction experimental set-up adopted in [9] by applying multiple holdouts with 80% of data for training and 20% for the test, using an unbiased generation of the training and test edge data (Section 2.6). Results show that GraPE outperforms the other methods in all the considered edge prediction tasks (Figure 6 a, b and c). Differences in favour of GraPE were statistically significant in each comparison (Wilcoxon rank-sum test, at 0.05 significance level).

2.5.2 Experimental comparison with state-of-the-art node label prediction models.

We compared Embiggen with state-of-the-art libraries for node-label prediction, including methods based on graph neural networks (GNN) like APPNP (both the Pytorch geometric implementation [16] , and Pagerank implementation [39] ) and Planetoid [40] , spectral-based Graph Convolutional Neural Networks (GCNN) which approximate spectral filters by truncated Chebysev polynomials (ChebNet [41] ), Learnable Graph Convolutional Networks (LGCN) [42] , SplineCNN, based on geometric deep-learning with continuous bspline kernels [43] and Graph U-nets, that extends to graphs the encoder-decoder architecture of U-net through novel graph pooling and unpooling operations [44] .

For the experimental comparison of Embiggen with state-of-the-art libraries we considered Cora [45] , CiteSeer [45] and PubMed [46] graphs. A report about the graph characteristics, automatically generated using Ensmallen (Section 2.6), and the hyper-parameters considered for the embedding process of each graph are available in the Supplementary Information (section S3.2).

For each graph, the node embedding was computed with CBOW, SkipGram or GloVe. In all the node-label prediction experiments the model represented in figure 5 -d was used, where the classifier module is a neural network composed by dense layers, a dropout layer followed by a dense SoftMax layer with a number of units equal to the number of the considered node labels and l1-l2 regularization. The node-label classifier module implements a regularization technique by which the embedded node is averaged with the embeddings of its neighbour nodes [2] . Other aggregation methods such as Hadamard product or sum can also be used in order to compute the node context. The optimizer used during the training of the node-label prediction model is Nadam, and the loss used is a categorical cross-entropy. The top three bar plots (a, b and c) refer respectively to the Accuracy, AUROC and F1-Score on the edge prediction task. The columns of the same top three bar plots refer to the four graphs considered for edge prediction, which are, respectively (from left to right) CTD-DDA, DrugBank-DDI, NDFRT-DDA and String-PPI. Figure d refers to the accuracy of the node-label prediction task. The bar plots of figure d refers to the three graphs considered for node label prediction, i.e.: CiteSeer, Cora and Pubmed Diabetes. The error bars represent the standard deviation, when provided from the respective work. The GraPE methods are always colored in orange, while the other libraries for edge prediction are colored blue and the other libraries for node prediction are colored purple. All bar plots are zoomed in the the range between 0.6 and 1.0, to better show the differences between the compared methods. The GraPE methods, without any model selection, always outperform the other libraries in the edge prediction task. Additionally, the GraPE methods, again without any model selection, outperforms most other node-label prediction methods, with the considerable exception of the SplineCNN method.

Ensmallen provides several methods to facilitate the automatic construction of unbiased train and test data for edge prediction, such as K-folds and connected Monte Carlo holdouts tailored to edge prediction problems, to avoid the generation of new connected components not present in the original graph.

GraPE implements fast algorithms to analyze the overall characteristics of the graph, including Breadth and Depth-first search, Dijkstra, Tarjan's strongly connected components, efficient Diameter computation, spanning arborescence and connected components, approximated vertex cover, triads counting, transitivity, clustering coefficient and triangles counting, Betweenness and stress centrality, Closeness and harmonic centrality, as well as optimized implementations for algebraic set graph-operations and node and edge filters.

Moreover Ensmallen provides three kinds of human-readable graph-reports in natural language (Supplementary section S4. 4) . Some examples are available in Supplementary section S1, S3 and Supplementary Figure  S1 .

We have presented GraPE a software resource with specialized data structures, algorithms, and fast parallel implementations of graph processing coupled to implementations of algorithms for graph machine learning.

The experimental results demonstrate that GraPE significantly outperforms several state of the art graph processing libraries in terms of empirical space and time complexity (Wilcoxon rank-sum test at 0.05 significance level) and can even load massive graphs (e.g. the Twitter graph) when the other libraries fail ( Figure 2 a and b).

Moreover, for both first and second-order random walks, we observed a speedup ranging from two to four orders of magnitude compared with state of the art libraries ( Figure 2 c and e). We registered a statistically significant improvement for all the considered random walk tasks (Wilcoxon rank-sum test at 0.01 significance level) for both the memory use and empirical time requirements (Figure 2 c, d, e and f).

Quasi-succinct data structures and lazy loading are crucial to efficiently generate large sets of random walk training data and enable the processing of large graphs in off-the-shelf computers. In fact, when we analyze a graph of the size of KG-COVID-19 [29] , the static generation of a huge amount of data fed into the models would easily occupy tens of terabytes of memory, making data processing unfeasible even in well-equipped computers.

GraPE can also run an approximated version of first and second-order random walks on big and highly connected graphs where the exact algorithm is extremely inefficient. For instance, on the graph sk-2005 [47] , that includes about 50 millions of nodes and 1.8 billions of edges and some nodes with degrees over 8 million, by linearly extrapolating the results reported in Fig. 3 -e to the entire graph, the exact algorithm requires about 23 days, while the approximate one about 11 minutes, both running on a PC with with two AMD EPYC 7662 64-core processors, 256 CPU threads, and 1TB RAM.

The synergy between parallel computation and efficient data structures enables a fast generation of huge volumes of random walk samples, that can be used to obtain more refined and accurate graph embeddings [1] . Moreover, the unprecedented training sample generation speed obtained using Ensmallen opens the door to model selection, for instance, through Bayesian optimization [48] , that previously would have required a prohibitive amount of computation time. Indeed the sheer amount of training samples that can be quickly generated meets the well-known greedy nature of neural network training sample requirements, leading to boosted performance in node and edge predictions tasks.

Results showed that GraPE outperforms the other competing libraries in all the considered edge prediction tasks (Figure 6 a, b and c). Differences in favour of GraPE were statistically significant in each comparison (Wilcoxon rank-sum test, at 0.05 significance level). Similar results have been also obtained for node label prediction tasks, confirming that the huge amount of sampled random walks generated by the GraPE engine can provide accurate embeddings that boost the performance of machine learning methods for edge and node label prediction.

The GraPE models reported here were trained without any model selection and hence it is likely that results can be further improved by using e.g. random search or Bayesian optimization techniques.

The modularity of the library allows use of graph embeddings obtained with other methods and libraries to train the neural networks implemented in the Embiggen module. For the same reason, other learning machines, implemented, e.g., in the scikit-learn Python package, can be trained with the node and edge vector representations computed by Embiggen.

Ensmallen also implements optional time-space trade-offs (Section 4.1.6), thanks to which the algorithms can be executed on computing environments ranging from simple notebooks to large HPC clusters.

Even if GraPE focuses on efficient random walk-based embeddings that allows edge and node label prediction problems in big graphs to be addressed, it also provides utilities for graph loading, processing and analysis, facilities to generate train and test graph data and also automatic human readable reports in natural language to summarize the characteristics of any graph.

A limitation of the current GraPE version is that does not provide explicit algorithms for the analysis of heterogeneous graphs, characterized by different types of nodes and edges. Nevertheless the object-oriented architecture and the design of GraPE allow the resource to be easily extended in this direction, through class inheritance and composition. A planned ongoing development is precisely the design and implementation of algorithms for node and edge label prediction in heterogeneous graphs [49] .

The two main modules of GraPE, Ensmallen and Embiggen, implement a large spectrum of graph processing methods and utilities, including fast graph loading, efficient first and second order random walks, graph embedding, node and edge label prediction, and a large range of graph processing algorithms that nicely scale with big graphs, using parallel computation and efficient data structures to speed-up the computation.

Ensmallen is implemented using the Rust language 1 , with fully documented Python bindings to provide a seamless integration with Deep Learning libraries such as Keras and Tensorflow. Rust is gaining importance in the scientific community [20] thanks to its robustness, reliability, and speed. Rust programming for lowlevel implementation of different procedures allows us to reduce computational costs, and to robustly and safely exploit both threads and data parallelism. In particular, Rust makes it easy to develop a highly parallel implementation of graph algorithms, including first and second order random-walks, sampling of edges for holdouts, and the Bader parallel spanning tree algorithm, while at the same time guarantees freedom from data-races and memory corruption bugs. To further improve efficiency, some core functionalities of the library, such as the generation of pseudo-random numbers and sampling procedures from a discrete distribution, are written in assembly (see Supplementary Sections S4.2 and S4.2.1 for further details).

The library has a comprehensive test suite that runs at every commit to check for correctness. Additionally, to thoroughly test Ensmallen against many different scenarios, we also employed fuzzers 2 available from the Google library HonggFuzz, and from the LLVM library libFuzzer, to test the latest version of Ensmallen over two billion unique test cases, obtaining successful results.

Beside heavy exploitation of parallelism, the second ingredient of our efficient implementation is the careful design of the data-structures for both using as little memory as possible and quickly performing operations on them. The naive representation of graphs explicitly stores its adjacency matrix, with a O(|V | 2 ) time and memory complexity being |V | the number of nodes, which leads to intractable memory costs on large graphs. However, since most large graphs are extremely sparse, this problem can be mitigated by storing only the existing edges. This is usually done by using the Compressed Sparse Rows (from now on referred to as CSR [50] ) format, which stores the source and destination indices of existing edges into two sorted vectors. In Ensmallen we further compressed the graph adjacency matrix by adopting the Elias-Fano succinct data scheme which allows to efficiently store the edges (Supplementary section S4.1 ).

Since Elias-Fano representation stores a sorted set of integers using memory close to the information theoretical limit, we defined a bijective map from the graph edge set and a sorted integer set. To define such encoding we firstly assigned a numerical id from a dense set to each node, and then we defined the encoding of an edge as the concatenation of the binary representations of the numerical ids of the source and destination nodes. The aforementioned encoding has the appealing property of representing the neighbours of a node as a sequential and sorted set of numeric values, which allows an efficient retrieval of neighbors during each random walk computation, since Elias-Fano has a faster sequential access when compared to random access (Supplementary section S4.1.1).

Memory Complexity. More precisely, Elias-Fano is a quasi-succinct data representation scheme, which provides a memory efficient storage of a monotone list of n sorted integers, bounded by u, by using at most EF(n, u) = 2n + n log 2 u n bits, which was proven to be less than half a bit per element away from optimality [19] and assuring random access to data in average constant-time. Thus, when Elias-Fano is paired with the previously presented encoding, the final memory complexity to represent a graph G(V, E)

which is asymptotically smaller than the complexity of the CSR scheme,

Edge Encoding. Ensmallen converts all the edges of a graph G(V, E) into a sorted list of integers: considering an edge e = (v, x) ∈ E connecting nodes v and x represented, respectively with integers a and b, the binary representation of a and b are concatenated through the function φ k (a, b) to generate an integer index uniquely representing the edge e itself:

This implementation is particularly fast because it requires only few bit-wise instructions:

where << is the left bit-shift, | is the bit-wise OR and & is the bit-wise AND (see Supplementary S4.1.1 for an example and an implementation of the encoding). Since the encoding uses 2k bits, it has the best performances when it fits into a CPU word, which is usually 64-bits on modern computers, meaning that the graph must have less than 2 32 nodes and and less than 2 64 edges. However, by using multi-word integers it can be easily extended to even larger graphs.

Operations on Elias-Fano. Beside being extremely simple, when paired with Elias-Fano representation, the aforementioned encoding allows an even more efficient computation of random walk samples. Indeed Elias-Fano representation allows performing rank and select operations by requiring on average constant time. These two operations were initially introduced by Jacobson [51] to simulate operations on general trees, and were subsequently proven fundamental to support operations on data structures encoded through efficient schemes [52] .

In particular, given a set of integers S, Jacobson defined the rank and select operations as follows [51] :

rank(S, m) returns the number of elements in S less or equal than m select(S, i) returns the i-th smallest value in S As explained below, to speed up computation, we deviate from this definition by defining the rank operation as the number of elements strictly lower than m. To compute the neighbours of a node using the rank and select operations, we observe that for every pair of nodes α, β with numerical ids a, b respectively, it holds that:

Thus, the encoding of all the edges with source α will fall in the discrete range φ k (a, 0), φ k (a + 1, 0) = a 2 k , (a + 1) 2 k Thanks to our different definition of the rank operation and the aforementioned property of the encoding, we can easily derive the computation of the degree d(a) of any node v with numerical id a for the set of encoded edges Γ of a given graph, which is equivalent to the number of outgoing edges from that node:

Moreover, we can retrieve the encoding of all the edges Γ a starting from v encoded as a, by selecting every index value i falling in in the range [φ k (a, 0), φ k (a + 1, 0):

We can then decode the numerical id of the destination nodes from Γ a , thus finally obtaining the set of numerical ids of the neighbours nodes N (a):

In this way, by exploiting the above simple integer encoding of the graph and the Elias-Fano data scheme, we can efficiently compute the degree and neighbours of a node using rank and select operations.

Efficient implementation of Elias-Fano. The performance and complexity of Elias-Fano heavily relies on the details of its implementation. In this paragraph our implementation is sketched, to show how we obtain an average constant time complexity for rank and select operations. A more detailed explanation can be found in the Supplementary Section S4.1.

Elias-Fano essentially represents a sorted list of integers y 0 , ..., y n bounded by u, i.e. ∀i ∈ {1, . . . , n − 1} we have 0 ≤ y i−1 ≤ y i ≤ y i+1 ≤ u by splitting each value, y i , into a low-bits, l i , and a high-bits part, h i . It can be proven that the optimal split between the high and low bits requires log 2 u n bits [53] . The lower-bits are consecutively stored into a low-bits array L = [l 1 , ..., l n ], while the high-bits are stored in a bit-vector H = [h 1 , ..., h n ], by concatenating the inverted unary encoding 3 , U of the differences (gaps) between consecutive high-bits parts: 

that similarly stores the position of every q ones. Thanks to the constructed index, when the i-th value v must be found, the scan can be started from a position, o j , for j = i q that is already near to the i-th v. Therefore, instead of scanning the whole high-bits array for each search, we only need to scan the high-bits array from position o j to position o j+1 . It can be shown that such scans take an average constant time O(q) at a low expense of the memory complexity, since we need O n q log 2 n bits for storing the two indexes (See Supplementary section S4.1).

Indeed, in our implementation we chose q = 1024 which provides good performance at the cost of a low memory overhead of 3.125% over the high-bits and, on average, for every select operation we need to scan 16 words of memory.

For a second order random walk that traverses the edge from the node v to x, starting at the previous step at node t ( Supplementary Fig. S8 ), Leskovec et al. [6] define the un-normalized transition probability π vx of moving from v to x as a function of the weight w vx of the edge (v, x), and the search bias α pq (t, x):

The search bias α pq (t, x) is defined as a function of the distance d(t, x) between t and x, and two parameters p and q, called, respectively, return and in-out:

If the return parameter p is small, the walk will be enforced to return to the preceding node, otherwise, if p is large, the walk will be encouraged to visit new nodes. The in-out parameter q allows to vary smoothly between Breadth First Search (BFS) and Depth First Search (DFS) behaviour: if q is small, the walk will prefer outward nodes thus mimicking DFS, otherwise it will prefer inward nodes emulating in this case BFS.

Since α must be recomputed at each step of the walk, we sped up their computation by decomposing the search bias α pq (t, x) into the in-out bias β q (t, x), related to the q parameter, and the return bias γ p (t, x), related to p:

where the two new biases are defined as:

It is easy to see that eq. 2 is equivalent to eq. 1.

Efficient computation of the in-out and return bias. The in-out bias can be re-formulated to allow an efficient implementation: starting from an edge (t, v) we need to compute β q (t,

This formulation ( Supplementary Fig. S9 ) allows us to compute in batch the set of nodes X β affected by the in-out parameter q:

Therefore the selection of the nodes X β affected by β q can be reduced to computing the difference of the two sets N (v) \ N (t). We efficiently computed X β by using a SIMD algorithm implemented in assembly, leveraging AVX2 instructions that work on node-set representations as sorted vectors of the indices of the nodes (see Supplementary sections S4.2 and S4.2.1 for more details).

The return bias γ p can be simplified as:

Therefore this can be efficiently computed using a binary search for the node t in the sorted vector of neighbours.

Summarizing we can re-formulate the transition probability π vx of a second order random walk in the following way:

If p, q are equal to one, the biases can be simplified and thus we can avoid computing them. In general, depending on the values of p, q and on the type of the graph (weighted or unweighted), Ensmallen provides eight different specialized implementation of the Node2Vec algorithm, in order to significantly speed-up the computation (Supplementary Tables S4 and S5 ).

Ensmallen automatically selects and runs the specialized algorithm that corresponds to the choice of the parameters p, q and the graph type. This automatic adaptation to the complexity of the task allows a significant speed-up. For instance, in the simplest case (p = q = 1 and an unweighted graph) the specialized algorithm runs more than 100 times faster than the most complex one (p = 1, q = 1, weighted graph). Moreover, as expected, we observe that the major bottleneck is the computation of the in-out bias (Supplementary Table S5 ).

Sampling from a discrete probability distribution is a fundamental step for computing a random walk and can be a significant bottleneck. Many graph libraries implementing the Node2Vec algorithm speed up sampling by using the Alias method (see Supplementary section S.4.3.3 ), which allows sampling in constant time from a discrete probability distribution with support of cardinality n, with a pre-processing phase which scales linearly with n.

The use of the Alias Method for Node2Vec incurs the "memory explosion problem" [54] since the preprocessing phase for a second-order random walk on a graph with |E| edges has a support whose cardinality is O eij ∈E deg (j) , where deg(j) is the degree of the destination node of the edge e ij ∈ E. Therefore, the time and memory complexity needed for preprocessing make the Alias method intractable even on relatively small graphs. For instance, on the unfiltered Human STRING PPI graph (19.354 nodes and 5.879.727 edges) it would require 777 GB of RAM.

To avoid this problem, we compute the distributions on the fly. For a given source node s, our sampling algorithm applies the following steps: 1) computation of the un-normalized transition probabilities to each neighbour of s according to the provided in-out and return biases; 2) computation of the un-normalized cumulative distribution, which is equivalent to a cumulative sum; 3) uniform sampling of a a random value between 0 and the maximum value in the un-normalized cumulative distribution; 4) the corresponding index is then identified through either a linear scan or a binary search, according to the degree of the node s.

To compute the cumulative sum efficiently, we implemented a SIMD routine that processes at once in CPU batches of 24 values. Moreover, when the length of the vector is smaller than 128, we apply a linear scan instead of a binary search because it is faster thanks to lower branching and better cache locality. Further details are available in the Supplementary section S4.2.2 .

Since the computational time complexity of the sampling algorithm described in the previous Section 4.1.4 scales linearly with the degree of the considered source node, computing an exact random walk on graph with high degree nodes (where "high" refers to nodes having an outbound degree larger than 10000) would be impractical, considering also that such nodes have a higher probability to be visited.

To cope with this problem, we designed an approximated random walk algorithm, where each step of the walk considers only a sub-sampled set of k neighbors, where the parameter k is set to a value significantly lower than the maximum node degree. Recalling that the efficiency of our node2vec implementation relies in our "sorted" data representation (i.e. where the neighbors of each node are stored contiguously and in ascending order), an efficient neighborhood sub-sampling for nodes with degree greater than k requires to uniformly sample unique neighbors whose original order must be maintained. To uniformly sample distinct neighbors in a discrete range [0, n] we implemented an algorithm (Sorted Unique Sub-Sampling -SUSS, see Supplementary S4.2.4 ) that divides the range [0, n] into k uniformly spaced buckets and then randomly samples a value in each bucket. This allows to efficiently extract k distinct and sorted values with complexity Θ(k).

The disadvantage of this sub-sampling approach is that two consecutive neighbors will never be selected in the same sub-sampled neighborhood. However, they have the same probability of being selected in different subsamplings, since the sub-sampling, thanks to its efficiency, is repeated at each step of the walk and multiple random walks from each node are performed (more details in Supplementary section S4.2.4 , Algorithm 1).

Even if the Elias-Fano quasi-succinct data structure enables efficient operations on graphs, Ensmallen can optionally further speed-up the computation at the expense of a more expensive usage of the main memory. We used three techniques, as listed below.

Explicit destinations vector The first and most important option is to explicitly create the vector of the destinations, avoiding to execute a select from the Elias-Fano data structure to extract a given edge destination and achieving a speedup during the random walks (on average an x3-4 speedup) while spending twice as much memory.

Explicit out-bound degrees vector The second most important option is to create the vector of the outbounds node degree, which avoids extracting the degree of a source node from the Elias-Fano. While spending a relatively limited amount of RAM, this grants, on average, an additional 10% speedup in the computation of random walks. When combined with the explicit destinations vector, it can achieve a combined speedup of up to x5-6 of random walks' computation.

We suggest enabling this option when computing a random walk-based model, such as CBOW or SkipGram.

Explicit sources vector In the context of the generation of link prediction batches, by explicitly creating both the vector of sources and the vector of destinations, we can avoid accessing the Elias-Fano data-structure at all, and by spending around three times more memory, we can achieve between three to four times speedup for the generation of link prediction batches.

The Embiggen module provides node embedding models (section 4.2.1), node-label prediction models (section 4.2.2), edge prediction models (section 4.2.3) and node and edge visualization tools.

The goal of node embedding models is to encode nodes as low-dimensional vectors that summarize the topological characteristics of their corresponding nodes. In other words, node embedding maps nodes topologically related into vectors close in the mapped metric (Euclidean) space.

Once computed, node embeddings can be used for different graph representation learning tasks [1] , including node label prediction, edge and edge label prediction, unsupervised graph analysis (e.g. node clustering) or to visualize graphs and their properties ( fig. 4 ) in combination with dimensionality reduction techniques (e.g. Siamese networks [55] , TSNE [33] ).

Embiggen currently provides an efficient Python object-oriented implementation of GloVe [23] , SkipGram and CBOW [21] , node embedding models, using Keras and Tensorflow as backend and optional GPU processing to speed-up computation.

All of Embiggen models have been designed, when possible, according to the "composition over inheritance" paradigm, to ensure a simpler usage experience through better modularity and polymorphic behaviour [56] . Each component of the node embedding models has been implemented as a separate Keras layer object to reduce code duplication and increase reusability (e.g. edge embedding layers like Hadamard or L2, approximated Dense softmax layers like NCE or Sampled Softmax). Thanks to its highly modular architecture, Embiggen node embedding models can be used as they are provided, or rearranged to be customized to the users' needs.

Embiggen is tightly linked with Ensmallen: indeed the input for the node embedding models (e.g. sampled random walks and skip-grams) are provided by Ensmallen in a scalable, highly efficient parallel way (Fig. 5) .

GloVe. The GloVe model learns to predict the logarithm of the normalized co-occurrence counts of nodes in windows sampled from random walks. This value can be interpreted as a proxy of the probability of an edge between two given nodes. Specifically, each node co-occurrence is additionally weighted by the distance between the contextual node and the node at the center of the window. The computation of the sparse co-occurrence matrix is executed within Ensmallen, allowing to sample massive quantities of random walks exploiting the techniques described in section 4.1.3.

SkipGram and CBOW. The SkipGram and CBOW models learn to predict respectively the contextual nodes from the central node and the central node from the contextual nodes. Nodes are defined as contextual when there exists a path between them with length less or equal to the given window size. In several implementations the output of SkipGram and CBOW uses the Softmax activation, which scales linearly with the number of nodes of the graph, thus making unfeasible the computational time requirements also for medium-sized graphs. To overcome this limitation GraPE implements the Noise-Contrastive Estimation approach (NCE) [57] and the Sampled Softmax [58] to improve SkipGram and CBOW execution efficiency without a significant decay in classification performance. The two approximation methods address this problem by evaluating for a given sample only the positive outputs (which are either two times the window size in the context of SkipGram or 1 in the context of CBOW) and a subset of randomly sampled negative outputs. To achieve a better approximation, Embiggen samples the negative output labels using a Zipfian distribution, which is a good approximation of most common real-world node-degree distributions [59] . By oversampling high degree nodes, which account for a disproportionate percentage of the exact Softmax loss, our implementation avoids falling into local minima wherein the node embedding high degree nodes tend to be represented close to all the other nodes. In this way, our implementation avoids learning that nodes with a high degree are neighbours of all nodes, which is a common problem affecting several node embedding models exploiting graph topology, such as SkipGram and CBOW themselves. By providing a better approximation of the loss function, the models reach convergence more rapidly (see Supplementary section S5 for more details).

The skip-gram samples for both the SkipGram and CBOW models are generated lazily from Ensmallen as described in section 4.1.4: this allows for training the models on massive amounts (billions) of random walk samples without requiring to pre-compute them, hence avoiding a steep memory peak.

The node embedding pipeline. To automate and make easier the generation of node and edge embeddings, the Embiggen module provides an automated pipeline that for any graph and embedding model available in the GraPE resource will provide:

• Detection and notification of errors in presence of GPU mis-configurations

• Creation of models for single or mirrored multi-GPUs training • Automated selection of optimized time/memory trade-offs specific for the requested model and graph features

• Identification and reporting of graph oddities relevant to the selected model (e.g., singleton nodes, singleton with self-loops, multiple connected component)

• Automated nodes sorting, according to the node-degree distribution, when necessary for the selected model

• Automatic cache handling (i.e. if the model was already trained with the same data and parameters, automatic loading of the embedding from cache)

Computation of edge embedding Embiggen allows computing the edge embeddings by applying different operators (e.g. concatenation, average, L1, L2 and Hadamard operators [6] ) to the node embeddings of the source and destination node. Creating a new custom edge embedding layer can also be achieved by extending the relative abstract class. The computation of edge embeddings is usually executed lazily for a given subset of the edges at a time since the amount of RAM that would be required to explicitly rasterize the edge embedding can be prohibitive on most systems, depending on the edge set cardinality of the considered graph.

Additional features All Embiggen models come equipped with optional node embedding regularization techniques meant to accelerate the convergence and quality of the embedding, such as Gradient centralization [60] , L1 and L2 normalization, batch normalization [61] , Dropout, and include instance support for a wide variety of optimizers (e.g. Adam and Nadam). Of note when embedding large graphs, also Mixed Precision may be used to trade numerical stability for a smaller model size. Additionally, all models are provided in specialized versions that support both single-GPU and multi-GPUs training. The optimal version for a given device parametrization is transparently instantiated.

Embiggen provides flexible and modular neural models trained on the embedded vectors and/or features associated to each node to solve any node label prediction problem in graphs.

Implementation of the models. Embiggen leverages the neural models and features available from the Keras library. The neural models provided by Embiggen can be trained using the embedded vectors generated through the node embedding models discussed in Section 4.2.1 ( Figure 5 ), or using also features associated to each node (e.g. domains in proteins or chemical characteristics for drugs), by leveraging the multi-modal neural architecture provided by Embiggen. Moreover, since recent advances in the graph representation learning literature showed that aggregating the neighbouring nodes embeddings can improve performances [62] , Embiggen implements this feature using element-wise weighted mean, based on normalized Laplacian or symmetric normalized Laplacian. We note that other learning machines, including, e.g. Random Forests, Support Vector Machines or Decision Trees implemented e.g. in the scikit-learn library, can be trained using the embeddings obtained with the Skip-Gram, CBOW and GloVe models described in Section 4.2.1.

The node-label prediction model used in section 2.5.2, is a bi-modal neural network that uses the node embeddings, computed with any of the aforementioned models as input of the first module, and the node feature vector as input for the second module (in our experiments the feature vector are one-hot encoded words for Cora and CiteSeer, and Term Frequency-Inverse Document Frequency vectors (TFIDF) for PubMed). This model uses the mean to aggregate both the embeddings and the features of the neighbouring nodes. The resulting aggregated embeddings and features undergo a "dropout" procedure to account for the sparsity of the one-hot encoded or TFIDF node features, and are finally concatenated and fed into a Feed-Forward Neural Network (FFNN) . 

Embiggen provides neural models implemented in the Keras library to predict edges using their vectorial representation obtained from the graph embedding models introduced in Section 4.2.1.

Implementation of the models. Embiggen can directly aggregates the node representations obtained with Skip-gram, CBOW and GloVe (Section 4.2.1) to obtain edge embeddings, using Concatenation, Average, Sum, Hadamard product, L 1 and L 2 norm, as well as other edge embedding methods that can be derived from the EdgeEmbedding layer Python class. The resulting embeddings can be directly used as input to train a FFNN for edge prediction problems, or can be used to construct the initial weights of the "embedding layer" of the neural model, in such a way that we can further refine their values through backpropagation. Indeed, according to this second approach the input of the neural model is simply the index of the edge to be predicted that extracts the corresponding vectorial embedding from the embedding layer and the output is the predicted existence of the edge in the graph. The embedding layer is shared for both the source and destination node of the edge, thus resembling in a certain way a a Siamese network [55] . To achieve better performance, more complex FFNN models can be employed and their hyper-parameters can be automatically tuned through techniques such as Bayesian Optimization. 

Data used for the edge prediction experiments are downloadable from the following Github site: https://github.com/xiangyue9607/BioNEV/tree/master/data.

Moreover, GraPE (https://github.com/AnacletoLAB/grape) allows the automatic retrieval of about 60000 graphs, among which all the graphs from the STRING repository [31] and several graphs used in state-ofthe-art works, among which all the graphs used in the proposed experiments.

The GraPE resource can be readily installed from PyPI in a Python environment with TensorFlow: https://pypi.org/project/grape/. Since properly configuring such an environment is non-trivial, we provide Docker containers [63] for both AMD ROCM GPUS and NVIDIA CUDA GPUs. 

The node/edge embedding problem has been approached multiple times in the literature. In this section we briefly recall renowned packages that are mostly related to GraPE, most of which have been used in our experiments as benchmarks for comparison.

NetworkX [1] is an extremely renowned Python language package for the exploration and analysis of networks. Even though this library does not provide fast first/second-order random walks, we includes it in our comparisons since it it used to handle the graph structure in the GraphEmbedding and Node2Vec packages mentioned below.

GraphEmbedding [2] is a Python package for embedding networks via random-walk based methods such as Node2Vec. The graph is loaded using NetworkX. Within this package, the random walks are executed using the alias method [3] (see section S15), whose complexity during the preprocessing phase hampers the computation of random-walks to large graphs.To our knowledge, GraphEmbedding handles only undirected homogeneous graphs.

Node2Vec is a Python language package for embedding networks via random-walk based methods such as Node2Vec [4] . The graph is loaded using NetworkX. As mentioned for GraphEmbedding, also Node2Vec makes is limited by the usage of the alias method to execute the random walks. To our knowledg Node2Vec handles only undirected homogeneous graphs.

iGraph [5] is a library collection for creating and manipulating graphs and analyzing networks. While it was originally written in C language, some implementations are presently available as Python and R packages. We didn't use this library in our comparisons because it does not currently implement fast parallel first or second-order random walks, nor it does support their easy implementation.

CSRGraph [6] is a library to execute fast first and second-order random walks using Numba [7] . Different from the previously mentioned libraries (e.g., GraphEmbedding or Node2Vec) CSRGraph doesn't store the graph into a NetworkX object, but instead exploits a compressed sparse row (CSR) matrix, which significantly reduces the memory requirements with respect to methods expliting NetworkX objects. Additionally, instead of using the aforementioned alias method, it computes the node probabilities and samples them lazily, as it is required. To our knowledge, this library only handles undirected homogeneous graphs.

PecanPy [8] is a library to execute fast first and second-order random walks based on a set of different solutions depending on the graph densities and node number. For graphs with less than 10000 nodes, PecanPy uses the alias-method, while for larger graphs it uses a strategy similar to CSR libray, where node probabilities and sampling are lazily computed and the CSR data structure is used to store the edges. Finally, for graphs with edge densities larger than 10% it switches the graph data structure to a dense adjacency matrix. To our knowledge, this library only handles undirected homogeneous graphs.

This section presents the main features of the graphs used in previous publications and used here to test Ensmallenvs. state-of-the-art graph libraries. 44 graphs come from Network Repository [9] , having a considerably different number of nodes and edges have been collected as benchmark for comparison; the other graphs are GiantTN provided by Zenodo, Homo Sapiens being provided by STRING and KGCOVID19 provided by KGHub. Their main features are summarized in Table S1 . [23] , available at https://github.com/xiangyue$9607$/BioNEV/tree/master/data. The following graph reports are generated automatically using Ensmallen reporting utility.

The undirected graph CTD DDA has 12765 nodes and 92813 unweighted edges, of which none are self-loops. The graph is sparse as it has a density of 0.00114 and has 20 connected components, where the component with most nodes has 12724 nodes and the component with the least nodes has 2 nodes. The graph median node degree is 3, the mean node degree is 14.54 and the node degree mode is 1. The top 5 most central nodes are D056486 (degree 1217), D012640 (degree 1098), D006973 (degree 838), D009336 (degree 831) and D007674 (degree 705).

The undirected graph DrugBank DDI has 2191 nodes and 242027 unweighted edges, of which none are selfloops. The graph is quite dense as it has a density of 0.10088 and has 2 connected components, where the component with most nodes has 2189 nodes and the component with the least nodes has 2 nodes. The graph median node degree is 155, the mean node degree is 220.93 and the node degree mode is 11. The top 5 most central nodes are DB01174 (degree 1034), DB00564 (degree 1028), DB00252 (degree 992), DB00176 (degree 963) and DB00794 (degree 952).

The undirected graph NDFRT DDA has 13545 nodes and 56515 unweighted edges, of which none are selfloops. The graph is quite sparse as it has a density of 0.00062 and has 85 connected components, where the component with most nodes has 13033 nodes and the component with the least nodes has 2 nodes. The graph median node degree is 3, the mean node degree is 8.34 and the node degree mode is 1. The top 5 most central nodes are C0030193 (degree 845), C0004623 (degree 741), C0004096 (degree 653), C0038160 (degree 575) and C0020538 (degree 534).

The undirected graph STRINGPPI has 15131 nodes and 359776 unweighted edges, of which none are selfloops. The graph is sparse as it has a density of 0.00314 and has 87 connected components, where the component with most nodes has 14932 nodes and the component with the least nodes has 2 nodes. The graph median node degree is 14, the mean node degree is 47.55 and the node degree mode is 1. 

For the node-label prediction experiments the Citeseer [24, 25] , the Cora [24, 25] , and the Pubmed [26] graphs have been used. To describe them in the following we copied the graph reports of mid-length computed by Ensmallen.

The undirected graph CiteSeer has 3327 heterogenous nodes and 4676 edges. The RAM requirements for the nodes and edges data structures are 658.26KB and 14.72KB respectively.

Degree centrality The minimum node degree is 1, the maximum node degree is 99, the mode degree is 1, the mean degree is 2.77 and the node degree median is 2.

The nodes with highest degree centrality are: brin98anatomy (degree 99 and node type IR), rao95bdi (degree 51 and node type Agents), chakrabarti98automatic (degree 35 and node type IR), bharat98improved (degree 34 and node type IR) and lawrence98searching (degree 30 and node type IR).

Disconnected nodes Disconnected nodes are nodes that are not connected to any other node.The graph contains 48 disconnected nodes.

Singleton nodes with selfloops Singleton nodes with selfloops are nodes with no edge to other nodes and have exclusively selfloops. The graph contains 48 singleton nodes with selfloops, which are 408356 (degree 1 and node type DB), eiter98firstorder (degree 1 and node type AI), 156949 (degree 1 and node type DB), bruno01stholes (degree 1 and node type DB) and kumar01behaviorbased (degree 1 and node type ML), plus other 43 singleton nodes with selfloops.

Node types The graph has 7 node types, of which the 5 most common are DB (701 nodes, 21.07%), IR (668 nodes, 20.08%), Agents (596 nodes, 17.91%), ML (590 nodes, 17.73%) and HCI (508 nodes, 15.27%). The RAM requirements for the node types data structure is 167.29KB.

The undirected graph Cora has 2708 heterogenous nodes and 5278 edges. The RAM requirements for the nodes and edges data structures are 399.52KB and 16.26KB respectively.

Degree centrality The minimum node degree is 1, the maximum node degree is 168, the mode degree is 2, the mean degree is 3.90 and the node degree median is 3. The nodes with highest degree centrality are: 35 (degree 168 and node type Genetic Algorithms), 6213 (degree 78 and node type Reinforcement Learning), 1365 (degree 74 and node type Neural Networks), 3229 (degree 65 and node type Neural Networks) and 910 (degree 44 and node type Neural Networks).

Node types The graph has 7 node types, of which the 5 most common are Neural Networks (818 nodes, 30.21%), Probabilistic Methods (426 nodes, 15.73%), Genetic Algorithms (418 nodes, 15.44%), Theory (351 nodes, 12.96%) and Case Based (298 nodes, 11.00%). The RAM requirements for the node types data structure is 136.99KB.

The undirected graph Pubmed has 19717 heterogenous nodes and 44327 edges. The RAM requirements for the nodes and edges data structures are 3.27MB and 165.38KB respectively.

Degree centrality The minimum node degree is 1, the maximum node degree is 171, the mode degree is 1, the mean degree is 4.50 and the node degree median is 2. The nodes with highest degree centrality are: 9742976 (degree 171 and node type Diabetes Mellitus Type 2), 8366922 (degree 154 and node type Diabetes Mellitus Type 1), 11832527 (degree 131 and node type Diabetes Mellitus Type 2), 19479186 (degree 130 and node type Diabetes Mellitus Type 2) and 18776148 (degree 125 and node type Diabetes Mellitus Type 1).

Node types The graph has 3 node types, which are Diabetes Mellitus Type 1 (7875 nodes, 39.94%), Diabetes Mellitus Type 2 (7739 nodes, 39.25%) and Diabetes Mellitus, Experimental (4103 nodes, 20.81%). The RAM requirements for the node types data structure is 986.88KB. Table S3 : Brief description of the graphs used during the node-label prediction tasks, namely Cora, CiteSeer and PubMed.

Graph description Cora [24] 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. CiteSeer [24] 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. PubMed [26] 19717 scientific publications from the PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

GraPE allows the automatic retrieval of all the graphs included into the STRING repository [10] (56691 graphs), KGHub [12] , Monarch Initiative [27] , Linqs [24] , PheKnowLator [28] , and over 1000 graphs from Network Repository [9] .

After importing an exhaustive graph automatic graph report can be generated with Ensmallen. As an example, in Figure S1 we show the graph report produced for Monarch.

8 Figure S1 : Exhaustive graph-report of Monarch Initiative [27] as automatically produced by Ensmallen.

Ensmallen allows fast graph loading and processing of graph queries by using an optimized graph representation. In particular, after assigning a numeric identifier to each node of the graph, the adjacency matrix is transformed into a sorted list of integers by a bijective mapping that exploits the numeric representation of the nodes (see subsection S4.1.1), which is then represented in memory by using

Elias-Fano scheme [29, 30] . Such representation allows storing n non-negative integers, sorted in increasing orders and bounded by u, with at most EF(n, u) = 2n+n log 2 u n bits, and a memory usage that is less than 9 half a bit away [31, 29, 32] from the succinct bound that is Z + o(Z), where Z is the theoretical minimum number of bits needed to store the data: Z = log 2 u n = n log 2 u n + O(n) [33] . We support the description of Elias-Fano schema with an example, depicted in Figure S2 of Elias-Fano [29] quasi-succint representation of the monotone list of integers [5, 8, 8, 15, 32] .

In particular, the binary representation of the i-th value of the list, b i , i = 1, . . . , n, is initially split into two parts: a low-bits part, l i , containing the lower log 2 u n bits, and an high-bits part, h i , containing the remaining bits (referred to as high-bits). Then, a low-bits array, (light-blue and referred to as L in figure  S2 ) is formed by sequentially concatenating the explicit copies of the low-bits parts, while an high-bits array (red, and named H in figure S2 ) is composed by sequentially concatenating the gaps (differences) between consecutive high-bits parts (where the preceding value of the first element is assumed to be equal to zero), where each gap is represented by using the inverted unary representation, where an integer value equal to k is encoded by using k zeros followed by a one (for example, 0 in inverted unary representation is 1, 3 is 0001). Figure S2 : Example of Elias-Fano. Elias-Fano splits the sorted values into high and low bits; the low-bits parts are then consecutively copied in the low-bits array, L (on the right, blue color), while the high-bits parts are coded into an high-bits array H (on the left, red color) by consecutively storing the gaps between consecutive high-bits parts, encoded by using the inverted unary representation.

Instead of computing the gaps, Pibiri et al. [34] propose a faster method to build an high bits array H which is equal to that obtained by Elias Fano. Specifically, Pibiri's et al. show that H can be created as a bit-vector with all zeros, exept for the elements at indexes h i + i (where i indexes the high-bits parts, h i ), which are set to 1. This method is faster because the encoding and decoding of each value no longer depends on the previous values (as gaps would) so that the representation may be built in parallel to further speed up the graph representation. Moreover, once the index of each value is known, exploiting atomic integers we can build Elias-Fano fully in parallel 1 without any locks. An example of this method is in Figure S3 , where the high bits index is no-longer computed using the gaps. Figure S3 : Algorithm for Elias-Fano encoding presented in [34] . While the low-bits array (on the right, blue color) is simply composed by sequentially concatenating the l i s for each i = 1, . . . , n − 1, the high-bits array is composed by setting the bit at index h i + i to 1.

The two fundamental operations to perform on Elias-Fano are Jacobson's rank and select.

In particular, given a set of integers S, Jacobson defined the rank and select operations as follows [35] :

returns the i-th smallest value in S rank(S, m) returns the number of elements in S less or equal than m

To speed up computation, we deviate from this definition by implementing a rank operation that extracts the number of elements strictly lower than m. The Rust implementations of these operations on our Elias-Fano representation are presented in Figures S5 and S4 .

/// Find the index-th smallest value fn select(&self, index: u64) -> u64 { // find the index of the high-th one in the bit-vector and subtract the index // to obtain the number of zeros before the index-th one. let high = self.high_bits.find_one_of_index(index) -index; // get the lower bits of the value let low = self.read_lowbits(index); // merge the high and low bits (high << self.low_bits_size) | low } Figure S4 : Rust implementation of a simplified select operation: in this implementation we omitted the logic needed to handle all the corner-cases. The full implementation can be found at https://github.com/zommiommy/elias_fano_rust/ blob/master/src/elias_fano.rs /// Get how many elements strictly smaller than value exists fn rank(&self, value: u64) -> u64 { // split the value into its higher and lower bits let high = value >> self.low_bits_size; let low = value & self.low_bits_mask;

// find the index of the high-th zero in the bit-vector let mut index = self.high_bits.find_zero_of_index(high); // start scanning the lower bits to find the first element bigger or equal than value while self.high_bits[index] == 1 && self.read_lowbits(index -high) < low { index += 1; } // the number of elements is equivalent to the index of the value in the high-bits // minus the higher bits because the count is the number of ones before index and // high is the number of zeros before index. index -high } Figure S5 : Rust implementation of a simplified rank operation: in this implementation we omitted the logic needed to handle all the corner-cases. The full implementation can be found at https://github.com/zommiommy/elias_fano_rust/ blob/master/src/elias_fano.rs

The complexity of the select mainly depends on the implementation of find one of index since all the other operations take constant time. Similarly, the complexity of the rank depends on the implementation of find zero of index. Therefore, to have rank and select in constant time we need to obtain a constant computational time for both find one of index and find zero of index. Both functions need to find the i-th one/zero in a bit-array, so that the naive way to solve the problem would be to scan the array from the start and, at each step, count how many values v ∈ {0, 1} have been encountered so far. This algorithm scales linearly with the length of the bit-vector and thus is not practical for large arrays.

A simple solution to improve the linear scan is to get a starting point that is closer to the result. To do so, given a bit-array to scan, we choose a quantum q and store the position of every q-th value v ∈ {0, 1} into an auxiliary array O = [o 0 , o 1 , ...].

Therefore, to find the i-th value v in Elias-Fano's high-bits array, the linear scan will start from position o k , where k = i/q , and it will have to scan at most until the next position o k+1 . Considering that Elias-Fano's high-bits array contains approximately half uniformly distributed ones and half uniformly distributed zeros, the average distance between two consecutive values v ∈ {0, 1} is equal to 2 bits, which implies that the average distance between two consecutive positions o k and o k+1 , that is the maximum number of bits to scan for searching a the i-th value v, is E[o k+1 − o k ] = 2q. Therefore, if the high-bits array has n bits, by using the auxiliary vector O, the average time complexity in the worst case is reduced from

Of note, we can further speed up the linear can by computing the number of vs in a word of memory (64-bits) using the popcnt instruction, thus allowing us to skip 64 bits each time.

Finally, if the CPU supports the BMI2 instruction set, we can use the instructions pdep and tzcnt to find the wanted value v in a word in constant time (5 cycles) 2 .

For what regards the memory complexity of the proposed index, we need on average O n 2q integers for storing the positions of the ones (or of the zeros), and each integer position needs at most log 2 n bits to be stored. Therefore, each of the two indexes (one index for the ones to speed the select, and one index for the zeros to speed the rank) cause a memory occupation, in the worst case, of O n 2q log 2 n bits, which resolves to a total worst-case memory occupation of O n q log 2 n bits. Note that while indices with memory complexity of o(n) exist, a careful implementation allows the use of a relatively high value for q, which practically results in low overhead.

In particular to store KG Covid, which has around 450'000 nodes and 32'000'000 edges, Elias-Fano uses 56Mib to store the adjacency matrix, of which 0.6 Mib are for the 1s and 0s indices. Therefore, the overhead ratio of the indices is 1.07% when compared to the size of the whole structure. On this KG Covid we are able to perform ranks in 50ns and selects in 118 ns on a Ryzen 9 3900x. 

In this subsection we describe how Ensmallen converts all the edges of a graph G(V, E) into a sorted list of integers. In particular, considering an edge e = (v, x) ∈ E connecting nodes v and x represented, respectively with integers a and b, the binary representation of a and b are concatenated through the function φ k (a, b) to generate an integer index uniquely representing the edge e itself:

This implementation is particularly fast because it requires only few bit-wise instructions:

where << is the left bit-shift, | is the bit-wise OR and & is the bit-wise AND. Since the encoding uses 2k bits, it has the best performances when it fits into a CPU word, which is usually 64-bits on modern computers, meaning that the graph must have less than 2 32 nodes and and less than 2 64 edges. However, by using multi-word integers it can be easily extended to even larger graphs [36] .

As an example, considering a graph with at most 8 nodes, encoded with integers numbers (v ∈ [0, ..., 7]) In figure S7 we schematize the encoding of the edge (2, 6) which has 2 as source node, and 6 as destination node. On the right we report the Rust implementation of the edge encoding and decoding. Once the edges are encoded, we can sort them and use Elias-Fano to store them.

While computing a second-order random walk, Leskovec et al. [37] , when the walk has stepped from node t to node v, where it now resides, the next step is guided by the un-normalized transition probability π vx of moving from v to x. π vx is a function of the weight w vx of the edge (v, x), and the search bias α pq (t, x):

In figure S8 the node2vec schema for defining the search bias is schematized. Figure S8 : Illustration of the random walk procedure in node2vec. The walk just transitioned from t to v and is now evaluating on which node to step next. Edge labels indicate search biases α. The nodes in blue are at distance 2 from t, so that the edge connecting them to v has α equal to the explore (in-out) bias; red nodes, at distance 1 from t, are connected to v by and edge with α equal to the return bias.

As explained in section 4.1.4 the in-out bias can be re-formulated to allow an efficient implementation: starting from an edge (t, v) we need to compute β q (t, x) for each x ∈ N (v), where N (v) is the set of nodes adjacent to v including node v itself.

This formulation (figure S9c) allows us to compute in batch the set of nodes X β affected by the in-out parameter q:

Therefore the selection of the nodes X β affected by β q can be reduced to computing the difference of the two sets N (v) \ N (t) (N (t) is the neighborhood of t already computed at the preceding step.

X β is efficiently computed by using a SIMD algorithm implemented in assembly, leveraging AVX2 instructions that work on node-set representations as sorted vectors of the indices of the nodes (see figure S10 ). The algorithm is adapted from Lemire's et al. [38] SIMD algorithm for set intersection, which similarly works on sets represented as sorted arrays. Depending on the values of p, q and on the type of the graph (weighted or unweighted), Ensmallen provides eight different specialized implementation of the Node2Vec algorithm, detailed in table S5. This trick allows to significantly speed-up the computation, as showed by the empirical computational times reported in table S6, which were obtained by each of the specialized algorithms when computing 1000 random walks of length 100 on the KGCovid19 Graph, using an AMD Ryzen 9 3900x processor. Figure S10: The complete implementation of the 8 way interleaved AVX2 xorshift. For compactness, the code is divided in two columns, the intended order is from left to right that uses the vectorized xorshift described in section S4.2.1.

Many of the algorithms inside of Ensmallen, e.g. the sampling of destination nodes during the random walk, or the generation of random negative edges for Skipgram [4] model, rely on the generation of random numbers. Therefore, a random number generator algorithm could be a bottleneck if not efficiently implemented. To guarantee efficiency, in Ensmallen we use the two following fast pseudo-random number generators, Vigna's Xoshiro256+ [39] and the Marsaglia's Xorshift [40] .

Xoshiro256+ has a shallower dependencies chain than Xorshift, which results in lower latency, while Xorshift has fewer instruction than Xoshiro256+, so that it can be implemented to achieve higher throughput. Therefore, Xoshiro256+ is faster in the generation of a single random value, e.g. during the sampling of the destination node in a step of the random walk. On the other side, Xorshift is faster in the generation of multiple random numbers, e.g. when generating the random negative edges for Skipgram.

Modern CPUs allow to execute up to 4 different independent instructions in the same cycle and can have eight or more memory requests running at a time, so a common way to exploit the super-scalarity of modern CPUs and out-of-order execution is to interleave different instances of the same algorithm. This reduces the importance of the depth of dependency chain of an algorithm and favors the number of instructions.

For these reasons, we interleave eight different instances of Xorshift and, to further improve its throughput, we implemented a vectorized version which exploits Intel's AVX2 instruction sets to execute 4 instances in parallel. An non-interleaved example of the vectorized Xorshift is illustrated in Fig. S11 With the aforementioned implementation, we achieve a throughput of 256 random bytes (32 64-bits integers) at the cost of 4 concurrent cache misses (which are adjacent so the values should already be in the L1 cache). The complete implementation is available in figure S10 and achieves a throughput of more than 20 times higher than standard methods as shown in table S7. 

The cumulative sum, also called prefix sum, is one of the most known examples in the field of parallel computing; in Ensmallen it is fundamental during the computation of the random walks, to compute the un-normalized cumulative sum of the un-normalized transition probabilities to each node (see Section 4.1.4 ).

To obtain efficient computations, the prefix sum algorithm implemented in Ensmallen exploits the Streaming SIMD Extensions (SSE 128-bit) instruction set, which is an extension of the SIMD instruction set for the x86 64 architecture. CPUs with SSE support have a special set of 8 128-bits registers, on which vector operations may be executed, that operate in parallel on 4 floats, therefore increasing the throughput by 4 times.

The algorithm is based on just two SSE instructions: vaddps (Fig. S12a) , which provides an element-wise sum of two registers, each containing 4 32-bits floats, and vpalignr (Fig. S12b) , which concatenate two registers into a temporary register, shifts the result by a chosen number of bytes and then returns the lower 128-bits. Figure S14: Inner core of the SIMD prefix sum algorithm This code assumes that in in the registers from xmm0 to xmm5 are loaded with the data and that xmm6 is zero filled. The result will, also, be in the registers from xmm0 to xmm5. We need 13 xmm registers but the SSE standard only provide 8 of them, luckily the AVX expansion increase this number to 16, thus making this algorithm possible.

The generalization of the algorithm for wider instruction sets such as AVX2 (256-bit) or AVX512 (512-bit) is limited by the lack of multi-lane logical shifts (the vpalignr instruction) which can be avoided by using opportunely shuffle and permutation instructions which introduces additional latency so the throughput don't scales linearly.

While this does not improve the complexity of the algorithm it is almost ten times faster than the naive implementation and five time faster than the unrolled version.

Loop unrolling [41] is an optimization technique for tight loops which increase the number of steps executed for each cycle, this reduces the overhead for the loop, allows to better exploit the CPU super-scalarity and 23 reduces the number of branches. This technique increase the size of the function and thus if abused might degrade performances due to cache missing.

The alias method [3] efficiently samples k integers from a discrete probability distribution. The algorithm is often used since it requires O(n) steps in the pre-processing phase (when Vose's algorithm is used [42] ) and O(1) steps for each point sampling. The algorithm is sketched in figure S15 , using a discrete probability distribution [p 0 , ..., p n−1 ] where each p i is the probability of sampling an integer i ∈ [0, ..., n − 1] (n = 4 in figure S15 ).

In the figure, the heights of the bars correspond to [p 0 , ..., p n−1 ] and their width is also assumed to be one, so that the total area of the histogram is 1. Point sampling from the depicted distribution would require to draw a random value 0 ≤ u ≤ 1 and then find the segment (from p 0 to p 1 , from p 1 to p 2 , and so on) where u falls. This check could be performed by the following comparisons: if u < p 0 then pick 0; otherwise, if p 0 ≤ u < p 0 + p 1 then sample 1; otherwise, if p 0 + p 1 ≤ u < p 0 + p 1 + p 2 then sample 2; otherwise if u ≥ p 0 + p 1 + p 2 then sample 3. This would be computationally expensive since each sampling would require, in the worst case, n − 1 comparisons.

To reduce the computational complexity, a more efficient algorithm could be designed that draws 2D points falling in the gray rectangle shown in figure S15 -b, where each bar represent the probability of sampling of one integer value according to the discrete probability distribution, or, in other words, each bar represents a sampling event. Therefore the algorithm needs only to check whether the drawn point falls in any of the bars. To this aim, an iterative process could be used that exploits only an indexed array with the discrete probabilities P vec = [p 0 , ..., p n−1 ] of each integer. Each iteration should draw two values: an index 0 ≤ s ≤ n, that selects one of the elements of the array P vec , and a random value 0 ≤ u ≤ 1. The drawing continues until the drawn s and u are such that u ≤ P vec [s], which leads to sampling s. In figure S15 -b the index s = 1 and the value (corresponding to the gray dot) u ≥ P vec [1] correspond to a miss (the gray dot) that does not allow selecting 1 as the drawn value; conversely, the green dot corresponds to a drawn index s = 3 and a drawn u ≤ P vec [3] , which allows selecting the value 3 as the sampled value.

However, as shown in figure S15 -b, usage the aforementioned method has the disadvantage of having an "empty space" that would cause repeated iterations, therefore increasing the computational time. To guarantee success at each draws, in [3] authors propose composing a rectangle where the probabilities ("rectangles" in the figure) are opportunely rearranged by splitting them, so as no empty spaces are left and each column contains at (figure S15-c). In this way, if each draw of a 2D point would always fall into a decision region, therefore bringing to a decision.

To this aim, the alias method applies a pre-processing phase, generally performed through Vose's algorithm [42] that, given the range [0−max] among which to sample, and the number n of integers to sample, computes two indexed vectors: the separating probability vector P sep = [p sep 0 , ..., p sep n−1 ] , and the "other class" (also called alias) vector OC = [oc 0 , ..., oc n−1 ].

The algorithm then draws a unique point 0 ≤ x ≤ 1, from which it then computes an integer s = nx , uniformly distributed in 0, 2, . . . , n − 1, and a value u = nx + 1 − s uniformly distributed on [0, 1). If u ≤ P sep [s] the alias method samples s, otherwise it samples OC[s].

Though efficient in the sampling process, when the range from which to sample becomes high, the preprocessing phase for computing the separating probabilities and the alias vector is impractical. Therefore, in Ensmallen we have implemented the SUSS algorithm, described in S4.2.4. Figure S15 : The Alias method. Left: a discrete probability distribution. Center: a quasi-efficient sampling method containing empty spaces. Right: the alias method.

The Sorted Unique Sub-Sampling algorithm (SUSS) has been designed in Ensmallento allow sub-sampling k unique sorted integers among n integers, by following an approximate uniform distribution. In practice, after splitting the range [0, . . . , n − 1] into k equal segments (buckets) with length delta/k , SUSS samples an integer value from each bucket by using the Xorshift random number generator, which allows a particularly efficient implementation (any arbitrary random number generator would work, though).

The implementation of the algorithm is reported in Algorithm 1. 

extracted.push(min val + step * i + s % step); s ← xorshift(s); end extracted.push(max value − s % (delta − step * (k − 1)) − 1); return extracted

To establish whether the distribution of the integers sampled with SUSS is truly approximating an uniform distribution, we sampled n = 10.000.000 integers over [0, ..., 10.000], by using both SUSS and by drawing from a uniform distribution in [0, ..., 10.000]. We then used the Wilcoxon signed-rank test to compare the frequencies of the obtained indices and we obtained a p-value of 0.9428, meaning that there is not a statistically significant difference among the two distributions. Therefore, by using a time complexity Θ(k) and a spatial complexity Θ(k) SUSS produces reliable approxi-25 mations of a uniform distribution.

For properly measuring the peak memory usage and the time requirements of both Ensmallen and the stateof-the-art libraries we use as benchmark comparison, we created an additional thread for logging purposes. We have measured the used memory by reading bash/proc/meminfo, which makes available five different metrics:

MemTotal RAM installed on the system MemFree RAM not used Buffers RAM used for I/O buffers such as disks and sockets, typically less than 20Mb

Cached RAM used to save dirty pages and for ramdisks Slab RAM used by the kernel we have therefore defined the memory in use as:

We executed Ensmallen and all the benchmarks on a dedicated server with no significant running process, except the sshd service we use to connect to it.

Anyhow, to obtain a truthful evaluation of the memory usage due to the execution of a specific task, we have logged the average memory usage before the task starts, and we have subtracted such value from the memory usage we measure during the task execution.

There are more accurate methods such as jemalloc [43] , Valgrind [44] , hooking malloc using bash malloc hook or using bash LD PRELOAD to hook the malloc function, but this method is precise enough since we do not need to measure the difference in bytes, but we care about significant differences.

We designed the tracker to have a linearly increasing delay between measurements because we have to measure tasks that might take from few microseconds to hours: long-running tasks would otherwise log too much data and start using Gigabytes of RAM.

To further reduce this problem, we log the values in a constant size buffer, and when either the task finishes or the delay between measurement is significantly longer than the time necessary to write the log to disk, we dump the log to a file.

One of the Ensmallen utilities regards the automatic generation of graph reports, which describe the content of the graph. In particular, three types of graph reports are available. The first graph-report shortly reports base graph characteristics (e.g., number of nodes, number of edges, number of self loops, average node degree, and so on). Example of such reports have been inserted in sections S1 S2 and S3. The second (mid-length) graph report adds some more information like a list of nodes with highest degree centrality, the number of connected components, and a report on the node types and their distribution. An example of such graph reports can be seen in section S3.2. The third graph-report exhaustively reports all the main graph characteristics, adding statistics on the edge types too. An example of the computed graph reports has been inserted in section S3.2.4, Figure S1 .

Skip-gram and CBOW algorithms use the SoftMax function for their predictions. Nevertheless the computational cost of the evaluation of this function is proportional to the number of output classes |Y |, and can therefore become prohibitive when |Y | is very large, as it often happens in graph representation learning problems.

A possible solution is represented by candidate sampling methods, where for the each training sample we only need to evaluate the output of the model for the positive output classes Y p ⊆ Y (typically |Y p | << |Y |), plus a random sample of negative output classes Y n ⊆ Y \ Y p .

The schema chosen to sample the negative classes Y n should best approximate the original exhaustive training function. In most practical graph learning scenarios, the node degree distribution follows a Zipfian distribution. It follows that, to best approximate the Softmax loss adopted in the SkipGram and CBOW models, the negative classes Y n are sampled following a Zipfian distribution (also referred to as log-uniform).

Using a better approximation of the Softmax function allows in turn for a better approximation of the gradient, which ultimately leads to a faster convergence rate. As an example, figure S16 shows the different values of the Noise Contrastive Estimation (NCE) loss [45] obtained during the node embedding computation using a SkipGram model of the KGCOVID19 graph; the loss computed using the Zipfian sampling (in blue) decreases more rapidly and converges (i.e. the early stopping kicks in) in 7 epochs, while the loss using uniform sampling (in orange) requires 11 epochs and decreases less rapidly. Figure S16 : Comparison of the use of Zipfian and uniform distributions to sample the negative nodes in the SkipGram model. The graph used for both the experiments is KGCOVID19 (˜500k nodes,˜30M edges), the random walk used to sample the contextual nodes (the positive classes) is a first order random walk with length 128, with a single iteration of random walk per node. The window size used is 4. The loss computed using the Zipfian sampling (blue) decreases more rapidly and converges (i.e. the early stopping kicks in) in 7 epochs, while the loss obtained when using uniform sampling (orange) requires 11 epochs and decreases less rapidly.

In this section we report details about the Embiggen models, their hyper-parameters and results obtained in the edge and node label prediction experiments.

For all the graphs all the tested models used a walk length of 128, embedding size of 100, window size of 4, 20 iterations, learning rate of 0.01 with Nadam optimizer, a maximum number of epochs equal to 100, an early stopping on the training loss with patience of 5 epochs.

The parameters relative to the embedding specific for each graph are specified in table S8 and those specific to each node embedding model are reported in table S9. In table S9 negative samples refers to the number of samples extracted using the NCE loss [45] in SkipGram and CBOW models, alpha is the α exponent parameter used in the GloVe weighting function [46] and min delta is the minimum loss improvement over 5 epochs before triggering the aforementioned early-stopping mechanism. Note that the loss functions used by GloVe and the SkipGram/CBOW models are different and tend to be, in average, considerably different in magnitude: SkipGram and CBOW losses tend to be several orders of magnitude higher, thus resulting in significantly different early stopping minimum deltas.

For all the graphs the FFNNs concatenate the node embeddings for obtaining the edge embeddings. FFNNs have a batch size of 262144 and 16384 batches per epoch. Moreover it uses L1 + L2 regularization for both kernel and bias with coefficients 10 −5 and 10 −4 respectively. 

In table S10 we show the results obtained when applying the link prediction models implemented in GraPE to the graphs used in [23] . 

For all the graphs all the tested models used a walk length of 128, embedding size of 100, window size of 4, 20 iterations (number of random walks starting from each node), learning rate of 0.01 with Nadam optimizer, a maximum number of epochs equal to 100, an early stopping on the training loss with patience of 5 epochs. All random walks are second order random walks with the inverse in-out weight q = 0.5 and return weight p = 2. The negative samples number for the CBOW and SkipGram models are fixed at 100.

In table S11 we show the results obtained when applying the node-label prediction models. 

Network representation learning: A survey

Representation learning on graphs: Methods and applications

A survey of link prediction in complex networks

Inductive matrix completion for predicting gene-disease associations

Deepwalk: Online learning of social representations

node2vec: Scalable feature learning for networks

Line: Large-scale information network embedding

Learning graph representation with generative adversarial nets

Graph embedding on biomedical networks: methods, applications and evaluations

The igraph software package for complex network research

Graphlab: A new framework for parallel machine learning

GraphX: Graph processing in a distributed dataflow framework

Snap: A general-purpose network analysis and graph-mining library

Pecanpy: a fast, efficient, and parallelized python implementation of node2vec

Deep graph library: A graph-centric, highly-performant package for graph neural networks

Fast graph representation learning with PyTorch Geometric

Graph neural networks in tensorflow and keras with spektral

PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings

Efficient storage and retrieval by content and address of static files

Why scientists are turning to Rust

Distributed representations of words and phrases and their compositionality

Efficient estimation of word representations in vector space

Glove: Global vectors for word representation

Parallel algorithms for evaluating centrality indices in real-world networks

Exploring network structure, dynamics, and function using networkx

The network data repository with interactive graph analytics and visualization

Biogrid: a general repository for interaction datasets

Wormnet v3: a network-assisted hypothesis-generating server for caenorhabditis elegans

Kg-covid-19: a framework to produce customized knowledge graphs for covid-19 response

Numba: A llvm-based python jit compiler

Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks

Bayesian optimization for automated model selection

Het-node2vec: second order random walk sampling for heterogeneous multigraphs embedding

Iterative methods for sparse linear systems

Space-efficient static trees and graphs

Fast, small, simple rank/select on bitmaps

Dynamic elias-fano representation

Memory-aware framework for efficient second-order random walk on large graphs

Siamese neural networks: An overview

Elements of reusable object-oriented software

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

On using very large target vocabulary for neural machine translation

Univariate Discrete Distributions

Gradient centralization: A new optimization technique for deep neural networks

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Graph representation learning

An introduction to Docker for reproducible research

Exploring network structure, dynamics, and function using networkx

On the alias method for generating random variables from a discrete distribution

Distributed representations of words and phrases and their compositionality

The igraph software package for complex network research

Numba: A llvm-based python jit compiler

Pecanpy: a fast, efficient, and parallelized python implementation of node2vec

The network data repository with interactive graph analytics and visualization

STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets

Growth of the flickr social network

Kg-covid-19: a framework to produce customized knowledge graphs for covid-19 response

Exploratory social network analysis with Pajek

Bigbrain: An ultrahigh-resolution 3d human brain model

Wormnet v3: a network-assisted hypothesis-generating server for caenorhabditis elegans

Community identification using extremal optimization phys

The human disease network

Global alignment of multiple protein interaction networks with application to functional orthology detection

Biogrid: a general repository for interaction datasets

How to infer gene networks from expression profiles

Lethality and centrality in protein networks

Supervised learning is an accurate method for network-based gene classification

Graph embedding on biomedical networks: methods, applications and evaluations

Link-based classification

Collective classification in network data

Query-driven active surveying for collective classification

The monarch initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species

Efficient storage and retrieval by content and address of static files

Quasi-succinct indices

On binary representations of monotone sequences

Universal codeword sets and representations of the integers

Dynamic elias-fano representation

Dynamic Elias-Fano Encoding

Space-efficient static trees and graphs

Seminumerical algorithms. The art of computer programming

node2vec: Scalable feature learning for networks

Simd compression and the intersection of sorted integers. Software: Practice and Experience

Scrambled linear pseudorandom number generators

Xorshift rngs

Generalized loop-unrolling: a method for program speedup

A linear algorithm for generating random numbers with a given distribution

A scalable concurrent malloc (3) implementation for freebsd

Valgrind: a framework for heavyweight dynamic binary instrumentation

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

Glove: Global vectors for word representation