key: cord-0526322-06f49lfu
authors: Dudkina, Ekaterina; Bin, Michelangelo; Breen, Jane; Crisostomi, Emanuele; Ferraro, Pietro; Kirkland, Steve; Marecek, Jakub; Murray-Smith, Roderick; Parisini, Thomas; Stone, Lewi; Yilmaz, Serife; Shorten, Robert
title: On node ranking in graphs
date: 2021-07-20
journal: nan
DOI: nan
sha: d8bf744eeee2e1fe8d116fbf3eab413faaf44204
doc_id: 526322
cord_uid: 06f49lfu

The ranking of nodes in a network according to their ``importance'' is a classic problem that has attracted the interest of different scientific communities in the last decades. The current COVID-19 pandemic has recently rejuvenated the interest in this problem, as it is related to the selection of which individuals should be tested in a population of asymptomatic individuals, or which individuals should be vaccinated first. Motivated by the COVID-19 spreading dynamics, in this paper we review the most popular methods for node ranking in undirected unweighted graphs, and compare their performance in a benchmark realistic network, that takes into account the community-based structure of society. Also, we generalize a classic benchmark network originally proposed by Newman for ranking nodes in unweighted graphs, to show how ranks change in the weighted case.

The recent outbreak of COVID-19 and the various attempts to find effective non-pharmaceutical interventions to mitigate the impact of this virus have highlighted both the potential value of network science, and also the pressing need to further understand complex interaction networks to support the development of efficient machine learning techniques for graphs. Several diverse scientific communities have already developed different techniques to better analyze and understand complex • In the study of electrical grids, a graph with its vertices (system components) and links (interactions or dependencies among components) can, for example, represent a power system. Analyses of this grid using graph theory can assess the vulnerability of grid components during cascade failure (a model called extended betweenness combines network structure with electrical characteristics of the power grid [1] ) or help to integrate renewable energy sources (for example, networks with the small-world property might be more resistant to small variations in load and generation [2] ).

• In neuroscience, networks may represent, for example, connections between neurons and help to understand the architecture and development of the brain. In this context, detection of communities and bridges, help to "separate functionally related neural elements" and to study the flow of neural signals and information [3] .

• Another common application of graph theory arises in the context of social networks where several applications have been developed, including that of understanding the propagation of epidemics over graphs [4] .

There are many other research areas in which the application of graph theory has played an important role. However, despite such a great interest from different communities, COVID-19 has illustrated both the importance of graph theory and its enormous potential, but also its limitations in the eyes of policy makers. For example, as we write this note, there are more than 7000 results for the combination of "COVID-19" and "graph theory" on Google Scholar. Yet few of these results, to the best of our knowledge, have actually influenced policy in response to COVID-19. This is despite the fact that all networks arising in studies prior to the pandemic share many important features with networks in which disease propagates. All are "driven by common organizing principles" and obey fundamental laws [5] , and in fact common mathematical methods can be applied to these. A natural question arises in this context why is this the case?

Answering the aforementioned question is not trivial. One feature of COVID-19, that certainly limited the usefulness of graph theory, is the lack of proper debate on the tradeoff between network utility (saving lives) and the right of individuals to cast-iron guarantees of privacy. Many (but not all) techniques for analyzing networks suffered in this context due to the fact that they rely on a centralized knowledge of community structure, and thus, of personal acquaintances. Another important factor is the fog that surrounds this area. As we have mentioned, the field of network science has benefited from the fact that it is applicable in many research areas. Consequently, overlapping contributions have been made in diverse areas. This is both good and bad; good as progress has been fast, but also bad as it makes parsing available results difficult, both from the perspective of understanding how they connect with each other, and from the perspective of applying these results to real world problems. Our objective in this paper is to try to partially address this latter issue. Motivated by our interest in the spread of COVID-19, and building on our recent paper on this topic [6] , the specific objective of the present manuscript is to review the many existing indicators that have been proposed to rank nodes in graphs, and to compare them in terms of their effectiveness and efficiency (i.e., in terms of the required computational effort for large-scale graphs) in the specific context of COVID-19. While we start with the classic case of undirected unweighted graphs, we also provide an interesting comparison for the weighted case as well, for which fewer results can be found in the literature.

This paper provides the following main contributions:

1. First, we review the most popular centrality measures that have been used in the existing literature to rank nodes in networks. While many other review papers have been published on this topic (see [7] [8] [9] ), in our review we also include some ranking methods which are not particularly popular, but have been recently proposed for epidemiology applications; for instance the Kemeny index [6];

2. Second, we propose a benchmark case study which can be used for comparison purposes. This includes a description of how to create a realistic network of contacts, which is taken from a work on epidemiological studies [10] , and a simplified model to mimic the COVID-19 spreading dynamics; 3. We compare many different indicators, taken from different scientific communities (e.g., epidemiology, operations research, graph theory, control theory, communications and computer science communities) and rank them on their ability to mitigate the spreading of the virus in the same network. Moreover, indicators are also compared in terms of their required computational burden. This is particularly important to predict their scalability in real-world large-scale social networks;

4. Finally, we revisit the classic undirected unweighted network that had been proposed by Newman to support the need for new ranking algorithms (namely, Random Walk Betweenness [11] ), and we extend it to the weighted case. Such an example allows us to observe what indicators may be more appropriate to use in the case of weighted networks (in the context of COVID-19, this may be useful if not all only contacts are registered, but also their duractions).

The paper is organized as follows: firstly, we review the state of the art and briefly describes the indicators that will be later considered in the comparisons. In the following section we present the benchmark case study of undirected and unweighted network and the indicators' comparison. Next, we describe the revisited network to compare indicators in the weighted framework, and show the corresponding obtained ranks of the nodes. Finally, we conclude our manuscript.

In this section we review methods of ranking nodes by how 'central' they are to the network. While many surveys of this type exist (see, for example [7] [8] [9] ), the aim of the current work is to compare and contrast these and their effectiveness in the context of disease spread in a contact network. Furthermore, we consider alternative methods for ranking node centrality by considering how much each node's removal contributes to a change in some global measure of connectivity of the network. As such, we also review some network connectivity measures, and their interpretation in the context of disease spread.

A simple undirected graph is denoted G = (V, E), with vertex set V = {1, . . . , N } and edge set E ⊆ {{i, j} | i, j ∈ V}. We say that i and j are adjacent, and denote this as i ∼ j, if there is an edge between i and j (i.e. {i, j} ∈ E). If i ∼ j, we say that j is a neighbour of i. The number of neighbours of i is referred to as the degree of vertex i, and denoted deg(i). The adjacency matrix of G is the matrix A defined entrywise as:

A directed graph with vertex set V as above is one in which the edges are ordered pairs of vertices (i, j); i.e. E ⊆ V × V. The edge (or arc) (i, j) is considered to represent a connection from i to j, sometimes denoted i → j, to indicate that the arc's initial vertex is i and terminal vertex is j. The adjacency matrix is defined in the same way as above, though a ij = 1 if and only if (i, j) ∈ E, and a ij = a ji in general.

A weighted graph arises when each edge {i, j} is assigned a weight w ij . This is taken into account in the adjacency matrix by simply allowing the (i, j) entry of the matrix to be the weight w ij .

Directed and weighted graphs provide generalizations of the simple undirected graph which may be very useful in the context of disease spread in a contact network. Differing weights of edges may account for differing strengths of transmission between pairs of individuals, due to, for example, different circumstances of the interaction-length of time, distance, etc. Asymmetric values allow for transmission to be more likely in one direction than the other between two individuals, for reasons more pertinent to the individual; for example, different levels of susceptibility, preventative measures, waning immunity, age differences, etc.

Many centrality measures and connectivity measures are defined for simple undirected graphs, and while some definitions may allow extensions to the weighted or directed case, they may lose some aspect of their interpretation; some do not generalize at all. In what follows, we attempt to indicate for each metric listed whether or not they do generalize in this way. The ones which extend most naturally, we find, are those which are derived from random walks, and so we include some mathematical preliminaries pertaining to these here.

A random walk on a connected graph G (and here we shall consider strongly connected graphs in the directed case) is a discrete-time stochastic process in which, at any given time, a "random walker" occupies one vertex of the graph, and in a subsequent time-step, moves to an adjacent vertex j of his/her current vertex i, according to some transition probability p ij . For a simple random walk on an undirected graph, the transition probability p ij is simply 1 deg(i) ; that is, the random walker chooses his/her next position uniformly at random from among the neighbours of his/her current vertex. This process is a Markov chain whose state space is the vertex set of G, since the state in any time-step depends only on the state of the chain in the previous time-step. The probability transition matrix P = [p ij ] for this Markov chain is easily determined from the adjacency matrix A by normalizing the rows so that they sum to 1. In the case of weighted graphs, the transition matrix is determined exactly the same way from the weighted adjacency matrix; it follows similarly in the case of directed graphs, although one runs into trouble if there are any vertices with no outgoing edges and the random walk is no longer well-defined.

For an ergodic Markov chain with states indexed 1, . . . , n and n × n transition matrix P , the stationary distribution vector of P , denoted w, is a left eigenvector of P corresponding to the eigenvalue 1, normalized so that the entries of w sum to 1 and thus represents a probability distribution across the states. In particular, w i represents the long-term probability that the Markov chain occupies the i th state. Note that in the July 21, 2021 4/25 case of a simple random walk on a connected undirected graph, the stationary distribution vector entries are proportional to the vertex degrees;

.

For any two states i and j, the mean first passage time from i to j, denoted m ij , is the expected time it takes to reach the j th state, given that the chain starts in the i th state.

Degree centrality

The first and simplest proposal for ranking nodes in a network is by considering each node's degree (sometimes called valency), or the number of neighbours of each node. The degree of node i can be easily computed via the i th row sum of the adjacency matrix:

The degree is a simple centrality measure for undirected networks, where it identifies the most connected nodes, in the sense that highly-ranked nodes under this metric will have the most neighbours. In the context of disease spread in a contact network, these highly-ranked nodes correspond to individuals with the most contacts; as such, a high-degree node infected with the disease has more opportunity to spread the disease to other individuals. In the case of directed graphs, one has to distinguish between incoming and outgoing edges, defining the in-degree and out-degree, respectively, as

Given this dual definition, it is more difficult to consider degree as a measure of centrality in the directed case; some options are to consider the sum or the average of the two (as in [7] ).

For weighted graphs, the degree of a vertex is easily extended by defining deg(i) as the sum of the weights of incident edges. This is sometimes referred to as the strength of a vertex (see, for example [12] ).

The distance between nodes i and j, denoted dist(i, j), is defined as the minimum number of consecutive edges needed to move from node i to j or, equivalently, as the length of a shortest path between them. The closeness centrality of a node i is computed by taking the inverse of the average distance from i to any other node:

The larger the value of CC(i), the more central node i is in the network, in the sense that it is, on average, close to many other nodes.

As with degree centrality, the definition of closeness centrality can be extended to directed networks, though a distinction must be made on whether the distances are computed from, or to, the reference node i, respectively. Note also that "distances" are not symmetric in directed networks. For weighted graphs, one could consider the edge-weights as a "cost" to traversing the edge, and thus define shortest-distance between u and v as the minimum weight of any path from u to v (where the weight of a path is the sum of the weights of edges in the path). With such a definition for distance in hand, it is reasonable to generalize the closeness centrality for weighted graphs; see [13] for some limited discussion.

In a disease spread context, ranking vertices in a contact network by their closeness centrality would, in theory, highlight individuals for whom there is (on average) low degree of separation between the individual and all other members of the community. If this individual were infected, then, it takes fewer secondary infections on average to infect others.

The betweenness centrality of a node i is computed in terms of how many shortest paths pass through that node [14, 15] . In particular, fix a source node s and target node t (distinct from i), and let σ st denote the total number of shortest paths from s to t (or geodesics, i.e. paths of length dist(s, t)). Letting σ st (i) denote the number of those paths that include node i, we take the ratio of these and then average over all choices for s, t = i:

Accordingly, the betweenness centrality of a node corresponds to the fraction of shortest paths that pass across that node, and this expression is valid for both directed and undirected networks. In principle, betweenness centrality is expected to rank highly the nodes that behave as bridges between clusters in the network [7] . In the context of disease spread, these highly-ranked nodes would correspond to individuals who bridge multiple communities.

We encounter similar difficulties with the extension of betweenness centrality to weighted and directed graphs as with closeness centrality. Some limited work exists; see for example [16] , that focuses on the betweenness centrality of an edge in a weighted network, rather than betweenness centrality of nodes, and [17] , that discusses several possible extensions to directed graphs and their limitations.

The PageRank algorithm computes a ranking for every web page based on the graph of the World Wide Web. PageRank has applications in search engines and traffic estimation [18] . The PageRank algorithm is best understood via a random walk on the network.

PageRank can be thought of as a model of a "random surfer" who starts on a random webpage and keeps clicking on links randomly, never hitting "back"; that is, he/she takes a random walk on the World Wide Web, a directed graph in which (i, j) is an edge if webpage i has a hyperlink to webpage j. At any point, the surfer may get bored and "teleport" to a random page in the network. The long-term probability that the random surfer occupies webpage i in this stochastic process is its PageRank-this July 21, 2021 6/25 corresponds to the stationary distribution of the corresponding Markov chain. Note that the transition matrix for this Markov chain is

where D is the diagonal matrix of vertex out-degrees, A is the adjacency matrix, and J is the N × N matrix of all ones [19] . The parameter α is referred to as the damping factor and represents the probability at each step that the "random surfer" will get bored and request another random page [20] . This is usually, as a matter of convention, set to 0.85. We note that in the case of the World Wide Web, the underlying directed graph is not strongly-connected; indeed, there may be many webpages with no outgoing hyperlinks. This causes immediate problems with the random walk being well-defined, but is usually fixed by replacing any zero row of the adjacency matrix with a row of all 1s. Thus when the random surfer ends up on a webpage with no outgoing links, he/she chooses a webpage uniformly at random in the next step. This allows computation of the PageRank vector in the setting that some nodes have outdegree 0.

The interpretation of this measure of centrality is interesting in the context of disease spread. In the case that we work with a simple undirected graph, if the damping factor is set to α = 1 (i.e. a simple random walk with no teleportation), then the PageRank centrality ranking corresponds exactly to the node degree centrality ranking. However, including a damping factor may allow for the possibility that a person contracts the disease not from another individual they have contact with, but rather by chance (e.g. touching a surface with traces of the virus), or by some mechanism not accounted for in the contact network model. PageRank centrality ranking of the nodes can essentially be thought of as strongly related to the node degree ranking in this context, with some relaxation that may actually reflect the features of the disease more accurately.

If one wishes to consider a weighted or directed graph, PageRank centrality is a measure which extends naturally along with its interpretability, simply by considering a random walk on the given weighted or directed graph.

A measure of betweenness centrality based on random walks, called random walk betweenness (RWB) was introduced in [11] . The idea is to calculate the centrality of a given node i by the proportion of random walks that pass through node i. This is very similar to betweenness centrality, but does not consider shortest paths. In particular, the author of [11] describes it as "the expected net number of times a random walk passes through vertex i on its way from a source s to a target t, averaged over all s and t". This measure was shown to better rank the importance of nodes in graphs with existing communities, and to be less correlated with vertex degree in most networks [11] . Interestingly, the method for computing this measure is strongly dependent on considering the graph in question as an electrical network and considering current flow through a vertex. The author then proves that this is equivalent to the "flow" of a random walk; however, this means that this particular definition of random walk betweenness does not extend to directed or weighted graphs, and there is no literature which attempts this.

Random-walk betweenness centrality is obtained by averaging the current flow (f 

See [11] for further information on how this quantity is calculated.

While it is not necessarily evident from the electrical network description above, this value does indeed compute the expected net number of times a random walk would pass through node i before reaching a target t, given that it starts at a source s, averaged over all pairs s, t. This intuitively seems like exactly the metric we are looking for when considering "pivotal" individuals in a contact network in which a disease is spreading, and as we will see in the next section, this metric is one which is particularly effective in controlling the disease when used exclusively to determine testing protocols. It is also known that in networks with strong community structure, immunization interventions targeted at individuals bridging communities (e.g., using random walk betweenness) are more effective than those simply targeting highly connected individuals [10] .

Note that the definition that can be directly applied only to simple undirected graphs; there does not exist a generalization to weighted or directed graphs.

Random walk centrality is introduced in [21] , and is said to "quantify how central a node i is located regarding its potential to receive information which is randomly diffusing over the network". Given a graph G, consider a random walk on the graph with transition matrix P . The stationary distribution vector for P is denoted w, and the k th entry of w is denoted w k . The characteristic relaxation time of vertex k is introduced as τ k ≡ ∞ j=0 ((P j ) k,k − w k ). This quantity converges whenever the transition matrix P is primitive, which is the case for a random walk on a connected non-bipartite graph. The random walk centrality [21] of vertex k is then calculated as:

In [22] , the author observes that the random walk centrality is the reciprocal of the measure of centrality known as the accessibility index. For a given vertex k, the accessibility index of vertex k is defined

where m jk is the mean first passage time from j to k. This definition allows the interpretation of the accessibility index as the expected time to reach vertex k for the first time, given that we start in any randomly-chosen initial state. It is shown in [22] that C k = 1/α k . This is particularly useful as it admits a definition of this as a measure of centrality not just for simple undirected graphs (as is introduced in [21] ) but for any Markov chain, as a measure of state centrality. In particular, this is a useful metric for both weighted and directed graphs. Further work on estimating the accessibility index may be found in [23] .

In the disease spread context, this measure ranks highly the vertices which are "easily accessed" from other vertices of the graph. This is a useful way to consider how "central" an individual is in a community. However, in the context of the spread of disease, and in particular when considering testing protocols in order to control the disease, this may not be appropriate. Controlling the disease in our simulations means determining individuals who are most instrumental in the disease spreading through the whole graph, and while those individuals with high random walk centrality may have increased likelihood of being infected, this may not coincide with individuals who ought to be tested and isolated as quickly as possible.

Many criticality measures exist which provide a single, numerical value describing the "connectedness" of a network in some way. Such measures can be easily extended to July 21, 2021 8/25 measure the criticality of a single node i in a graph G by inferring the criticality of the i th node from the change in criticality of the network after the i th node is removed from the network. This is described as follows in [24] : given some graph invariant cr(G) measuring the connectedness of the graph G in some manner, define

We include several such measures of node centrality here for consideration in later simulations.

The effective graph resistance is interpreted as a "robustness measure" of a network [25] .

To formulate the effective graph resistance (also known as the total graph resistance, or the Kirchhoff index of the graph), the (undirected and connected) graph G is seen as an electrical circuit, where an edge {i, j} corresponds to a resistor of r ij Ohm, and the effective graph resistance is the sum of the effective resistances over all pairs of vertices. These effective resistances R st can be computed using the Laplacian matrix of the graph.

The Laplacian matrix of a graph is defined as the adjacency matrix subtracted from the diagonal matrix of vertex degrees: L = D − A. Note that in the context of the electrical network, an undirected, unweighted graph is considered to have resistors of r ij = 1 Ohm on each edge. However, this can be extended to weighted networks, where resistors have r ij = 1/w ij , where w ij denotes the edge weight between vertices i and j. In this case, the weighted Laplacian matrix can easily be defined, and used to extend the following results.

Through applications of Kirchhoff's current and circuit laws, one can show that the effective resistance R st between vertices s and t may be calculated using a pseudoinverse of the Laplacian matrix of a graph:

where e i denotes the standard unit vector, i.e., a vector consisting of all zeros except a single 1 in the i th position. Given this expression for the effective resistance between two vertices, the total graph resistance can be expressed in terms of the sum of the reciprocals of the nonzero eigenvalues of the Laplacian matrix [26] :

As previously mentioned, this measure can be easily extended to weighted graphs, though it is not clear that a generalization for directed graphs is possible.

Given an ergodic Markov chain with transition matrix P , stationary vector w, and mean first passage times m ij , one can define the following quantity:

This can be interpreted as the expected time to reach a randomly-chosen state j, given that the chain starts in state i. Astonishingly, this quantity is independent of the initial July 21, 2021 9/25 state i [27] . Thus the Kemeny's constant, denoted as K(P ), can be computed as

An interpretation of this result is that the expected time to reach a randomly-selected destination state j from a fixed initial state i (where state j is selected randomly according to the stationary distribution w) does not depend on the starting point i [28] . Furthermore, since the w j sum to 1, we can write Kemeny's constant as a double-sum:

admitting the interpretation of Kemeny's constant as the expected length of a random trip in the chain, where both initial and destination states are chosen randomly according to the stationary distribution. Therefore, Kemeny's constant is an intrinsic measure of a Markov chain. If the transition matrix P has eigenvalues λ 1 = 1, λ 2 , ..., λ n , then another way of computing K(P ) is [29] ,

Note that this expression furnishes a computationally useful method for determining K(P ) in practice; K(P ) is computed as the trace of a given generalized inverse of the singular matrix I − P .

Kemeny's constant is a proxy for the global "connectedness" of a network, given the interpretation as the expected length of a random trip between states in the chain. Thus, networks characterised by small values of Kemeny's constant should be more efficient in terms of flow [30] . In a contact network, small values of Kemeny's constant for the random walk on the graph indicate that the individuals in the network are well-connected, while large values of Kemeny's constant may be indicative of clustered behaviour, and that it is difficult to traverse the graph. If one can identify the vertex whose removal causes the largest increase in the value of Kemeny's constant, this could be interpreted as a "central" vertex.

Since Kemeny's constant can be computed for any ergodic Markov chain, it extends most usefully to weighted, directed graphs.

For an ergodic Markov chain with transition matrix P , the eigenvalues may be listed λ 1 = 1, λ 2 , . . . , λ n , and by Perron-Frobenius theory, |λ j | ≤ 1 for all j = 2, . . . , n. As is evident from our discussion of Kemeny's constant above, much information regarding the dynamics of the Markov chain may be extracted from these eigenvalues. We are frequently interested in the eigenvalue λ j for which |λ j | has second-largest modulus (SLEM) after 1. Without loss of generality, suppose this is λ 2 . We outline here several ways in which the value of λ 2 can be used to infer how well-connected the states of a Markov chain are (or the vertices of a graph; where the chain in question represents a random walk on a graph). In general, if λ 2 is bounded away from 1, the states of the Markov chain are considered to be "well-connected".

For a Markov chain with transition matrix P , the probability distribution after k time-steps is given by u T k = u T 0 P k , where u 0 is the initial probability distribution vector. It is well-known that if the Markov chain is ergodic (i.e. the matrix P is primitive), then u T k converges to the stationary distribution vector w T of the chain as k → ∞, independently of the initial distribution. The asymptotic rate of this convergence is clearly dictated by the moduli of the eigenvalues of P . If all eigenvalues have modulus bounded away from 1, convergence happens quickly. This convergence is often referred to as the "mixing rate" of the chain, and is framed in terms of how quickly the initial information is lost.

In the context of a random walk on an undirected connected graph, there is the following bound on the total variation distance after k time-steps have passed (see [31] ):

As P has real eigenvalues, issues arise when λ j is close to 1 or close to −1. We note that there is some difference in the dynamics of the Markov chain in these cases-a subdominant eigenvalue close to 1 in value is indicative of clustering behaviour, or of near-decoupling [30, 32, 33] , while −1 occurs as an eigenvalue of P if the undirected graph is bipartite, causing periodic behaviour in the Markov chain. For this reason, in the context of disease spread in a network, we are more interested in a subdominant eigenvalue whose value is close to 1 (not whose modulus is close to 1). See [33] for more discussion.

For a connected graph G, the transition matrix for the random walk on G is P = D −1 A; denote its eigenvalues by 1 ≥ λ 2 ≥ · · · ≥ λ n . The subdominant eigenvalue λ 2 and its relationship to graph structure is well-studied in the context of the normalized Laplacian matrix, defined as

when G has no isolated vertices. It is easily seen that L is similar to the matrix I − D −1 A, so that the normalized Laplacian eigenvalues 0 ≤ µ 1 ≤ · · · ≤ µ n−1 are in one-to-one correspondence with the eigenvalues of D −1 A, where µ j = 1 − λ j+1 . In particular, the eigenvalue µ 1 = 1 − λ 2 is often referred to as the algebraic connectivity of the graph [31] (not to be confused with Fiedler's algebraic connectivity, defined in the next section as an eigenvalue of the combinatorial Laplacian L = D − A).

The eigenvalue µ 1 (or the spectral gap 1 − λ 2 ) is well-known to be related to isoperimetric numbers of the graph. In particular, we have the following relationship with Cheeger's constant h(G) [31] :

The Cheeger's constant is a measure of how much a vertex set can expand in the graph; formally,

where vol(S) is twice the number of edges between vertices in S, and e(S, S) denotes the number of edges in the graph between vertices in S and vertices outside S. We note that graphs with low values of Cheeger's constant typically have "bottleneck" vertices, or clusters/communities with very few edges between them. Finally, the quantity 1/µ 1 is referred to as the relaxation time of the random walk, and is studied in itself as a measure of the asymptotic rate of convergence to the stationary distribution (see [34] ). Note that the relaxation time 1/(1 − λ 2 ) is the first and largest term in the expression (7) of Kemeny's constant.

In the context of disease spread in a graph G, the value of the second-largest eigenvalue λ 2 of the transition matrix for the random walk (and how close it is to 1) may be used to indicate how quickly the disease may disperse through the graph, as a measure of clustering behaviour in the graph, or as a stand-in for Cheeger's constant, indicating whether the graph has good expansion properties (which would imply that the disease spreads quickly).

Since subdominant eigenvalues can be calculated for any Markov chain, this measure is easily extended to weighted and directed networks. We remark that in the case that a subdominant eigenvalue is a complex number, some work has been done in that area to show how clustering behaviour may be derived from the value of λ 2 in that case [33] .

The eigenvalues of the Laplacian matrix of a graph G, L = D − A, can be listed ρ 0 = 0 ≤ ρ 1 ≤ · · · ≤ ρ n−1 , and the second-smallest eigenvalue ρ 1 is referred to as the algebraic connectivity of the graph, and denoted a(G) [35] . This eigenvalue is strictly greater than 0 if and only if G is a connected graph. The magnitude of this value reflects how well connected the overall graph is. It has been used in analysing the robustness and synchronizability of networks. There are many papers and results relating the value of a(G) to many other structural quantities in the graph, such as vertex-and edge-connectivity, diameter, minimum degree, isoperimetric constants, and many others. For an overview, see the excellent survey by De Abreu [36] . There is some research in the area of extending the definition of algebraic connectivity to directed graphs [37] .

The basic reproduction number R 0 is perhaps the most popular indicator in the epidemiology community, and it may be defined as the expected number of individuals that a randomly infected individual can infect during his/her infection period in a fully-healthy susceptible population [38] . This value depends on the specific disease (e.g., its intrinsic infectivity), and on the topology of the network of contacts. In particular, it can be shown that in a network-SIS model (a susceptible-infected-susceptible model) R 0 is proportional to λ max , where λ max denotes the dominant eigenvalue of the adjacency matrix, and it is equal to its spectral radius [38] . Accordingly, in this paper we shall use the dominant value of the adjacency matrix as a proxy for R 0 , with the ultimate goal of testing individuals whose removal gives rise to the lowest value of R 0

The last measure we discuss here is the modularity of a network, which is considered to quantify the degree of community structure of the network. In order to define this measure (as defined in [39] ), we assume there exist some pre-determined communities to which the vertices of a graph G belong, and partition the vertex set accordingly; say V (G) = S 1 ∪ S 2 ∪ · · · S r . We define e ij to be the fraction of all edges in the network that join vertices from S i with vertices from S j , with e ii considered as the fraction of edges within the community S i (i.e. vol(S i )/ vol(G)). Define a i = r j=1 e ij , considered as the fraction of edges that connect to vertices in S i . Then the modularity is defined

This quantifies the degree of community structure in the network by comparing the fraction of all edges that are within a community to the fraction of the edges connecting that community to the other communities, and observing that if there was no community structure, one could expect that e ij = a i a j , giving a modularity of Q = 0.

In this present manuscript we do not consider modularity as one of our criticality measures used to rank nodes by the effect of their removal on this quantity. Instead, modularity plays a more vital role in our simulations, as we construct contact networks with varying levels of community structure as described by modularity. We follow the method outlined in [10] to construct these; see section Case study: Creation of the network of daily contacts.

Case study: Creation of the network of daily contacts Our case study is created according to the procedure outlined in [10] , and later reused also in [6] . The procedure to create the network can then be summarized in the following steps:

1. 6 small-world communities of 40 nodes are first created using the Watts-Strogatz algorithm [40] , so that each node has exactly N 0 d edges connecting to nodes of the same community; 2. We then add 30 · N 0 d edges in a random way to connect different communities; after this step, the average node degree becomes N d = 5 4 N 0 d ;

3. We then rewire between-communities edges (i.e., edges that connect nodes belonging to two different communities) so that they become within-community edges (i.e., edges that connect nodes belonging to the same community). In doing this, the modularity of the graph increases, and we stop the procedure once a desired level of modularity M is achieved.

In particular, we compare the different indicators for different values of N d , that correspond to the number of daily contacts of individuals, and for different values of modularity M , that allows one to evaluate what happens for societies with milder or stronger community structures. Thus, every day we create a new graph according to the procedure previously given, and we assume that it corresponds to a network of daily contacts.

Case study: simulation of epidemic spread on a graph

The previously described procedure had been proposed to mimic the network of contacts of individuals in a society, with the final objective of evaluating the impact of different testing/vaccination campaign, to better mitigate the spreading of epidemics. While such networks could be only guessed in that context, contact tracing applications are now giving the unprecedented advantage of indeed knowing when individuals meet at a close distance for a sufficient time to infect a new individual (e.g., at least 15 minutes at a distance of 1 meter). An implicit assumption here is that individuals allow the sharing of the information of their daily contacts to some centralized data center that aggregates this kind of information and knows the network (e.g., in terms of an adjacency matrix). We then model the spreading of the virus, and the testing procedure in the following simplified way: On the first day, we randomly label two individuals as "infected", and they correspond to our initial condition. Every single day, we assume that a susceptible individual who gets in contact with an infectious individual, has a probability of 10 % of being infected. Also, every day, we test N test individuals according to the different indicators introduced in the section Review of the state-of-the-art: if a tested individual is found infected, then he/she is quarantined for the following 14 days, after which we assume the individual is fully recovered and not susceptible anymore (i.e., they can not be infected again). Also all his/her contacts of the same day are further tested (but not those of the previous days). During the quarantine, we assume that the individual is fully compliant with the rules, and does not have contact with any other individual. It may also happen that infectious individuals are never tested, and obviously may infect other individuals as well as they are not quarantined, and we assume that they also recover after 14 days, after which they are not susceptible anymore.

The previous model is obviously a simplification of the COVID-19 dynamics, and may be seen as a simplified agent-based SIQR model, where each individual in the population may belong to only one of the disjoint compartments of Susceptible -Infected -Quarantined -and Recovered individuals. Since we only test individuals on the basis of the topology of the network, depending on the specific indicator of interest, this method may be applied in practice to target-testing individuals in a population of asymptomatic individuals, and should be obviously complemented with the good practices of testing individuals who, independently from the topology of the network, exhibit COVID-19 symptoms.

Simulation results are described in Fig. 1 , where the final number of susceptible individuals at the end of the simulation (i.e., after 30 days) is shown. Note that due to the adopted COVID-19 spreading model, the final number of susceptible individuals corresponds to 240 (size of the considered population) minus the number of infected individuals (so a larger number of susceptibles individuals corresponds to a smaller number of infected individuals). Due to the stochasticity of the considered model, results shown in Fig. 1 are averaged over 100 simulations, and every time two randomly chosen individuals are supposed to be infectious. The evolution of the number of infected individuals throughout the 30 days of simulations is shown in Fig. 2 .

In the simulations we have considered networks of modularity equal to 0.8, average node degree N d equal to 7.5 and the possibility of testing 20 different individuals every day. In particular, Fig. 1 shows how performances of different ranking algorithms change if one of the previous quantities is slightly changed. In particular, from the figure above it is possible to appreciate that better results can be obtained if networks with stronger community structure are considered (values of modularity greater than 0.8). However, as from the second panel of Fig. 1 , it is possible to appreciate that the most relevant quantity is the average node degree N d . This could be expected since it corresponds to the number of daily relevant contacts. Finally, the last panel shows the obvious result that as the number of daily tests is increased, then a reduced number of individuals gets infected.

With the increase of the network size, the computational burden could become too high for some indicators, and thus limit their scalability. Table 1 reports the computation time for the different indicators with a personal computer, equipped with a 6-core i7-8700 CPU @ 3.20GHz, RAM 16 GB. The values correspond to an average time for ten runs of the whole 30-day simulation and are sorted accordingly.

The indicators that can be computed fastest are those measuring the node centrality, i.e., Page rank, Closeness centrality, Node degree, Betweenness centrality, and RWC. As expected, node connectivity measures are more resource-demanding and result at least one order of magnitude slower. Among them, the most computationally expensive are Second Largest Eigenvalues in Modulus (SLEM) and Kemeny constant. Finally, it is noteworthy that although RWB still measures the node centrality, it is the most burdensome measure among all indicators (almost 300 times slower than Page rank). 

The primary contribution of this article is the survey and comparison of various node ranking procedures in the control of disease spread in a contact network. However, throughout the review in Section II of existing centrality indicators, we have stressed the importance of considering metrics which generalize to weighted or asymmetric (i.e., directed) graphs. This is motivated in no small part by the nature of the current pandemic and the emphasis on increased risk of transmission caused by a lengthier interaction, or a contact event with no personal protective equipment. In this section, we present a small example to motivate further consideration of this issue.

In Newman's 2005 paper (see [11] ), the idea of random walk betweenness was introduced, but first motivated by the example shown in Fig. 3 . This example indicates that while a node may not sit on any shortest path, intuition would still indicate that it is still somehow more "central" than others. Finding existing measures of centrality (which depended largely on shortest paths) falling short in appropriately categorizing such nodes, Newman introduced the idea of random walk betweenness. Inspired by this key example, and its natural appeal to the reader's intuition, we present a comparable example here which highlights the importance of considering the weights of connections between nodes; first by an intuitive argument, then backed up by computations.

In Fig. 4 , we have a network consisting of 10 nodes, and edges of different weights. The weights are given by the following matrix: 

For our purposes, we will interpret the edge weight w ij as the probability of transmission from individual i to j (or vice versa) based on their contact time and conditions, if one of them is already infected. Note that in Fig. 4 , the weights are represented proportionally by thinner or thicker lines. It is a reasonable argument that based on these infection probabilities, nodes 2 and 9 should play a larger role in the spread of the disease through the entire community than nodes 5 and 6. However, if we were to assume that the graph is unweighted, or that all infection probabilities are equal, then we would assume that all four nodes (2, 5, 6, and 9) have equal importance. If testing/quarantine strategies are based on this assumption, we may fail to prevent or slow the spread of the disease. For the remainder of this section, we analyse this example in a few ways. First, we compare the rankings of the nodes of this example according to each centrality indicator from the previous section, where we consider the network to be unweighted (i.e. all weights of contacts/edges are either 0 or 1). Then we will compute rankings when the weights of the edges are taken into account. To do so, we will need to elaborate further on how these will be computed in the weighted case. Finally, we run some simulations to compare the effectiveness of each indicator (both unweighted and weighted) in controlling the disease. These simulations differ from those in the previous section since this network is too small for those same simulations to be sensible.

For unweighted indicators (node degree, betweenness centrality, random walk betweenness, second-largest eigenvalue, and Kemeny's constant), we compute as before, taking the weight of every edge to be 1, and working with either the 0 − 1 adjacency matrix A, or the simple random walk on the graph with transition matrix D −1 A. For weighted indicators, we make a few adjustments:

• For weighted node degree, we use W as the weighted adjacency matrix, and determine the weight of node i as the sum of the weights of incident edges, or the sum of row i.

• In computing betweenness centrality of a node in a weighted graph, typically the edge-weights correspond to a "cost", and minimum distance corresponds to paths or walks of minimal cost. As such, it does not make sense in this context to compute betweenness centrality using infection probabilities as weights, since if the value of w ij is high, it is easier for the disease to spread, not harder (which we would associate with high cost). We choose to simply replace the weight w ij of Fig 5. The transition matrix for the adjusted random walk on the weighted network given in Fig. 4 .

the edge between i and j by the cost 1 − w ij , and then compute betweenness centrality for each node in the usual way.

• There is no known formulation of random walk betweenness for weighted graphs, so we do not include a weighted version here.

• For the second-largest eigenvalue of a probability transition matrix, and for Kemeny's constant, there are a few choices we could make for how to represent a random walk on the weighted network. One option is to simply normalize the rows of the matrix W , producing a stochastic matrix. However, in the disease-spread context, this may not be appropriate. Instead, to account for individuals with less-risky contacts overall, we implement an adjusted random walk, in which for certain nodes, there is a nontrivial probability that the random walker stays in place in the next time-step (in our application, this would correspond to assuming that an infectious individual will not infect another individual). To do this, we find the maximum row-sum of W , and normalize the entire matrix by that value. Then we replace the diagonal entry in each row by the difference between 1 and the i th row-sum to produce a stochastic matrix. See Fig. 5 for the stochastic matrix representing the adjusted random walk for this example with these weights. The advantage of this normalization is that all contacts having the same duration are eventually associated with the same probability of spreading the virus in the transition matrix, while without the diagonal correction this is not guaranteed to occur, due to the normalization steps required to make transition matrices row-stochastic. With this transition matrix in hand, we compute the weighted versions of these indicators using this representation of the Markov chain. For the sake of comparison, we also compute Kemeny's constant for the weighted random walk (without the diagonal correction).

All ranking indicators group the nodes of this network into at most four groups: {2, 9}, {5, 6} and {1, 3, 4, 7, 8, 10}, where nodes in the same group receive the same rank. The results are as follows:

• All unweighted indicators group {2, 5, 6, 9} together as being equally the most critical nodes, while the remaining nodes rank second.

• Both weighted node degree and the weighted version of Kemeny's constant for the adjusted random walk rank {2, 9} as most critical, {1, 3, 4, 7, 8, 10} next, and {5, 6} as least critical.

• The weighted betweenness centrality measure ranks {2, 9} as the most central nodes in the weighted network, and weights all other nodes equally.

• The second-largest eigenvalue measure for the adjusted random walk ranks {2, 9} as most critical, then {5, 6}, then the remaining nodes {1, 3, 4, 7, 8, 10}. Interestingly, Kemeny's constant for the weighted random walk without the diagonal correction ranks the nodes the same way.

Intuition and visual inspection of the graph are sufficient to determine that unweighted indicators fail to properly rank nodes because nodes {2, 9} are more critical than {5, 6} (the virus is spread more likely from one community to the other through the {2, 9}-link as the duration of their contact is longer than that of {5, 6}). Indeed, all indicators based on the weighted graph reach a consensus on indicating nodes {2, 9} as the most critical. Conversely, there is a discrepancy on which one should be the second most critical set of nodes (i.e., Kemeny's constant indicates {1, 3, 4, 7, 8, 10} , while the weighted betweenness centrality measure indicates {5, 6}).

Accordingly, we run disease-spread simulations with this network so to establish which one is the correct response. As mentioned already, this network is too small to obtain meaningful results from the heuristic outlined in the previous section. Instead, we do the following: For each node, remove it from the network, infect one other node chosen at random, and run an infection simulation according to the probabilities given in W . Take note of the number of days it takes for the disease to spread to all the remaining 9 nodes in the network. Run this same simulation 1000 times and record the average number of days for the disease to spread to the entire network. Thus for each node in the network, we have a simulated measure of how "critical" it is in the spread of disease through the network -the higher the number of days for the disease to spread with node i removed, the more critical node i must be to the spread in the underlying network shown in Fig. 4 Note that in 1000 simulations on the full weighted network, the average number of days for the disease to spread to all ten nodes is 7.86 days. Removing some nodes might either speed up or slow down the spreading of the disease.

If we assume that the averages of the 1000 stochastic simulations of the epidemic spreading in the network provide the "correct" ranking of the nodes, then it is evident that indicators that were very convenient in the unweighted case, most notably those based on betweenness centrality, fail to correctly rank the nodes in the weighted cases.

If we further consider that extensions of other indicators to the weighted case are not straightforward (e.g., random walk betweenness), it appears that when weighted graphs may be reconstructed (i.e., when a contact tracing app is able to store relevant information such as the durations of contacts, or the distances at which contacts occur), indicators based on the adjusted random walks such as the Kemeny's constant appear to be the most convenient choice.

Integrated Security Analysis on Cascading Failure in Complex Networks

Networks and Power Systems

Graph theory methods: applications in brain networks

Forecast and control of epidemics in a globalized world

Network science. Network science

Kemeny-based testing for COVID-19

Centrality in Networks: Finding the Most Important Nodes

Analyzing complex networks through correlations in centrality measurements

Consistency and differences between centrality measures across distinct classes of networks

Dynamics and Control of Diseases in Networks with Community Structure

A measure of betweenness centrality based on random walks

The architecture of complex weighted networks

Node centrality in weighted networks: Generalizing degree and shortest paths

A Set of Measures of Centrality Based on Betweenness

Centrality in social networks conceptual clarification

Betweenness centrality in a weighted network

Betweenness centrality measures for directed graphs

The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab; 1999. 1999-66

Google's PageRank and Beyond -the science of search engine rankings

The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems

Random Walks on Complex Networks

Random walk centrality and a partition of Kemeny's constant

Estimating random walk centrality in networks

Algorithms and Models for Network Data and Link Analysis

Kemeny's constant and the effective graph resistance

Resistance distance

Finit Markov chains. Van Nostrand

The Kemeny constant of a Markov chain

Kemeny's Constant and the Random Surfer

A Google-like model of road network dynamics and its application to regulation and control

Spectral Graph Theory

On the structure of stochastic matrices with a subdominant eigenvalue near 1. Linear Algebra and its Applications

Clustering behaviour in Markov chains with eigenvalues close to one

Reversible Markov Chains and Random Walks on Graphs

Laplacian of graphs and algebraic connectivity

Old and new results on algebraic connectivity of graphs

Algebraic connectivity of directed graphs. Linear and Multilinear Algebra

On the dynamics of deterministic epidemic propagation over networks

Finding and evaluating community structure in networks

Collective dynamics of 'small-world' networks

Inspired by the recent problems arising in the context of the COVID-19 pandemic, and most notably in terms of who should be tested, and who should be vaccinated first, this manuscript reviews some of the most popular ranking methodologies to identify the importance of nodes in networks of individuals. While the dynamics of the COVID-19 have been simplified, still it is possible to observe that significantly different results may be obtained if different ranking methods are adopted to select the most suitable individuals for tests or vaccination. In particular, while the actual effectiveness one strategy depends on a number of variables (most notably, the modularity exhibited by a network of individuals; by the average number of daily contacts; and by the number of available tests), it is possible to appreciate for some combinations of such variables, indicators like the algebraic connectivity, betweenness centrality, second largest eigenvalue modulus, Kemeny's constant and random walk betweenness, may actually be twice as effective than other indicators in abating the number of infected individuals (i.e., with respect to PageRank or node degree).The comparison becomes even more interesting under the assumption that durations of contacts may be measured and shared, when weighted graphs can be considered. Indeed, not all the aforementioned indicators can be generalized to work in this context. In addition, indicators that appeared to perform very well in the unweighted case, most notably, betweenness centrality and the second largest eigenvalue, fail to correctly rank nodes in the proposed simple network, which is a weighted revisitation of the classic Newman's network. Other indicators that explicitly take into account the (weighted) transition matrix, as the Kemeny's constant) appear to be the most suitable in correctly ranking the nodes.As mentioned in the introductory section, the problem of ranking nodes according to their "importance" arises in a number of different contexts, and different solutions have been proposed, and have become common practices, in different communities. Motivated by the recent concerns of COVID-19, this manuscript attempted at fairly comparing some of the most popular methodologies in a single application, with the ultimate objective that this comparison may inspire more work and more discussion in the field of graph theory, and ultimately provide a valuable support for testing and vaccination policies.