key: cord-0181399-cxvdextn authors: Floros, Dimitris; Liu, Tiancheng; Pitsianis, Nikos; Sun, Xiaobai title: Using Graphlet Spectrograms for Temporal Pattern Analysis of Virus-Research Collaboration Networks date: 2020-09-01 journal: nan DOI: nan sha: 8eeaab71359a1e495af76355ff95fe73e41341f8 doc_id: 181399 cord_uid: cxvdextn We introduce a new method for temporal pattern analysis of scientific collaboration networks. We investigate in particular virus research activities through five epidemic or pandemic outbreaks in the recent two decades and in the ongoing pandemic with COVID-19. Our method embodies two innovative components. The first is a simple model of temporal collaboration networks with time segmented in publication time and convolved in citation history, to effectively capture and accommodate collaboration activities at mixed time scales. The second component is the novel use of graphlets to encode topological structures and to detect change and persistence in collaboration activities over time. We discover in particular two unique and universal roles of bi-fork graphlet in (1) identifying bridges among triangle clusters and (2) quantifying grassroots as the backbone of every collaboration network. We present a number of intriguing patterns and findings about the virus-research activities. I. INTRODUCTION We present a new approach for uncovering and analyzing temporal patterns of author collaboration networks over different time periods. Author collaboration networks are among the most studied over a half century [1] [2] [3] [4] . We make three key contributions. First, we introduce a live literature graph (LG) we created in March 2020 [5] and have made frequent updates since. The literature body is mainly on research of various viruses, including HIV, SARS, Swine flu, MERS, Ebola, Avian flu and the coronavirus responsible for the ongoing, ravaging pandemic with COVID-19. The LG contains articles that date back to 1744. For temporal pattern detection and analysis, we extract 6 open citation networks over different time windows, 6 related article-author bipartite graphs and 6 author collaboration networks. The basic information is in Table 1 . We apply our temporal analysis approach to these networks and use them in turn as real-world network examples. We describe the LG in Section II. Secondly we introduce an original method for constructing author collaboration networks with time segmented in publication time and convolved (overlapped) in citation history. This is a significant deviation from conventional methods, in which temporal changes in data was neglected in static analysis, or simplistically sliced by time windows at a macroscopic scale, or regulated by timedependent models at a microscopic time scale. We describe The first two authors contributed equally to this work. (1) (2) (3) (4) (5) (6) (7) Thirdly, we use graphlets as coding elements for encoding topological structures and dynamic changes of live networks. In Fig. 2 Fig. 2 : A dictionary of 5 graphlets with 1 to 3 nodes. There are the singleton 0 , edge 1 , bi-fork 2 ( 1,2 ), 2-path 3 ( 2 ) and triangle 4 ( 3 a.k.a. 3 ). In each graphlet, the red square node specifies the designated incidence node, nodes in light red are automorphic to the incidence node. wavelets for spectro-temporal analysis of signal processing [6] , shapelets for time series classification [7] , super-pixels for image analysis [8] , and n-grams for natural language processing [9] , [10] . Network analysis using graphlets has advanced in recent years, since the original work by Pržulj et al in 2004 [11] [12] [13] [14] . Graphlets are mostly used for statistical characterization and modeling of entire networks. We recently established a new way of using graphlets to encode microscopic structure at vertex neighborhood to macroscopic structure such as cluster configurations [15] . In this work, we succeed in using graphlet encoding schemes to detect changes and persistence across author collaboration networks at different epochs, while the conventional approach based on degree distributions is short of such differentiation capacity, as shown in Fig. 4 . We present in Section V a few remarkable findings. LG-COVID19-HOTP We introduce briefly a particular real-world literature graph, LG-covid19-HOTP [5] , created by three of the authors in early March, released onto the Internet on March 26, 2020, and updated weekly ever since. 1 The literature body, centering on virus research, is composed of COVID-19 scholarly articles, their precursor and contemporary studies on viruses. Our data collection starts with a set of seed articles (not a single ego article) and makes backward (citing) and forward (being cited) spans from several very large literature databases [16] . In principle, the collection is by preferential attachment [17] . Detailed collection information can be found at the home website. There are several other COVID-19 themed literature datasets 2 , some of them are absent of citation links. We provide in Fig. 3 and Table 1 the basic quantitative information of LG-covid19-HOTP. The number of articles and the author population increase steadily, without noticeable bursts. The dips in recent two years are due to update latency in the databases our collection relies on. We expect, however, a burst in 2020 by the current collection up to June. Our temporal pattern analysis rests on recognizing and exploring important properties of the literature graph. LG-covid19-HOTP is of multiple attribute dimensions, or a multiplex network. It has several types of vertices/nodes and several types of edges/links. The primary nodes are articles; the primary edges are citation links from citing articles to cited articles. The adjacency matrix for the citation network among Table 1 . The bin size is 4 except the last one that includes all authors with degree above 800. The network associated with COVID-19 is over a time window of only 6 months up to the present, 5 times as small as the 3-year periods for the others. The burst at 100, common to all 6 histograms, is an artifact caused by the cutoff threshold at 100 set by Scopus (and similarly by other data sources) over the number of authors per article. The artifact is also manifested in Fig. 8 and explained in detail there. Observation. All six degree distributions follow the power-law pattern. A synthetic power law curve with parameter = 2.7 is provided and shown in black to serve as a reference. the article nodes is nearly upper triangular, where the articles are ordered chronologically, the first row identifies with the earliest article. We depict the citation adjacency matrix at the top-left of Fig. 1 . Although primary, the article nodes are actually the results of actions by the author nodes. The LG contains the bipartite between article nodes and author nodes. An article is written by one or more authors; and an author is connected to one or more articles. We depict the bipartite incidence matrix at the top-right of Fig. 1 . Via the bipartite, we get author collaboration network/adjacency matrix at the bottom-right of Fig. 1 . Other nodes represent author affiliations, author profiles, semantic entities in titles, abstracts and text bodies, and data figures and tables. More importantly, the LG is a live network, changing and evolving incessantly, with growth and collectively selected memory, not ephemeral nor static. However, author collaboration networks had been largely made static (by time integration). The publication timestamp with each article records the birth time of the article, which may herald a new path or trajectory. The bibliographical references are links to selected precursor work in time as well as in concept. Some other earlier work may be forgotten for a while or for good, not as indelible as seemed in data records. We investigate on temporal differentiation and persistence across author collaboration networks at different epochs. The conventional static collaboration network is constructed from the article-author bipartite deprived of temporal activity information. We introduce instead epoch-centered triad citation networks to enable temporal pattern analysis at mixed time scales, in adaptation to continuous and new collaboration activities. We describe our model, its properties, computational procedure and underlying rationale. Let be a time window or epoch. Denote by Core the set of articles published in the period. See the citation matrix in (a) of Fig. 1 . The core articles cite each other in the same period, represented by the matrix block Core on the diagonal. This is a closed network. In reality, core articles cite articles in set Cout (outward links in the same column block), and they are cited by articles in set Cin (inward links in the same row block). We term such open network as an epoch-centered triad network. We take the sub-bipartite with the article vertices from the triad network on the one side and the involved authors on the other. See the bipartite in (b) of Fig. 1 . We partition the involved authors into seven cohorts. The septa-partition is shown in the Venn diagram in (c) of Fig. 1 . In particular, the authors in cohort (4) are active not only during the epoch , but also before and after the epoch by one hop in citation. We refer to this cohort as the persistent cohort. Often, dynamic network analysis takes one of the extremes in dealing with time scale [4] , [18] , [19] . At the one extreme, one assumes a closed network with fixed boundary and a dynamics model that describes internal change at microscopic time scale, subject to initial condition, boundary condition, and some additional regulation condition. At the other extreme, a dynamic network is sliced into multiple ones by nonoverlapping time windows at a macroscopic time scale. No time overlap nor memory/impact among the sliced networks; no finer temporal resolution to differentiate within each. Each time-sliced network is then treated as static; subsequent analysis across the networks is subject to the fixed time resolution by the window size. We reason differently. Literature graphs are dynamic, but not on an assumed uniform scale. Collaboration activities take place at various and mixed time scales, similar to many social networks, dissimilar to those physical-sensor networks with built-in clocks or biological networks with intrinsic circadian rhythms. Our model of epoch-centered triad citation networks is simple, and innovative in using a data-adaptive meta resolution (e.g. adapt to each epidemic period) in order to capture and accommodate the variation in time resolution or scale. By our schema, the networks are open, not only time shifted but also permitting temporal convolution and dilation. The temporal convolution is by one-hop topological links in citation to precursors and successors. It is time dilated in citation, not closed to or confined within an imposed time window. These properties of triad citation networks are transported at ease to the collaboration networks, each of which is consequently time-segmented within, at the meta time scale. Based on the model, we are able to investigate connection, continuation, differentiation, and deviation across networks centered at different epochs. The conventional network characterization and correlation based on degree distribution have limited capacity to reveal or differentiate temporal and topological relations among time-segmented collaboration networks, as shown evidently in Fig. 4 . We introduce in this section how we differentiate the collaboration networks at three granularity levels, and associate them as well, via topological encoding with graphlets. We review generic graphlets and graphlet dictionaries by their forms and attributes. A graphlet is a connected graph with a small vertex set and a designated node to be incident with. We show in Fig. 2 a dictionary of 5 (undirected) graphlets, Σ 5 = { } =0:4 . The dictionary contains small subgraph patterns: singleton, edge ( 2 ), binary fork ( 1,2 ), 2-path ( 2 ), and triangle ( 3 , 3 ). (2) is marked by red boundary lines. A pixel in brighter color indicates a higher frequency value. Observation: In each of the top two spectrograms, the persistent cohort (4) has higher concentration of vertices with higher 2 frequencies and 3 frequencies. In fact, vertices with higher 2 frequencies have higher centrality positions, see Section V. The persistent cohort (4) is missing in the spectrogram at the bottom, as the network is in its developing stage. The designated incidence node is shown with a red square, up to an isomorphic permutation (shown in red circles). Graphlets with the same number of nodes form a family with an partial ordering. For example, 2 , 3 and 4 are a family of trinode graphlets. The partial ordering 2 , 3 ≺ 4 is by the relationship that 2 and 3 are subgraphs of 4 . Specific to any undirected graph = ( , ), we obtain at every vertex ∈ a graphlet frequency vector of length |Σ|, the element-of which is the number/frequency of -pattern induced subgraphs that are incident to , = 0, 1, · · · , |Σ|−1. In other words, we make a transform of graph to a field of vectors over . The vectors encode, with graphlets as the coding words/elements, the topological and statistical information of the graph. The dictionary Σ 2 = { 0 , 1 } encodes the very basic information. However, it limits network analysis to the ordinary degree distributions, types, correlations and models [20] , [21] . We use the dictionary Σ 5 with much greater coding capacity, with little cost in computation. The relationship among the graphlets, computation formulas, and complexities are detailed in [15] . We describe the graphlet frequencies in a graph = ( , ) with the help of a vertex-graphlet incidence structure. Denote by = ( , Σ; ) the bipartite between the graph vertices and the graphlets, ⊂ ×Σ. There is a link ( , ) between a vertex ∈ and a graphlet ∈ Σ if is an incident node on an induced subgraph of -pattern. The incident node on a graphlet is uniquely specified, up to an isomorphic mapping. For example, graphlet 4 (clique 3 or cycle 3 ) in Fig. 2 is an automorphism. There may be multiple links between and . We denote them by a single link ( , ) with a positive integer weight ( ) for the multiplicity, which is the frequency with graphlet . However, the multiplicities from vertex to multiple graphlets in the same family are not independently determined. For example, the multiplicities on links from vertices to 2 do not include those to sub-graphlets within 4 . The weight on ( , 1 ) is counted independently as 1 has no other family member. We describe formally the transformation of to the frequency vector field over , with = |Σ| − 1, For any , 0 ≤ < , ( ) is the frequency of patternsubgraphs incident at . In particular, 0 ( ) = 1, 1 ( ) is the degree of on graph . The descriptor f ( ) encodes the topological structure of the neighborhood of vertex , with the graphlets as the coding words/elements. We use the longhand notation for node-wise descriptor f ( | ) when necessary. We present graphlet spectrograms of three epoch-associated collaboration networks, labeled respectively as SARS, MERS and COVID-19 in Fig. 6 . The descriptors are color-coded and placed in columns, with nodes in the same ordering as in Fig. 5 . The graphlet dependencies, fast and exact transform formulas and complexity analysis are detailed in [22] . In implementation, we use GraphBLAS [23] to exploit sparse patterns and Cilk [24] for multi-threaded programming. = , the two networks and may still differ in topological connections, rendering non-empty symmetric difference between the edge sets, However, the size of the symmetric difference between the edge sets lacks the capacity to characterize and differentiate the changes in collaboration patterns. We measure the discrepancy (discordance), or agreement (concordance), in topological structures between and , using the graphlet spectral descriptors. We measure the discrepancy at three granularity levels. Define the relative differ- 3: Agreement scores by 1 − , at the cohort level, between MERS as the comparison target and the other networks as the references, along with aggregation weights (in red) of (3). In the left table, = { } =1:7 is the septa-cohort-partition for the target MERS, the scores are 1− ( , ), where changes across the columns associated respectively with the reference networks. In the right table, is the septa-cohort-partition of each target network, the scores (in black) are 1− ( , ). The weights (in red) are based on the cohort sizes in Table 4 . The last rows in both tables are the components of (4). The sums give entries of Table 2 , which summarizes all such comparisons at cohort level. Key observation. The cohorts of MERS mostly remain in network Ebola (by the left table), and are more influential to the developing network COVID-19 (by the right table) TABLE 4 : Sizes of 7 author cohorts in each of the 6 collaboration networks associated with the labeled epochs of Table 1 , with cohorts grouped by the septa-partition of (c) in Fig. 1 . They are used in the weighting scheme for topological discrepancy aggregation of (3) and Tables 2 and 3. The size of the largest connected component is in the last row. Observation. The total number of authors with each of the networks increases steadily with the forward shift in the epochs. We expect a burst with COVID-19 as the network already has a massive size while in its early developing stage (and therefore the absence of cohorts (3)- (6)). ence between two vectors a and b as rdiff = |a − b|./|a + b|, where ./ denotes the Hadamard division, i.e., element-wise division. Let ′ be a graph. If / ∈ ( ′ ), we set f ( | ′ ) = 0. We measure the relative difference in topological connectivity at each vertex between and , where |a| is the order-Hölder mean of vector a with its elements in absolute values. The weights are predetermined on the graphlets in the dictionary; equal weights are used for empirical study later. Strictly speaking, the difference f ( | ) − f ( | ) characterizes the change in the topological neighborhoods of vertex in graph and graph . The first element, associated with the singleton graphlet 0 , accumulates to the difference in the vertex set between and . The second element reflects the difference in the ordinary degree at every vertex. When graphlet , > 0, is absent in both vertex neighborhoods, element-in the relative difference is zero. We aggregate the local changes to the vertex subsets by the septa-partition and then to the entire vertex set of , The weights in (3a) are equal. The weights in (3b) increase monotonically with | |/| | and sum to 1. Similarly, we define ( , , ), ( , ), 1 ≤ ≤ 7, and ( , ). We note the inequalities at many vertices, Fig. 7 : Graphlet frequencies of 2 (bi-fork 1,2 ) against 4 (cycle 3 ), in logarithmic scatter plots, among the author vertices in the persistent cohort (4), one for each of the 6 collaboration networks, except that Cin is empty in network COVID-19. See Table 4 for their respective sizes. The points in red are common to all 6 persistent cohorts across the networks, they distinguish and highlight 103 researchers who are continuously active and influential in virus research through two decades or longer. Among them are S. J. M. Peiris (at the top in darker red) and A. S. Fauci (in bright green). Observation. First, the dynamic persistence information lost in a static collaboration network and analysis is uncovered by our novel analysis method. Secondly, we note in particular the phenomenon, common to all 6 networks, that the authors in red tend to have relatively higher 2 -frequencies than the others in gray in the same local range of 4 -frequencies. See further elaboration in Section V. (i) Author nodes with the same degree of 1 (in coordinate) differ greatly in 2 -frequencies by over an order of magnitude. An author with higher 2frequency makes more connections among different author (triangle) clusters on different articles. (ii) Nodes with higher 2 -frequencies include the hubnodes (with top degrees) but also significantly outnumber in accumulated 2 -frequencies those of the hub-nodes by orders of magnitude. They are responsible for the grass-root connectivity of a network. (iii) The difference between 2 and 1 is incidentally magnified by the artifact of 2 gaps/voids, the most noticeable one is around location 100 of degree ( 1 ). Some articles of scientific experiments have many contributing authors, above a threshold typically set by data record registration. In particular, the threshold is 100 by Scopus. The retained co-authors have at least degree 100 each, many of them may not be co-authors of other articles in the LG database and thus the 2 gap at 100. The co-authors rounded off have their 2 counts reduced. We start with the pairwise comparisons in Table 2 among the networks at the coarsest granularity. To see the comparisons of one network with the others in the upper or lower triangular, follow the row and column by the same label (such as MERS) and take the turn at the diagonal. The intersection set sizes are consistently increasing with time forward. However, the agreement profiles are not necessarily monotonically related. Specifically, among the first five, the agreement scores of SARS (or Avian flu) with the rest decrease with time distance; of MERS, peaked with Ebola, and vice versa; of Swine flu, peaked with MERS. The developing network labeled COVID-19 is more influenced by more recent networks. The next comparisons are at the cohort level. We see a few phenomena in Table 3 . The size of cohort (3) is larger with network more distant from the present, although the network author size | | is not monotonically increasing. This correlates reasonably with the reverse trend in cohort (1). Together they indicate the lasting and expanding impact of the earlier work by collective selection and memory of the research community. We highlight our discovery and understanding of the topological betweenness encoded by 2 , the bi-fork graphlet, which contributes what lacks with 1 (degree) and 4 (triangle). In Fig. 6 we point out a remarkable connection at the persistent cohort (4) between temporal segmentation and spectral differentiation. In Fig. 7 we detail the connection with the scatter plot of 2 -frequencies against 4 -frequencies over the persistent cohort. The triangle is a celebrated motif for the cluster coefficients, a centrality measure. In fact, the modest bi-fork plays the critical role of connecting those triangle clusters. An author has a triangle connection with every other two co-authors. All authors of the same article gain the same amount of triangle count and degree count as well. In contrast, an author is at the root of a bi-fork only if the author has more than one articles with different sets of authors. That is, a bi-fork vertex connects two triangle clusters. The 2 uniquely encodes such betweenness. Furthermore, we take the intersection of all persistent cohorts over time, and find 103 authors. Among them are Malik Peiris, the first person to isolate the SARS virus, and Anthony S. Fauci, well known for his research on HIV and other infectious diseases. These authors have continued presence and influence in two decades or longer. They typically have higher 2 frequencies than those at the same 4 frequencies. One may use a similar approach to measuring temporal-topological betweenness. We provide in Fig. 8 solid evidence of 2 playing another important role in quantifying the grassroots as the backbone of network connectivity. The data points in all 6 plots show unambiguously that while the elite club of hub nodes (with top degrees) is credited for the small-world phenomenon, the collaboration network rests largely on the 2 bridge builders. This finding implies that a conclusion drawn solely from the 1 degree information could be much biased to the hub nodes and prone to changes at the hubs. Our analysis suggests that a network containing larger mass of nodes with high 2 frequencies is likely more robust to adversarial changes. We add that, in its unique roles, the bi-fork graphlet must co-exist with the triangle graphlet, the count of bi-forks at each vertex excludes those in any triangle, see Section IV-A. Our analysis results are readily interpretable. The remarkable findings reported above are beyond the reach of conventional approaches. Our method suggests new ways to investigate the dynamics of author collaboration. And what is your Erdos number? On properties of a well-known graph or what is your Ramsey number? The structure of scientific collaboration networks Twenty years of network science: A bibliographic and co-authorship network analysis LG-covid19-HOTP: Literature graph of scholarly articles relevant to COVID-19 Study Wavelets and signal processing Time series shapelets: A new primitive for data mining Learning a classification model for segmentation A mathematical theory of communication Prediction and entropy of printed English Modeling interactome: Scalefree or geometric Revealing the hidden language of complex networks Graphlet-based characterization of directed networks Efficient graphlet kernels for large graph comparison Measures of discrepancy between network cluster configurations using graphlet spectrograms Construction of the literature graph in Semantic Scholar Coauthorship networks and patterns of scientific collaboration The Theory of Evolution and Dynamical Systems: Mathematical Aspects of Selection The Structure and Dynamics of Networks Effect of correlations on network controllability Fast graphlet transform of sparse graphs Graph algorithms via SuiteSparse: GraphBLAS: Triangle counting and K-truss Cilk: An efficient multithreaded runtime system Acknowledgements. This work is partially supported by grant 5R01EB028324-02 from the National Institute of Health (NIH), USA, and EDULLL 34, co-financed by the European Social Fund (ESF) 2014-2020. We are grateful to the reviewer who made numerous suggestions on improving the manuscript quality. We also thank Thaleia-M. Passia for helpful comments.