key: cord-0452205-da97hdru
authors: Ji, Pengsheng; Jin, Jiashun; Ke, Zheng Tracy; Li, Wanshan
title: Co-citation and Co-authorship Networks of Statisticians
date: 2022-04-24
journal: nan
DOI: nan
sha: 3a76638e0db01775e297e0ae24d18d6c1ebe63e4
doc_id: 452205
cord_uid: da97hdru

We collected and cleaned a large data set on publications in statistics. The data set consists of the coauthor relationships and citation relationships of 83, 331 papers published in 36 representative journals in statistics, probability, and machine learning, spanning 41 years. The data set allows us to construct many different networks, and motivates a number of research problems about the research patterns and trends, research impacts, and network topology of the statistics community. In this paper we focus on (i) using the citation relationships to estimate the research interests of authors, and (ii) using the coauthor relationships to study the network topology. Using co-citation networks we constructed, we discover a"statistics triangle", reminiscent of the statistical philosophy triangle (Efron, 1998). We propose new approaches to constructing the"research map"of statisticians, as well as the"research trajectory"for a given author to visualize his/her research interest evolvement. Using co-authorship networks we constructed, we discover a multi-layer community tree and produce a Sankey diagram to visualize the author migrations in different sub-areas. We also propose several new metrics for research diversity of individual authors. We find that"Bayes","Biostatistics", and"Nonparametric"are three primary areas in statistics. We also identify 15 sub-areas, each of which can be viewed as a weighted average of the primary areas, and identify several underlying reasons for the formation of co-authorship communities. We also find that the research interests of statisticians have evolved significantly in the 41-year time window we studied: some areas (e.g., biostatistics, high-dimensional data analysis, etc.) have become increasingly more popular.

In the past decades, the size of the scientific community has grown substantially. The rapid growth of the scientific community motivates many interesting Big Data projects, and one of them is how to use the vast volume of publications of a scientific field to delineate a complete picture of the research habits, trends, and impacts of this field. These studies are useful for examining national and global scientific publication-related activities, ranking universities, and making decisions of funding, promotions, and awards.

There are two main approaches to studying scientific publications, the subjective approach and the quantitative approach. The subjective approach is more traditional, but it is time-consuming and susceptible to bias. The quantitative approach (which uses statistical tools for analyzing such data) is comparably inexpensive, fast, objective, and transparent, and will play an increasingly more important role (Silverman, 2016) .

From a statistical standpoint, most existing quantitative approaches are overly simple, using preliminary metrics (e.g., counts of papers or citations) for analysis. The h-index and journal impact factor are examples of more sophisticated approaches, but they are still not principled statistical methods. Statistical modeling of publication data is a significantly underdeveloped area, where we have only a small number of interesting papers, sparsely scattered over the spectrum, and typically, each focusing on only a specific problem.

On the other hand, this can also be viewed as a golden opportunity for statisticians.

The publication data provide a valuable data resource, important problems in science and social science, and interesting Big Data projects that demand sophisticated statistical tools.

Having seen such an opportunity, Hall encouraged statisticians to take on a more active role in such research (Hall, 2011 ). Hall's viewpoint is shared by Donoho (2017) , among others.

In his illuminating paper "50 Years of Data Science" (Donoho, 2017) , Donoho predicted that "science about data science" will become one of the major divisions of data science, and one task of this division is to evaluate scientific research outputs.

This paper is a response to the call by Hall and others. We contribute a large-scale high-quality data set on the publications of statisticians and use it to showcase how modern statistical tools can be employed for analysis of such kind of data.

A new data set about the publications of statisticians. We present a new data set about the publications of statisticians, collected and cleaned by ourselves with enormous efforts. The data set consists of coauthor relationships and citation relationships of 83K research papers published in 36 representative journals in statistics, probability, machine learning, and related fields, spanning 41 years. See the table below. More information of these journals is presented in Table B .1 of the supplement.

#journals time span #authors #papers 36 1975-2015 47, 311 83, 331 One might think that the data set is easy to obtain, as BibTeX and citation data seem to be easy to download. Unfortunately, when we need a large-volume, high-quality data set, this is not the case. For example, the citation counts from Google Scholar are not always accurate, and many online resources do not allow for large volume downloads. Our data were downloaded from a handful of online resources by techniques including but not limited to web scraping. The data set was also carefully cleaned by a combination of manual efforts and computer algorithms we developed. Both data collection and cleaning are sophisticated and time-consuming processes, during which we had to overcome a number of challenges.

For a detailed discussion on data collection and cleaning, see Section B.2 of the supplement.

Results, findings, and challenges. First, we overview the results. Our data set provides rich material for research and motivates many interesting problems for research trends, patterns, and impacts of the statistics community. In this paper, we focus on two topics: (1) How to use the citation data to estimate the research interests of statisticians, and (2) How to use the coauthorship data to study the network topology of statisticians.

Section 2 studies the first topic. How to model the research interests of an author is an open problem in bibliometrics. Our idea is to first use the co-citation relationships to construct a citee network and then model the research interests of the author as the mixed-memberships he/she has over different network communities. This gives rise to the degree-corrected mixed-membership (DCMM) model (Jin et al., 2017) . Such a framework allows us to use principled statistical tools to attack problems about research interests.

Specifically, we develop new models, methods, and theory for (i) estimating the research interests of authors, (ii) clustering authors by research interests, (iii) studying how the research interests of an author evolve over time, and (iv) measuring the research interest diversity of individual authors. We discover a "Research Map" (a cloud of points in R 2 , each representing the research interests of an author), which consists of a "statistics triangle" and Last but not the least, as statisticians, we know partial ground truth of our community.

For this reason, our data set may provide a benchmark for comparing different methods in statistics, machine learning, and especially network analysis, and so largely help the development of methods and theory in these areas.

Our study has (potential) impacts in science, social science, and even real life. It provides an array of ready-to-use and easy-to-extend statistical tools which the administrators, award committee, and individuals can use to study the research profile of an individual, an area, or the whole statistics community. For example, suppose a committee wishes to learn the research profile of an individual researcher. Our study provides a long list of tools to help characterize and visualize the research profile of the researcher: his/her research interests and his/her position on the Research Map, his/her research interest trajectory, to which network community he/she belongs, his/her research diversity in terms of citation and in terms of co-authorship, his/her personalized networks, the importance of his/her research area, his/her research impact and ranking relative to his/her peers. Such information is not available from his/her curriculum vitae or profile on Google Scholar, and can be very useful for the award committee or administrators for decision making.

Our study also provides a useful guide for researchers (especially junior researchers) in selecting research topics, looking for references, and building social networks. It also helps understand several important problems in social science and science: characterizing research evolvement, predicting emerging communities and significant advancement in each research area, checking whether the development of different areas is balanced, and identifying unknown biases in publications. We discuss these with more details in Section 4.

For disclaimers, note that we have to use real names as our data are about real-world publications, but we have not used any information that is not publicly available. It is not our intention to rank a researcher (or a paper, or an area) over others. While we tried very hard to create a high-quality data set, the time and effort one can invest is limited, so is the scope of our study; as a result, some of our results may have biases. Our paper can be viewed as a starting point for an ambitious task, where we create a research template with which the researchers in other fields (e.g., economics) can use statisticians' expertise in data analysis to study their own fields. For this reason, the main contributions of our paper are still valid. See Section A of the supplement for a longer version of the disclaimers.

Contents. Section 2 studies co-citation networks, where the focus is to study how to estimate the research interests of an author and how the research interests evolve over time.

Section 3 focuses on coauthorship networks. It studies hierarchical and dynamic community detection, and proposes two new diversity measures. Section 4 is the conclusion.

A good understanding of the research interests of statisticians helps understand the research trends, research impacts, and network topology of the statistics community, and also helps understand the research profile of individual statisticians. For example, suppose we are given an author with a total of 1000 citation counts. To decide whether he/she is highly cited, it is crucial to understand his/her major areas of interest, because the average citation count for a researcher in one area may be a few times higher than that of another.

The citation counts in our data set provide a valuable resource to study the research interests. In this section, we consider four problems: (a) how to model the research interests of individual authors; (b) how to estimate his/her research interests and how to use the estimated research interests for author clustering; (c) how to study the dynamic evolvement of research interests of an author; (d) how to measure the diversity of research interests of an author. We propose new approaches to studying (a)-(d). Below is a sketch of our ideas.

Consider Problem (a) first. How to model research interests of individual authors is an open problem. We observe that two authors being frequently cited together in the same papers (i.e., co-cited) indicates that their works are scientifically related and that they share some common research interests. Motivated by this, we propose the following approach to tackling Problem (a). First, we use the co-citation relationship to construct an undirected network which we call the citee network (see Section 2.1). We assume that the citee network has K communities, each representing a primary research area in statistics (primary areas can be further divided into sub-areas). For author i, we model his/her research interest as a weight vector π i ∈ R K , with π i (k) being the fraction of his/her interest in community k, 1 ≤ k ≤ K. We further model the citee network with the recent Degree Corrected

Mixed-Membership (DCMM) model, where π i are the vectors of mixed-memberships.

In a network, communities are tight-knit groups of nodes that have more edges within than between (Goldenberg et al., 2010) . For example, suppose K = 3 and we have three communities, each being a primary area in statistics: "Bayes", "Biostatistics", and "Nonparametric". Suppose for author i, π i = (0.5, 0.3, 0.2) . In this case, we think author i has 50%, 30%, and 20% of his research interest or impact in these primary areas, respectively.

The DCMM model is a recent network model (Jin et al., 2017; Zhang et al., 2020) . It models both severe degree heterogeneity and mixed-memberships and is reasonable for the current setting. Let A ∈ R n,n be the adjacency matrix of the citee network, where A(i, j) = 1 if i = j and there is an edge between nodes i and j and A(i, j) = 0 otherwise. As above, let π i be the K-dimensional vector that models the research interests of author i, 1 ≤ i ≤ n.

For a nonnegative, unit-diagonal matrix P ∈ R K,K that models the community structure and parameters θ 1 , θ 2 , . . . , θ n > 0 that model the degree heterogeneity, we assume that the upper triangle of A contains independent Bernoulli variables, where for any 1 ≤ i < j ≤ n,

The problem of estimating the trajectory is related to the problem of dynamic mixedmembership analysis. Consider a sequence of citee networks, each for a different time window. We extend the DCMM model for static networks in (2.1) to dynamic networks, where π i may vary with time. In such a setting, how to estimate π i is largely an open problem. Related works include Kim et al. (2018) and Liu et al. (2018) 

We construct a citee network using the co-citations during 1991-2000. We limit the time to 1991-2000, for later we will use this network as a reference network to study the research trajectories of selected authors. For each year t, 1991 ≤ t ≤ 2000, define a year-t weighted network where each node is an author, and for any two nodes i and j, the weight of the edge between them is the number of times that the papers by author i published between

year t − 9 to t and the papers by author j published between year t − 9 and t have been cited together in a paper by another author published in year t. This results in a weighted adjacency matrix for year t. Summing the adjacency matrices for t = 1991, 1992, . . . , 2000 gives rise to a weighted network. Let the degree of node i be the sum of weights of edges between node i and the other nodes. We remove all nodes with a degree smaller than 60, and define a symmetric unweighted network using the remaining nodes, where two nodes have an edge if and only if the weight between them in the previous network is no less than 2. We call the giant component of this network the citee network for 1999-2000, which has 2,831 nodes (these nodes form a subset of most active and most cited authors).

There are different ways to construct the citee network (we have studied many options and recommend the one above). We restricted to "fresh" citations only (a citation from one paper to the other is considered "fresh" if the two publication times are no more than 10 years apart). We have removed low-degree nodes and low-weight edges in the intermediate weighted graph to reduce noise. In Section C.3 of the supplement, we have also studied the case where the threshold 60 is replaced by 50 and 70, and observed similar results (e.g., similar triangle and research map for statisticians). Thresholding the edge weights is a common practice. It may cause some information loss. But since the goal is to identify active communities, it is unclear how such a loss may affect the results.

Also, just as in different fields of science, the average citations (per paper or author) can vary dramatically in different areas. For this reason, we may threshold the edge weights adaptively with different thresholds for different areas. However, it is not immediately clear how to implement such an approach. We leave these studies to the future.

We wish to use this citee network to study the research interests of individual authors.

We model this network with the aforementioned DCMM model (2.1). Under this model, each of the K communities can be interpreted as a research area, and the research interest of author i is modeled by the mixed-membership vector π i ∈ R K . How to estimate π i is known as the problem of mixed-membership estimation, where we use the method mixed-SCORE (Jin et al., 2017) . The approach uses SCORE embedding which embeds all authors to a low dimensional space and provides a way to visualize the research interest of each author. Specifically, letξ 1 , . . .ξ K ∈ R n be the first K eigenvectors of the adjacency matrix.

Each node i is embedded into a (K − 1)-dimensional space by the vector

Now, first, the embedded points are approximately contained in a simplex with K vertices in R K−1 , where each vertex represents a community. Second, each embedded pointr i is approximately a convex combination of the vertices:

where v 1 , v 2 , . . . , v K are the vertices of the simplex. The weight vector w i is an order-preserving transformation of π i , in the sense that w i ∝ π i • b, where • is the Hadamard product and b ∈ R K is a positive vector (not depending on i). Therefore, if an embedded pointr i is close to one vertex, then w i is nearly degenerate (with only one nonzero entry that is 1), and node i is a pure node (i.e., node i is called a pure node of community k if π i (k) = 1 and π i ( ) = 0 for all = k). Ifr i is deeply in the interior of the simplex, then all entries of w i are bounded away from 0 and node i is highly mixed; see Jin et al. (2017) for more discussions.

Why K = 3 is the most reasonable choice. To use mixed-SCORE, we need to decide K, which is unknown. First, we use the scree plot of the adjacency matrix to determine the range of K as [2, 6] . Second, we implemented mixed-SCORE for each K ∈ {2, 3, . . . , 6} and investigated the goodness of fit, by checking whether the rows ofR fit the aforementioned (K − 1)-dimensional simplex structure (it is hard to visualize the simplex when K ≥ 4, so we plot two coordinates ofr i 's at a time to visualize a projection of the simplex to R 2 ). Last, for each K, we manually check the large-degree pure nodes in each community and see whether the results fit with our knowledge of the statistics community. The above analysis suggests K = 3 as the best choice. See Section C.2 of the supplement for details.

The statistics triangle. Since K = 3, the simplex in SCORE embedding is a triangle, each vertex representing (perceivably) a primary statistical research area. See Figure 1 .

To interpret these areas, we apply mixed-SCORE to the citee network with K = 3, and obtain an estimate for the membership vectors π 1 , π 2 , . . . , π n byπ 1 ,π 2 , . . . ,π n . We divide all the nodes into three groups: If the largest entry ofπ i is the kth entry, then node i is assigned to group k, 1 ≤ k ≤ 3. In Section C of the supplement, we investigate the research interests of authors in each group, using the topic weights estimated from abstracts of their papers. It suggests that the three vertices represent three primary research areas: "Bayes," "biostatistics," and "nonparametric statistics." This triangle is reminiscent of the statistics philosophy triangle by Efron (1998) , where the three vertices are "Bayes", "Fisherian", and "frequentist". Efron argued that they are the three major philosophies in statistics, and most statistics methodologies (e.g., bootstrap) can be viewed as weighted averages of these three philosophies. Different from Efron's triangle, our statistics triangle is data-driven. Figure 1 : The research map. Each grey dot represents a 2-dimensional SCORE embedding vector r i , 1 ≤ i ≤ n, and the 15 clusters and Voronoi diagram are obtained by applying the K-means algorithm tor 1 ,r 2 , . . . ,r n . The dashed green line represents the triangle, where the vertices represent the 3 primary areas. In each cluster, the cluster center is also presented (blue crosses), together with 5 authors with highest degrees (blue dots). The results are based on citations: it is possible that an author does not work in an area, but have many citations in that area.

The research map. Perceivably, we can further split each primary area into sub-areas, and a convenient approach is to use SCORE embedding. For each author i in the citee network, 1 ≤ i ≤ n, since K = 3,r i can be viewed as a point in R 2 . The distance between authors in this space is a measure of closeness of their research areas. Therefore, it makes sense to further cluster the authors into sub-areas by applying the K-means algorithm to

We have tried the K-means algorithm with L = 10, 11, . . . , 20 clusters, and picked L = 15 due to that the result is most reasonable. We then apply the K-means with L = 15 and obtain 15 clusters, each of which can be interpreted as a sub-area in statistics after a careful investigation of the research works by representative authors in the cluster (while we try very hard to find a reasonable label for each cluster, we should not expect that a simple label is able to explain the research interests of all authors in the cluster). Figure 1 shows the 15 clusters and their labels, which we call the research map of the citee network. In this map, each point representsr i for some node i, 1 ≤ i ≤ n, and the two axes are the two entries ofr i , respectively. The statistics triangle is illustrated by the dashed green lines, where the three vertices are estimated by mixed-SCORE and represent the three primary areas "Bayes," "Biostatistics," and "Nonparametric." We also present the Voronoi diagram for the clusters (boundaries are illustrated by dashed blue lines), and the names for the 5 authors with the largest degrees in each cluster. Rubin rarely works in Longitudinal I (GEE), he is clustered to GEE for he is cited together with quite a few authors in GEE (e.g. Scott Zeger, Nan Laird, and Daniel F. Heitjan).

The research map in Figure 1 We consider 21 time windows (see Table 1 ) and construct a citee network for each of them. As the numbers of papers published per year are steadily increasing, we use gradually smaller windows so the average node degrees of all 21 citee networks are roughly the same.

We use the citee network for the first window (1991) (1992) (1993) (1994) (1995) (1996) (1997) (1998) (1999) (2000) as the reference network for our study below. This network is the same as the citee network that we use to study the statistics triangle and the research map in Figure 1 . Recall that this network has 2,831 nodes. We restrict each of the other 20 networks to the same set of nodes. We propose a dynamic DCMM model by extending the (static) DCMM model (2.1). Consider T citee networks for the same set of n nodes, and let A 1 , A 2 , . . . , A T be the adjacency matrices. Let P ∈ R K,K be the time-invariant community structure matrix, and let θ (t) i > 0 and π (t) i ∈ R K be the degree parameter and mixed membership vector of node i at time t, 1 ≤ i ≤ n, 1 ≤ t ≤ T . Write θ t = diag(θ 1t , . . . , θ nt ) and Π t = [π 1t , . . . , π nt ] . Given {(θ t , Π t } T t=1 , we assume A 1 , A 2 , . . . , A T are independently generated. Also, the upper triangle of A t contains independent Bernoulli variables satisfying

Here, we assume A 1 , A 2 , . . . , A T are independent given {(θ t , Π t } T t=1 , but this can be relaxed to allow for weak dependence. Also, to allow flexible temporal dependence in {(θ t , Π t } T t=1 , we do not impose any extra conditions on them.

How to estimate π (t) i is known as the problem of dynamic mixed membership estimation. Existing works include Kim et al. (2018) ; Liu et al. (2018) . However, these works focus on the dynamic MMSB model (a special dynamic DCMM) where it is required θ (t) i ≡ α t for all 1 ≤ i ≤ n at each time t. It is therefore unclear how to extend their ideas to our setting.

Alternatively, one may use naive mixed-SCORE (i.e., we apply mixed-SCORE to each network in the sequence separately). Unfortunately, the approach is also unsatisfactory.

One challenge is that the estimates {π (t) i } 1≤i≤n for each time window t are up to an unknown permutation among the K communities. Since we have T different time windows, we have a large number of possible combinations of such permutations, and it is unclear how to pick the right one. The other challenge is that, each A t is constructed for a relatively short time period, and can be very sparse. In such cases, spectral decomposition of A t may be rather noisy, and the naive mixed-SCORE may perform unsatisfactorily.

We propose dynamic network embedding as a new approach to dynamic mixed membership estimation. Note that the network A 1 from the first window was used in Section 2.1 to build a "research map" for all the authors. This motivates us to treat A 1 as a reference network and project all the other networks onto this "research map." Letλ 1 ,λ 2 , . . . ,λ K be the K largest eigenvalues (in magnitude) of A 1 , and letξ 1 ,ξ 2 , . . . ,ξ K be the corresponding eigenvectors. For each 1 ≤ t ≤ T and each node 1 ≤ i ≤ n, define a (K − 1)-dimensional vectorr (t) i by (e i : the ith standard basis vector of R n )

Now, for each time t, we obtain the low-dimensional embedding {r (t)

i } 1≤i≤n of all n nodes, and for each node i, we obtain the embedded "trajectory" as (r

coincides with the SCORE embedding (2.2). It implies that the starting point of each embedded trajectory is always the position of this author in the "research map." For t > 1, the proposed embedding is different from the SCORE embedding (2.2) for A t . Note that in (2.2), we use the eigenvectors of A t to construct the embedding at t, while in (2.4), we use the eigenvectors and eigenvalues of A 1 to construct the embeddings for all t.

We now explain how the approach overcomes the two challenges aforementioned. First, the new approach utilizes the same (ξ 1 ,ξ 2 , . . . ,ξ K ) to obtain the embeddings for all t, so that these networks are projected to the same low-dimensional space. Consequently, the projected pointsr (t) i are automatically aligned across time. Second, in spectral projection and its variants (e.g., SCORE), the data to project (rows of A t ) and the projection directions (eigenvectors of A t ) are dependent of each other. On the contrary, in (2.4), the data to project, A t e i , and the projection direction,ξ k , are independent of each other, for any t ≥ 2. Thus, the projected points are much less noisy. In the preliminary theoretical analysis, we find thatr (t) i has a sharp large-deviation bound even when A t is very sparse and whenξ k is only a moderately good estimate of the population eigenvector of A 1 .

We explain why the approach is reasonable. Define a population counterpart of (2.4). In

n ] , and Ω t = Θ (t) Π (t) P (Π (t) ) Θ (t) , 1 ≤ t ≤ T . Let Λ = diag(λ 1 , λ 2 , . . . , λ K ) and Ξ = [ξ 1 , ξ 2 , . . . , ξ K ], where λ k is the kth largest (in magnitude) eigenvalue of Ω 1 and ξ k is the corresponding eigenvector. For 

k . If i is not a pure node of any community, then r (t) i is in the interior of S t (including the edges and faces, but not any of the vertices). Second, each r

, where • is the Hadamard product and h t ∈ R K is a positive vector that does not depend on i.

Theorem 2.1 is proved in the supplement. By Theorem 2.1, in the noiseless case, the embedded data cloud {r (t) i } 1≤i≤n at every t form a low-dimensional simplex, similar to that in Jin et al. (2017) . We can then borrow the idea there and estimate π (t) i from the embedded data cloud via a simplex vertex hunting algorithm. This explains the rationale of our procedure. To focus on real data analysis, we relegate more detailed analysis of the approach to a forthcoming manuscript We now apply the procedure to our data set.

Research trajectories for individual authors. Recall that we have constructed a 2831-node citee network for each of the 21 time windows in Table 1 . Applying (2.4), we get an embeddingr Figure 1 within the dashed yellow square, with the same Voronoi diagram). Each trajectory has 21 knots, corresponding to the 21 time windows in Table 1 (knots 1, 11, and 21 are marked with 1, 11, and 21, respectively). The starting point (marked with 1) is the same as the author's position in Figure 1 . For interpretation, we selected some authors we are familiar with, but we can plot the trajectory for any author with a reasonably long publication history in our data set. The results are based on citations: it may happen that an author (e.g., D. Rubin) does not work in an area, but have many citations in that area. The red dots represent the 10 highest-degree authors. The orange dots represent (among the top 200 highest-degree nodes) the 5 authors with the largest se − distance and the 5 authors with the largest differences between max − distance and se − distance.

The research trajectories in Section 2.2 suggest that research interests of some authors may vary more significantly than those of others. This motivates us to propose some metrics for research diversity of individual authors. Recall that the 21 knots for the trajectory of author i arer

. We introduce two diversity metrics: For author i, if both M i and E i are large, we call the changes of the research areas of author i significant and persistent (SP), and for short, author i is an SP type. If M i is large but E i is relatively small, we call the changes of the research areas of author i significant but not persistent (SnP), and for short, author i is an SnP type. For the 20 authors whose names are showed in the figure, Charles J. Stone has the largest E i value and is seen to be an SP type, and Lueping Zhao has the largest M i value and is seen to be an SnP type.

The study of coauthorship patterns and community structures in an academic society is an interesting topic (Newman, 2004) . The co-author relationship in our data set provides a valuable resource to study the community structure, which is the focus of this section.

Compared to the co-citation relationship (focus of Section 2), the co-author relationship is quite different in nature: Citations are primarily driven by scientific relevance, but collaborations may be driven by many factors (e.g., geographical proximity, academic genealogy, cultural ties). Therefore, the study below may shed new insight which we do not see in Section 2. We focus on the following problems: (a) hierarchical community detection (and especially interpretation of different communities), (b) evolvement of communities, and (c) diversity measure of individual authors. We discuss these in Sections 3.1-3.3 separately.

Compared to the citee networks, the effect of mixed-memberships in co-authorship networks is notably less significant; see Section D.5 of the supplement for detailed discussion. So instead of focusing on the mixed-memberships as in Section 2, we focus on the problem of recursive community detection: We think that the co-authorship network has many communities (each is a research sub-area in statistics), and the sub-areas may have a tree structure. The goal is to (possibly recursively) cluster the authors into these sub-areas.

A popular strategy to recursive community detection is as follows: First, we partition the network into K 0 groups, for a small integer K 0 < K, where K is the total number of communities. This gives rise to K 0 subnetworks restricted to each group. Next, for each subnetwork, we test whether it has only one community (null hypothesis) or multiple communities (alternative hypothesis). If the null hypothesis is rejected, this subnetwork is further split. The algorithm stops when the null hypothesis is accepted in every subnetwork.

The output is a hierarchical tree, with each leaf being an estimated community.

As the mixed-membership effect here is less significant than that in citee networks, it is reasonable to use the DCBM model (Karrer and Newman, 2011) . Compared with the DCMM model in (2.1), DCBM is a special case where we require all vectors π i to be degenerate (i.e., one entry is 1, all other entries are 0), and so the nodes partition to nonoverlapping communities C 1 , C 2 , . . . , C K . Let A ∈ {0, 1} n×n be the symmetrical adjacency matrix of a coauthorship network, where A(i, j) = 1 if and only if authors i and j have co-authored papers in the range of interest. In DCBM, we assume

where (P, θ 1 , θ 2 , . . . , θ n ) are the same as those in (2.1). In this subsection, we assume both the whole network and subnetworks satisfy the DCBM. A more careful modeling for the hierarchical structure is possible (e.g., Li et al. (2020) ). But since our primary focus here is to analyze a valuable new data set, we leave this to the future.

There are many interesting works on recursive community detection (e.g., Li et al. (2020) ), but they focused on the stochastic block models, a special case of the DCBM model in (3.6) that does not allow degree heterogeneity. It is unclear how to extend their methods to our settings. We propose a new algorithm for recursive community detection, consisting of a community detection module and a hypothesis testing module. Both modules are able to properly deal with severe degree heterogeneity. We now discuss them separately.

The community detection module clusters the nodes in a network into K 0 communities, for a given K 0 ≥ 2. We use the following algorithm. For a tuning parameter c 0 > 0, let I n be the identity matrix, letμ k be the k-th largest eigenvalue (in magnitude) of A + c 0 I n , and letξ k be the corresponding eigenvector, 1 ≤ k ≤ K 0 . Define a matrixR ∈ R n,K 0 −1 bŷ

For a threshold t > 0, we apply element-wise truncation onR and

We then apply the k-means algorithm to the rows ofR * , assuming there are ≤ K 0 clusters. There are two tuning parameters (c 0 , t). We set c 0 = 1 and t = log(n).

The approach extends SCORE (Jin, 2015) , where c 0 = 0. Recall that we callξ k the k-th largest eigenvector of A if it corresponds to the k-th largest (in magnitude) eigenvalue of A.

SCORE uses the first K eigenvectors of A for clustering, but unfortunately, the estimated network is dis-assortative (a network is assortative if for any pair of communities, they have more edges within than between (Lu and Szymanski, 2019) ). For co-authorship networks, such a result is hard to interpret. Note that for an assortative network, a negative eigenvalue is more likely to be spurious than a positive one. This motivates the above approach, where we replace A by A + c 0 I n : the term c 0 I penalizes the rankings of negative eigenvalues, so the set of first K eigenvectors of A + c 0 I n is different from those of A. How to choose c 0 is an interesting problem. We find all estimated networks for c 0 ≥ 1 are assortative, so we choose c 0 as 1 for convenience. The asymptotic consistency of the proposed approach is similar to that of the original SCORE.

Given a cluster (subnetwork), the hypothesis testing module determines whether the cluster should be further split. To abuse the notation a little bit, let A be the adjacency matrix of the network formed by restricting nodes and edges to the set of nodes in the current cluster. As before, we assume A follows a DCBM model with K 0 communities and test the null hypothesis K 0 = 1. We use the Signed-Quadrilateral (SgnQ) test by . Defineη = 1 √ 1 n A1n A1 n ∈ R n and A * = A −ηη ∈ R n,n . The SgnQ test statistic is

It was showed in that under mild conditions, ψ n → N (0, 1) in the null hypothesis. This asymptotic normality holds even when the network has severe degree heterogeneity. Then, we can compute the p-value conveniently and use it to set the stopping rule of the recursive algorithm (e.g., when p-value is ≥ 0.05, a cluster will not be split).

The coauthorshp network (36 journals). We build a coauthorship network using all the data in 36 journals during 1975-2015 as follows: Each node is an author; there is an edge between two nodes if they have coauthored at least m 0 papers in the data range. As we wish to focus on (a) the subset of long-term active researchers, and (b) solid collaborations, choosing m 0 = 1 would be too low (see Ji and Jin 2016)): we may include too many edges between active researchers and non-actives ones (e.g., a Ph.D advisee who joined industry and stopped publishing in academic journals). We take m 0 = 3 and focus on the giant component, which has 4,383 nodes. Taking m 0 = 2 may also be a reasonable choice, but the network is comparably denser and larger (10,741 nodes), and so requires more time and efforts to interpret the results (as we need to check each identified community one by one manually). Below, we present the result for m 0 = 3, and leave the results for m 0 = 2 to Section D.6 of the supplement, where we see the results of two cases are largely consistent.

We now apply our proposed algorithm. Note that the community detection module still requires an input of K 0 . Similar to that in Section 2.1, we choose K 0 by combining the scree plot, goodness-of-fit, and evaluation of output communities (details are in Section D.4 of the supplement). Since we use the eigenvectors of (A + I n ) for community detection, the scree plot contains the absolute eigenvalues of (A + I n ) instead of those of A. The stopping rule of the recursive algorithm is set as follows: Either the SgnQ p-value is > 0.001 or the community has ≤ 250 nodes. The output is a hierarchical community tree in Figure 4 .

The hierarchical community tree. First, we investigate the 6 communities in the first layer. To help for interpretation, we apply topic modeling on paper abstracts (see Section D of the supplement, especially Figure D .6). Combining the topic modeling results with a careful read of the large-degree nodes in each community, we propose to label these communities as in Table 2 , where we also list some comments on each community. 2

Next, we look at the other layers of the tree. The stopping rule of recursive partition is that either the SgnQ p-value is > 0.001 or the community size is ≤ 250, but there are a few exceptions in Figure 4 : (a) C6 has 264 nodes, but its giant component has no more than However, after we further split it into 2 sub-communities by SCORE, one sub-community contains only 8 nodes, and the other has a p-value 0.1. We thus keep C3-1 unchanged.

For each leaf community (i.e., the community corresponding to a leaf in the tree),

we provide a manual label using two commonly used centrality measures, the betweenness (Freeman, 1977) and the closeness (Bavelas, 1950) . For a node in a community, its betweenness is defined as the number of pairs of nodes in the same community that are connected through this node via the shortest path (therefore, a node with a large betweenness plays an important role in bridging other nodes), and the closeness of the node is defined as the reciprocal of the sum of distances from all other nodes in the same community to this node.

Given a leaf community, we use the last names of the two nodes with largest betweenness and the one node with largest closeness to label the community (of course, if the latter happens to be one of the former, we will not use the same name twice). As a result, each leaf community is labeled with the last names of either two or three authors (not necessarily in alphabetical order). Table 3 presents a few representative nodes for each leaf community.

More information of each leaf community is in Tables with someone in the same institute/country than with others.

Example 3. Academic genealogy. The academic advisor-advisee relationship is also a common source of collaboration. For example, the leaf community C1-1-1 Shen-Wong-

Hettmansperger has a component of 29 nodes, which is largely formed by students of three authors, Wing H Wong, Jun Liu, and Xiaotong Shen; Liu and Shen are former students of Wong. We also note that this leaf community has sub-communities. For example, the network has a component of 24 nodes containing Thomas P. Hettmansperger. We did not further split C1-1-1 simply because its size falls below 250.

Recall that we name the first-layer communities, C1, C2, . . . , C6, using the results of topic learning (see Figure D .6 and Table 2 ). In most cases, the interpretations of umbrellaed leaf communities match with the name of the first-layer community. One exception is "C3-3 Pepe-Leisenring-Sun." It is under "C3 Mathematical Statistics" but consists of a group of biostatisticians. After some investigation, we find that this group is brought together with other groups in C3 largely by the author Xingqiu Zhao. She collaborated with both Narayanaswamy Balakrishnan, a hub node of C3, and Jianguo Sun, a hub node of C3-3.

The community tree is constructed by SCORE. To compare with other clustering methods, we apply Newman-Girvan's modularity approach (Newman's spectral approximation) (Newman, 2006) to the same co-authorship network, and obtain 6 communities. We then check the numbers of nodes in the intersection between each of these communities and each of 26 leaves in our tree. The results are in Table D .7 of the supplement. We find that for most of the 26 leaf communities identified by SCORE, the majority of nodes in the community are contained in one of the 6 communities identified by Newman's approach.

Therefore, at least to some extent, two clustering results are consistent with each other.

Our data set spans a relatively long time period , and it is interesting to study and visualize how the network communities evolve over time. The Sankey plot is a popular visualization tool for dynamic networks. However, to have a nice plot with interpretable results, we face many challenges: (a) the coauthorship network constructed using all data has too many communities (so it is hard to interpret all of them, and the resultant Sankey plot will also be too crowded); (b) it is unclear how to determine the number of communities;

(c) it is also unclear how to interpret each community.

For (a), we decided to focus on the coauthorship network constructed with only papers from 4 representative journals, AoS, Bka, JASA, and JRSSB (the full journal names are in Table B .1). Compared to the co-authorship network constructed with the papers in all 36 journals, research interests of the authors in the current network are more homogeneous.

As a result, the network has many fewer communities and is comparably easier to analyze.

We have also spent a lot of efforts in dealing with challenges (b)-(c); see details below.

The dynamic coauthorship networks (4 journals). We consider three time windows in our study: (i) 1975-1997, (ii) 1995-2007, and (iii) 2005-2015 . As in many works on dynamic network analysis (Kim et al., 2018) , we let the adjacent time windows be slightly overlapping, so the results on community detection will be much more stable. The Sankey diagram. By careful investigation, we found that the three networks have 3, 4, and 3 communities respectively. Once these numbers are determined, we first perform a community detection for each network by applying the modified SCORE described in Section 3.1, and then use the estimated community labels to generate a Sankey diagram;

see Figure 5 . Since the sets of nodes of three networks are different, we focus on the set

, which has 1,687 nodes, for the Sankey diagram.

We explain some notations in Figure 5 . Consider the network for the time period 1 first. By similar analysis as before, we propose to label the three communities obtained from applying modified SCORE to the network by semiparametric statistics (SP), nonparametric statistics (NP), and Bayes (Bay). We do not have a separate community for biostatisticians, but a significant number of biostatisticians (e.g., Jason Fine, Lu Tian, Hongtu Zhu) are outside V , and another significant number of them (e.g., Lee-jen Wei, Zhiliang Ying, Joseph

Ibrahim, Nicholas P. Jewell) are in SP. Let SP1, NP1, and Bio1 be the intersection of V and each community, respectively. We have

The discussion of the third network is similar, except that the estimated communities are interpreted as high-dimensional data analysis (HD), nonparametric and semiparametric (NP/SP), and Bayes (Bay). Similarly, V = HD∪(N P/SP )∪Bay3∪O 3 , where O 3 = V \G 3 .

Last, consider the second network. The four communities obtained by applying SCORE can be similarly interpreted as seimparametric statistics and Bayes (SP/Bay), nonparamet- ric (NP), Bayes (Bay), and biostatistics (Bio). We have V = (SP/Bay) ∪ N P 2 ∪ Bay2 ∪ Bio2, where NP2 is the intersection of NP with G 2 ; similar for Bay2 and Bio2. Note here that V is a subset of G 2 (but not a subset of G 1 or G 3 ), and so O 2 = V \ G 2 is an empty set. See Figure 5 for details.

The Sankey diagram suggests several noteworthy observations. First, in time period 1, our algorithm suggests that there is no "Bio" community, although many biostatisticians (e.g., Jason Fine, Hongtou Zhu, Lu Tian) are outside the set V (recall that V = (G 1 ∩ G 2 ) ∪ (G 2 ∩ G 3 )). In time period 2, our algorithm suggests that there is a "Bio" community, where a significant fraction of the members come from the outside of V , and another "SP" in time period 1, migrate to "Bio" in period 2, and migrate to "HD" in period 3.

In Section 2.3, we have proposed two diversity metrics for the research interests of individual authors, using the trajectory. In this section, we propose a new approach to measuring research diversity by using the personalized networks and a recent tool in network global testing. The approach is quite different from that in Section 2.3 (and also those in the literature), and provides new insight on the research diversity of statisticians.

Fixing a node in a symmetrical network, the personalized network (also called the ego network) is the subnetwork consisting of the node itself and all of its adjacent nodes. We construct a coauthorship network similar to that in Section 3.1 but with m 0 = 1: Every author who ever published a paper in any of the 36 journals between 1975 and 2015 is a node, and two nodes have an edge if and only if they coauthored one or more papers.

Once this large network is constructed, for every author, we can obtain a personalized coauthorship network accordingly.

We model each personalized coauthorship network with a DCBM model (2.1) with K communities. We consider the global testing problem (Yuan et al., 2018) where we test H 0 : K = 1 versus H 1 : K > 1. Viewing each community as a tight-knit group, this is testing whether the given personalized coauthorship network has only one or multiple tight-knit groups. We approach the testing problem by the SgnQ test which was already described in Section 3.1. Let Q i be the test score ψ n in (3.7) for the personalized coauthorship network of author i. According to , when the null hypothesis is true, Q i → N (0, 1) as the size of the personalized network grows to ∞. We thus calculate the p-value by p i = P(N (0, 1) ≥ Q i ) and assign p i to author i. We propose to use p i to measure the coauthorship diversity of author i: a large p-value suggests that his/her coauthors form a tightly knit group, and a small p-value suggests that his/her coauthors are from two or more groups and so he/she is more diverse in coauthorship. Extension to measuring the diversity of citers and citees. We extend the study to personalized citer/citee networks. In a citer network, two authors have an edge if they have both cited some other authors. In a citee network, two authors have an edge if they have been both cited by some other nodes. Similarly as above, we construct a personalized citer network and a personalized citee network for each author i. We apply the SgnQ test and denote the two test scores by T citer i and T citee i , respectively. Figure 7 shows the two test scores for 1,000 authors with the largest numbers of coauthors. First, for most authors (705 out of 1000) the personalized citer network is more diverse than the personalized citee network. This is because each author typically focuses on only a few research areas, but his/her work may be cited by researchers from various areas. Second, there is a group of authors whose T citee i is much smaller than T citer i , most of whom are theoretical statisticians (e.g., Bradley Efron, Iain Johnstone) . This is probably because theoretical papers mainly cite theoretical papers but can be cited by many methodology and applied papers. Third, there is a group of authors in biostatistics (e.g., Michael Kosorok, Tze Leung Lai), whose test score for the citee network is much larger than that for the citer network. This is probably because biostatistics papers cite a variety of methodology papers; another reason is that many citations to papers in biostatistics are from other disciplines not covered by our data set. Last, for Raymond Carroll, Jianqing Fan, Peter Hall, and Joseph Ibrahim, both test scores are relatively large, suggesting that they are diverse both in citer and citee.

We have proposed 5 metrics for measuring the research interests and diversity: two (denoted by A1 and A2) in Section 2.3 where we measure the diversity using the research trajectory computed from the co-citation networks, and three (denoted by B1-B3) in this section, for the co-authorship, citer, and citee networks, respectively. These metrics measure diversity from different angles using different types of networks. Also, the networks are based on data in different ranges. For these reasons, our results on diversity may have some inconsistencies, and we must interpret them with caution. For example, it is not rare that a paper on one research topic may impact several other research topics, so an author who is not diverse in co-authorship can be significantly diverse in research impacts. For example, most papers by Donald Rubin's are in Bayesian statistics and causal inferences, but he has impacts over many other areas (e.g., GEE); see Figures 1-2. Xihong Lin is regarded as highly diverse in research impact, but not regarded as diverse in co-authorship (based on results in our data range); see Figures 2 and 7. Also, while Approaches A1-A2 and B3 are both for citee networks, A1-A2 are for a dynamic DCMM setting and measures how the membership vector, π it , evolve over time, and B3 considers a (static) DCMM setting and measures whether the personalized network has only one or multiple communities.

For reasons of space, we focus on the network approach in this paper where we model the co-author relationships by networks. As an extension, we may model the co-author relationships by the more sophisticated hypergraph model (e.g., Yuan et al. (2021) ; Ke et al. (2019) ). In comparison, the literature on the hypergraph approach is much less developed than that of the network approach, so we leave the study on the hypergraph approach to the future.

We have several contributions. First, we produce a large-scale high-quality data set. Second, we set an example for how to conduct a data science project that is highly demanding (in data resource, tools, computing, and time and efforts). We showcase this by creating a research template where we (a) collect and clean a valuable large-scale data set, (b) identify a list of interesting problems in social science and science, (c) attack these problems by developing new tools and by adapting exiting tools, (d) deal with a long array of challenges in real data analysis so as to get meaningful results, and (e) use multiple resources to interpret the results, from perspectives in science and social science. We have also made significant contributions in methods and theory by developing an array of ready-to-use tools (for analysis and for visualization).

Our study has (potential) impact in social science, science, and real life. For example, suppose an administrator (in an university or a funding agency) wants to learn the research profile of a researcher. Our study provides a long list of tools to characterize and visualize the research profile of the researcher. Such information can be very useful for decision making. Our study also provides a useful guide for researchers (especially junior researchers) in selecting research topics, looking for references, and building social networks.

In social science, an important problem is to study the evolvement of a scientific community (Rosvall and Bergstrom, 2010) . We attack the problem by providing several tools (e.g., research map, research trajectory, Sankey plot) for characterizing and visualizing the evolvement of the statistical community. Another important problem is to check whether the development of a research field is balanced (e.g., if some areas are over-studied or under-studied) and whether there are unknown biases (e.g., whether scientists have biases when publishing papers related to COVID-19) (Foster et al., 2015) . Our study can tell which areas have far more researchers, papers, or citations than others, and so helps check the balance of the field. Our study is also potentially useful for checking unknown biases.

In science, an important problem is how to identify patterns and so to predict new discoveries ahead of time. For example, in material science, one can use the abstracts of published papers to recommend materials for functional applications several years ahead of time (Tshitoyan et al., 2019) . We can do similar things with our data set to predict emerging new areas and significant advancements. For example, in Ji et al. (2021) , we combine our citation data with the paper abstracts (treated as text data) to rank different research topics and identify the most active research topics. We find that in the past decade, machine learning has been rising to one of the active research topics in statistics.

Though our data set is high quality, we still need some necessary data preprocessing, and focus on networks with sizes much smaller than 47K. The bottleneck for studying much larger networks is the time and efforts required to manually label each research area and to interpret the results in each case. For better use of such a valuable data set, our hope is that, the data set (which will be publicly available soon) will motivate many lines of researches, so over the years, researchers may continue to use different parts of the data set for new projects and new discoveries.

For future work, note that our data set provides at least two data resources: co-author relationships and citation relationships. It is noteworthy that most existing works in bibliometrics have been focused on one data source and one specific problem. Our results suggest the following: (a) The two data resources provide different information for the same group of researchers, and analysis of different data resources may have different results. The data resources and the results complement with each other. (b) Analysis focusing on only one aspect may have limited insight. Combining analysis of different aspects helps paint a more complete picture. (c) Therefore, it is highly preferable to combine the data resources for our study, with a multi-dimensional framework and multi-way analysis. In our real data analysis, we have combined the two data resources. For example, in Section 3.3, we use different metrics to measure the diversity of an author, where some metrics are based on the co-citation data and others are based on the coauthorship data. How to combine different data resources more efficiently is an interesting problem. We leave this to the future work. Supplemental Material: Supplemental material contains a disclaimer, details of the data set, supplemental data analysis results, and proof of Theorem 2.1.

It is not our intention to rank a researcher (or a paper, or an area) over others. For example, when we say a paper is "highly cited," we only mean that the citation counts are high, and we do not intend to judge how important or influential the paper is. Our results on journal ranking are based on journal citation exchanges, but we do not intend to interpret the ranking more than the numerical results we obtain from the algorithms we use.

As our data set is drawn from real-world publications, we have to use real names, but we have not used any information that is not publicly available. For interpretation purposes, we frequently need to suggest a label for a research group or a research area, and we wish to clarify that the labels do not always accurately reflect all the authors/papers in the group.

Our primary interest is the statistics community as a whole, and it is not our intention to label a particular author (or paper, or topic) as belonging to a certain community (group, area).

While we try very hard to create a large-scale and high-quality data set, the time and effort one can invest in a project is limited. As a result, the scope of our data set is limited. Our data set focuses on the development of statistical methods and theory in the past 40 years, and covers research papers in 36 journals between 1975 and 2015 (we began downloading data in 2015). These journals were selected from the 175 journals on the 2010 ranked list of statistics journals by the Australian Research Council (see Section B.1). Journals on special themes and most journals on econometrics, interdisciplinary research, and applications are not included (see Section 6.1 for detailed description). As a result, papers on econometrics, interdisciplinary research, and applications may be underrepresented.

Due to the limited scope of our data set, some of our results may be biased. For example, for the citations a paper has received, we count only those within our data range, so the resultant citation counts may be lower than the real counts the paper has received.

Alternatively, for each paper, we can count the citation by web searching (e.g., Google Scholar, which is known to be not very accurate), or by reference matching (e.g., Web of Science and Scopus). Our approach allows us to perform advanced analysis (e.g., ranking authors/papers by citation counts, reporting the most cited authors and papers, excluding self-citations, and calculating cross-journal citation). For such analysis, it is crucial that we know the title, author, author affiliation, references, and time and place where it is published for each paper under consideration. For each of the two alternative approaches, we can gather such information for a small number of papers, but it is hard to obtain such information for 83,336 papers as in our data set.

A full scope study of a scientific community is impossible to accomplish in one paper.

The primary goal of our paper is to serve as a starting point for this ambitious task by creating a template where researchers in other fields (e.g., physics) can use statisticians' expertise in data analysis to study their fields. For these reasons, the main contributions of our paper are still valid, despite some limitations discussed above.

One of our contributions is creating a high-quality, large-scale data set on the publications in 36 statistics-related journals (see Section B.2 of the main paper). We present information of these journals and describe how the 36 journals were selected. We also discuss the challenges we encountered in data collection and cleaning, and how we overcame the challenges.

Our data set consists of papers from 36 journals in statistics, probability, machine learning, and related fields. Table 5 

One might think that our data sets are easy to obtain, as it seems that BibTeX and citation data are easy to download. Unfortunately, when we need a large-volume high-quality data set, this is not the case. For example, the citation data by Google Scholar is not very accurate, and many online resources do not allow large volume downloads. Our data are downloaded using a handful of online resources by techniques including, but not limited to, web scraping. The data set was also carefully cleaned by a combination of manual efforts and computer algorithms we developed. Both data collection and cleaning are sophisticated and time-consuming processes, during which we have encountered a number of challenges.

The first challenge is that, for many papers, we need multiple online resources to acquire the complete information. For example, to download complete information of a paper, we might need online resources 1, 3, and 5 for paper 1, whereas online resources 2, 4, and 6 for paper 2. Also, each online resource may have a different system to label their papers.

As a result, we also need to carefully match papers in one online resource to the same ones in another online resource. These make the downloading process rather complicated.

The second challenge is name matching and cleaning. For example, some journals list the authors only with the last name and first initial, so it is hard to tell whether "D.

Rubin" is Donald Rubin or Daniel Rubin. Also, the name of the same author may be spelled differently in different papers (e.g., "Kung-Yee Liang" and "Kung Yee Liang"). A Table 5 : For each of the 36 journals, we present the full name, abbreviated name, starting time, total number of authors, total number of papers, and impact factors in 2014 and 2015. For each journal, our data set consists of all papers between a certain year (i.e., the starting time) and 2015. The starting time is not necessarily the year the journal was launched. more difficult case is that different authors may share the same name (e.g., Hao Zhang at Purdue University and Hao Zhang at Arizona State University). To correctly match the names and authors, we have to combine manual efforts with some computer algorithms.

Last, an online resource frequently has internal inconsistencies, syntax errors, encoding issues, etc. We need a substantial amount of time and efforts to fix these issues.

In this section, we present supplementary results about the citee networks. Section C.1 describes how to interpret the three vertices in the statistics triangle using the topic modeling on paper abstracts. Section C.2 explains the choice of K in mixed membership estimation. Section C.3 investigates the robustness of results with respect to the way we construct the citee networks. Section C.4 discusses a variant of the dynamic network embedding method and compares the results with those in Section 2.2.

In Section 2.1, we constructed a citee network using the co-citation data in 1991-2000 and applied mixed-SCORE (Jin et al., 2017) to obtain a statistics triangle. Our interpretation of three vertices of this triangle were based on estimating a topic model on paper abstracts, which we explain below.

Topic modeling (Blei et al., 2003) is a popular tool in text analysis. Given n documents written on a vocabulary of p words, a topic modeling algorithm outputs K "topic vectors"

A 1 , A 2 , . . . , A K ∈ R p and n "weight vectors" w 1 , w 2 , . . . , w n ∈ R K . Each A k is a probability mass function (PMF) on the vocabulary, representing a conceptual "topic." By investigating those words that correspond to large entries in A k , one can relate a conceptual "topic" to a real-world topic (e.g., news, sports, etc.). Each w i is a nonnegative vector whose entries sum up to 1, where w i (k) is the weight that document i puts on topic k.

In a companion paper Ji et al. (2021) , we conducted topic modeling using the abstracts of papers in our data set via the spectral algorithm in Ke and Wang (2017) (Time series) . We obtained a vector w i ∈ R 11 for each abstract. Letw be the average of w i 's of all the abstracts in our data set. Then, for each author j, we obtained a vector z j ∈ R 11 , which is defined as the average of w i −w among all the abstracts i coauthored by this author j. We call z j the centered topic interest of author j. "Bayes," "Biostatistics," and "Nonparametric" (x-axis: the 11 topics; y-axis: within-group average of the centered topic weights). For each author, the community label is obtained by mixed-SCORE, and the centered topic weight vector is obtained in Ji et al. (2021) .

We now use these vectors z i for authors to interpret the three vertices in the "statistics triangle." Letπ i be the estimated mixed membership vector of node i by mixed-SCORE, 1 ≤ i ≤ n. We divide all the nodes into 3 groups: If the largest entry ofπ i is the kth entry, then node i is assigned to group k, 1 ≤ k ≤ 3. Recall that the topic modeling results in (Ji et al., 2021) produce a vector z i for each author i. We now computez 1 ,z 2 ,z 3 by taking the the within-group average of z i for each of the 3 groups, respectively. The three vectors are presented in Figure 8 . By this figure, we propose to interpret the three communities from mixed-SCORE as three primary statistical areas: (a) Bayes. The topic interests are mainly about "Bayesian statistics," "experimental design," and "machine learning" (including research on EM algorithm and Markov chain Monte Carlo (MCMC)). (b) Biostatistics. The topic interests are mainly in "biostatistics and medical statistics," "clinical trials," and "times series" (including research on longitudinal data). (c) Nonparametric statistics. The topic interests are in "mathematical statistics," "regression," "hypothesis testing," and "statistical inference."

Furthermore, recall that each author is assigned to one of the three primary areas as above. For each primary area, we first select the 30 nodes with the highest degree, and then out of the 30 nodes, we select the 10 with the highest purity. The selected nodes are presented in Table 6 , where for each node, we present the 2 (out of 11) representative topics where the node has the largest weights (note: "time series" includes longitudinal data, "machine learning" includes EM algorithm and MCMC). The results in Table 6 are largely consistent with the above interpretation. Table 6 : From left to right: high degree pure nodes in "Bayes", "Biostatistics", and "Nonparametric". In each primary area, we first select the 30 nodes with the highest degree, and out of which we then select the 10 with the highest purity. For each node, we present the 2 (out of 11) representative topics where the node has the largest weights (note: "time series" includes longitudinal data, "machine learning" includes EM algorithm and MCMC).

We explain how we decided K = 3 for the citee network (1991-2000; 2831 nodes) . The scree plot is shown in Figure 9 , where K = 4 is an elbow point. We thus search the value of K in the neighborhood of this elbow point. Additionally, since we believe that the citee network is assortative (a network is assortative if there are more edges within communities than between; in such a network, a negative eigenvalue is more likely to be spurious), we focus on K ≤ 6 to prevent enrolling negative eigenvalues. We then apply mixed-SCORE to each K ∈ {2, 3, . . . , 6} and check (i) the goodness of fit and (ii) research interests of authors in the estimated communities. We combine (i)-(ii) to make a choice for K.

As an illustration, we explain why we preferred K = 3 to K = 4. The results for K = 3 are reported in the main article. We now consider K = 4 and apply mixed-SCORE to get a matrixR = [r 1 ,r 2 , . . . ,r n ] ∈ R n×3 . The theory of mixed-SCORE indicates that the data cloud {r i } 1≤i≤n has approximately the silhouette of a simplex in R 3 with four vertices. We estimate the four vertices using the SVS algorithm in Jin et al. (2017) . In Figure 10 , the top left panel shows the data cloud and the estimated simplex (tetrahedron). We recall that for K = 3, the data cloud fits a triangle very well; however, the data cloud here does not fit a tetrahedron very well. This is also seen from the pairwise coordinate plots in Figure 10 , which are projections of the data cloud onto 2-dimensional subspaces. All these pairwise coordinate plots have the approximate shape of a triangle, not a quadrilateral.

Although it is possible that a tetrahedron becomes a triangle in the projection, we find that the fitting between the data cloud and the estimated simplex is not as good as that for K = 3.

The above analysis alone is still insufficient for us to choose between K = 3 and K = 4.

We further investigate those pure nodes in each community. For K = 3, the results can be found in Section 2.1, and the three communities are interpreted as 'Bayes', 'Biostatistics'

and 'Non-parametrics', respectively. For K = 4, the results are in Table 7 : Given each of 1 ≤ k ≤ 4, we first select 300 nodes with highestπ i (k); out of these 300 nodes, we further select the 10 nodes with largest degrees. Denote by C 1 , C 2 , C 3 , and C 4 the four communities.

According to Table 7 , we interpret these communities as C 1 : 'Biostatistics I', C 2 : 'Bayes', C 3 : 'Biostatistics II', and C 4 : 'Nonparametrics'. Comparing them with the three vertices in Figure 1 , we conclude as follows: when K is increased from 3 to 4, the 'Nonparametrics'

and 'Bayes' communities are the same, but the 'Biostatistics' community is split into two (this is also seen from the top right panel of Figure 10 , where V1 and V3 are both near the 'Biostatistics' vertex of the statistics triangle in Figure 1) . A careful investigation suggests that the difference of author research interests between 'Biostatistics I' and 'Biostatistics II' is smaller than their difference from 'Nonparametrics' or 'Bayes'. Therefore, it is more appropriate to treat the communities from K = 3 as the 'primary areas'.

In the above, we have compared K = 4 with K = 3 from (i) the goodness-of-fit of the simplex and (ii) the interpretation of communities. Both (i) and (ii) suggest that K = 3 is a better choice. We also studied other choices of K in a similar fashion and came up with the final decision of K = 3.

We had a few preprocessing steps when constructing the citee network using the co-citations in 1991-2000. One of them is first forming a weighted adjacency matrix and then removing all nodes with a degree smaller than 60. We now vary the threshold 60 and investigate the Name Deg.π i (1) Name Degreeπ i (2) Table 7 : The relatively pure nodes of each community for K = 4 in mixed-SCORE. For each 1 ≤ k ≤ 4, we first selected 300 nodes with highestπ i (k); out of these 300 nodes, we then output the 10 nodes with highest degrees. The four communities are interpreted as 'Biostatistics I', 'Bayes', 'Biostatistics II', and 'Nonparametrics', respectively.

robustness of the statistics triangle.

By setting the threshold at 50, 60 and 70, the resultant citee network has 3125, 2831 and 2558 nodes, respectively. We observe that the size of the network only changes moderately as the threshold varies. We then apply mixed-SCORE to all three networks (with exactly the same algorithm parameters) to produce the research maps and statistics triangles. For the threshold 60, the research map and statistics triangle are shown in Figure 1 of the main article. For the thresholds 50 and 70, the results are shown in Figure 11 .

The shape of the triangle has changed when the threshold changes. This is because the leading eigenvectors have changed, which cause the projected subspace to change. However, it is very clear that the three vertices can always be interpreted as 'Bayes', 'Nonparametrics'

and 'Biostatistics', no matter which threshold is used. As the threshold varies, the author clustering results have also changed, but the 15 'sub-areas' identified by author clustering are largely the same. We conclude that the results are robust to the choice of threshold in constructing the citee network.

In Section 2.2, we proposed dynamic network embedding as a new approach to drawing research trajectories and calculating diversity metrics for individual nodes. The key idea of this method is using the leading eigenvalues and eigenvectors of A 1 to embed the networks at all time points. As a variant of this method, we can also use the leading eigenvalues and eigenvectors of A t 0 , for some t 0 > 1, to embed the networks at all time points. Letλ k,t 0 be the kth largest eigenvalue (in magnitude) of A t 0 , and letξ k,t 0 be the associated eigenvector.

For each 1 ≤ i ≤ n and 1 ≤ t ≤ T , let

(1) i ,r

i , . . . ,r

i } ⊂ R K−1 for each node i. The embedded point r (t) i coincides with the SCORE embedding (Jin, 2015; Jin et al., 2017) at t = t 0 , but the two embeddings are different at t = t 0 .

Our approach in Section 2.2 corresponds to t 0 = 1. We now consider t 0 ∈ {2, 5, 10} and compare the results with those for t 0 = 1. Given t 0 , for each i, we generate the trajectory using (C.8) and calculate two diversity metrics, the se distance E i and the max distance M i , in the same way as in Section 2.3. The results are shown in Figure 12 . The top panels contain the scatter plot of E i,t 0 versus E i,t 0 =1 , for the top 100 nodes with highest degrees.

When t 0 ∈ {2, 5}, the scatter plot fits the line y = x very well, indicating that the diversity metrics have little changes. When t 0 = 10, for many nodes, E i,t 0 deviates from E i,t 0 =1 . This is as expected, because the eigenvectors of A 10 are quite different from the eigenvectors of A 1 , which cause the projected subspace and the resulting research trajectories to be quite different. However, this does not really affect the assessment of authors' research diversity, because the scatter plot suggests that E i,t 0 =10 and E i,t 0 =1 are strongly positively correlated.

We conclude that switching to a different t 0 won't yield any significant change of the results in Section 2.3.

In this section, we present supplementary results on the analysis of coauthorship networks.

Section D.1 describes how to interpret the communities using the topic modeling on paper abstracts. Section D.2 presents the representative authors of each community. Section D.3

compares the current community detection results with those by the Newman-Girvan modularity. Section D.4 explains the choice of K in community detection. Section D.5 justifies the assumption of no mixed memberships. Section D.6 investigates the robustness of results with respect to the way we construct the coauthorship network.

In Section 3, we constructed a coauthorship network (36 journals, 1975-2015) and performed hierarchical community detection. It gives rise to a community tree in Figure 4 . The first layer of this tree has 6 communities. Similarly as in Section C.1, we use the topic modeling result in Ji et al. (2021) to interpret these communities.

As explained in Section C.1, we got a centered topic interest vector z a ∈ R 11 for each author a. Now, for each of the 6 communities, we take the within-community average of z a . The 6 resultant vectors are displayed in Figure 13 . Combining the figure with a careful read of the large-degree nodes in each community, we propose to label the 6 communities as "C1 Non-parametric Statistics", "C2 Biostatistics (Europe)", "C3 Mathematical Statistics", "C4 Biostatistics (UNC)", "C5 Semi-parametric Statistics", and "C6 Biostatistics (UM)",

where we also list some comments on each community in Table 2 

The hierarchical tree in Figure 4 has 26 leaf communities. To help for interpretation, we present the representative authors in each leaf community, ordered by their degrees within the leaf community. Some representative authors are shown in Table 3 of the main article.

We now further expand this table.

Recall that each leaf community is labeled using the last names of the two nodes with largest betweenness metric (Freeman, 1977) and the one node with largest closeness metric (Bavelas, 1950) (if the latter happens to be one of the former, we will not use the same name twice). Table 8 contains the representative authors of the offspring leaf communities of C1. Table 9 contains the representative authors of the offspring leaf communities of C2, Table 9 : The offspring leaf communities of C2, C3, and C4, as well as the representative authors (ordered by degree within leaf community). Names in italics: same as in Table 8 .

In obtaining the hierarchical community tree, we used a modification of SCORE (Jin, 2015) for community detection. There are other existing methods for community detection. One of the popular approaches is the Newman-Girvan modularity (Girvan and Newman, 2002) .

We apply this method to the coauthorship network (36 journals, 1975-2015) and compare the results with those by our method.

The Newman-Girvan modularity method requires an exhaustive search over all possible community assignments. However, the coauthorship network is large (4,383 nodes), making it computationally infeasible to implement the method directly. We instead use the spectral approximation in Newman (2006) . This method also requires an input of K. We fix K = 6, the same as the number of first-layer communities in our hierarchical community tree.

The results are shown in Table 11 . Each row represents a leaf community in Figure 4 , and we report its proportion of nodes in each of the six estimated communities by Newman-Girvan modularity. We aim to check whether the 26 leaf communities found by our method are indeed tight-knit clusters of nodes that is non-splitting in the Newman-Girvan method.

This is obviously true for some leaf communities: e.g., C1-2 has 98% of nodes in NG6, and C5-3-2 has 89% of nodes in NG2. In the last column of Table 11 : Comparison with the community detection results by applying Newman-Girvan modularity with K = 6. Rows correspond to the leaf communities in Figure 4 . For each leaf community, we report its proportion of nodes in each of the six NG communities; numbers < 10% are omitted.

community has its majority in one NG community if their intersection occupies more than 60% nodes in that leaf community (we use 60% as a heuristic threshold; readers may change it to a different threshold, which results can be obtained immediately from Table 11 ). We see that 21 out of 26 leaf communities have their majority in an NG community, suggesting that they are indeed tight-knit clusters. In the remaining 5 leaf communities, C4-4 and C6

have a SgnQ p-value < 0.001, hence, they are supposed to be further split in our algorithm.

We did not split them only because their sizes are already small (see Section 3.1 for details).

We conclude that the results by our method and the results by Newman-Girvan modularity are concordant with each other to some extent.

However, the Newman-Girvan modularity method has no consistency under the DCBM model (Zhao et al., 2012) , while SCORE is proved to lead to consistent community detection (Jin, 2015) . For this reason, we stick to the approach in the main article.

We explain how we decided K = 6 in the first layer of the hierarchical community detection tree. The determination of K for other layers is similar.

Given the adjacency matrix A of this network, we first checked the scree plot (left panel of Figure 14 ). The elbow points are 4, 7, 8, and 11. We thus focused on K ∈ {4, 5, . . . , 11}.

Note that the 7th, 8th, and 11th largest eigenvalue (in magnitude) are negative. Since the coauthorship network only has assortative communities, these negative eigenvalues are less likely to contain true signals. As explained in Section 3.1, we penalize negative eigenvalues by conducting eigen-decomposition on A + I n , instead of A. The scree plot associated with A + I n is shown on the right panel of Figure 14 ), suggesting that K = 6 is appropriate. The above analysis alone is insufficient to determine K = 6. We next ran the modified SCORE algorithm in Section 3.1 for each K ∈ {4, 5, . . . , 11} and investigated the permuted adjacency matrix. It showed evidence of bad fitting when we increased K from 6 to 7 (see Figure 15 ). Note that the 7th and 8th largest eigenvalues (in magnitude) are negative; due to assortativity of coauthorship networks, they are likely to be driven by noise. This is confirmed by the fitting of SCORE. For K = 7, the estimated communities 4 and 6 have a of mixed memberships. Figure 16 : The community detection results of SCORE on C1, C3, C4 and C5 (these are the first-layer communities in Figure 4 .

As far as we know, there are no existing tests for testing DCMM against DCBM. In this hypothesis testing problem, both hypotheses are highly composite and have many free parameters. It is challenging to find a test statistic that simultaneously has a tractable null distribution and enjoys good power. We leave this to future work.

When constructing the coauthorship network, we define an edge between two nodes if and only if the two authors have coauthored at least m 0 papers in the range of our data set.

Since we prefer to focus on long-term active researchers and solid collaborations, m 0 = 1 would be too small (Ji and Jin, 2016) , as the network may include too many edges between active researchers and non-actives ones (e.g., a Ph.D advisee who joined industry and ceased to publish in academic journals). The choices of m 0 = 2 and m 0 = 3 are both acceptable.

We now compare their corresponding community detection results.

In the main article, we used m 0 = 3 and obtained a coauthorship network with 4,383 nodes in the giant component. By choosing m 0 = 2, we obtained a coauthorship network with 10, 741 nodes in the giant component. We then applied the same community detection algorithm as in Section 3.1 with K = 6 and compared the resulting communities, denoted by D1-D6, with the six first-layer communities, C1-C6, in Figure 4 . The results are summarized in Table 12 .

When m 0 is decreased from 3 to 2, the size of the coauthorship network greatly expands,

where only 41% of nodes of the expanded network are in the original 4383-node network.

As the network expands, there are two possible cases: (a) each of old communities absorbs The first-layer communities (D1-D6) for m 0 = 2 versus the first-layer communities (C1-C6) for m 0 = 3. For each of D1-D6, we report its total number of nodes and its number of nodes in N m 0 =3 (the 4383-node network). For each of C1-C6, we report its proportion of nodes in each of D1-D6; numbers < 10% are omitted. new nodes and grows; (b) the new nodes form some new communities. The results in Table 12 suggest that it is Case (a): In this table, for each of D1-D6, we report the fraction of nodes that are in the original 4383-node network (denoted by N m 0 =3 ). These numbers range from 38% and 50% and are largely comparable with each other. It implies that, as m 0 decreases from 3 to 2, none of D1-D6 purely consists of newly added nodes; instead, each of them is built on one or more old communities. Furthermore, for each of C1-C6, we calculate its fraction of nodes in each of D1-D6. We then use these fractions to map this community to one of the six new communities; see Table 12 . C1-C4 are mapped to D1-D4 respectively, and both C5 and C6 are mapped to D5.

We can similarly apply the recursive community detection algorithm on the 10741-node network associated with m 0 = 2. However, the resulting hierarchical tree has more layers, branches, and leaves, and it requires a lot more efforts to manually interpret each leaf. For this reason, we only report the results for m 0 = 3 in the main article.

The first bullet point follows directly from the second bullet point. It suffices to prove the second bullet point.

We now show the second bullet point. Fix t ≥ 1. Define Y t = Ω t ΞΛ −1 . Then, for every 1 ≤ i ≤ n, we have

Under the DCMM model, Ω t = Θ (t) Π (t) P (Π (t) ) Θ (t) . Therefore,

It follows that

Since min 1≤k≤K {M t (1, k)} > 0, we immediately have Y t (i, 1) > 0 for every 1 ≤ i ≤ n. We plug (E.10) into (E.9) to get Let h t = (M t (1, 1), M t (2, 1), . . . , M t (K, 1)) and w

.

Combining the above gives t) in the vector form.

Our dataset is called the Multi-Attribute Dataset on Statisticians (MADStat). The dataset and all the code is publicly accessible at the following sites:

• The MADStat project: http://zke.fas.harvard.edu/MADStat.html

• GitHub: https://github.com/ZhengTracyKe/MADStat

It contains a list of ready-to-use data matrices:

• The full data are in the file AuthorPaperInfo.RData. It has two variables.

-AuPapMat: This matrix summarizes the bibtex data. It has 4 columns, where

idxAu is author ID, idxPap is paper ID, year is publication year, and journal is publication journal.

-PapPapMat: This matrix summarizes the citation data. It has 5 columns, where

FromPap and ToPap are the paper IDs, FromYear and ToYear are the publication years of two papers, and SelfCite is an indicator whether this is a self citation.

Additionally, the file author name.txt contains all author names, in the same order as their IDs. The file BibtexInfo.RData contains the bibtex information of papers.

• The adjacency matrices of co-citation networks -CiteeAdjFinal.mat: The adjacency matrix of the citee network (1991) (1992) (1993) (1994) (1995) (1996) (1997) (1998) (1999) (2000) .

This is the network used to produce the Statistics Triangle and Research Map in Section 2.1. It has 2831 authors. To get the names of the 2831 authors, use the variable keepNodeID in CiteeAdjFinal.mat, as well as the file author name.txt.

-CiteeDynamicFinal.mat: The adjacency matrices of the 21 citee networks . These are the networks used to produce the Research Trajectories in Section 2.2.

• The adjacency matrices of co-authorship networks

Pseudo-likelihood methods for community detection in large sparse networks

Communication patterns in task-oriented groups

Latent Dirichlet allocation

50 years of data science

Fisher in the 21st century

Tradition and innovation in scientists' research strategies

A set of measures of centrality based on betweenness

Community structure in social and biological networks

A survey of statistical network models

Ranking our excellence" or "assessing our quality" or whatever

Coauthorship and citation networks for statisticians (with discussions)

Journal ranking, topic modeling, and citation prediction for statistical publications

Fast community detection by SCORE

Sharp impossibility results for hypergraph testing

Estimating network memberships by simplex vertex hunting

Optimal adaptivity of signed-polygon statistics for network testing

Stochastic blockmodels and community structure in networks

Community detection for hypergraph networks via regularized tensor power iteration

A new SVD approach to optimal topic estimation

A review of dynamic network models with latent variables

Hierarchical community detection by recursive partitioning

Global spectral clustering in dynamic networks

A regularized stochastic block model for the robust community detection in complex networks

Coauthorship networks and patterns of scientific collaboration

Modularity and community structure in networks

Mapping change in large networks

Coauthorship and citation networks for statisticians

Unsupervised word embeddings capture latent knowledge from materials science literature

A likelihood-ratio type test for stochastic block models with bounded degrees

Testing community structures for hypergraphs

Detecting overlapping communities in networks using spectral methods

Community extraction for social networks

Consistency of community detection in networks under degree-corrected stochastic block models

lot of inter-community edges; for K = 8, the estimated communities 1, 7 and 8 have many inter-community edges. Based on these plots, we further focused on K ≤ 6. By analyzing the author names in estimated communities (using our internal knowledge of the academic statistics society), we decided that K = 6 was the best choice.

In Section 3, we model the coauthorship networks with the DCBM model, a special case of the DCMM model by eliminating mixed memberships. We now justify why we use DCBM, instead of DCMM. As discussed in Jin et al. (2017) , there are two common strategies to model the communities in a large social network -(a) a mixed membership model with a relatively small K, or (b) a non-mixing model with a relatively large K. When constructing the hierarchical tree in Figure 4 , our goal is to divide the coauthorship network into a large number of leaf communities, each being a tight-knit cluster of authors. For this reason, we used strategy (b) and adopted the DCBM model. Furthermore, we checked the goodnessof-fit of DCBM on the whole network and each first-layer community, using the permuted adjacency matrix with estimated community labels. The plot for the whole network is on the left panel of Figure 15 . The permuted adjacency matrix is nearly blockwise diagonal, suggesting that the data provides no enough evidence of mixed membeships. The plots for the first-layer communities are shown in Figure 16 . Among the six first-layer communities, only C1, C3, C4 and C5 are further split in the tree. For each of them, we applied SCORE and plotted the permuted adjacency matrix. Again, the data provides no enough evidence journals). This is the network used to obtain the community tree in Section 3.1.It has 4383 authors. The names of the 4383 authors are given by the variable authorNames in CoauAdjFinal.mat.-CoauSankeyFinal.mat: The adjacency matrices of the 3 co-authorship networks (4 journals). This is the network used to produce the Sankey diagram in Sec-