key: cord-0314837-sqizwovu authors: Rohatgi, Shaurya; Downey, Doug; King, Daniel; Feldman, Sergey title: S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications date: 2022-04-22 journal: nan DOI: 10.1145/3529372.3533283 sha: 6d8b9829512f66a1fb92326dc4f6e2b314363625 doc_id: 314837 cord_uid: sqizwovu Mentorship is a critical component of academia, but is not as visible as publications, citations, grants, and awards. Despite the importance of studying the quality and impact of mentorship, there are few large representative mentorship datasets available. We contribute two datasets to the study of mentorship. The first has over 300,000 ground truth academic mentor-mentee pairs obtained from multiple diverse, manually-curated sources, and linked to the Semantic Scholar (S2) knowledge graph. We use this dataset to train an accurate classifier for predicting mentorship relations from bibliographic features, achieving a held-out area under the ROC curve of 0.96. Our second dataset is formed by applying the classifier to the complete co-authorship graph of S2. The result is an inferred graph with 137 million weighted mentorship edges among 24 million nodes. We release this first-of-its-kind dataset to the community to help accelerate the study of scholarly mentorship: url{https://github.com/allenai/S2AMP-data} Mentorship is a crucial part of the scholarly enterprise. A mentor who has gained experience working in a field over time guides a mentee who is new to the area, and this phase can have longterm effects throughout a mentee's career [8, 9, 12, 16] . Evidence of effective mentorship can be an important factor in promotion and tenure decisions or in a student's choice of advisor. However, records of mentorship are not available at scale the way standard bibliographic measures like h-index, citation count and number of published papers are. To remedy this we introduce two related datasets for studying and inferring publication-evidenced mentorship in science at scale, which we collectively refer to as Semantic Scholar Analysis of MentorshiP (S2AMP). S2AMP is aimed at identifying explicit mentorship relations, such as those between Ph.D. advisors and their students or senior research managers and their less senior coworkers. We start by aggregating Web resources containing ground truth mentorship relations of this type. These form our first dataset. Our second dataset is then obtained by applying a mentorship classifier trained on the first dataset to infer mentorship relations across a complete bibliographic knowledge graph. We find that accurate inference of mentorship is possible, and that the inferred dataset greatly exceeds the coverage of previous mentorship datasets. The largest similar resource is the crowd-sourced Academic Family Tree (AFT) [6] linked to the Microsoft Academic Graph (MAG), which is approximately 180 times smaller than our inferred dataset in terms of the number of mentor-mentee pairs (see Table 1 ). Other previous work, like SHIFU [7, 10, 19] built mentorship models using publication information provided by MAG [18] . This model was then used to infer mentorship over a part of the MAG (one million authors). However, this inferred data was not open sourced by the authors and only applied to a small portion of all candidate author pairs. Other work which models mentorship has focused on data for a specific field of study like Physics [4] or Mathematics and AI [13] . Our new S2AMP resource includes the following contributions: (1) S2AMP Gold -A ground truth dataset drawn from multiple online sources containing known mentor-mentee pairs, and links to rich bibliographic records of the individuals in Semantic Scholar (S2) [1] . 1 (2) A classifier that infers mentorship relations from bibliographic data based on the bibliographic records of the pair (e.g. inferred seniority, co-publication, and author order), and graph-based features derived from an initial estimated mentorship graph (e.g., estimated number of mentees). The final classifier is efficient, and has an area under the ROC curve of 0.96. The area under the precision-recall curve is 0.74, and a manual analysis (detailed in Section 5.3) of 100 false positives revealed that 92% of them were in fact true positives that were simply absent from the gold data. (3) S2AMP Inferred -A mentor-mentee dataset of 137 million inferred mentorship relations obtained by applying the trained classifier to the entire S2 knowledge graph. The rest of the paper proceeds as follows. In Section 2, we detail how we collect ground truth mentor-mentee relations and look up the pairs of authors in S2. In Section 3 we describe how we create negative examples for each true mentor-mentee pair to train our classifier. We also show how a novel two-stage model, that first classifies pairs individually and then re-classifies them using graph features derived from the first stage, can improve accuracy (Section 4). Finally, in Section 5 we discuss the creation of our inferred largescale mentorship graph, and provide preliminary analyses using this graph to illustrate its utility for studying mentorship. The first step in our data collection is to acquire ground-truth mentor-mentee pairs from online sources. Specifically, we build custom crawlers for Open ProQuest 2 , The Mathematics Genealogy Project 3 and Handle.net 4 pages associated with major Universities around the world. We obtain approximately 280,000 pairs from these sources. Lastly, we use the Academic Family Tree 5 dataset. This dataset has 1.5 million mentor-mentee pairs populated by end-users (of which 743,176 are linked to MAG in Ke et al. [6] ). Linking to Semantic Scholar. Our mentorship classifier relies on the bibliographic records of the mentors and mentees. To obtain these records, we link our gold mentor-mentee pairs to author nodes in an existing bibliographic knowledge base, Semantic Scholar (S2). We first links the mentees and then the mentors, as described below. Given a mentor-mentee pair, and one or more author records from S2 that may correspond to the mentee, we find the mentor by searching through each of the candidate author's co-authors for a textual match for the mentor. We retain the mentee id for which there is a matching mentor name. In case of multiple matches, we keep the pairs with the most co-authored papers. Using the above procedure, we successfully match over 300,000 mentor and mentee pairs to S2's authorship graph. For each matched author, we obtain their co-authors and complete publication history using S2. 6 These linked ground-truth mentor-mentee pairs and their bibliographic data comprise S2AMP-Gold. Linking Errors. The linking procedure suffers from occasional errors. Picking between ambiguous name pairs by using the "most co-published papers" heuristic is sometimes wrong, and sometimes the linking may be correct but the author profiles themselves incorrect due to automated author disambiguation errors. The linked pairs described in the previous section serve as positive mentor-mentee pairs for model training. Next, we describe how we obtain negative examples and the features used to train our model. We define the co-publication period as the time interval (in years) between the first co-authored article and the latest co-authored article. This co-publication period may be longer than the period of explicit mentorship. For example, cases when a Ph.D. student graduated, became a professor, then later collaborated again with their Ph.D. advisor. For these types of cases, it is unclear when to call a mentorship complete from the publication history alone. As such, we also define a dense co-publication period as the shortest period during the authors published P% (P>60) of their co-publications, and use this as a proxy for the most important collaboration period between a pair of authors. When we derive the features for each pair, we compute two versions of each feature, one for the co-publication period and another for the dense co-publication period. To construct negative examples, we obtain the co-authors of a mentee who are plausibly mentors but not marked as such in the ground truth data. These may include, for example, other more senior Ph.D. students, post-doctoral students, or other co-author professors. We call them candidate mentors, and they serve as negative examples for the classification model. Using the bibliographic data, we select as candidate mentors all the co-authors of a mentee where (a) the co-author has more publications than the mentee before the date of their first co-authored article, (b) the co-author's first publication came earlier than the mentee's, and (c) the co-author and the mentee have written at least scholarly articles together. In our experiments we set = 2, after finding that = 1 yielded significantly lower precision and a computationally prohibitive number of candidate mentors during the inference stage. Requiring two co-publications can be helpful for capturing short-term mentorships for pre-doctoral mentees in some high-publication fields, in addition to longer-term mentorships with Ph.D. students in many other fields that publish before graduation. Using author and publication data from S2, we extract various types of features for each positive and negative mentor-mentee pair: • Co-publication -Features based on the co-publications of the mentor and the mentee, as well as total individual publications of the mentor and mentee during the co-publication period. These features capture our expectations that, for example, the mentor will have a significantly higher number of publications than the mentee for a given period, and that the mentee will have fewer years of scholarly activity than their mentor. • Co-authors -The count of co-authors for each mentor and the mentee. We expect mentors to typically have more co-authors or collaborators than mentees for the period of co-publication. • Author list position -We use the authorship position of the mentee and the mentor in the publications which they co-author as well as the ones they don't for the period of co-publication. This can be a helpful feature in fields where mentors tend to occur at the end of the authorship list. 7 We curate a total of 48 features, which are detailed in the Appendix in Table 4 . Using these features we train a classifier (described in the next section) to predict if a given pair of authors are a positive mentor-mentee pair or not. The model trained on features from the previous section predicts a mentor-mentee relationships with no context about surrounding colleagues, and this is a limitation. In practice, mentors tend to mentor more than one mentee over the course of their career, and a mentee who already has a high-scoring mentor is less likely to have others with as high a score. As such, we can improve our model's first-stage predictions by taking into account other nearby predictions for each candidate mentor and mentee. Formally, we define a graph where each candidate mentor and mentee correspond to nodes, and weighted directed edges point from each mentor to each mentee (where the edge weight is the score produced by the first-stage model). We extract additional features for each node in the graph. For example, for each node we sum the incoming edge weights to get the total "menteeship" received, and sum the outgoing edges to obtain the total mentorship given. We extract a total of 20 features using the first-stage mentorship graph, including the maximum mentorship given by an author, ratio of sum of in-edge weights to out-edge weights, difference between the sum of in-edge weights and out-edge weights, and others, all of which are described in the Appendix in Table 4 . These new features are concatenated with the first-stage features and a second-stage model is trained with the same methodology as the first-stage model. Filtering our gold mentor-mentee pairs from our linking pipeline to just those pairs that have co-authored at least = 2 papers leaves a subset of 219,331 positive pairs, to which we add the generated 1.6 million negative mentor-mentee pairs. We then split the data into train, validation, and test (3:1:1 split), ensuring that mentees present in one split are not present in another. Using the extracted features we train a binary classification model to differentiate true mentor-mentee pairs from the false ones. As this model is eventually used for inference at scale for approximately 137 million pairs, we chose the efficient LightGBM [5] implementation of gradient boosted decision tree ensembles. Evaluating on the validation set, we use the hyperopt library [2] to search for the best configuration of nine hyperparameters. 8 We use the probabilistic output from the trained LightGBM classifier as the edge weight in the mentorship graph. To report final performance on the test set, we use two metrics: area under the ROC curve (AUROC) and area under the precision recall curve (AUPRC). Results for the first and second stages are reported in Table 2 . We also study what features are important for predicting a mentorship relationship using SHAP importance [11] values. We observe that the number of publications before the co-publication period for both mentors and mentees, number of coauthors before the co-publication period for both mentors and mentees, and average of authorship position of the mentor are the top-3 features for the first-stage model. For the second-stage, the two most important 8 See Figure 3 in the Appendix for complete details. The complete co-publication graph of Semantic Scholar contains around 2 billion edges. We use the criteria from Section 3.2 to select candidate mentors for each mentee from these co-publication edges, after which we are left with 137 million pairs of mentees and their candidate mentors. These pairs are run through our feature extraction pipeline and the trained two-stage model to obtain a set of predicted scores, which serve as weights on the inferred mentorship edges. We can then calculate the following mentorship metrics for each author node: • Menteeship sum -sum of all the incoming edges of a node (mentorship received). • Menteeship mean -mean of all the incoming edges. • Mentorship sum -sum of all the outgoing edges of a node (mentorship given). • Mentorship mean -sum of all the outgoing edges. We hypothesize that the derived mentorship scores from the preceding section not only meaningfully quantify mentorship given and received, but can also partially explain success in academia more broadly, as measured by h-index. We thus model the relationship between h-index (the dependent variable) and total publication count, total citation count, field of study, and the four derived mentorship scores (independent variables) 9 with a negative binomial general linear fixed effects model 10 , the conditional mean model , where x is the vector of covariates and is the learned coefficient for the th covariate. We represent the field of study covariate using one-hot encoding, while the continuous variables are each binned into five equally-sized quintiles. We fit the model using the entire inferred graph (mentors and mentees both) after removing outliers, and the learned coefficients are in Table 3 . Important to note is that, due to the exponential term in the regression function above, the coefficients are interpreted as having a multiplicative effect instead of an additive one as in linear regression. For example, the coefficient 3.51* 0.7* 0.14* -0.02* 0.1* 0.03* for the 5th quintile of menteeship sum is 0.14, which we interpret as having the effect of multiplying the h-index by exp(0.14) = 1.14. The first quintile of each covariate has 0 coefficient by construction. We exclude coefficients for field of study to save space, but Table 4 in the Appendix has complete coefficients information. The coefficients for citation count are the largest, followed by paper count, as expected since the h-index is constructed from these two variables. Menteeship sum is the next largest, statistically significant variable: as the menteeship sum increases, its coefficient for predicting h-index also increases. Interestingly, both menteeship mean and mentorship sum are negative, with the exception of the 5th quintile of mentorship sum. Mentorship mean is a small, constant, but statistically positive effect. Menteeship sum and mentorship mean are also the two most important covariates for predictive power according to SHAP. We note that causal conclusions should not be made from this analysis. Mentorship scores can help us identify expert scholars and their descendants. For example, our model predicts a high mentorship sum (272.44) for Leo Paquette 11 , an American organic chemist who guided approximately 150 graduate students to their Ph.D. degree. To highlight the coverage of our model, S2AMP has 279 mentees (above a mentorship score of 0.5) for Dr. Paquette while Academic Tree reports only 39. 12 We note again that these are the mentees detectable by our algorithm, and it is likely Leo Paquette mentored others with whom he published zero or one paper. More examples of expert discovery are discussed in Appendix. To assess the quality of our model, we manually examined 200 misclassified sample pairs from the validation data (100 false positives and 100 false negatives). We found via manual web search and author homepage review that 92% of the false positives were true mentor-mentee pairs. The other 8% were cases where the candidate mentor was very senior to the mentee, but did not appear to be a mentor. Looking at the false negatives, we found that the model gives low probabilities to pairs with mentors who have not We introduce S2AMP, a pair of novel datasets for studying publication evidenced mentorship. S2AMP Gold is used to train a model which can generalize mentorship prediction to unseen pairs. We used this model to generate S2AMP Inferred, a graph with 137 million edges and 24 million nodes. The Gold and Inferred datasets are linked to S2, which is regularly updated. We hope that S2AMP helps further new studies and findings in understanding the role of mentorship in academia and beyond. Limitations. As discussed in the introduction, S2AMP targets explicit mentorship relationships, such as Ph.D advising or management in industry. However, this is not the only type of mentorship that occurs in scholarly work. Mentorship can be informal and not explicitly documented, e.g. when more senior lab mates mentor junior ones. Our ground truth data does not cover this informal mentorship. Similar to the previous work [17] , our model can't detect mentorship unless there is co-publication, and we require a minimum of two co-published papers. This naturally excludes early academic career mentorship detection in many fields where publishing during a Ph.D. is not as common, e.g. humanities and economics. Both S2AMP Gold and S2AMP Inferred exclude any form of mentorship which has no publication trace. Future Work. We believe that our inferred mentorship graph can help answer many science-of-science [12, 14] questions including: • Do doctoral students whose advisors have superior research output and/or are better positioned in the co-publication network of their respective scientific communities have a higher probability to become advisors themselves? [3] • How does mentorship differ across different fields of study? Are cross-disciplinary mentorships correlated with success? • Can we identify significant "intermediate actors" that play a role in the mentorship graph, e.g. post-doctoral students? One important question is to what degree our model can discover informal mentorship relationships in addition to formal ones like Ph.D. advisors and their students. The error analysis in Section 5.3 showed promising results for discovering formal relationships. We speculate that informal relationships will be reflected by somewhat elevated scores from our model, and measuring this is an item of future work. Name Description publication copub_count total papers published together total_mte_pubs total publications of the mentee till copub end date total_coa_pubs total publications of the coauthor till copub end date mte_copub_total # of papers published by mentee in copub period coa_copub_total # of papers published by coauthor in copub period mte_copub_prcnt ratio of copub_count to mte_copub_total coa_copub_prcnt ratio of copub_count to coa_copub_total ratio_mte_coa ratio of total_mte_pubs to total_coa_pubs copub_years # of years of collaboration mte_years mentee publication years till copub end date coa_years coauthor publication years till copub end date mte_copub_years_prcnt ratio of copub_years to mte_years coa_copub_years_prcnt ratio of copub_years to coa_years dense_mte_copub_total # of papers published by mentee in dense copub period dense_coa_copub_total # of papers published by coauthor in dense copub period dense_total_coa_pubs total publications of the coauthor till dense copub end date dense_total_mte_pubs total publications of the mentee till dense copub end date dense_copub_count total papers published together during the dense copub period dense_mte_copub_prcnt ratio of dense copub_count to mte_copub_total dense_coa_copub_prcnt ratio of dense copub_count to coa_copub_total dense_ratio_mte_coa ratio of dense_total_mte_pubs to dense_total_coa_pubs dense_mte_years mentee publication years till dense copub end date dense_coa_years coauthor publication years till dense copub end date dense_mte_copub_years_prcnt ratio of dense copub_years to mte_years dense_coa_copub_years_prcnt ratio of dense copub_years to coa_years coa_pubs_before_copub coauthor publication count before co-publication period mte_pubs_before_copub mentee publication count before co-publication period coauthor out-edge max weight coa_in_max coauthor in-edge max weight mte_out_max mentee out-edge max weight mte_in_max mentee in-edge max weight coa_out_sum sum of out-edge weights for coauthor coa_in_sum sum of in-edge weights for coauthor mte_out_sum sum of out-edge weights for mentee mte_in_sum sum of in-edge weights for mentee mte_weight_sum mte_out_sum + mte_in_sum coa_weight_sum coa_out_sum + coa_in_sum mte_avg_in average of in-edge weights for mentee mte_avg_out average of out-edge weights for mentee coa_avg_in average of in-edge weights for coauthor coa_avg_out average of out-edge weights for coauthor mte_ratio_in_out ratio of mte_in_sum to mte_out_sum coa_ratio_in_out ratio of coa_in_sum to coa_out_sum 8 Construction of the Literature Graph in Semantic Scholar Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures The Next Generation (Plus One): An Analysis of Doctoral Students' Academic Fertility Using a Novel Approach for Identifying Advisors The next generation (plus one): an analysis of doctoral students' academic fecundity based on a novel approach to advisor identification Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems A dataset of mentorship in science with semantic and demographic estimations Web of Scholars: A Scholar Knowledge Graph Early coauthorship with top scientists predicts success in academic careers Intellectual synthesis in mentorship determines success in academic careers Shifu2: A Network Representation Learning Based Model for Advisor-advisee Relationship Mining A Unified Approach to Interpreting Model Predictions Mentorship and protégé success in STEM fields Association for Computing Machinery Impact of gender on the formation and outcome of mentoring relationships in academic research statsmodels: Econometric and statistical modeling with python Quantifying the evolution of individual scientific impact Mining advisor-advisee relationships from research publication networks Microsoft academic graph: When experts are not enough Mining Advisor-Advisee Relationships in Scholarly Big Data: A Deep Learning Approach We are immensely grateful to James Thorson, Jevin West, Dashun Wang, and Daniel S. Weld for their insightful discussions, and JJ Yang for their relentless support. This work was supported in part by NSF Convergence Accelerator Grant 2132318.