key: cord-0475326-t7714bxs
authors: Pollacci, Laura
title: EMAKG: An Enhanced Version Of The Microsoft Academic Knowledge Graph
date: 2022-03-17
journal: nan
DOI: nan
sha: 5badad9e84f5ee78bb055197a782e608657a10e0
doc_id: 475326
cord_uid: t7714bxs

Scholarly knowledge graphs are valuable sources of information in several research fields. Despite the number of existing datasets related to publications and researchers, resource quality, coverage and accessibility are still limited. This article presents the Enhanced Microsoft Academic Knowledge Graph, a large dataset of information about scientific publications and involved entities, and the methods developed to build it. Data includes geographical information, researchers' collaborative networks and movements between institutions, academic-related metrics, and linguistic features. The dataset merges information from several data sources and has high temporal and spatial 7 coverage, allowing several use cases.

Sharing knowledge is ever more crucial, especially in scientific research [1] . Data representing highly skilled personnel is a key to interpreting and understanding scientific collaborations and knowledge exchange phenomena. Given its multifaceted nature, numerous strands of research are involved in the analysis of highly skilled personnel and scholarly data, including digital libraries [2, 3] , collaborator discovery, expert finding, and recommendation systems [4] . Furthermore, scientific networks of collaboration and exchange, understood as physical displacement (mobility), are at the centre of research attention. Despite the recent interest in knowledge exchange and the increase in movements of highly-skilled personnel, moving researchers have captured a limited interest. There exists a notable gap in the understanding of researchers' mobility, knowledge exchange, and scientific collaboration networks, besides a few exceptions [5] [6] [7] . One of the challenges with modelling researchers' mobility and collaborations is the lacking of data and international statistics providing definitions and specific indicators, e.g., socio-economic, educational, and professional indicators [8, 9] . The most recent research has focused on alternative data sources to fill the gaps posed by traditional data, e.g., register statistics. Unconventional data describing publications, researchers' careers and movements have opened new research opportunities for multiple fields of study [10] . There is a large variability in terms of available data sources, accessibility, format, coverage, type, and the number of contents, as discussed in Section 3. Researchers have benefited from alternative data sources to study academic collaborations networks and to develop scientific mobility indicators [11] [12] [13] and to examine the scientific ethnic and mobility networks [14, 15] .

This paper presents the Enhanced MAKG (DOI: 10 .5281/zenodo.5888647) 1 , a large dataset of scientific publications and related entities, including authors, and the methods 2 developed to build it. The proposed dataset originates from the Microsoft Academic Knowledge Graph (MAKG) [2, 3, 16] , one of the most extensive freely available knowledge graphs on publications. I first assess the limitations of the current MAKG dataset in Section 3.1. Then, based on these, several methods are designed to enhance data and facilitate the number of use case scenarios, particularly in mobility and network analysis. The dataset provides two main advantages. First, it has improved usability, facilitating access to non-expert users. Second, it includes an increased number of types of information obtained by integrating various datasets and sources, which help expand the application domains. For instance, geographical information could help mobility (and migration) research. The knowledge graph completeness is improved by retrieving and merging information on publications and other entities no longer available in the latest version of MAKG. Furthermore, geographical and collaboration networks details are employed to provide data on authors, including their working connections and movements between institutions and countries, opening several new possible research and use cases for the dataset. Further, data is generally enriched and standardised by designing Natural Language Processing (NLP) semi-supervised approaches.

The rest of this paper is organised as follows. Section 2 describes the main contributions in this article. Section 3 discusses available scholarly data resources and their main differences. In particular, Section 3.1 provides a general overview of the Microsoft Academic Knowledge Graph with its enhancements over time, current limitations, and usages. The methods developed to build the EMAKG are described from Section 4 to 9, while description of the dataset is in Section 10. Section 12 concludes the paper with the final discussion together and future works, after a brief discussion on possible usages of the Enhanced MAKG in Section 11.

This section summarises the main contributions proposed to enhance and improve the Microsoft Academic Knowledge Graph.

Facilitation of Use Case Scenarios The dataset dissemination and the use scenarios strictly rely on data accessibility. Thus, providing most of the data in Comma Separated Value (.csv) and .txt formats could lower necessary skills to access, manage and analyse them. Use cases may also depend on the number, type, and ease of understanding of data. To this end, data standardisation is enhanced. In addition, entities' properties rely on official coding systems, including ISO 3166 codes for countries 3 , ISSN for journals 4 , and ISO 639-1 for languages.

Knowledge Graph. As discussed in Section 3.1, the MAKG has often been updated, but such changes also impose a data loss. To this end, the knowledge graph completeness is improved by retrieving information no longer available in MAKG from its parent, the Microsoft Academic Graph (MAG) [17] , e.g., links between institutions, papers, and authors. Furthermore, other sources, i.e., Wikipedia, are exploited to add new information on entities. For instance, the semi-supervised method in Section 4 combines a) reverse geocoding, b) information retrieval and c) data integration to provide additional information on affiliations. These include homepage, foundation date, type, acronym, and a set of geographical data, e.g., city, country name, and ISO codes.

Mobility. As discussed in Section 3.1, the application scenario of the MAKG lacks mobility-related studies. Geographical information and affiliations geolocalisation allow de-scribing researchers' movements over institutions and countries. Starting from the hypothesis that an author lives in the country where its affiliation is located, annual publications and geolocalised affiliations are computed by authors. The relationship between these provides (a) the authors' annual location, as the most frequent country among locations of institutions related to an author's annual publications; and (b) authors' career's nationality, as the country of the first geolocated institution over an author's career. Researchers locations allows dealing with mobility (and migration) related concepts, such as flows and stocks (Section 5). Combining the literature [18] and the information on authors' locations and careers, the concept of i. working-nativity is introduced. Given an author A and a country C, A is a working-native of C if C corresponds to the author's career's nationality. Furthermore, are defined ii. authors' stocks the number of authors identified as international working migrants in a given country and year; iii. authors' flows the number of authors entering or leaving a given country and year. Thus, given a country C, authors' stocks are computed by counting the number of non-workingnatives authors annually located in C. Moreover, authors' flows are modelled as the direct graph between worldwide countries.

Networks of collaborations. Similar to mobility, the role and the topology of scientific collaboration networks have been extensively analysed for different research purposes (Section 6). Authors' collaboration networks could be employed to understand dynamics between researchers' and study knowledge exchange over institutions and countries. According to the literature, authors having authored a paper together are linked on a yearly basis [19] . Thus, authors' networks of collaborations are modelled as ego networks annually (Section 6). Annual authors' ego networks may facilitate network analysis and help understanding dynamics between researchers over time.

Fields of study. The latest MAKG version includes a descriptive classification of fields of study. Nevertheless, together with the data loss [16] , the newest classification method seems to be more suitable for given research fields instead of others. Here, I propose a method that starts from a limited set of top fields of study to propagate them based on parenting relationships. The procedure allows obtaining fields of study labelled with one or more toplevel disciplines and described by a score in the range [0, 1] computed as the proportion over the list of inherited labels. The obtained labelled fields of study (FOS) can help understand phenomena and dynamics about researchers, e.g., exploring trends of publications rates by fields and disciplines' attractiveness (as the number of authors publishing in a given area).

General Enhancements. Data is enriched and aggregated by designing Natural Language Processing semi-supervised approaches aiming to include a) academia related metrics (h-index); b) abstracts and linguistic features, i.e., standard language codes, tokens, and types; c) entities' general information, e.g., date of foundation, type and acronym of institutions, among others.

To date, various systems allow exploration of scientific data via repository interfaces and ensure access to integrated datasets from multiple resources [20] . Major data resources include but are not limited to: Scopus 5 a multidisciplinary composite source; DBLP 6 , a computer science bibliography website; Google Scholar 7 , which allows search and citation services over the academic literature; CiteSeerX 8 , a large-scale harvesting of indexed papers; Web of Science (WOS) 9 , a research dataset for scholarly publications; Microsoft Academic Search (MAS) 10 [17] [25] is too specific for the purpose of EMAKG since it refers exclusively to Semantic Web conferences.

The Microsoft Academic Knowledge Graph derives from the Microsoft Academic Graph, an extensive database about scientific publications modelled as a connected knowledge graph.

The MAKG provides information about scientific publications and entities involved in and related to these, including authors, venues, and institutions. The dataset includes data for almost 240 million papers and 245 million authors affiliated to more than 25 thousand institutions, as shown by the distribution of the entities among the main entity types in Table 1 . The last version of MAKG (v. 2020-06-19 18 ) shows notable changes regarding the previous ones such as new properties for entities relations modelling, authors disambiguation, and geographical coordinates for institutions. Also, the number of subsets in the dataset changed from 18 to 26.

The MAKG strongly benefits from the existing resources, such as DBpedia 19 , the Dublin Core Metadata Initiative (DCMI) 20 , and Semantic Publishing and Referencing (SPAR) ontologies [26] , which include FaBiO 21 , CiTO 22 , PRISM 23 , DataCite 24 , and C4O 25 .

Due to data richness and high coverage, the MAKG has been employed in several research fields and scenarios, including bibliometrics and scientific impact [27] [28] [29] , recommender systems [30] , data analytics (i.e., Nesta business intelligence tools 26 ), and benchmarking [31] . Further, the MAG, to which the MAKG originally derives, has been extensively investigated [32] [33] [34] and used for scientific ethics and mobility networks [24, 25] , and in COVID-19 related studies [35, 36] .

Starting with the first version (v. 2018-11-09), the MAKG has notably been enhanced. According with [16] , the most significant limitations rely on data replication, the field of study hierarchy, and the scarcity of entity embeddings. The authors of the latest MAKG version have managed these issues, making substantial changes to the dataset and its information.

According to [37] , the design of the parent of MAKG leads to more author entities than real authors. Thus, in the last version, the replications of author entities have been addressed by performing name disambiguation. • Originally, the MAKG structures the information on research areas as its parent. Fields of study are organised in a multi-level hierarchy where parent research areas are fine-grained and multiple. The last MAKG version provides a descriptive classification of fields of study based on abstracts of publications. •

In previous versions, entity pre-trained embeddings are provided only for publications using RDF2Vec. The last MAKG version includes embeddings for journals, conferences, and authors.

Along with these improvements, it is also worth noting the inclusion of the geographic coordinates of the institutions, i.e., affiliations.

One of the strengths of the MAKG is its free distribution, which allows free, direct and unlimited access via RDF knowledge graph dump with resolvable URIs 27 , a public SPARQL endpoint 28 and via Zenodo 29 . Although the advantages of the triples-RDF format are pointed out in [3] , the skill level needed to access the dataset could represent a limit. On the one hand, the MAKG has already proven to be a highly versatile resource across both study scenarios and research fields. On the other hand, the RDF format may represent a limit -or at least -a challenge for some researchers, students and non-professionals. By providing .csv and .txt data, even users with fewer computer skills can easily access avoiding software and online platforms for format conversion, even given the size. In addition, splitting the dataset allows users to download, manage and store subsets individually, as MAKG and MAG. The application scenario of the MAKG lacks mobility-related studies. Conversely, the MAG has 19 DBpedia: https://www.dbpedia.org/. 20 been already exploited to analyse scientific ethics and mobility networks [24, 25] . This could be due to (a) the scarcity of geographic information and (b) the lack of standardised geographic data. The latest MAKG version only partially adds geographic information, i.e., coordinates (longitude and latitude) for affiliations. However, geographic data are not standardised, e.g., the data on conferences includes the DBpedia location (city). The country is not provided, and several homonymous cities exist 30 . As for the data format, standardised and easy-accessible geographical information could favour multidisciplinary mobility-related studies. Others limitation of the MAKG depends on its design. First, the MAKG provides only the last affiliation of authors [3] , making it impossible to build geolocalised authors' careers. On the contrary, the MAG provides the relationship between papers, authors and affiliations, allowing re-integrating data useful to locate authors over time. Secondly, the new design choice to label papers with fields of study based on the abstracts (which are not available for all papers) imposes a loss of data [16] . Also, the results show that the design is more suitable for some research fields. The performance could suffer from the low specificity of abstracts' terms since the best performances are for fields with strictly domain-dependent vocabularies [16] .

Geographic coordinates, i.e., latitude and longitude, for affiliations is one of the improvements of the latest MAKG version. However, enhancing the geographical dimension of the dataset could facilitate studies in human mobility and migration, especially of scholars. In the latest decades, researchers have benefited from advantages gained from alternative data sources such as bibliometric repositories, e.g., Scopus and Web of Science, to study academic collaboration networks and scientific mobility indicators [11] [12] [13] . However, international scientific mobility and migration patterns are still not fully explored besides a few studies [38, 39] . Thus, geolocating affiliations aim to add standardised and detailed metadata facilitating mobility-related studies.

By leveraging coordinates provided in MAKG, reverse geocoding methods can be applied to transform (latitude, longitude) pairs into addresses -or at least -parts of it, e.g., country name. To this end, an ad-hoc semi-supervised NLP function is built to return an array of standardised geographical metadata, including the city name, state, postcode, the country with its ISO 3361 codes and official name, from coordinates. The algorithm takes as inputs the affiliations' latitude and longitude coordinates and applies Geopy 31 and Reverse Geocoder 32 reverse geocoding methods. Then, results are cross-checked following a set of rules to assign at least a country -together with standardised geographical information -to each affiliation. Geopy and Reverse Geocoder results show that the libraries classify some countries differently, e.g., Unincorporated territories of the United States 33 . Since both Alpha 2 codes are correct depending on the country classification, i.e., including Unincorporated territories of the United States under the US label or not since non-incorporated countries, EMAKG provides both. Once all affiliations are labelled with an Alpha 2 code, the algorithm uses the PyCountry 34 library to retrieve Alpha 3 ISO 3361 code and official name plus the country name. 30 

An ad-hoc pipeline provides various information from Wikipedia URLs, including geographic data. This algorithm uses a parser based on WpTools 35 to obtain an array of raw information from Wikipedia infoboxes given a URL. The parser provides geographical and non-geographical data, such as country, state, city, acronym, foundation date, and homepage. However, city, homepage, and foundation date labels are inconsistent between affiliations thus keyword sets are used to gather data 36 . For non-geographical fields, supervised NLP-based rules allow extracting and standardising valuable data. Standardised geographical information, i.e., city name with coordinates (latitude and longitude), state, a country name with ISO 3361 Alpha 2 and Alpha 3 codes and official name are provided by a two-step URL-based geolocation algorithm. First, the algorithm uses city labels to search for city names in GeoText cities set. It then retrieves country names and related data such as country ISO 3361 alpha codes and official names by applying a support function. This uses the raw country text to retrieve the standardised and official name with ISO 3361 alpha codes from the GeoText country set. While the "fuzzy search" provided by PyCountry is used if GeoText does not give results. Also, the support function includes two supplementary methods used if a country is still not found. The first evaluates whether the state field can represent a valid country by re-applying the main support function with the state as the input parameter. The second one extracts the country from the city name with Geopy and gathers related data re-applying the main support function.

Both reverse geocoding (Section 4.1) and URL-based geolocation (Section 4.2) algorithms are applied to all affiliations, and results are cross-checked and integrated by a set of semisupervised NLP-based rules. The method iterates over all the reversed geocoded affiliations, checking if coordinates and country are retrieved using reverse geocoding. If not, geographical data extracted URL-based algorithm are added. When coordinates and country are provided by reverse geocoding and coincide with those of URL-based algorithm, if city and state lack, are added. On the contrary, the coordinates are included if cities of both geocoding methods overlap. Finally, affiliations are further enriched with foundation dates, entities, and acronyms. The result of the entire enriching procedure allows to obtain a dataset of Affiliations described by coordinates (from MAKG), a standardised city with coordinates, a state, a postcode, the standardised country with ISO Alpha codes and official name 37 , entity type, the foundation date, and URLs.

Authors' careers are computed by collecting their papers tagged with the year of publication. However, the MAKG makes it impossible to retrieve information on the authors' location. It doesn't include the relationship between publications, authors, and institutions (for which geographic information is available). The relationship between the three entities is gathered from the latest available corresponding subset in MAG 38 . The MAG models the triple (paper, author, institution) as "has authors" relationship with a direct edge from a publication to each of its authors [37] . Thus, careers and geolocated affiliations (Section 4.1) are crossed to provide a set of paper and author pairs yearly linked to a geolocalised affiliation, i.e., (paper, author): year, affiliation_ID, affiliation_alpha2. The obtained data describes the authors' geolocalised careers 35 WpTools: https://wp-tools.com/author/wptools/. 36 37 eventually plus the ISO Alpha 2 code of the second country the affiliation could be localised. 38 Since the time coverage of datasets is different, i.e., MAKG 2020 while MAG 2019, this matching phase imposes a data loss.

with annual resolution. Starting from the hypothesis that the authors live where they are affiliated, thus in the country of their affiliations, it is possible to define:

• the author's annual location, as the most frequent country among locations of institutions related to the author's annual publications. • the author's career nationality, as the country of the first geolocated institution over the author's career.

By applying the existing literature to geolocalised careers, the concepts of i. working-nativity (Theorem 1), ii. authors' stock (Theorem 2) and iii. authors' flow (Theorem 3) are introduced based on the definitions of migrants stocks and migrants flows [18] .

An author A is a working-native of a country C if C is the author's career nationality.

Authors' stock refers to the number of authors identified as international working migrants during a given year in a country.

Following Theorems 1,2 and 3, the total number of authors based on countries is obtained by aggregating authors' annual locations. Then, stocks are computed by counting the number of non-working-natives researchers in each country. This is because an author cannot be defined as a working migrant of its working-native country. Flows are modelled as the direct graph between worldwide countries based on changes in authors' annual locations. A flow is represented by a weighted link between countries describing the origin and destination of a researcher's movement (C Origin , C Destination , respectively). The weight is the number of authors who moved from (C Origin ) to (C Destination ). Moreover, flows are enriched with:

• returners: the number of authors located in the destination country for at least the second time during their career. • origin natives: the number of authors leaving their working-native country. • destination natives: the number of authors returning to their working-native country.

The role of scientific collaborations, together with their topology and dynamics, have been extensively analysed in different research strands for several purposes [19, [40] [41] [42] . Studies have been conducted at different resolution levels, e.g., micro-level (individuals), meso-level (institutions), and macro-level (countries) [42] . According to the literature, two scholars are connected if they have authored a paper together [19] . EMAKG provides the networks of collaborations by building authors' ego networks 39 . Various social relations can link together egos and alters depending on the network, e.g., working and personal relationships. In this case, an author's ego network is the weighted graph of its scientific collaborators in publishing papers. To this end, an ad-hoc algorithm takes as input the relationships among papers, authors, and years of publications. Then, it computes the co-authors' list in each publication annually for each author. Further, since two authors may have published more than one paper, the links between nodes (co-authors) are tagged with the weight as the count of the shared publications. 39 An ego network consists of a central node (ego) and the nodes to which it is directly connected (alters), plus the links among the nodes.

Papers abstracts are not included in the latest version of the MAKG. Abstracts have been extensively investigated, particularly in linguistics. Studies focus on type and provenance of publications [43, 44] , and research strands, e.g., medical [43] , applied linguistics and educational [45, 46] , and biomedical [47] to analyse styles [48] , linguistic complexity [49] and rhetorical forms [50] .

A semi-supervised pipeline is built to add linguistic data to abstracts obtained from the penultimate version of the MAKG. The method first infers the language code by using the LangDetect 40 . Then, it employs Spacy 41 and Html 42 to clean and extract tokens, lemmas, and types from abstracts. The method provides texts of abstracts together with its ISO 639-1 code, the list of tokens with frequency counts, and the list of types.

The h-index [51] (also known as Hirsch index/number) is a measure of the author's scientific achievements that considers both the number of papers published and the citations those receive. Despite its drawbacks 43 [53, 54] , it has become one of the most well-known metrics in academia. The index refers to the highest number h such that an author has h publications with at least h citations. Google Scholar 44 is one of the most well-known academic search engines, but it has no official APIs. Among available 3rd party APIs, ScraperAPI 45 to be combined to prebuilt Google Scholar scraping libraries, e.g., Scholarly 46 ; SERP API 47 and Publish or Perish 48 are specifically designed for Google Scholar. However, Publish or Perish may requires proxy solutions, e.g., ScraperAPI; SerpWow 49 has no dedicated Google Scholar documentation; Scale SERP 50 cannot be customised and provides low-granularity data compared to most other APIs. All the 3rd party APIs are under subscription with different requests limits and pricing. Free available libraries include Scholarly, which allows searching authors by name, by the id in the url of an author's profile (in Google Scholar), by keywords, and by (titles of) publications, and Scholar.py 51 which allows searching authors by name and by keywords. However, a) these libraries permit a few queries compared to dataset dimension, b) may return non-disambiguated results since searches are based on name, c) could not be consistent with the number of publications and citations per paper in the dataset. To face these limitations, the h-index is calculated directly from the data. The array of citations is computed starting from authors' publications. Then, the index is computed following three different methods, and results are cross-checked to ensure their reliability due to the lack of ground truth. I compute h-index by applying the Scholarmetrics 52 , the function 53 derived by [51] , and the h index array function 54 . The method achieves 100% agreement over the three methods, providing an h-index for all authors. 

Fields of study represent research strands and concepts papers are associated with. Information about which field(s) of study does a publication belongs to is very valuable for many tasks, but this may be often complicated to collect or calculate [32] . Each field of study is represented by a name, paper and citation counts, and hierarchy level of abstraction ranging in [0, 5] . Field studies are structured according to parent-child relationships, and each research field can have multiple parents. The latest MAKG version includes a descriptive classification obtained by assigning abstracts of papers to the 19 levels 0 top-fields of study from MAG (Section 3.1). Together with the data loss [16] , the classification method seems to be more suitable for research fields with a highly specialised and domain-dependent lexicon, i.e., geology, psychology, medicine, and biology. The result of descriptive classification is a list of tags describing the topic of the paper. Here, levels of abstraction with kinship relationships are used to propagate and assign FOSs to fields. The 19 top-level FOSs are directly tagged with the corresponding research area since these have no FOS parents. Conversely, FOS of lower levels (from 1 to 5), first, inherit the parents' tags. Then scores (as proportion) in [0, 1] ranges for each research area are computed based on the tags lists 55 . The FOS labelling allows, in turn, to assign one or more research areas publications. For each paper, a) is obtained the list of FOSs to which it is associated; b) are added up the scores by research areas; c) the obtained scores are divided by the sum of the scores of all the research areas rescaling the score in the range [0, 1].

The Enhanced Microsoft Academic Knowledge Graph (Appendix A) is a large dataset of scientific publications composed of several subsets representing entities and involved in publications and their relationships.

Papers. Papers represent the core of the EMAKG graph. The dataset comprises 238,670,900 papers published from 1800 to 2021 with different rates. As shown in Figure 1 , starting from 1900 the number of publications constantly grows 56 . Publications are described by several properties, including the unique identifier, the entity class, and the unique identifier of the 55 The 19 top-level FOSs have a 1.0 score. 56 Note that the decrease in 2019, 2020, and 2021 is due to a gap between data collection (before June 2020) and data release. Journal, the conference series, and the conference instance in which the article is published. Also, papers have a rank, a family Id, and the counts about citations and references. Other properties are based on DBpedia, DCMI, FaBiO, and PRISM data. As shown by Table 2 , the most represented document type is journal articles which describe 35.93% of total papers and 57.87% of papers with a non-null type. Patents cover 22% of the entire dataset, while conference papers, book and books chapters do not reach 2% each, while 37.90% of the total publications do not have a document type 57 . Papers are associated with abstracts for which are provided original texts, ISO639-1 codes, tokens with frequencies, and types (Section 7). Affiliations. Among the entities, the dataset comprises 25,768 enriched affiliations (Section 4) described by coordinates (latitude and longitude, from MAKG), a standardised city and its coordinates, a state, a postcode, the standardised country with ISO Alpha codes and official name, and the ISO Alpha 2 code of the second country the affiliation could localise in. Figure 2 describes the features obtained together with the percentages with respect to the entire subset. Most affiliations (about 96%) is described by geographic information, e.g., coordinates, country name and ISO Alpha codes. Additionally, Figure 3 shows the location of individual affiliations based on geographic coordinates. The world map is highly heterogeneous, with densely populated areas, e.g., United States of America and Europecontrasted with areas with poor geolocation, e.g., African states and Russia. North America, Europe (especially Central), Brazil and Mexico, India, China, and Oceania include most affiliations. In contrast, Central America, western South America, Africa (excluding South Africa and Nigeria), Arab and Western Asian states, plus Russia are poorly represented.

Authors. As MAKG, the EMAKG includes just over 243 million authors and more than 151 million disambiguated authors [16] . An author is identified by a unique identifier and described by the class, the rank, the last known affiliation in MAKG, FaBiO name, and paper, paper family, and citation counts. Careers of authors 58 are computed following the method in Section 5 and include the publications and the related affiliations annually. On average, an author published 4.21 papers. Authors are further described using their h-index (Section 8). On average, an author has a 1.01 h-index, as in Figure 4 . The five higher ones in the dataset reach respectively 240, 252, 257, 300, and 375, which might be partially plausible, but is likely misleading due to unclean data to some extent, as already underlined in some cases in [16] . Information on authors include also their annual locations 59 together with the affiliation and ego networks as the set of yearly co-authors. The time coverage of networks ranges from 1856 Field of Study. The dataset comprises more than 740 thousand FOSs labelled with at least a field of study, thus the research area(s), to which the FOS bellows according to the multiple parent-child relationships. Most of the FOS are labeled with one or two main research areas, while fewer than 4500 FOS are tagged with more than six areas. Most of the FOS refer to the so-called STEM 60 disciplines, thus science, technology, engineering and mathematics and any subjects that fall under these four disciplines as computer science (CS), biology, and chemistry. Conversely, humanities-related disciplines such as history, art, and philosophy seem characterised by fewer FOS, as shown in Figure 6 . By using FOS, more than 44 million papers are tagged with at least a research area. Figure 7 refers to distribution of research areas over papers obtained by summing scores (greater than or equal to 0.5) of papers by research areas.

Careers, Stocks & Flows. The EMAKG includes authors' careers as the set of their annual publications and related affiliations. Authors' careers are calculated by leveraging geolocalised affiliations and data about publications (Section 5). EMAKG provides careers for 27,647,403 authors. Geocalised careers are 27,448,058 (-199,345) . The dataset also includes researchers' stocks (Theorem 2) and flows (Theorem 3) (Section 5) with annual temporal resolution and worldwide coverage. Both span from 1857 to 2020 with some sparse gap before 1945. Figure 8 shows researchers' stocks in 2000 and 2019, respectively, and the general increasing number of authors worldwide and outline the consolidation of the power of some countries, e.g., the United States of America and the United Kingdom. Data allows studying different levels of spatial granularity annually, i.e., country level, continent level, and customised and ad-hoc 60 The acronym STEM stands for Science, Technology, Engineering, and Mathematics. geographical sets. Flows are represented as annuals direct graphs representing researchers' flows between country pairs by leveraging geolocalisation on affiliations and aggregating data. Figure 9 shows the trend of the movements of researchers along the entire time axis as well as in detail from 1995 onwards. The global flows of researchers generally tend to grow over time with minimal decreases only before 1975. From 2018 the trend is reversed. This can be due to the data loss during merging sources with different time coverage.

The Enhanced MAKG is built on top of the Microsoft Academic Knowledge Graph. Although significant additions have been made, most of the original structure has been conserved, including main original entities relationships, and properties. Thus, the EMAKG could be exploited in all the uses cases and applications of MAKG. Besides, the EMAKG a) reintroduces no longer available subsets, e.g., abstracts, b) merges new knowledge retrieved from external resources (i.e., Wikipedia) and libraries, and c) adds new relationships among entities computed by aggregating data of the dataset. The proposed improvements could open new analyses and applications of the dataset. Among the possible research areas, data could be used to measure institutions' research output [29] and for the science of science [55] also considering the geographical dimension at different spatial resolutions. Authors' connections can be leveraged to study the knowledge exchange. Moreover, researchers' flows and stocks can be compared with official statistics to study highly skilled mobility and migration. Linguistics and computational linguistics could benefit from abstracts and pre-computed tokens and types for language studies. Also, the annual authors' ego network can be explored for research in network analysis. However, thanks to the amount and multifaceted nature of information in the Enhanced MAKG and the easy access, e.g., .csv and .txt, data can be analysed in several fields and for multiple purposes.

This article presents the Enhanced MAKG, an enriched version of the Microsoft Academic Knowledge Graph, and methods developed to build it. The main aim of the dataset is a) making data accessible and easy to use and b) enriching available information to allow and facilitate new analysis. Further, a set of methods aiming to retrieve, standardise, and add new information about existing entities are developed to improve available data. Geographical information and geolocalisation are enhanced by combining reverse geocoding, information retrieval, and data integration. Authors-related data includes working connections (ego networks) and movements between institutions, publications, and general information. Further, EMAKG provides authors' annual locations and career nationalities, together with worldwide yearly stocks and flows. Among others, the subsets include a) fields of study (and publications) labelled by their discipline(s); b) abstracts and linguistic features, i.e., standard language codes, tokens, and types; c) entities' general information, e.g., date of foundation and type of institutions; and d) academia related metrics, i.e., h-index. The resulting dataset maintains all the characteristics of the parent datasets and includes a set of additional subsets and data that can be used for new case studies relating to network analysis, knowledge exchange, linguistics and computational linguistics, and mobility and human migration, among others. 

CountryAnnualFlowsAggregated: Flows aggregated by country and year. • 30.FlowsAnnual: Annual country to country flows

Authors: Authors' subset. • 06.PaperAuthorAffiliations: Relationships between papers and authors. • 07.PaperExtendedAttributes: Patent numbers and PubMedIds

To share or not to share? Research-knowledge sharing in higher education institution: preliminary results

The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data

The microsoft academic knowledge graph: a linked data source with 8 billion triples of scholarly data. International Semantic Web Conference

A survey on scholarly data: From big data perspective. Information Processing & Management

Rituals of encounter: campus life, liminality and being the familiar stranger. Crossing boundaries and weaving intercultural work, life, and scholarship in globalizing universities

Academic mobility, transnational identity capital, and stratification under conditions of academic capitalism

Academic and teacher expatriates: Mobilities, positionalities, and subjectivities

Anatomy of a Misfit: International Migration Statistics

International migration under the microscope

Human migration: the big data perspective

Dynamics of Scientific Collaboration Networks Due to Academic Migrations

Scholarly migration within Mexico: analyzing internal migration among researchers using Scopus longitudinal bibliometric data

An investigation of the relationship between scientists' mobility to/from China and their research performance

The preeminence of ethnic diversity in scientific collaboration

The mobility network of scientists: analyzing temporal correlations in scientific careers

Enhancing the Microsoft Academic Knowledge Graph via Author Name Disambiguation, Publication Classification, and Embeddings. Semantic Web

An overview of microsoft academic service (mas) and applications

Handbook on Measuring International Migration through Population Censuses

The structure of scientific collaboration networks

Exploring scholarly data with rexplore. International semantic web conference

Setting our bibliographic references free: towards open citation data

Acekg: A large-scale knowledge graph for academic data mining

SPedia: a central hub for the linked open data of scientific publications

Conference live: Accessible and sociable conference semantic data

Conference linked data: the scholarlydata project. International Semantic Web Conference

International Semantic Web Conference

Schüber, F. Identifying Used Methods and Datasets in Scientific Publications. SDU@ AAAI

Investigating software usage in the social sciences: A knowledge graph approach. European Semantic Web Conference

How Can a University Take Its First Steps in Open Data?

A scalable hybrid research paper recommender system for microsoft academic

Horrocks, I. Streaming Partitioning of RDF Graphs for Datalog Reasoning. European Semantic Web Conference

An analysis of the microsoft academic graph. D-lib Magazine

Analysing trends in computer science research: A preliminary study using the microsoft academic graph

Investigations on rating computer sciences conferences: An experiment with the Microsoft Academic Graph dataset

A glimpse of the first eight months of the covid-19 literature on microsoft academic graph: Themes, citation contexts, and uncertainties

Costeffectiveness of Microsoft Academic Graph with machine learning for automated study identification in a living map of coronavirus disease 2019 (COVID-19) research

Microsoft academic graph: When experts are not enough

A bibliometric approach to tracking international scientific migration

The many faces of mobility: Using bibliometric data to measure the movement of scientists

Power quality: Scientific collaboration networks and research trends

Dynamics of scientific collaboration networks due to academic migrations

The structure of scientific collaboration networks in Scientometrics

Abstracts in German medical journals: A linguistic analysis. Information Processing & Management

Analyses of rhetorical moves and linguistic realizations in accounting research article abstracts published in international and Thai-based journals

Research article abstracts in applied linguistics and educational technology: A study of linguistic realizations of rhetorical structure and authorial stance

Prominent messages in Education and Applied Linguistic abstracts: How do authors appeal to their prospective readers

Structuralizing biomedical abstracts with discriminative linguistic features. Computers in biology and medicine

Do linguistic style and readability of scientific abstracts affect their virality?

Linguistic complexity of abstracts and titles in highly cited journals

Literary research article abstracts: An analysis of rhetorical moves and their linguistic realizations

An index to quantify an individual's scientific research output

The emperor has no clothes

Trends in the usage of ISI bibliometric data: Uses, abuses, and implications. portal: Libraries and the Academy

The journal impact factor: Don't expect its demise any time soon