Dual network embedding for representing research interests in the link prediction problem on co-authorship networks


Dual network embedding for representing
research interests in the link prediction
problem on co-authorship networks
Ilya Makarov1,2, Olga Gerasimova1, Pavel Sulimov1 and
Leonid E. Zhukov1

1 School of Data Analysis and Artificial Intelligence, National Research University Higher School
of Economics, Moscow, Russia

2 Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia

ABSTRACT
We present a study on co-authorship network representation based on network
embedding together with additional information on topic modeling of research
papers and new edge embedding operator. We use the link prediction (LP) model for
constructing a recommender system for searching collaborators with similar research
interests. Extracting topics for each paper, we construct keywords co-occurrence
network and use its embedding for further generalizing author attributes. Standard
graph feature engineering and network embedding methods were combined for
constructing co-author recommender system formulated as LP problem and
prediction of future graph structure. We evaluate our survey on the dataset
containing temporal information on National Research University Higher School of
Economics over 25 years of research articles indexed in Russian Science Citation
Index and Scopus. Our model of network representation shows better performance
for stated binary classification tasks on several co-authorship networks.

Subjects Artificial Intelligence, Data Mining and Machine Learning, Digital Libraries,
Network Science and Online Social Networks, World Wide Web and Web Science
Keywords Co-occurrence network, Network embedding, Machine learning, Link prediction,
Recommender systems, Co-authorship networks

INTRODUCTION
Nowadays, researchers struggle to find relevant scientific contributions among large
variety of international conferences and journal articles. In order not to miss important
improvements in various related fields of study, it is important to know the current
state-of-art results while not reading all the papers tagged by research interests. One of the
solutions is to search for the most “important” articles taking into account citation or
centrality metrics of the paper and the authors with high influence on specific research
field (Liang, Li & Qian, 2011). However, such method does not include collaborative
patterns and previous history of research publications in co-authorship. It also does not
measure the author professional skills and the ability to publish research results according
to paper influence metrics, for example, journal impact factor.

We study the problem of finding collaborator depending on his/her research
community, the quality of publications and structural patterns based on co-authorship
network suggested by Newman (2004a, 2004b). Early unsupervised learning approaches

How to cite this article Makarov I, Gerasimova O, Sulimov P, Zhukov LE. 2019. Dual network embedding for representing research
interests in the link prediction problem on co-authorship networks. PeerJ Comput. Sci. 5:e172 DOI 10.7717/peerj-cs.172

Submitted 20 September 2018
Accepted 28 December 2018
Published 21 January 2019

Corresponding author
Ilya Makarov, iamakarov@hse.ru

Academic editor
Diego Amancio

Additional Information and
Declarations can be found on
page 16

DOI 10.7717/peerj-cs.172

Copyright
2019 Makarov et al.

Distributed under
Creative Commons CC-BY 4.0

http://dx.doi.org/10.7717/peerj-cs.172
mailto:iamakarov@�hse.�ru
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.172
http://www.creativecommons.org/licenses/by/4.0/
http://www.creativecommons.org/licenses/by/4.0/
https://peerj.com/computer-science/


for community detection in research networks were studied in Morel et al. (2009),
Cetorelli & Peristiani (2013), Yan & Ding (2009), and Velden & Lagoze (2009). A review on
social network analysis and network science can be found in Wasserman & Faust (1994),
Barabási & Pósfai (2016), and Scott (2017).

We focus on the link prediction (LP) problem (Liben-Nowell & Kleinberg, 2007) in
order to predict links in temporal networks and restore missing edges in complex networks
constructed over noisy data. The LP algorithms can be used to extract the missing link
or to detect abnormal interactions in a given graph, however, the most suitable case is to
use LP for predicting the most probable persons for future collaboration, which we state
as a problem of recommending a co-author using LP ranking (Li & Chen, 2009). Our
model is designed to predict whether a pair of nodes in a network would have a
connection. We can also predict the parameters of such an edge in terms of publication
quality or number of collaborators corresponding to the predicted link (Makarov et al.,
2019a, 2019b).

In general, LP algorithms are widely used in several applications, such as web linking
(Adafre & De Rijke, 2005), search for real-world friends on social networks (Backstrom &
Leskovec, 2011), citation recommender system for digital libraries (He et al., 2010).
A complete list of existing applied LP techniques can be found in Srinivas & Mitra (2016).

Recently, the improvement of machine learning techniques shifted the attention
from manual feature engineering to the vectorized information representation. Such
methods have been successfully applied for natural language processing and now are tested
on network topology representation despite the fact that an arbitrary graph could not be
described by its invariants. The approach of representing network vertices by a vector
model depending on actor’s neighborhood and similar actors is called graph (network)
embedding (Perozzi, Al-Rfou & Skiena, 2014). The current progress of theoretical
and practical results on network embeddings (Perozzi, Al-Rfou & Skiena, 2014; Tang et al.,
2015; Chang et al., 2015; Grover & Leskovec, 2016) shows state-of-art performance on
such problems as multi-class actor classification and LP. Although, the existing methods
use not only structural equivalence and network homophily properties, but also the
actor attributes, such as labels, texts, images, etc. A list of surveys on graph embedding
models and applications can be found in Cai, Zheng & Chang (2018), Cui et al. (2018),
Goyal & Ferrara (2018), and Chen et al. (2018).

In this paper, we study a co-authorship recommender system based on co-authorship
network where one or more of the coauthors belong to the National Research University
Higher School of Economics (NRU HSE) and the co-authored publications are only
those indexed in Scopus. We use machine learning techniques to predict new edges
based on network embeddings (Grover & Leskovec, 2016; Wu & Lerman, 2017) and edge
characteristics obtained from author attributes. We compare our approach with
state-of-the-art algorithms for the LP problem using structural, attribute and combined
feature space to evaluate the impact of the suggested approach on the binary classification
task of predicting links in co-authorship network. Such an obtained system could be
applied for expert search, recommending collaborator or scientific adviser, and
searching for relevant research publications similar to the work proposed in

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 2/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


Makarov, Bulanov & Zhukov (2017) and Makarov et al. (2018a). In what follows, we describe
solution to the LP problem leading to evaluation of our recommender system based on
co-authorship network embeddings and manually engineered features for HSE researchers.

RELATED WORK
Link prediction
The LP problem was stated in Liben-Nowell & Kleinberg (2007), in which Liben-Nowell
and Kleinberg proposed using node proximity metrics. The evaluation of the proposed
metrics for large co-authorship networks showed promising results for predicting
future links based on network topology without any additional information on authors.
Unsupervised structural learning was proposed in Tang & Liu (2012). Gao, Denoyer &
Gallinari (2011) presented temporal LP based on node proximity and its attributes
determined by the content using matrix factorization.

Two surveys on LP methods describe core approaches for feature engineering, Bayesian
approach and dimensionality reduction were presented in Hasan & Zaki (2011), Lü &
Zhou (2011). Survey on LP was published in Wang et al. (2015).

The simplest baseline solution using network homophily is based on common
neighbors or other network similarity scores (Liben-Nowell & Kleinberg, 2007). However,
the Gao et al. (2015) that the similarity measures are not robust to the network global
properties and, thus, could noise the prediction model with similarity scores only.
The impact of the attribute-based formation in social networks was considered in
Robins et al. (2007) and McPherson, Smith-Lovin & Cook (2001). All these observations
require feature engineering depending on the domain.

Graph-based recommender systems formulated via LP problem were suggested in Chen,
Li & Huang (2005), Liu & Kou (2007), and Li & Chen (2009). In Kossinets & Watts (2009),
studied the effect of homophily in a university community. They considered temporal
co-authorship network accompanied with author attributes and concluded the influence of
not only structural proximity, but also author homophily for the social network structure.

Another approach focusing on interdisciplinary collaboration inside the University
was presented in Cho & Yu (2018). The authors used the existing co-authorship network
and academic information for University of Bristoll and proposed a new LP model for
co-authorship prediction and recommendation.

Kong et al. (2018) developed a scientific paper recommender system called VOPRec.
However, in contrast to our work they constructed vector representation of research
papers in citation networks. Their system uses both text information represented
with word embedding to find papers of similar research interest and structural identity
converted into vectors to find papers of similar network topology. To combine text and
structural informations with the network, vector representation of article can be
learned with network embedding.

Network embedding
In general, knowledge retrieval and task-dependent feature extraction would require
domain-specific expert to construct a real-value feature vector for nodes and edges

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 3/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


representation. The quality of such an approach will be influenced by particular tasks and
expert work, while not being scalable for large noisy networks. Recently, the theory of
hidden representations has impacted on machine learning and artificial intelligence.
It shifted the attention from manual feature engineering to defining loss function and then
solving optimization task. The early works on network vectorized models were presented
in Local Linear Embedding (Roweis & Saul, 2000), IsoMAP (Tenenbaum, De Silva &
Langford, 2000), Laplacian Eigenmap (Belkin & Niyogi, 2002), Spectral Clustering (Tang &
Liu, 2011), MFA (Yan et al., 2007), and GraRep (Cao, Lu & Xu, 2015). These works try to
embed the networks into real-value vector space using several proximity metrics.
However, development of representation learning for networks was in stagnation due to
the non-robust and non-efficient machine learning methods of dimensionality reduction
based on network matrix factorization or spectral decomposition. These methods
were not applicable for large networks and noisy edge and attribute data providing
low accuracy and having high time complexity of constructing embedding.

The modern methods of network embedding try to improve the performance on several
typical machine learning tasks using conditional representation of a node based on its
local and global neighborhood defined via random walking. The first-order and
second-order nodes proximity were suggested in LINE (Tang et al., 2015) and SDNE
(Wang, Cui & Zhu, 2016) models. Generalizing this approach, DeepWalk (Perozzi,
Al-Rfou & Skiena, 2014) and node2vec (Grover & Leskovec, 2016) algorithms use
Skip-gram model (Mikolov et al., 2013) based on simulation of breadth-first sampling
and depth-first sampling. Although, in Carstens et al. (2017), Carstens et al. showed
some drawbacks of node2vec (Grover & Leskovec, 2016) graph embedding, it still
remained competitive structural-only embedding for representing both, homophily
and structural equivalence in the network. Its generalization on global network
representation learning from Wu & Lerman (2017) shows comparable results with
the original model.

Several works cover the node attributes, such as label and text content (see TADW;
Yang et al., 2015, LANE; Huang, Li & Hu, 2017). In TriDNR paper, Pan et al. (2016)
proposed to separately learn structural embedding from DeepWalk (Perozzi,
Al-Rfou & Skiena, 2014) and content embedding via Doc2Vec (Le & Mikolov, 2014).
On the contrary, ASNE (Liao et al., 2017) learns combined representations for structural
and node attribute representation using end-to-end neural network.

We focus on graph embedding and feature engineering methods applied to an LP task
on a particular network, consisting of HSE researchers co-authored at least one paper
with additional attributes representing authors. Using network features only fails to
include the information about actors obtained from the other sources, thus decreasing
efficiency of network embeddings. We aim to include information on feature space of
author’s research interests using data from the Scopus digital library containing manually
input and automatically selected keywords for each research article. Based on this
information, we constructed keywords co-occurrence network and consider its embedding
for further generalizing author attributes.

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 4/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


DATASET DESCRIPTION AND PREPROCESSING
We use the NRU HSE portal (National Research University Higher School of Economics,
2017) containing information on research papers co-authored by at least one HSE
researcher, which were later uploaded to the portal by one of co-authors. The HSE
database contains information on over 7,000 HSE researchers published over 31,000
research papers. The portal site contains web interface for the researchers to extract
metadata of publications for a given time period and could be used by external researchers.
The database records contain information on title, list of authors, keywords, abstract,
year and place, journal/conference and publishing agency, and indexing flags for Scopus,
Web of Science (WoS) Core Collection and Russian Science Citation Index (RSCI).

Unfortunately, the database has no interface for managing bibliography databases and
has no integration with synchronizing of indexing digital libraries compared to
Scholar Google or personal researcher profile management services such as ResearcherID
or Orcid. As a consequence, a large amount of noisy data occurs due to such problems
as author name ambiguity or incorrect/incomplete information on the publications.

In order to resolve the ambiguity, we considered standard disambiguation approaches
for predicting necessity to merge authors. We used Levenshtein distance (useful for records
with one-two error letters) for abbreviated author last and first names and then
validated by two thresholds whether to merge two authors with the same abbreviation in
the database based on cosine similarity and common neighbors metrics. The threshold
values have been found manually via validation on small labeled set of ambiguous
records. The number of authors with ambiguous writing does not exceed 2% of the whole
database. We have also removed all non-HSE authors due to lack of information on their
publications in HSE dataset.

We also retrieved the Scopus database of research papers co-authored by researchers
from NRU HSE and indexed by Elsevier (2018). The database contains information on
paper author list, document title, year, source title, volume, issue, pages, source and
document type, DOI, author keywords, index keywords. We also added the information on
research interests based on Scopus subject categories for the journals, in which
authors have published their articles. We manually inputted the research interest list
according to RSCI categorization in order to fill the lack of keywords and attributes for
the papers.

We then stated the problem of indexing author research interests in terms of keywords
attached to paper description in both databases, and retrieved from HSE dataset using the
BigARTM (Vorontsov et al., 2016) topic modeling framework. For the Scopus dataset,
we use automatically chosen keywords previously prepared by the service together with
manually input by authors list of keywords. We also uses additional keywords written
in terms of subject categories of journals and proceedings according to the indexing in
Scopus and WoS research paper libraries.

These two datasets (HSE, Scopus) have common papers; however HSE dataset contains
many noisy data and, unfortunately, low-level publications not indexed outside RSCI,
while Scopus contains precise information on 25% number of papers and exact research

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 5/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


interest representation while lacking weak connections on authors written only poor-
quality papers.

We visualized the HSE network with author names as node labels shown in the Fig. 1
while visualizing edge width as the cumulative quantity of joint publications based on the

Figure 1 Visualization of HSE co-authorship network. We plot the whole HSE co-authorship network (A) and its subgraphs induced by local
proximities around influential persons from our university such as rector Kuzminov Y.I. (B), first vice-rector responsible for science Gokhberg
L.M. (C), and university research supervisor Yasin E.G. (D). Full-size DOI: 10.7717/peerj-cs.172/fig-1

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 6/20

http://dx.doi.org/10.7717/peerj-cs.172/fig-1
http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


papers’ quartiles and their number. In particular, we plotted the whole network structure
(Fig. 1A) and zoomed parts corresponding dense communities with the most influential
persons from our university such as rector Kuzminov Y.I. (Fig. 1B), first vice-rector
responsible for science Gokhberg L.M. (Fig. 1C), and university research supervisor
Yasin E.G. (Fig. 1D). It is easy to see that rector co-authors are people responsible for core
direction of university development and vice rectors realizing university strategy in
research, education, government, expert, and company areas of collaboration. On the other
hand, from dense subgraphs one can find exact matches of university staff departments,
such as research institutes, such as Institute for Industrial and Market Studies headed
by Yakovlev A.A. or Institute for Statistical Studies and Economics of Knowledge headed
by L.M. Gokhberg.

We show that network could visualize the most important administrative staff units and
their heads thus giving insight on the connection of publication activity and administrative
structure of HSE university.

FEATURE ENGINEERING
Our new idea was to obtain additional edge attributes that were embed based on keywords
network as a part of model evaluation. We constructed the network of stemmed keywords
co-occurrence. To construct this network, we used the principle that two nodes were
connected if corresponding keywords occurred in the same paper. For a given list of
keywords, we built standard node2vec embedding (Grover & Leskovec, 2016). Next, for
each author the most frequent and relevant keyword was defined, and its embedding was
used as node additional feature vector for our LP tasks.

We considered the problem of finding authors with similar interests to a selected one as
collaboration search problem. In terms of social network analysis, we studied the problem
of recommending similar author as LP problem. We operate with authors similarity
and use similarity scores described in Liben-Nowell & Kleinberg (2007) as baseline for
network descriptors for pairs of nodes presented in Table 1.

So, we represented each node by vector model of author attributes using manually
engineered features such as HSE staff information and publication activity represented by
centralities of co-authorship network and descriptive statistics. We added graph
embeddings for author research interests and node proximity and evaluated different
combinations of models corresponding to node feature space representation.

LINK EMBEDDINGS
To use node2vec, we obtained the vector node representations. The node2vec embedding
parameters were chosen via ROC–area under the curve (AUC) optimization over
embedding size with respect to different edge embedding operators. For edge embedding
we applied specific component-wise functions representing edge to node embeddings
for source and target nodes of a given edge. This model was suggested in Grover & Leskovec
(2016), in which four functions for such edge embeddings were presented (see first
four rows in Table 2). We leave an evaluation of the approaches from Abu-El-Haija,
Perozzi & Al-Rfou (2017), which use bi-linear form learning from reduced by deep neural

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 7/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


network node embedding, for the future work while also working on a new model of joint
node-edge graph embedding, similar to (Goyal et al., 2018). Presented in the paper
model suggests simple generalization of the idea of pooling first-order neighborhood
of nodes while constructing edge embedding operator, which is much faster than
dimensionality reduction approaches.

We evaluated our model on two additional functions involving not only edge source and
target node representations, but also their neighborhood representations as average
over all the nodes in first-order proximity. These measures were first presented
in Makarov et al. (2018b) but were not properly evaluated. The resulting list of link
embeddings is presented in Table 2.

For each author, we also chose the most frequent keyword in the co-occurrence network
and then constructed general embedding using node2vec with automatically chosen
parameters.

Table 1 Similarity score for a pair of nodes u and v with local neighborhoods N(u) and N(v)
correspondingly, and for vectors corresponding to two authors research interests X and Y.

Similarity metric Definition

Common neighbors jN(u) \ N(v)j
Jaccard coefficient jNðuÞ \ NðvÞj

jNðuÞ [ NðvÞj
Adamic-Adar score X

w2NðuÞ\NðvÞ
1

ln jNðwÞj
Preferential attachment jN(u)j · jN(v)j
Graph distance Length of shortest path between u and v

Metric score 1
1 þ jjx � yjj

Cosine score ðx; yÞ
jjxjjjjyjj

Pearson coefficient covðx; yÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
covðx; xÞ

p
�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
covðy; yÞ

p
Generalized Jaccard

P
minðxi; yiÞP
maxðxi; yiÞ

Table 2 Binary operators for computing vectorized (u, v)-edge representation based on node
attribute embeddings f(x) for ith component for f(u, v).

Symmetry operator Definition

Average fiðuÞ þ fiðvÞ
2

Hadamard fi(u) · fi(v)

Weighted-L1 jfi(u) - fi(v)j
Weighted-L2 (fi(u) - fi(v))

2

Neighbor Weighted-L1
P

w2NðuÞ[fug fiðwÞ
jNðuÞj þ 1 �

P
t2NðvÞ[fvg fiðtÞ
jNðvÞj þ 1

����
����

Neighbor Weighted-L2
P

w2NðuÞ[fug fiðwÞ
jNðuÞj þ 1 �

P
t2NðvÞ[fvg fiðtÞ
jNðvÞj þ 1

� �2

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 8/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


Overall edge embedding contained several feature spaces described by the following List 1:

(1) Edge embedding based on node2vec node embeddings

(2) Edge embedding based on node2vec embedding for keywords co-occurrence network

(3) Network similarity scores (baselines in Liben-Nowell & Kleinberg (2007))

- Common neighbors

- Jaccard’s coefficient

- Adamic-Adar score

- Preferential attachment

- Graph distance

(4) Author similarity scores

- Cosine similarity

- Common neighbors

- Jaccard’s generalized coefficient

- Pearson’s correlation coefficient

- Metric score

TRAINING MODEL
We consider machine learning models for binary classification task of whether for a
given pair of nodes there will be a link connecting them based on previous or current
co-authorship network. We compare Logistic Regression with Lasso regularization,
Random Forest, Extreme Gradient Boosting (XGBoost), and Support Vector Machine,
in short SVM, models.

We use most common machine learning frameworks which we shortly describe below.
Logistic Lasso Regression is a kind of linear model with a logit target function, in addition,
Lasso regularization sets some coefficients to zero, effectively choosing a simpler model
that reduces number of coefficients. Random Forest and Gradient Boosting are an ensemble
learning methods. The former operates by constructing a multitude of decision trees, the
latter combines several weak models building the model in a stage-wise fashion and
generalizes them by allowing optimization of an arbitrary differentiable loss function.
Random Forest adds additional randomness to the model, while growing the trees. Instead of
searching for the most important feature while splitting a node, it searches for the best
feature among a random subset of features. SVM constructs a hyperplane that has the
largest distance to the nearest training-data point of any class when it achieves a
good classification.

We use standard classification performance metrics for evaluating quality such as
Precision, Accuracy, F1-score (micro, macro), Log-loss, ROC–AUC. Next, we shortly
define them.

Precision is a measure that tells us what proportion of publications that we diagnosed as
existing, actually had existed. Accuracy in classification problems is the number of correct

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 9/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


predictions made by the model over all kinds predictions made. Recall is a measure
that tells us what proportion of publications that actually had existed was diagnosed by the
algorithm as existing. To balance these metrics, F1-score is calculated as a weighted mean
of the Precision and Recall. Micro-averaged metrics are usually more useful if the
classes distribution is uneven by calculating metrics globally to count true and false
predictions, macro-averaged are used when we want to evaluate systems performance
across on different classes by calculating metrics for each label. Log-loss, or logarithmic
loss, is a “soft” measurement of accuracy that integrates the idea of probabilistic
confidence. In binary classification Log-loss can be calculated as

� 1N �
PN

i¼1ðyi � logðpiÞ þ ð1 � yiÞ � logð1 � piÞÞ, where yi is a binary indicator (0 or 1)
checking the correctness of classification for observation i, pi is a predicted probability
observation i for a given class. Log-loss measures the unpredictability of the “extra noise”
that comes from using a predictor as opposed to the true labels. ROC–AUC, aka
area under the receiver operating characteristic curve, is equal to the probability
that a classifier will rank a randomly chosen existing edge higher than a randomly
chosen non-existing one. ROC curves typically feature rate of true predicted existing
publications on the Y axis, and rate of false predicted existing publications on the
X axis. The larger AUC is usually better. All metrics were averaged using fivefold
cross validation with five negative sampling trials for a fixed train set, which
we describe below.

In our LP problem for the co-authorship network, we have two possible formalizations of
predicting links. We consider either temporal network structure using information from the
previous years to predict links corresponding the current year or the whole network
to predict missing links. For the first task, we use combined HSE+Scopus dataset of 2015
publications and learn to predict papers appearing at 2016 year. We test our model shifting
years on 1 year ahead and evaluate our predictions for 2017 year based on publications
until 2016 year. For the second task, we remove 50% of existing edges preserving
connectivity property from our dataset and add negative sampling of the same size as a
number of edges left in order to balance classes for classification problem.

In Table 3 for the future links prediction task, we compare chosen predictive models
fixing one Neighbor Weighted-L2 link operator to construct edge embeddings considered
as model features. It is interesting to see that XGBoost model get significantly overfitted
while the best model appear to be Support Vector Machine. In Table 3 and in further
tables we highlight the best values of quality metrics in bold.

Table 3 Comparing machine learning models based on the Neighbor Weighted-L2 link embedding applied to future links prediction on the
Scopus dataset.

Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC

Logistic regression 0.382 0.372 0.483 0.471 2.089 0.534

Random forest 0.622 0.671 0.652 0.641 0.893 0.725

Gradient boosting 0.487 0.521 0.527 0.582 1.275 0.631

SVM 0.712 0.784 0.761 0.754 0.634 0.816

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 10/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


In what follows, we aim to compare several link embedding metrics (see Table 2) for the
best machine learning model. To evaluate our approach for the first task on Scopus
dataset, we can see in Table 4 that suggested by authors new link embedding outperforms
existing approaches by all the binary classification quality metrics.

As for the second task, we evaluate LP task over HSE and Scopus datasets in terms of
predictive models and link embeddings. In Tables 5 and 6, we could see that the SVM
model outperforms the other model on the HSE dataset but Random Forest gets the best
AUC for the Scopus dataset, being competitive with the XGBoost and SVM models
(SVM performs slightly better). This could happen due to sparse data in Scopus
publications after removing 50% link information and lack of ensemble methods for
choosing proper negative sampling for such a sparse dataset.

In Tables 7 and 8, we could see that the suggested local proximity operator for link
embedding that we call Neighbor Weighted-L2 link embedding outperforms all the other
approaches for embedding edges based on node vector representations.

To choose the best predictive model and edge embedding operator, we consider several
feature space combination based on List 1. The results of their comparison are shown
in Table 9 for the first task and in Table 10 for the second task of LP (the original
results appear in Makarov et al. (2018b)).

Table 4 Comparing link embeddings for future links prediction on the Scopus dataset.

Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC

Average 0.436 0.578 0.661 0.589 1.341 0.599

Hadamard 0.613 0.654 0.657 0.628 1.125 0.634

Weighted-L1 0.645 0.678 0.674 0.632 0.979 0.723

Weighted-L2 0.672 0.682 0.688 0.637 0.915 0.742

Neighbor Weighted-L1 0.644 0.692 0.701 0.696 0.832 0.783

Neighbor Weighted-L2 0.712 0.784 0.761 0.754 0.634 0.816

Table 5 Comparing machine learning models based on the Neighbor Weighted-L2 link embedding for link prediction problem on the HSE dataset.

Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC

Logistic regression 0.421 0.462 0.502 0.472 1.873 0.583

Random forest 0.836 0.871 0.774 0.831 0.193 0.888

Gradient boosting 0.771 0.732 0.703 0.734 0.742 0.663

SVM 0.823 0.845 0.782 0.812 0.273 0.828

Table 6 Comparing machine learning models based on the Neighbor Weighted-L2 link embedding for link prediction problem on the Scopus
dataset.

Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC

Logistic regression 0.482 0.491 0.522 0.563 1.452 0.613

Random forest 0.812 0.844 0.745 0.811 0.176 0.876

Gradient boosting 0.852 0.821 0.733 0.806 0.337 0.815

SVM 0.834 0.837 0.701 0.725 0.302 0.818

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 11/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


We could see that adding embedding of author research interests as well as the author
embedding itself play significant role in improving prediction quality for both tasks.
When considering only structural embedding or node similarity features, we obtained
worse results in terms of all binary classification quality metrics. In both tasks, the
combined approach with direct node similarity scores does not improve the quality of
prediction overfitting the model on particular properties thus influencing the predictions
for network with missing links. Makarov et al. (2018a) evaluate their recommender system
on including research interests based only on subject categories from the respective
journals index in Scopus. It leads to worse results for LP for the authors with small number
of research publication, so they succeeded only predicting so-called strong connections
for authors writing at least three to five papers in co-authorship. Our approach

Table 8 Comparing link embeddings for the link prediction problem on the Scopus dataset.

Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC

Average 0.588 0.531 0.563 0.569 1.321 0.553

Hadamard 0.672 0.637 0.611 0.632 0.784 0.626

Weighted-L1 0.746 0.653 0.675 0.694 0.693 0.668

Weighted-L2 0.786 0.771 0.705 0.707 0.597 0.774

Neighbor Weighted-L1 0.794 0.821 0.732 0.781 0.337 0.832

Neighbor Weighted-L2 0.812 0.844 0.745 0.811 0.176 0.876

Table 9 Prediction of publications for 2017 year based on ≤2016 years information.

Precision Accuracy F1-score
(macro)

F1-score
(micro)

Log-loss ROC–AUC

(1) 0.712 0.784 0.761 0.754 0.634 0.816

(3) 0.652 0.745 0.731 0.719 0.682 0.767

(1)+(2) 0.735 0.822 0.775 0.759 0.593 0.836

(1)+(3) 0.728 0.796 0.781 0.789 0.618 0.827

(1)+(4) 0.722 0.791 0.772 0.751 0.611 0.819

(3)+(4) 0.683 0.762 0.782 0.781 0.671 0.762

(1)+(2)+(3) 0.742 0.863 0.787 0.751 0.582 0.855

(1)+(2)+(4) 0.738 0.852 0.791 0.731 0.604 0.842

(1)+(2)+(3)+(4) 0.798 0.866 0.793 0.786 0.573 0.878

Table 7 Comparing link embeddings for the link prediction problem on the HSE dataset.

Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC

Average 0.628 0.611 0.693 0.738 1.092 0.641

Hadamard 0.721 0.728 0.673 0.687 0.703 0.771

Weighted-L1 0.776 0.765 0.727 0.759 0.376 0.817

Weighted-L2 0.786 0.782 0.751 0.782 0.355 0.827

Neighbor Weighted-L1 0.816 0.834 0.762 0.801 0.214 0.839

Neighbor Weighted-L2 0.836 0.871 0.774 0.831 0.193 0.888

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 12/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


allows to work with arbitrary research interest representation thus making it possible to use
recommender system for novice researchers with small or zero connections in the network.

EXPERIMENTS FOR A LARGE NETWORK
To evaluate scalability of our results, we considered the LP task for the large network called
AMiner (Tang et al., 2008). This network describes collaborations among the authors and
contains 1,560,640 nodes and 4,258,615 edges.

For constructing node embeddings we used node2vec with model parameters
p, q = (1,1), embedding dimension d = 128, length of walk per node equaled l = 60, and
number of walks per node equaled n = 3. We decreased values of l and n in comparison
with default values, because our computer terminated process due to memory issues.

We studied the impact of train/test split on different edge embeddings operators while
fixing Logistic Regression model for LP. We considered train set, consisting of 20%,
40%, 60%, 80% of the graph edges while averaging binary classification quality metrics over
five negative sampling providing negative examples for non-existent edges. We compared the
Log-loss (Figs. 2A and 2B) and Accuracy (Figs. 2C and 2D) metrics computed for train
and test sets using different edge embeddings. As a result, we obtained that Hadamard and
Neighbor Weighted-L2 edge embedding operators represent highly accurate results
while trained on sparse data from the original graph. Moreover, increasing the size of train
data (which is the case for our temporal co-authorship network) The Hadamard product
becomes inferior to our Neighbor Weighted-L2 operator. In addition, the Neighbor
Weighted-L1 operator also showed greater performance in contrast to the HSE dataset.

It was interesting to find out that overall performance of node2vec node embedding
model produces very precise results for LP task, which may vary depending on graph as
was shown in Grover & Leskovec (2016).

DISCUSSION
We have obtained that combined approach of embedding co-authorship and keywords
co-occurrence networks, while preserving several author attributes leads to significant
improvement in the classification quality metrics for predicting future links and LP tasks.
However, we will continue to experiment with node2vec model. The (1) feature space

Table 10 Link prediction for 2017 year on the Scopus dataset.

Precision Accuracy F1-score
(macro)

F1-score
(micro)

Log-loss ROC–AUC

(1) 0.767 0.831 0.703 0.783 0.236 0.859

(3) 0.731 0.822 0.698 0.778 0.244 0.813

(1)+(2) 0.811 0.843 0.726 0.805 0.216 0.864

(1)+(3) 0.763 0.821 0.703 0.775 0.231 0.839

(1)+(4) 0.772 0.833 0.715 0.794 0.223 0.861

(3)+(4) 0.747 0.824 0.719 0.791 0.241 0.825

(1)+(2)+(3) 0.808 0.835 0.724 0.807 0.193 0.866

(1)+(2)+(4) 0.821 0.842 0.733 0.813 0.203 0.872

(1)+(2)+(3)+(4) 0.812 0.844 0.745 0.811 0.176 0.876

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 13/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


from the List 1 is node2vec model with d = 128 (compared also on d = 128) and parameters
p, q = (1,0.25) obtained by us via model fitting on a given d and logistic regression
baseline model. The small q value shows that considering local proximity was more
important than proximities of highest orders.

However, we aim to further study this question taking into account modern methods of
graph auto-encoders (Kipf & Welling, 2016). We are further working on a new embedding
model called JONNE learning high quality node representations in tandem with edge
representations. We proved the embeddings learned by JONNEE is almost always superior
then of those learned by state-of-the-art models of all different types: matrix factorization
based, sequence-based and deep learning based but the model has a certain drawback
of longer training similar to matrix factorizations but less parallelizable, thus giving us the
dilemma to choose between quality and processing speed of suggested solution for
edge embedding construction.

Figure 2 Low log-loss and high accuracy (plotted log-scale Y-axis) represent the best edge
embedding. We show that Neighbor Weighted-L1 and Neighbor Weighted-L2 operators are on par
with the state-of-art Hadamard product presented in Grover & Leskovec (2016) and outperform it when
train data size increases based on Log-loss (A and B) and Accuracy (C and D) metrics computed for train
and test sets. Full-size DOI: 10.7717/peerj-cs.172/fig-2

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 14/20

http://dx.doi.org/10.7717/peerj-cs.172/fig-2
http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


While the LP task still remains hard problem for network analysis, its application to
matching collaborators based on structural, attribute and content information shows
promising results on applicability of graph-based recommender system predicting links in
the co-authorship network and incorporating author research interests in collaborative
patterns. We aim to generalize the model based on full-text extraction of research interests
from collections of source documents and studying deep learning solutions for
representing combined embedding of structural and content information for co-
authorship networks.

The code for computing all the models with respect to classification evaluation,
choosing proper edge embedding operator and tuning hyper-parameters of node
embeddings will be uploaded on GitHub (http://github.com/makarovia/jcdl2018/)
including the HSE and Scopus datasets.

CONCLUSION
We have improved recommender systems (Makarov, Bulanov & Zhukov, 2017; Makarov
et al., 2018a) based on choosing proper link embedding operator (Makarov et al., 2018b)
and including research interest information presented as embedding of nodes in
keywords co-occurrence network connecting keywords relating to a given research article.
We have compared several machine learning models for future and missing LP problems
interpreted as a binary classification problem. The edge embedding operator suggested
by Makarov et al. (2018b) edge embedding operator called Neighbor Weighted-L2
(see Table 2) outperforms all the other edge embedding functions due to involving the
neighborhood of the edge in the graph and was properly evaluated in this paper for both
tasks. Among the machine learning models, SVM outperforms all the others except the
LP problem on the sparse Scopus dataset while XGBoost was significantly overfitted;
however, training SVM for large graphs is computationally hard.

Such a constructed model may be considered as a recommender system for searching
collaborators based on mutual research interests and publishing patterns.
The recommender system demonstrates good results on predicting new collaborations
between existing authors even if they have small number of data in co-authorship network
due to availability of their research interests.

We are looking forward to the evaluation of our system for universities, who have to
deal with the problems of finding an expert based on text for evaluation, matchmaking
for co-authored research papers with novice researchers, searching for collaborators
on specific grant proposal or proper scientific advisers.

Focusing only on machine learning task is not suitable for real-world application
involving social interactions, so we aim to implement framework with the possibility to
manually add positive and negative preferences for collaboration recommendations, thus
providing a useful service which could be integrated in university business process of
managing researchers’ publication activity.

Our system may also be used for predicting the number of publications corresponding
to a given administrative staff unit using network collaborative patterns and thus able
to evaluate the efficiency of the authors or the whole staff. It also may be used for

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 15/20

http://github.com/makarovia/jcdl2018/
http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


suggesting collaborations between separate staff units by considering combined network
with staff units as vertices and weighted by the number of publications mutual connections
between them. Evaluating such a network for HSE University give us the picture that
most popular faculties of Economics and Management have a lot of mutual connections
due to many researchers working at these faculties, but limiting these connection only
on Scopus publication leads to the influence of Computer Science and Engineering
faculties which showed trend of computer science research in applied sciences. We leave
for the future work consideration of the applicability of our system which suggests a
new university publication strategy based on collaboration patterns and invites
researchers to compare the existing solutions on the HSE researchers’ dataset.

ACKNOWLEDGEMENTS
A part of the article is extended and revised version of presented oral talk at
international conference AIST’18 and posters at ACM/IEEE JCDL’18, WebSci’18 and
SunBelt’18. We thank all the colleagues from NRU HSE participated in discussion of
this research.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
The work was supported by the Russian Science Foundation under grant 17-11-01294
and performed at the National Research University Higher School of Economics, Russia.
The funders had no role in study design, data collection and analysis, decision to
publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:
Russian Science Foundation: 17-11-01294.
National Research University Higher School of Economics, Russia.

Competing Interests
The authors declare that they have no competing interests.

Author Contributions
� Ilya Makarov conceived and designed the experiments, analyzed the data, prepared
figures and/or tables, performed the computation work, authored or reviewed drafts of
the paper, approved the final draft.

� Olga Gerasimova performed the experiments, analyzed the data, prepared figures and/or
tables, performed the computation work, approved the final draft.

� Pavel Sulimov performed the experiments, analyzed the data, prepared figures and/or
tables, performed the computation work, approved the final draft.

� Leonid E. Zhukov conceived and designed the experiments, analyzed the data,
performed the computation work, approved the final draft.

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 16/20

http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


Data Availability
The following information was supplied regarding data availability:

GitHub: https://github.com/MakarovIA/jcdl2018.

REFERENCES
Abu-El-Haija S, Perozzi B, Al-Rfou R. 2017. Learning edge representations via low-rank

asymmetric projections. In: Proceedings of the 2017 ACM on Conference on Information and
Knowledge Management. New York: ACM, 1787–1796.

Adafre SF, De Rijke M. 2005. Discovering missing links in Wikipedia. In: Proceedings of the
3rd International Workshop on Link Discovery, LinkKDD ’05. New York: ACM, 90–97.

Backstrom L, Leskovec J. 2011. Supervised random walks: predicting and recommending links in
social networks. In: Proceedings of the Fourth ACM International Conference on Web Search
and Data Mining, WSDM ’1. New York: ACM, 635–644.

Barabási A-L, Pósfai M. 2016. Network science. Cambridge: Cambridge University Press.

Belkin M, Niyogi P. 2002. Laplacian eigenmaps and spectral techniques for embedding and
clustering. Advances in Neural Information Processing Systems. Cambridge: MIT Press, 585–591.

Cai H, Zheng VW, Chang K. 2018. A comprehensive survey of graph embedding: problems,
techniques and applications. IEEE Transactions on Knowledge and Data Engineering
30(9):1616–1637 DOI 10.1109/tkde.2018.2807452.

Cao S, Lu W, Xu Q. 2015. Grarep: learning graph representations with global structural
information. In: Proceedings of the 24th ACM International on Conference on Information
and Knowledge Management, CIKM ’15, New York: ACM, 891–900.

Carstens BT, Jensen MR, Spaniel MF, Hermansen A. 2017. Vertex similarity in graphs
using feature learning. Available at https://projekter.aau.dk/projekter/files/259997796/
mi109f17___Vertex_Similarity.pdf (accessed 7 June 2017).

Cetorelli N, Peristiani S. 2013. Prestigious stock exchanges: a network analysis of
international financial centers. Journal of Banking & Finance 37(5):1543–1551
DOI 10.1016/j.jbankfin.2012.06.011.

Chang S, Han W, Tang J, Qi G-J, Aggarwal CC, Huang TS. 2015. Heterogeneous
network embedding via deep architectures. In: Proceedings of the 21th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York: ACM,
119–128.

Chen H, Li X, Huang Z. 2005. Link prediction approach to collaborative filtering. In: Proceedings of
the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’05), New York, NY, USA,
141–142.

Chen H, Perozzi B, Al-Rfou R, Skiena S. 2018. A tutorial on network embeddings. arXiv preprint
arXiv:1808.02590.

Cho H, Yu Y. 2018. Link prediction for interdisciplinary collaboration via co-authorship network.
Social Network Analysis and Mining 8(1):25 DOI 10.1007/s13278-018-0501-6.

Cui P, Wang X, Pei J, Zhu W. 2018. A survey on network embedding. IEEE Transactions on
Knowledge and Data Engineering, 21 pages. Available at https://ieeexplore.ieee.org/abstract/
document/8392745.

Elsevier. 2018. Scopus. Available at http://www.scopus.com/ (accessed 9 January 2018).

Gao F, Musial K, Cooper C, Tsoka S. 2015. Link prediction methods and their accuracy for
different social networks and network metrics. Scientific Programming 2015:1–13
DOI 10.1155/2015/172879.

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 17/20

https://github.com/MakarovIA/jcdl2018
http://dx.doi.org/10.1109/tkde.2018.2807452
https://projekter.aau.dk/projekter/files/259997796/mi109f17___Vertex_Similarity.pdf
https://projekter.aau.dk/projekter/files/259997796/mi109f17___Vertex_Similarity.pdf
http://dx.doi.org/10.1016/j.jbankfin.2012.06.011
http://dx.doi.org/10.1007/s13278-018-0501-6
https://ieeexplore.ieee.org/abstract/document/8392745
https://ieeexplore.ieee.org/abstract/document/8392745
http://www.scopus.com/
http://dx.doi.org/10.1155/2015/172879
http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


Gao S, Denoyer L, Gallinari P. 2011. Temporal link prediction by integrating content and
structure information. In: Proceedings of the 20th ACM International Conference on Information
and Knowledge Management, CIKM ’11, New York: ACM, 1169–1174.

Goyal P, Ferrara E. 2018. Graph embedding techniques, applications, and performance: a survey.
Knowledge-Based Systems 151:78–94 DOI 10.1016/j.knosys.2018.03.022.

Goyal P, Hosseinmardi H, Ferrara E, Galstyan A. 2018. Capturing edge attributes via
network embedding. arXiv preprint arXiv:1805.03280.

Grover A, Leskovec J. 2016. Node2vec: scalable feature learning for networks. In: Proceedings of the
22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’16, New York: ACM, 855–864.

Hasan MA, Zaki MJ. 2011. A survey of link prediction in social networks. Boston: Springer, 243–275.

He Q, Pei J, Kifer D, Mitra P, Giles L. 2010. Context-aware citation recommendation. In: Proceedings
of the 19th International Conference on World Wide Web, WWW ’10, New York: ACM, 421–430.

Huang X, Li J, Hu X. 2017. Label informed attributed network embedding. In: Proceedings of
the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17,
New York: ACM, 731–739.

Kipf TN, Welling M. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308.

Kong X, Mao M, Wang W, Liu J, Xu B. 2018. Voprec: vector representation learning of papers
with text information and structural identity for recommendation. Epub ahead of print 26 April
2018. IEEE Transactions on Emerging Topics in Computing DOI 10.1109/tetc.2018.2830698.

Kossinets G, Watts DJ. 2009. Origins of homophily in an evolving social network.
American Journal of Sociology 115(2):405–450 DOI 10.1086/599247.

Le Q, Mikolov T. 2014. Distributed representations of sentences and documents. In: Proceedings
of the 31st International Conference on Machine Learning (ICML-14), Cambridge: MIT Press,
1188–1196.

Li X, Chen H. 2009. Recommendation as link prediction: a graph kernel-based machine learning
approach. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries,
JCDL ’09. New York: ACM, 213–216.

Liang Y, Li Q, Qian T. 2011. Finding relevant papers based on citation relations. In: International
Conference on Web-Age Information Management, Berlin: Springer, 403–414.

Liao L, He X, Zhang H, Chua T-S. 2017. Attributed social network embedding. arXiv preprint
arXiv:1705.04969.

Liben-Nowell D, Kleinberg J. 2007. The link-prediction problem for social networks.
Journal of the Association for Information Science and Technology 58(7):1019–1031.

Liu Y, Kou Z. 2007. Predicting who rated what in large-scale datasets. ACM SIGKDD
Explorations Newsletter 9(2):62–65 DOI 10.1145/1345448.1345462.

Lü L, Zhou T. 2011. Link prediction in complex networks: a survey. Physica A: Statistical
Mechanics and its Applications 390(6):1150–1170.

Makarov I, Bulanov O, Gerasimova O, Meshcheryakova N, Karpov I, Zhukov LE. 2018a.
Scientific matchmaker: collaborator recommender system. In: Analysis of Images,
Social Networks and Texts, Cham: Springer International Publishing, 404–410.

Makarov I, Bulanov O, Zhukov L. 2017. Co-author recommender system. In: Springer
Proceedings in Mathematics and Statistic, Berlin: Springer, 1–6.

Makarov I, Gerasimova O, Sulimov P, Korovina K, Zhukov L. 2019a. Joint node-edge network
embedding for link prediction. In: Springer Proceedings in Mathematics and Statistic (to appear),
Berlin: Springer, 1–12.

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 18/20

http://dx.doi.org/10.1016/j.knosys.2018.03.022
http://dx.doi.org/10.1109/tetc.2018.2830698
http://dx.doi.org/10.1086/599247
http://dx.doi.org/10.1145/1345448.1345462
http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


Makarov I, Gerasimova O, Sulimov P, Zhukov L. 2019b. Co-authorship network embedding and
recommending collaborators via network embedding. In: Springer Proceedings in Mathematics
and Statistic (to appear), Berlin: Springer, 1–6.

Makarov I, Gerasimova O, Sulimov P, Zhukov LE. 2018b. Recommending co-authorship via
network embeddings and feature engineering: the case of national research university higher
school of economics. In: Proceedings of the 18th ACM/IEEE on Joint Conference on
Digital Libraries. New York: ACM, 365–366.

McPherson M, Smith-Lovin L, Cook JM. 2001. Birds of a feather: homophily in social networks.
Annual Review of Sociology 27(1):415–444 DOI 10.1146/annurev.soc.27.1.415.

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. 2013. Distributed representations of words and
phrases and their compositionality. In: Advances in Neural Information Processing Systems,
3111–3119. Available at https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-
phrases-and-their-compositionality.pdf.

Morel CM, Serruya SJ, Penna GO, Guimarães R. 2009. Co-authorship network analysis:
a powerful tool for strategic planning of research, development and capacity building
programs on neglected diseases. PLOS Neglected Tropical Diseases 3(8):e501
DOI 10.1371/journal.pntd.0000501.

National Research University Higher School of Economics. 2017. Publications of HSE. Available
at http://publications.hse.ru/en (accessed 9 May 2017).

Newman MEJ. 2004a. Coauthorship networks and patterns of scientific collaboration. Proceedings
of the National Academy of Sciences of the United States of America 101(suppl 1):5200–5205
DOI 10.1073/pnas.0307545100.

Newman ME. 2004b. Who is the best connected scientist? a study of scientific coauthorship
networks. Complex Networks 1:337–370.

Pan S, Wu J, Zhu X, Zhang C, Wang Y. 2016. Tri-party deep network representation. Network
11(9):12.

Perozzi B, Al-Rfou R, Skiena S. 2014. Deepwalk: online learning of social representations.
In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD ’14, New York: ACM, 701–710.

Robins G, Snijders T, Wang P, Handcock M, Pattison P. 2007. Recent developments in
exponential random graph (p�) models for social networks. Social Networks 29(2):192–215
DOI 10.1016/j.socnet.2006.08.003.

Roweis ST, Saul LK. 2000. Nonlinear dimensionality reduction by locally linear embedding.
Science 290(5500):2323–2326 DOI 10.1126/science.290.5500.2323.

Scott J. 2017. Social network analysis. Thousand Oaks: Sage.

Srinivas V, Mitra P. 2016. Applications of link prediction. Cham: Springer International
Publishing, 57–61.

Tang J, Liu H. 2012. Unsupervised feature selection for linked social media data. In: Proceedings of
the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’12. New York: ACM, 904–912.

Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. 2015. Line: large-scale information network
embedding. In: Proceedings of the 24th International Conference on World Wide Web,
WWW ‘15, Republic and Canton of Geneva, Switzerland: International World Wide Web
Conferences Steering Committee, 1067–1077.

Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. 2008. Arnetminer: extraction and mining of
academic social networks. In: KDD’08, New York, NY, USA, 990–998.

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 19/20

http://dx.doi.org/10.1146/annurev.soc.27.1.415
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
http://dx.doi.org/10.1371/journal.pntd.0000501
http://publications.hse.ru/en
http://dx.doi.org/10.1073/pnas.0307545100
http://dx.doi.org/10.1016/j.socnet.2006.08.003
http://dx.doi.org/10.1126/science.290.5500.2323
http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/


Tang L, Liu H. 2011. Leveraging social media networks for classification. Data Mining and
Knowledge Discovery 23(3):447–478 DOI 10.1007/s10618-010-0210-x.

Tenenbaum JB, De Silva, Langford JC. 2000. A global geometric framework for nonlinear
dimensionality reduction. Science 290(5500):2319–2323 DOI 10.1126/science.290.5500.2319.

Velden T, Lagoze C. 2009. Patterns of collaboration in co-authorship networks in
chemistry-mesoscopic analysis and interpretation. In: 12th International Conference on
Scientometrics and Informetrics, Rio de Janeiro: ISSI Society, 1–12.

Vorontsov K, Frei O, Apishev M, Romov P, Dudarenko M. 2016. Bigartm. v0.8.2. Available at
https://doi.org/10.5281/zenodo.288960.

Wang D, Cui P, Zhu W. 2016. Structural deep network embedding. In: Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16.
New York: ACM, 1225–1234.

Wang P, Xu B, Wu Y, Zhou X. 2015. Link prediction in social networks: the state-of-the-art.
Science China Information Sciences 58(1):1–38 DOI 10.1007/s11432-014-5237-y.

Wasserman S, Faust K. 1994. Social network analysis: methods and applications. Vol. 8.
Cambridge: Cambridge University Press.

Wu H, Lerman K. 2017. Network vector: distributed representations of networks with global
context. arXiv preprint arXiv:1709.02448.

Yan E, Ding Y. 2009. Applying centrality measures to impact analysis: a coauthorship network
analysis. Journal of the American Society for Information Science and Technology
60(10):2107–2118 DOI 10.1002/asi.21128.

Yan S, Xu D, Zhang B, Zhang HJ, Yang Q, Lin S. 2007. Graph embedding and extensions: a
general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and
Machine Intelligence 29(1):40–51 DOI 10.1109/tpami.2007.250598.

Yang C, Liu Z, Zhao D, Sun M, Chang EY. 2015. Network representation learning with rich text
information. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence,
Palo Alto, CA, USA, 2111–2117.

Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 20/20

http://dx.doi.org/10.1007/s10618-010-0210-x
http://dx.doi.org/10.1126/science.290.5500.2319
https://doi.org/10.5281/zenodo.288960
http://dx.doi.org/10.1007/s11432-014-5237-y
http://dx.doi.org/10.1002/asi.21128
http://dx.doi.org/10.1109/tpami.2007.250598
http://dx.doi.org/10.7717/peerj-cs.172
https://peerj.com/computer-science/

	Dual network embedding for representing research interests in the link prediction problem on co-authorship networks
	Introduction
	Related Work
	Dataset Description and Preprocessing
	Feature Engineering
	Link Embeddings
	Training Model
	Experiments for a Large Network
	Discussion
	Conclusion
	flink10
	References


<<
  /ASCII85EncodePages false
  /AllowTransparency false
  /AutoPositionEPSFiles true
  /AutoRotatePages /None
  /Binding /Left
  /CalGrayProfile (Dot Gain 20%)
  /CalRGBProfile (sRGB IEC61966-2.1)
  /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2)
  /sRGBProfile (sRGB IEC61966-2.1)
  /CannotEmbedFontPolicy /Warning
  /CompatibilityLevel 1.4
  /CompressObjects /Off
  /CompressPages true
  /ConvertImagesToIndexed true
  /PassThroughJPEGImages true
  /CreateJobTicket false
  /DefaultRenderingIntent /Default
  /DetectBlends true
  /DetectCurves 0.0000
  /ColorConversionStrategy /LeaveColorUnchanged
  /DoThumbnails false
  /EmbedAllFonts true
  /EmbedOpenType false
  /ParseICCProfilesInComments true
  /EmbedJobOptions true
  /DSCReportingLevel 0
  /EmitDSCWarnings false
  /EndPage -1
  /ImageMemory 1048576
  /LockDistillerParams false
  /MaxSubsetPct 100
  /Optimize true
  /OPM 1
  /ParseDSCComments true
  /ParseDSCCommentsForDocInfo true
  /PreserveCopyPage true
  /PreserveDICMYKValues true
  /PreserveEPSInfo true
  /PreserveFlatness true
  /PreserveHalftoneInfo false
  /PreserveOPIComments false
  /PreserveOverprintSettings true
  /StartPage 1
  /SubsetFonts true
  /TransferFunctionInfo /Apply
  /UCRandBGInfo /Preserve
  /UsePrologue false
  /ColorSettingsFile (None)
  /AlwaysEmbed [ true
  ]
  /NeverEmbed [ true
  ]
  /AntiAliasColorImages false
  /CropColorImages true
  /ColorImageMinResolution 300
  /ColorImageMinResolutionPolicy /OK
  /DownsampleColorImages false
  /ColorImageDownsampleType /Average
  /ColorImageResolution 300
  /ColorImageDepth 8
  /ColorImageMinDownsampleDepth 1
  /ColorImageDownsampleThreshold 1.50000
  /EncodeColorImages true
  /ColorImageFilter /FlateEncode
  /AutoFilterColorImages false
  /ColorImageAutoFilterStrategy /JPEG
  /ColorACSImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /ColorImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /JPEG2000ColorACSImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /JPEG2000ColorImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /AntiAliasGrayImages false
  /CropGrayImages true
  /GrayImageMinResolution 300
  /GrayImageMinResolutionPolicy /OK
  /DownsampleGrayImages false
  /GrayImageDownsampleType /Average
  /GrayImageResolution 300
  /GrayImageDepth 8
  /GrayImageMinDownsampleDepth 2
  /GrayImageDownsampleThreshold 1.50000
  /EncodeGrayImages true
  /GrayImageFilter /FlateEncode
  /AutoFilterGrayImages false
  /GrayImageAutoFilterStrategy /JPEG
  /GrayACSImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /GrayImageDict <<
    /QFactor 0.15
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /JPEG2000GrayACSImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /JPEG2000GrayImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 30
  >>
  /AntiAliasMonoImages false
  /CropMonoImages true
  /MonoImageMinResolution 1200
  /MonoImageMinResolutionPolicy /OK
  /DownsampleMonoImages false
  /MonoImageDownsampleType /Average
  /MonoImageResolution 1200
  /MonoImageDepth -1
  /MonoImageDownsampleThreshold 1.50000
  /EncodeMonoImages true
  /MonoImageFilter /CCITTFaxEncode
  /MonoImageDict <<
    /K -1
  >>
  /AllowPSXObjects false
  /CheckCompliance [
    /None
  ]
  /PDFX1aCheck false
  /PDFX3Check false
  /PDFXCompliantPDFOnly false
  /PDFXNoTrimBoxError true
  /PDFXTrimBoxToMediaBoxOffset [
    0.00000
    0.00000
    0.00000
    0.00000
  ]
  /PDFXSetBleedBoxToMediaBox true
  /PDFXBleedBoxToTrimBoxOffset [
    0.00000
    0.00000
    0.00000
    0.00000
  ]
  /PDFXOutputIntentProfile (None)
  /PDFXOutputConditionIdentifier ()
  /PDFXOutputCondition ()
  /PDFXRegistryName ()
  /PDFXTrapped /False

  /CreateJDFFile false
  /Description <<
    /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002>
    /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002>
    /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e>
    /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e>
    /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e>
    /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e>
    /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e>
    /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002>
    /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e>
    /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.)
    /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e>
    /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e>
    /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e>
    /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e>
    /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers.  Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.)
  >>
  /Namespace [
    (Adobe)
    (Common)
    (1.0)
  ]
  /OtherNamespaces [
    <<
      /AsReaderSpreads false
      /CropImagesToFrames true
      /ErrorControl /WarnAndContinue
      /FlattenerIgnoreSpreadOverrides false
      /IncludeGuidesGrids false
      /IncludeNonPrinting false
      /IncludeSlug false
      /Namespace [
        (Adobe)
        (InDesign)
        (4.0)
      ]
      /OmitPlacedBitmaps false
      /OmitPlacedEPS false
      /OmitPlacedPDF false
      /SimulateOverprint /Legacy
    >>
    <<
      /AddBleedMarks false
      /AddColorBars false
      /AddCropMarks false
      /AddPageInfo false
      /AddRegMarks false
      /ConvertColors /NoConversion
      /DestinationProfileName ()
      /DestinationProfileSelector /NA
      /Downsample16BitImages true
      /FlattenerPreset <<
        /PresetSelector /MediumResolution
      >>
      /FormElements false
      /GenerateStructure true
      /IncludeBookmarks false
      /IncludeHyperlinks false
      /IncludeInteractive false
      /IncludeLayers false
      /IncludeProfiles true
      /MultimediaHandling /UseObjectSettings
      /Namespace [
        (Adobe)
        (CreativeSuite)
        (2.0)
      ]
      /PDFXOutputIntentProfileSelector /NA
      /PreserveEditing true
      /UntaggedCMYKHandling /LeaveUntagged
      /UntaggedRGBHandling /LeaveUntagged
      /UseDocumentBleed false
    >>
  ]
>> setdistillerparams
<<
  /HWResolution [2400 2400]
  /PageSize [612.000 792.000]
>> setpagedevice