Submitted 20 July 2016
Accepted 23 May 2017
Published 19 June 2017

Corresponding author
Angelo A. Salatino,
angelo.salatino@open.ac.uk

Academic editor
Filippo Menczer

Additional Information and
Declarations can be found on
page 24

DOI 10.7717/peerj-cs.119

Copyright
2017 Salatino et al.

Distributed under
Creative Commons CC-BY 4.0

OPEN ACCESS

How are topics born? Understanding
the research dynamics preceding the
emergence of new areas
Angelo A. Salatino, Francesco Osborne and Enrico Motta
Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom

ABSTRACT
The ability to promptly recognise new research trends is strategic for many stake-
holders, including universities, institutional funding bodies, academic publishers and
companies. While the literature describes several approaches which aim to identify the
emergence of new research topics early in their lifecycle, these rely on the assumption
that the topic in question is already associated with a number of publications and
consistently referred to by a community of researchers. Hence, detecting the emergence
of a new research area at an embryonic stage, i.e., before the topic has been consistently
labelled by a community of researchers and associated with a number of publications, is
still an open challenge. In this paper, we begin to address this challenge by performing
a study of the dynamics preceding the creation of new topics. This study indicates that
the emergence of a new topic is anticipated by a significant increase in the pace of
collaboration between relevant research areas, which can be seen as the ‘parents’ of the
new topic. These initial findings (i) confirm our hypothesis that it is possible in principle
to detect the emergence of a new topic at the embryonic stage, (ii) provide new empirical
evidence supporting relevant theories in Philosophy of Science, and also (iii) suggest
that new topics tend to emerge in an environment in which weakly interconnected
research areas begin to cross-fertilise.

Subjects Artificial Intelligence, Data Science, Digital Libraries
Keywords Scholarly data, Topic emergence detection, Empirical study, Research trend detection,
Topic discovery, Digital libraries

INTRODUCTION
Early awareness of the emergence of new research topics can bring significant benefits to
anybody involved in the research environment. Academic publishers and editors can exploit
this knowledge and offer the most up to date and interesting contents. Researchers may
not only be interested in new trends related to their areas but may also find it very useful
to be alerted about significant new research developments in general. Institutional funding
bodies and companies also need to be regularly updated on how the research landscape
is evolving, so that they can make early decisions about critical investments. Considering
the growth rate of research publications (Larsen & Von Ins, 2010), keeping up with novel
trends is a challenge even for expert researchers. Traditional methods, such as the manual
exploration of publications in significant conferences and journals, are no longer viable.
This has led to the emergence of several approaches capable of detecting novel topics and

How to cite this article Salatino et al. (2017), How are topics born? Understanding the research dynamics preceding the emergence of
new areas. PeerJ Comput. Sci. 3:e119; DOI 10.7717/peerj-cs.119

https://peerj.com
mailto:angelo.salatino@open.ac.uk
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.119
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
http://dx.doi.org/10.7717/peerj-cs.119


research trends (Bolelli, Ertekin & Giles, 2009; Duvvuru, Kamarthi & Sultornsanee, 2012; He
et al., 2009; Wu, Venkatramanan & Chiu, 2016). However, all of these approaches focus on
topics that are already associated with a number of publications and consistently referred
to by a community of researchers. This limitation hinders the ability of stakeholders to
anticipate and react promptly to new developments in the research landscape.

Hence, there is a need for novel methods capable of identifying the appearance of new
topics at a very early stage, assessing their potential and forecasting their trajectory. To
this end, we need first to achieve a better understanding of the dynamics underlying the
creation of new topics and then investigate whether such understanding can be exploited to
develop computationally effective methods, which are capable of detecting the emergence
of new topics at a very early stage.

The field of Philosophy of Science offers a number of interesting theories about the
emergence of new topics Kuhn (2012) theorised that science evolves through paradigm
shifts. According to him, scientific work is performed within a set of paradigms and when
these paradigms cannot cope with certain problems, there is a paradigm shift that can lead
to the emergence of a new scientific discipline. This happens often through the creation
of novel scientific collaborations. In this context, Becher & Trowler (2001) explained
that, even if science proceeds towards more specific disciplines, and thus researchers in
different communities become less compatible, they are still inclined to collaborate for
mutual benefit. Herrera, Roberts & Gulbahce (2010), Sun et al. (2013) and Nowotny, Scott
& Gibbons (2013) suggested that the development of new topics is encouraged by the cross-
fertilisation of established research areas and recognised that multidisciplinary approaches
foster new developments and innovative thinking. Sun et al. (2013) and Osborne, Scavo &
Motta (2014) provided empirical evidence to these theories by analysing the social dynamics
of researchers and their effects on the formation and life-cycle of research communities
and topics.

According to these theories, when a new scientific area emerges, it goes through two
main phases. In the initial phase a group of scientists agree on some basic tenets, build a
conceptual framework and begin to establish a new scientific community. Afterwards, the
area enters a recognised phase, in which a substantial number of authors become active in
the area, producing and disseminating results (Couvalis, 1997).

Inspired by these theories, we hypothesize the existence of an even earlier phase, which we
name embryonic phase, in which a topic has not yet been explicitly labelled and recognized
by a research community, but it is already taking shape, as evidenced by the fact that
researchers from a variety of fields are forming new collaborations and producing new
work, starting to define the challenges and the paradigms associated with the emerging
new area.

We also hypothesize that it could be possible to detect topics at this stage by analysing
the dynamics of already established topics. In this context, we use the term dynamics to refer
to the significant trends associated with a topic, including the interactions between topics
and those between entities linked to these topics, such as publications, authors, venues.
For example, the sudden appearance of some publications concerning a combination of
previously uncorrelated topics may suggest that some pioneer researchers are investigating

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 2/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


new possibilities and maybe shaping a new emerging area. In the same way, as pointed
out by Salatino (2015), we can hypothesize a wide array of relevant patterns of activity,
which could anticipate the creation of a new research area. These may include a new
collaboration between two or more research communities (Osborne, Scavo & Motta, 2014),
the creation of interdisciplinary workshops, a rise in the number of experts working on
a certain combination of topics, a significant change in the vocabulary associated with
relevant topics (Cano Basave, Osborne & Salatino, 2016), and so on.

In this paper we present a study that aims to uncover key elements associated with the
research dynamics preceding the creation of novel topics, thus providing initial evidence to
support our hypotheses. In particular, our study provides evidence that the emergence of a
novel research topic can be anticipated by a significant increase in the pace of collaboration
between relevant research areas, which can be seen as the ‘parents’ of the new topic.

Our study was performed on a sample of three million publications in the 2000–2010
interval. It was conducted by comparing the sections of the co-occurrence graphs where new
topics are about to emerge with a control group of subgraphs associated with established
topics. These graphs were analysed by using two novel approaches that integrate both
statistics and semantics. We found that the pace of collaboration and the density measured
in the sections of the network that will give rise to a new topic are significantly higher than
those in the control group. These findings support our hypothesis about the existence of an
embryonic phase and also yield new empirical evidence consistent with the aforementioned
theories in Philosophy of Science. In addition, the identified dynamics could be used as the
starting point for developing new automatic methods, which could detect the emergence
of a new research topic well before this becomes explicitly recognised and established.

The study presented in this paper is an extension of the work by Salatino & Motta
(2016). The new contributions of this paper are: (1) a larger sample including 75 debutant
topics and 100 established ones, (2) a new technique for measuring the density of the topic
graph, (3) a more exhaustive statistical analysis, including a comparison of the different
approaches, (4) a revised state of the art, and (5) a more comprehensive discussion of the
findings.

The rest of the paper is organized as follows. We first review the literature regarding the
early detection of topics, pointing out the existing gaps. We then describe the experimental
approach used for the study, present the results and discuss their implications. Finally, we
summarize the main conclusions and outline future directions of research.

RELATED WORK
Topic detection and tracking is a task that has drawn much attention in recent years
and has been applied to a variety of scenarios, such as social networks (Cataldi, Di Caro
& Schifanella, 2010; Mathioudakis & Koudas, 2010), blogs (Gruhl et al., 2004; Oka, Abe &
Kato, 2006), emails (Morinaga & Yamanishi, 2004) and scientific literature (Bolelli, Ertekin
& Giles, 2009; Decker et al., 2007; Erten et al., 2004; Lv et al., 2011; Osborne, Scavo & Motta,
2014; Sun, Ding & Lin, 2016; Tseng et al., 2009).

The literature presents several works on research trend detection, which can be
characterised either by the way they define a topic or the techniques they use detect it

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 3/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


(Salatino (2015). Blei, Ng & Jordan, (2003) have developed the well-known Latent Dirichlet
Allocation (LDA), an unsupervised learning method to extract topics from a corpus, which
models topics as a multinomial distribution over words. Since its introduction, LDA has
been extended and adapted to several applications. For example, Blei & Lafferty (2006)
have introduced the Correlated Topic Model using the logistic normal distribution instead
of the Dirichlet one, to address the issue that LDA fails to model correlations between
topics. Griffiths et al. (2004) have developed the hierarchical LDA, where topics are grouped
together in a hierarchy. Further extensions incorporate other kinds of research metadata.
For example, Rosen-Zvi et al. (2004) present the Author-Topic Model (ATM), which
includes authorship information and associates each topic to a multinomial distribution
over words and each author to a multinomial distribution over topics. Bolelli, Ertekin &
Giles (2009) introduce the Segmented Author-Topic model which extends ATM by adding
the temporal ordering of documents to address the problem of topic evolution. In addition,
Chang & Blei (2010) have developed the relational topic model which combines LDA and
the network structure of documents to model topics. Similarly, He et al. (2009) have
combined LDA and citation networks in order to address the problem of topic evolution.
Their approach detects topics in independent subsets of a corpus and leverages citations
to connect topics in different time frames. In a similar way, Morinaga & Yamanishi (2004)
employ a probabilistic model called Finite Mixture Model to represent the structure of
topics and analyse the changes in time of the extracted components to track emerging
topics. However, their evaluation rests on an email corpus, thus it is not clear how it would
perform on scientific corpus. A general issue affecting this kind of approaches is that it is
not always easy to associate clearly identifiable research areas to the resulting topic models.

In addition to LDA, the Natural Language Processing (NLP) community have proposed
a variety of tools for identifying topics. For example, Chavalarias & Cointet (2013) used
CorText Manager to extract a list of 2000 n-grams representing the most salient terms
from a corpus and derived a co-occurrence matrix on which they performed clustering
analysis to discover patterns in the evolution of science Jo, Lagoze & Giles (2007) developed
an approach that correlates the distribution of terms extracted from a text with the
distribution of the citation graphs related to publications containing these terms. Their
work assumes that if a term is relevant to a topic, documents containing that term will
have a stronger connection than randomly selected ones. This approach is not suitable
for topics in their very early stage since it takes time for the citation network of a term to
become tightly connected.

Duvvuru et al. (2013) have analysed the network of co-occurring keywords in a scholarly
corpus and monitored the evolution in time of the link weights, to detect research trends and
emerging research areas. However, as Osborne & Motta (2012) pointed out, keywords tend
to be noisy and do not always represent research topics—in many cases different keywords
even refer to the same topic. For example, Osborne, Scavo & Motta (2014) showed that a
semantic characterisation of research topics yields better results than keywords for the
detection of research communities. To cope with this problem, some approaches rely on
taxonomies of topics. For example, Decker et al. (2007) matched a corpus of research papers
to a taxonomy of topics based on the most significant words found in titles and abstracts,

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 4/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


and analysed the changes in the number of publications associated with such topics.
Similarly, Erten et al. (2004) adopted the ACM Digital Library taxonomy for analysing
the evolution of topic graphs and monitoring research trends. However, human crafted
taxonomies tend to evolve slowly and, in a fast-changing research field, such as Computer
Science (Pham, Klamma & Jarke, 2011), it is important to rely on constantly updated
taxonomies. For this reason, in our experiment we adopted an ontology of Computer
Science automatically generated and regularly updated by the Klink-2 algorithm developed
by Osborne & Motta (2015).

In brief, the literature comprises a wide collection of approaches for detecting research
trends. However, they focus on already recognised topics, which are either already associated
with recognized label or, in the case of probabilistic topics models, with a set of terms that
have previously appeared in a good number of publications. Detecting research trends at
an embryonic stage remains an open challenge.

MATERIALS AND METHODS
The aim of this study was to measure the association between the emergence of a new topic
and the increase of the pace of collaboration and density previously observed in the co-
occurrence graphs of related topics. To this end, we represent topics and their relationships
in a certain time frame as a graph in which nodes are topics and edges represent their co-
occurrences in a sample of publications. This is a common representation for investigating
topic dynamics (Boyack, Klavans & Börner, 2005; Leydesdorff, 2007; Newman, 2001). In the
following we will refer to it as topic graph or topic network. We analysed 75 topics that
debuted in the 2000–2010 period using 100 established topics as a control group.

In our previous work (Salatino & Motta, 2016), we conducted a similar analysis on a
smaller sample. The sample analysed in this paper was selected by iteratively adding new
topics until we reached data saturation (Fusch & Ness, 2015), i.e., the results of the analysis
did not vary significantly with the inclusion of new data points.

In the following sections, we will describe the dataset, the semantically enhanced topic
graph, and the methods used to measure the pace of collaboration and the density of the
subgraphs.

The raw data and the results of this study are available at https://osf.io/bd8ex/.

Semantic enhanced topic network
We use as dataset the metadata describing three million papers in the field of Computer
Science from a dump of the well-known Scopus dataset (https://www.elsevier.com/
solutions/scopus). In this dataset each paper is associated to a number of keywords
that can be used to build the topic graph. However, as pointed out in Osborne & Motta
(2012), the use of keywords as proxies for topics suffers from a number of problems: some
keywords do not represent topics (e.g., case study) and multiple keywords can refer to the
same topic (e.g., ontology mapping and ontology matching).

The literature offers several methods for characterizing research topics. Probabilistic
topic models, such as LDA, are very popular solutions, which however are most effective in
scenarios where fuzzy classification is acceptable, there is no good domain knowledge and

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 5/28

https://peerj.com
https://osf.io/bd8ex/
https://www.elsevier.com/solutions/scopus
https://www.elsevier.com/solutions/scopus
http://dx.doi.org/10.7717/peerj-cs.119


it is not important for users to understand the rationale of a classification. However, these
tenets do not apply to this study. Furthermore, it is not easy to label the topics produced
by a probabilistic topic model with specific and distinct research areas. Conversely, in this
study it is important to be able to associate topics with well-established research areas.

A second approach, used by several digital libraries and publishers is tagging publications
with categories from a pre-determined taxonomy of topic. Some examples include the
ACM computing classification system (http://www.acm.org/publications/class-2012), the
Springer Nature classification (http://www.nature.com/subjects), Scopus subject areas
(https://www.elsevier.com/solutions/scopus/content), and the Microsoft Academic Search
classification (http://academic.research.microsoft.com/). This solution has the advantage
of producing sound topics, agreed upon by a committee of experts. However, these
taxonomies suffer from a number of issues. First, building large-scale taxonomies requires
a sizable number of experts and it is an expensive and time-consuming process. Hence,
they are seldom updated and grow obsolete very quickly. For example, the 2012 version of
the ACM classification was finalized fourteen years after the previous version. In addition,
these taxonomies are very coarse-grained and usually contain general fields rather than
fine-grained research topics.

We addressed these issues by characterizing our topics according to the Computer
Science Ontology (CSO) produced by Klink-2 (Osborne & Motta, 2015), which describes
the relationships between more than 15,000 research areas extracted from a corpus of
16 million publications. Klink-2 is an algorithm that is able to generate very granular
ontologies and update them regularly by analysing keywords and their relationships with
research papers, authors, venues, and organizations, and by taking advantage of multiple
knowledge sources available on the web. Klink-2 is currently integrated in the Rexplore
system (Osborne, Motta & Mulholland, 2013), a platform for exploring and making sense
of scholarly data, which provides semantic-aware analytics.

We used the CSO ontology to semantically enhance the co-occurrence graphs by
removing all keywords that did not refer to research areas and by aggregating keywords
representing the same concept, i.e., keywords linked by a relatedEquivalent relationship in
the ontology (Osborne, Motta & Mulholland, 2013). For example, we aggregated keywords
such as ‘‘semantic web’’, ‘‘semantic web technology’’ and ‘‘semantic web technologies’’ in a
single semantic topic and we assigned it to all publications associated with these keywords.

We built sixteen topic networks representing topic co-occurrences in the 1995–2010
timeframe. Each network is a fully weighted graph Gyear=(Vyear,Eyear), in which V is the
set of topics while E is the set of links representing the topic co-occurrences. The node
weight represents the number of publications in which a topic appears in a year, while the
link weight is equal to the number of publications in which two topics co-occur in the
same year.

Graph selection
We randomly selected 75 topics that debuted in the period between 2000 and 2010 as
treatment group (also referred to as debutant group). A topic debuts in the year in which
its label first appears in a research paper. The control group (also referred to as non-debutant

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 6/28

https://peerj.com
http://www.acm.org/publications/class-2012
http://www.nature.com/subjects
https://www.elsevier.com/solutions/scopus/content
http://academic.research.microsoft.com/
http://dx.doi.org/10.7717/peerj-cs.119


Figure 1 Evolution of the topic Software Agents in terms of number of authors and number of publi-
cations per year. The chart has been produced using the Rexplore system.

Figure 2 Workflow representing all the steps for the selection phase.

group), was obtained by selecting 100 well-established topics. We considered a topic as
well-established if: (i) it debuted before 2000, (ii) it appears in the CSO Ontology, (iii) it
is associated each year with a substantial and consistent number of publications. As an
example, Fig. 1 shows the evolution through time of the well-established topic Software
Agents, in terms of number of active authors and publications. The figure shows that the
topic made its debut in 1993 and in the year 2000 reached a rate of over 500 publications
per year with more than 1,500 authors working on it. It can thus be considered established
in the context of our study.

We assume that a new topic will continue to collaborate with the topics that contributed
to its creation for a certain time after its debut. This assumption was discussed and tested in
a previous study (Osborne & Motta, 2012), where it was used to find historical subsumption
links between research areas. Hence, as summarized in Fig. 2, for each debutant topic we

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 7/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


extracted the portion of topic network containing its n most co-occurring topics from the
year of debut until nowadays and analysed their activity in the five years preceding the year
of debut. Since we want to analyse how the dimension of these subgraphs influences the
results, we tested different values of n (20, 40, and 60). For example, if a topic A made its
debut in 2003, the portion of network containing its most co-occurring topics is analysed
in the 1998–2002 timeframe. We repeated the same procedure on the topics in the control
group, assigning them a random year of analysis within the decade 2000–2010.

In the previous study (Salatino & Motta, 2016), we selected 50 established topics and
assigned a random year of analysis to each of them. For this study, we randomly assigned
each established topic to two consecutive years within the decade 2000–2010, with the
consequence of doubling the control group, thus reducing noise and smoothing the
resulting measures.

In brief, the selection phase associates to each topic in the treatment and control groups
(also referred as input topics) a graph:

Gtopic=Gtopicyear−5∪G
topic
year−4∪G

topic
year−3∪G

topic
year−2∪G

topic
year−1. (1)

This graph corresponds to the co-occurrence network of a debutant topic in the five years
prior to its emergence (or year of analysis for non-debutant topics). In particular, each
year corresponds to the sub-graphs:

Gtopicyear−i=(V
topic
year−i,E

topic
year−i) (2)

in which V topicyear−i is the set of most co-occurring topics in a year and E
topic
year−1 is the set of

edges linking the nodes in the set.
The graphs associated to the debutant topics included 1,357 unique topics, while the

ones associated to the control group included 1,060 topics.

Graph analysis
We assess the dynamics in the graphs with two main approaches: clique-based and triad-
based. The first transforms the graph in 3-cliques, associates to each of them a measure
reflecting the increase in collaboration between relevant topics and then averages the results
over all 3-cliques. The second measures the increase in the topic graph density using the
triad census technique (Davis & Leinhardt, 1967). In the following two sections we describe
both methods in details.

Clique-based method
We measure the collaboration pace of a graph by analysing the diachronic activity of
triangles of collaborating topics. To this end, we first extract all 3-cliques from the five
sub-graphs associated to each topic under analysis. A 3-clique, as shown in Fig. 3, is a
complete sub-graph of order three in which all nodes are connected to one another and is
employed for modelling small groups of entities close to each other (Luce & Perry, 1949).

To study the dynamics preceding the debut of each topic, we analyse the evolution of
the same 3-clique in subsequent years. Figure 4 summarizes the process. Considering a
3-clique having nodes {A,B,C}, we quantify its collaboration index µ1 in a year by taking

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 8/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


Wa Wb

Wc

Wab

Wac Wbc

A B

C

Figure 3 An instance of a 3-clique containing node and link weights.

Figure 4 Main steps of the analysis phase.

into account both node weights {Wa,Wb,Wc} and link weights {Wab,Wbc,Wca}.

µA−B=mean(P(A|B),P(B|A))
µB−C =mean(P(B|C),P(C,B))
µC−A=mean(P(C|A),P(A|C))
µ1=mean(µA−B,µB−C,µC−A).

(3)

The index µ1 is computed by aggregating the three coefficients µA−B, µB−C and µC−A
as illustrated by Eq. (3). The strength of collaboration µx−y between two nodes of the
topic network, x and y, is computed as the mean of the conditional probabilities P

(
y|x
)

and P(x|y), where P(y|x) is the probability that a publication associated with a topic x
will be associated also with a topic y in a certain year. The advantage of using conditional
probabilities instead of the number of co-occurrences is that the value µx−y is normalised
with respect to the number of publications associated to each topic. Finally,µ1 is computed
as the mean of the strengths of collaboration of the three links in a 3-clique. This solution

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 9/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


H0 H1 H2 H3
empty one edge two-star triangle

Figure 5 The four isomorphism classes of triad. The triad census counts the frequencies of Hi in the
input graph.

was adopted after testing alternative approaches during the preliminary evaluation, as
discussed in the Results section.

The evolution of the 3-clique collaboration pace can be represented as a timeline of
values in which each year is associated with its collaboration pace, as in Eq. (4). We assess
the increase of the collaboration pace in the period under analysis by computing the slope
of the linear regression of these values.

µ
clique−i
1time =[µ(1yr−5),µ(1yr−4),µ(1yr−3),µ(1yr−2),µ(1yr−1)]. (4)

Initially, we tried to determine the increase in the collaboration pace exhibited by a
clique by simply taking the difference between the first and last values of the timeline
(µ1yr−5−µ1yr−1). However, this method ignores the other values in the timeline and
can thus neglect important information. For this reason, we applied instead the linear
interpolation method on the five measures using the least-squares approximation to
determine the linear regression of the time series f (x)=a ·x+b. The slope a is then
used to assess the increase of collaboration in a clique. When a is positive, the degree of
collaboration between the topics in the clique is increasing over time, while, when it is
negative, the number and intensity of collaborations are decreasing.

Finally, the collaboration pace of each sub-graph is measured by computing the mean
of all slopes associated with the 3-cliques.

To summarize, for each input topic we select a subgraph of related topics in the five
years preceding the year of debut (or analysis for topics in the control group). We then
extract the 3-cliques and associate each of them with a vector representing the evolution of
their pace of collaboration. The trend of each clique is computed as the angular coefficient
of the linear regression of these values. Finally, the increase in the pace of collaboration of
a subgraph is obtained by averaging these values.

Triad-based method
The triad-based method employs the triad census (Davis & Leinhardt, 1967) to measure
the change of topology and the increasing density of the subgraphs during the five year
period. The triad census of an undirected graph, also referred to as global 3-profiles, is a
four-dimensional vector representing the frequencies of the four isomorphism classes of
triad, as shown in Fig. 5.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 10/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


Table 1 Frequencies of Hi obtained performing triad census on the debutant topic ‘‘Artificial Bee
Colonies’’.

Graph H0 H1 H2 H3
Gtopicyear−5 446 790 807 882
Gtopicyear−4 443 854 915 1,064
Gtopicyear−3 125 486 967 1698
Gtopicyear−2 100 410 908 1,858
Gtopicyear−1 68 486 849 2251

The triad census summarises structural information in networks and is useful to analyse
structural properties in social networks. It has been applied to several scenarios, such
as identifying spam (Kamaliha et al., 2008; O’Callaghan et al., 2012), comparing networks
(Pržulj, 2007), and analysing social networks (Faust, 2010; Ugander, Backstrom & Kleinberg,
2013).

In this study, we use triad census to describe all the sub-graphs associated to a input
topic in terms of frequencies of Hi (see Fig. 5) and we then evaluate how the frequencies
of empties (H0), one edges (H1), two-stars (H2) and triangles (H3) change in time. Figure 5
illustrates the four classes of triads for an undirected graph in the case of topic networks. An
increase in the number of triangles suggests the appearance of new collaboration clusters
among previously distant topics.

In contrast with the 3-cliques approach, the triad census does not consider the weight
of the links, but only their existence. Hence, it is useful to assess how the inclusion of links
with different strengths affects the analysis. To this end, we performed three experiments in
which we considered only links associated with more than 3, 10 and 20 topic co-occurrences.

We initially perform the triad census over the five graphs associated to each input
topic. For example, Table 1 shows the results of the triad census over the five sub-graphs
associated with the debutant topic Artificial Bee Colonies.

Next, we check whether the co-occurrence graph is becoming denser by analysing
the change of frequencies associated with Hi (see Fig. 6). We first calculate the percentage
growth of each Hi (Eq. (5)) and then compute their weighted summation (Eq. (6)). We label
the resulting metric the growth index. We empirically tested other solutions for aggregating
the various contributions (e.g., considering only H3, summing the values, weighting the
sum in a variety of ways) and found that this definition of growth index provides the best
discrimination between the two classes of graphs.

%GrowthHi=
(HYr−1i −H

Yr−5
i )∗100

HYr−5i
(5)

Growing Indextopic=
3∑

i=0

i·%GrowthHi. (6)

The growth index takes into account the contributions from H1, H2 and H3. Although,
the number of triangles (H3) can by itself be a fair indicator of the density, previous
studies showed that all four classes of triads are useful for computing network properties,

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 11/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


0

500

1000

1500

2000

2500

Y E A R - 5 Y E A R - 4 Y E A R - 3 Y E A R - 2 Y E A R - 1

FR
EU

Q
EN

CI
ES

 O
F 

TR
IA

D
S

GRAPHS

H0 H1 H2 H3

Figure 6 Development in time of the frequencies of Hi in the network related to the emergence of ‘‘Ar-
tificial Bee Colonies’’.

including transitivity, intransitivity and density (Faust, 2010; Holland & Leinhardt, 1976).
Taking into consideration only H3 might fail to detect some subtler cases, characterized
for example by a contemporary increase of H2 and decrease of H1.

To summarize, the triad-based method receives the same input as the clique-based
method. For each of the five subgraphs associated to a topic, we perform the triad
census obtaining the different frequencies, Hi, in different years. We then analyse them
diachronically to quantify the increase in density.

RESULTS
In this section we report the results obtained by analysing the debutant and control groups
using the previously discussed methods. We will describe:

• The preliminary evaluation performed on a reduced dataset for assessing the metrics
used in the clique-based method;
• The full study using the clique-based method;
• The full study using the triad-based method.

Preliminary evaluation with alternative clique-based methods
We conducted a preliminary evaluation aiming at choosing the most effective clique-based
method for assessing the pace of collaboration. This test focused on the subgraph of the
20 most co-occurring topics associated with the topic Semantic Web (debuting in 2001)
and Cloud Computing (2006) versus a control group of 20 subgraphs associated to a group
of non-debutant topics. We tested on this dataset two techniques to compute the weight
of a clique (harmonic mean and arithmetic mean) and two methods to evaluate its trend

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 12/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


-1

-0.5

0

0.5

1

1.5

2

-1

-0.5

0

0.5

1

1.5

2

AM-N
Debutant

AM-N
Non-Debut.

AM-CF
Debutant

AM-CF
Non-Debut.

HM-N
Debutant

HM-N
Non-Debut.

HM-CF
Debutant

HM-CF
Non-Debut.P

A
CE

 O
F 

CO
LL

A
BO

RA
TI

O
N

EXPERIMENTS

Figure 7 Overall directions of the sub-graphs related to input topics in both debutant and control
group with all four approaches.

(computing the difference between the first and the last values and linear regression).
Hence, we evaluated the following four approaches:

• AM-N, which uses the arithmetic mean and the difference between first and last value;
• AM-CF, which uses the arithmetic mean and the linear regression coefficient;
• HM-N, which uses the harmonic mean and the difference between first and last value;
• HM-CF, which uses the harmonic mean and the linear regression coefficient.

Figure 7 illustrates the average pace of collaboration for the sub-graphs associated with
each topic according to these methods and the range of their values (thin vertical line). The
results support the initial hypothesis: the pace of collaboration of the cliques within the
portion of network associated with the emergence of new topics is positive and higher than
the ones of the control group. Interestingly, the pace of collaboration of the control group
is also slightly positive. Further analysis revealed that this behaviour is probably caused by
the fact that the topic network becomes denser and noisier in time. Figure 8 confirms this
intuition illustrating the fast growth of the number of publications per year in the dataset
during the time window 1970–2013.

The approaches based on the simple difference (AM-N and HM-N) exhibit the larger
gaps between the two groups in terms of average pace of collaboration. However, the ranges
of values overlap, making it harder to assess if a certain sub-group is incubating a novel
topic. The same applies to AM-CF. HM-CF performs better and, even if the values slightly
overlap when averaging the pace over different years, they do not when considering single
years. Indeed, analysing the two ranges separately in 2001 and 2006 (see Fig. 9), we can
see that the overall collaboration paces of the debutant topics (DB) are always significantly
higher than the control group (NDB).

With the null hypothesis: ‘‘The differences in the pace of collaboration between the
debutant topics and topics in the control group result purely from chance’’, we ran Student’s
t-test on the sample of data provided by the HM-CF approach, to verify whether the two

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 13/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


1We consider p<0.0001 as a conventional
statistical representation to indicate an ex-
tremely high statistical significance (>500
times stronger than the conventional 0.05
threshold for claiming significance). It
includes all mathematical outcomes below
0.0001, which are essentially equivalent in
assessing excellent significance.

0

100

200

300

400

500

600

700

800

900

1000

1 9 7 0 1 9 7 7 1 9 8 4 1 9 9 1 1 9 9 8 2 0 0 5 2 0 1 2

N
U

M
BE

R 
O

F 
PA

PE
RS

TH
O

U
SA

N
D

S

YEAR

Figure 8 Number of papers each year in the period 1970–2013 in the dataset under analysis.

-0.2

-0.1

0

0.1

0.2

0.3

0.4

DB
2001

NDB
2001

DB
2006

NDB
2006

PA
CE

 O
F 

CO
LL

A
BO

RA
TI

O
N

EXPERIMENTS

Figure 9 Overall directions of the sub-graphs related to input topics in both debutant and control
group in HM-CF approach.

groups belong to different populations. The test yielded p < 0.0001, which allowedt us
to reject the null hypothesis that the differences between the two distributions were due
to random variations.1 Based on this result, we could further confirm that the HM-CF
approach performs better compared to the other approaches. For this reason, we selected
the combination of harmonic mean and linear regression as the approach for the full study
using the clique-based method.

The results of HM-CF give interesting insights on the creation of some well-known
research topics. Tables 2 and 3 list the cliques that exhibited a steeper slope for Semantic
Web and Cloud Computing. We can see that Semantic Web was anticipated in the 1996–2001
timeframe by a significant increase in collaboration of the World Wide Web area with topics
such as, Information Retrieval, Artificial Intelligence, and Knowledge Based Systems. This is

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 14/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


Table 2 Ranking of the cliques with highest slope value for the ‘‘semantic web’’.

Topic 1 Topic 2 Topic 3 Slope

World Wide Web Information retrieval Search engines 2.529
World Wide Web User interfaces Artificial intelligence 1.12
World Wide Web Artificial intelligence Knowledge representation 0.974
World Wide Web Knowledge based systems Artificial intelligence 0.850
World Wide Web Information retrieval Knowledge representation 0.803

Table 3 Ranking of the cliques with highest slope value for the ‘‘cloud computing’’.

Topic 1 Topic 2 Topic 3 Slope

Grid computing Distributed computer systems Web services 1.208
Web services Information management Information technology 1.094
Grid computing Distributed computer systems Quality of service 1.036
Internet Quality of service Web services 0.951
Web services Distributed computer systems Information management 0.949

consistent with the initial vision of the semantic web, defined in the 2001 by the seminal
work of Berners-Lee, Hendler & Lassila (2001). Similarly, Cloud Computing was anticipated
by an increase in the collaboration between topics such as, Grid Computing, Web Services,
Distributed Computer Systems and Internet. This suggests that our approach can be used
both for forecasting the emergence of new topics in distinct subsections of the topic network
and also for identifying the topics that gave rise to a research area.

Clique-based method study
We applied the clique-based methods on the subgraphs associated to topics in the treatment
and control groups. Figure 10 reports the results obtained by using subgraphs composed by
the 20, 40 and 60 topics with the highest co-occurrence. Each bar shows the mean value of
the average pace of collaboration for the debutant (DB) and non-debutant (NDB) topics.
As before, the pace computed in the portion of the network related to debutant topics is
higher than the corresponding pace for the control group.

Since the pace of collaboration shows significant changes within the period considered,
we studied its behaviour across the 2000–2010 interval. Figures 11A– 11C, show the average
yearly collaboration pace when considering the 20, 40 and 60 most co-occurring topics.
In all cases the collaboration pace for the debutant topics is higher than the one for the
control group. We can also notice that in the last five years the overall pace of collaboration
suffered a fall for both debutant and non-debutant topics. This may be due to the fact that
the topic network became denser and noisier in the final years of the interval. Moreover,
the most recent debutant topics often have an underdeveloped network of co-occurrences,
which may result in a suboptimal selection of the group of topics to be analysed in the
previous years. Therefore, simply selecting the 20 most co-occurring topics may not allow
us to highlight the real dynamics preceding the creation of a new topic.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 15/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


-0.05

0.05

0.15

0.25

0.35

0.45

0.55

20
DB

20
NDB

40
DB

40
NDB

60
DB

60
NDB

PA
CE

 O
F 

CO
LL

A
BO

RA
TI

O
N

EXPERIMENTS

Figure 10 Average collaboration pace of the sub-graphs associated to the treatment (DB) and control
group (NDB), when selecting the 20, 40 and 60 most co-occurring topics. The thin vertical lines repre-
sent the ranges of values.

Table 4 compares the collaboration pace of 24 debutant topics with the collaboration
pace of the control group in the same year. We can see how the appearance of a good number
of well-known topics, which emerged in the last decade, was anticipated by the dynamics of
the topic network. The Student’s t-test confirmed that the debutant and established topics
do not belong to the same population (p<0.0001). The results of the t-test also suggest
that the experiment involving the 60 most co-occurring topics, represented in Fig. 11C,
provides a better discrimination of debutant topics from non-debutant ones. For the sake
of completeness, in Table 5 we report the p-values yielded by each experiment.

In conclusion, the results confirm that the portions of the topic network in which a
novel topic will eventually appear exhibit a measurable fingerprint, in terms of increased
collaboration pace, well before the topic is recognized and labelled by researchers.

Triad-based method study
We applied the triad-based methods on the subgraphs composed by the 60 most co-
occurring topics, since this configuration provided the best outcomes in previous tests.
We performed multiple tests by filtering links associated with less than 3, 10 and 20
co-occurrences, to understand how collaboration strength influences the outcome.

Figure 12A reports the average value of the growth indexes when discarding links with less
than 3 co-occurrences. The approach allows us to discriminate well the portion of networks
related to debutant topics from the ones related to the control group. In particular, the
density of network associated with the debutant topics is always higher than its counterpart.
Figsures 12B and 12C report the results obtained by removing links with less than 10 and
20 co-occurrences. As in the previous experiment, we adopted the Student’s t-test to

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 16/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


-0.05

0.05

0.15

0.25

0.35

0.45

0.55

0.65

2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 2 0 1 0

PA
CE

 O
F 

CO
LL

A
BO

RA
TI

O
N

YEAR

Debutant Non Debutant

-0.05

0.05

0.15

0.25

0.35

0.45

0.55

0.65

2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 2 0 1 0

PA
CE

 O
F 

CO
LL

A
BO

RA
TI

O
N

YEAR

Debutant Non Debutant

-0.05

0.05

0.15

0.25

0.35

0.45

0.55

0.65

2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 2 0 1 0

PA
CE

 O
F 

CO
LL

A
BO

RA
TI

O
N

YEAR

Debutant Non Debutant

a)

b)

c)

Figure 11 Average collaboration pace per year of the sub-graphs related to input topics in both debu-
tant and control groups considering their 20 (A), 40 (B) and 60 (C) most co-occurring topics. The year
refers to the year of analysis of each topic.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 17/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


Table 4 Collaboration pace of the sub-graphs associated to selected debutant topics versus the average
collaboration pace of the control group in the same year of debut.

Topic (year of debut) Collaboration Pace Standard
Collaboration pace

Service discovery (2000) 0.455 0.156
Ontology engineering (2000) 0.435 0.156
Ontology alignment (2005) 0.386 0.273
Service-oriented architecture (2003) 0.360 0.177
Smart power grids (2005) 0.358 0.273
Sentiment analysis (2005) 0.349 0.273
Semantic web services (2003) 0.349 0.177
Linked data (2004) 0.348 0.250
Semantic web technology (2001) 0.343 0.147
Vehicular ad hoc networks (2004) 0.342 0.250
Mobile ad-hoc networks (2001) 0.342 0.147
p2p network (2002) 0.340 0.145
Location based services (2001) 0.331 0.147
Service oriented computing (2003) 0.331 0.177
Ambient intelligence (2002) 0.289 0.145
Social tagging (2006) 0.263 0.192
Community detection (2006) 0.243 0.192
Cloud computing (2006) 0.241 0.192
User-generated content (2006) 0.240 0.192
Information retrieval technology (2008) 0.231 0.057
Web 2.0 (2006) 0.224 0.192
Ambient assisted living (2006) 0.224 0.192
Internet of things (2009) 0.221 0.116

Table 5 P-values obtained performing the Student’s t-test over the distributions of both debutant and
control groups considering their 20, 40 and 60 most co-occurring topics. The best result is bolded.

Experiment p-value Associated chart

20 most co-occurring topics 4.22·10−2 Fig. 11A
40 most co-occurring topics 6.84·10−2 Fig. 11B
60 most co-occurring topics 4.64·10−45 Fig. 11C

understand among the three tests which one could provide better discrimination between
the two classes of topics. The results of the t-test suggest that the experiment in which
we discard links with less than three co-occurrences provides a better discrimination of
debutant topics from non-debutant ones. This suggests that considering weak connections
is more beneficial for discriminating the two groups. The 2004 peak is caused by the debut
of a number of topics associated with particularly strong underlying dynamics, such as
Linked Data, Pairing-based Cryptography, Microgrid and Privacy Preservation.

Table 6 reports as an example the triad census performed over the subgraph associated
with the topic Semantic Web Technologies (SWT) debuting in 2001. We can see an increase

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 18/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


Table 6 The results of the triad census performed on the network associated with the debutant topic ‘‘semantic web technology’’ removing
links associated with less than 3 (left), 10 (right) and 20 (bottom) publications.

Removing links <3 Removing links <10 Removing links <20

Graph H0 H1 H2 H3 H0 H1 H2 H3 H0 H1 H2 H3
1996 1,124 1,157 658 337 641 676 316 138 796 509 174 618
1997 928 1,237 670 441 1,022 828 315 135 632 432 204 62
1998 1,255 1,353 657 389 585 705 300 181 525 418 145 52
1999 1,307 1,431 861 461 1,222 1,098 413 192 569 497 187 77
2000 913 1,399 1,043 705 1,482 1,361 554 257 842 618 228 83

in the number of triangles (H3) and two-stars (H2), mirroring the increasing density of the
topic network. Again, this phenomenon is more evident when also using weak links (<3).
The percentage of growth of full triangles is 109% in the first test and then it decreases to
86% (<10) and 36 % (<20).

Table 7 shows a selection of debutant topics and their growth indexes compared with
the growth index of the control group in the same year. If we can compare this table to
Table 4, we can see that the two methods used in this study reflect the same dynamics.

With the null hypothesis ‘‘The differences in growth index between the debutant topics and
topics in the control group result purely from chance’’, we ran Student’s t-test over the two
distributions of growth indexes, for all three experiments. It yielded p<0.0001 for all the
experiments. More details about the computed p-values per each experiment performed
in this triad-based study can be found in Table 8. Figure 13 shows, as an example, the
distributions associated to the two groups of topics obtained in the first test.

Hence, the results from this second experiment confirm our initial hypothesis too. In
addition, per Table 8, the results from the t-test also suggest that the first experiment,
which ignores the links associated with less than three publications, better discriminates
the two populations.

DISCUSSION
We analysed the topic network with the aim of experimentally confirming our hypothesis
that the emergence of new research areas is anticipated by an increased rate of interaction of
pre-existing topics. We examined the pace of collaboration (via the clique-based method)
and the change in topology (via the triad-based method) in portions of the network related
to debutant topics, showing that it is possible to effectively discriminate areas of the topic
graph associated with the future emergence of new topics. The first experiment showed that
the subgraphs associated with the emergence of a new topic exhibit a significantly higher
pace of collaboration than the control group of subgraphs associated with established
topics. Similarly, the second experiment showed that the graphs associated with a new
topic display a significantly higher increase in their density than the control group. We can
thus confirm that these two aspects can play a key role in the context of defining methods
for detecting embryonic topics.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 19/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


0
100
200
300
400
500
600
700
800
900

1000

2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 2 0 1 0

D
EN

SI
TY

 O
F 

N
ET

W
O

RK

YEAR

Debutant Non Debutant

0

200

400

600

800

1000

1200

1400

1600

1800

2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 2 0 1 0

D
EN

SI
TY

 O
F 

N
ET

W
O

RK

YEAR

Debutant Non Debutant

-50

450

950

1450

1950

2450

2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 2 0 1 0

D
EN

SI
TY

 O
F 

N
ET

W
O

RK

YEAR

Debutant Non Debutant

a)

b)

c)

Figure 12 Average growth index per year of the sub-graphs related to the topics in both debutant and
non-debutant groups considering their 60 most co-occurring topics and filtering links having with less
than 3 (A), 10 (B) and 20 (C) publications.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 20/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


Table 7 Growth indexes of sub-graphs associated to selected debutant topics versus the average
growth index of the control group in the same year of debut (Standard Growth Index).

Topic (year of debut) Growth index Standard growth
index

Service discovery (2000) 290.29 35.97
Ontology engineering (2000) 207.22 35.97
Ontology alignment (2005) 399.60 186.89
Service-oriented architecture (2003) 628.07 140.17
Smart power grids (2005) 637.53 186.89
Sentiment analysis (2005) 354.10 186.89
Semantic web services (2003) 439.85 140.17
Linked data (2004) 590.81 289.94
Semantic web technology (2001) 465.53 72.71
Vehicular ad hoc networks (2004) 859.44 289.94
Mobile ad-hoc networks (2001) 87.31 72.71
p2p network (2002) 305.28 18.92
Location based services (2001) 595.90 72.71
Service oriented computing (2003) 422.92 140.17
Ambient intelligence (2002) 308.34 18.92
Social tagging (2006) 429.77 157.69
Community detection (2006) 583.21 157.69
Cloud computing (2006) 695.79 157.69
User-generated content (2006) 485.89 157.69
Information retrieval technology (2008) 552.14 227.02
Web 2.0 (2006) 387.42 157.69
Ambient assisted living (2006) 940.79 157.69
Internet of things (2009) 580.33 167.86

Table 8 P-values obtained performing the Student’s t-test over the distributions of both debutant and
control groups considering their 60 most co-occurring topics filtering links having with less than 3, 10
and 20 publications. The best result is bolded.

Experiment p-value Associated chart

Less than 3 publications 6.43·10−16 Fig. 12A
Less than 10 publications 1.69·10−11 Fig. 12B
Less than 20 publications 3.52·10−10 Fig. 12C

Interestingly, the ability of the two approaches in discriminating the debutant group from
the control group varies with the time interval considered. It appears that the clique-based
approach (see Fig. 11) discriminates them more effectively in the initial period, whereas
the triad-based method (Fig. 12) seems to perform better in the central years (2004–2007).
We intend to investigate in future work if these behaviours are associated with specific
characteristics of the network.

The results of these two experiments allow us to effectively discriminate specific sections
of the topic graph and suggest that a significant increase in the rate of collaboration

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 21/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


−200 0 200 400 600 800 1000 1200 1400
0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

GROWTH INDEX

P
E

R
C

E
N

TA
G

E
 O

F 
G

R
O

W
TH

 IN
D

E
X

Debutant Non Debutant

Figure 13 Distributions of growth indexes for both groups when filtering links associated with less
than three publications.

between existing topics provides a strong indicator for predicting the emergence of new
research areas. Even a simple threshold over the indexes introduced in this study allows
us to discriminate well the subgraphs that will produce new research areas. For example,
Table 9 reports the pace of collaboration obtained for both debutant and non-debutant
topics in 2004. Here we can appreciate that a 0.41 threshold corresponding to a 100%
precision is able to retrieve 8 out of 9 debutant topics. Table 10 displays other cases in
which it is possible to obtain a very good recall when choosing a threshold corresponding
to 100% precision. The application of this technique in a realistic setting would however
require a scalable method for identifying promising topic graphs.

While these results are satisfactory, our analysis presents some limitations, which we
shall address in future work.

In particular, we identified the relevant subgraph during the selection phase simply by
selecting the n most co-occurrent topics of the topic under analysis. This solution allows
us to compare graphs of the same dimension, however it introduces two issues.

First of all, it assumes that all topics derive from the same number of research areas,
which is an obvious simplification. Emerging topics may have a different nature, based
on their origin, development patterns, interactions of pioneer researchers, and so on.
Therefore, each of them will be linked to a different number of established research areas.
A manual analysis of the data suggests that using a constant number of co-occurring topics
is one of the reasons why the overall pace of collaboration and growth index associated
with the emergent topics are not much higher than the ones of the control group. When

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 22/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


Table 9 List of topics, both debutant and non-debutant with their pace of collaboration analysed in
the 2004.

Testing topic Pace of collaboration Debutant/Control

Linked data 0.538 D
Bilinear pairing 0.499 D
Wimax 0.488 D
Separation logic 0.463 D
Phishing 0.446 D
Micro grid 0.433 D
Privacy preservation 0.426 D
Vehicular ad hoc networks 0.416 D
Mobile computing 0.409 C
Electromagnetic dispersion 0.401 C
Online learning 0.357 C
Wavelet analysis 0.326 C
Program interpreters 0.325 C
Zigbee 0.313 D
Natural sciences computing 0.308 C
Knowledge discovery 0.300 C
Fuzzy neural networks 0.298 C
Three term control systems 0.250 C

Table 10 Precision and Recall when choosing particular thresholds for distinguish the classes of
topics.

Year 2001 2004 2006

Threshold 0.35 0.41 0.23
Recall 8/9 8/9 11/14
Precision 8/8 8/8 11/11

selecting too many co-occurring topics, we may include less significant research areas or,
alternatively, research areas that started to collaborate with the topic in question only after
its emergence. Conversely, when selecting too few topics, the resulting graph may exclude
some important ones.

A second limitation is that the selection phase performed in our study cannot be reused
in a system capable of automatically detecting embryonic topics, since it requires knowledge
of the set of topics with which the embryonic topic will co-occur in the future. However,
this could be fixed by developing techniques that are able to select promising subgraphs
according to their collaboration pace and density. For this purpose we are currently
developing an approach that generates a topic graph in which (i) links are weighted
according to the acceleration in the pace of collaboration between the two relevant topics
and (ii) community detection algorithms are applied to select portions of the network
characterized by an intense collaboration between topics. We expect that this solution will
be able to detect at a very early stage that ‘something’ new is emerging in a certain area of

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 23/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


the topic graph, even if it may not be able to accurately define the topic itself. It would thus
allow relevant stakeholders to react very quickly to developments in the research landscape.

The findings of this analysis also provide contributions of potential value to research in
Philosophy of Science. Firstly, they appear to support our hypothesis about the existence of
an embryonic phase in the lifecycle of research topics. Secondly, they bring new empirical
evidence to fundamental theories in Philosophy of Science, which are concerned with the
evolution of scientific disciplines, e.g., Herrera, Roberts & Gulbahce (2010), Kuhn (2012),
Nowotny, Scott & Gibbons (2013), and Sun et al. (2013). Finally, they highlight that new
topics tend to be born in an environment in which previously less interconnected research
areas start to cross-fertilise and generate new ideas. This suggests that interdisciplinarity
is one of the most significant forces that drives innovation forward, allowing researchers
to integrate a diversity of expertise and perspectives, and yield new solutions and new
scientific visions. Hence the results of our analysis could be used to support policies that
promote interdisciplinary research.

CONCLUSIONS
We hypothesised the existence of an embryonic phase for research topics, where, while
they have not yet been consistently labelled or associated with a considerable number
of publications, they can nonetheless be detected through an analysis of the dynamics
of already existing topics. To confirm this hypothesis, we performed an experiment on
75 debutant topics in Computer Science, which led to the analysis of a topic network
comprising about 2,000 topics extracted from a sample of three million papers in the 2000–
2010 time interval. The results confirm that the creation of novel topics is anticipated by a
significant increase in the pace of collaboration and density of the portions of the network in
which they will appear. These findings provide supporting evidence for the existence of an
embryonic phase for research topics and can be built on to foster further research to develop
new techniques for the detection of topics at this stage. They also bring new empirical
evidence to theories in Philosophy of Science. Finally, they suggest that an interdisciplinary
environment provides a fertile ground for the creation of novel research topics.

We now plan to exploit the dynamics discussed in this study to create a fully automatic
approach for detecting embryonic topics. We also intend to study and integrate additional
dynamics involving other research entities, such as authors and venues. The aim is to
produce a robust approach to be used by researchers and companies alike for gaining a
better understanding of where research is heading.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
PhD studentship for Angelo A. Salatino is funded by Springer. The funders had no role
in study design, data collection and analysis, decision to publish, or preparation of the
manuscript.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 24/28

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.119


Grant Disclosures
The following grant information was disclosed by the authors:
Springer.

Competing Interests
The authors declare there are no competing interests.

Author Contributions
• Angelo A. Salatino conceived and designed the experiments, performed the experiments,
analyzed the data, contributed reagents/materials/analysis tools, wrote the paper,
prepared figures and/or tables, performed the computation work, reviewed drafts
of the paper.
• Francesco Osborne conceived and designed the experiments, analyzed the data, wrote
the paper, reviewed drafts of the paper.
• Enrico Motta wrote the paper, reviewed drafts of the paper.

Data Availability
The following information was supplied regarding data availability:

Open Science Framework: https://osf.io/bd8ex/.

REFERENCES
Becher T, Trowler P. 2001. Academic tribes and territories: intellectual enquiry and the

culture of disciplines. New York: McGraw-Hill Education (UK).
Berners-Lee T, Hendler J, Lassila O. 2001. The semantic web. Scientific American

284:28–37.
Blei DM, Lafferty JD. 2006. Correlated topic models. In: Weiss Y, Schölkopf PB, Platt JC,

eds. Advances in neural information processing systems, vol. 18, 147–154.
Blei DM, Ng AY, Jordan MI. 2003. Latent dirichlet allocation. Journal of Machine

Learning Research 3:993–1022.
Bolelli L, Ertekin Ş, Giles CL. 2009. Topic and trend detection in text collections using

latent dirichlet allocation. In: Proceedings of the 31th European conference on IR
research on advances in information retrieval , ECIR’09. Berlin, Heidelberg: Springer-
Verlag, 776–780.

Boyack KW, Klavans R, Börner K. 2005. Mapping the backbone of science. Scientomet-
rics 64:351–374 DOI 10.1007/s11192-005-0255-6.

Cano Basave AE, Osborne F, Salatino AA. 2016. Ontology forecasting in scientific
literature: semantic concepts prediction based on innovation-adoption priors. In:
Knowledge engineering and knowledge management: 20th international conference,
EKAW 2016, Bologna, Italy, Proceedings. New York: Springer International Publish-
ing, 51–67.

Cataldi M, Di Caro L, Schifanella C. 2010. Emerging topic detection on twitter based
on temporal and social terms evaluation. In: Proceedings of the tenth international
workshop on multimedia data mining. ACM, p 4.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 25/28

https://peerj.com
https://osf.io/bd8ex/
http://dx.doi.org/10.1007/s11192-005-0255-6
http://dx.doi.org/10.7717/peerj-cs.119


Chang J, Blei DM. 2010. Hierarchical relational models for document networks. The
Annals of Applied Statistics 4(1):124–150.

Chavalarias D, Cointet J-P. 2013. Phylomemetic patterns in science evolution—the rise
and fall of scientific fields. PLOS ONE 8:e54847 DOI 10.1371/journal.pone.0054847.

Couvalis G. 1997. The philosophy of science: science and objectivity. Thousand Oaks: Sage.
Davis JA, Leinhardt S. 1967. The structure of positive interpersonal relations in small

groups. Washington, D.C.: Institute of Education Sciences.
Decker SL, Aleman-Meza B, Cameron D, Arpinar IB. 2007. Detection of bursty and

emerging trends towards identification of researchers at the early stage of trends.
Athens: University of Georgia.

Duvvuru A, Kamarthi S, Sultornsanee S. 2012. Undercovering research trends: network
analysis of keywords in scholarly articles. In: Computer Science and Software Engi-
neering (JCSSE), 2012 International Joint Conference on. 265–270.

Duvvuru A, Radhakrishnan S, More D, Kamarthi S, Sultornsanee S. 2013. Analyzing
structural & temporal characteristics of keyword system in academic research
articles. Procedia Computer Science 20:439–445 DOI 10.1016/j.procs.2013.09.300.

Erten C, Harding PJ, Kobourov SG, Wampler K, Yee G. 2004. Exploring the computing
literature using temporal graph visualization. Electronic Imaging 2004:45–56.

Faust K. 2010. A puzzle concerning triads in social networks: graph constraints and the
triad census. Social Networks 32:221–233 DOI 10.1016/j.socnet.2010.03.004.

Fusch PI, Ness LR. 2015. Are we there yet? Data saturation in qualitative research. The
Qualitative Report 20(9):1408–1416.

Griffiths TL, Jordan MI, Tenenbaum JB, Blei DM. 2004. Hierarchical topic models
and the nested Chinese restaurant process. In: Thrun S, Saul LK, Schölkopf PB, eds.
Advances in neural information processing systems, vol. 16. Cambridge: MIT Press,
17–24.

Gruhl D, Guha R, Liben-Nowell D, Tomkins A. 2004. Information diffusion through
blogspace. In: Proceedings of the 13th international conference on World Wide Web.
491–501.

He Q, Chen B, Pei J, Qiu B, Mitra P, Giles L. 2009. Detecting topic evolution in scientific
literature: how can citations help? In: Proceedings of the 18th ACM conference on
Information and knowledge management. New York: ACM, 957–966.

Herrera M, Roberts DC, Gulbahce N. 2010. Mapping the evolution of scientific fields.
PLOS ONE 5:e10355 DOI 10.1371/journal.pone.0010355.

Holland PW, Leinhardt S. 1976. Local structure in social networks. Sociological Method-
ology 7:1–45 DOI 10.2307/270703.

Jo Y, Lagoze C, Giles CL. 2007. Detecting research topics via the correlation between
graphs and texts. In: Proceedings of the 13th ACM SIGKDD international conference
on Knowledge discovery and data mining. New York: ACM, 370–379.

Kamaliha E, Riahi F, Qazvinian V, Adibi J. 2008. Characterizing network motifs to
identify spam comments. In: 2008 IEEE international conference on data mining
workshops. Piscataway: IEEE, 919–928.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 26/28

https://peerj.com
http://dx.doi.org/10.1371/journal.pone.0054847
http://dx.doi.org/10.1016/j.procs.2013.09.300
http://dx.doi.org/10.1016/j.socnet.2010.03.004
http://dx.doi.org/10.1371/journal.pone.0010355
http://dx.doi.org/10.2307/270703
http://dx.doi.org/10.7717/peerj-cs.119


Kuhn TS. 2012. The structure of scientific revolutions. Chicago: University of Chicago
Press.

Larsen PO, Von Ins M. 2010. The rate of growth in scientific publication and the
decline in coverage provided by Science Citation Index. Scientometrics 84:575–603
DOI 10.1007/s11192-010-0202-z.

Leydesdorff L. 2007. Betweenness centrality as an indicator of the interdisciplinarity
of scientific journals. Journal of the American Society for Information Science and
Technology 58:1303–1319 DOI 10.1002/asi.20614.

Luce RD, Perry AD. 1949. A method of matrix analysis of group structure. Psychometrika
14:95–116 DOI 10.1007/BF02289146.

Lv PH, Wang G-F, Wan Y, Liu J, Liu Q, Ma F-C. 2011. Bibliometric trend analysis on
global graphene research. Scientometrics 88:399–419
DOI 10.1007/s11192-011-0386-x.

Mathioudakis M, Koudas N. 2010. Twittermonitor: trend detection over the twitter
stream. In: Proceedings of the 2010 ACM SIGMOD international conference on
management of data. New York: ACM, 1155–1158.

Morinaga S, Yamanishi K. 2004. Tracking dynamics of topic trends using a finite
mixture model. In: Proceedings of the tenth ACM SIGKDD international conference
on Knowledge discovery and data mining. New York: ACM, 811–816.

Newman ME. 2001. The structure of scientific collaboration networks. Proceedings
of the National Academy of Sciences of the United States of America 98:404–409
DOI 10.1073/pnas.98.2.404.

Nowotny H, Scott PB, Gibbons MT. 2013. Re-thinking science: knowledge and the public
in an age of uncertainty. Hoboken: John Wiley & Sons.

O’Callaghan D, Harrigan M, Carthy J, Cunningham P. 2012. Identifying discriminating
network motifs in YouTube spam. ArXiv preprint. arXiv:12025216.

Oka M, Abe H, Kato K. 2006. Extracting topics from weblogs through frequency seg-
ments. In: Proceedings of WWW 2006 annual workshop on the weblogging ecosystem:
aggregation, analysis, and dynamics.

Osborne F, Motta E. 2012. Mining semantic relations between research areas. In: Cudré-
Mauroux P, Heflin J, Sirin E, Tudorache T, Euzenat J, Hauswirth M, Xavier Parreira
J, Hendler J, Schreiber G, Bernstein A, Blomqvist E, eds. The Semantic Web—ISWC
2012. ISWC 2012. Lecture notes in computer science, vol. 7649. Berlin, Heidelberg:
Springer.

Osborne F, Motta E. 2015. Klink-2: integrating multiple web sources to generate
semantic topic networks. In: Arenas M, Corcho O, Simperl E, Strohmaier M,
d’Aquin M, Srinivas K, Groth P, Dumontier M, Heflin J, Thirunarayan K, Staab S,
eds. The Semantic Web—ISWC 2015. Lecture notes in computer science, vol. 9366.
Cham: Springer.

Osborne F, Motta E, Mulholland P. 2013. Exploring scholarly data with rexplore. In:
The Semantic Web—ISWC 2013. Berlin, Heidelberg: Springer.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 27/28

https://peerj.com
http://dx.doi.org/10.1007/s11192-010-0202-z
http://dx.doi.org/10.1002/asi.20614
http://dx.doi.org/10.1007/BF02289146
http://dx.doi.org/10.1007/s11192-011-0386-x
http://dx.doi.org/10.1073/pnas.98.2.404
http://arXiv.org/abs/12025216
http://dx.doi.org/10.7717/peerj-cs.119


Osborne F, Scavo G, Motta E. 2014. A hybrid semantic approach to building dynamic
maps of research communities. In: Janowicz K, Schlobach S, Lambrix P, Hyvönen E,
eds. Knowledge engineering and knowledge management. EKAW 2014. Lecture notes in
computer science, vol. 8876. Berlin, Heidelberg: Springer.

Pham MC, Klamma R, Jarke M. 2011. Development of computer science disciplines: a
social network analysis approach. Social Network Analysis and Mining 1:321–340
DOI 10.1007/s13278-011-0024-x.

Pržulj N. 2007. Biological network comparison using graphlet degree distribution.
Bioinformatics 23:e177–e183 DOI 10.1093/bioinformatics/btl301.

Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P. 2004. The author-topic model for
authors and documents. In: Proceedings of the 20th conference on Uncertainty in
artificial intelligence. Arlington: AUAI Press, 487–494.

Salatino A. 2015. Early detection and forecasting of research trends. In: ISWC-DC 2015
The ISWC 2015 doctoral consortium. Available at http://ceur-ws.org/Vol-1491/paper_
5.pdf .

Salatino AA, Motta E. 2016. Detection of embryonic research topics by analysing
semantic topic networks. In: 2016 Workshop on semantics, analytics, visualisation:
enhancing scholarly datae (SAVE-SD 2016). Cham: Springer.

Sun X, Ding K, Lin Y. 2016. Mapping the evolution of scientific fields based on cross-
field authors. Journal of Informetrics 10:750–761 DOI 10.1016/j.joi.2016.04.016.

Sun X, Kaur J, Milojević S, Flammini A, Menczer F. 2013. Social dynamics of science.
Scientific Reports 3:1069 DOI 10.1038/srep01069.

Tseng Y-H, Lin Y-I, Lee Y-Y, Hung W-C, Lee C-H. 2009. A comparison of methods for
detecting hot topics. Scientometrics 81:73–90 DOI 10.1007/s11192-009-1885-x.

Ugander J, Backstrom L, Kleinberg J. 2013. Subgraph frequencies: mapping the
empirical and extremal geography of large graph collections. In: Proceedings of the
22nd international conference on World Wide Web: international World Wide Web
conferences steering committee. 1307–1318.

Wu Y, Venkatramanan S, Chiu DM. 2016. Research collaboration and topic trends
in Computer Science based on top active authors. PeerJ Computer Science 2:e41
DOI 10.7717/peerj-cs.41.

Salatino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.119 28/28

https://peerj.com
http://dx.doi.org/10.1007/s13278-011-0024-x
http://dx.doi.org/10.1093/bioinformatics/btl301
http://ceur-ws.org/Vol-1491/paper_5.pdf
http://ceur-ws.org/Vol-1491/paper_5.pdf
http://dx.doi.org/10.1016/j.joi.2016.04.016
http://dx.doi.org/10.1038/srep01069
http://dx.doi.org/10.1007/s11192-009-1885-x
http://dx.doi.org/10.7717/peerj-cs.41
http://dx.doi.org/10.7717/peerj-cs.119