06 April 2021 POLITECNICO DI TORINO Repository ISTITUZIONALE Semantic Enrichment for Recommendation of Primary Studies in a Systematic Literature Review / Rizzo, Giuseppe; Tomassetti, Federico; Vetrò, Antonio; Ardito, Luca; Torchiano, Marco; Morisio, Maurizio; Troncy, Raphael. - In: DIGITAL SCHOLARSHIP IN THE HUMANITIES. - ISSN 2055-7671. - STAMPA. - 32:1(2017), pp. 195-208. Original Semantic Enrichment for Recommendation of Primary Studies in a Systematic Literature Review Publisher: Published DOI:10.1093/llc/fqv031 Terms of use: openAccess Publisher copyright (Article begins on next page) This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository Availability: This version is available at: 11583/2617310 since: 2017-06-16T19:17:31Z Oxford University Press Semantic Enrichment for Recommendation of Primary Studies in a Systematic Literature Review Giuseppe Rizzo1?, Federico Tomassetti2, Antonio Vetrò3, Luca Ardito2, Marco Torchiano2, Maurizio Morisio2, Raphaël Troncy1 1 EURECOM, Sophia Antipolis, France [giuseppe.rizzo, raphael.troncy]@eurecom.fr, 2 Politecnico di Torino, Turin, Italy [federico.tomassetti, luca.ardito, marco.torchiano, maurizio.morisio]@polito.it 3 Technische Universität München, Germany vetro@in.tum.de Abstract. A Systematic Literature Review (SLR) identifies, evaluates and synthesizes the literature available for a given topic. This gener- ally requires a significant human workload and has subjectivity bias that could a↵ect the results of such a review. Automated document classifi- cation can be a valuable tool for recommending the selection of studies. In this paper, we propose an automated pre-selection approach based on text mining and semantic enrichment techniques. Each document is firstly processed by a named entity extractor. The DBpedia URIs com- ing from the entity linking process are used as external sources of in- formation. Our system collects the bag of words of those sources and it adds them to the initial document. A Multinomial Naive Bayes classi- fier discriminates whether the enriched document belongs to the posi- tive example set or not. We used an existing manually performed SLR as benchmark dataset. We trained our system with di↵erent configura- tions of relevant documents and we tested the goodness of our approach with an empirical assessment. Results show a reduction of the manual workload of 18% that a human researcher has to spend, while holding a remarkable 95% of recall, important condition for the nature itself of SLRs. We measure the e↵ect of the enrichment process to the precision of the classifier and we observed a gain up to 5%. 1 Introduction A Systematic Literature Review (SLR) is a research methodology used to iden- tify, analyze and interpret all available evidences related to a specific research question in a way that is unbiased and (to a degree) repeatable (Kitchenham ? Corresponding author. 2 Rizzo et al. 2007). A SLR has to be performed according to a pre-defined protocol describ- ing how primary studies4 are selected and categorized, reducing as much as possible subjectivity bias. Depending on the research field where it is applied, the protocol changes. In this paper, we focus on a SLR applied to the field of Software Engineering, where the protocol can be summarized by the follow- ing steps (Kitchenham 2004): (i) identification of research, (ii) selection of pri- mary studies, (iii) study quality assessment, (iv) data extraction and monitoring progress, (v) data synthesis. The first step defines the search space, i.e. the set of documents in which researchers select papers. A small sample set of relevant documents is used to define the search space. The second step identifies and analyses all possible useful studies among the papers which are contained in the search space that can help to answer some research questions. In the third step, an assessment about the quality of the studies collected is performed, while in the fourth step, the data extraction forms are delivered according to the review under evaluation. The last step delivers the data synthesis methods. Although these steps seem to be sequential, it is worth considering them as iterative steps and, therefore, the outputs may evolve according to the evolving topics. The entire process is supervised and guided by researchers who summarize all existing information about some phenomena in a thorough and, potentially, unbiased manner. The final goal is to draw more general conclusions about some phenomena derived from individual studies, or as a prelude to further research activities. A SLR has a crucial importance in all research fields but it is extremely time-consuming, requiring an important human workload which is costly and error prone. Even though full automation of SLR is not possible due to the need of human reasoning for the aggregation and interpretation of scientific results, we believe that a tool support in the selection of the primary studies can reduce the human workload necessary in that phase, without loosing knowledge (which is a particularly important condition for the nature itself of SLRs). Therefore, the objective of this paper is to reduce the human workload in a SLR, semi-automating the selection of primary studies (i.e. the second step of the SLR process). This depends on the dimensions of the search space. The larger the search space is the more e↵ective our proposed approach will be. Our method focuses on a filter strategy resorting to semantic enrichment and text mining techniques to reduce the number of papers that researchers, who perform a SLR, should read. We use a text classifier to filter potentially interesting documents within the search space. The classifier produces a reduced set which contains a higher percentage of interesting document than the initial set. Afterwards, this reduced set is manually examined by researchers. In this way, we reduce the workload required to all researchers, limiting the human error rate. This phenomenon usually occurs when a set is sparse and searching through it requires more e↵orts than in a clean set, where the noise is smaller. 4 A primary study is (in the context of evidence) an empirical study investigating a specific research question (Kitchenham 2007). Semantic Systematic Literature Review 3 RQ1 Does the automatic selection process based on the Multinomial Naive Bayes classifier and semantic enrichment (enriched process) reduce the amount of manual work of a SLR with respect to the original process? RQ2 Does the automatic selection process based on Multinomial Naive Bayes classifier and semantic enrichment (enriched process) reduce the amount of manual work of the alternative version of the process with only Multinomial Naive Bayes classifier ( non-enriched process)? In other words, we aim to validate the idea behind the use of enriched papers as test samples instead of using original papers as test samples. The approach presented in this paper is based on a previous work (Tomassetti et al. 2011). The following improvements are proposed: while previously the au- tomatic classification was planned to fully automate the entire selection process step, in this paper, we propose a semi-supervised approach. This is because pa- pers selected by the automatic classifiers could be immediately discarded by a human researcher just looking at the title and the abstract and do not need necessarily to be fully read. In addition, we perform an evaluation on a much larger dataset, extending the benchmark dataset size from the previous 111 pa- pers to the current 2215 papers (almost 20 times larger). Finally, we present an exhaustive task-based evaluation. The remainder of this paper is organized as follows. Section 2 compares our approach with the state of the art in the SLR domain. Section 3 details the steps of selecting primary studies and Section 4 presents our approach to improve this step. Section 5 describes the use case we use to validate our approach. In Section 6, we report and discuss the results we obtained. Finally, we give our conclusions and outline future work in Section 7. 2 Related Work The automatic text classification applied to a systematic review is more chal- lenging than the typical classification task. This is basically due to the dynamic nature of a SLR which is a supervised and iterative process where the initial scope of the SLR often evolves during the review process. Numerous research ef- forts have been spent to reduce the human workload when a SLR is performed. We focus on two di↵erent types of studies: i) machine learning based, and ii) ontology based. Cohen et al. proposed a first attempt to reduce the human workload in the SLR field (Cohen et al. 2006). They used automatic classification to discard non-interesting papers from a set of them in fifteen di↵erent medical systematic literature reviews, each one considering the validity of a particular drug. Their classification model uses a reduced set of the features gathered from the paper such as author name, journal name, journal references, abstract, introduction, and conclusion. The classification model is built using negative examples as well as positive examples, where negative examples are selected from the pool of papers which do not adhere to the chosen SLR. Finally, this model is used to 4 Rizzo et al. create a perceptron modified vector for each feature in the feature set. Negative examples bias the model. In order to limit this phenomenon, they introduced a perceptron learning adjustment just evaluating the false negatives and false positives, monitoring them according to the False Negative Linear Rate (FNLR). A test article is classified by taking the scalar product of the document feature vector with the perceptron vector and comparing the output values. Considering a recall of 95%, the reduction of workload ranges from 0% to 68% according to the SLR they took under evaluation. Similarly to Cohen et al.’s work, in our approach we evaluate the reduction of human workload, while holding a 95% of recall for the classifier. The experiment we conduct is inspired to this, but we di↵erentiate in terms of feature selection and the classifier used. For the former, we use a bag of words model enriched with further descriptions available in an external knowledge base, and we used a Multinomial Naive Bayes classifier. The human workload and the precision we achieve are in order of magnitude com- parable with the ones observed by Cohen et al. (above the average) on fifteen medical literature reviews. However, due to the di↵erence of the SLR domains (medical for Cohen et al., Software Engineering in this paper), we cannot exhaus- tively compare the two approaches. Among the findings, Cohen et al. suggested that the automatic classification may be useful to regularly monitor new relevant journal issues in order to identify interesting primary studies, easing the task to keep a SLR constantly updated. According to this result, it is crucial to con- sider the classification problem in the SLR field as a semi-supervised approach in which a human being supervises the inclusion or exclusion of possible relevant studies selected by the classifier. Another attempt to reduce the human workload in selecting relevant primary studies was performed by (Matwin et al. 2010). They proposed an approach mainly based on the Naive Bayes classifier with some optimizations which are based on the Complement Naive Bayes (CNB) (Rennie et al. 2003). The results they achieved outperform what detailed in (Cohen et al. 2006), but using a di↵erent configuration parameters (they consider only title and abstract for each document instead of the large set of features considered by Cohen). Leveraging on Natural Language Processing techniques (NLP), Cohen et al. tackle the problem of paper handling once the review starts (Cohen 2008). This is practically done to allow the reviewer to first analyze the documents which are labelled as potentially relevant documents, leaving at the end the evaluation for the remaining ones. They combined the approach of unigram and Medical Subject Headings (MeSH) to create the histogram of documents which potentially fits the scope of the review. In (Ruttenberg et al. 2009), the authors proposed a hybrid approach for automating scientific literature search by means of data aggregation and text mining algorithms to make easy the search process. The key point of their work was to find a way to represent and share knowledge learned by human beings reading relevant papers, by means of an ontology. Through it, it was possible to combine outcomes of each single document and to represent it into a graph, which is mapped to the ontology. The first step of this process consists of identifying Semantic Systematic Literature Review 5 the key phrases of the document (outcomes). Then, key phrases are used to link di↵erent concepts in the graph. Following this process, concepts are linked together, obtaining a chain of relationships. This work is usually made by human beings, who are experts of the domain. Ideally, they shoud be objective but the authors assessed that the graph mapping is strongly a↵ected by the expert subjectivity. Then, they proposed a mechanism based on text mining algorithms to be able to navigate and cluster inferences. This work represents the first attempt to introduce the concept of knowledge representation in a SLR and, among the findings, they stated that a pre-clustering and linking of documents limit the human subjectivity improving the overall result. 3 Selection of primary studies In this section, we detail the selection step of the SLR process analyzing its strengths and weaknesses according to the guidelines described in (Kitchenham 2004). This step takes as input the set of primary studies W gathered from a collection assumed to be the universe of all scientific papers in the domain of interest of the review. W results from the first step of the process and it is obtained as the output of the search process performed by human beings using keywords on dedicated sources. For instance, W could be composed by all papers published by a given set of journals or by all papers that a digital library provided as result of the search with keywords. The selection of primary studies is divided in two sub-steps: the former operates a selection based on reading titles and abstracts (first selection), the latter is the decision based on the full text human analysis (second selection). Both steps are basically a↵ected by the following choice criteria: does it fit the research field? We define C (candidate studies) the set of studies that successfully passed the first selection and are eligible to be processed by researchers in the second selection step. It has the goal to split C in I (included studies) and E (excluded studies) where those sets are: – I is the set of studies 2 C which successfully passed the second manual selection and will contribute to the systematic review. The following relation holds: I ✓ C. – E is the set of studies 2 C which did not pass the second manual selection and will not contribute to the systematic review and synthesis. Hence, E ✓ C and E \ I = ↵. Figure 1 illustrates the selection of primary studies step. As introduced in the previous section, the selection of primary studies is performed by human beings who usually apply selection criteria . However, the application of those criteria could rarely be completely objective, and it is frequently instead a↵ected by the subjective opinions of the involved researchers. A semi-supervised approach aims to reduce this potential bias. 6 Rizzo et al. Fig. 1. Selection of primary studies in a Systematic Literature Review 4 Approach The proposed approach relies on text mining techniques and semantic enrich- ment to reduce the set of interesting papers a researcher has to evaluate. The approach consists of a semi-supervised iterative process built on top of the fol- lowing assumption: W 6= ↵ (as a result of the applied search strategy) and I 6= ↵ at the beginning (the set of relevant documents already known addedis not emply when the systematic review starts. The output of this approach is the set of most interesting papers W 0 gathered from a larger set of unread papers W . 4.1 I 0 construction The initial set of sources contained in I is named I 0 and it is composed of primary studies already classified as relevant for the review: this is the first step of our process and it is needed to start the iterative part of the algorithm. I 0 can be built in two di↵erent ways. The first way is to ask researchers to use their previous knowledge indicating the most well known and fundamental papers in the field of interest. This strategy considers that, often, systematic reviews are undertaken by experts in the field. The second way is to explore a portion of the search space using the basic process, e.g. searching on digital libraries or selecting the issues of (a) given journal(s). This portion is marked as I 0 and the enriched process is used to explore the remaining search space. 4.2 Model building The second step of our approach consists in computing automatically a model M from I 0 . The idea is to build a bag of words (BoW) model starting from the primary studies in I 0 . For each study, we considered the words from the abstract and introduction. According to (Cohen et al. 2006) words which appear at the beginning and at the end of a document (such as title, abstract, introduction and conclusion) are more significant. We empirically assessed that using a reduced set of words, coming only from abstract and introduction, provides the same results of considering the extended set of words (i.e. set of words coming from the title, Semantic Systematic Literature Review 7 abstract, introduction and conclusion). The explanation is that the semantic enrichment stage (cfr. Section 4.3) compensates a reduced cardinality of the BoW through linking external sources and gathering from them textual data. Finally, we perform stop words elimination and stemming process, using the Porter algorithm (Porter 1980). The model built is used to train a Multinomial Naive Bayes classifier which computes the weight for each word according to the TF-IDF normalized approach (Kibriya et al. 2005). 4.3 Semantic enrichment We define wi a document composed by the BoW collected from the abstract and the introduction of one paper wi 2 W . Each wi is processed to get a bag of named entities N which features wi. A named entity is a name of a person or an or- ganization, a location, a brand, a product, a numeric expression including time, date, money and percent found in a sentence (Grishman & Sundheim 1996). Basically, it is an information unit described by a set of classes (e.g. person, location, organization) which may be further disambiguated by an entry in a knowledge base such as DBpedia or Freebase. In this work we disambiguate entities to DBpedia (Bizer et al. 2009), with the rationale of linking them to external knowledge base entries. We then will fetch the abstract description of those entries and we join the existing textual content with the retrieved tex- tual data. The encyclopedic nature of this dataset is appropriate to enrich the content of each wi. Once we have extracted the bag of named entities N, we link each ni 2 N to the corresponding DBpedia resource (when it is available). The extraction of named entities is performed using OpenCalais5. OpenCalais provides a classification for each named entity and suggests a URI of an external source where the information is disambiguated. Relying on it, we point to a DB- pedia resource defined by the owl:sameAs property. Since not all the instances in the OpenCalais knowledge base have the owl:sameAs property, to minimize the loss, we used a logic that looks up entries in DBpedia that match the labels of the extracted entities (e.g. an occurrence of Systematic Literature Review is mapped to http://dbpedia.org/resource/Systematic_review). Once the resource is found, then we collect all words contained in the description field (dbpedia-owl:abstract property). The abstract property is one of the descrip- tive property , whose usage is consistent across the entire DBpedia dataset. After collecting these descriptions, we add them to the bag of words natively taken by the document wi. We call it the enrichment process and the resulting document is defined as w+i , and with BoW+ we refer to the bag of words extracted from w+i. Finally, it is compared with the trained model M using a Naive Bayes classifier which is described below. 4.4 Classification We used a Multinomial Naive Bayes (MNB) classifier and we implement the TF-IDF weight normalization. The choice of the Multinomial Naive Bayes clas- 5 http://www.opencalais.com 8 Rizzo et al. sifier was based on two criteria: (1) the characteristics of the specific data and classification problem, and (2) the focus of the approach: 1. A first characteristic in this use case is the small training set, which is a pe- culiarity of the problem under the study (i.e. the common situation is that the initial set of available papers is not large at the beginning of a literature search). Usually, specific configuration of the classification algorithm parameters can improve the performances of a classifier (Forman & Cohen 2004). However, this is not a task that we expect from a normal user, given that we address a very transversely and general problem. Instead Naive Bayes models are more robust towards shift in training distribution (Elkan 2001). Another character- istic is the data heterogeneity because every word is interpreted as feature, thus leading to the well known problems of sparsity (which produces the so-called curse of dimensionality). Common text classifiers such a Support Vector Machines (SVMs), which are more often used for text classification purposes (Murphy 2012), particularly su↵er leading to consequent overfitting issues (Cawley & Talbot 2010). In such fuzzy contexts, Naive Bayes (NB) approaches corrected with TF-IDF are competitive (Rennie et al. 2003). We then opt for the MNB setting since it is proven to lead the best results compared with other NB variants for such a context (Kibriya et al. 2005). Finally, SLRs produce highly imbalanced datasets. As a matter of fact, in our case study only 50 articles over 2215 are interesting (cfr. Section 5.1). Typical solutions to this type of problem are resampling techniques or hybrid algorithms (Chawla et al. 2004, Chawla 2005). While the first type of solutions is not applicable to the case of systematic literature reviews, the second one has the risk of a too specific implementation, which is not in the focus of our study. 2. The classification task in our case is subordinate to the enrichment process. For this reason our focus is to show that even with a very simple classifier, such as the MNB, the enrichment process is worthy: in fact, we show that using the BoW+ produces better results than using the original BoW in terms of saved manual work (from 15% to 18% reduction), preserving the recall beyond 95%, which is a very high value for all type of classifications. We use the classifier to compare w+i with the model M and we determine whether the conditional probability that w+i belongs to I is significant or not. This allows to still preserve the context of the initial documents where the en- tities are extracted, hence favoring the classifier to decide also according to the entire bag of words instead of the extracted named entities. We assume that all papers which do not belong to I, belong to E adopting the Boolean algebra. The comparison is done for each w+i 2 W : papers with P [w+i 2 I] � threshold are moved to W 0 and they are manually analyzed by researchers. Finally, all the papers whose P [w+i 2 I] < threshold remain in W . Semantic Systematic Literature Review 9 4.5 Iteration The papers with a P [w+i 2 I] � threshold are moved to W 0 to be manually processed, whilst the remaining ones still remain in W . It is likely that some of the papers moved in W 0 will pass the manual selection and will go to I, while the others will go to E. When I is modified, M becomes obsolete and it is necessary to re-build the model and repeat the classification step for all papers w+i 2 W . Again, if P [w+i 2 I] � threshold, w+i is moved to W 0 to be manually analyzed. If any w+i goes to W 0, i.e. W 0 = ↵ after a classification, the iteration stops. Papers that remain in W after the last iteration are finally discarded and not considered by researchers. The exclusion of these papers represents the reduction in workload for the human researchers. At each iteration, the model will be progressively tailored to the domain of interest, allowing to refine the selection of primary studies. Algorithm 1 Enriched selection process algorithm Define I0 Init I with I0 repeat /* automatic recommendation of primary studies */ Train classifier with I Extract model M for all wi in W do Enrich wi obtaining w+i Compare w+i with model M: if P[w+i in I] � threshold then move wi to W 0 end if end for /* first selection */ for all w0i 2 W 0 do Manually read title and abstract (w0i 2 I ) ? move w 0 i to C : discard w 0 i end for /* second selection */ for all ci 2 C do Manually read full paper (ci 2 I ) ? move ci to I : move ci to E end for until C 6= ↵ Discard 8 wi 2 W We provide in Algorithm 1 the synopsis of the whole study selection process proposed in this paper and in Figure 2 its complementary graphical representa- tion. Comparing this picture with Figure 1 which represents the selection pro- cess provided by the guidelines (Kitchenham 2004), we observe that the original process is not changed, but we have added a selection of primary studies that recommends papers similar to the model at each iteration. We also reported in Figure 2 the steps of the new process described in subsections 4.1 to 4.4: the use of a model of bag of words (b) derived from I 0 or I (a), the enrichment of papers through semantic enrichment (c) and the comparison of the model M with the studies through a Multinomial Naive Bayes classifier (d). 10 Rizzo et al. 5 Experimental Settings The proposed approach has been implemented in the Semantic Systematic Re- view tool which is publicly available at https://github.com/ftomassetti/ semreview.6 The tool allows the loading of an already performed SLR from which are already known both the set of interesting papers and the set of non- interesting ones. This enables experiments to be run to assess the e↵ectiveness of our approach. The tool creates the initially set of relevant papers I 0 (papers which belong to the I set) randomly selecting a sub-set of the interesting papers defined by the SLR. Doing that, the tool simulates the operation performed by human researchers at the beginning of the SLR. The other interesting papers, to- gether with the non-interesting ones, end in the W . This set is used for assessing the performance of the approach. From I 0 , the tool extracts the corresponding BoW and initializes the model M. Then, for all the papers in W , the tool auto- matically performs the recommendation of the primary studies (the second step in the SLR process) implementing the approach described in Section 4. Finally, the tool reports the performance of the approach using as ground truth the SLR taken as reference. The performance is measured as the amount of the saved manual work. The baseline in the experiment is given by the semi-supervised automatic approach without the semantic enrichment mechanism. Fig. 2. The enriched study selection process and its principal steps: model extraction (b) after I is built (a), enrichment of papers through semantic enrichment (c) and comparison with the model through a Multinomial Naive Bayes classifier (d). 5.1 Benchmark dataset As a case study we selected a SLR on Software Cost Estimation done by (Jorgensen & Shepperd 2007) and we limit the ground truth to all the papers mentioned 6 The version released is a research prototype. It does not include some of the addi- tional scripts used to run the experiments. Semantic Systematic Literature Review 11 in the SLR coming from the IEEE Transactions on Software Engineering (IEEE TSE) journal. They cover a timeframe ranging from 1977 to April 2004. We had to exclude the first volume of IEEE TSE because it is not accessible from the IEEEXplore portal7. The resulting set contains 2215 candidates, all of them eval- uated from the SRL taken as reference. The original SLR contains 51 interesting papers. However, only 50 of them are actually present in the set of the candi- dates available from the IEEEXplore, the missing one having been published in the first volume of IEEE TSE. Our benchmark dataset is therefore composed of 2215 papers, 50 of which belong to the I set. The others are considered as non-interesting papers, i.e. they do not pass the selection criteria defined at the beginning of the performed study and they belong to the E set. 5.2 Variable selection The main outcome under measurement is the manual work, consisting of reading primary studies either entirely or only title and abstract, to select the interesting ones for the subject of the SLR. We measure the manual work as the number of papers that are read assuming the number as a proxy for the actual time that would be spent reading the articles. The minimum manual work ideally required is the total number of interesting papers. However, this minimum could reasonably never be reached in SLR. Indeed, the relation I ⇢ W holds, where I is the set of relevant papers and W is the set of containing papers defined by the search criterion. This choice is motivated by the fact that the SLR, selected as subject of the case study, does not report neither the time spent for papers selection nor which papers were read entirely and which partially (only title and abstract). As a consequence, we define the following two metrics: mw is the manual work. More specifically mwO is the manual work performed in the original SLR, i.e. manually selecting and reading all papers, mwNE is the manual work obtained applying the selection based on the Multinomial Naive Bayes classifier using original papers (non-enriched process), mwE is the manual work obtained applying the selection based on the Multinomial Naive Bayes classifier using enriched papers (enriched process). t is the applied task. Three levels are possible: manual, non-enriched, enriched. 5.3 Hypothesis formulation The last step of the design is the hypothesis formulation. We formulate a pair of null and alternative hypothesis for each of the two research questions. Goal of the experiment is to reject the null hypothesis H 0 monitoring the p-value (Hubbard & Lindsay 2008). In other words, we discard the null hypothesis and we validate the alternative one HA if the probability to reject the H0 is lower than the 0.001. Moreover, it tells that when choosing the alternative hypothesis HA, the probability to commit an error is lower than 0.001. 7 http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=32 12 Rizzo et al. 1. H1 0 : mwO  mwE , recall= 0.95 H1A : mwO > mwE , recall= 0.95 2. H2 0 : mwNE  mwE, recall=0.95 H2A : mwNE > mwE, recall=0.95 5.4 Parameter configuration We decided to assess the validity of our process with di↵erent sizes of I 0 ranging between 1 and 5. In order to limit the bias introduced by a particular configura- tion of selected papers, we built 30 di↵erent I 0 sets per each dimension choosing them randomly among 50 relevant papers. We used each generated I 0 to kick- o↵ the two variants of the process: enriched and non-enriched. Moreover, we replicated the experiment varying the classification threshold between 0 and 1 with steps of 0.01. The classifier threshold represents the posterior probability for a sample to belong to I (interesting set). Overall, we executed the complete algorithm 30,300 times = 5 (number of I 0 sizes) x 30 (number of I 0 sets for each size) x 2 (variants of the algorithm) x 101 (thresholds). A preliminary step consisted to define the best classifier threshold T which maximizes the recall for the two variants. According to (Cohen et al. 2006), we decided to aim at a recall of 95%. Although this recall value is a strong constraint, we adopted it for limiting as much as possible the elimination of interesting papers. In Table 1, we report the distribution of the maximum classifier threshold which permits to obtain the target recall using the di↵erent I 0 sets. We chose the maximum threshold because is the one which minimizes the workload while it still satisfies the requirement of a recall equal to or greater than 95%. We select the median values to set the classifier, that means 0.22 for the enriched process and 0.17 for the non-enriched one. Min. 1st Qu. Median Mean 3rd Qu. Max. non-enriched 0.11700 0.1700 0.1700 0.1729 0.1775 0.1900 enriched 0.2100 0.2100 0.2200 0.2201 0.2200 0.2600 Table 1. Analysis of the best classifier threshold for both enriched and non-enriched process across di↵erent I0 sets. The first and last column show the minimum and maximum values, second and fifth columns respectively the first and third quartile of the distribution, then mid columns show median and the mean of it. 5.5 Analysis methodology The goal of data analysis is to apply proper statistical tests to reject the null hy- potheses we formulated. Since the values are not normally distributed (according Semantic Systematic Literature Review 13 to the Shapiro test), we adopt a non parametric test. In particular, we select the Mann-Whitney test (Hollander & Wolfe 1973) that compares the medians of the vectors of mw. To do that, we considered all papers extracted from the dataset except those papers used to build the I 0 . 6 Results and Discussion Figure 3 shows the comparison distributions for di↵erent settings of I 0 according to the two di↵erent types of recommendation approaches proposed: enriched process or non-enriched process. On the y-axis, the workload needed for a human being after both processes (enriched E and non-enriched NE) is reported. On the x-axis, we indicate the number of papers used for training the I 0 set and the process used (e.g. 1.E means an I 0 composed of 1 paper and the process has been performed using the enrichment mechanism). We observe a reduction of the workload in both approaches. Comparing the semantic enrichment with the baseline, we observe a greater reduction of the workload. This increment ranges from 2.5% to 5% for all I 0 settings, except for the I 0 composed of 1 paper (1.E in Figure 3) where the increment is lower then 1% with respect to the not-enriched (e.g. 1.NE in Figure 3). Fig. 3. Number of papers to read for di↵erent I0 sizes and tasks applied: E (with enrichment) and NE (without). 14 Rizzo et al. We present below the results according to the two research questions ad- dressed in this paper (see Section 1): evaluating whether the semantic automatic process classification reduce the amount of work of a SLR or not (RQ1) and evaluating if the semantic enrichment increases the performance of the simple classification process (RQ2). 6.1 RQ1: Reduction of the Human Workload The results from the Mann-Whitney test are shown in Table 2. The table reports the I 0 size (column 1), the manual work in the original SLR process (column 2), the manual work obtained with our enriched process (column 3), the estimated percentage of manual work to be performed with our enriched approach with respect to the total work required using the common approach (column 4) and the p-value obtained from the Mann-Whitney test. The p-value for all the configurations indicates that the null hypothesis can be rejected and we assume the alternative which motivates the choice to use the semantic enrichment ap- proach. In addition, we notice that the workload reduction increases as the size of I 0 . Workload Manual workload vs enriched workload |I0| mwO mwE median p � value 1 2214 1897.567 85% < 0.001 2 2213 1864.367 84% < 0.001 3 2212 1863.833 84% < 0.001 4 2211 1843.133 83% < 0.001 5 2210 1829.1 82% < 0.001 Table 2. For each I0 configuration, we first compare the workload required to a human being in the original SLR and the workload mean if our process is performed. To verify the goodness of our process, we compute the Mann-Whitney test and we reject the hypothesis mwO  mwE with a recall = 0.95. 6.2 RQ2: Assessing the Performance of the Enrichment Process We used the Mann-Whitney test to reject the null hypothesis by which we state that mwNE  mwE. Table 3 reports the I0 size (column 1), the estimated di↵er- ence of manual workload between the two processes (column 2), and the p-value of Mann-Whitney test (column 3). While we can observe that the enriched pro- cess requires less workload for every size of I 0 , we can a�rm it with p < 0.001 just when the size of I 0 is 5. Semantic Systematic Literature Review 15 |I0| workload median pairwise di↵erence p � value 1 26.67 0.0192 2 66.00 0.0073 3 40.83 0.0090 4 33.00 0.0083 5 49.99 0.0009 Table 3. For each I0 configuration, we performed the Mann-Whitney test, evaluating median pairwise di↵erence and p-value to estimate the minimum workload using both process: enriched and not-enriched. As for RQ1, the minimum recall is 0.95. 6.3 Discussion The results show that our approach actually reduces the human workload to perform a SLR, while aiming to maintain a high level of completeness. Indeed, by limiting the recall to 95%, we adhere to the state of the art in the automa- tion of SLR field maintaining its high quality. However, relying only on positive papers, this approach introduces one more configuration step for defining the threshold. The threshold can change according to the field of the SLR. In our test, we empirically observed that the probability threshold is almost consistent in di↵erent test scenarios. For this reason, we consider it as a baseline value for further investigations. In addition, we observed that the enriched process performs better than the variant without enrichment up to 5%. There are still two shortcomings: i) the extracted entities from OpenCalais sometimes point to resources in the OpenCalais knowledge base which do not contain sameAs links to DBpedia resources. We observe that the enrichment process fails in around 20% of the cases. The fallback strategy, to rely on another interlinking step using the named entity labels and lookup in DBpedia, partially fills the gap, since we observe that 19.9% of resources can be located, holding a loss of 0.1% of matched resources. However, this does not entirely fulfill the semantic gap since the interlinking step empowered as fallback does not consider the context from which the named entity has been extracted (raising an ambiguity issue which should be further analyzed with domain adaptive techniques). ii) a massive use of encyclopedic sources can bias the content of the enriched paper, penalizing words which do not appear often in the linked source but that are frequent in the initial document. Di↵erently from what we expected, the I 0 configuration does not a↵ect the recall. Indeed, our results suggest that the number of papers in I 0 is not relevant. Its composition in terms of which papers are used to create it may play a more important role. For instance, let us consider an initialization of I 0 with papers that are not strictly related or if they represent just a niche of the research field, or if we select papers which are completely out of argument and they represent di↵erent meaning. While in the latter case, a wrong initialization a↵ects all process and requires the initial set, in the former case the enrichment process enlarges I evading from the niche. Experiments show that the subjective bias in 16 Rizzo et al. the composition of I 0 is reduced when we use the semantic enrichment approach. While we do not have statistical evidence for that, I 0 size seems to play a role on workload reduction. An important positive consequence of the use of automatic classification is the possibility to operate on larger search spaces because the e↵ort of explor- ing W is reduced by means of partial automation. As consequence the search strategies can also explore potential interesting sources. For example, using the standard approach, search on a high number of journals and conferences is com- monly quite expensive. Instead resorting on partially automatic classification, this search is more a↵ordable. Moreover, using an external knowledge base we are able to capture not just papers we recognize being similar to the ones al- ready selected, but we are able to capture papers that have conceptual relations (named entities) to the content expressed in the already selected papers. This strategy allows to deal with an incomplete description of the field of interest, which can not be completely described by the set of already selected papers. Therefore the proposed approach allows, as reported by the results, to use also a I set which is relative small and not representative of the whole field and to obtain results which outperform the classification process using only original sources. In addition, the experimental results show that these improvements are obtained with a still high recall (above 95%), which means loosing a negligible amount of relevant information, which is an essential condition for the nature itself of SLRs. 7 Conclusion and Future Work In this paper, we presented a semantic enrichment recommendation of primary studies in a SLR. Resorting on text mining techniques and semantic enrich- ment, we improved the second step of the SLR process in order to filter the set of possible studies a researcher should read, automatically discarding the not relevant papers. Our approach has two main advantages: i) reduction of work- load requested to classify sources and ii) reduction of subjectivity in the overall process. We tested our approach using a real SLR (Jorgensen & Shepperd 2007) which is used as benchmark dataset. Keeping a recall of 95% (i.e. we expected to discard papers only when the system is at least 95% sure that the paper is out the scope) we gained a percentage of workload saved of 18% when I 0 is composed of 5 papers. In addition, we demonstrated that the enrichment process outperforms up to 5% the automatic recommendation process without enrichment which is used as baseline. As future work, we plan to improve the classification step, using besides positive examples also negative examples. We believe that using also negative examples the process may have a more accurate value of the plausible probability if a sample belongs to the interesting set. The first idea is to use some of the papers not included in the SLR for training negative examples. Although this may be intuitive, we may address the problem of a short distance from positives and negatives, due to the cross topics which these papers may report. A further Semantic Systematic Literature Review 17 evaluation of the distance among papers from di↵erent journal issues may give a better idea about the use of negative examples. Therefore a deep analysis of which studies may be considered as negative is needed. In addition, we have planned to extract one paper i at a time from the set of relevant papers I, and to use the remaining papers 2 I to train the classifier and, then, to evaluate if it recognizes i as similar to the others. In this way, the classifier is used to give a “second opinion” on the selection process, potentially reducing the number of researchers necessary to undertake this step. In the presented approach, we rely on the MNB classifier. It is considered as the baseline for text classification, but its results are often comparable to the state of the art in text classification, such as SVM and Markov chain (Rennie et al. 2003) and as shown in Section 4.4. We plan to validate the use of the semantic enrichment with other classifiers to investigate the changes in perfor- mance. The experiments addressed an important weakness in the named entity extraction task. The disambiguation mechanism provided by OpenCalais often links, via the sameAs link, to DBpedia resources. The loss of this process is recovered by an in-house interlinking logic which disambiguates the entity to DBpedia only considering the name of the entity. Currently we are investigating the e↵ect of NERD (Rizzo et al. 2014) which disambiguates to DBpedia considering the surroundings of the text where the entity has been spotted, hence preserving the semantics. Finally, the semantic enrichment mechanism has been validated using one SLRs. We plan to validate it also using other SLRs especially coming from other field of research. We be- lieve that our approach could be adopted by scientific content providers such as journal portals, to index sources and to automatically classify and cluster the papers they publish. This approach may be used to propose a faceted view of sources queried by a user. The challenge will be to compute this operation in real-time to limit human e↵orts. Acknowledgments This work was partially supported by the European Union’s 7th Framework Programme via the projects LinkedTV (GA 287911). References Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R & Hellmann S 2009 DBpedia - A crystallization point for the Web of Data Web Semantics: Science, Services and Agents on the World Wide Web 7(3), 154–165. Cawley G C & Talbot N L 2010 On over-fitting in model selection and subsequent selection bias in performance evaluation The Journal of Machine Learning Research 11, 2079–2107. Chawla N V 2005 Data Mining for Imbalanced Datasets: An Overview Data Mining and Knowledge Discovery Handbook pp. 853–867. Chawla N V, Japkowicz N & Kotcz A 2004 Editorial: Special Issue on Learning from Imbalanced Data Sets ACM SIGKDD Explorations Newsletter 6(1), 1–6. 18 Rizzo et al. Cohen A M 2008 Optimizing feature representation for automated systematic review work prioritization in ‘Annual Symposium of the American Medical Informatics Association (AMIA)’ pp. 121–125. Cohen A M, Hersh W R, Peterson K & Yen P Y 2006 Reducing Workload in System- atic Review Preparation Using Automated Citation Classification Journal of the American Medical Informatics Association (JAMIA) 13(2), 206–219. Elkan C 2001 The Foundations of Cost-sensitive Learning in ‘17th International Joint Conference on Artificial Intelligence’ IJCAI’01. Forman G & Cohen I 2004 Learning from Little: Comparison of Classifiers Given Little Training Knowledge Discovery in Databases: PKDD 2004 . Grishman R & Sundheim B 1996 Message Understanding Conference-6: a brief his- tory in ‘16th International Conference on Computational linguistics (COLING’96)’ pp. 466–471. Hollander M & Wolfe D A 1973 Nonparametric Statistical Methods John Wiley and Sons New York. Hubbard R & Lindsay R M 2008 Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing Theory & Psychology 18(1), 69–88. Jorgensen M & Shepperd M 2007 A Systematic Review of Software Development Cost Estimation Studies IEEE Transactions on Software Engineering 33(1), 33–53. Kibriya A, Frank E, Pfahringer B & Holmes G 2005 Multinomial Naive Bayes for Text Categorization Revisited in ‘17th Australian joint conference on Advances in Artificial Intelligence (AI’05)’. Kitchenham B 2004 Procedures for performing systematic reviews Technical Report TR/SE-0401 Software Engineering Group, Department of Computer Science, Keele University. Kitchenham B 2007 Guidelines for performing systematic literature reviews in software engineering Technical Report EBSE-2007-01. Matwin S, Kouznetsov A, Inkpen D, Frunza O & O’Blenis P 2010 A new algorithm for reducing the workload of experts in performing systematic reviews Journal of the American Medical Informatics Association (JAMIA) 17(4), 446–453. Murphy K P 2012 Machine Learning: a Probabilistic Perspective The MIT Press. Porter M 1980 An algorithm for su�x stripping Program 14(3), 130–137. URL: http://www.emeraldinsight.com/doi/abs/10.1108/eb046814 Rennie J D M, Shih L, Teevan J & Karger D R 2003 Tackling the Poor Assumptions of Naive Bayes Text Classifiers in ‘20th International Conference on Machine Learning (ICML’03)’. Rizzo G, van Erp M & Troncy R 2014 Benchmarking the Extraction and Disam- biguation of Named Entities on the Semantic Web in ‘9th edition of the Language Resources and Evaluation Conference (LREC’14)’. Ruttenberg A, Rees J A, Samwald M & Marshall M S 2009 Life sciences on the Semantic Web: the Neurocommons and beyond Briefings in Bioinformatics 10(2), 193–204. Tomassetti F, Rizzo G, Vetro A, Ardito L, Torchiano M & Morisio M 2011 Linked Data Approach for Selection Process Automation in Systematic Reviews in ‘Evaluation and Assessment in Software Engineering (EASE’11)’.