key: cord-0196221-3ovwp4fu
authors: Singhal, Trisha; Liu, Junhua; Blessing, Lucienne T.M.; Lim, Kwan Hui
title: Analyzing Scientific Publications using Domain-Specific Word Embedding and Topic Modelling
date: 2021-12-24
journal: nan
DOI: nan
sha: c55b498c0baa458d0be24cc947dfe1a42b4dba96
doc_id: 196221
cord_uid: 3ovwp4fu

The scientific world is changing at a rapid pace, with new technology being developed and new trends being set at an increasing frequency. This paper presents a framework for conducting scientific analyses of academic publications, which is crucial to monitor research trends and identify potential innovations. This framework adopts and combines various techniques of Natural Language Processing, such as word embedding and topic modelling. Word embedding is used to capture semantic meanings of domain-specific words. We propose two novel scientific publication embedding, i.e., PUB-G and PUB-W, which are capable of learning semantic meanings of general as well as domain-specific words in various research fields. Thereafter, topic modelling is used to identify clusters of research topics within these larger research fields. We curated a publication dataset consisting of two conferences and two journals from 1995 to 2020 from two research domains. Experimental results show that our PUB-G and PUB-W embeddings are superior in comparison to other baseline embeddings by a margin of ~0.18-1.03 based on topic coherence.

The scientific and technological worlds are changing at an unprecedented rate, thus increasing the importance of monitoring research trends to identify innovation potential. Trend research and identification can be done using a variety of sources, with scientific literature books, articles, and publications receiving substantial attention from researchers worldwide [1] - [3] . The analysis of publications has proved useful in identifying emerging topics and tracking their growth or decline over the years using linguistic features. The use of Natural language Processing (NLP) techniques to support this analysis makes it easier to discover patterns and allow for answering more specific research questions.

Word Embedding is an important and widely used NLP technique to identify the semantic meanings of a text corpus. These semantic meanings are useful for identifying and quantifying the word-word similarities and global contextual meaning of text corpora. An increasing number of word embeddings can be found in the literature, such as count vectorizers [4] and TF-IDF [5] , which are more classical word representation techniques. These classical approaches are linear language modeling approaches and often fail to model the true contextual meaning of text corpora. In contrast, Word2Vec [6] , GloVe [7] , and ELMO [8] are some of the more modern techniques of contextualizing meanings of text corpora, which incorporate neural networks for non-linear language modelling. However, these models are often trained on datasets derived from Twitter, Wikipedia, or general pieces of text and are therefore not entirely suitable for the analysis of scientific publications due to the existence of domain-specific words in these corpora. With this motivation, we present two domain-specific word embeddings termed PUB-G and PUB-W. We use these embeddings to cluster scientific publications based on their abstract to identify various areas of research in the respective domain. We then use topic modeling to identify more detailed research topics in these areas.

The paper makes the following contributions: 1) We curate a publication dataset for two conference series and two journals of two different disciplines with a total of 10.4k publications for the period 1995 to 2020. 2) We propose novel domain-specific embeddings based on GloVe and, alternatively, explore an embedding based on Word2Vec for publication data termed as PUB-G and PUB-W respectively. We further used these embeddings to cluster publications based on their abstracts. 3) We develop a baseline classical approach for publication clustering and compare this approach with other competitive baseline embeddings. 4) We show that the research topics identified by PUB-G embedding show a better coherence score. The remainder of this paper is organized as follows. Section II discusses the related works. Section III shows the workflow of our proposed framework and provides a detailed explanation of each component, and Section IV describes our dataset. Section V discusses the experimental results and main findings. The conclusions can be found in Section VI.

II. RELATED WORK NLP has widespread use and is being applied in a myriad of tasks ranging from language translation [9] - [11] , sentiment analysis of text [12] - [14] , document analysis [15] , [16] , and social media analytics [17] - [19] . In the following sections, we discuss some related works in several relevant sub-fields of NLP.

Digital document analysis has been a research field for several years. [20] provides a detailed discussion of the traditional approaches that were used to analyze the structure of electronic documents. On the other hand, there are various tools available in the market today to perform information extraction from scientific literature [21] - [24] . Many researchers used traditional approaches like Support Vector Machines (SVM), Latent Dirichlet Allocation (LDA), Singular Value Decomposition (SVD) and Hidden Markov Model (HMM) to implement various text analysis techniques. [25] used an extended HMM to extract the bibliographic attributes from the references, [26] used SVM classifier for two-stage metadata extraction from headers of research publications, and [27] applied an ensemble ML approach to automatically extract users from patents.

The application of neural networks (NN) to digital documents helped enormously in extracting and analyzing the documents and gain valuable insights. On this account, [28] extracted text information by identifying various sections of scientific publications in the form of PDF documents using deep learning-based NN, U-Net. Some researchers worked beyond the textual information such as [29] , who used end-toend multimodal fully convolutional neural networks to perform pixel-wise page segmentation to extract semantic features of the document. Other researchers [30] , [31] extracted figures in research papers at NIPS, ICML and AAAI.

Trend analysis has been a significant research topic in several fields. Recently, Ordun et. al. [32] did a thorough analysis of COVID-19 tweets using topic modeling and pattern matching to identify high-level trends, events with sudden spikes, distinctive topics, speed of tweeting and retweeting, and network behaviors. On a similar topic, Kwan and Lim [33] , [34] used sentiment analysis, topic modeling and temporal analysis techniques on tweets to study trends and discussions about COVID-19 in various countries. Schoch et. al. [35] explored literary genre using topic modeling. Chiarello et. al. [36] using state-of-the-art text mining techniques to analyze research papers published in the Engineering Design field to identify the evolution of various research themes.

Pek and Lim used academic publications to identify key business trends, particularly the various popular topics and the frequency of these topics over the years [37] . Similarly, research publications were used in HCI: Yang et. al. [38] visualized the use of ML to improve user experience (UX), Carter et. al. [39] analyzed the understanding of games and play research within four research paradigms in the field of HCI, and [40] studied the emerging trends and changes in HCI over a decade.

Textual word embeddings are often used in NLP research for language modeling [41] , [42] . These embeddings transform textual words into an n-dimensional vector space, which are useful to 1) quantify word-word similarity and 2) model global contextual meanings of text corpora. There are various techniques for developing such a language model. Some of the classical methods in the literature are count vectorization and TF-IDF vectorization. These methods are based on word frequencies. Word2Vec presents a shallow neural network-based approach that optimizes to predict a masked word given the words before and after the masked word [6] . GloVe presents a regression-based model to predict the conditional probability of a word appearing given another word. Context-aware word embeddings, such as such as Embeddings from Language Model (ELMo) [8] and Bidirectional Encoder Representations from Transformers (BERT) [43] , were more recently proposed to generate word representations that better consider the context of the sentence. However, all these embeddings are usually trained on common text corpora [7] . Thus, these existing embeddings are not suitable for analyzing scientific publications which often are abundant with domain-specific words.

Our analysis is based on the abstracts of publications from two conferences and two journals in the Human-Computer Interaction (HCI) and Engineering Design research fields (more details later). We use a classical and neural networkbased hybrid linguistic analysis approach to uncover trends in these conferences and journals. Figure 1 shows our analysis methodology which will be explained in detail in this section.

A. Topic Clustering with Abstracts 1) Text Pre-Processing: The abstracts are pre-processed by reducing noise to facilitate the subsequent analyses. The pre-processing consists of the following steps:

Hyperlinks. The first step performed to clean the dataset is the removal of hyperlinks. Regex patterns are implemented for the same. These consist of a small sequence of characters defining a specified syntax that is used to match all the possible sets of strings in a given text. These regular expressions are supported and accessed by the Python module 're'.

Punctuations. Next, all the punctuation symbols including '[', ',', '\', '.', '!', '?', ']' are replaced by an empty string using regular expressions (regex).

Numeric Values. Numerical values such as dates, amounts, etc. do not contribute much information for our purpose and hence, we removed numbers from the documents, again using Regex.

Lowercase. To prevent the model from being case-sensitive, we converted the text in lowercase using string method, lower().

Whitespaces. It is essential to remove unnecessary whitespaces from the data to reduce noise. This is done by using the string method, strip() that removes leading and trailing whitespaces.

2) Tokenization: The tokenization process assigns a unique identifier to each unique word in the publication corpus. This is usually done as a preliminary step in many of the natural language processing pipelines for obtaining language features. We used Gensim library to perform tokenization on the corpus.

3) Part-Of-Speech Tagging and Lemmatization: Part-Of-Speech (POS) tagging is used to allocate each token a POS tag, such as noun, adjective, verb, and adverb based on its contextual interpretation. Following which, lemmatization transforms all tokens from their grammatical modulation to root form. 4) N-grams: In computational linguistics, an n-gram is identified as a continuous pattern of n words in a text corpus. An n-gram of size 1 is referred to as unigram, size 2 is referred to as bi-gram, and size 3 as trigram. Identifying such patterns in texts are often necessary to effectively uncover contextual meanings of language . Some of the most common bigrams and trigrams identified in our datasets are 'augmented reality', 'privacy concern', 'computer mediated communication', etc. It is evident that if these words are extracted as unigrams, their linguistic meaning is lost.

B. Text Embeddings 1) Baseline Textual Embedding: Text embedding is an N dimensional vector for each unique word in the corpus. In other words, it stipulates words into N dimensional vector space, from which, e.g. the semantic similarity among different words can be derived. We use Term Frequency-Inverse Document Frequency (TF-IDF) [44] as our baseline textual embedding which essentially is a 1-dimensional text embedding. TF-IDF score for the word t in document d from the document set D is obtained as follows:

Let us assume that there are T unique words, bigrams or trigrams in the publication corpus. Let t i be the i th unique word. Our textual embedding is T dimensional where s i is the i th value of the T dimensional vector. s i for the abstract d is calculated as follows:

S is used as a baseline for textual embedding. Our publication corpus has 15, 125 unique TF-IDF features for ICED, 21, 066 for CHI, 6, 026 for TOCHI, and 4, 998 for RIED.

2) PUB-W Embedding: We explored another textual embedding, Continuous Bag-of-Words (CBOW), Word2Vec [6] . The model contains a two-layer neural network for training that includes an input, a hidden, and an output layer. The input is given in a vector form developed by converting words into vectors using one-hot encoding. The hidden layer is a dense (fully-connected) layer with word embeddings as the weights and the output layer uses a softmax classifier to generate the probabilities for the target words.

Assuming the input text has T words. For each t ∈ [1 : T ], optimization function first computes the log of conditional probability of predicting t th word given the previous n number of words and n number of words after the t th word. Then, it computes the sum of the log of conditional probabilities for each word in the text. Finally, the objective is to minimize the negative of this summation which is represented in equation 3.

We present a trained 100-dimensional Word2Vec embedding on our publication dataset. For further analysis, we compute the average PUB-W embedding over all words.

3) PUB-G Embedding: We believe that the TFIDF may not capture the actual semantic similarity among words in the vector space. There has been numerous work on different types of embeddings proposed over the years, which capture better semantic meaning of textual information. With this motivation, we propose to use GloVe [7] embedding trained on our publication corpus. In contrast to Word2Vec [6] or TFIDF, GloVe does not only rely on the local context of the words. It captures global statistics by the means of the word-word co-occurrence matrix. Let us define the co-occurrence matrix as X ∈ R N ×N where N is the number of unique words in the dataset. X i , j is defined as the number of times the word i has co-occurred with the word j. Let X i = k X i,k be the number of times any word appears in the context of word i. Furthermore, they define the P i,j as follows:

Let w i ∈ R d be the d dimensional GloVe word embedding for the word i. They define a regression model to learn P i,k /P j,k . Here, learning the word embeddings depends on three words i, j and k, where they define word k to be the context word. Regression model is parameterized as follows:

Here, for the context word k, separate embedding layerw k is used. In our analysis, we use 100 as the dimensionality of the GloVe word embedding.

For the abstract texts, we first compute the PUB-G embedding for each word in the abstract and then compute the average embedding over all the words.

The resulting vectorized matrix obtained from feature vector space is then used to cluster each document of all four corpora, using K-Means Clustering [45] . To visualize these clusters in 2-dimensional space, we used t-distributed stochastic neighbor embedding (t-SNE) [46] visualization. Both of the approaches are discussed in detail further.

1) K-means: [45] came up with one of the most straightforward unsupervised machine learning (ML) algorithms which are nowadays widely used for various real-world applications in order to recognize hidden patterns. It uses a simple approach of classifying the datapoints in a fixed number of clusters defined by K, each having a particular centroid, c representing the center of the cluster. At first, the K-Means clustering algorithm initializes the random centroids followed by the recursive computations of the distance between each cluster point and its corresponding cluster's centroid until the centers of clusters get stabilized or the given number of iterations has been reached. The main idea is to minimize an error function known as a squared error which can be represented by the objective function 6 and the recalculation step of new centroid can be represented by 7.

Here, X = {x 1 , x 2 , x 3 , ..., x n } is the set of datapoints whereas V = {v 1 , v 2 , v 3 , ..., v c } is the set of centroids of clusters. In our case, datapoint x i is a T -dimensional textual embedding for i th abstract. The absolute difference between x i and v j shows the euclidean distance between the i th abstract and j th cluster, c i represents the number of datapoints in i th cluster, and c depicts the total number of clusters used. A datapoint is assigned to a particular cluster based on the following function.

Here k i is the assigned cluster ID of i th document. We decided to use K=10 clusters across different publication datasets. For our dataset, we used Elbow Method [47] to first find the optimal number of clusters, leading to the earlier mentioned choice.

2) t-SNE: Based on [48] , [46] developed t-distributed stochastic neighbor embedding (t-SNE) that is extremely useful to visualize high-dimensional data in lower dimensions, specifically the two-dimensional plane. Hence, it is an unsupervised non-linear machine learning technique that is also used as a dimensionality reduction method. The t-SNE creates probability distribution by finding pairwise similarity between the neighboring datapoints. This pairwise similarity is decided based on the conditional probability density between two nearby points as it will be high for the nearby points rather than the far-distanced points. It contains two important parameters: perplexity and early exaggeration. Perplexity is the total number of nearest neighbors of the center point impacting the variance of Gaussian distribution whereas early exaggeration controls the space between the clusters. For our experiments, we kept the perplexity as 100 and early exaggeration as the default value i.e. 12. Figure 2 shows the clusters visualizations graphs developed using t-SNE.

After clustering documents in groups, we discovered the topics to understand what each cluster is representing. We found out 10 topics for each cluster using the technique, LDA (Latent Dirichlet Allocation) [49] . LDA is one of the widely used topic modeling approaches where it assumes that each document is a mixture of k different topic and each k th topic has its inherent word probability distribution. Hence the objective of the LDA algorithm is to find these k topics and their word probability distribution. More concretely, let us assume that there are D abstracts, T words, K topics, and N words an abstract. The goal of LDA is to calculate the joint posterior probability as stated in the following equation: 

Here θ is a distribution of topics, one for each document. z is a distribution of topics, one for each word. β is a distribution of words, one for each topic. α is the parameter vector for each document, η is the parameter vector for each topic and D is the abstracts dataset.

We curate a dataset that represents the leading conferences and journals for the fields of Human-Computer Interaction and of Engineering Design, which are: papers on design theory and methodology in all fields of engineering, focusing on mechanical, civil, architectural, and manufacturing engineering. Topics covered include functional representation, feature-based design, shape grammars, process design, redesign, product data base models, and empirical studies.

The selection of a journal and a conference series from two different fields, allows us to compare the usefulness of our methodology between fields, and between publication type. Table I shows a summary description of our dataset, which we elaborate more on later. 

The Web of Science platform is used to fetch the data for CHI conference, TOCHI, and RIED journals. For CHI, publications from 2007 to 2020 were extracted, resulting in 6,365 data points. Similarly, the TOCHI dataset includes 370 data points, covering 2007 to 2019, while the RIED dataset contains 366 datapoints from 1995 to 2020 excluding 1996. The datasets include abstracts, titles, authors' names, and keywords.

To get the ICED Conference dataset, we extracted the publication information and papers from the ICED Design Society website 1 for the period 2003 to 2019. The total number of instances extracted is 3308. Due to the unavailability of some attributes on the website, the missing datapoints are manually extracted from the publications' PDFs. The dataset comprises (i) Title, (ii) Year, (iii) Editor, (iv) Author, (v) Series, (vi) Institution, (vii) Section, (viii) DOI number/ISBN, (ix) ISSN, (x) Abstract, and (xi) Keywords. The data is assembled as a delimited text file using a comma to separate values.

A. Evaluation Metrics 1) Topic Coherence: A set of statements or facts are said to be coherent if they support each other. As an example consider the statement 'GloVe is a word embedding used for language modeling'. This statement is said to be coherent since the facts support each other. In literature, various topic coherence scores are used to quantify this semantic similarity. As explained in 1 https://iced.designsociety.org/group/7/Proceedings+of+ICED the section III-D, we apply LDA topic modeling to each cluster to identify research topics. For a single topic, a coherence score measures the degree of similarity between high-scoring words in the topic. There exist two types of coherence metrics, 1) Intrinsic and 2) Extrinsic methods. Intrinsic methods do not use any external task for measuring semantic similarity. In comparison to that, extrinsic methods apply the discovered topics in an external task such as information retrieval. However, we believe is that applying the topics generated for a corpus abundant with domain-specific words into an external task is not well suited. Therefore, we use the intrinsic UMass coherence score [50] for our evaluations. The following equation depicts the pairwise score function used to calculate the coherence score.

Let us assume that, w i , ..., w n represents the top-n words for each topic. Here, D(w i ) depicts the number of times the word w i appeared in the corpus and D(w i , w j ) depicts the number of times w i and w j appeared together in the corpus. Here, the w i is selected to be more common word than w j . Then the following score is averaged over all the topics and subsequently over all the clusters we generated.

2) Mutual Information Scores: We applied 3 different types of word embeddings, which generated distinct clusters. The mutual information (MI) measures the similarity between two labels of the same data as follows:

Here U and V are two different clusters, |U | and |V | are the number of cluster labels in cluster U and V respectively. |U i | is the number of samples in the cluster U i and |V j | is the number of samples in the cluster V j Furthermore, we also calculate two other MI related metrics called, 1) Normalized mutual information score (NMI) and 2) Adjusted mutual information score (AMI). NMI score scales the MI score to be between 0 and 1. AMI is another adjustment of MI score for chance. It is calculated as following: 

B. Comparing the Coherence score against various embeddings

As explained in the subsection III-D, we applied LDA topic modeling to identify 10 topics in each cluster separately. Then the average u mass coherence score over the clusters for various text embeddings was computed. Table II summarizes these values representing TF-IDF as the baseline, GloVe as the pre-trained, and PUB-G and PUB-W as the proposed embeddings. Based on these, we make the following key observations:

1) Our proposed PUB-G and PUB-W embeddings generates the best coherence score for all the publication datasets in comparison to all other embeddings. For CHI and ICED, PUB-W works the best whereas for TOCHI and RIED, PUB-G works well. PUB-G seems to generalize better in the cases of limited data points in comparison to PUB-W. 2) Across all the datasets, TF-IDF performance is lower than both of the proposed, PUB-G and PUB-W embeddings. We believe that TF-IDF is not able to model the semantic meaning of the text. 3) GloVe performs marginally better than PUB-G for ICED and CHI. However, GloVe's performance is still lower than our proposed embedding, PUB-W. Based on the key observations, we can notice that the already existing pre-trained textual embeddings are incapable of capturing the semantic meaning of textual data in a scientific domain.

C. Comparing the MI scores between TFIDF, PUB-G, and PUB-W Different embeddings generated different clusters based on the inherent semantic similarity among words. In order to analyze the degree of similarity among these clusters, we compute the MI scores among these three embeddings. Tables  III, IV , and V shows the MI score between 1) TF-IDF and PUB-G, 2) PUB-G and PUB-W, and 3) PUB-W and TF-IDF. Based on these findings, we find that MI scores between PUB-W and PUB-G is higher in comparison to PUB-G and TF-IDF. This indicates that TF-IDF generates different clusters in comparison to the other two.

We demonstrate the qualitative analysis of the clusters detected by discussing a case study for the CHI publications. Upon applying the k-means clustering to all the documents and further applying topic modeling on the 10 clusters, we found From the time-series graph in Figure 3 , it can be noticed that cluster 8 has shown a decrease which somewhat validates the fact as to when the first VR technologies came into existence, it had a sudden spike in interest in the research community but gradually decreased over the years and might decrease further due to COVID-19. On similar grounds, cluster 2 has shown a spike which can also be validated as data analysis of online communities, groups, forums have increased especially in terms of behavioral analysis, opinion mining, sentiment analysis, etc. Cluster 3, which mostly includes research on User Experience, appears to be having a steady but higher interest in the CHI community over the years. This may also be validated given that the theme of CHI research is focused on developing human-computer interactive technologies.

In this paper, we present a framework to facilitate the scientific analyses of academic publications, which is important for monitoring the growth of a particular research field and identifying potential innovations. Our framework adopts and combines data collection, word embedding, topic modelling and temporal trend analysis. Many word embedding are trained on general text articles which may not be able to capture the features relevant to domain-specific texts found in scientific publications. To solve this problem, we curated a publication dataset consisting of two conferences and two journals from 1995 to 2020 in two research disciplines. Using this dataset, we propose two scientific publication embedding, i.e., PUB-G and PUB-W, which are capable of learning semantic meanings of general as well as domain-specific words in various research fields. Experimental results show that our PUB-G and PUB-W embeddings out-perform other baseline embeddings based on topic coherence.

Science communication as a field of research: identifying trends, challenges and gaps by analysing research papers

Identifying general trends and patterns in complex systems research: An overview of theoretical and practical implications

Exploring research trends in big data across disciplines: A text mining analysis

sklearn.feature extraction.text.countvectorizerscikit-learn 0.24.1 documentation

sklearn.feature extraction.text.tfidfvectorizer scikit-learn 0.24.1 documentation

Efficient estimation of word representations in vector space

Glove: Global vectors for word representation

Deep contextualized word representations

Neural machine translation by jointly learning to align and translate

Automatic machine translation evaluation using source language inputs and cross-lingual language model

Re-translation versus streaming for simultaneous translation

Opinion mining and sentiment analysis

Sentilare: Linguistic knowledge enhanced language representation for sentiment analysis

Abcdm: An attention-based bidirectional cnn-rnn deep model for sentiment analysis

Document analysis as a qualitative research method

Identification of cybersecurity specific content using the doc2vec language model

EPIC30M: An Epidemics Corpus of Over 30 Million Relevant Tweets

Crisis-BERT: A Robust Transformer for Crisis Classification and Contextual Crisis Embedding

Real-time Spatio-temporal Event Detection on Geotagged Social Media

Document structure analysis algorithms: a literature survey

Parscit: an open-source crf reference string parsing package

Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications

Cermine-automatic extraction of metadata and references from scientific literature

Ocr++: a robust framework for information extraction from scholarly articles

Bibliographic attribute extraction from erroneous references based on a statistical model

Automatic document metadata extraction using support vector machines

Automatic users extraction from patents

Deeppdf: A deep learning approach to extracting text from pdfs

Learning to extract semantic structure from documents using multimodal fully convolutional neural networks

Looking beyond text: Extracting figures, tables and captions from computer science papers

Pdffigures 2.0: Mining figures from research papers

Exploratory analysis of covid-19 tweets using topic modeling, umap, and digraphs

TweetCOVID: A System for Analyzing Public Sentiments and Discussions about COVID-19 via Twitter Activities

Understanding Public Sentiments, Opinions and Topics about COVID-19 using Twitter

Topic modeling genre: An exploration of french classical and enlightenment drama

A text mining based map of engineering design: Topics and their trajectories over time

Identifying and Understanding Business Trends using Topic Models with Word Embedding

Mapping machine learning advances from hci research to reveal starting places for design innovation

Proceedings of the first ACM SIGCHI annual symposium on Computerhuman interaction in play

Trends and changes in the field of hci the last decade from the perspective of hcii conference

Word embedding for understanding natural language: a survey

Word embeddings: A survey

Bert: Pre-training of deep bidirectional transformers for language understanding

Term-weighting approaches in automatic text retrieval

Some methods for classification and analysis of multivariate observations

Visualizing data using t-sne

Research on k-value selection method of k-means clustering algorithm

Stochastic neighbor embedding

Latent dirichlet allocation

Optimizing semantic coherence in topic models

This research is funded in part by the Singapore University of Technology and Design under grant SRG-ISTD-2018-140. The authors thank the anonymous reviewers for their useful comments.