key: cord-0457110-gl51r2hk
authors: Hadi, Mohammad Abdul; Fard, Fatemeh H
title: AOBTM: Adaptive Online Biterm Topic Modeling for Version Sensitive Short-texts Analysis
date: 2020-09-13
journal: nan
DOI: nan
sha: dcc3d7835dd95c66fdd3461ab804e3cca5713b7b
doc_id: 457110
cord_uid: gl51r2hk

Analysis of mobile app reviews has shown its important role in requirement engineering, software maintenance and evolution of mobile apps. Mobile app developers check their users' reviews frequently to clarify the issues experienced by users or capture the new issues that are introduced due to a recent app update. App reviews have a dynamic nature and their discussed topics change over time. The changes in the topics among collected reviews for different versions of an app can reveal important issues about the app update. A main technique in this analysis is using topic modeling algorithms. However, app reviews are short texts and it is challenging to unveil their latent topics over time. Conventional topic models suffer from the sparsity of word co-occurrence patterns while inferring topics for short texts. Furthermore, these algorithms cannot capture topics over numerous consecutive time-slices. Online topic modeling algorithms speed up the inference of topic models for the texts collected in the latest time-slice by saving a fraction of data from the previous time-slice. But these algorithms do not analyze the statistical-data of all the previous time-slices, which can confer contributions to the topic distribution of the current time-slice. We propose Adaptive Online Biterm Topic Model (AOBTM) to model topics in short texts adaptively. AOBTM alleviates the sparsity problem in short-texts and considers the statistical-data for an optimal number of previous time-slices. We also propose parallel algorithms to automatically determine the optimal number of topics and the best number of previous versions that should be considered in topic inference phase. Automatic evaluation on collections of app reviews and real-world short text datasets confirm that AOBTM can find more coherent topics and outperforms the state-of-the-art baselines.

Abstract-Analysis of mobile app reviews has shown its important role in requirement engineering, software maintenance and evolution of mobile apps. Mobile app developers check their users reviews frequently to clarify the issues experienced by users or capture the new issues that are introduced due to a recent app update. App reviews have a dynamic nature and their discussed topics change over time. The changes in the topics among collected reviews for different versions of an app can reveal important issues about the app update. A main technique in this analysis is using topic modeling algorithms. However, app reviews are short texts and it is challenging to unveil their latent topics over time. Conventional topic models such as Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSA) suffer from the sparsity of word co-occurrence patterns while inferring topics for short texts. Furthermore, these algorithms cannot capture topics over numerous consecutive time-slices (or versions). Online topic modeling algorithms such as Online LDA (OLDA) and Online Biterm Topic Model (OBTM) speed up the inference of topic models for the texts collected in the latest time-slice by saving a fraction of data from the previous time-slice. But these algorithms do not analyze the statistical-data (such as topic distributions) of all the previous time-slices, which can confer contributions to the topic distribution of the current time-slice.

In this paper, we propose Adaptive Online Biterm Topic Model (AOBTM) to model topics in short texts adaptively. AOBTM alleviates the sparsity problem in short-texts and considers the statistical-data for an optimal number of previous time-slices. We also propose parallel algorithms to automatically determine the optimal number of topics and the best number of previous versions that should be considered in topic inference phase. Automatic evaluation on collections of app reviews and realworld short text datasets confirm that AOBTM can find more coherent topics and outperforms the state-of-the-art baselines. For reproducibility of the results, we open source all scripts.

Index Terms-App review analysis, adaptive topic model, biterm, online algorithm, automatic parameter setting I. INTRODUCTION Mobile app reviews form a main feedback channel for the app developers [1] to evaluate their products and improve application maintenance and evolution tasks [2] . The app developers require to analyze app reviews in order to gain insights about the current state of their apps from users' perspectives. Mobile app reviews form a feedback channel for the developers [1] to evaluate their products and improve application maintenance and evolution tasks [2] . The app developers must analyze app reviews in order to gain insights about the current state of their apps from users' perspectives. Several studies have been proposed to analyze the app reviews, including extracting informative text [3] , summarizing user reviews [4] , identifying the bugs or feature requests [5] [6] [7] , prioritizing feature inclusions [8] , and extracting insights about the apps [1] . Although the (popular) apps are updated frequently [9] , the app review analysis studies mostly consider the app reviews static [10] , [11] . However, app reviews have a dynamic nature and their discussed topics change over time.

If the update-centric analysis is neglected, it misses the point that feedback are written on a certain update [11] . The change in the topics extracted from reviews for different app versions can reveal important issues about the app [10] , [11] .

A recent example of the importance of discussed topics over time is the Zoom Cloud Meeting, a popular app for video conferencing. Zoom received massive one-star ratings (the lowest rating) in Google Play Store during the COVID-19 outbreak in March 2020. Most of the issues were related to users' concerns about data-privacy and security-malpractices. These issues were so severe that in a letter to Zoom, the New York attorney generals office expressed concerns and addressed security flaws [12] . These issues were extensively discussed in the user reviews and could have been flushed out months ago through proper inspection of the user reviews and topic changes from user feedbacks on Google Play Store.

App reviews are short texts that are time/version sensitive as these texts are generated constantly and are collected regularly for consecutive app versions [9] , [10] . The underlying latent topics derived from app reviews can benefit the developers extensively [13] . However, extracting relevant topics from app reviews is challenging due to their dynamic nature and lack of rich context in short texts.

The most popular topic modeling methods for discovering the underlying topics from text-corpus are LDA [14] and PLSA [15] . But these topic models do not perform well with text-corpus containing short-texts as documents [16] . These algorithms consider individual short texts as separate documents and model each of these documents as a mixture of topics, where each topic is considered as a probability distribution over words. The models then utilize various statistical techniques to determine the topic components and mixture coefficients of each document by implicitly capturing the document-level word co-occurrence patterns [17] , [18] . While dealing with the typical lengthy documents, the mentioned algorithms could rely on larger word counts to know how the words are related. However, the natural sparseness of the word co-occurrence patterns in each short document makes these models suffer from the data-sparsity problem [19] . Moreover, short texts lack the richness of context, making it more difficult for these topic models to identify the senses of ambiguous words in short documents. Biterm Topic Model (BTM) alleviates these problems by learning topics over short texts and explicitly modeling the generation of biterms in the whole corpus to enhance topic learning [16] .

As mentioned, app reviews are usually collected in batches of consecutive time-slices [10] . Each time a new batch of text-data arrives, these topic models (e.g. LDA, BTM) require retraining to discover latent topic distributions from the new dataset, which is prohibitively time and memory consuming. The popular way to alleviate the scalability problem is to develop online algorithms such as Online LDA (OLDA) [20] and Online BTM (OBTM) [16] . These online algorithms store a small fraction of data on the fly in order to accommodate the dataset of the upcoming time-slice. When a new batch of textdata arrives, online algorithms model the topics of texts either by using the statistics of samples collected in the immediately previous version/time-slice 1 [21] or by naively aggregating statistics of all the previous time-slices [16] . However, these online algorithms do not take different versions' varying contributions into account. Statistics of the textual data collected over different time periods or different versions may have a non-negligible difference in similarity with that of the latest time-slice; and thus, can contribute differently to the latest version [10] , [18] . Adaptive versions of online algorithms, such as Adaptively Online LDA (AOLDA), can be used to address the problem of the varying contribution of different versions [10] . But, the underlying model in AOLDA is LDA, which again makes it suffer from the mentioned data-sparsity problem while working with short-texts.

In this paper, we propose a new adaptive online topic model for short texts which takes previous versions' varying contribution into account. We refer to this novel model as the Adaptive Online Biterm Topic Model (AOBTM). AOBTM inherits the characteristics of BTM to deal with the data sparsity issue. It is an online algorithm that can scale for the increasing volume of the dataset that is generated frequently. AOBTM also endows the statistics of the previous versions with different contributions to the topic distributions of the current version of the dataset. Also, we have employed a preprocessing technique that is useful for yielding better top contributing key-terms to help the manual investigation of the inferred topics. Our contributions are enlisted below: 1) We propose a novel method called AOBTM for version sensitive content analysis for short texts. This method 1 Time-slice and version are semantically equal in this paper.

adaptively combines the topic distributions of a selected number prior versions to generate topic distributions of the current version. 2) We propose two parallel algorithms; the first algorithm can identify an optimal number of topics to be derived in the latest version, and the second algorithm can identify the optimal number of previous versions to be taken into consideration for adaptive aggregation of statistical data. 3) To encourage replicability, of the research results, we make all scripts, codes, and graphs available to the community 2 . We have conducted experiments on app review datasets and Twitter dataset with large number of records to evaluate performance of AOBTM compared to five baseline algorithms. Also, we integrated AOBTM into the state of the art online app-review analysis framework called IDEA for comparison [10] . Our results show that topics captured by AOBTM are more coherent compared to the topics extracted by baseline methods.

The rest of the paper is organized as follows: Section II and III describe the background and our Topic Model Design. Sections IV and V are dedicated to the proposed parallel algorithms and experiments and results, followed by the related works in Section VI. We add threats to validity in section VII and conclude the paper in Section VIII.

Topic modeling algorithms such as PLSA and LDA are widely embraced for identifying latent semantic structures from text corpus without requiring any prior annotations of the documents [22] . These algorithms observe each document as a mixture of topics while a distribution over the vocabulary terms characterizes each topic. Statistical techniques such as Variational methods and Gibbs sampling are then applied to infer the latent topic distributions of given documents and the word distributions of inferred topics [23] . Although these algorithms and their variants contributed largely in modeling text collections such as blogs, research papers, and news articles [20] , [24] , [25] , these topic models endure considerable performance deterioration while handling short texts [16] , [26] . Directly applying these models on short texts suffer from severe data sparsity problem [19] as the frequency of words in the short texts play less discriminative role, which makes it hard to infer words correlation from short documents [19] . The limited contexts in the short text also make it challenging to identify the sense of ambiguous words.

Researchers have proposed the numerous topic modeling algorithms for short texts by trying to solve one or two of the following inherent characteristics of the short texts: i) lack of enough word co-occurrence information, probability of most individual short texts being generated by singular topic, ii) inability to fully capture semantically co-related but rarely co-occurring words (due to lack of statistical information of words among texts), and iii) the probability that a singletopic assumption is too strong for some short texts [22] . In [22] , Qiang et al. divided the short text topic modeling algorithms into three major categories: Dirichlet multinomial mixture (DMM) based methods, Global word co-occurrences based methods, and Self-aggregation based methods. Brief introductions to the related works can be found in section VI. C. Online Topic Modeling for short texts A particular issue with the traditional topic modeling algorithms is that they can not scale with the expanding dataset. Whenever a new batch of data arrives, these topic models (i.e., LDA, PLSA) need to train from scratch. Moreover, these conventional topic models can not guarantee consistency in the sequence of topics if independent training is performed on different batches of the same corpus [14] , [16] . This inconsistency occurs because, before each of the independent training, we set the prior topic distribution to a default value, a Dirichlet parameter; so, the fixed number of topics (assume, K) can be generated in any sequence. [16] . For example, when we train a corpus, using BTM or LDA, with K number of topics, each independent training of BTM generates K number of topic distributions, where we can not ascertain that each time we train the corpus, a topic no. K = k (here, k = [1, ..., K]) will always correspond to a specific topic (i.e., "UI component").

Researchers proposed online models (i.e., OBTM, OLDA) to circumvent the problems with streaming datasets. Here, we can assume that the documents would be generated in streams and can be collected from, divided different timeslices or versions, where the documents are exchangeable in a time-slice. For example, Online BTM (OBTM) accommodates and deals with batches of short-text documents divided into different time-slices or versions. Let's assume that OBTM has already got the topic distribution for (t − 1)-th time-slice. When a new batch (t-th time-slice) arrives, OBTM utilizes the topic distribution of (t − 1)-th time-slice to set the prior topic distribution of t-th time-slice. It, in turn, ensures that after the training of the t-th time-slice, the k-th topic in the t-th time-slice is closely related to the k-th topic generated in the (t − 1)-th time-slice. If we introduce a completely new topic in the latest batch of documents (t-th time-slice), the new topic will merge into one (or more) existing topic(s) generated in the (t−1)-th time-slice that has (/have) strong correlation(s) with the introduced topic. The problem with online topic modeling algorithms is that they do not consider or compare the varying consequential correlation among all the preceding time slices or versions of the short texts while inferring topics for the latest timeslice. For example, in OBTM, this limitation transpires as the topic model generates the topic distribution of a time-slice by making it directly dependent on the topic distribution of preceding time-slices. If new topics are being introduced in each time-slice, the k-th topic in the latest time slice would be significantly different than that of the first time-slice. We can not reliably compare the distribution of the K = k-th topic between two non-consecutive time-slices as OBTM does not impose or investigate the varying correlation between them.

In our proposed method, we aim to alleviate the mentioned problem by adaptively integrating the topic distributions of all the previous time-slices with their respective weights as contributions, for generating the prior distribution of the latest, t-th time-slice. This way, we can warrant both coherence of specific topics and consistency in the topics' sequence in all the available versions. The adaptiveness enables us to compare topic distributions of any two different time-slices reliably. In Figure 1 , we show an overview of the framework. Here, version tagged short-texts are processed and fed into the AOBTM algorithm to find better topic distribution for the latest version by leveraging previous versions' statistical data. The details of each part are discussed in Sections III, IV.

In this section, we discuss the details of the Adaptive Online Biterm Topic Model (AOBTM). This method introduces Adaptiveness to give the online algorithm, OBTM, a version or time-slice sensitivity so that the prior topic distribution of the latest time-slice takes varying contributions (topic distributionwise) of the previous time-slices into account. After setting the prior, we train the model to find out the final topic distribution. The details of the proposed method are described below.

We have adopted the definition of Biterm from [16] , where it denotes an unordered term-pair co-occurring in a small, fixedsize window over a term sequence. The fixed-sized window is referred to as short-context. The optimal size of shortcontext varies from dataset to dataset and can be considered as an important parameter setting. In a given short-context, an unordered pair of any two distinct terms can form a biterm. For example, a short context with size=3, generates the following biterms:

In previous works of topic labeling (i.e., [27] , [28] ), inferred topics have been labeled with the term(s), which has(/have) a more significant contribution to the respective topics. Here, term denotes any non-redundant word in the document, which cannot be found in Natural Language Toolkit's (NLTK) stopword list. But Gao et al. [9] , showed that the most contributing singular term or their combination could not adequately represent the respective topic. So, instead of using singular terms, we use meaningful phrases to label the topics, where a Phrase refers to two frequently co-occurring words. To ensure the comprehensibility of the extracted phrases, we use a Pointwise Mutual Information (PMI)-based phrase extraction method [29] , where the higher frequency of two words' co-occurrence warrants the generation of a more meaningful phrase. For our model, we have empirically set our frequency threshold to 24. After identifying the phrases, we convert them into single terms using ' ' (i.e., w 1 w 2 ), to train them along with other terms using our algorithm. During biterms extraction, words constructing the phrases are also considered as term when they appear outside identified phrases. We train the phrases to capture their underlying semantics, which, in turn, would help us to label the topics with the most relevant phrases. We will further demonstrate this modification's impact in the experiment section.

To alleviate the data-sparsity problem faced by AOLDA and to capture more coherent, comprehensible, and discriminative topics, we propose an adaptive online topic modeling method, AOBTM, which improves OBTM by adaptively combining the topic distributions in previous versions. The details of the proposed AOBTM method are described in figure 2. After the preprocessing, the short texts are separated into different time-slices or versions, and input into AOBTM sequentially. AOBTM treats the short texts-set from each timeslice as a separate corpus. We denote the whole corpus as R = R 1 , R 2 , ..., R t , where t indicates the t-th time-slice. Following the literature, we denote the prior distributions over corpustopic as α and the prior distributions over topic-words as β; both α and β are defined initially. The topic-word distributions determine the topic's distribution over all the non-redundant terms (including the phrases) that appear in the corpus. The number of the topics is specified as K. For the k-th topic, Φ t k is the probability distribution vector over all the input terms in the t-th time slice. We introduce a new parameterwin (window size), which defines the number of previous versions to be considered for inferring the topic distributions of the current version. The overview of the AOBTM model is depicted in Figure 2 . Different from OBTM (and similar to AOLDA), as Figure 2 shown, we adaptively integrate the topic distributions of the previous win versions, denoted as Φ t−1 , Φ t−2 , ..., Φ t−i , ..., Φ t−win , for generating the prior, β t for the t-th version. The adaptive integration sums up the topic distributions of different versions with different weights, γ t,i :

Here, i denotes the i-th previous version (1 ≤ i ≤ w). n t w|k denotes the number of times word, w is assigned to topic k in time-slice t. The weight γ t,i k is determined by considering the similarity of the k-th topic between the (t-i)th version and the (t-1)th version, which is calculated by the following softmax function:

Algorithm 1: Adaptive Online BTM

Randomly assign topics to biterms in B (t) ; Compute Φ (t) by φ k,w = n w|k +β, n ·|k +W β ; (refer to [16] ) 13 Compute θ (t) by θ k = n k +α, NB +Kα ; (refer to [16] ) 14 end

represents Einstein Summation and computes the similarity between the topic distribution, Φ t−i k and the prior of the (t-1)th version, β t−i k . This adaptive aggregation allows the topics of the previous versions to endow different contributions to the topic distributions of the current version. The steps are shown in Algorithm 1. In Algorithm 1, B denotes the biterms collection. Here, N B is the number of biterms and b i denotes a biterm with two terms: w i,1 and w i,2 in i-th biterm. We use W as the total number of words in the vocabulary, and θ (t) as a K-dimensional multinomial distribution which denotes the corpus-topic distribution for a time-slice. Here, n k , n k|d , and n w|k denote number of words in topic k, number of words in document d assigned to topic k, and number of times word w is assigned to topic k, respectively.

In Algorithm 1, the topics are drawn from Eq. 3 and prior distribution, α for the latest time-slice is calculated using Eq.4:

where, z ∈ [1, K] refers to the topic indicator variable and P(z) refers to the prevalence of topics in the corpus. We use symmetric Dirichlet distributions as the initial priors by setting α 1 = (α, ..., α) and β 1 k = (β, ..., β). Given α t and [β t k ] K k=1 , we iteratively draw topic assignments for each biterm b i ∈ B t , according to the conditional distribution stated in Eq. 3. Once iterations are completed, we obtain the counts n t k and n t w|k . We adjust the hyperparameters α t and [β t k ] K k=1 for time slice (t + 1) by setting α (t+1) k and β t+1 k using Eq. 4 and Eq. 1, respectively. The derivation of Eq. 3 and Eq. 4 can be found in [16] .

In this section, we discuss the details of the running time and memory requirement for AOBTM and compare it with different batch, online, and adaptive online algorithms. We have listed the time-complexity and the number of in-memory variables for different topic models in Table I. In the following discussion,l refers to the average document length, and N D refers to the number of documents in the corpus, respectively. We can assume that all the documents in the short-text corpus have almost the same length [16] , [21] . It is reasonable to infer N B (number of biterms in the corpus) using this assumption as we are applying N B only for the topic models, which are devised for short texts (i.e., BTM, OBTM, and AOBTM). According to our assumption, each document with lengthl, would producel(l − 1)/2 biterms; so, we have the equivalence of N B as: ≈ N D ·l · (l − 1)/2

Furthermore, in Table I , win denotes the user-defined window-size (the number of previous versions to consider) in the adaptive inline algorithms, W denotes the total number of terms, and v refers to the number of available time-slices. 

The most time-consuming part in these topic models is the component calculating the conditional probability of topic assignments, which requires O(K) time. While LDA draws a topic for each word occurrence, BTM draws a topic for each biterm. So, the overall timecomplexity for LDA and BTM turn out as O(N iter KN Dl ) and O(N iter KN B ), respectively [14] , [16] . From our previous assumption-based calculation of N B , we can further expand the time-complexity for BTM: O(N iter KN Dl (l−1)/2), which is approximately (l − 1)/2 times the time-complexity of LDA. As BTM works with short texts where value ofl is considerably small, the run-time of BTM can still be compared to that of LDA [21] .

The online algorithms, such as OLDA and OBTM, deal with documents and short texts, respectively, present in the latest time-slice. In Table I , we have used superscript t to denote the latest time-slice or version. But, the adaptively online algorithms (i.e., AOLDA, AOBTM) compare and determine contributions of the previous v number of topic-word distributions for different time-slices, which require an additional O(vKW ) time.

Number of Variables Stored in Memory. LDA maintains the following counts as the cached memory: the number of words in a document d assigned to topic k, n k|d (=N D K) , and the number of times word w assigned to topic k, n w|k (=W K). LDA also stores the topic assignment for each word occurrence (=N Dl ) [30] . On the other hand, BTM stores the following variables: the number of topics, n k (=K), the number of times word w assigned to topic k, n w|k (=W K), and the topic assignment for each biterm (=N B ) [16] .

Unlike the batch algorithms, online topic models do not require running over all documents (in case of OLDA), or all biterms (in case of OBTM) observed up to the latest time slice. Instead, OLDA only iteratively runs over the words present in the current time-slice documents, whereas OBTM only iterates over the biterm set in the latest time-slice. These online algorithms require almost constant memory cost to update the models, since the number of documents, their average length, and the number of biterms are often stable [20] , [21] .

In the adaptively online algorithms, topic-word assignments for different versions are compared, weighted, and combined to set the prior topic-word distribution of the latest time-slice. Therefore, the counts, n w|k (=W K) for all the previous timeslices, need to be stored as cache-memory. As win ∈ [1, ..., v], we consider vW K as the counts stored in memory.

From table I, we can see that AOBTM's time complexity is higher, but it is comparable to other algorithms while dealing with a fewer number of short texts. In practice, the number of texts is bound to decline as they are separated into different versions or time-slices. On the other hand, AOBTM has to store some additional variables to accommodate adaptiveness, yet incur less memory cost than the other Adaptive Online algorithm, AOLDA.

TO INFER AND PREVIOUS VERSIONS TO CONSIDER Two parameters in the adaptive online topic modeling method, play key-roles in the quality of the topics discovered: (i) the number of topics to derive, (ii) and the number of previous versions to consider for adaptive integration. In previous studies, the values of these crucial parameters were set via informed guess established from the manual examinations performed over the dataset [10] . We propose algorithms to determine the values of these parameters automatically. Before developing algorithms to find optimal values for these parameters, we need to determine suitable evaluation metric for measuring the quality of discovered topics.

Perplexity (or, marginal likelihood) evaluated on a heldout test set have been utilized in many studies to assess the effectiveness and efficiency of a generated topic model [14] , [31] , [32] . But, the minimized perplexity as a metric is not suitable for our approach for the following reasons. First, the mentioned studies focused on LDA-based topic models where the likelihood of word occurrences in documents is optimized, whereas, in our approach, the likelihood of biterm occurrences in the latest time-slice is optimized. Second, it was argued in [33] , that topic models with better held-out likelihood might infer less semantically meaningful topics, which deviates our underlying expectations of topic models (e.g., better interpretability and coherence of the derived topics).

For our purpose, we can use Coherence Score or PMI-Score. Coherence Score is a metric used for measuring the quality of the discovered topics automatically [34] . It depicts that a topic is more coherent if the most probable words in that topic co-occur more frequently in the corpus. On the other hand, PMI-Score measures the coherence of a topic based on point-wise mutual information using large scale text datasets from external sources, such as Wikipedia [29] . This idea resonates with the underlying assumption of our approach, which maintains that words co-occurring more frequently in an external dataset, should be more likely to belong to the same topic. Since the external dataset is model-independent, the generated PMI-Score would fluctuate consistently for distinct topic models with different parameter values [16] . Therefore, we exploit PMI-Score to evaluate the discovered topic quality, which measures the pairwise association among T most contributing words in a discovered topic, k:

Here, P (w i ), P (w j ), and P (w i , w j ) are the probabilities of word w i , w j , and co-occurring word-pair (w i , w j ), respectively. The probabilities are estimated empirically from a fixed external dataset. Following the literature [16] , [21] , we computed the PMI-Scores using 5.4 million English Wikipedia articles as external dataset. We have used an open source webscraper API 3 to scrape the articles with average length of 362.7 words. To determine the overall PMI-Score for the topic model, we take the average of all PMI-Scores produced by distinct topics: PMI-Score(terminal) = 1 K k PMI-Score(k).

An inadequate number of topics could render our topic model too coarse to identify distinct and particular topics. Conversely, an inordinate number of topics could deliver a model that is too involved, making subjective validation and interpretation difficult. To estimate the most appropriate number of topics for our topic modeling approach, we propose a 2-step parallel algorithm. For the parallelization, we have employed OpenMP, a set of compiler directives and an API for our program (written in C++) that provides support for multi-platform, multiprocessing programming in shared-memory environments. OpenMP enabled us to write the algorithm so that the multithreading directives are skipped (or replaced with regular arguments) in the machines that do not have OpenMP installed. The designed algorithm to determine the optimal number for topics inference is provided in Algorithm 2.

The first step of our parallel algorithm takes an array of integers, InputArr. This array stores candidate number of topics, such as [n 1 ,n 2 ,...,n t ], where n is an integer and t is the array-size. If core refers to the number of CPU-cores available, it is advisable to limit t within [2, (cores − 1)]. This limit warrants that only one core would be assigned for each element in the array. Each core, in turn, builds iter number of AOBTM models and calculates respective PMI-Scores. If we independently train AOBTM multiple times, on the same dataset with the same number of topics, we end up with slightly different PMI-Scores with each independent training. So, we designed our algorithm in such a way so that, for one candidate number of topics, each core builds iter number of models and generates separate PMI-Scores. We stabilize the metric for corresponding candidate number by taking an average of the distinct PMI-Scores. We find the optimal candidate number (optV al) through reduction (OpenMP operator) after all the threads finish their execution.

In the second step of our algorithm, we finetune near the optimal candidate number (optV al) to determine the final value of the optimal topic number, optT opicN um. The user may specify span with an integer, which defines the breadth of grid-search around optV al, observed in the algorithm's first step. Without any user specification, span is automatically set as core − 1. For each integer in the range of span around optV al, we repeat the procedure of the first step to determine the optimal value and set it as the appropriate number of topics. Fig. 3 illustrate the first and second phase of Algorithm 2, respectively. In essence, the first phase determines the appropriate number for topics inference (optV al) by evaluating the elements of the InputArr; the second phase determines the optimal topic number by evaluating integers around the optV al. The algorithm ensures that in each phase, one core evaluates only one integer topic number. 

Earlier, we have discussed how previous versions or time slices incur different contributions to the topic distributions of the latest time-slice. In AOBTM, we form the prior topic distributions for t-th time-slice by taking weighted contributions of the previous win number of time slices into account. Users can define the parameter win ∈ [1, v − 1] in Algorithm 1, where v denotes the available time slices. To make an educated decision about the parameter win, we can analyze the change in PMI Scores for different values of win. We build (v −1) number of AOBTM models for the latest time-slice with different values of win and calculate the PMI-Scores. In the parallel block, we use OpenMp reduction clause to find the maximized PMIscore. If OpenMp is not enabled in a machine, we store the scores in an array, where the score's index corresponds to the number of considered versions. Then, with one scan of the array, we determine the cutoff point where the PMI-score dropped and did not rise again.

We propose Algorithm 3 to determine the appropriate number of previous versions to consider automatically. 

In this section, we evaluate the performance of AOBTM in identifying consistent and distinctive latent topics from corpora comprising of short text documents. We explain the datasets and compare the results of different topic modeling algorithms. Our focus is to answer the following research questions:

• RQ1: Can AOBTM achieve better performance compared to other topic modeling methods? • RQ2: How do different parameter settings, documentlengths, and pre-processing approaches impact the performance of AOBTM? • RQ3: Using the parameters set by our parallel algorithms, how discriminative and coherent are the topics discovered by different topic modeling methods?

A. Setups 1) Datasets: To show the effectiveness of our approach, in addition to using app reviews, we use a large dataset of Twitter microblogs. Tweets are considered as short text and evaluation on this dataset can show the applicability of AOBTM on short text analysis. The details of the datasets are as follows:

App Reviews from Apple Store and Google Play. We use the dataset provided by Gao et al. [10] , which is previously studied to evaluate AOLDA for extracting topics from app reviews. The dataset includes reviews that are related to a number of versions of the collected apps. The subject apps are distributed in different categories and platforms; this choice ensures the generalization of our approach. We enriched the provided datasets by adding the user-reviews collected from the latest versions of the subject apps. However, one of the apps in this dataset, "Clean Master," was discontinued, and we could not acquire app-changelogs. Another app in the provided dataset, namely "eBay," had pulled enormous app-reviews from the app-stores. As the changelogs are critical to our evaluation metrics, we have decided to discard these two apps from the evaluation. We double-checked the provided app-reviews and changelogs in the dataset from the play stores and discarded the ones that could not be found. Table II summarizes the specifications of the app reviews datasets.

Tweets2020 is a collection of approximately 200,000 tweets scraped from Twitter between January 1st and May 20th, 2020, where each month is considered as a time-slice. For the collection of the tweets, we have used an open-sourced twitter-scraper 4 . We used 300 top trending topics over the region of North America to collect the tweets with timestamp. Besides the content, each tweet includes user id, timestamp, number of retweets, and likes.

The user reviews and tweets collections contain many noisy words, such as repetitive words, casual words, misspelled words, and non-informative words (e.g., "normally"). We have performed common text preprocessing techniques including removing meaningless words, lowercasing, lemmatization, digit and name replacement following [35] . We apply the preprocessing technique in for lemmatization and replace all digits with "<digit>." We also removed duplicate records and documents with a single word. 2) Baselines: We select LDA [14] , BTM [16] , OLDA [20] , OBTM [21] , and AOLDA [10] as our baseline methods to evaluate the performance of AOBTM. The details of the baseline algorithms are explained in Section II. All the experiments were carried on a Linux machine with Intel 2.21 GHz CPU and 16G memory. Following the literature [10] , we have used all algorithms implemented by Gibbs sampling in C++ 5 .

3) Evaluation Metrics: Good topic models deliver coherent [28] and discriminative topics, which cover unique and comprehensive aspects of the corpus [10] . So, We utilized PMI-Score as a measure of coherence [16] and Discreteness Score (Dis Score) to measure the discriminative property of the derived topics, which is inspired from the semantic similarity mapping in [10] . Higher values of PMI Score and Dis Score suggest the discovery of more coherent and discriminative topics. We also presented time-cost (seconds) per iteration (Time Cost in Table III ) as the third performance metric. We picked the top 10 terms from each generated topic to calculate the PMI-Scores, as explained in section IV. For calculating Dis Score, we use Jensen Shannon (JS) Divergence D JS [36] , to estimate the difference between two topic distributions (Φ). The equations are provided below:

Eq. 7 elaborates D JS of Eq. 6. Eq. 8 defines the D KL (Kullback-Leibler Divergence) and M from Eq. 7. In Eq. 6, the innter term, K j=1,j =k D JS (Φ t k ||Φ t j )/K measures the difference of a single topic distribution's average with the rest of the topic distributions. In Eq. 8, P (i) (or Q(i)) is the i-th item in P (or Q).

To provide a better comparison, we adopted three more performance metrics used in [10] to evaluate the performance of 4 https://pypi.org/project/twitter-scraper/ 5 Code of BTM : http://code.google.com/p/btm/ AOLDA. These metrics are Precision E , Recall L , and F hybrid . Further details about these metrics are discussed in [10] . For app-reviews and twitter data, we have taken app-changelogs and popular hashtags, respectively as our ground-truths. Here, Precision E measures the accuracy in detecting emerging topics in the latest time slice, t [10] . Recall L evaluates whether our prioritized topics (including both emerging and nonemerging) reflect the changes mentioned in the change-logs or hashtags. Higher F hybrid suggests that the change-logs and hashtags are more explicitly covered by detected topics. The higher score in F hybrid also signifies that the prioritized issues reflect more of the change-logs and hashtags contents [10] . B. Result of RQ1: Comparison Results with Different Methods Table III presents the evaluation results, where, P E , R L , F h refer to Precision E , Recall L , and F hybrid , respectively. During evaluation for RQ1, we set the parameters as w = 3 and K = 10 for the adaptive online algorithms for the sake of uniformity. We have initialized α = 0.05 and β = 0.01 for LDA based methods as they have achieved the best performance with these values for short texts in [16] . We have set α = 50/K and β = 0.01 for BTM based algorithms [16] .

From Table III , we observe that AOBTM delivers the highest PMI-Scores with every dataset by alleviating the data sparsity problem and considering the varying contributions of different time-slices or versions. So, the topics discovered by AOBTM are more coherent and comprehensible. For discriminative topic learning, AOBTM performs better than other methods, except for the Tweets2020 dataset. A large amount of short texts per time-slice in the Tweets2020 dataset helps AOLDA to learn better document level word correlations and infer more discriminative topics; still, AOLDA did not generate higher PMI-Score than AOBTM for Tweets2020. From the result, it is apparent that AOBTM exceeds the benchmark methods including its online version, OBTM, as well as AOLDA. Although AOBTM marginally improved the performance of AOLDA, the performance improvement of AOLDA over OLDA is approximately the same as that of AOBTM over OBTM. AOBTM also generated the highest scores for every dataset for Precision E , Recall L , and F hybrid , which indicates that our topic model can select emerging topics more precisely.

We acknowledge that AOBTM is more time-expensive than all the other baselines, but the runtime is comparable to adaptive online methods when the dataset is small. From Table  III , we observe that the difference of runtime between AOLDA and AOBTM is trivial; AOBTM even outperforms AOLDA in runtime for NOAA Radar dataset, which has the lowest number of average short texts per version.

Effect of Different Parameter Settings. In Fig. 4 and Fig. 5 , we have compared our method with AOLDA, as the number of previous versions to consider is unique to adaptive online algorithms. In Fig. 4 , we consider distinct uniform topic-number for each dataset and calculate PMI-Scores for the different number of previous versions. The topic numbers for the dataset is calculated by Algorithm 2. In Fig. 5 , we consider fixed window-size (number of versions) to calculate the PMI-Scores for varying number of topics.

We can observe that AOBTM in general generates the highest PMI-Scores, and the trendlines of both methods are analogous. In Fig. 4 , the declines in the performance of AOBTM (i.e., win=25 in YouTube dataset) can transpire for the following reasons: the emergence of an unrelated novel topic in the recently considered versions (i.e., from 20th to 25th versions in YouTube dataset), content drifting, and higher occurrence of meaningless texts in the newly included versions [21] . In Fig. 5 , in all cases, the methods produced an increasing number of PMI-Scores until the tipping point generates the highest score. Once the methods reach their peak, a further increase in the number of topics generates incoherent coinciding topics inducing the reduction in PMI-Scores. Effect of Document Length. In Fig. 6 , we have presented AOBTM's performance using PMI-Scores with respect to varying document lengths. We have considered average document length for each dataset to evaluate the considered methods. Tweets2020, NOAA Radar, Youtube, Viber, and Swiftkey have average document length of 68. 3, 8.5, 13.6, 9.4 , and 6.2, respectively. AOBTM performs better than other methods for all datasets. It is worth noting that, as expected, LDA based methods performed well with large document length and bigger corpus, mostly because of the dataset's content richness and abundance of document level word co-occurrences. Effect of Preprocessing. In section III, we have mentioned a preprocessing technique with Phrase Extraction that can deliver more comprehensible top contributing terms in each discovered topic. We implement AOBTM+ with the phrase extraction preprocessing technique. In Table IV , we explicate how this technique in AOBTM+ generates better topic words than AOBTM with no phrase extraction. We selected a topic discovered from the Twitter dataset, which is related to Racism. We can see that extracting phrases during preprocessing and training them with the rest of the terms help distinguishing key terms for topic representation, which is captured only by AOBTM+. In [37] [38] [39] [40] , researchers have explored different ways to finetune the topic models' parameters. Their basic approach is to train various topic models (with different parameter settings) over several iterations to select one with the best performance. All the proposed procedures are computationally expensive, especially when executed sequentially. Moreover, all the existing libraries and packages that implement the mentioned procedures use LDA based models [41] . So, we proposed two parallel algorithms as described in Section IV to determine 2 important parameters in our approach automatically: i) the number of topics to discover (K) and ii) the number of previous versions/time-slices to consider (win). In Table V , time-complexities are provided for both algorithms. It is worth noting that the proposed algorithms run sequentially if the environment is not set up to perform parallelism. 

We have employed the proposed algorithms to determine the best values for the parameters K and win for each dataset. For Tweets2020, Youtube, Viber, NOAA, and Swiftkey, Algorithm 2 determined the corresponding numbers of topics to be derived as 31, 22, 18, 13, and 11, respectively, whereas Algorithm 3 determined the best win value to be considered as 5, 34, 9, 16 , and 11, respectively. After setting the best detected parameters for the online and adaptively online algorithms, we have calculated the PMI-Scores for each dataset and present the results in Fig. 7 . The plot shows that AOBTM outperforms all the other online algorithms for all datasets. Compared to OBTM, AOLDA and OLDA perform better for datasets that contain longer documents. Furthermore, OBTM outperforms both of the LDA based methods when it comes to the limited datasets containing short texts. 

Prompt detection of emerging topics from user reviews is crucial for app developers as these topics reveal users requirements, preferences, and complaints [10] . Developers can proactively identify user-complaints and take quick actions to improve user experience through efficient analysis of the app-reviews. Timely and precisely identifying emerging issues helps developers to fix bugs, refine existing features, and add essential functions in the subsequent update of the application. For this purpose, Gao et al. developed a framework named IDEA to detect emerging issues from the app-reviews of popular applications [10] . IDEA collects user reviews after the publication of different versions of the app and implement AOLDA to get the topic-word distributions for the app reviews collected after the publication of the latest version. We have modified the open-source framework IDEA and incorporated AOBTM instead of AOLDA to generate the topic-word distribution. The rest of the framework's components, such as preprocessing, emerging topic identification, and topic interpretation, remain the same. The modified version is denoted as OPRA (Online App Review Analysis). Inspired by the zoom case study as explained in Section I, we have collected around 15,000 app reviews for Zoom Cloud Meetings from Google Play. The average review length for this dataset is 7.8. These reviews were generated after the publication of the latest 5 versions of the app. For emerging topic detection, we set the parameters as win = 3 and K = 10 for fair evaluation. We also changed the initial values of α and β to 0.1 and 0.01, respectively, as these values yielded best performance for IDEA (implementing AOLDA) in [10] .

In Table VI , we have reported top 5 most contributing words from two topics generated by IDEA and OPRA: first topic is closely related to app-security, and second topic is closely related to messaging feature of the app. In Table VII , we have reported the corresponding PMI-Scores and Time-Cost for both frameworks. We can see that applying AOBTM in the framework slightly increases the time-cost, but generates more comprehensive and coherent topics. 

For the topic modeling for short texts, PLSA, LDA, and their variants suffer from the lack of enough word cooccurrences. To boost the performance of topic models, researchers had utilized external knowledge to produce supplementary essential word co-occurrences across short texts [42] [43] [44] . The problem rests in that auxiliary information can be too scarce or too expensive (or both) for deployment.

In the short text topic modeling regime, Yin at al. introduced the DMM based topic modeling method in [45] , where it is presumed that each short-text is sampled from only one latent topic. But this proved to be too simple and too strong of an assumption for any reasonable short text topic model [46] . In self-aggregation based methods, short texts are merged into long pseudo-documents before topic inference to help develop rich word co-occurrence information. Researchers have used this type of method in [47] , [48] , where they have presumed that each short text is sampled from a long concatenated pseudo-document (unobserved in current text collection). This presumption enables inferring latent topics from long pseudo-documents. But the concatenation yielded suboptimal results in [49] , [50] as merging short texts into long pseudo-documents using word embeddings cannot alleviate the loss of auxiliary information or metadata. Global word cooccurrences based methods (i.e., [16] , [51] ) try to use the rich global word co-occurrence patterns for inferring latent topics, where the adequacy of these co-occurrences alleviates the sparsity problem of short texts. [16] posits that the two words in a biterm share the same topic drawn from a mixture of topics over the whole corpus. This topic modeling algorithm is comparatively more robust and suitable for all the mentioned characteristics of short texts [52] .

To solve the problems with streaming short texts and unordered topic generation, researchers proposed online models such as OLDA [20] and Online BTM [16] . In essence, online algorithms fit conventional topic models (i.e., BTM, LDA, respectively) over the data in a time-slice t and use the inferred statistical data to adjust Dirichlet hyperparameters for the next time slice.

Gao et al. [10] introduced the Adaptively Online LDA (AOLDA) by factoring in all the previous time-slices' contribution, instead of just the preceding one. Here, they have shown that the comparison among the topic distributions for more than two consecutive time-slices/versions can lead to more coherent and distinguishable topic learning. But this specific method uses LDA as their underlying topic model, which suffers from several discussed issues when it comes to short texts [16] .

Human Evaluation. In our experiments, we have only used deterministic scores to evaluate our results against the benchmarks. The authors of this paper have manually reviewed the outcomes of the considered topic models. We have consulted with fellow researchers about the model outcome and have considered the opinions of industry developers which confirmed that AOBTM generates more coherent key-words for extracted topics. But we did not perform any formal human evaluation to assess how well our model performs in practice compared to others.

Datasets. Our proposed topic models with text corpus distributed over different versions or time-slices. Evaluating our models using only version-tagged app-reviews from mobile applications that have multiple published versions in the appstore would not give us a precise idea about how this algorithm works with time slices. We have handled the problem by We have evaluated our approach using a few mobile applications, which might affect the generality of our model. To migrate the problem, we have incorporated a new dataset with around 200,000 timeline-tagged tweets scraped by considering 300 top trending topics from Canada and the USA. Furthermore, we carefully selected apps so that we could demonstrate out topic model's performance for apps that have small or large number of reviews per version (~623-2,147) Ground Truth. To measure the extensibility of our topic model, we wanted to know how it scales to other online algorithms for prioritizing topics and detecting emerging ones. For selecting ground-truth, we have used key-terms from appchangelogs for app-reviews, as Gao et al. did in [10] . In order to calculate precision, recall, and F-score, app-changelogs are used as ground truth, similar to [10] . However, we did not have any changelogs or tweet-summary to take as groundtruth for Tweets2020. We have manually selected top-trending hashtags over different time-slices to mine the key-terms. Other approaches for evaluation should be studied.

Memory and Time Cost. We acknowledge that our model cannot compare to the benchmarks when it comes to memory and time-cost. Still, we are currently endeavoring to incorporate word-cooccurrence pattern algorithm to make our topic model faster while using significantly less resources.

In this paper, we proposed a novel adaptive topic modeling algorithm, AOBTM, which is able to discover coherent and discriminative topics from short texts. AOBTM addresses the problems with conventional topic models by adopting a version sensitive strategy. Along with AOBTM, we use a preprocessing technique that enables capturing distinguishable terms in an extracted topic. Moreover, we implemented two parallel algorithms to determine the value of the two most important parameters of our model automatically. The results of several experiments on app reviews and Twitter datasets confirm the performance of AOBTM compared to the state of the art algorithms.

We plan to improve the underlying BTM method using short text expansion and concept drifting detection and integrate it with a topic visualization tool specifically designed for app reviews. For the parallel algorithms, we plan to use GPU-cores and shared memory cache to make the program run faster. We are currently endeavoring to incorporate word-cooccurrence pattern algorithm to make our topic model faster while using significantly less resources.

Infar: Insight extraction from app reviews

what parts of your apps are loved by users?

Ar-miner: mining informative reviews for developers from mobile app marketplace

A survey of app store analysis for software engineering

User reviews matter! tracking crowdsourced reviews to support evolution of successful apps

User feedback in the appstore: An empirical study

Bug report, feature request, or simply praise? on automatically classifying app reviews

Analysis of user comments: An approach for software requirements evolution

Emerging app issue identification from user feedback: Experience on wechat

Online app review analysis for identifying emerging issues

Studying bad updates of top free-to-download apps in the google play store

March) New york attorney general looks into zooms privacy practices

Too many user-reviews, what should app developers look at first?

Latent dirichlet allocation

Probabilistic latent semantic indexing

Btm: Topic modeling over short texts

Syntactic topic models

Topics over time: A non-markov continuous-time model of topical trends

Empirical study of topic modeling in twitter

Online learning for latent dirichlet allocation

Online biterm topic model based short text stream classification using short text expansion and concept drifting detection

Short text topic modeling techniques, applications, and performance: A survey

Probabilistic topic models

Integrating document clustering and topic modeling

Incorporating word correlation knowledge into topic modeling

The dual-sparse topic model: Mining focused topics and focused terms in short text

Best topic word selection for topic labelling

Automatic labeling of multinomial topic models

Automatic evaluation of topic coherence

Annual Conference of the North American Chapter of the Association for Computational Linguistics, ser. HLT 10. USA: Association for Computational Linguistics

Parameter estimation for text analysis

Hidden topic markov models

Integrating topics and syntax

Reading tea leaves: How humans interpret topic models

Optimizing semantic coherence in topic models

Experience report: Understanding cross-platform app issues from user reviews

Jensenshannon divergence

On finding the natural number of topics with latent dirichlet allocation: Some observations

A densitybased method for adaptive lda model selection

Accurate and effective latent concept modeling for ad hoc information retrieval

Finding scientific topics

Latent dirichlet allocation in r

Transferring topical knowledge from auxiliary long texts for short text clustering

Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Improving lda topic models for microblogs via tweet pooling and automatic labeling

A dirichlet multinomial mixture model-based approach for short text clustering

Comparing twitter and traditional media using topic models

Short and sparse text topic modeling via self-aggregation

Topic modeling of short texts: A pseudo-document view

Topic modeling for short texts with auxiliary word embeddings

A general framework to expand short text for topic modeling

Word network topic model: A simple but general solution for short and imbalanced texts

Enhancing topic modeling for short texts with auxiliary word embeddings