key: cord-0232600-99zf9oox
authors: Wu, Qiong; Hare, Adam; Wang, Sirui; Tu, Yuwei; Liu, Zhenming; Brinton, Christopher G.; Li, Yanhua
title: BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation
date: 2020-08-05
journal: nan
DOI: nan
sha: 175d05fe518d39537aad005e8af04ca40d2c3cfa
doc_id: 232600
cord_uid: 99zf9oox

Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of"topic identification"and"text segmentation"for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called BATS: Biclustering Approach to Topic modeling and Segmentation. BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on four datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.

Innovations in topic modeling and text segmentation have demonstrated the potential for automated analyses of large collections of documents. Broadly speaking, topic modeling refers to finding a collection of topics (e.g., groups of words) that represent a given document, whereas document segmentation refers to partitioning a document into components (e.g., sentences) about the topics. Existing solutions to these problems are usually based on analyzing statistical patterns in text across datasets that consist of large collections of documents. For example, the popular Latent Dirichlet Allocation (LDA) algorithm for topic modeling [5] assumes that each document comprising a corpus, and every word in them, are generated according to the Dirichlet process. With this assumption, EM-based algorithms can then be employed to infer the latent states of the documents [30] . Word embedding models such as word2vec [45] and GloVe [52] have also become popular, building joint distributions of word sequences by transforming every word in a document into a high-dimensional space learnt over a large corpus. The resulting high-dimensional representations then help to identify topics in the document and perform segmentation based on these topics.

While algorithms for finding topics [5, 16, 30] and segmenting documents [12, 29, 61] have been extensively studied, none have fully addressed issues posed by the "new and single document" setting. In this setting, we may need to analyze a newly created text whose topics have not been seen before, posing unique modeling challenges we aim to address in this paper.

The "new and single document" setting manifests itself in several contemporary scenarios. We consider first that new words may arise rapidly and garner the most interest when there is relatively little written about them [31, 83] . A topic modeling algorithm that relies on a model trained on a dataset from several years ago may have rejected a topic word such as "COVID" as out-of-vocabulary at the moment when the public was most interested in finding out about the emerging disease. Although a news aggregator like Google may be able to find a suitable number of documents despite so few being available, a model used by a single outlet (e.g., an internal search on the New York Times website) is unlikely to have the same resources.

These neologisms are not the only circumstance in which the "new and single document" setting arises. Often, existing words acquire new context-specific meanings as their usage changes over time [79] . For example, before the ubiquity of "like buttons" on social media, it may have been safe to assume the word "like" was too generic to be a suitable topic or topic word. Additionally, words may have different meaning in different domains: consider "transformer" in the context of computer science [72] , electrical engineering, and popular culture. Although this issue has helped spark interest in contextualized word embeddings [17, 43, 53] , these models are still sensitive to biases in the historical data used to train them [82] . These differences may be especially relevant as one source begins to add content from a new domain, e.g., when an online archive begins accepting submissions from a new subject area. In each of these settings, when working across a diverse database of documents, treating each one individually -i.e., as a new and single document -can allow these semantic differences to be better captured. The same could be said of a general-purpose aggregator that can receive documents in new formats and, as a result, cannot afford to make strong assumptions about future input based on what it has already seen. A key challenge we face as a result of the single document setting is dealing with sparser amounts of text available for modeling.

The "new and single document" setting also manifests itself when rapid processing is required in the presence of constrained computational resources. These scenarios are becoming more widespread today as a result of edge computing technologies [67, 70] , which are moving processing tasks from the cloud to the network edge in an effort to leverage the improved intelligence capabilities of mobile devices for more real-time results. Edge devices are heterogeneous in their processing capabilities, which requires computationally efficient learning techniques [11, 71] . In the text analysis space, we can consider continually-updated document streams (e.g., a social media newsfeed) that need to be processed in real time with limited resources (e.g., on a smartphone), which will require efficient algorithms that avoid computation across a large corpus [77] . When speed is a concern and parallelization is a possibility, this approach allows each document to be treated independently and pushed to the next step in the pipeline as it is processed. It also avoids relying on a large corpus or pre-trained model, both of which may demand substantial computational resources.

In this paper, we design a statistically sound, computationally efficient, unsupervised algorithm that can simultaneously extract topics and segment text from a single document of interest. Designing such an algorithm is challenging because we need to determine model parameters on a sparse dataset. Our development is guided by three key ideas: 1.2.1 Idea 1: Using word ordering information properly. Traditional topic modeling approaches assume bag-of-words models [5] where information on the order in which words appear is neglected. While this has proven effective in the analysis of full corpora, compression to a bag-of-words in the case of a single document may lose information valuable to the task at hand. The recent success of recurrent models and the addition of positional encodings in non-recurrent models for the application of machine translation [72] is further evidence of the potential value of word-order information on single document.

Motivated by this, our approach aims to leverage word-order information to achieve good performance in the presence of a small, single document training dataset. In particular, we consider the location of words in neighboring sentences. In designing this mechanism, we will make two assumptions guided by basic rules of written language: (i) words appearing in the same sentences are more likely to be on the same topic, and (ii) words located in nearby sentences are more likely to be on the same topic. 1.2.2 Idea 2: Design a biclustering algorithm that addresses sparsity. For joint topic modeling and text segmentation, we will find it convenient to model documents with sentence-word matrices. But word-to-word interactions and word-to-sentence interactions are noisy by nature [7] . This problem becomes even more pronounced with small datasets like single documents where these interactions are likely to be sparse (e.g., the sentence-word matrices for datasets considered in this paper have only 15% of entries nonzero on average). A well-designed denoising process is necessary so that a sentence-word matrix can be utilized effectively in the downstream topic extraction and text segmentation tasks. We also seek to avoid reliance on pre-trained word embeddings in this step given the computational cost consideration in our "single and new document" setting.

Our approach connects the denoising problem here with the denoising problem in stochastic block models, where we consider structure in the "blocks" of a document's sentence-word matrix [56, 75] . In particular, motivated by an expected power law of node degrees in a document's sentenceword bipartite graph, we design a specialized spectral biclustering algorithm which operates on a regularized version of the graph Laplacian to address sparsity. In this algorithm, the topics and segments emerge from clustering the right and left singular vectors of the Laplacian. Given this, we term our overall solution BATS: Biclustering Approach for Topic modeling and Segmentation. 1.2.3 Idea 3: Optimize heuristics to analyze single documents. We design several heuristics to enhance our algorithm's performance. Our heuristics are designed based on two major observations: (i) extremely low frequency words tend to introduce noise to document analysis and thus need to be removed, and (ii) part-of-speech (POS) tagging can help to identify more important elements of a document and thus should be considered in our model. Therefore, we remove the low frequency words in the text, but award the important words according to their POS tags. Specifically, because nouns and verbs often convey the body and condition of a sentence, they are typically more informative in topic modeling than other parts of speech [26] .

We evaluate BATS against 12 total baselines, six for topic modeling and six for text segmentation tasks. For topic modeling, we compare performance in terms of topic coherence (i.e., quality of individual topics) and topic diversity (i.e., overlap in topic words) on five datasets, in which we find that BATS obtains comparable or higher performance to the best baselines in each case. For text segmentation, we add in one more standard dataset, and show that we outperform all of the baselines in most cases in terms of agreement with a ground truth. In most cases, the highest performing baselines on topic coherence and text segmentation are based on pre-trained lanaguage models. To this end, we additionally show that the runtime of our method scales significantly better than any of the highest-performing baselines as the size of the input document increases. For these reasons, we can conclude that BATS obtains the most desirable combination of metrics for the "new and single document" setting. Figure 1 outlines the methodology we develop and provides a roadmap for the paper. The inputs to BATS are a single document and a single hyperparameter (segment number, which also indicates topic number). Then, the two major stages of BATS are preprocessing and extraction. In the preprocessing stage (Sections 3.1 and 3.2), we leverage ideas 1 and 3 to build an effective feature matrix representation of a document under sparse and noisy conditions. In the extraction stage (Sections 3.3 and 3.4), we use idea 2 to identify low-dimensional representations of the signals through spectral biclustering, with agglomerative methods to segment the text and KMeans to identify the topics. Our subsequent evaluation (Section 4) assesses performance of the resulting text segments and topic words in terms of diversity, coherence, segmentation, and runtime metrics.

Our key contributions are summarized as follows:

• We develop a novel methodology called BATS that performs topic modeling and text segmentation on a single document simultaneously. BATS is unsupervised and scalable in its implementation, as it does not rely on pre-trained word embedding models. • We connect the joint topic extraction and segmentation problem to spectral biclustering of sentence-word matrices, and show how a factorization of the graph Laplacian with appropriate pre-processing and post-clustering can lead to effective topic modeling and text segmentation. • Our evaluation on several datasets establishes that BATS achieves the best combination of topic modeling, text segmentation, and runtime metrics when compared with baselines on single documents. It also verifies the contribution of different components of BATS.

We identify three areas of related work: biclustering techniques, topic modeling, and text segmentation algorithms.

Biclustering techniques (e.g., [18, 19, 63] ) have been proposed to model interactions among two types of nodes represented in a bipartite graph, with nodes of each type grouped into clusters according to different methods. These techniques are widely used in part because of their sound theoretical properties [19] . In [18] and [34] , the authors propose algorithms which translate input data into bipartite graphs and apply spectral techniques to the adjacency matrices; in [18] , a block diagonal structure is assumed, while in [34] , the case of a checkerboard pattern is considered, with implications to the spectral decomposition. [63] can be viewed as an extension of the algorithm in [18] to deal with asymmetric data matrices. By contrast, [19] proposes a probabilistic approach to graph biclustering, where the input data matrix is treated as a joint probability distribution between two random variables, which are then clustered according to relative entropy and mutual information metrics. Our work builds off the spectral clustering foundations in [18, 63] , accommodating rectangular sentence-word data matrices instead of traditionally assumed square matrices.

Several models have been proposed to extract topics from a corpus consisting of multiple long documents, including Latent Semantic Analysis (LSA) [16] , Non-negative Matrix Factorization (NMF) [51] , Probabilistic Latent Semantic Analysis (pLSA) [30] , Latent Dirichlet Allocation (LDA) [5] , and variants on LDA, e.g., hierachical modeling [69] (see [14] for a survey). Recent work has incorporated word and document embeddings jointly to capture "topic vectors" [2] ; however, even when these approaches use large pre-trained embedding models, they require corpora on the order of thousands of documents to achieve competitive results, making them unsuitable to applications where data is limited. Analysis on short texts usually faces the issue of sparsity in word occurrences. To overcome this challenge, works such as [76, 80] make additional assumptions on word co-occurrence patterns; [48, 55, 66, 81] have resorted to word embeddings which leverage pre-trained models; [13, 25] depend on further external knowledge including social relationships in microblogs and user preferences. One of these, Semantics-assisted NMF (SeaNMF) [66] , is a variant of NMF designed to handle short documents by learning semantic relationships between words in the corpus. It has been found to outperform other standard techniques such as NMF and LDA [66] . Different from these methods, ours aims at identifying topics in a single, newly created document without an extensive training component. To overcome issues of input data sparsity and noise, BATS turns to word-ordering information between sentences and regularization in the spectral clustering phase, as opposed to making additional assumptions on word co-occurrence patterns. Through evaluation on several datasets, we show that BATS outperforms many of the above models on single document topic modeling in terms of topic coherence, topic diversity, and scalability metrics. In particular, compared to the state-of-the-art SeaNMF method, we will see through our experiments that BATS achieves better performance on topic coherence in most cases, and smaller/more scalable runtimes, which is important in the "new and single document" setting we consider in this paper.

Text segmentation algorithms are designed to detect breakpoints in a document and split the document into multiple segments accordingly. Algorithms such as Lexical Chains [41] and Text-Tiling [29] use lexical co-occurrence and distribution patterns to divide sets of paragraphs into multi-paragraph sub-blocks that become segments. A potential drawback of these approaches, however, is that the segments are not associated or labeled with explicit topic information, and that it is not always clear how to translate from a lexical distribution to topics. This motivates the consideration of topic modeling and text segmentation jointly.

More recently, to improve segmentation performance, topic-based segmentation methods such as TopSeg [6] , LDA_MDP [47] , and TopicTiling [61] have been proposed. Similar to the topic modeling algorithms discussed above, these segmentation methods depend heavily on the training process, and usually require training on a large corpus [9] . This is problematic when only small datasets are available, let alone in the single document case that we consider in this paper. Through biclustering of the sentence-word matrix and development of other pre-processing techniques, BATS does not demand an expensive training process. Other recent approaches, such as SupervisedSeg [35] and SegBot [39] , have modeled the task as a supervised learning problem and employed deep recurrent networks while others rely on massive language representation models such as Bidirectional Encoder Representations from Transformers (BERT) [17] . Generally, these models are pre-trained on large historical datasets and can be used in an application with little to no fine-tuning required. However, loading such models may use substantial computational resources and their reliance on historical data can introduce biases and gaps. By contrast, BATS is designed to be computationally efficient and flexible enough to accurately handle novel text. We find that BATS obtains an order of magnitude smaller runtime compared with BERT applied to text segmentation, which is important in our "new and single document" setting. Further, our evaluation shows that BATS outperforms the segmentation methods discussed here on single documents across several datasets.

As shown in Figure 1 , our proposed methodology BATS consists of two main stages: the text preprocessing stage (Section 3.1) and the extraction stage, with the latter broken down into graph Laplacian regularization (Section 3.2), singular vector extraction (Section 3.3), and sentence/word clustering (Section 3.4). Topics and segments emerge from the word and sentence clusters, respectively. In this section, we detail the development of these modules, and discuss how they address the issue of sparsity associated with modeling a single document.

Consider an input document comprised of sentences, indexed = 1, ..., . We denote W = { 1 , ..., } as the set of words we are interested in for modeling, indexed = 1, ..., . In defining W, we do not include all the words that ever appear in the document; instead, a word is included in W if and only if it appears in more than one sentence in the document and it is not in a stopword list. In this way, W excludes "degree-one" words that can skew models in single documents; we observe that these words often behave as pure noise in our inference algorithms. Let = [ ] ∈ R × denote the sentence-word matrix. Even after excluding degree-one words, we still expect this matrix will be sparse, with a significant number of = 0. We develop two steps to construct , taking into account both word order and parts-of-speech information: 3.1.1 Step 1. Using parts-of-speech information. Our first optimization trick is based on parts-ofspeech (POS) tags, which are generated through analysis of the word positions in the sentences [26] . In particular, the lexical model presented in [22] shows hierarchies exist according to syntactic/semantic similarities of words; looking into them, it is clear that nouns and verbs convey more information than other word types, and thus should be given a larger weight [60] . As a result, letting

= [ ] where is the number of occurrences of word ∈ W in sentence , we define

where = [ ], = 1 if ≠ 0 and is tagged as a noun or verb in sentence , and = 0 otherwise. > 0 is a scalar parameter for awarding POS; by default, = 1. In our implementation, Python's spaCy module is used to tag the words, as this pre-trained model based on word positions is more robust to novel words or topics than would be, for instance, a word-embedding model. 3.1.2 Step 2. Transformation by using word-order information. Our incorporation of word-order information is based on the intuition that words in neighboring sentences are likely to be similar in their constituent topics, with this effect decaying as the sentences grow further apart. Assumptions on words appearing within a certain window being related can be found in other text analysis techniques as well, including word embedding models [52] . Concretely, we bond neighboring sentences to the current sentence according to

where = ( 1 , ..., ) is the th row of and ℓ is the ℓth row of for ℓ = 1, ..., (for ℓ < 1 and ℓ > , ℓ is taken as a vector of zeros). Parameter controls the size of the bonding window, and ∈ [0, 1] is a decay rate for the distance. In this way, the presence of a word in one sentence will impact neighboring sentences, and words appearing in several consecutive sentences are increased in importance. Doing so also alleviates the issue of sparsity associated with modeling single documents, as each sentence's data smoothens its neighbors' representations too. The procedure for determining the values of and will be discussed in Section 3.2.

3.2.1 Intuition. Consider the bipartite graph G( ) of the sentence-word matrix , where the sentences = 1, ..., and words = 1, ..., each form a node set, and edge ( , ) of weight , is in G( ) if and only if , ≠ 0. Intuitively, this graphical representation will capture some relatedness information between words and sentences, e.g., words that frequently occur together, and are therefore likely to be part of the same topic, should have a high number of short paths to each other. This bipartite graph is likely to have nodes whose degrees follow a power law distribution, both as a product of Zipf's Law for word frequency distributions [85, 86] and because empirically real world networks often exhibit power laws [20] . The graph Laplacian is known to be a useful matrix representation for clustering graphs with heterogeneous node degrees given the relationship between its spectral properties and the connected components of the graph [73] . To further address the sparsity issue, we will develop a regularized version of the Laplacian, as it has been shown to improve the performance of spectral clustering algorithms on sparse matrices [1, 10, 56] . 3.2.2 Constructing the Laplacian. Formally, define two diagonal matrices = diag( 1 , ..., ) ∈ R × and = diag( 1 , ..., ) ∈ R × where = =1 , = 1, ..., and = =1 , = 1, ..., are the row and column sums of . With regularization parameters , ≥ 0, the regularized graph Laplacian ∈ R × is computed according to

where and are identity matrices. Multiplying these by regularization parameters and can resolve issues due to poor concentration since the degrees for every vertex are inflated. Following prior work [56] which has indicated that such regularization parameters should be proportional to the average degrees of the vertices (so that the asymptotic bounds will be indicative of the mis-clustering rate), we set the average degrees as defaults, i.e., = / and = / .

Our main observation is that sentence-word interactions in a document may be modeled by a stochastic co-block model (SCBM) [33, 56] , whose structure can be inferred through spectral clustering. SCBM is a bipartite graph generalization of the standard block model [63] , where there are 1 blocks of nodes on the left side of the graph and 2 blocks of nodes on the right. ← tf-idf( ) //Tf-idf assignment 7: for ← 1, ..., do 8:

for ← 1, ..., do 10:

for ← 1, ..., do 12:

13: 

The interactions between nodes at two sides are governed by a matrix ∈ R 1 × 2 , i.e., for any in the -th left block and in the -th right block, Pr[{ , } ∈ ] = , . In our setting, each right node corresponds to a word, and words in the same block can be interpreted as in the same topic. Each node to the left corresponds to a sentence, and sentences in the same block corresponds to the same segment. A word is connected to a sentence if it appears in the sentence at least once. Spectral clustering can be used to recover the block structure in an SCBM. This algorithm first finds the leading singular vectors and values of the bipartite graph's adjacency matrix, and then runs standard clustering algorithms (e.g., k-means) on the leading singular vectors/values. The algorithm is known to be effective for sparse SCBM because it can effectively remove singular vectors along directions that contain stronger noise than signal. A synthetic example of this is shown in Fig. 2 , where we have generated 2000 sentences and 5000 words according to an SCBM with four blocks each. The dataset is too sparse for standard degree-based algorithms, such as counting shared neighbors [32] or using Jaccard similarity [50] , to recover these blocks (Fig. 2b) . Nevertheless, we notice that all the singular vectors/values beyond the first three correspond to noise (Fig. 2a) . Therefore, when we perform clustering only on the leading dimensions of the singular vectors, we are able to exactly recover the blocks (Fig. 2c ). This motivates us using the singular vectors to cluster words and sentences. 3.3.2 Obtaining a low dimensional embedding. We consider the singular value decomposition (SVD) of the graph Laplacian . By definition, the SVD yields

where ∈ R × and ∈ R × are unitary matrices and Σ contains the singular values 1 , ..., max{ , } on its diagonal. Since = (Σ Σ) is a measure of similarity between words, counting their degrees of connectivity via sentences, and = (ΣΣ ) is a measure of similarity between sentences, counting their degree of connectivity via words, the SVD can be used to cluster words (using ) and sentences (using ). Further, as the eigenvalues of and are the square of the singular values in Σ, we introduce another parameter which denotes the number of left 1 , ..., ∈ R and right 1 , ..., ∈ R dominant singular vectors used, where we assume the singular values are in decreasing order 1 ≥ 2 ≥ · · · . We then re-normalize the rows of the resulting matrices

to have unit length, i.e., so that ℓ ′ 2 ℓ = ℓ ′ 2 ℓ = 1 for each sentence and word . Following [73] , which suggests that the dimensionality should be consistent with the number of clusters to be grouped, we use the same parameter for both and .

The full decomposition process developed in Sections 3.1-3.3 is summarized in Algorithm 1.

Recall the window and decay parameters from (2). We investigate the impact of these parameters on the matrix decomposition in (5) by considering the L2-norm distances between the resulting sentence vectors in . Figure 3 gives heatmaps of these distances for two arbitrary documents in one of our datasets (see Section 4.1). Since neighboring sentences should cover similar topics, we seek values of and for which ordering information is clearly embedded in the matrix. In Figure 3 (a), for small values of (i.e., = 0, 1), the sentence order is less clear as the elements near the diagonal are more blurry. As increases, the pattern becomes more obvious, and when = 3 we observe clear block patterns in the heatmap. When is increased further (i.e., to = 5), the sharpness of the block pattern does not continue to improve; intuitively, sentences at the far ends of the bonding window for large will have higher dissimilarity, but this effect is blunted by the decay (which is 0.7 here). Since a higher also increases the runtime of the method, in considering several documents, we find that the best choice of is typically between 3 and 5 (i.e., the number of topic-neighboring sentences is 6 to 10). By this logic, then, the value of should be significantly lower than 1. As it is decreased in Figure  3 (b) (i.e., from = 0.9), we see that the sharpness of the blocks improves, with = 0.7 giving the clearest pattern. Beyond this (i.e, = 0.5, 0.3), however, the sharpness begins to decrease. In these cases, neighboring sentences are assigned lower weights, confirming that the SVD uncovers topic similarity between neighbors. In considering several documents, we find that the best choice is ≈ 0.7 for this reason. In Section 4, we will verify that = 5, = 0.7 leads to the highest average performance on the topic modeling and text segmentation metrics for all datasets considered. 

With the embedding from (5) in hand, we move to obtain topics and segments via spectral clustering of the word and sentence matrices respectively. Following [73] , we consider the problem from a graph cut point of view, where the cuts are taken on a similarity graph of words or sentences. Formally, let = [ , ] ∈ R × be the similarity matrix among a set of nodes 1 , ..., (i.e., words or sentences), where , ≥ 0 is the similarity between nodes and . We seek to minimize

a grouping of the nodes into disjoint sets 1 , ..., . A simple and straightforward solution for this minimization problem is to cut off individual nodes which are weakly connected to the rest. However, there is usually no topic with one word or text segment with one sentence, therefore, the groups of words or sentences are supposed to have more balanced sizes. As a result, the objective function needs to take group sizes into consideration and build the balanced cut problem

Taking group sizes into account makes the problem NP hard and requires further relaxation. We reorganise the problem by defining a group indication matrix = [ℎ 1 · · · ℎ ] ∈ R × consisting of weighted indicator vectors ℎ = (ℎ 1, , ..., ℎ , ) , = 1, ..., where

We can see = where is the identity. Letting be the node similarity graph, we define its degree matrix as = diag( 1 , ..., ) where = , = 1, ..., , and the unnormalized graph Laplacian as = − . Through some easy math, we can get

Combining this with (6), we conclude that

Thus, the minimization problem can be presented as min 1 ,··· , ( ) subject to = , defined as (7).

This problem is equivalent to minimizing (6) and is NP-hard. Therefore, in the BATS methodology, we relax this constraint by allowing ℎ , ∈ R to take any arbitrary value, and turn (10) into

This approach allows us to employ clustering algorithms to solve the minimization problem. In the following sections, we detail our methods for solving (11) to cluster words (Section 3.4.1) and sentences (Section 3.4.2), respectively.

To obtain the topics, we consider spectral clustering for the normalized word matrix ′ in (5). Since each row ′ ∈ R , = 1, ..., of is a -dimensional representation of a word, the clustering optimization in (9) takes these words as the nodes to be grouped into sets 1 , 2 , ..., based on the similarities , ′ between pairs of word representation vectors and ′ . This is equivalent to minimizing the pairwise deviations between representations of nodes within the sets:

This optimization is equivalent to a KMeans clustering [40] of the vectors ′ 1 , ..., ′ . The number of clusters is determined by the number of segments , and each resulting word cluster refers to a topic. To obtain a description of each topic in terms of its top words, we further rank the words in each cluster according to the standard term frequency-inverse document frequency (tf-idf) metric [58] applied to the awarded sentence-word matrix in (1). The tf-idf assignment matrix is obtained during the matrix decomposition process and it is the same size of . To assign each word a single tf-idf score for sorting, we sum the tf-idf scores of each word over all sentences. 3.4.2 Segments via sentence clustering. We then turn to clustering the normalized sentence matrix ′ in (5) to obtain the segments. Compared with the topic clustering problem, this one will have more constraints since the clusters are related to the sentence orders and the cluster sizes can be uneven. As a result, the KMeans method is no longer applicable, and we resort instead to an agglomerative clustering method with connectivity constraints [15] to solve (9) . In agglomerative clustering, nodes are grouped together sequentially according to pairwise similarities: the process recursively merges two groups of nodes that yield the minimum between-cluster distance. Formally, recall that the sentence embeddings are the rows ′ ∈ R , = 1, ..., of the matrix ′ . We form the graph of sentences being the cosine similarity between ′ and ′ . The ordering constraint should be such that only adjacent sentences can be clustered; we therefore initialize a connectivity graph 1 = ( , 1 ) where for all pairs of nodes , ∈ , ( , ) ∈ 1 if and only if = + 1, i.e., each node connects to the next sentence. Letting denote Algorithm 2 BATS topic modeling and text segmentation.

INPUT: Single text document, segment number PARAMETER: Awarding value , window size , decaying rate OUTPUT: Topic words, text segments 1: procedure MainProcess(text, , ,

Remove degree-one words from text. 3: Compute sentence-word matrix and POS-based matrix . 4 : Cluster the rows of ′ with KMeans into clusters. Sort the words in each cluster by (tf-idf scores). 6: Cluster the rows of ′ with agglomerative clustering into clusters with a connectivity constraint. 7: Topic words ← Sorted words in each cluster 8: Text segments ← Sentence clusters 9: return Topic words, Text segments cluster of the sentences at the th iteration, initialized as 1 = { } for each , the merging operation of our constrained agglomerative clustering is given by

for = 1, ..., − , where ( , ) refers to the distance between the sets and , which is treated as the average distance between sentences in and according to their link weights in , and ( ) is the single node that points to in . In each iteration, the two adjacent clusters of sentences that have minimum distance are merged together. The procedure ends after = − iterations, when there are clusters for which ≠ ∅; these are taken as the segments. 3.4.3 Advantages of joint topic modeling and text segmentation. In the BATS methodology, segmentations are produced from sentence clusters whereas topics are produced from word clusters. A key observation we have made in the design of BATS is that we can indeed use one simple bi-clustering algorithm to simultaneously identify sentence and word clusters, based on two simple assumptions: (i) when two sentences are in the same cluster (segment), the same set of words are more likely to appear in both sentences, and (ii) when two words are in the same cluster (topic), these words are more likely to co-appear in the same sentences.

In other words, the segmentation and topic modeling tasks are coupled through the bi-clustering algorithm in BATS, which uses sentence-word interactions to identify a joint latent structure for segments and topics. The model of which sentences are contained in which segments simplifies the process of identifying topics, and vice versa. Let us consider a concrete example, in which a document consists of 10 sentences. Each sentence may have different topic distributions, e.g., sentence 1 has 80% of words from a "science" topic and 20% from a "health" topic, whereas sentence 7 has 40% from the "science" topic and 60% from the "health" topic. If we assume that words from the same segment come from the same topic distribution, when the ground-truth of the segmentation is known -e.g., segment 1 consists of the first four sentences and segment 2 consists of the last six sentences -we are narrowing the set of topics these words are assumed to be sampled from. Specifically, in this example, we effectively observe two sets of words sampled from two topic distributions (one for each segment), instead of 10 (one for each sentence). This extra information simplifies the process of identifying topics. In fact, the bi-clustering algorithm implicitly uses this information in the singular value decomposition step: it uses sentence clusters (segments) to improve word clusters, and vice versa. Therefore, we expect that doing topic modeling and text segmentation jointly will improve the quality of both tasks, with the added advantage of being more computationally efficient than doing both separately.

The full BATS topic modeling and text segmentation methodology (including Algorithm 1) developed throughout this section is summarized in Algorithm 2. The inputs are the single text document of interest and , the number of topics and segments to extract. The algorithm begins with denoising, which removes all degree-one words, and constructing the sentence-word matrix and parts-of-speech matrix . and are then inputted to the matrix decomposition procedure, detailed in Algorithm 1, which employs sentence bonding and graph Laplacian regularization to obtain the matrices ′ and ′ , containing the encodings of the sentences and words, and the tf-idf matrix . The rows of ′ are then clustered into clusters of words via KMeans, with the words in each cluster sorted by tf-idf score in , forming the topics. Finally, the rows of ′ are clustered into clusters of sentences via constrained agglomerative clustering, forming the segments.

Recall from Section 1.1 that an important setting for the "single and new document" setting is in the presence of constrained computational resources. Thus, we also perform a complexity analysis to investigate the efficiency of our algorithm, given the importance of low runtimes.

From Algorithm 1, note that there are three main procedures in BATS which have major impacts on the time complexity: matrix decomposition, KMeans clustering for words, and agglomerative clustering for sentences. The matrix decomposition process consists of multiple matrix multiplication and summation operations, of which matrix multiplication dominates with a complexity of (max( 3 , 3 )), where and are the number of sentences and words, respectively. However, in our application, the sentence-word matrix is sparse and therefore enables some optimizations in the matrix decomposition procedure. Specifically, iterative methods such as primme [68] can decompose sparse matrices with a time complexity of ( ), where is the number of singular vectors. Formally, we can show that when the number of singular vectors is small, the time complexity is described as follows: Lemma 1. For a given document, assume and are the number of sentences and words, respectively. If the number of non-zero singular values of the sentence-word matrix is sufficiently small, then the runtime of BATS on the document can be approximated as ( + ).

Proof. In the KMeans clustering procedure, all word vectors are compared to centroids to find the closest centroid, and this step iterates times, leading to the time complexity of ( ). In the constrained agglomerative procedure, the similarities between sentence vectors are computed for clustering, and with iterations, the time complexity is ( log ) with the efficient priority queue implementation [42] . For the matrix decomposition, as discussed above, using an iterative method such as primme leads to a time complexity of ( ). The overall time complexity of our method is the sum of all these procedures, which leads to

Since ( ) ≫ ( log ), ≪ , and ≪ by assumption of a low-rank sentence-word matrix, this complexity can be approximated as ( + ). □ Thus, BATS has a much lower time complexity than (max( 3 , 3 )), with the difference being particularly pronnounced when ≫ . The scalability and small runtimes of BATS will be verified experimentally in Section 4.4.

We turn now to evaluating our BATS methodology. After describing the datasets (Section 4.1), we consider performance against baselines on the topic modeling (Section 4.2) and text segmentation (Section 4.3) tasks. Then, we consider the scalability of our method (Section 4.4). Finally, we conduct an ablation study to determine the impact of various components of BATS (Section 4.5).

We consider documents from six datasets -Textbook, Lectures, Introductions, Choi, Wiki, and News -obtained from different text applications. Basic statistics on these datasets are given in Table 1 , including the number of documents, the average sentences per document, the average word counts per document, and the average sparsity per document (fraction of zero entries in the matrix), both before and after the preprocessing procedures described in Section 3.1. More specifics on these datasets are as follows: (i) Textbook dataset: This is drawn from the medical textbook in [74] . Each chapter is treated as a document, and each section as a segment. The numbers of segments per document and sentences per segment have high variance. Moreover, segments within a document tend to be similar in their constituent words, as they are different sections of the same chapter. As a result, this dataset helps us test on cases where documents have different segments discussing similar topics.

(ii) Lectures dataset: This dataset contains transcripts of conversational lectures on AI and physics topics [21] . As each lecture is divided into sections, we treat lectures as documents and sections as segments. Each lecture script has 6-10 sections. Compared with the other datasets, the sentences are more conversational, tending to be shorter and simpler. Therefore, this dataset helps us examine algorithm performance on lengthy conversational documents. (iii) Introductions dataset: In this dataset, every document is an artificial combination of abstracts and introductions from academic articles in different fields [8] . We randomly choose 3-8 articles, extract the abstract and introduction as one sample, and combine multiple samples into one document. Each sample is treated as one segment. Compared with the other datasets, this will allow us to test on cases with large segment sizes, uneven segment lengths, and a diverse set of topics. (iv) Choi dataset: This is a standard dataset [12] widely used to evaluate text segmentation approaches. The documents in the dataset are artificial combinations of the first ℓ sentences of the documents in the Brown corpus [24] . Each document has 10 segments, with few sentences per segment. Because the dataset lacks explicit topic distributions and contains mostly segments that are too short for topic modeling, we use it only for evaluating text segmentation.

(v) Wiki dataset: This is the WIKI-727k dataset from [35] which is comprised of over 727,000 articles from English Wikipedia. Its vocabulary size exceeds 800,000 tokens. We treat articles as documents and, following [35] , use the Punkt Sentence Tokenizer from the nltk library [4] to generate ground truth segments. We include this dataset primarily to test the performance of BATS on large corpora, although it also provides a new genre of natural text covering a wide variety of encyclopedia topics. (vi) News dataset: Finally, we include a news dataset generated by following links shared on Twitter. We use a collection of Tweets from [37] that focus on news related to the 2016 US presidential election and follow the shared links to scrape text from the linked articles. The news dataset consists of about 300,000 documents with a vocabulary size of over 80,000. Each article is treated as a document and ground truth segments are generated using the Punkt Sentence Tokenizer. (i) Latent Dirichlet Allocation (LDA) [5, 64] : LDA is a probabilistic topic model which uses two independent Dirichlet priors for the document-topic and word-topic distributions. It trains a model to best estimate the Bayesian probabilities ( | ) and ( | ). We use the sklearn implementation in Python with the default parameters.

(ii) Hierarchical Dirichlet Process (HDP) [69] : HDP is a mixed-membership model which extends LDA to an unknown number of topics by building a hierarchy. Specifically, it builds a two-level hierarchical Dirichlet process at the document-level and the word-level to perform parameter inference. We use the gensim implementation in Python with the default parameters. (iii) Latent Semantic Analysis (LSA) [16] : LSA decomposes a document-word matrix, based on TF-IDF scores, into a document-topic matrix and a topic-word matrix; the decomposition is performed through a truncated SVD technique. We use the gensim implementation in Python. (iv) Probabilistic Latent Semantic Analysis (pLSA) [30] : pLSA is developed from LSA, using a probabilistic method instead of SVD to find the latent topics via generative modeling of the observed document-word matrix. We implement pLSA de-novo in Python, using 30 as the max number of iterations, 10.0 as the breaking threshold, and as the number of topics.

(v) Non-negative Matrix Factorization (NMF) [51] : NMF is a linear-algebraic model which factorizes a high-dimensional matrix into two lower-dimensional ones. In this case, NMF decomposes the document-word matrix (based on TF-IDF scores) into a topic matrix and a coefficient matrix for the topics. We use the sklearn implementation in Python with the default parameters.

(vi) Semantics-assisted Non-negative Matrix Factorization (SeaNMF) [66] : SeaNMF introduces wordcontext semantic correlations into NMF to extract topics particularly from short texts. The semantic correlations between the words and their contexts are learned from the Skip-Gram with Negative [44, 45] to address sparsity. The objective function of SeaNMF preserves both the word-document matrix and semantic correlation matrix. We use the author's Python implementation available at https://github.com/tshi04/SeaNMF. Since our focus is on single document topic modeling, we evaluate the models on each document separately. Given that the baselines usually learn across multiple documents, to provide a fair comparison, we treat the sentences within each document as the "documents" for the baselines, i.e., we feed them the preprocessed sentence-word matrices. For each document, the number of topics assumed by each baseline is taken to be the number of segments. The performance of each baseline is averaged over several trials.

We employ two popular coherence metrics to assess extracted topic: pointwise mutual information (PMI) [23] and UMass [46] . Higher values of these metrics have been associated with better performance in terms of interpretability and consistency of topics with human evaluation [62] . Since these metrics treat topics separately, in order to evaluate the diversity between topics, we also include two similarity measures: Jaccard (Jacc) Index and Sørensen-Dice (Dice) Index [27] . They measure overlaps in words between the topics, with lower values (i.e., less overlap) being better. More specifically: (i) Topic coherence measures: We use the PMI score, an external metric which takes the top-K words under each topic into consideration and has been widely used in recent papers for evaluating topic coherence [28, 36, 38, 49, 57, 65, 66, 78] . Formally, for each topic , the PMI score is calculated as

where K is the number of words from topic that are considered, ( , ) is the number of documents containing both of the words and divided by the number of documents in the dataset, and ( ) and ( ) are the number of document containing the words and divided by the total number of documents, respectively. When selecting the K words from topic , if the topic modeling algorithm orders its results (e.g., from most likely to be part of the topic to least likely) then the top K words are used. The average PMI score over all the topics is then used to evaluate the quality of the topic models. Following [65, 66] , we set the default value of K to 10. We calculate ( , ), ( ), and ( ) over the entire dataset we are evaluating.

By contrast, UMass is an intrinsic evaluation metric which takes the sequence of words into consideration by computing the conditional log-probability of each pair of words; the pairwise scores are not symmetric, and therefore the order of the words matters. The UMass score is given as

where K, ( , ), and ( ) have the same meaning as in (15) . Note that this equation assumes the top-K words have been ordered from most to least likely to be part of topic . In our singledocument evaluation, we consider the internal corpus for UMass to be the document itself. Note that we include both an external (PMI) and an internal (Umass) metric for completeness to ensure that our evaluation does not give an unfair advantage to any particular method. (ii) Similarity score measures: With and as the sets of words comprising topics and , the Jacc ( , ) and Dice ( , ) similarity scores are computed as:

The example in Figure 4 shows the importance of considering both types of metrics. The topics extracted by the LSA baseline tend to have many duplicated words (50% in the example) as compared with results from LDA and BATS, even though it has roughly the same number of words that are consistent with a human-generated summary as our method. Further, since the overall scores for each document are averaged across topics, poor results in terms of one metric on any given topic can be outweighed by high performance on other topics. Since the overall topic coherence scores for each document are averaged across topics, similar topics with duplicate words and high coherence scores will achieve a high average score. Figure 5 shows another example of this for LSA: though this method achieves high topic coherence, the topics are highly overlapped, motivating the need to take diversity into consideration. We note that because BATS prevents overlapping words by design, it will achieve a similarity score of 0 for both Jacc and Dice. For this reason, we will not use these metrics to claim that BATS has the "best" performance on the topic modeling task (the difference between "low overlap" and "no overlap" is not likely important). Instead, we include Jacc and Dice for BATS and the baselines to provide a more holistic picture of the baselines. As some baselines will have a high Jacc or Dice score as well as strong performance on other metrics, these scores highlight a potential tradeoff between low overlap and high performance on the coherence metrics. As a result, we also define composite metrics for evaluation which penalize the coherence scores on pairs of topics according to the similarity scores. Specifically, using PMI and UMass as the coherence scores for topic and sim , as the similarity score between topics and according to Jacc or Dice, we compute:

where refers to the total number of topics and sgn( ) is the sign function that evaluates to 1 if ≥ 0 and -1 if < 0. (18) uses the sgn function to ensure that pairs of topics with high similarity scores are penalized regardless of the sign of the sum of their PMI scores. Because the UMass scores are always negative, we can simplify this penalty to 1 + sim , in (19) .

inspecting heatmaps of pairwise distances between the SVD sentence vectors. We now test these parameters against the PMI metric to validate our choices on the topic modeling task. Looking at Table 6 , we can see that the best choices of and are (i) consistent with our earlier analysis (i.e., ≈ 0.7, = 3 or 5) and (ii) reasonably consistent across datasets. This validates the idea that BATS can be fully unsupervised, where we do not have an initial training procedure for hyperparameters.

Additionally, for completeness, we study the impact of changing K in the PMI score across the Textbook, Lectures, and Introduction datasets for BATS. The results are shown in Table 2 . We find that increasing K decreases the PMI score: this makes intuitive sense because as the number of words in the topics increases, less relevant words should be included, if the words within each topic are ranked effectively. In BATS, our ranking is done according to the TF-IDF scores. 4.2.4 Results and discussion. The results obtained by each algorithm on the five datasets are given in Table 3 . We present the mean and standard deviations on topic diversity, topic coherence, and the four cases of joint metrics. The first two columns, Jacc and Dice, indicate the diversities of the topics (smaller being better). The following columns then give the topic coherence scores, PMI and UMass (larger being better), followed by their combinations with the similarity measures (e.g., is PMI with Dice used for sim in (18)). Overall, we see that compared with the baselines, our method BATS obtains competitive topic coherence scores, the lowest similarity scores, and the best composite scores in most cases. For the Introductions, Wiki, and News datasets, which are the three largest we consider, our method maintains higher performance than all baselines in all metrics. On the Textbook and Lectures datasets, BATS achieves the best performance on the PMI composite metrics and is a close second to LSA on the non-composite PMI. BATS performs second only to SeaNMF for the UMass composite metrics and comparably to the best baselines on the non-composite UMass metric. The baseline which tends to outperform our algorithm in terms of topic coherence, LSA, also performs the worst in terms of topic diversity. To interpret this diversity performance, we note that a Jacc score of 0.25 and a Dice score of 0.4 correspond roughly to | ∩ | ∝ 0.4 in (17), i.e., a 40% duplication between topics. Thus, LSA (as well as NMF) usually has up to 40% average overlap in topic words, leading to confusing topics, while our method yields no noticeable overlap. On the other hand, the baseline which matches our algorithm in topic diversity, LDA, is among the lowest performing in terms of coherence, which is also reflected in the composite metrics.

Although SeaNMF achieves both a high topic diversity score and UMass score, on the PMI metric it falls substantially short of BATS (for all datasets) and LSA (for the Textbook and Lectures datasets). This may be due to the fact that, for our purposes, the UMass score for each document is focused only on topics in that document whereas the PMI score includes information from the rest of the dataset. We can thus conclude that, among the algorithms tested, our algorithm finds the best balance between topic coherence and diversity for single document topic modeling; its consistent performance across the datasets also shows that it is robust to variations in dataset properties.

We also observe an interesting pattern in the baselines: the spectral methods -LSA and NMF -perform high in coherence but low in similarity, while the generative models -LDA and pLSA -have the opposite trends. While spectral approaches can extract topics that are interpretable when taken individually, there is high similarity between them because they are based on matrix decomposition and do not consider diversity. Generative models can extract diversified topics, but when they are operating on single documents with few word co-occurrences, the resulting topics will not be as coherent. These observations are consistent with Figure 4 .

4.3.1 Baselines. We compared BATS against six baselines for the text segmentation task: Table 3 . Performance of each algorithm on the Textbook, Lectures, Introductions, Wiki, and News datasets in terms of topic coherence, similarity, and composite metrics. Our algorithm has the highest performance on most of the metrics, indicating it achieves the best balance between topic coherence and diversity. (i) TextTiling [29] : TextTiling divides the text into pseudosentences, assigns similarity scores at the gaps, detects peak differences in the scores, and marks the peaks as boundaries. The boundaries are normalized to the closest sentence breaks. We use the implementation from the nltk package.

(ii) C99 [12] : C99 is another popular text segmentation algorithm that inserts boundaries based on average inter-sentence similarities. More specifically, a ranking transformation is performed, pairwise cosine distances between sentences are computed based on the ranking, and boundaries are determined based on these similarities. We implement C99 de-novo in Python.

(iii) Modified DP Algorithm with LDA (LDA_MDP) [47] : LDA_MDP performs text segmentation based the LDA topic model, with the segmentation being implemented with dynamic processing (DP) techniques. The method has also been tested using an alternate topic model, multinomial We find that = 5, = 0.7 leads to the best performance, consistent with the PMI results in Figure 6 .

mixture, but LDA has has been found to obtain better performance. We implement the LDA_MDP model de-novo in Python.

(iv) TopicTiling [61] : TopicTiling is based on TextTiling, with additionally integrated topic information from the LDA topic model for text segmentation. We implement TopicTiling de-novo in Python, using a window size of 2 and 500 iterations.

(v) SupervisedSeg [35] : SupervisedSeg formulates text segmentation as a supervised learning task, training a hierarchical bidirectional neural LSTM model on the WIKI-727K dataset. The authors transform the text into word embeddings using the GoogleNews word2vec pre-trained model, and use the word embeddings as inputs to the neural model. SupervisedSeg then predicts a cut-off probability for each sentence. We used the open source pre-trained model available from [35] .

(vi) BERTSeg: To test the efficacy of pre-trained models, we designed BERTSeg, an algorithm that leverages the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) [17] contextualized word embeddings for text segmentation. As the name suggests, BERT uses the Transformer deep learning architecture [72] to learn representations of words from unlabeled text, considering both the right context and left context in every layer (i.e., learning bidirectionally). BERTSeg is based on the open source Pytorch BERT model [59] . After using the pre-trained BERT model to generate sentence embeddings, BERTSeg employs the agglomerative segment clustering method from BATS to convert the sentence embeddings into segments.

To say consistent across the algorithms, for the topic-based text segmentation methods -LDA_MDP and TopicTiling -we train the topic model with the single document being segmented. 4.3.2 Evaluation metrics. We consider two standard text segmentation metrics, [3] and Win-dowDiff (WD) [54] . Lower values indicate better performance. Each of these metrics compares the ground truth (i.e., reference) segmentation ref to the estimated (i.e., hypothesized) segmentation hyp. The metric calculates the number of disagreements in the positions of segment boundaries between hyp and ref; in doing so, it ignores the exact number of boundaries to be detected, and weights false positives more heavily [54] . WD, on the other hand, slides a fixed-sized window across the document, calculates the number of boundaries within that window, and records an error if ref and hyp disagree on the number.

Formally, let ( 1 , 2 , ..., ) be the sequence of words comprising a document, where each ∈ W, the set of document words. With ( , ) as the binary indicator of whether words and are in the same segment under segmentation , and ( , ) as the number of segment boundaries between and under , the metrics are calculated as 

where the window size ℓ is set to one less than half the average segment length, and 1 is the indicator function which evaluates to 1 if the argument is true and 0 otherwise.

We now test and against the WD metric to validate our choices on the topic modeling task. Looking at Table 7 , we can see that our results are consistent with those in Section 4.2.3: the optimal values are ≈ 0.7, = 3 or 5 for each of the datasets. Taken together, all of these results support the idea that we can use this set of parameters as default, across different datasets and evaluation metrics. This allows users to use BATS "out of the box" without fine-tuning which may be difficult due to limited data in the "new and single document" setting or computationally burdensome when data is available. 4.3.4 Results and discussion. The results obtained by each algorithm on each of the four datasets are given in Table 4 . The mean and standard deviation across documents is shown in each case.

Overall, we see that our method BATS consistently outperforms all of the baselines in terms of text segmentation on four of out of six datasets. BERTSeg is the most competitive on the baseline on all of the datasets except for Lectures, where SupervisedSeg does slightly better, and Wiki where C99 performs better. The three topic-based text segmentation methods (LDA_MDP, and TopicTiling) actually perform considerably worse than these other baselines, possibly due to single documents containing insufficient data for training their topic models (recall in particular that LDA had poor topic diversity performance in Table 4 ). Although for the Textbook and Introductions datasets BATS achieves only modest gains over BERTSeg, this achievement is noteworthy as BERT is very large (around 335 million parameters) and, even pre-trained, has a runtime two orders of magnitude worse than BATS (see Section 4.4). The performance gains are more pronounced in the Lectures and News datasets where BATS improves more substantially on BERTSeg.

The pre-trained models outperforming the simpler baselines is perhaps not surprising. However, the fact that they achieve comparable but slightly worse results than BATS across three natural datasets suggests that (i) given their computational complexity, pre-trained embedding models are not as well suited for the "new and single document" setting as BATS, and (ii) the relatively simple and efficient pre-processing and biclustering techniques of BATS in combination are effective.

Considering these results along with those in in Sec. 4.2, we conclude that our method is capable of identifying accurate segment boundaries and topic words for a single document simultaneously. The C99 baseline performs remarkably well on the Choi and Wiki datasets. C99 is designed specifically with datasets such as Choi in mind, where documents are artificially built with identical numbers of short segments and sparse content in each segment [12] . Specifically, as shown in Table  1 , the average sentences per document in Choi are significantly smaller than the other datasets. This is due to the way it is constructed -with each document as combinations of first ℓ sentences from documents in another corpus -making it less realistic than the other datasets. Similarly, because BERTSeg uses agglomerative clustering on pre-trained embeddings directly (instead of using embeddings as input to another supervised model), it may be particularly apt for identifying breaks among unrelated topics (i.e., topics from different articles) as opposed to the harder task of segmenting natural text with related topics. Put simply, using the embeddings directly makes a "coarse" separation between unrelated topics easier as word embeddings are designed to capture these semantic differences. By contrast, BATS is designed to segment natural text with a "finer" separation between topics that are related but distinct in the "single and new document" setting. In this setting, even contextualized embeddings are not as effective as the semantic differences between topics is likely to be relatively small and so BATS outperforms embeddings-based models, as evidenced by results on the other datasets. Similar results have been observed in, e.g., [35] .

Next, we evaluate the effect of the number of sentences and segments on the runtime of our method compared with the baselines. Table 5 shows the increase in runtime from varying the number of sentences in each segment for the Choi dataset, relative to the case of 50 sentences (we choose this dataset because all documents are constructed with a constant number of segments). We can see that the growth in runtime of our methodology BATS is comparable to the most scalable baselines, with the rate of increase in runtime less than the corresponding increase in sentences. Additionally, BATS is the only methodology performing both topic modeling and text segmentation. Out of the baselines in Table 5 , TextTiling, C99, and pLSA have considerably higher increases in runtime, with pLSA performing the worst. The substantial difference between LSA, the most scalable, and pLSA is consistent with spectral approaches being known to scale better than generative algorithms that require multiple iterations [84] . Although BERTSeg and SupervisedSeg scale relatively well, they have absolute runtimes second and third to pLSA for all sentence lengths. This indicates that there is a high fixed overhead associated with loading the large pre-trained models for segmentation. Table 6 shows the impact on runtime from varying the number of segments per document for the Lectures dataset (recall from Table 1 this dataset has the longest documents available). Here we have excluded pLSA, as its runtime is significantly longer, and also the text segmentation baselines, as their runtimes are not dependent on the number of segments. SeaNMF is by far impacted the most, followed by NMF and LSA, while HDP and LDA exhibit the best scalability. Our method remains impacted under 10% for a 5-fold increase in segments, again implying that our method supports changes in the size of input efficiently. We note that while for a small number of topics BATS has a longer absolute runtime than SeaNMF, this disadvantage disappears when the number of segments exceeds 50. Taken together, Tables 5 and 6 validate our theoretical analysis in Section 3.6 which concluded that BATS has low computational complexity.

When we consider these runtime results in the context of the evaluation metrics, we find that BATS provides the best balance of computational efficiency and performance across evaluation metrics for both the text segmentation and the topic modeling tasks. When compared with the most competitive topic modeling baselines (LSA and SeaNMF), BATS is not as efficient as LSA but scores much better on topic similarity measures. BATS scales better than SeaNMF and has a significantly lower runtime on longer documents with many segments, as evidenced by its performance on the Lectures dataset (Table 6 ). On the text segmentation task, BATS both scales better and has a much lower absolute running time than the two best baselines (BERTSeg and SupervisedSeg) in addition to outperforming both of them on three real-world datasets. These results show that BATS is a practical algorithm with substantial advantages over existing baselines for the "single and new document" setting that we are focused on in this paper.

To test the effectiveness of different parts of the BATS methodology, we conduct an ablation study that excludes components and measures the resulting performance effect. The results for both the topic coherence and text segmentation metrics are given in Table 7 . Referring to Figure 1 , we focus on the following parts of BATS: (i) parts of speech (POS) awarding (i.e., setting = 0 in (1)), (ii) low frequency processing (i.e., not excluding degree one words), (iii) regularization of the graph Laplacian (i.e., setting the terms to 0 in (3)), (iv) sentence bonding (i.e., setting = 0 in (2)), and (v) the Laplacian. For (v), we remove the Laplacian entirely, and instead perform SVD directly on the sentence-word matrix after POS awarding and sentence bonding.

We draw three main conclusions from these results. First, each component of BATS has a substantial impact on one or more of the metrics, which validates its inclusion in the methodology. Table 7 . Ablation studies on the Textbook, Lectures, and Introduction datasets. Each row gives the performance on metrics obtained when excluding a component of the methodology. Overall, we see that each component is important to BATS, as excluding any of them results in lower performance on one or more metrics. Second, the topic coherence metrics are impacted by each component more significantly than the text segmentation metrics, which implies that results at the sentence level are not impacted as significantly by preprocessing than those at the topic level. Third, out of all the components, the Laplacian has the greatest impact, followed by sentence bonding. This validates our graph Laplacian representation as input to the spectral clustering algorithm and suggests that our connection to denoising under the stochastic block model is appropriate. The impact of removing sentence bonding reflects the importance of incorporating word order information in the presence of sparsity.

The "single and new document" setting arises when it is desirable to perform modeling tasks on a single document, due to the potential novelty of words in the document and/or computational limitations. In this work, we developed an unsupervised, computationally efficient, statistically sound methodology called BATS that simultaneously extracts the topics and segments the text from one single document. BATS first leverages word-order information together with optimization tricks such as parts-of-speech (POS) tagging to refine a document's sentence-word matrix. It then obtains a singular value decomposition from a regularized form of the graph Laplacian, with the singular vectors yielding low dimensional embeddings of words and sentences. Finally, BATS employs clustering algorithms to extract topics and text segments from the left and right singular vectors. Through evaluations against 12 baselines on six datasets, we confirmed that our algorithm achieves the best overall results considering runtime, scalability, and standard metrics in both topic modeling and text segmentation tasks for the "single and new document" setting. For topic modeling, this was especially true when considering the dual objectives of coherence maximization and similarity minimization across topics. Our experimental results also showed that BATS scales well with the size of the input data, and that it is robust to changes in dataset characteristics. We identify several potential avenues of future work. First, BATS could potentially be tuned to improve performance on the topic modeling or text segmentation task at the cost of increased computation. Specifically, although we found that BATS outperformed baselines based on word embedding techniques, further experimentation may leverage embeddings to enhance BATS. As this would likely come at substantial computational cost (see Tables 5 and 6 ), we leave this as an avenue for future work beyond the "single and new document" setting. Second, a more elaborate POS awarding scheme in the sentence-word matrix construction phase of BATS may improve topic coherence further. Third, since BATS provides both topic and text segment information, the application of our methodology to text summarization can also be considered, e.g., in identifying the most important segments according to the number of corresponding topic words.

We thank anonymous reviewers for helpful comments and suggestions. Christopher G. Brinton was supported in part by the Charles Koch Foundation. Qiong Wu and Zhenming Liu are supported by NSF grants NSF-2008557, NSF-1835821, and NSF-1755769. Yanhua Li was supported in part by NSF grants IIS-1942680 (CAREER), CNS-1952085, CMMI1831140, and DGE-2021871. The authors acknowledge William & Mary Research Computing for providing computational resources and technical support that have contributed to the results reported within this paper.

Pseudo-likelihood methods for community detection in large sparse networks

Top2Vec: Distributed Representations of Topics

Statistical models for text segmentation

Natural language processing with Python: analyzing text with the natural language toolkit

Latent dirichlet allocation

Topic-based document segmentation with probabilistic latent semantic analysis

On the efficiency of online social learning networks

Statistical models for text classification: Applications and analysis

Reading tea leaves: How humans interpret topic models

Spectral clustering of graphs with general degrees in the extended planted partition model

Fog and IoT: An overview of research opportunities

Advances in domain independent linear text segmentation

Short Text Analysis Based on Dual Semantic Extension and Deep Hashing in Microblog

Knowledge discovery through directed probabilistic topic models: a survey

Agglomerative hierarchical clustering with constraints: Theoretical and empirical results

Indexing by latent semantic analysis

Bert: Pre-training of deep bidirectional transformers for language understanding

Co-clustering documents and words using bipartite spectral graph partitioning

Information-theoretic co-clustering

Revisiting power-law distributions in spectra of real world networks

Bayesian unsupervised topic segmentation

Finding structure in time

Transmission of information: A statistical theory of communications

Brown Corpus Manual

Finding semantically valid and relevant topics by association-based topic selection model

A dynamic oracle for arc-eager dependency parsing

A survey of text similarity approaches

Combining IR and LDA topic modeling for filtering microblogs

TextTiling: Segmenting text into multi-paragraph subtopic passages

Probabilistic latent semantic indexing

Empirical study of topic modeling in twitter

Clustering using a similarity measure based on shared near neighbors

Stochastic blockmodels and community structure in networks

Spectral biclustering of microarray data: coclustering genes and conditions

Text segmentation as a supervised learning task

Neural word embedding as implicit matrix factorization

From which world is your graph

Integration of Knowledge Graph Embedding Into Topic Modeling with Hierarchical Dirichlet Process

SegBot: A Generic Neural Text Segmentation Model with Pointer Network

IJCAI

Least squares quantization in PCM

Word sense disambiguation and text segmentation based on lexical cohesion

Introduction to information retrieval

Learned in translation: Contextualized word vectors

Efficient estimation of word representations in vector space

Distributed representations of words and phrases and their compositionality

Optimizing semantic coherence in topic models

Text segmentation: A topic modeling perspective

Improving topic models with latent feature word representations

Topic quality metrics based on distributed word representations

Using of Jaccard coefficient for keywords similarity

Text mining using non-negative matrix factorizations

Glove: Global vectors for word representation

Semi-supervised sequence tagging with bidirectional language models

A critique and improvement of an evaluation metric for text segmentation

Topic modeling over short texts by incorporating word embeddings

Regularized spectral clustering under the degree-corrected stochastic blockmodel

Short and sparse text topic modeling via selfaggregation

Mining of Massive Datasets: Data Mining (Ch01). Min. Massive Datasets

Sentence-bert: Sentence embeddings using siamese bert-networks

Selection and information: a class-based approach to lexical relationships

TopicTiling: a text segmentation algorithm based on LDA

Exploring the space of topic coherence measures

Co-clustering for directed graphs: the Stochastic co-Blockmodel and spectral algorithm Di-Sim

Unsupervised topic extraction from privacy policies

Topic modeling based on keywords and context

Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations

Edge computing: Vision and challenges

PRIMME: preconditioned iterative multimethod eigensolver-methods and software description

Sharing clusters among related groups: Hierarchical Dirichlet processes

Collaborative mobile edge computing in 5G networks: New paradigms, scenarios, and challenges

Network-Aware Optimization of Distributed Learning for Fog Computing

Attention is all you need

A tutorial on spectral clustering

The Oral Cavity and Associated Structures-Clinical Methods: The History, Physical, and Laboratory Examinations

A biterm topic model for short texts

Efficient methods for topic model inference on streaming document collections

Incorporating knowledge graph embeddings into topic modeling

Dynamic word embeddings for evolving semantic discovery

A dirichlet multinomial mixture model-based approach for short text clustering

MetaLDA: a topic model that efficiently incorporates meta information

Gender bias in contextualized word embeddings

Comparing twitter and traditional media using topic models

Generative model-based document clustering: a comparative study

Human behavior and the principle of least effort: An introduction to human ecology

The psycho-biology of language: An introduction to dynamic philology