key: cord-0155286-v67uswby authors: Komolafe, Tomilayo; Fong, Allan; Sengupta, Srijan title: Scalable Community Extraction of Text Networks for Automated Grouping in Medical Databases date: 2021-11-28 journal: nan DOI: nan sha: aeee4f035e900693ebce2efc1a81e7a9b45b726f doc_id: 155286 cord_uid: v67uswby Networks are ubiquitous in today's world. Community structure is a well-known feature of many empirical networks, and a lot of statistical methods have been developed for community detection. In this paper, we consider the problem of community structure in text networks,which is greatly relevant in medical errors and patient safety databases. We adapt a well-known community extraction method to develop a scalable algorithm for community extraction in large text databases. The application of our method on a real-world patient safety report database demonstrates that the groups generated from community extraction are much more accurate than manual tagging by frontline workers. Many complex systems in today's world consist, at an abstract level, of agents who interact with one another. This general agent-interaction framework arises in a range of disciplines, such as biological sciences (Lynall et al., 2010) , physical sciences (Huberman and Adamic, 1999; Pagani and Aiello, 2013) , and social sciences (Milgram, 1967) , to name a few. By denoting agents as nodes and their interactions as edges, any such system can be represented as a network. Such network data provide a versatile framework for analyzing a broad spectrum of complex systems. Community structure is a well-known feature of many empirical networks. Nodes in a network are often found to belong to groups or communities that exhibit similar behavior. The identification of this network structure, called community detection, is an important problem in network analysis. Community detection has important scientific implications; these communities often turn out to be groups of agents which share common properties and/or play similar roles within the network. For example, in Jonsson et al. (2006) , the communities in a protein interaction network turned out to be functional groups (proteins having the same or similar function) -this conclusion has important implications for cancer research. Fortunato (2010) provides a multidisciplinary exposition on community detection in networks. Fittingly, several useful tools for community detection have been developed and studied in the statistics literature. These include spectral methods (Rohe et al., 2011; Jin, 2015; Sengupta and Chen, 2015) , modularity based methods (Newman and Girvan, 2004; Bickel and Chen, 2009; Sengupta and Chen, 2018) , likelihood based methods (Amini et al., 2013) , to name a few. Most of these methods are known to have theoretical guarantees for accuracy of community detection. In this paper, we study text networks, where vertices represent documents and edges represent similarity between document pairs. Similarity between text documents can be measured in a number of ways based on representational learning (Mikolov et al., 2013 (Mikolov et al., , 2017 Hofmann, 1999; Landauer et al., 1998; Papadimitriou et al., 2000; Dumais, 2004) . We provide more details on document representation in Section 3. Text networks provide a useful framework for representing large databases of documents, and statistical network analysis techniques can be applied for the analysis of such databases. In particular, community detection techniques can be used for grouping text databases into homogeneous clusters, which enables downstream analysis of the clusters thus formed. However, there has not been much work on community detection of text networks, with some very recent exceptions such as Yan et al. (2021) and Dong et al. (2020) . Our main contributions in this paper are as follows. We develop a method for clustering text networks based on representational learning combined with a well-known community extraction method proposed by Zhao et al. (2011) . Most real-world text databases are large, which can lead to high computational expense when applying community extraction. We propose a novel divide and conquer strategy to address this issue. We demonstrate our method by applying it to a large patient safety event database, where it generates much better groups than manual tagging, as measured by document similarity. The rest of the paper is structured as follows. In Section 2, we describe the scientific application area of medical errors and patient safety events which motivated this work, and we also introduce the patient safety error database on which our method is applied. In Section 3, we describe the proposed methodology. In Section 4, we report the results of our analysis, and we conclude the paper with a short discussion in Section 5. The Institute of Medicine (IOM), an authority at the intersection of medicine and society, released a report titled "To Err is Human: Building a Safer Health System" in November 1999 (Donaldson et al., 2000) . Its goal was to break the cycle of inaction regarding medical errors by advocating a comprehensive approach to improving patient safety. Based on two studies conducted in 1984 and 1992, the IOM concluded that between 44,000 and 98,000 patients die every year in United States (U.S.) hospitals due to medical errors. Costs alone from medical errors were approximated to be $37.6 billion per year. About $17 billion were associated with preventable errors (Donaldson et al., 2000) . Given the intense level of public and scientific reaction to the report, various stakeholders responded swiftly to take action. In February 2000, President Clinton announced a national action plan to reduce preventable medical errors by fifty percent within five years (HOUSE, 2020) . Congress mandated the monitoring of progress in preventing patient harm. In July 2004, a Healthgrades Quality Study asserted that IOM had in fact vastly underestimated the number of deaths due to medical errors, citing 195,000 deaths per year (Harrington, 2005) . Two decades later, medical errors continue to be a leading cause of death in the United States (Makary and Daniel, 2016) . The Institute of Medicine and several state legislatures have recommended the use of patient safety event reporting systems (PSRS) to better understand and improve safety hazards (Aspden et al. (2004) ; Rosenthal and Booth (2005) ). Numerous healthcare providers have adopted these systems which provide a framework for healthcare provider staff, including frontline clinicians, nurses, and technicians to report patient safety events, ranging from 'near misses', where no patient harm occurs, to serious safety events that result in patient harm (Clarke, 2006) . However the potential of these reports to systematically identify hazards and reduce harm has been lacking, in part because of the limited techniques used to analyze these data. If the reported data can be analyzed effectively, reporting systems have the potential to dramatically improve the safety and quality of care by exposing possible weaknesses in the care process (Pronovost et al., 2008) . Patient safety event (PSE) reports are free-text narratives written by the front-line staff. These narratives describe incidents whereby a healthcare service delivery did not go as expected. During these instances, the front-line staff witnessing the incident can document his/her perspective of the events that occurred. Therefore, aggregating similar PSEs has the potential to give insights into trends of the different types of medical errors healthcare organizations encounter. There is a significant amount of variation between documents because these narratives do not have to follow any specified format. For example, documents describing similar events can vary drastically in their word usage, vocabulary, document length, and prevalence of grammatical errors. Therefore, the notion of similarity has to be based on semantic representation rather than simple features defined on the documents. In this work, we consider a PSE database from MedStar Health consisting of 2,072 documents. Our goal is to develop a clustering algorithm to find homogeneous groups of documents. We now propose a method to accomplish this by using community extraction. In this section, we describe the process of community extraction to find homogeneous clusters in a text database. Figure 1 provides a schematic representation of the different steps involved. The subsequent subsections provide details on each step. Note that while this work is motivated by patient safety event reports, this general methodology can be applied on any text database. The first two steps (pre-processing and term-document matrix construction) are well-known strategies from natural language processing, while the last step (community extraction) is a wellknown method from the statistical network analysis literature. We integrate these well-known approaches in our work. A pre-processing step is critical to the performance of any natural language processing (NLP) model to reduce errors (Vijayarani et al., 2015) . Pre-processing of text can be compared to exploratory data analysis in traditional statistical analysis. In our work, we first create a manually curated dictionary of commonly misspelled words in our corpus and replace them with their proper spelling. To do this, we extracted terms that appeared in 2 or more documents and correct any misspelled terms. As PSEs typically contain information such as the date an event occurred, the time it occurred, or dosage of a particular medication, any permutation of a date, dosage, or time is replaced with the words "date", "dose", and "time" respectively. This is because the exact time an event occurred or the exact dosage of a medication is irrelevant for our analysis. However, we should not remove the word because then the sentence will lose its syntactic coherence. Therefore, we simply replace specific times by the general concept word time. In addition, special characters are removed, except for periods and all other numbers are removed from the text. For example, this sentence: "On Dec. 13 at 5PM resident was prescribed 2mc/mg of oxycotine" is converted to "On date at time resident was prescribed dose of oxycotine" Furthermore, to ensure that words with similar morphology are presented as the same, we carried out stemming of the words, which is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form-generally a written word form. The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form, and this is a common pre-processing step in text analytics (Vijayarani et al., 2015) . The next step is to represent the text database as a numeric matrix. This is a common approach in natural language processing, where the entire corpus is converted to a term-document matrix where rows represent terms and columns represent documents . The weighting of the terms in our term document matrix is critical to any future analysis. We use the common methodology Term Document -Inverse Document Frequency methodology referred to as "tf-idf" in the literature (Aizawa, 2003; Ramos et al., 2003) . Here, term frequency is an adjusted version of the number of times a term appears in the document. Let t be a term and d be a document in the corpus. Then, term frequency is defined as where f t,d is the raw count of a term in a document, i.e., the number of times that term t occurs in document d. Note that the denominator is the total frequency of all terms in the document, i.e., the total length of the document, which scales the raw count and allows for comparison between documents of differing length. The inverse document frequency is a measure of how much information the word provides, i.e., how common or rare the term t is across all documents in the corpus. Let D denote the set of all documents in the corpus and let N = |D| be the total number of documents. Then, inverse document frequency is defined as where the denominator is the number of documents which contain the term t. Finally, the tf-idf score is calculated as The tf-idf term-document matrix is constructed as follows. First, we consider the set of all unique terms that appear in the corpus. Then, for each term t and each document d, we compute the tf-idf score and populate the entries of the matrix. For a toy illustration, consider the following short patient safety event reports. • "The patient schedule did not match the script." • "Script and schedule mismatch. Script stated vasculab and schedule xray" • "Xray monitor will not transmit images" The resulting term-document matrix is displayed in Table 1 . Once we have a weighted term document matrix, we apply the well-known technique of Latent Semantic Analysis (LSA) for dimension reduction (Turney, 2001; Dumais, 2004) . LSA has the ability to handle obstacles prevalent in natural language processing and analysis such as presence of synonyms and polysemy. In what follows, we provide only a brief description of LSA. For a more detailed description of the approach, see Landauer et al. (1998) . For a term-document matrix X of m terms and n documents with rank r, its singular value decomposition (SVD) can be written as where X is the m x n term-document matrix, T is a m x m matrix whose columns are the orthogonal eigenvectors of XX T where we denote X T as the transpose of the matrix X. The matrix D is a n x n matrix whose columns are the orthogonal eigenvectors of X T X and Σ is a m x n diagonal matrix whose diagonals are √ λ i where λ corresponds to the eigenvalues of XX T and 1 ≤ i ≤ r and 0 everywhere else. The eigenvalues of XX T are the same as the eigenvalues of X T X. The values √ λ i are called the singular values of X. The implementation of LSA used in this work is a low rank approximation of the SVD. For this, we find a positive integer, k ≤ r such that it closely approximates the term document matrix. The value k is selected such that it minimizes the error between the original matrix X and its low rank approximation X k . This is achieved through the following steps; Since λ i ≥ λ i+1 , setting λ i+1 = 0 if it is close to zero will not significantly affect the original matrix X. We therefore find a k where 1 ≤ k ≤ r such that it minimizes the difference in the Frobenius norm between X and X k . If k = r, then the difference in the Frobenius norm is 0 but if k r, we have a low rank approximation of our matrix that is also easy to manipulate. By keeping only the k columns or entries for each of our matrices, we obtain X k and furthermore a low rank approximation of both terms and documents. Therefore, we have Where we only keep the k columns of matrix T so T k is a m x k matrix, D T so D T k is a k x n matrix and Σ k is a diagonal k X k matrix. Then, the rows of the matrix D k are the LSA-based vector representations of the documents in the corpus. Finally, we generate a network of documents by creating a similarity matrix from the matrix D k . We define the similarity between two documents d i and d j as the correlation between the corresponding rows of D k , resulting in a n x n correlation matrix. The correlation matrix serves as our adjacency matrix for the next step of community extraction. Note that this is a weighted adjacency matrix. Most community detection methods aim to partition a network into communities with the goal of maximizing the number of edges within communities and minimizing edges between communities. This framework assumes that all nodes belong to some community. However, there could be scenarios where some nodes do not belong to any particular community and forcing these nodes into a community will distort the community detection results. For example, let's assume we have a network of high school students where links between students signifies that these students participate in similar extra-curricular activities. Applying some of the traditional community detection algorithms to this network will result in unsatisfactory results. This is because some students naturally do not participate in any extra-curricular activity and therefore do not belong to a community. However, these community detection algorithms will force these nodes to one of the formed communities. The text networks from PSE databases also have this property. We expect that the majority of PSE reports will fall into groups, but there could be some "miscellaneous" documents that do not belong to any group. Community detection methods that partition all nodes into communities are going to enforce such "miscellaneous" reports into groups, which is unwarranted. Therefore, we use the community extraction method, proposed by Zhao et al. (2011) , which can handle these types of networks. We describe a network graph G as composed of vertices V and edges E, and G = (V, E). The total number of vertices in a network graph G gives us the network size N . That is, N = |V |. Also the number of edges in a network graph is M , where M = |E|. We consider only non-overlapping communities in this paper, therefore once community extraction is applied to a network G , the partition results in two distinct sets, V 1 and V 2 where V 1 ∩ V 2 = ∅ and V 1 ∪ V 2 = V . A network can also be represented as an N x N adjacency matrix referred to as A, where its elements are A ij and i, j = 1, 2, ..., N, A ij = (−1, 1) making it a weighted network. For text networks, the adjacency matrix A is equal to the correlation matrix of D k T from the preceding subsection. Communities are extracted one at a time with the criterion of extracting a set of nodes with the sum of its weights largest within that set and smallest between the set and its complement (Zhao et al., 2011) . We will call this set of extracted nodes S, and its complement, S c . The objective function we are therefore maximizing in each iteration step is given bỹ where O(S) = i,j∈S A ij and B(S) = i∈S,j∈S c A ij . The term O(S) is twice the weight of the edges within S and B(S) represents the weights from the set S to the rest of the remaining network. In large sparse networks, particularly as in our application, a small community S could result in a largeW (S) value, the term |S||S c | serves to ensure that sufficiently sized communities are extracted at each step as very large communities or very small communities will be penalized. This is because the term, |S||S c | is maximized at |S| = N 2 . To maximize the objective function, we implement the tabu search maximization technique which is a local optimization technique based on label switching (Beasley, 1998; Glover and Laguna, 1998) . In this optimization technique, a string of binary values representing nodes in either community S or S c is passed to the tabu search function (Zhao et al., 2011; Beasley, 1998; Glover and Laguna, 1998) . The function tracks which nodes have been switched, ensuring that they are not switched again until a certain number of iterations have passed, making these nodes, "tabu". To guard against being trapped at a local maxima, the algorithm is run with random label assignments each time. In our implementation, the community extraction algorithm is repeated till only a small subset of nodes, 30 nodes or less, are left in the network and this was sufficient for our application. Zhao et al. (2011) proposed a stopping criteria only for a network that can be represented by the block model. Future works will investigate a more appropriate stopping criteria. In practice, we observed run times of the order O(n 2 ) where n is the size of the corpus. Our original PSE corpus is 2,072 documents, and running one iteration of the tabu search algorithm on the entire corpus takes over 120 hours. One alternative is to use the divide and conquer strategy by splitting the entire corpus of 2,072 documents into chunks of 200 documents or chunks of 400 documents. We observed run times of about 22 hours and 44 hours when partitioned into sizes of 200 and 400 respectively. However, it is crucial to knit similar communities in each partition of 200 or 400 back together. Partitioning the entire document will also result in some communities being arbitrarily split up. We also developed a methodology for combining similar communities from different partitions. Our methodology relies on the correlation matrix of the entire 2,072 corpus. We compare pairs of communities across the different partitions and combine communities that have a combined density greater than some threshold. We denote S a,p as the identity of a community extracted during the implementation of our algorithm. The integer, a, refers to the iteration number at which the community is extracted in that partition. The integer, p, refers to the partition the community belongs to. Where 1 ≤ a ≤ x with x representing the number of communities extracted for that partition and 1 ≤ p ≤ y where y is the total number of partitions for that particular implementation. Therefore, to establish if two extracted communities, S 1,1 and S 4,2 originally belonged to the same community, we compare each of their densities, D a,p to their combined density, D (1,1),(4,2) . That is, A ij , and D (1,1),(4,2) = 1 |S 1,1 | * |S 4,2 | i∈S 1,1 ,j∈S 4,2 A ij (7) In this paper, two communities are combined together if D (1,1)(4,2) > 0.85 * D 1,1 and D (1,1)(4,2) > 0.85 * D 4,2 . In this section, we report the results from applying the methodology proposed in Section 3 on the MedStar PSE corpus of 2,072 documents. First, we establish a reference method for benchmarking. These PSE reports are manually tagged by the front-line staff with options available from a drop-down menu. Tags include both a general event description, and there are 20 options to select from in our report, and 187 specific event descriptions which are sub-categories of any one of the general event descriptions. If the tags are descriptive enough, then we would expect the diagonals of the correlation matrices, representing average correlation within a group, to be high, and conversely, the off diagonals to be low. This would suggest that front-line staff are tagging similar documents with similar tags. However, if the correlation matrices do not follow this pattern, then it suggests that the tags available to the front-line staff are not descriptive enough for each report type. The benchmark results from manual tagging are displayed in Figure 2 as a heatmap. Clearly, manual tagging fails to obtain high correlation within groups and low correlation between groups. Besides the visual illustrations, we can also look at statistics of the correlation matrices obtained by manual tagging. Specifically, are there communities or tags whereby the documents within the community are more related to another set of documents in another community or tag. We do this by looking at the percentage of off diagonal cells that have a value equal to or greater than the value of the cell in the diagonal for a given column in the correlation matrix. Some examples of manually tagged categories that are more similar to other categories than within themselves are below. • "Medication": more related with "Fluid-Outdated" and "Unusable Medication" • "Equipment": more related with "Medical Device-Sterilization" and "Cleanliness Issue" • "Diagnostic Imaging-Test -Wrong Side (L vs. R)": more related with "Blood Bank-Patient Testing (Blood Bank Use Only)", "Diagnostic Imaging-Image -Misidentified", and "Diagnostic Imaging-Test -Test Delayed" Next, we applied our methodology described in Section 3 to obtain automated tags via community extraction. Recall that implementing the method on the full network of 2,072 documents is computationally very intensive, and therefore we applied the divide and conquer approach Figure 2 : Heatmap of groups generated by manual tagging from 52 X 52 combined event types, i.e., a General event tag paired with its Specific event. described at the end of Section 3. We used subnetworks of 200 or 400 documents, and used a correlation cut-off of 0.15 or 0.2 to stop the network from getting too dense. The results are plotted in Figure 3 . Note that the community extraction method does not require pre-specification of the number of communities, rather, the number of communities is an output of the method. We obtained 113, 156, and 125 clusters, respectively, from top to bottom of Figure 3 . From the correlation heatmaps, it is clear that the documents have very high within-group correlation and very low between-group correlation, which indicates that the grouping is effective. This is a substantial improvement over manual tagging (Figure 2 ). Note that the results from community extraction are better than manual tagging across the range of tuning parameters, i.e., subgraph size and correlation threshold. Next, recall that we observed "heterophilic" behavior with manual tagging, where documents in some groups have higher between-group correlation than within-group correlation. To compare manual tagging vs community extraction with respect to this property, we looked at each group, and computed what fraction of other groups have higher between-group correlation than withingroup correlation. The boxplots are shown in Figure 4 , where we compare manual tagging to a representative community extraction. We observe that the groups from community extraction have very little "heterophilic" behavior comapred to manual tagging. Finally, recall that our divide and conquer strategy involves random partitioning of the large text network into a number of smaller subnetworks. A natural question is "How stable are the groupings generated due to random partitioning? To answer this question, we implemented several random iterations of the divide and conquer strategy, and computed the Normalized Mutual Information (NMI) for document groups arising in different iterations. A high value of NMI indicates high stability of document grouping across random iterations. The results are plotted in Figure 5 for several tuning parameter values. We observe that the NMI values are quite high indicating stability of clustering. : Comparisons of communities between the predefined PSE categories/tag against community extraction by looking at the distribution of percentage of communities that are more similar, higher correlation score, than documents within that community. This analysis demonstrates the advantage of a network driven approach to extract communities in patient safety event free-text. The results clearly show categories with less overlapping categories compared to the categories manually selected by the front-line staff. It is likely that the analysis can identify communities of reports or themes in the reports that are not dependent on the structured categories. For example, communication and hand-off are often prevalent themes in patient safety reports that are not typically captured in structured fields. Structured fields are predefined and often difficult or time consuming to change and update. A network driven approach that leverages the free-text is more flexible and can identify more timely hazards with changing environments and care processes. Having a flexible approach is particularly important as new workflows are being introduced (e.g., COVID-19 protocols, telehealth). A network driven approach to identify communities and themes in free-text can help reduce the burden of reporters from choosing through complex taxonomies which are both time consuming and can result in errors. In addition, these results highlight the potential to identify communities of related reports that might be missed from analyzing just the structured categories. Such categorization flexibility could greatly help safety analysts and safety leaders better identify meaningful signals and insights from all the data. This analysis was performed on data from one healthcare system. As a result, the comparison of extracted communities with the structured categories are specific to the structured categories implemented at the healthcare system. It is possible that other healthcare systems use different categorization taxonomies highlighting the need to understand the generalizability of this approach across taxonomies and healthcare systems. In addition, the present method does not consider temporal effects on communities. Expanding this approach to include temporally stable communities or emerging communities would be important especially as changes to policy, workflow, safety hazards can often occur. An information-theoretic perspective of tf-idf measures. Information Processing & Management Pseudo-likelihood methods for community detection in large sparse networks Patient safety reporting systems and applications Heuristic algorithms for the unconstrained binary quadratic programming problem A nonparametric view of network models and Newman-Girvan and other modularities How a system for reporting medical errors can and cannot improve patient safety To err is human: building a safer health system Overlapping community detection in weighted temporal text networks Latent semantic analysis. Annual review of information science and technology Community detection in graphs Tabu search Revisiting medical error: Five years after the iom report, have reporting systems made a measurable difference Probabilistic latent semantic indexing Clinton-gore administration announces new actions to improve patient safety and assure health care quality Internet: growth dynamics of the World-Wide Web Fast community detection by SCORE. The Annals of Statistics Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis An introduction to latent semantic analysis Functional connectivity and brain networks in schizophrenia Medical error-the third leading cause of death in the us Efficient estimation of word representations in vector space Advances in pre-training distributed word representations The small world problem Finding and evaluating community structure in networks The power grid as a complex network: a survey Latent semantic indexing: A probabilistic analysis Improving the value of patient safety reporting systems Using tf-idf to determine word relevance in document queries Spectral clustering and the high-dimensional stochastic blockmodel Maximizing the use of state adverse event data to improve patient safety Spectral clustering in heterogeneous networks A block model for node popularity in networks with community structure Mining the web for synonyms: Pmi-ir versus lsa on toefl Preprocessing techniques for text mining-an overview Overlapping community detection in temporal text networks Community extraction for social networks Appendix Table 2: List of General Event Tags from PSE