Text Analysis and Visualization Research on the Hetu Dangse During the Qing Dynasty of China ARTICLE Text Analysis and Visualization Research on the Hetu Dangse During the Qing Dynasty of China Zhiyu Wang, Jingyu Wu, Guang Yu, and Zhiping Song INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2021 https://doi.org/10.6017/ital.v40i3.13279 Zhiyu Wang (mikemike248@gmail.com) is PhD Candidate, School of Management, Harbin Institute of Technology and Associate Professor, School of History, Liaoning University. Jingyu Wu (734665532@qq.com) is graduate student, School of History, Liaoning University. Guang Yu (yug@hit.edu.cn) is Professor, School of Management, Harbin Institute of Technology. Zhiping Song (1367123893@qq.com) is graduate student, School of History, Liaoning University. © 2021. ABSTRACT In traditional historical research, interpreting historical documents subjectively and manually causes problems such as one-sided understanding, selective analysis, and one-way knowledge connection. In this study, we aim to use machine learning to automatically analyze and explore historical documents from a text analysis and visualization perspective. This technology solves the problem of large-scale historical data analysis that is difficult for humans to read and intuitively understand. In this study, we use the historical documents of the Qing Dynasty Hetu Dangse, preserved in the Archives of Liaoning Province, as data analysis samples. China’s Hetu Dangse is the largest Qing Dynasty thematic archive with Manchu and Chinese characters in the world. Through word frequency analysis, correlation analysis, co-word clustering, word2vec model, and SVM (Support Vector Machines) algorithms, we visualize historical documents, reveal the relationships between functions of the government departments in the Shengjing area of the Qing Dynasty, achieve the automatic classification of historical archives, improve the efficient use of historical materials as well as build connections between historical knowledge. Through this, archivists can be guided practically in historical materials’ management and compilation. INTRODUCTION China has a long history documented in numerous archives. At present, various local archive departments preserve large numbers of historical documents from different periods. Owing to the development of China’s archive digitization, archive management departments at all levels have established digital archive abstracts, catalogs, and subject indexes of historical documents in their collections realizing online retrieval of historical archives. With in-depth research on Chinese history, simple catalog retrieval cannot satisfy researchers’ demand for related knowledge in historical archives. Owing to the limitations of the catalog retrieval system, complex catalog data still need to be read manually. However, it is difficult to view the overall picture of the recorded content and impossible to easily distinguish important information in historical materials; this leads to various difficulties, such as the compilation of historical materials for Chinese historical researchers. Thus, in this study, we aim to use text analysis and visualization methods in machine learning to conduct data mining analysis of historical document data. These methods will help us discover the logical relationships of historical records and their purposes, accomplish visual presentations of historical entities and knowledge discovered in historiography, improve knowledge representation and automatic classification of historical data, and provide valuable information for historical archive researchers. mailto:mikemike248@gmail.com mailto:734665532@qq.com mailto:yug@hit.edu.cn mailto:1367123893@qq.com INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 2 During the process of analyzing traditional manual methods for interpreting historical documents, we find the following phenomena: macro description, single angle, selective analysis, and one-way knowledge connection, among others. For example, the Hetu Dangse preserved in the Liaoning Archives contains a total of 1,149 volumes and 127,000 pages, making it difficult to fully grasp and understand the overall content of such documents. Relying on manual reading and analysis of entire archives is an unrealistic task. Therefore, this paper proposes using machine learning, natural language processing (NLP), and other technologies to address various problems from traditional manual reading. First, information from historical documents can be revealed from different angles, and this allows the content of the documents to be displayed more comprehensively and scientifically through visual charts. Second, use of objective quantitative analysis methods, such as text analysis and NLP, prevents subjective interpretations of the same content. Third, NLP and other technologies can solve the problem of calculating massive text training data sets while forming systematic knowledge that avoids the omission and one-sided understanding of knowledge in the historical archive. The application of machine learning in historical data analysis has attracted the attention of researchers in management, history, and computer science. Tao used the Latent Dirichlet Allocation (LDA) topic modeling algorithm to analyze the themes of documents from 1700 to 1800 included in the German Archives, providing a more three-dimensional interpretation and explanation of the spiritual world of Germany during the eighteenth century.1 Chinese scholars Kaixu et al. proposed a method of automatic sentence punctuation based on conditional random fields in ancient Chinese.2 This method was proved to better solve the problem of automatic punctuation processing compared with the single-layer conditional random field strategy in ancient Chinese as tested on the two corpora of The Analects and Records of the Grand Historian. Swiss and South African scholars Stauffer, Fischer, and Riesen, and Chinese scholars Wu, Wang, and Ma used the KWS technology and deep reinforcement learning to automatically recognize handwritten pictures in historical documents.3 Solar and Radovan used the National and University Library of Slovenia’s historical pictures and maps as research data. Using GIS technology, they created a novel display method, and interdisciplinary data resource web application to access and research the data.4 Chinese scholars Dong et al. and Polish scholars Kuna and Kowalski used the WebGIS technology to conduct efficient management and visualization research on historical data of natural disasters in ancient China and Russia. 5 Meanwhile, Latvian scholars Ivanovs and Varfolomeyev and Dutch scholars Schreiber et al. used web technology to develop a web service platform and explored the intelligent environment of cultural heritage service utilization.6 Korean scholars Kim et al. used machine learning technology to determine the complex relationships between tasks of various classes in a specific historical period through the network of historical figures.7 Judging from results in related fields, the semantic analysis and visualization of historical archives in an intelligent way are gradually moving from statistical description to knowledge mining. These results provide theoretical feasibility and practical technical experience for this study. At present, research on historical documents mainly focuses on the retrieval and utilization of historical material databases. Since the words, semantics, grammar, and sentence patterns recorded in historical materials differ from modern texts, using data mining technologies such as machine learning and NLP to intelligently identify historical documents and organize historical data will help us more than traditional methods. This requires the cooperation of artificial intelligence and historical researchers to establish an effective method of historical big data INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 3 analysis to achieve the transformation from traditional manual historical document analysis to automatic artificial intelligence analysis methods. In this paper, we use machine learning and data visualization as a tool to identify differently the content of the historical documents from traditional literature reading, reveal valuable information in the content of historical documents, and promote more systematic, efficient, and detailed understanding of the literature. RELATED TECHNOLOGY DEFINITION To perform text analysis and visualization of the Hetu Dangse, we use machine learning technology such as word vector processing, the SVM (Support Vector Machines) model and network analysis. Word vector is a numerical vector representation of a word’s literal and implicit meaning.8 We segmented the Hetu Dangse’s catalog data and used the word2vec model to transform the segmented data’s word vector form into a set of 50-dimensional numerical vectors representing a catalog’s vector data set. To accurately visualize historical document records’ relationship features, we reduced the vector data set’s dimensionality. Dimensionality reduction, or dimension reduction, is data’s transformation from a high- into a low-dimensional space so that the representation retains some of the original data’s meaningful properties, ideally close to its intrinsic dimension.9 After dimensionality reduction, each catalog data in the vector data set is reduced from 50 to 2 dimensions to facilitate flat display. We used the SVM model and network analysis technology to analyze the vector data set. The SVM model is a set of supervised learning methods used for classification, regression, and outlier detection.10 It is given a vector data set as training to represent historical document records as points in space, and learns independently through the kernel algorithm. Using the algorithm, it maps the separated new records to the same space, and predicts their category based on which side of the interval they fall. Network analysis techniques derive from network theory, a computer science system demonstrating social networks’ powerful influences. Network analysis technology’s characteristics determine that it is suitable for books and historical archives’ visualization in the library and information science field, because the visualization technique involves mapping entities’ relationships based on the symmetry or asymmetry of their relative proximity.11 Thus, it helps to discover historical documents’ knowledge relevance. For example, citation network analysis can identify emerging relationships in healthcare domain journals.12 SAMPLE DATA PREPROCESSING AND CLASSIFICATION This study uses the catalog of the Qing Dynasty historical archives from the Hetu Dangse collected by the Liaoning Archives as the research sample to conduct text analysis and visualization research. China’s Hetu Dangse is the largest Qing Dynasty thematic archive with Manchu and Chinese characters both in domestic and international. The Hetu Dangse is the official document of communication between Shengjing General Yamen, the Wubu of Shengjing and Fengtian Office, and the document communicated between the Beijing Internal Affairs Office in Charge and the Liubu of Beijing during the Qing Dynasty. The Hetu Dangse was published from 2015 to 2018, including the Hetu Dangse·Kangxi period (56 volumes), Hetu Dangse·Yongzheng period (30 volumes), Hetu Dangse·Qianlong period (24 volumes), Hetu Dangse·Qianlong period (17 volumes), Hetu Dangse·Daoguang period (52 volumes), Hetu Dangse·Jiaqing period (58 volumes), Hetu Dangse·Qianlong period Official Documents (46 volumes), Hetu Dangse·Qianlong period Official Documents (46 volumes), and Hetu Dangse·general list (16 volumes).13 The Hetu Dangse is an INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 4 important document for studying the history of the Qing Dynasty. Owing to the special status of Shengjing in the Qing Dynasty, it has a unique historical significance as the companion capital of Beijing and the hometown of the Qing royal family. This provides original evidence from this time for studying politics, economy, culture, history, and natural ecology in Northeast China. In this study, we preprocess the catalog data of the Hetu Dangse by performing text segmentation, creating a corpus, and labeling data before using text analysis and visualization technology to analyze the catalog data of Hetu Dangse. First, we use word frequency analysis and statistics to study the functions of institutions. Second, we use the co-word clustering algorithm to quantify and visualize the institutional relationships. Finally, we use the SVM model to automatically classify and explore the catalog data of the Hetu Dangse. Figure 1 illustrates this process. Figure 1. Text analysis flowchart. Data Preparation and Preprocessing We collected 95,680 catalog data items in the Hetu Dangse of the Liaoning Archives, including 25,148 items from the Kangxi period; 1,096 items from the Yongzheng period; 23,819 items from the Qianlong period; 20,730 items from the Jiaqing period; and 15,887 items from the Daoguang period. The content of each catalog data includes three parts: title information, time of publication (Chinese lunar calendar), and responsible agency. The proportion for each period was not evenly distributed in the catalog data of the Hetu Dangse with the Kangxi Period catalog data having the highest proportion (26.2%). Through the catalog data information, we can perform an in -depth analysis of the content of the Hetu Dangse from the three perspectives: institutional functions, institutional relationships, and topic classification. Data Cleaning As the text recorded in the archives of the Hetu Dangse are Manchu and ancient Chinese, using Chinese word segmentation tools (jieba, SnowNLP, THULAC, etc.) based on modern Chinese will cause errors. Therefore, it is necessary to construct a special text corpus for word segmentation. First, we construct a stop vocabulary list to remove words with little impact on semantics in the Hetu Dangse, such as for (为), please (请) and of (之). Second, we use the word segmentation tools mentioned above for preliminary word segmentation and then perform part-of-speech tagging and word segmentation corrections based on the word segmentation results. The title part of the catalog data of the Hetu Dangse mainly contains three dimensions of information: the record title of the catalog, issuing institution, and receiving institution. Accordingly, we set a total of four types of tags in the text corpus: issuing institution, receiving institution, record type, and keywords. The receiving institution and the issuing institution correspond to the institutions at the beginning and the end of the catalog, respectively, such as the words Shengjing Zhangguan Fang Zuoling, and Shengjing Ministry of Justice. The record type is the front word of the receiving institution, such as counseling (咨) and please (请). The keywords are words that can represent the overall semantics INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 5 in the record title of the catalog, such as arrest (缉拿) and advance (进送). Table 1 presents the corpus we developed. Table 1. Hetu Dangse corpus Num Word Property1 Property 2 1 盛京掌关防佐领 Organization Noun 2 为 Stop_words Preposition 3 缉拿 Keywords Verb 4 逃人 Keywords Noun 5 舒廷 Name Noun 6 官事 Stop_words Noun 7 咨 Keywords Verb 8 盛京刑部 Organization Noun 9 正白旗佐领 Organization Noun 10 兆麟 Name Noun 11 呈 Stop_words Preposition 12 为 Stop_words Preposition 13 交纳 Keywords Verb 14 壮丁 Keywords Noun 15 银两事 Keywords Noun ┋ ┋ ┋ ┋ 61047 收讫事 Keywords Noun 61048 盛京佐领 Organization Noun Label Data To improve the utilization efficiency of the Hetu Dangse and show the document content information from multiple angles, we use a supervised machine learning method to automatically classify the catalog data of the Hetu Dangse. Therefore, the original catalog data set must be labeled. We determine the classification and label of the Hetu Dangse catalog according to the Chinese Archives Classification Law, Chapter 12. Table 2 presents the 11 categories of the catalog. With this, we complete the Hetu Dangse catalog sampling classification and labeling laying the foundation for automatic catalog classification. The Hetu Dangse has a total of 95,680 catalog records involving five periods: Kangxi, Yongzheng, Qianlong, Jiaqing, and Daoguang. We randomly select 500 records from each period and manually label these 2,500 records as the sample data set. The data classification after manual labeling is shown in figure 2. The overall distribution is relatively even, making it suitable for machine learning processing. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 6 Table 2. Data labels Num Category 1 Type of Official Documents (政务种类) 2 Palace, Royal Family and Eight Banners Affairs(宫 廷、皇族及八旗事务) 3 Bureaucracy, Officials(职官、吏役) 4 Military(军事) 5 Politics and Law(政法) 6 Sino-foreign Relations(中外关系) 7 Culture, Education, Health and Scientific Cultural study(文化、教育、卫生及科学文化研究) 8 Finance(财政) 9 Agriculture, Water Conservancy, Animal Husbandry (农业、水利、畜牧业) 10 Building(建筑) 11 Transportation, Post and Telecommunication(交 通、邮电) Figure 2. Percentage of the Hetu Dangse catalog data label chart. RESULTS In this study, we used the catalog data of the Hetu Dangse as a sample to analyze and reveal the Hetu Dangse catalog data from three perspectives: institutional function, institutional relationship, and automatic classification. This will improve usage efficiency of the Hetu Dangse, thus improving INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 7 researchers’ mastery of relevant information about the document. To achieve the functional requirements of text analysis, we adopted four methods: word vector conversion, word frequency analysis, co-word clustering, and the SVM model. Word Vector Conversion of Text Catalog Data The automatic classification of machine-learning technology is based on vector data sets. Thus, the Hetu Dangse text catalog data set must be vectorized before automatic classification. Currently, word vector conversion technology mainly includes methods such as one-hot, Word2vec, and GloVe. Hetu Dangse records the history of the Qing Dynasty for more than 200 years. There are inevitable relationships among the contents recorded in the documents, indicating that they are not isolated from each other. The word2vec model provides an efficient implementation of CBOW and skip-gram architectures for computing vector representations of words, both of which are simple neural network models with one hidden layer. The word2vec model produces word vectors as outputs from inputting the text corpus. This method generates a vocabulary from the input words and then learns the word vectors via backpropagation and stochastic gradient descent.14 This makes the word2vec model more suitable for catalog data from Hetu Dangse. word2vec includes the CBOW model and the skip-gram model, which can enrich the semantic relevance depending on the context, and it is more suitable for the semantic relevance of historical documents such as the Hetu Dangse. Therefore, we adopt the skip-gram model to analyze the catalog data of Hetu Dangse. We extracted the features of word vectors in catalog data from the corpus, input them into the word2vec model, imported the Gensim library in Python, trained the vector embeddings, and obtained the htd.model.bin vector file and htd.text.model model file. The correlation between each word in the Hetu Dangse catalog can be found by implementing the model. For example, if the word Bannerman (旗人) is input into the model, the most relevant words are Minren (民人, with 0.84726 relevance), accused (被控, with 0.812017), and robbery (抢 劫, with 0.795359). To visualize the ethnic relationships recorded in the Hetu Dangse catalog, we input the first 300 words of the word vector into the trained word2vec model and performed dimensionality reduction to realize a planar graph. To understand the structure of the data intuitively, we used the t-SNE algorithm to reduce the dimensions of the word vector. The t-SNE is a type of nonlinear dimensionality reduction used to ensure that similar data points in high-dimensional space are as close as possible in low-dimensional space. We set the embedded space dimension parameter of t- SNE to 2 and the initialization parameter as pca. This makes it more globally stable than random initialization. The maximum number of optimization iterations is 5,000. Figure 3 presents the results. In figure 3, the terms Sanling, Yongling, Zhaoling, Prime Minister, and Fuling form clusters. In Shengjing, the Qing set up the Sanling Prime Minister's Office, and the Prime Minister's Mausoleum Affairs Minister was appointed concurrently by General Shengjing. Near Fujinmen, the Sanling Prime Minister's Office was established. In the 30th year of Guangxu, the government office was changed to the Prime Minister's Office of Shengjing mausoleum affairs, and the governor of the three provinces concurrently served. Under the Sanling Prime Minister’s office, the Sanling office was set up to undertake the sacrifice and repair affairs of the three tombs (Xinbin Yongling, Shenyang Fuling, and Zhaoling).15 Therefore, the clustering in figure 3 verifies the close relationship between the Sanling Prime Minister's Office and the tombs. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 8 Figure 3. 2D tSNE visualization of word2vec vectors. Analysis of the Relationship Between the Documents Received and Sent of the Institution With the statistics of the text data obtained after word segmentation, we can find the quantitative relationship between the documents received and sent by the institution, using the Pearson correlation coefficient to judge whether there is a correlation between the number of documents received and the number of documents sent by the same institution. 𝜌(𝑟,𝑠) = 𝑐𝑜𝑣(𝑅,𝑆) 𝜎𝑟 𝜎𝑠 (3.1) We suppose that the Pearson correlation coefficient between the number of documents received and the number of documents sent is ρ(r,s), R= {r1, r2, r3...r11}. Here, R is the variable set of documents received from the institutional sample. Set S= {s1, s2, s3…s11} is the variable set of documents sent by the institutional sample. By dividing the covariance of R and S by the product of their respective standard deviations, we can obtain the value of the correlation coefficient of the documents sent and received by the same institution. Mining the Relationship Between Institutions’ Sending and Receiving Documents Based on Co-word Clustering To mine the relationship between the institutions’ sending and receiving documents, we adopt a co-word clustering algorithm to generate a visualized network map of institutional relationships. The global co-occurrence rate represents the probability of two words appearing together in all the data sets. In large-scale data sets, if two words often appear together in the text, these two words are considered to be strongly related to the semantics.16 Clustering is a method that places objects into a group by similarity or dissimilarity. Thus, keywords with high correlation to each other tend to be placed in the same cluster. Social network analysis, which evaluates the unique structure of interrelationships among individuals, has been extensively used in social science, psychological science, management science, and scientometrics. 17 We can obtain a sociogram from the institutional function analysis. The main purpose of the sociogram is to provide information INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 9 about the relationship between institutions’ sending and receiving documents. In the sociogram, each member of a network is described by a “vertex” or “node.” Vertices represent high-frequency words, and the sizes of the nodes indicate the occurrence frequency. The smaller the size of a node, the lower the occurrence frequency. Lines depict the relationships between two institutions. They exist between two keywords, indicating that they received or sent documents to each other. The thickness is proportional to the correlation between the keywords. The thicker the line between the two keywords, the stronger the connection. Using this rationale, the map visualization and network characteristics (centrality, density, core-periphery structure, strategic diagram, and network chart) were obtained by analyzing Pearson’s correlation matrix or other similarity matrices.18 In this study, we conducted network analysis on a binary matrix to display the relationships between the documents sent and received by the institutions in the Shengjing area during the Qing Dynasty recorded in the Hetu Dangse. Further, we extracted the receiving institution and issuing institution from each record of catalog data in the Hetu Dangse, and then we composed a new data set with the following data from the receiving institution: issuing institution and title content. We used Python to convert the new data set to EndNote format and import it into VOSviewer1.6.15 to calculate and draw a visual map of the new data set. Van Eck and Waltman of the Netherlands’ Leiden University developed VOSviewer, a metrological analysis software used for constructing and visualizing network graphs.19 Although the software’s development principle is based on documents’ co-citation principles, it can be applied to the construction of data network knowledge graphs in various fields. Combined with the co -word clustering algorithm, we can create an entity connection network map for historical documents through VOSviewer software to reflect the recorded content. Automatic Classification Method of Historical Archives Catalog Based on the SVM Model We used the SVM model in machine learning for automatic classification. The SVM model has the advantages of strong generalization, low error rate, strong learning ability, and support for small sample data sets, making it suitable for historical archive catalog data samples with small sample characteristics. Therefore, we attempted to classify the catalog data set of Hetu Dangse using the SVM model. First, we divided the vectorized labeled data set into a training set and a testing set. The training set accounts for 70% of the data, and the testing set accounts for 30%. To ensure the accuracy of the model prediction, we adopted a random division method to avoid overfitting. Second, we used a linear kernel in the SVM model and grid search to find the best parameter. Various combinations of the penalty coefficient (C) and gamma parameter in the SVM model were tested based on their accuracy ranked from high to low. We then determined the best parameter combination. After the model was established, we validated the predictive performance of the model from multiple perspectives such as precision, recall, and F1 score to ensure the generalization ability and availability of the model. We set the penalty coefficients to 10, 100, 200, and 300, while the gamma parameters are set to 0.1, 0.25, 0.5, and 0.75. We used the precision evaluation criteria to find the optimal parameter combination of the model and then imported them. The penalty coefficient is set to the X-axis, the gamma parameter set to the Y-axis, and the precision set to the Z-axis. We implemented the model to obtain the visualization that is shown in figure 4. Clearly, the optimal parameter combination is a penalty coefficient of 10 and a gamma parameter of 0.075. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 10 Figure 4. SVM grid search parameter tuning diagram. DISCUSSION The history of a nation is the foundation on which it is built. Historical documents are the witnesses and recorders of history. Through the study of historical documents, we can go back to the past, cherish the present, and look forward to the future. An increasing number of scholars have studied these documents in recent years due to their importance. The Hetu Dangse records the document communications between institutions in Shengjing (now Shenyang) and Beijing during the Qing Dynasty. It is an important historical document that cannot be ignored when INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 11 studying the history of Northeast China during the Qing Dynasty. Here, we use the catalog data of the Hetu Dangse as the sample data to test the machine learning methods previously mentioned. We explore the results from the perspectives of institutional function, institutional relationship, and automatic classification to determine the feasibility of our methods. Functions of Institutions The number of institutions involved in the Hetu Dangse is over 150. These functional departments formed the governance system of the Shengjing area during the Qing Dynasty. To gain a deeper understanding of the Qing Dynasty’s ruling system in the Shengjing area, the functions of these institutions should be examined. This study analyzes and studies the functions of the institutions in the Shengjing area through the number of documents and the frequency of content of the sending and receiving institutions. Analysis of the Number of Documents Received and Sent by Institutions By sorting and statistically analyzing the catalog data of Hetu Dangse, we obtained data on the number of documents received and sent by institutions in the Shengjing area recorded in the Hetu Dangse. We set the vertical axis as the total number of communicated documents, number of issued documents, and number of received documents. We set the horizontal axis as the names of the institutions and then drew a histogram. This study analyzes the number of institutional archives of the Hetu Dangse catalog from three perspectives: total number of sent and received documents, number of received documents, and number of issued documents to find the institutions with the highest research value in the Shengjing area. In the histogram shown in figure 5(A), the top three institutions in total number of communicated documents are Shengjing Internal Affairs Office, Shengjing Zuoling, and Shengjing Ministry of Revenue. We can also observe that the top 10 institutions have different volumes of their respective documents received and sent by institutions. Therefore, the ranking of the total number of communicated documents is not directly related to the respective rankings of the number of documents received and the number of documents sent. In figure 5(B), we can observe that the top three institutions in number of documents received in the Hetu Dangse are Shengjing Internal Affairs Office, Shengjing Ministry of Revenue, and Shengjing General Yamen. Figure 5(C) shows the top three institutions in number of documents sent in the Hetu Dangse are Shengjing Internal Affairs Office, Shengjing Zuoling, and Shengjing General Yamen. The total number of communicated documents, number of documents sent, and number of documents received by the Shengjing Internal Affairs Office all rank first; this indicates that the Shengjing Internal Affairs Office is the most important department of the ruling system in the Qing Dynasty during the Shengjing area. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 12 Figure 5. Number of documents received and sent by institutions. A B C INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 13 By using the number of documents received and sent by the institutions, we calculated the Pearson correlation coefficient to determine if the number of documents received and sent by the same institution is relevant. As institutional samples, we selected the Shengjing Internal Affairs Office, Shengjing Ministry of Revenue, (Beijing) Internal Affairs Office in Charge, Shengjing Zuoling, Shengjing Ministry of Works, Shengjing Ministry of Justice, Shengjing General Yamen, Shengjing Close Defense Zuoling, Shengjing Ministry of War, Fengtian General Yamen, and Shengjing Ministry of Rites. Through calculation, the result of Pearson correlation coefficient is 0.69 (save two decimal places), so there is a correlation between the number of sent and received documents, as shown in figure 6. Figure 6. Scatter plot of Pearson correlation coefficient. The Hetu Dangse is a copy of official documents dealing with the royal affairs of the Shengjing Internal Affairs Office during the Qing Dynasty. It contains the official documents between the Shengjing Internal Affairs Office and the Beijing Internal Affairs Office in Charge, the Liubu, etc. and the local Shengjing General Yamen, Fengtian Office, the Wubu of Shengjing, and other yamens.16 Thus, there exist a large stock of documents with the Shengjing Internal Affairs Office as the sending and receiving agency. The Wubu of Shengjing, Shengjing General Yamen, Shengjing Zuoling, and other institutions are important hubs for the operation of institutions in Shengjing. They played an important role in maintaining and stabilizing the society of Shengjing. The number of documents is second in importance only to the Shengjing Internal Affairs Office. Analysis of the Frequency of Documents Received and Sent by Institutions To further explore the functions of institutions with research value, we extracted the contents of the catalogs from the top three institutions in total number of documents sent and received: Shengjing Internal Affairs Office, Shengjing Ministry of Revenue, and Shengjing Zuoling. We then classified the catalogs of the aforementioned institutions according to receipts and postings. Subsequently, we used word segmentation and word frequency statistics to process the two types INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 14 of catalog information and draw comparison diagrams to explore their specific functions in the Hetu Dangse. As shown in figure 7, we can roughly divide the obtained segmentation words into two categories. One is the name of the communicated official document institutions, such as the Ministry of Revenue, the Ministry of Justice, and the Ministry of Rites on the side of the word frequency (see fig. 7[A]). The other is the name of the official document content and the words Zhuangtou (庄头), Dimu (地亩), and Zhuangding (壮丁) on the side of the frequency of the words in the documents sent. Through a comparative analysis of the top 10 words received and sent by the same institution, we conclude that the institutions with a close relationship between receiving and sending documents are not the same. For example, the Ministry of Revenue of Shengjing Internal Affairs Office ranks first in the frequency of documents sent by institutions, while the ShengJing Zuoling ranks first for receiving institutions (see fig. 7[B]). The contents of documents sent and received by the same institution are different. Figure 7(C) shows how the affairs sent by Shengjing Zuoling to Ula (乌拉), Forage (粮草), and License (执照) differ from those represented by the Zhuangtou (庄头), Accounting (会计), and Close Defense (关防) in the frequency of documents sent and frequency of receipts, respectively. Based on previous research on the functions of Shengjing’s institutions, the Shengjing Internal Affairs Office was set up in the companion capital of Shengjing during the Qing Dynasty to be in charge of Shengjing cemetery, sacrifice, organization of staff transfer, and other matters. 20 This relates to the meaning of words such as sacrifice (祭祀) in figure 7(A). The functions of the Shengjing Ministry of Revenue were represented in Guangxu’s Great Qing huidian. The cashiers in charge of taxation in Shengjing, number of annual losses in official villages, and Banner Land were carefully recorded. The expenditures were distinguished and the accounting obeyed the regulations according to the Beijing Ministry of Revenue at the end of the year.21 This is related to the meaning of words, such as Dimu (地亩), land sale (卖地), and money and grain (钱粮) in figure 7(B). In Fu Yonggong and Guan Jialu’s research of Shengjing Zuoling’s functions, Shengjing Zuoling handled the transfer communicated documents; supervised and urged the various departments of Guangchu, Duyu, Zhangyi, Accounting, Construction, and Qingfeng to undertake matters; managed officials and various people; maintained the Shengjing palace and the warehouse; selected women to send to Beijing Inspect; heard all types of cases; undertook the emperor’s general letter; managed the Ula people and tributes; and accepted the emperor or the Internal Affairs Office in Charge, among other tasks.22 This is connected to the meaning of words such as Ula (乌拉), Close Defense (关防) and License (执照) in figure 7(C). INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 15 Figure 7. Word frequency comparison of documents received (in blue) and sent (in orange) by institutions. A B C INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 16 Institutional Relationship Analysis To further study the governance structure of the Shengjing area, we not only need to understand the functions of each institution but also explore the overlap between functions of institutions. The catalog data of the Hetu Dangse consist of three parts: receiving institutions, issuing institutions, and record title of the catalog. A document often includes two institutions, th e receiving institution and the issuing institution, and it is certain that the content of a document relates closely to the functions between the two institutions. By observing the closeness between the number of institutions through visualizations, we conducted a quantitative analysis of consistent catalog data of the receiving and issuing institutions in the Hetu Dangse to provide reliable data for further research in the intersection of institutional functions in Shengjing area. Results of Institutional Connection Analysis Using the co-word clustering algorithm, we counted the number of archive catalog data consistent with the receiving and issuing institutions. We set the vertical axis as the issuing institution and the horizontal axis as the receiving institution to obtain figure 8. The numbers inside the boxes represent the quantity of catalog data that are consistent with the issuing institution. To facilitate measurements in the statistical process, records less than or equal to 50 communicated documents between the receiving institution and the issuing institution have been zeroed out. As shown in figure 8, the institutions having close relations with the documents recorded in the Hetu Dangse are concentrated in the issuing institutions Shengjing Zuoling and Shengjing Internal Affairs Office, and the receiving institutions Shengjing Internal Affairs Office and Shengjing Zuoling. Among the receiving institutions, the number of documents received by the Shengjing Internal Affairs Office from Shengjing General Yamen reached as high as 11,936. The top three documents received by ShengJing Zuoling were Fengtian General Yamen (2,265 pieces), Shengjing Ministry of Revenue (1,527 pieces), and Shengjing Ministry of Justice (1,520 pieces). It is worth noting that there are less than 50 documents from Shengjing Zuoling in the Shengjing Internal Affairs Office. The overlapping functions of the institutions in the Shengjing area enabled individual offices to play bureaucratic games, passing responsibility to other offices, leading to low efficiency in handling affairs. For example, the military and political power in the Shengjing area was jointly controlled by the Shengjing General Office and the Shengjing Ministry of War. The Shengjing area’s tax power was controlled by the Shengjing Ministry of Revenue and Fengtian Office and their subordinate offices. This phenomenon ran through the entire Qing Dynasty. Research on the cr oss- functionality of institutions has always been a hot topic in Qing historiography. By analyzing the official documents between the institutional functions, we can further explore the overlap as well as the advantages and disadvantages of the Qing Dynasty Shengjing ruling system to study the history of Shengjing institutions in the Qing Dynasty more thoroughly providing a reference for the design of current institutions. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 17 Figure 8. Relationship of communicated documents by the Hetu Dangse Institutions diagram. Visualization of Institutional Network Map We used the Hetu Dangse catalog as sample data and the co-word clustering algorithm to obtain the close relationship between institutions and the appearance frequency of institutions. We drew a visual network diagram by virtue of VOSviewer1.6.15 to obtain figure 9. In figure 9, institutions are represented by default as a circle with their names. The size of the label and the circle of an institution are determined by the weight of the item. The higher the weight of an item, the larger the label and the circle of the item. For some items, labels may not be displayed to avoid overlapping labels. The color of an institution is determined by the cluster the institutions belong to, and lines between items represent links. As shown in figure 9, the relationships between the institutions and departments in the Hetu Dangse form three core groups: the Shengjing Internal Affairs Office (in Charge), Shengjing Zuoling, and Beijing Internal Affairs Office in Charge. However, the relationships between the three groups are not similar; the distance between the group (Beijing) Internal Affairs Office in Charge and the two other groups is relatively large. The group at the core of Shengjing Internal Affairs Office and the group at the core of Shengjing Zuoling are closely connected to each other through the Wubu of INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 18 Shengjing (Shengjing Ministry of Revenue, Shengjing Ministry of Rites, Shengjing Ministry of War, Shengjing Ministry of Justice, and Shengjing Ministry of Works). Further, there are two larger individuals: Fengtian General Yamen and Shengjing General Yamen. Fengtian General Yamen and Shengjing Zuoling are closely related to each other, and the relationship between Shengjing General Yamen and Shengjing Internal Affairs Office is relatively close. Figure 9. Co-occurrence of Institutions network map. The city of Shengjing was the companion capital of the Qing Dynasty. The Qing government implemented special governance measures in these areas that differed greatly from those of direct inland provinces.23 To ensure the stable rule of the Shengjing area, the Qing Dynasty performed the following tasks. First, the Qing Dynasty set up a general garrison as the highest military and political chief in the Shengjing area to be responsible for all military and political affairs within its jurisdiction. Second, they established the Fengtian Office, a capital of the same level as the Shuntian Office, to rule the common people of the Shengjing area. The states and counties, as well as the Garrison Banner Officer, which was under the rule of general garrison, were local administrative institutions under the Fengtian Office. These institutions implemented the dual management rule of the Bannerman and Common people. Third, as the companion capital, the Shengjing area followed the Ming Dynasty companion capital system to set up the Wubu of Shengjing to maintain power. In addition, the Shengjing Internal Affairs Office, which was in charge of palace affairs, communicated with the Beijing Internal Affairs Office in Charge. INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 19 Results of Automatic Classification Analysis Catalogs are important information resources in the field of historical archives. The classification of archival catalogs can not only link relevant information in archives or archive fonds, improve researchers’ utilization efficiency, and save time to search for required archives, but it can also be shown to readers in clusters. As the Hetu Dangse catalog is a series of historical documents stored for a long period of time, its original classification system does not suit well existing archival management methods. The Hetu Dangse has a total of 1,149 volumes and 127,000 pages. Each volume contains a different number of documents and the ink characters on Chinese art paper are in Manchu and Chinese. Reading and categorizing the full text of the Hetu Dangse not only requires a lot of manpower, material, and financial resources but also extremely high requirements for the classified staff. They need to possess a good knowledge of Manchu, archival science, document taxonomy, and other related disciplines. Therefore, sorting and organizing the content of the Hetu Dangse is an impractical task that relies on manual reading and comprehension. To address this problem, we used the SVM model of machine learning to automatically classify and explore the catalog data of the Hetu Dangse. This model further demonstrates the relevance of the knowledge between documents in the Hetu Dangse and facilitates an in-depth analysis. We imported the vectorized labeled data set into the SVM model and selected the optimal parameter combination to run the model. To visualize the data results, the 50-dimensional word vector is reduced to a 2-dimensional word vector using the t-distributed random neighborhood embedding algorithm. We used the SVM model to establish a hyperplane visualized in 2- dimensional form. The legend only in figure 10 shows the data distribution of the six categories with the highest proportion owing to the large number of categorized data. To test the classification effect of the SVM model, we used precision and recall as metrics and calculated the F1 score to validate the model. The results are presented in table 3. Based on the created SVM model, 95,680 catalog data of the Hetu Dangse were predicted and classified. The results are shown in figure 11. Although there exist certain deficiencies in accuracy and other aspects, it a positive impact for the content research, management, utilization, and retrieval discovery of Hetu Dangse. Table 3. SVM model validation parameters Result Precision 0.736 Recall 0.717 F1 Score 0.716 INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 20 Figure 10. SVM decision region boundary. Figure 11. Hetu Dangse catalog data prediction classification. CONCLUSION In this study, we used machine learning to analyze and visualize the catalog data of the Hetu Dangse, revealing the functional relationship of the Qing Dynasty, Shengjing regional institutions recorded in this historical document, and showing the institutional communicated relationships. Using the SVM model, we achieved automatic classification of the Hetu Dangse catalog from the category perspective. Owing to the massive archives of historical materials in ancient China, the INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 21 fonts of many historical materials cannot be recognized by computers or humans. The digitization of catalogs has become a digital bridge between researchers and historical documents. This not only achieves the concise summary and refinement of them but also greatly improves the utilization efficiency by researchers. The SVM model can “learn” through the labeled sample data and realize automatic classification of large amounts of unlabeled catalog data. By automatic classification of catalog data, historical data researchers and archive managers can use and manage a large number of historical documents and catalog data more effectively, greatly increasing their utilization. The co-occurrence algorithm can reveal the rules written by the catalog data itself, discover the distance between the catalog data, and form clusters providing a clearer direction for researchers to use historical documents. The algorithm also saves time for researchers to identify documents without purpose, making content presentation of historical documents to readers clearer. This paper improves archivists’ awareness of archive data compilation and management. First, data is observed, topics are identified, and potential relationships between these are found and established to improve historical archives’ compilation. Second, the visual presentation method and carrier is chosen, and via the web browser established relationships are visualized for the users to access and utilize. It can be said that scientometric research method can promote the transformation of historical research and archives management and compilation research from traditional explanatory scholarship to truth-seeking scholarship. Currently, the application of machine learning technology has gradually extended from applied disciplines to traditional fields of literature, art, and sociology. However, there are still many opportunities in the field of historical research. This study used methods in the field of artificial intelligence to conduct text mining and visualize the presentation of historical archive document catalog data and proposes a new digital and intelligent solution for researching Chinese historical documents. With the development of science and technology, research methods for historical documents are undergoing constant changes from the traditional manual subjective analysis of historical data to relying on quantitative analysis represented by deep learning and data mining technology. It is an irreversible trend to research historical documents more comprehensively, accurately, and scientifically by means of artificial intelligence and other technologies on the scientific frontier. For future work, we plan to conduct research on the Qing dynasty historical documents from a deeper semantic analysis level, construct a knowledge graph through the method of named entity recognition, and construct an ontological model transforming historical documents into a structured knowledge base to discover new knowledge from historical documents in an automated manner. ACKNOWLEDGMENTS Funding Statement This work was supported by the General Program of the National Natural Science Foundation of China [grant number 72074060], the Research Foundation of the Ministry of Education of China [grant number 20JHQ012], and the National Social Science Fund of China [grant number 16BTQ089]. Data Accessibility The data sets supporting this article have been uploaded as part of the Supplementary Material. https://drive.google.com/drive/folders/1bZs17otRUyvA_QKbShMF836yGDTi40y0?usp=sharing https://drive.google.com/drive/folders/1bZs17otRUyvA_QKbShMF836yGDTi40y0?usp=sharing INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 22 Competing Interests We have no competing interests. ENDNOTES 1 Wang Tao, “Data Mining of German Historical Documents in the 18th Century, Taking Topic Models as Examples,” Xuehai 1, no. 20 (2017): 206–16, https://doi.org/10.16091/j.cnki.cn32- 1308/c.2017.01.021. 2 Kaixu Zhang and Yunqing Xia, “CRF-based Approach to Sentence Segmentation and Punctuation for Ancient Chinese Prose,” Journal of Tsinghua University (Science and Technology) 10, no. 27 (2009): 39–49, https://doi.org/10.16511/j.cnki.qhdxxb.2009.10.027. 3 Michael Stauffer, Andreas Fischer, and Kaspar Riesen, “Keyword Spotting in Historical Handwritten Documents Based on Graph Matching,” Pattern Recognition 81 (2018): 240–53, https://doi.org/10.1016/j.patcog.2018.04.001; Wu Sihang et al., “Precise Detection of Chinese Characters in Historical Documents with Deep Reinforcement Learning,” Pattern Recognition 107 (2020): 107503, https://doi.org/10.1016/j.patcog.2020.107503. 4 Renata Solar and Dalibor Radovan, “Use of GIS for Presentation of the Map and Pictorial Collection of the National and University Library of Slovenia,” Information Technology and Libraries 24, no. 4 (2005): 196–200, https://doi.org/10.6017/ital.v24i4.3385. 5 Shaochun Dong et al., “Semantic Enhanced WebGIS Approach to Visualize Chinese Historical Natural Hazards,” Journal of Cultural Heritage 14, no. 3 (2013): 181–89, https://doi.org/10.1016/j.culher.2012.06.009; Jakub Kuna and Łukasz Kowalski, “Exploring a Non-existent City via Historical GIS System by the Example of the Jewish District ‘Podzamcze’ in Lublin (Poland),” Journal of Cultural Heritage 46 (2020): 328–34, https://doi.org/10.1016/j.culher.2020.07.010. 6 Aleksandrs Ivanovs and Aleksey Varfolomeyev, “Service-oriented Architecture of Intelligent Environment for Historical Records Studies,” Procedia Computer Science 104 (2017): 57–64, http://doi.org/10.1016/j.procs.2017.01.062; Guus Schreiber et al., “Semantic Annotation and Search of Cultural-heritage Collections: The MultimediaN E-Culture Demonstrator,” Journal of Web Semantics 6, no. 4 (2008): 243–49, https://doi.org/10.1016/j.websem.2008.08.001. 7 M Kim et al., “Inference on Historical Factions Based on Multi-layered Network of Historical Figures,” Expert Systems with Applications 161 (2020): 113703, http://doi.org/10.1016/j.eswa.2020.113703. 8 Hobson Lane, Cole Howard, Hannes Hapke, Natural Language Processing in Action: Understanding, Analyzing, and Generating Text with Python (New York: Manning Publications, 2019), 165. 9 Laurens Van der Maaten, Eric Postma, and Jaap van den Herik, “Dimensionality Reduction: A Comparative Review,” Tilburg University Technical Report, TiCC-TR 2009-005 (2009), https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_200 9.pdf. https://doi.org/10.16091/j.cnki.cn32-1308/c.2017.01.021 https://doi.org/10.16091/j.cnki.cn32-1308/c.2017.01.021 https://doi.org/10.16511/j.cnki.qhdxxb.2009.10.027 https://doi.org/ https://doi.org/10.1016/j.patcog.2018.04.001 https://doi.org/10.1016/j.patcog.2020.107503 https://doi.org/10.6017/ital.v24i4.3385 https://doi.org/10.1016/j.culher.2012.06.009 https://doi.org/10.1016/j.culher.2020.07.010 http://doi.org/10.1016/j.procs.2017.01.062 https://doi.org/10.1016/j.websem.2008.08.001 http://doi.org/ https://doi.org/10.1016/j.eswa.2020.113703 https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2021 TEXT ANALYSIS AND VISUALIZATION RESEARCH ON THE HETU DANGSE | WANG, WU, YU, AND SONG 23 10 Gavin Hackeling, Mastering Machine Learning with Scikit-learn (Birmingham: Packt Publishing, 2017). 11 Richard Smiraglia, Domain Analysis for Knowledge Organization: Tools for Ontology Extraction (Oxford: Chandos Publishing, 2015). 12 Kuo-Chung Chu, Hsin-Ke Lu, and Wen-I Liu, “Identifying Emerging Relationship in Healthcare Domain Journals via Citation Network Analysis,” Information Technology and Libraries 37, no. 1 (2018): 39–51, https://doi.org/10.6017/ital.v37i1.9595. 13 Archives of Liaoning Province in China, “The Hetu Dangse Series Archives Publication,” Qing History Research 6, no. 2 (2009): 1. 14 Amit Kumar Sharma, Sandeep Chaurasia, and Devesh Kumar Srivastava, “Sentimental Short Sentences Classification by Using CNN Deep Learning Model with Fine Tuned Word2Vec,” Procedia Computer Science 167 (2020): 1139–47, https://doi.org/10.1016/j.procs.2020.03.416. 15 B Hongxi, “Research on the Sanling Management Institutions of the Qing Dynasty Outside the Pass,” Manchu Minority Research 4, no. 12 (1997): 38–56. 16 Guangli Zhu et al., “Building Multi-subtopic Bi-level Network for Micro-blog Hot Topic Based on Feature Co-occurrence and Semantic Community Division,” Journal of Network and Computer Applications 170 (2020): 102815, https://doi.org/10.1016/j.jnca.2020.102815. 17 S. Ravikumar, Ashutosh Agrahari, and S. N. Singh, “Mapping the Intellectual Structure of Scientometrics: A Co-word Analysis of the Journal Scientometrics (2005–2010),” Scientometrics 102 (2015): 929–55, https://doi.org/10.1007/s11192-014-1402-8. 18 Jiming Hu and Yin Zhang, “Research Patterns and Trends of Recommendation System in China Using Co-word Analysis,” Information Processing and Management 51, no. 4 (2015): 329–39, https://doi.org/10.1016/j.ipm.2015.02.002. 19 Nees Jan Van Eck and Ludo Waltman, “Software Survey: VOSviewer, a Computer Program for Bibliometric Mapping, Scientometrics, 84, no. 2 (2010): 523–38, https://doi.org/10.1007/s11192-009-0146-3. 20 Z Yanchang and L Xinzhu, “The Study of the Function of Shengjing Office from the Use of the Official Communication — An Academic Investigation Based on Hetu Dangse,” Shanxi Archives 8, no. 12 (2020): 179–88. 21 ShengJing Ministry of Revenue, Guangxu's Great Qing Huidian Volume 25 (Zhonghua Book Company, 1991), 211–12. 22 F Yonggong and G Jialu, “Brief Introduction of Shengjing Upper Three Banners Baoyi Zuoling,” Historical Archives 9, no. 30 (1992): 93–7. 23 Wangyue, “Research on the Yamens and Their Affair Relationships in Shengjing Area,” Shenyang Palace Museum Journal 1, no. 31 (2011): 67–77. https://doi.org/10.6017/ital.v37i1.9595 https://doi.org/10.1016/j.procs.2020.03.416 https://doi.org/10.1016/j.jnca.2020.102815 https://doi.org/ https://doi.org/10.1007/s11192-014-1402-8 https://doi.org/10.1016/j.ipm.2015.02.002 https://doi.org/10.1007/s11192-009-0146-3 Abstract Introduction Related technology definition Sample Data Preprocessing and Classification Data Preparation and Preprocessing Data Cleaning Label Data Results Word Vector Conversion of Text Catalog Data Analysis of the Relationship Between the Documents Received and Sent of the Institution Mining the Relationship Between Institutions’ Sending and Receiving Documents Based on Co-word Clustering Automatic Classification Method of Historical Archives Catalog Based on the SVM Model Discussion Functions of Institutions Analysis of the Number of Documents Received and Sent by Institutions Analysis of the Frequency of Documents Received and Sent by Institutions Institutional Relationship Analysis Results of Institutional Connection Analysis Visualization of Institutional Network Map Results of Automatic Classification Analysis Conclusion Acknowledgments Funding Statement Data Accessibility Competing Interests Endnotes