key: cord-0047092-uodby95t authors: Chen, Tianlei; Miao, Duoqian; Zhang, Yuebing title: A Graph-Based Keyphrase Extraction Model with Three-Way Decision date: 2020-06-10 journal: Rough Sets DOI: 10.1007/978-3-030-52705-1_8 sha: b5e7b4eee82a6689b9cbaf013c2a4fcd8150a6bf doc_id: 47092 cord_uid: uodby95t Keyphrase extraction has been a popular research topic in the field of natural language processing in recent years. But how to extract keyphrases precisely and effectively is still a challenge. The mainstream methods are supervised learning methods and graph-based methods. Generally, the effects of supervised methods are better than unsupervised methods. However, there are many problems in supervised methods such as the difficulty in obtaining training data, the cost of labeling and the limitation of the classification function trained by training data. In recent years, the development of the graph-based method has made great progress and its performance of extraction is getting closer and closer to the supervised method, so the graph-based method of keyphrase extraction has got a wide concern from researchers. In this paper, we propose a new model that applies the three-way decision theory to graph-based keyphrase extraction model. In our model, we propose algorithms dividing the set of candidate phrases into the positive domain, the boundary domain and the negative domain depending on graph-based attributes, and combining candidate phrases in the positive domain and the boundary domain qualified by graph-based attributes and non- graph-based attributes to get keyphrases. Experimental results show that our model can effectively improve the extraction precision compared with baseline methods. Keyphrase extraction has been a popular research topic in the natural language processing research field. Especially with the current increasing requirements for applications of texts, keyphrase extraction has attracted widespread attention from researchers. Although it has been greatly developed in recent years at home and abroad, the extracted results are far from the ideal. With the rapid growth of text applications, the analysis of text data has become an important research area that has attracted much attention. Among them, how to extract keyphrases that reflect the subjects of texts has always been a research hotspot in the field of natural language processing, and its research results can be widely used in text retrieval, text summarization, text classification and question answering systems. Especially with the rise of research on unstructured big data of texts in recent years, the issue of keyphrase extraction has received in-depth research, and many researches have appeared in the international top conferences of artificial intelligence and natural language processing, such as the International Joint Conference on Artificial Intelligence (IJCAI) [1] , The Annual Meeting of the Association for the Advance of Artificial Intelligence (AAAI) [2] [3] [4] , International Computational Linguistics Association The Annual Meeting of the Association for Computational Linguistics (ACL) [5] , The International Conference of World Wide Web (WWW) [6] and Conference on Empirical Methods in Natural Language Processing (EMNLP) [7] , etc. Researchers generally believe that the extracted keyphrases [8] should meet the following basic standards: (i) Keyphrases should be meaningful phrases. For example, "keyphrase extraction" is a meaningful phrase, but "and" does not meet the standard. (ii) Keyphrase extraction should meet the relevance standard that keyphrases must be closely related to the subjects of texts, which is the most essential requirement for keyphrase extraction. For example, the subtitle "Introduction" in this paper is not an appropriate keyphrase obviously. (iii) Keyphrase extraction should correspond to the coverage standard. Keyphrases should be able to cover various topics of the text and the main aspects of each topic, not just focus on only one topic and ignore others. (iv) Keyphrases extraction should meet the coherence standard. Several keyphrases of the text should be semantically and logically related. For an instance, a piece of academic paper that mainly introduces a graph-based keyphrase extraction model. The set of keyphrases is {"keyphrase extraction", "graph-based"}, which is more suitable than {"keyphrase extraction", "target detection"}. (v) Keyphrase extraction should correspond the conciseness standard. The number of keyphrases is limited, and the set of keyphrases should not contain any redundant phrase. To meet any of the above standard, there is a huge challenge. Although there are many methods to solve this scientific problem such as statistical-based methods, supervised learning methods and graph-based methods, how to extract keyphrases precisely and efficiently is still a challenge. In this paper, we propose a new model that applies the three-way decision theory to the graph-based keyphrase extraction model. In our model, we propose algorithms dividing the set of candidate phrases into the positive domain, the boundary domain and the negative domain depending on graph-based attributes, and combining candidate phrases in the positive domain and the boundary domain qualified by graph-based attributes and non-graph-based attributes to get keyphrases. Experimental results show that our model can effectively improve the extraction precision compared with baseline methods. In Sect. 2, we briefly introduce the three-way decision theory and some related works in the field of keyphrase extraction. In Sect. 3, we describe the structure of our model and algorithms we proposed. In Sect. 4, we report the experimental results and analysis. Finally, we make a conclusion in Sect. 5. Using statistical-based methods to extract keyphrases of texts is relatively simple, because it requires neither training data nor external knowledge. After the preprocessing of texts, simple statistical rules can be used to form a set of candidate phrases. The estimation of candidate phrases usually uses quantification of feature values. The main statistical-based keyphrase extraction method is TF-IDF (Term Frequency-Inverse Document Frequency) [9] and its improved methods. The advantage of the TF-IDF algorithm is that it is simple and fast. However, the traditional TF-IDF algorithm also has obvious shortcomings that it is not comprehensive enough to measure the importance of phrases based on the frequency. Sometimes important phrases may not appear frequently. The graph-based keyphrase extraction method is the most effective and widely studied unsupervised keyphrase extraction method, because the method considers the cooccurrence relationship between phrases in the text. If there is a co-occurrence relationship between two phrases, it indicates that they are semantically related in the text. On the other hand, the graph-based method can incorporate more other features, so it has reached better effect of Extraction. The graph-based method has been widely concerned by researchers, from the TextRank method proposed by Mihalcea [10] to the PositionRank method proposed by Florescu [4] . In this paper, we propose a new model that applies the three-way decision theory to graph-based keyphrase extraction method. As generally considered, there are only acceptance and rejection in making a decision, which is a two-branch decision model, but it is often not the case in practical application. Based on the rough set theory proposed by Pawlak [11] , Yao's three-way decision theory [12] provided a third alternative. The idea of three-way decision is based on three categories: acceptance, rejection and non-commitment. The goal is to divide a domain into three disjoint parts. Positive rules acquired from positive domain are used to accept something, negative rules acquired from negative domain are used to deny something, and rules that fall on boundary domain need further observation, which called delayed decision-making. Miao [13] has made some researches about three-way decision theory with multi-granularity, and Zhang [14] has applied it to the application of sentiment classification. The way of three-way decision describes the thinking mode of human beings in solving practical decision-making problems. We propose a graph-based keyphrase extraction model with three-way decision. As Fig. 1 illustrated, we could obtain candidate phrases through the preprocessing of texts from the raw, and then transform texts to text graphs with candidate phrases as nodes to get their graph-based attributes and non-graph-based attributes. With the support of the three-way decision theory, we divide the set of candidate phrases into the positive domain, the boundary domain and the negative domain depending on their graph-based attributes, and combine candidate phrases in the positive domain and the boundary domain qualified by their graph-based attributes and non-graph-based attributes to get keyphrases. The step of preprocessing of texts from the raw plays an important role in the process of extracting keyphrases due to its output affecting the result deeply. The generic preprocessing way of graph-based keyphrases extraction: (i) Tokenizing: The process of tokenizing is to split strings into phrases. (ii) Tagging [15] : The task of tagging is to tag part-of-speech of phrases preparing for filtering. (iii) Filtering: Filter out phrases that do not meet the part-of-speech requirements according to the result of tagging. (iv) Stemming [16, 17] : Stemming phrases is in order to eliminate the effects of phrases forms that can get the main part of phrases. The differences between the phrases before stemming and after stemming are as follows (Table 1) : After preprocessing the raw, we construct the text graph to obtain graph-based attributes of candidate phrases. The text graph G ¼ V; E; W ð Þ , V is the set of nodes representing candidate phrases, E is the set of edges and W is the set of corresponding and v j co-occurring in consecutive sentences, adopting the context-aware graph construction method from Duari [18] due to its simple construction method and well performance. The higher value of w ij is, the stronger relationships between v i and v j are. In our opinion, the three-way decision is making a delayed decision on uncertainty, and decides based on other information in the future. In this paper, we propose two Algorithms, which applies the three-way decision theory to the graph-based keyphrase extraction model (see Algorithm 1 and Algorithm 2). The main notations in this paper are listed in Table 2 . The positive domain b The boundary domain n The negative domain C i The set of candidate phrases G i The set of graph-based attributes NG i The set of non-graph-based attributes R The set of keyphrases extraction results Th The threshold of the three-way decision From Cohen's [19] Trusses theory, for a weighted, undirected and simple graph G ¼ V; E; W ð Þ , a k-truss subgraph of G is the maximal subgraph G k ¼ V k ; E K ; W k ð Þ , such that each edge e ij 2 E k belongs to at least k À 2 ð Þtriangles. The truss level of an edge e ij is k if it lies in k-truss but not in k þ 1 ð Þ-truss. Kaur [20] expanded the concept of truss to nodes and defined truss level k i of node v i as follows. where N i is the set of neighbours of node v i and l ij is the truss level of edge e ij . Based on the definition of the truss level of nodes, Duari [18] defined the semantic strength v i of node v i and the semantic connectivity SC i of node v i as follows. We take these attributes on the basis of the graph into account and define the graphbased attributes ga i of node v i as follows. In this paper, we propose Algorithm 1 to classify the candidate phrases by graphbased attributes and divide the set of candidate phrases C into the positive domain C P , boundary domain C b and negative domain C n respectively. Position information is an important factor in identifying keyphrases except for graph-based attributes. Florescu [4] proposed PositionRank and took the position of candidate phrases into account to identify keyphrases, we regard it as non-graph-based attributes nga i of node v i with the following definition. In this paper, we propose Algorithm 2 taking graph-based and non-graph-based attributions of the candidate phrases into account in the boundary domain. Generally, both of the candidate phrases in the positive domain and the boundary domain are considered as the output of the Algorithm 2, where Th is the threshold of the three-way decision and the value of k represents the count of keyphrases to extract. We evaluate the performance of the model with two widely used benchmark datasets, which are Hulth2003 and Krapivin2009. Hulth2003 is a dataset including about 2,000 abstracts of academic articles. Krapivin2009 consists of over 2,000 scientific papers from computer science domain published by ACM used for keyphrase extraction specially. We use the uncontrolled list of keyphrases of Hulth2003 and gold-standard keyphrases of Krapivin2009 for evaluation. We take Textrank [10] , DegExt [21] , kcore retention [22] and PositionRank [4] as baseline methods and evaluate our model against them. Duari [18] reported that values of k are 25 for Hulth2003 and 10 for Krapivin2009 that yield the highest F1-measure with all algorithms mentioned above, which correlate with the average number of labeled keyphrases in datasets, and we adopted the reported values of k and the results of baseline methods. In the experiment, we separate a part of data from data sets as validation sets to explore the most appropriate value of Th. The results show the value of Th is 0.1 for Hulth2003 and 0.4 for Krapivin2009 yields the best performance (see Table 3 and Table 4 ). To verify the value of k yields the highest F1-measure mentioned above, we compared the F1-measure where the value of k was 5, 10, 15, 20, 25 and 30. The result shows that the F1-measure reaches the best when the value of k is 20 or 25 for Hulth2003 and 10 for Krapivin2009 (see Fig. 2 ). We find that the result of recall increases and the result of precision decreases when the value of k increases, which meets the fact. The performance evaluation of keyphrase extraction can be divided into microstatistical evaluation and macro-statistical evaluation. The micro one calculates the performance for each text first and then takes the average value. In comparison, the macro one statistics the result of extraction first and then calculates the performance at one time. We compared our model with Textrank [10] , DegExt [21] , k-core retention [22] and PositionRank [4] under the macro-statistical evaluation, where the value of k was 25 for Hulth2003 and 10 for Krapivin2009. The result shows that our model gets the best performance where the F1-measure reaches 51.85 for Hulth2003 and 43.64 for Krapivin2009 (see Table 5 and Fig. 3) . In this paper, we propose a new model that applies the three-way decision theory to graph-based keyphrase extraction model. In our model, we propose algorithms dividing the set of candidate phrases into the positive domain, the boundary domain and the negative domain depending on graph-based attributes, and combining candidate phrases in the positive domain and the boundary domain qualified by graph-based attributes and non-graph-based attributes to extract keyphrases. Experimental results show that our model can effectively improve the extraction accuracy compared with baseline methods. In future work, we will do more experiments to prove the performance of keyphrase extraction. Integrating semantic relatedness and words' intrinsic features for keyword extraction Extracting keyphrases from research papers using citation networks Incorporating expert knowledge into keyphrase extraction A position-biased PageRank algorithm for keyphrase extraction Deep keyphrase generation Topical word importance for fast keyphrase extraction Supervised keyphrase extraction as positive unlabeled learning Comparison of automatic keyphrase extraction systems in scientific papers Term-Weighting approaches in automatic text retrieval TextRank: bringing order into texts Rough sets Three-way decisions with probabilistic rough sets Uncertainty Analysis in Granular Computing A cost-sensitive three-way combination technique for ensemble learning in sentiment classification Feature-rich part-of-speech tagging with a cyclic dependency network Development of a stemming algorithm NLTK: the natural language toolkit sCAKE: semantic connectivity aware keyword extraction Trusses: cohesive subgraphs for social network analysis Leveraging hierarchy and community structure for determining influencers in networks DegExt-a languageindependent graph-based keyphrase extractor Main core retention on graph-of-words for single-document keyword extraction Acknowledgments. Authors would like to thank the anonymous reviewer for their critical and constructive comments and suggestions. This work was supported by National Natural Science Foundation of China (Grant No. 61976158). It was also supported by National Natural Science Foundation of China (Grant No. 61673301).