key: cord-0921119-7ypg25l0
authors: Li, Ren-De; Ma, Hao-Tian; Wang, Zi-Yi; Guo, Qiang; Liu, Jian-Guo
title: Entity Perception of Two-Step-Matching Framework for Public Opinions
date: 2020-06-30
journal: nan
DOI: 10.1016/j.jnlssr.2020.06.005
sha: 7dae0476e1854716e43aabcef4484117eecc911f
doc_id: 921119
cord_uid: 7ypg25l0

Abstract Entity perception of ambiguous user comments is a critical problem of target identification for huge amount of public opinions. In this paper, a Two-Step-Matching method is proposed to identify the precise target entity from multiple entities mentioned. Firstly, potential entities are extracted by BiLSTM-CRF model and characteristic words by TF-IDF model from public comments. Secondly, the first matching is implemented between potential entities and an official business directory by Jaro-Winkler distance algorithm. Then, in order to find the precise one, an industry-characteristic dictionary is developed into the second matching process. The precise entity is identified according to the count of characteristic words matching to industry-characteristic dictionary. In addition, associated rate (global indicator) and accuracy rate (sample indicator) are defined for evaluation of matching accuracy. The results for three data sets of public opinions about major public health events show that the highest associated rate and accuracy rate arrive at 0.93 and 0.95, averagely enhanced by 32% and 30% above the case of using the first matching process alone. This framework provides the method to find the true target entity of really wanted expression from public opinions.

Public opinions usually contain critical information, pointing to the certain entity for social perception. For instance, the new Coronavirus 2019 was initially linked to the Huanan Seafood Wholesale Market from public opinions Wu, Leung and Leung (2020) . Especially in emergency safety management, there is a delay from the occurrence of event to the perception of regulators/authorities. However, early signal can be found in social media in early time, such as the complaints/comments about the event. These signals may be ambiguous and hard to associate with real entities. Therefore, entity identification is one of the core techniques as natural language processing applications for information retrieval in the safety domain, which is initially proposed by Lim, Srivastava, Prabhakar and Richardson (1996) . They analyzed the use of extended keys, the union of keys in the relationships to be matched, and their corresponding identity rules to determine the equivalence between tuples in relationships that may not share any common keys. Stolfo (1995, 1998) developed a system to solve the merge / clear problem, which was to process the data multiple times with different keys in order to sort in each successive traversal. Arasu, Chaudhuri and Kaushik (2008) and Arasu and Kaushik (2009) proposed a procedural framework for record matching, which took user-defined string conversions as input. Fan, Wang and Wu (2013) and Fan, Li, Wang and Wu (2012) suggested matching graphics patterns based on the concept of bounded simulation. Furthermore, matching methods take more account of semantics information. Cohen (2000) extended the similarity measurement method based on TF-IDF similarity measurement in order ⋆ This work is partially supported by the National Natural Science Foundation of China (Grant Nos. 71901144, 71771152, 61773248) to identify entities. Augsten, Bohlen, Dyreson and Gamper (2008) and Schallehn, Sattler and Saake (2004) defined the pq-gram distance to exact matches of data items that represented the same real-world object. Rizzo and Troncy (2012) proposed a framework, NERD, which unified entity extractors through natural language processing. Song, Kim, Lee, Heo and Kang (2015) produced PKDE4J (an extensible and flexible text mining system) to enable entity extraction and relationship extraction. Gradually, matching methods of the content-based record are integrated into the rule-based methods.

Recently, the disambiguation of entity identification attracts more attentions. Pappu, Blanco, Mehdad, Stent and Thadani (2017) suggested a lightweight, and multilingual named entity recognition (NER) and linking (NEL) system to detect entities in new languages with limited labelled resources. Nguyen, Theobald and Weikum (2017) offered J-REED, a joint approach for entity disambiguation and relation extraction that was based on probabilistic graphical models. Nguyen et al. (2017) illustrated TwitterNEED, a hybrid approach for named entity extraction and named entity disambiguation for tweets. Canale, Lisena and Troncy (2018) illustrated a novel ensemble method for named entity recognition and disambiguation based on neural network. These researches distinguish taxonomies and disambiguation with different knowledge bases.

However, it is a big challenge for matching entity precisely when public opinons contain multiple targets and which entity is referred ambiguously. For example, there is a piece of online user comment: "This Honeywell N95 mask bought from Taobao is very easy to use. It is very comfortable to wear after half day wearing. The air valve is not blocked in any way. Please see the attached photos which was taken by my Huawei mobile phone. Must be given high praise!". In this comment, three entity names are mentioned regarding to mask. Accordingly, ambiguity has appeared in machine recognition since the content cannot locate to the specific target. As shown in Figure 1 , it can effectively extract targets and match entities using existing technologies. The first matching process can realize the matching between user comments and real entities, but how to distinguish the precise one is the main concern of entity perception.

In this work, a Two-Step-Matching (TSM) method is introduced into the framework. At the first stage, potential entity will be extracted from user comments by using BiLSTM-CRF model and characteristic words in comments will be recorded. The first matching process will pair the potential entity with an official business directory by Jaro-Winkler distance algorithm. Meanwhile, an industry-characteristic dictionary is constructed by combination of characteristic words in comments and description words in business directory. The next stage is to identify the precise entity if there are multiple targets in the first matching process. An industry-characteristic dictionary is introduced into the count of characteristic words matching with the user comments. As illustrated in Figure 1 , if a comment contains 4 characteristic words about entity A, while the other two entities match less characteristic words. It is generally agreed that the comment mainly points to the entity A, which is the precisely perceiving target. At the end, associated rate(Global indicator) and accuracy rate (sample indicator) are defined to evaluate the effect of entity identification. The results show that TSM framework performs significantly than using the first matching process alone.

To obtain the potential entity from users' comments, it is necessary to prepare the text string factors at the beginning stage. The research Dong, Zhang, Zong, Hattori and Di (2016) indicated that the character-based bidirectional LSTM-CRF (BiLSTM-CRF) model included both effectiveness and efficiency, which is suitable for sentence-level Wang, Xuan, Xu, Ruifeng, He, Yulan, Chen and Tao (2017) or document-level text Ling, Yang, Pei, Yin, Lei, Lin and Jian (2017) .

The model divides Chinese phrase into a sequence which includes characters as = 1 , 2 , … , . indicates the ℎ character in the phrase or sentence . Each character is represented as a -dimensional one hot vector. Each character in sentence has a tag from the BIO Encoding set { − , − , − , − , − , − , }. These tags are the common entity types to represent as follows: person (PER), location (LOC), organization (ORG) and none of any type (O). B and I are short for the "Beginning" and "Inside Characters", respectively.

The 1 layer of the BiLSTM-CRF model, Look-up layer, maps each character representation from a one hot vector into a lower dimensional dense word vector that includes character embedding and word embedding. Initialize the character embedding randomly and employ the pre-trained word embedding.

The inputs of the 2 layer, BiLSTM layer, are those embeddings like ( 1 , 2 , 3 , 4 ). And the outputs of BiLSTM layer are the scores of each label. In the BiLSTM layer, it processes the data in both directions with two separate hidden layers. The BiLSTM computes the forward hidden sequence ( ⃖⃖⃖ ⃗ 1 , ⃖⃖⃖ ⃗ 2 , ⃖⃖⃖ ⃗ 3 , ⃖⃖⃖ ⃗ 4 ), and the backward hidden sequence ( ⃖⃖⃖⃖ 1 , ⃖⃖⃖⃖ 2 , ⃖⃖⃖⃖ 3 , ⃖⃖⃖⃖ 4 ). The representation of a character by BiLSTM layer is acquired by concatenating its forward and backward context representation, = [ ⃖⃖ ⃗ ; ⃖⃖⃖ ], which contains a representation of a character in context effectively and it is helpful to tag applications in next layer. Then, extracted sentence features are recorded as a matrix = 1 , 2 , 3 , 4 . Each is regarded as the score value of classifying the to the label.

These scores of each label predicted by the BiLSTM layer are the inputs of the 3 layer, the CRF layer. And the outputs of this CRF layer are predicted labels from label sequence with the highest prediction score for characters in sentence initially. Considering is the matrix of the scores outputted by the BiLSTM layer and its size is × , where is the number of characters of sentence , and is the number of distinct tags.

corresponds to the score of th label of the th character. Then if regarding a sequence of predicted label

where is equal to the length of sentence , and the scores of the label are:

where is a matrix of transition scores and represents the transition score from the -th label to the -th label. With the aim of having a more robust transition score matrix, two more labels, START and END, are put into sentence , so the size of matrix is ( + 2) × ( + 2). The score is made up of two parts: one is the LSTM's output , and the other one is determined by the transition matrix .

A softmax yields a probability for predicted label sequence :

where is all possible label sequences for the sentence .

Then, the log-probability of the correct label sequence during training process is given as:

The last output predicted label sequence with maximum score is given by:

Train the BiLSTM-CRF model, then test the entities from user comment texts, recording them as entity names .

The other text ingredient from user comments are characteristic words. After Chinese text segmentation, i.e. jieba, TF-IDF algorithm Jones (1972) is used to extract the characteristic words.

Each words list of comment is seen as the text document, the term frequency is expressed as:

where represents the number of occurrences of the term in document , and ∑ represents the total number of all the words in the document . The inverse document frequency is:

where | | is total number of documents and | ∶ ∈ | is the number of documents where the term appears. Then − is calculated as:

A high weight in -is reached by a high term frequency and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Top 20 highest weights words are chosen as the feature words of each documents.

Consequently, characteristic keywords can be obtained. In the same method, the description words can also be obtained from the description of business scope of entity in the business directory by TF-IDF algorithm.

Usually, Chinese entities contain four basic elements for their full names, which are administrative divisions, short name, industry and organization format. This work just focuses on Short Name and Industry to build the business directory. In Table 1 , the third and fourth columns are examples, and the business scope is demonstrated in the last column.

String matching techniques can find string A within string B regardless of the order of characters, then it can index and retrieve information from database Hakak, Kamsin, Shivakumara, Gilkar, Khan and Imran (2019). In this work, string matching happens between entity names from user comments and short name from business directory. This first matching process can identify the entities pairs, then the second matching process occurs between characteristic keywords and industry-characteristic dictionary to pick out the target entity.

Jaro-Winkler distance Winkler (1990) is an algorithm for calculating the similarity between two strings. It is first proposed to determine whether the two names on the health record are the same. It is suitable for calculation between the short characters such as strings. Given two strings, the Jaro similarity is calculated as: where is the Jaro similarity score. | 1 | and | 2 | are the length of two strings. is the number of characters matched (the order is guaranteed to be the same), and is the number of transpositions that is the distance between two characters | 1 | and | 2 |, which is no more than

The Jaro-Winkler algorithm gives the starting part a higher score for the same string, which is defined as a prefix range . If the prefix part has the same length as the partial string, the Jaro-Winkler distance is:

where is the length of the prefix part matched, and = 2 is set in this work. is a range factor constant to adjust the prefix matching. The Jaro-Winkler distance is from 0 to 1, which 0 represents no similarity between two strings and 1 represents an exact match.

In the first match, Jaro-Winkler distance algorithm is applied between entity names and short name in business directory.

However, after the first match based on Jaro-Winkler distance, multiple matching pairs appear. Therefore, a subsequent second matching process is presented with an industry-characteristic dictionary, which is constructed as Figure 2 .

As shown in Figure 2 (a), the first string match happens between keywords = 1 , 2 , 3 , … , and the names = 1 , 2 , 3 , … , from user comments, and the short names from the business directory = 1 , 2 , 3 , … , . These names are corresponding to the Industry = 1 , 2 , 3 , … , . Industry-characteristic dictionary includes the reorganized the business directory, the first matching entity names( and short names ). Then, industry dictionary needs to add the corresponding characteristic keywords from user comment texts, description words , shown in Figure 2 (b).

In the second matching process, industry-characteristic dictionary is needed. By using Jaro-Winkler distance, each keyword (in test set) is compared to the characteristic keywords and description words (in train set) from the industry-characteristic dictionary. Then the unique entity-industry pair can be found between the characteristic keywords and industry-characteristic dictionary (in Figure 3) , which is the target entity identified.

The second matching process is to solve the ambiguous problem. In Figure 3 , three potential entities correspond to the three names in business directory of comment 1 after the first matching. Which entity is the user most concerned? Offering one of the answers in this work is to find that the characteristic keywords matching, such as The construction of industry-characteristic dictionary. The first match happens between the potential names from user comment texts and short name from business directory, shown in (a). Then, reconstruct pairs according to industry, shown in (b). For instance, two entities 1 ( 1 ) and 2 ( 2 ) are belong to industry 1 , where the two entities are mentioned twice (through keywords 1 ). In this illustraion, filter the keywords that appear more than 1 time. In empirical experiment, filter the keywords that appear more than 10 times in each industry. Then, extract the descriptions of entities 1 , 2 as description words 1 , 2 for the second matching process. So industry-characteristic dictionary contains both characteristic keywords and industry description words . characteristic keyword 1 in both user comments and industry-characteristic dictionary. Then 1 is the identified as target entity from comment 1. For comment 2, a name corresponds to three entities in business directory multiply. According to second matching characteristic keywords both in comments and industry-characteristic dictionary, like industry 2 appears twice, and industry 1 once. The name 3 corresponding to industry 2 is considered as target entity.

Besides, two functions are applicated in TSM process. One is removing comments irrelevant to the entity. As shown in Figure 3 , the characteristic word 1 related to name 1 corresponding to industry 1 , but an irrelevant keyword 3 is removed during the process. The other one is to exclude redundant comments. As shown in Figure  3 , one name corresponds to two industries, but the second matching result ignores the less relevant industry 1 which only includes one characteristic keyword 1 .

In order to evaluate the results of the TSM entities, two indicators are defined as: 1) association rate, : the number of records matched by the TSM framework / the number of comment texts in the test set. This indicator is a global indicator, which does not involve artificial verification; Figure 3 : (Colour online) The illustration of two examples for TSM process to identify one target entity. The first match happens between potential names c and short names C to find the entities mentioned by users. The second matching process involves user comment keywords and the industry-characteristic dictionary. For comment 1, the target entity is 1 , since only keyword 1 from user comments matches characteristic keyword 1 in the industry-characteristic dictionary. For comment 2, although name could catch three entities uncertainly, keyword 2 and 2 can match the industrycharacteristic dictionary in the second matching process. They belong to the industry 2 and correspond to name 3 according to the first match result. So the target entity for comment 2 is 3 .

: the number of correct matching records / the number of artificial matching records.

where | | is the total number of potential names and is the number of comments in the test set, |Γ ℎ | is the matching set of the second matching method, and |Γ | refers to the matching set of artificial labelling method. In the test sets, when one comment includes multi-entities, for the first match as benchmark, consider the first mentioned entity as correct matching record Γ ℎ . Since the second match method could identify the multi-entities, the Γ ℎ is only one.

This study is based on public opinion monitoring of major public health events, related to Shopping (Consumption of E-commerce Platforms), Tourism (Reservation of Hotels and Air Tickets) and Rent (Rent of Houses and Cars).Therefore, the team selected 3 categories for sample pool including: Shopping (454 comments), Tourism(285 comments), Rent (391 comments), from 1 Jan to 1 Feb 2020. User comments are sentencelevel texts. To construct business directory and industry-characteristic dictionary, entity registration details from China State Administration for Market Regulation are acquired. This will be used in the TSM process to identify the target entity below.

The BiLSTM-CRF model is employed to obtain the entity names from the user comment texts at first stage. Then it matches with short name in business directory, but there could be some multiple pairs. After that matching process is the complement for targeting the most accurate entity. The training set and the test set is divided by 6:4. The training set is to obtain the industry-characteristic dictionary. The details of TSM process are demonstrated in Figure 4 .

Entity C 1 + I 1 C 2 + I 2 … C j + I k …

Match n 1 n 2 … n m … C j +I k ∊{c • }

Un-matched

Entity C 1 + I 1 C 2 + I 2 … C j + I k …

Match n 1 n 2 … n m … Second Match

(1)

(2) (4) The input of this method is user comments, and the output is the target matching entity list. For each comment text, comment set { ., .} is to match the business directory { ., .} by Jaro-Winkler distance algorithm (Figure 4 step 1) . If only one name corresponds to the business directory, = , then complete the match and the record goes into the final matching list. If there is a name that corresponds to multiple entity names ∈ , (Figure 4 step 2), then it is necessary to check whether each entity name appears in the comment. If there is a entity name in the business directory + ∈ { .} (Figure 4 step 3) , the match is complete as well.

However, if there are multiple names in the business directory { .} ∩ { ., .} (Figure 4 step 4) , it needs to count the number of the entity names that appears in the comment (Figure 4 step 5) . If the number of occurrences of a name is higher than other names 1 > 2 … (Figure 4 step 6) , then it needs to check whether the entity name is in potential name set + ∈ { .} (Figure 4 step 7) . If the result comes out as true, the match is completed; otherwise, it is unmatched. If there are multiple entity names with the same count at the same time 1 = 2 = ⋯ = , then it needs to check whether each entity name that appears in the characteristic words set again. If only one entity name hits and other entity names do not 1 + 1 ∉ { .} , 2 + 2 ∉ { .} … + ∈ { .} (Figure 4 step 8) , the match is completed.

In order to improve the efficiency of second matching process, the industry characteristic words set is excluded from the entity set { .} − { .} (Figure 4 step 9) . After that, if only one name remains in the name set which corresponds to multiple entity names ∈ , (Figure 4 step 10) , then returns to the step (3) and check as first match. If there is a single mapping relationship between the name set and the business directory set { .} → { ., .} (Figure 4 step 11) , that is, multiple names correspond to the entities one by one, the second matching process is started.

In the second matching process, the industry-characteristic dictionary is needed. Firstly, the characteristic words between user comments and industry-characteristic dictionary need to be matched again (Figure 4 step  12) . If the number of second matching of a entity name is higher than other names, 1 > 2 > … (Figure 4 step 13), this second matching process completes, then puts the most frequent occurring entity names into final matching list. The step (13) already includes a single mapping relationship, and repeated checking does not need to do like the step (7) This Honeywell N95 mask bought from Taobao is very easy to use. It is very comfortable to wear after half day wearing. The air valve is not blocked in any way. Please see the attached photos which was taken by my Huawei mobile phone. Must be given high praise! Honeywell Taobao Huawei 14), the second match is completed. Eventually, the second matching results form a matching list.

Take Table 2 as an example. After the first match, three entities are identified, but still it cannot confirm the specific target one. Furthermore, the second matching process identifies "Honeywell" by characteristic keywords, i.e. "N95 mask", "wearing" and "air valve" are industry characteristics. Although the words "photos" and "mobile phone" also represent the characteristic of "Huawei", the number of characteristics of "Honeywell" occurs more. Therefore, the target entity can be identified.

In order to check the results matched the real entities precisely, we randomly select 150 matching records from three data sets, and manually label entities according to the business directory. Then the artificial matching results are compared with the list of the second matching method.

The results of three kinds of text corpus are illustrated in Table 3 . The first matching process uses the BiLSTM-CRF before Jaro-Winkler match. It is obvious from this table that the matching method of BiLSTM-CRF included is better than that of CRF model and Jaro-Winkler alone. BilSTM-CRF matching can avoid errors caused by segmentation cutting errors compared to Jaro-Winkler method. For example, "I bought a Jingdongfang screen". If Jaro-Winkler's matching method is used only, the word that is eventually cut out is Jingdong, which is a wrong result. Conversely, the word that BiLSTM-CRF finally cuts out according to the longest matching field is Jingdongfang. So the and obtained by the BiLSTM-CRF matching method are better than the matching method of Jaro-Winkler. In three different categories, associated rate has increased from 0.39, 0.43 and 0.35 to 0.55, 0.60 and 0.49, respectively. Meanwhile, accuracy rate increased from 0.44, 0.45 and 0.41 to 0.66, 0.69 and 0.61. Due to BiLSTM-CRF model combining the advantages of BilSTM model and CRF model, compared to the traditional CRF model, the average value of and of BiLSTM-CRF model has increased from 0.46 and 0.57 to 0.55 and 0.65, respectively, the results of BilSTM-CRF model are better than that of CRF model. The BilSTM model can automatically extract the features of the observation sequence according to the target, but the disadvantage is that it cannot learn the relationship between the state sequences. Whereas the advantage of CRF is that it can model the implied state and learn the characteristics of the state sequence, but its disadvantage is that it needs to manually extract the sequence features. So the general approach is to add a layer of CRF after the BiLSTM to get the best of both. Then, the results of first matching with BiLSTM-CRF are as a benchmark for comparison.

Both and of second matching framework proposed in this work perform much better than only once matching results. In Shopping and Rent categories, has increased to 0.93 (increased 38%) and 0.95 (increased 29%), respectively. The increments indicate that the comments in the two categories include multiple entities mapping relations, i.e., users' intents are on one entity mainly, but other irrelevant entities are also mentioned. The second matching process, moreover, improves the matching results simply because of the industry characteristic terms. Meanwhile, in rent category, has increased from 0.49 to 0.77 (increased 28%), probably since involving the wide coverage of the industries.

of the second matching results are all over 0.9 in three categories (averagely increased 30%), indicating the TSM framework is much closer to the human recognition. The reason for the improvement of precision by the TSM method is the introduction of industry-characteristic. The first match process can only identify the obvious target, such as a comment only mentions one entity, or the Entity Perception mostly mentioned entity even other irrelevant entities were brought out by the way. However, if multiple entities are mentioned in the same time, the first matching process can only guess the first appearing entity as the target, which will cause the significant error. The second matching process can improve the since the matching process automatically recognizes the entities with industry-characteristic, which links the relation between the content of comments and the character of entities. Therefore, the improved indicates that artificial label of human recognition results are close to the model recognition results. Since the ambiguity of user comments may encounter different type, when a user mentions entity B and C, the real expressed target is A. The TSM method considers the descriptions of the entities, so that the recognized entity is consistent with human recognition.

In the safety science, such as the public opinion, emergencies and so on, the entity identification plays crucial role for the early warning and diagnosis and intervention which could reducing the response time greatly. This paper demonstrates the Two-Step-Matching method designed to identify the precise target from ambiguous user comments of public opinions. Firstly, potential entity from user comments are extracted by BiLSTM-CRF model and user characteristic keywords are extracted by TF-IDF algorithm. Secondly, the potential entity is matched with the official business directory by Jaro-Winkler distance algorithm to form the match pair. Then, the industry-characteristic dictionary is used to find the precise target, which includes characteristic keywords and description words. In the second matching process, the target company can be found with large count of matching among the characteristic keywords mentioned in comments and industry-characteristic dictionary. Meanwhile, two evaluation indicators are defined in this work to verify the proposed method: 1) association rate, : the number of records matched via the TSM matching method / the number of comment records; 2) accuracy rate, : the number of accurate matched records / the number of artificial matched records. In the empirical experiments, three categorizes of user comments about major public health events. i.e., Shopping, Tourism and Rent are verified. Comparing with the case of using the first matching process alone, the TSM method shows that the associated rate has increased 38%, 30%, 28%, respectively. Meanwhile, accuracy rate has increased 29%, 28%, 34%, respectively. Average speaking, the association rate and accuracy rate for the shopping, tourism and rent data sets could be enhanced by 32% and 30%.

For practical application, this proposed method is convenient to be employed in perception of social media Derczynski, Yang and Jensen (2013) . For example, in public opinion application, entity identification techniques can be employed to draw attention to the trends of the public opinion, to mine the opinions about competing entities in typical opinion mining applications, such as users' attitudes about the competing products or brands, and to discover the early warning signals about the product' drawbacks in the details of comment texts in product quality management Zhang and Liu (2014) ; Zhong, Xing, Luo, Zhou, Li, Rose and Fang (2020) . Therefore, it is significant for a system to automatically identify the true target entities that users really talk about from relevant comment texts.

Nevertheless, several other issues should be investigated in future. For instance, the TSM method proposed in this paper can also be employed in the knowledge graph research to identify the named entity and to construction the relationship among entities, so comparing the results with other knowledge graph construction techniques is one of the future works. It can also calculate the similarities by mapping the characters into the high-dimensional with various graph embedding techniques, such as the node embedding with node information and network information Hamilton, Ying and Leskovec (2017) . What's more, the main operational limitation is that our method is suitable for the ambiguity of multiple targets appeared simultaneously. If the user's expressions with multiple targets appear different times, such as entity A appears three times, entity B appears twice times, and entity C appears once, but the user is actually concerned about entity C. In this case, our method cannot recognize the result because we recognize entity A as the final target. Therefore, semantic information of context can be considered to explore more information from users, such as synonyms, upper and lower words. Professional word dictionary may be helpful to speed up the identification and training of the industrycharacteristic dictionary could be iterated according to text corpus. Moreover, it is necessary to keep balance among complicated text corpus structure, time consuming of training process, and accuracy of the results.

Transformation-based framework for record matching

A grammar-based entity representation framework for data cleaning

Approximate joins for data-centric xml

A novel ensemble method for named entity recognition and disambiguation based on neural network

Data integration using similarity joins and a word-based information representation language

Towards context-aware search and analysis on social media data

Character-based lstm-crf with radical-level features for chinese named entity recognition

Query preserving graph compression

Incremental graph pattern matching

Exact string matching algorithms: Survey, issues, and future research directions

Inductive representation learning on large graphs

The merge/purge problem for large databases

Real-world data is dirty: Data cleansing and the merge/purge problem

A statistical interpretation of term specificity and its application in retrieval

Entity identification in database integration

An attention-based bilstm-crf approach to document-level chemical named entity recognition

J-reed: joint relation extraction and entity disambiguation

Lightweight multilingual entity extraction and linking

Nerd: a framework for unifying named entity recognition and disambiguation extraction tools

Efficient similarity-based operations for data integration

Pkde4j: Entity and relation extraction for public knowledge discovery

Improving sentiment analysis via sentence type classification using bilstm-crf and cnn

String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage

Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study

Aspect and entity extraction for opinion mining, in: Data mining and knowledge discovery for big data

Deep learning-based extraction of construction procedural constraints from construction regulations

respectively. He is currently a Librarian in University of Shanghai for Science and Technology. His research interests are information science, human behavior and big data mining

Science degree in Finance from the University of

She is currently a Research Assistant in Economic Journalism Department, School of Humanities, Shanghai University of Finance and Economics. Her research interests include social network analysis

She was funded several National Natural Science Foundations and published 119 papers including 66 SCI indexed papers. Her H index is 18. His current research interests include: big data analysis

PhD supervisor, Dean of Sina WRD Big Data Research Institute, Distinguished Professor of Shanghai Oriental Scholar, Shanghai Shuguang Scholar, and Deputy Dean of Shanghai University of Finance and Economics. His current research interests include: Media big data modeling and analysis

We thank Qiang Yue for the preliminary experience of industry application.