Paper Title (use style: paper title)


2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 

92 

 
The Extraction of Comment Information and Sentiment Analysis in Chinese 

Reviews 

 
Li Danyang 

School of Computer Science and Engineering 

Xi’an Technological University 

Xi’an, China  

e-mail: 821563942@qq.com 

Fan Huimin 

School of Computer Science and Engineering 

Xi’an Technological University 

Xi’an, China  

e-mail: 492896361@ qq.com 

 
Zhao Yingze 

School of Marxism 

Xi'an Jiaotong University 

Xi’an, China  

e-mail: yingze1013@163.com 

 
Abstract—Sentiment analysis, also known as opinion mining, 

refers to the emotional tendencies expressed by the critics 

through the analysis of the content of the text. The task of text 

sentiment analysis mainly includes the classification of 

sentiment, the extraction of sentiment information and the 

retrieval and induction of sentiment information. Based on 

CRF, this paper will extract several pairs of theme words and 

sentiment words exist in the e-commerce review, and judge the 

sentiment inclination of the extracted sentiment words. The 

experimental results show that CRF has a good effect on the 

extraction of emotional information. 

Keywords-CRF; Extract Theme Words; Extract Sentiment 

Words; Sentiment Analysis 

I. INTRODUCTION 

With the rapid development of web 2.0 technology, there 
have been network reviews on the platform with exponential 
growth, such as micro-blog reviews, news commentaries and 
e-commerce reviews, etc. E-commerce is a business activity 
based on information network technology and centered on 
commodity exchange. With the diversification of consumer 
information in twenty-first Century, the trading volume of e-
commerce has increased rapidly. It has become an important 
part of the national economy and plays an extremely 
important role. For the e-commerce platform, the comment 
information greatly affects the consumer's purchase 
decision[1]. By extracting the comment information in the 
Chinese comment text, it can not only guide consumers to 
make rational consumption, but also help the merchants to 
improve the quality of the products. The comment 
information includes the theme words and sentiment words 
that appear in the commentary, the theme word refers to the 
evaluation object in the comment, which is the modification 
object of the sentiment word in the sentence, which is 
usually expressed as some attribute of the product. The 
extraction of comment information is one of the key tasks of 
text sentiment analysis, the existing methods for extracting 

information from reviews are mainly divided into 
rules/template and statistical methods. 

II. EXTRACTION METHOD OF EVALUATION 
INFORMATION 

The rule/template method is mainly based on the 
characteristics of the text itself, making the corresponding 
rules or templates to identify the specific field of evaluation 
objects. Liu Bing first proposed the problem of evaluation 
object extraction, he used the noun with high frequency as 
the evaluation object, and used the nearest adjective from the 
evaluation object as an sentiment word[2]. According to the 
characteristics of Chinese language, Qiu Yunfei and Chen 
Yifang put forward the method of extracting commodity 
evaluation objects by using word characteristics and 
syntactic analysis[3]. However, the rule/template method 
requires domain experts to define the evaluation objects and 
rules in the corresponding field, so it cannot satisfy the 
emerging neologisms, and has no cross domain and 
portability, therefore, the most effective extraction method is 
based on the statistical method. 

The statistical extraction method uses a trained statistical 
model to extract comment information. Niklas Jako et al. 
proposed the use of conditional random field model to 
extract the evaluation object, and model the extraction 
problem of the evaluation object into a sequence marking 
task[4]. Jin Lijun and others have studied the method of 
automatic recognition based on SVM[5]. This paper mainly 
studies the application of CRF statistical model in the 
extraction of comment information. 

III. COMMENT INFORMATION EXTRACTION BASED ON 
CRF STATISTICAL MODEL 

A. Review information extraction process based on CRF 

Based on the statistical method, this paper uses the CRF 
model as the main model and combines the constructed 
emotional dictionary to extract the comment information, as 


2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 

93 

 
shown in Figure 1, it mainly includes building emotional 
lexicon, data preprocessing, part of speech tagging, training 
CRF model, using CRF model to extract theme words and 
sentiment words, judging the sentiment inclination of the 
extracted sentiment words and exporting final results. 

 
Figure 1.  Review Information Extraction Process Based on CRF 

B. Data pretreatment 

Data pretreatment is an essential part of text data mining. 
In this paper, it mainly includes the following parts: 

1)  Building a sentiment word dictionary: This paper 
will extend the sentiment dictionary for the use of the corpus. 

Firstly, the new dictionary is applied to Chinese word 

segmentation, which makes segmentation more accurate and 

largely avoids the destruction of the theme words and 

emotional words when they are segmenting. Secondly, the 

new emotional dictionary is applied to the sentiment 

tendencies of sentiment words. 

2) Chinese word segmentation: Chinese word 
segmentation is the basis of text mining, but Chinese is not 

as natural as English word, so Chinese word segmentation is 

much more complicated than English word segmentation. In 

this paper, the more mature Jieba segmentation algorithm 

combined with the new sentiment dictionary can be used to 

carry out Chinese word segmentation, which is achieve a 

good segmentation effect. 

3) Removing the stop words: The stop words are words 
that are completely useless or meaningless, such as auxiliary 

words, mood words, punctuation marks and so on. The 

removal of stop words can improve the efficiency of retrieval, 

save storage space, and exclude interference words. 

4) Sequence labeling: In order to extract more accurate 
theme words and sentiment words, this paper divides the 

elements in the text into 3 categories by using sequence 

labeling: the theme words is marked as T (Theme), the 

sentiment word is marked as S (Sentiment), and the rest of 

the words are labeled as O (Other). 

The results of some of the data after the above 
pretreatment are shown in Figure 2. 

 
Figure 2.  An example of data pretreatment results 

C.  Part of Speech Tagging 

CRF model is actually transforming information 
extraction problem into sequence labeling problem. 
Therefore, in order to train CRF model, we need to process 
part of speech tagging of corpus in addition to three kinds of 
customized tags. Part of speech is used to describe the 
function of a word in context. Part of speech tagging is also 
called part-of-speech tagging, which refers to the process of 
marking a correct part of speech for every word in the word 
segmentation result. Different languages have different set of 
part of speech tagging. In this paper, the annotation set of 
parsing tree is used. Part of the data after the tagging is 
shown in Table 2. 

 
Figure 3.  An example of part of speech tagging 

D.  The Introduction of  CRF 

Conditional Random Fields[6], CRF or CRFs for short, 
was first proposed by John Lafferty in 2001. It combines the 
characteristics of maximum entropy model and hidden 
Markov model, and it is a probabilistic undirected graph 
model. It is often used in sequence segmentation and tagging, 


2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 

94 

 
and the conditional probability of the output node can be 
calculated under the given input node. CRF model can better 
capture the context information[7], accurately identify the 
key information, has been widely applied to many fields of 
natural language processing, and has a good performance in 
Chinese natural language processing tasks, such as the part 
of speech tagging, machine translation, prosodic structure 
prediction and speech recognition.  

CRF is an undirected graph model, and Lefferty and 
others define CRF as: order G=(V,E), where G represents an 
undirected graph, and V and E belong to a set in an 
undirected graph. In this expression, V represents the set of 
nodes, and E represents the set of edges. In the tag sequence, 
the elements and the nodes in the graph correspond to one by 
one. Under the condition of known observation sequence X, 
if the distribution of the random variable satisfies Markov 
property, that is, the node is adjacent to the node in graph G, 
then it is called a conditional random field. The formalized 
description of CRF is as follows: 

Set G= (V, E) is an undirected graph, the V represents the 
set of vertices, and the E represents the set of the edges. 
Y={Yv|v∈V} represents the index of vertices in figure G, 
that is, each vertex corresponds to the composition Yv of the 
marked sequence represented by a random variable. 
Therefore, on the condition of X, the form of joint 
distribution related to G is p(y1,y2,…,yn|X), in which y is a 
marker sequence and X represents the observation sequence. 
If the random variable r satisfies the Markov property about 
G, that is 

( | , , ) ( | , , ~ )
v u v u

p Y X Y u v p Y X Y u v          (1) 

In the above formula, u~v indicates that u is adjacent to v 
in graph G, and (X, Y) constitutes a conditional random field. 

In theory, if graph G represents the conditional 
dependence between the labeled sequences to be modeled, 
then its structure can be arbitrary. But when modeling the 
sequence annotation task, the most simple and general graph 
structure is: a simple first order chain corresponding to the 
elements of Y is formed. This CRF is generally called linear 
- chain CRF, the model of linear - chain CRF is shown in 
Figure 2, X= (x1, x2,... xn) is the observation sequence, y= 
(y1, y2,..., yn) is the output sequence. 

 
Figure 4.  Structural Representation of Linear - chain CRF 

Given the observation sequence, the sequence conditional 
probability of the output is as follows: 

1

1
| exp ( , , , ) ( , , )

( )
k k i i k k i

i k i k

P y x t y y x i s y x i
Z x

 


 
 

 
 （ ）=   (2) 

tk(yi-1,yi,x,i) is a state transfer function; sk(yi,x,i) is a 
state feature function; tk, sk are all characteristic functions; 
λk, μk is the weight of the characteristic function, which is 
learned by training; Z(x) is a normalization factor: 

1
( ) ( , , , ) ( , , )

k k i i k k i

y i k i k

Z x t y y x i s y x i 


 
  

 
    (3) 

As one of the most important undirected graph structures, 
linear - chain CRF has been applied to the practical research, 
and most of the Natural Language Processing research tasks 
all use linear - chain CRF. 

IV. EXPERIMENTAL RESULTS AND ANALYSIS 

In the experiment, the data set of this paper is divided 
into two parts: training set and test set based on the data set 
provided by the  Big Data & Computing Intelligence Contest 
in 2017. The CRF model was trained with the training set, 
and then the theme words and Sentiment words were 
extracted from the test set, and judge the sentiment 
inclination of the extracted sentiment words. In this paper, 
F1 is used as the index of evaluation model. 

A.  Experimental evaluation index 

There are three main evaluation indexes commonly used 
in data mining and natural language processing, including 
accuracy, recall rate and F1. Among them, the accuracy rate 
for the prediction results is to indicate how many positive 
samples are true in the predicted sample. The recall rate is 
aimed at our original sample, which indicates that the 
number of examples in the sample is predicted correctly. 
And F1 is the harmonic average of the accuracy and recall. 
In a word, this paper  uses the F1 value as an evaluation 
standard. In this paper, the calculation formula is as follows: 

Extracting the correct theme words umber of sentiment words

Extracting theme words / umber of sentiment words
P 

/N   

N   

  (4) 

Extracting the correct theme words umber of sentiment words

Theme words in data set / Total number of sentiment words
P 

/N  (5) 

2 * *

( )

P R
F

P R



                                  (6) 

B. Training CRF model  

The CRF model is trained after the data set is 
preprocessed by word segmentation, annotation and so on. 
This paper uses the open source tool CRF++ to train the 
training model, the feature template needs to be prepared 


2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 

95 

 
before the training model, and the feature template file of 
this article is shown in Figure 3. 

 
Figure 5.  The Feature Template 

The T in "T**:%x[#,#]"  represents the template type, 
where there are two templates altogether. The first is 
Unigram template, the first character is U, which is a 
template for describing unigram feature; The second is 
Bigram template, the first character is B. Two "#" 
respectively represent relative row offsets and column offsets, 
each line of "%x[#, #]"  generates a CRF point (state) 
function: f(s,o), where "s " is the t time label (output), "o"  is 
t times the context. The trained CRF model contains feature 
template and feature dimension, data set number, 
characteristic function and the weight of information, a series 
of information is output in the training process, and some of 
the information is shown in Figure 4. The meaning of the 
parameter information is as follows: 

Iter: The number of iterations. When the number of 
iterations reaches  the max, the iteration is terminated. 

Terr: Mark error  rate. 
Serr: Sentence error rate. 
Obj: The value of the current object. When the value 

converges to a definite value, the training is completed. 
Diff: The relative difference between the value of the last 

object. When this value is lower than eta, the training is 
completed. 

 
Figure 6.   Output File 

C. Experimental Results and Analysis 

After extracting the theme words and sentiment words, 
the next step is the judgment of the sentiment inclination of 
the extracted sentiment words, and this step is much simpler. 
If the sentiment word belongs to the positive affective 
dictionary, the sentiment is positive. If it belongs to the 
negative sentiment dictionary, the sentiment is negative, 
otherwise it is neutral. The experimental results before and 
after the optimization dictionary are compared as shown in 
the table below. Table 3 is the experimental result of no 
emotional dictionary. Table 4 is the experimental result after 
optimization. 
 

Figure 7.  Examples of  no optimized results 

 
Figure 8.  Examples of optimized results 


2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 

96 

 
In the above table, there are no definite theme words in 
some comments, but there are corresponding sentiment 
words. In this case, the theme word is marked as NULL. 
After comparison, it is found that after optimization, the 
recognition of the theme words is more accurate than before, 
and the accuracy rate is further improved. The next table 5 is 
the comparison of the F1 values before and after the 
optimization. The comparison results from the table show 
that the F1 value of the optimized dictionary is about 3% 
higher than that before the optimization. 

TABLE I.  COMPARISON OF F1  BEFORE AND AFTER OPTIMIZATION 

Data Set F1 

20,000 comments 0.58498 

20,000 comments 0.61827 

V. CONCLUSION 

The accurate recognition of the theme words and 
sentiment words in the commentary is the key to the 
extraction of comment information and the basis for further 
analysis of the text[8]. The extraction of comment 
information in Chinese review is of great significance to both 
the merchant and the consumer. The merchant can adjust the 
goods according to the comment information or improve the 
quality of the product. The consumer can also make auxiliary 
decisions according to the comment information.  

In this paper, the CRF statistical model is used to extract 
the theme words and sentiment words in the comment 
statement, and it is proved that the CRF model is effective in 
identifying subject words and emotional words. In addition, 
this paper optimizes the emotional dictionary, which 
combines data sets and CNKI's emotional lexicon to build an 
new emotional dictionary that is more suitable for this paper. 
The experimental results show that the optimization 
dictionary makes the recognition of the theme words and 
sentiment words more accurate, and the F1 value is further 
improved.  

Of course, in addition to the CRF model used in this 
paper, there are other methods that can also extract comment 
information, such as LDA theme model, dependency parsing 
method and so on.  

Besides, there are still some shortcomings in this paper. 
Chinese expression is much richer than English, In Chinese, 
there are some irony and even the use of network language 
statements can not accurately identify theme words and 
sentiment words, so we need further study of Chinese 
semantics and so on. 

 
REFERENCES 

[1] Li Piji, Ma Jun, Zhang Dongmei, etc.. “Label extraction and sorting 
in user reviews,” J. Journal of Chinese Information Processing, 2012, 
vol. 26(5),  pp. 14-19,45. 

[2] Hu MQ, Liu B. “Mining and summarizing customer reviews,” C.Proc. 
of the 10th ACM SIGKDD International Conference on Knowledge 
Discovery and Data Mining. Seattle, WA, USA. 2004, pp.168–177. 

[3] Qiu Yunfei, Chen Yifang, Wang Wei, etc.. “Product evaluation object 
extraction based on word character and syntactic analysis,” J. 
Computer Engineering, 2016, vol. 42 (7), pp. 173-180.  

[4] Jakob N, Gurevych I. “Extracting opinion targets in a single-and 
cross-domain setting with conditional random fields,” Proc. of the 
2010 Conference on Empirical Methods in Natural Language 
Processing. Cambridge, Massachusetts.  2010.  

[5] Jin Lijun. “Research on the automatic recognition of the usefulness of 
SVM based search commodity reviews,” D. Harbin Institute of 
Technology,  2013. 

[6] John D Lafferty, Andrew McCallum, Fernando C N Pereira. 
“Conditional random fields: Probabilistic models for segmenting and 
labeling sequence data,” Proceedings of the 18th International 
Conference on  Machine Learning. Williamstown, MA, USA, 2001, 
pp.  282-289. 

[7] Wang Rongyang, Ju Jiupeng, Li Shoushan, etc.. “Research on feature 
extraction feature of evaluation object based on CRFs,” J. Journal of 
Chinese Information Processing, 2012, vol. 26 (2), pp. 56-61. 

[8] Xia yuan, Zhang Zheng . “Evaluation of object extraction based on 
CRF,” J.  Computer Systemsand Applications.2017, vol. 26 (11), pp. 
254-259.