UEF//eRepository DSpace https://erepo.uef.fi Artikkelit Luonnontieteiden ja metsätieteiden tiedekunta 2017 Using linguistic features to automatically extract web page title Gali N info:eu-repo/semantics/article info:eu-repo/semantics/acceptedVersion © Elsevier Ltd CC BY-NC-ND https://creativecommons.org/licenses/by-nc-nd/4.0/ https://erepo.uef.fi/handle/123456789/4243 Downloaded from University of Eastern Finland's eRepository Accepted Manuscript Using Linguistic Features to Automatically Extract Web page Title Najlah Gali , Radu Mariescu Istodor , Pasi Fränti PII: S0957-4174(17)30135-5 DOI: 10.1016/j.eswa.2017.02.045 Reference: ESWA 11153 To appear in: Expert Systems With Applications Received date: 5 October 2016 Revised date: 27 February 2017 Accepted date: 28 February 2017 Please cite this article as: Najlah Gali , Radu Mariescu Istodor , Pasi Fränti , Using Linguistic Fea- tures to Automatically Extract Web page Title, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.02.045 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. http://dx.doi.org/10.1016/j.eswa.2017.02.045 http://dx.doi.org/10.1016/j.eswa.2017.02.045 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Highlights  Successful title extraction must analyze both the DOM nodes and the title tag  Natural language processing improves the quality of the title.  Visual and formatting features are less relevant for the task.  Simpler classifier like k-NN perform as well as an advanced classifier like SVM.  The proposed method significantly outperforms all existing ones by clear margin. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Using Linguistic Features to Automatically Extract Web page Title Najlah Gali, Radu Mariescu Istodor, Pasi Fränti Machine Learning group, School of Computing, University of Eastern Finland, Joensuu FI-80101, Finland {najlaa, radum, franti} @cs.uef.fi Abstract Existing methods for extracting titles from HTML web page mostly rely on visual and structural features. However, this approach fails in the case of service-based web pages because advertisements are often given more visual emphasize than the main headlines. To improve the current state-of-the-art, we propose a novel method that combines statistical features, linguistic knowledge, and text segmentation. Using annotated English corpus, we learn the morphosyntactic characteristics of known titles and define a part-of-speech tag patterns that help to extract candidate phrases from the web page. To evaluate the proposed method, we compared two datasets Titler and Mopsi and evaluated the extracted features using four classifiers: Naïve Bayes, k-NN, SVM, and clustering. Experimental results show that the proposed method outperform the solution used by Google from 0.58 to 0.85 on Titler corpus and from 0.43 to 0.55 on Mopsi dataset, and offers a readily available solution for the title extraction problem. Keywords Web content mining, Information extraction, Title extraction, Natural language processing, Machine learning ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 1 Introduction Human is an impatient creature by nature when seeking information from the Internet; the user wants to obtain the right answer immediately and with minimal efforts (Marchlonini, 1992; Song, Xin, Shi, Wen, & Ma, 2006). In most applications, the title of the content is the first thing where the user pays attention. Even a single word or phrase can dramatically change the whole message of this content. To have a correct title is important and should therefore, be descriptive, concise, and grammatically correct (Hu et al., 2005). In web pages, titles usually exist in two places for different purposes. First, the title is somewhere in the body text with high visual emphasis for the human reader. This is important for humans who browse the web page quickly on a computer display. Second, the title is placed in the title field (between and tag) for robots, crawlers, and programs that prepare a summary for the web page. However, the designers of the web page often ignore this or abuse the title tag by adding extra content like keywords, address or other less relevant text. Its content becomes then vague, incorrect, or it might be even missing completely (Xue et al., 2007). In mobile applications the problem exaggerates. The small screen of miniaturized devices are even more restrictive to the displayed content and requires the title to be also fitted spatially. T he title is also needed for indexing in search engines like Google. Title extraction aims at producing a compact title for a web page automatically. Due to the problems of the title tag, existing literature has mainly been focused on extracting the title from the body text. Methods have been developed for web pages of standard format such as news and pages of educational institutions. However, less attention has been given to service-based web pages such as entertainment, sport, and restaurants. Existing methods also make an assumption like the title is always located in the top region of the page and has visual prominence; they often fail to correctly extract the title of the service-based pages where the title is exchanged for a logo, or it is positioned elsewhere on the page (see Figure 1). For example, Hu et al. (2005) and Xue et al. (2007) explicitly state that the title must be in the top area of the page. Furthermore, Fan, Luo, and Joshi (2011) hypothesize that the title is located in the upper part of the main text. Changuel, Labroche, and Bouchon-Meunier (2009) implicitly assume that the title appears in the top portion of the page and as a result extract only the first 20 text nodes from the Document Object Model (DOM) tree. Another assumption often made is that the title in a body is a separate line of text (i.e., it has its own text node in the DOM tree). However, the modern web page design allows the title to appear as a part of other phrases in the text node of the DOM tree. For example,

Welcome to Petter Pharmacy, please select one of the five options below:

will produce ill-fitting title unsuitable for mobile devices. According to our experiments, about 68% of the title nodes also contain additional information similar to the example and are therefore prone to errors in the title extraction. In this work, we developed a novel method to overcome these problems in the title extraction for service -based content and mobile applications. Our key finding is that the title tag is still the best source. However, it needs to be segmented and further processed. Our preliminary version was presented in (Gali & Fränti, 2016). Here, we further enhance this approach by applying additional part-of-speech (POS) tagging. POS processes every word in the text and assigns them with a tag based on the relationship with adjacent and related words in the phrase, sentence, or paragraph. POS tagging has been successfully employed in other domains such as keyword extraction (Hulth, 2003). However, to our best knowledge, no language model has been applied for title extraction in the context of web pages. We are only aware the work of (Lopez, Prince, & Roche, 2010; Lopez, Prince, & Roche, 2014) giving POS model for mailing lists and news articles in the French language. We, therefore, investigate the contribution of POS tagging to this task. We aim at identifying features that are independent of the format of the web page. Our method uses the following features: syntactic structure, similarity with the link of the web page, appearance in the title tag, appearance in meta tags, popularity on the web page, appearance in heading tags, capitalization, capitalization frequency, independent appearance, and phrase length. We consider four alternative classifiers: Naive Bayes, clustering-based, k-nearest neighbors (k-NN), and support vector machine (SVM), which to our knowledge have not been compared previously in the title extraction task. We compare the proposed method against related ones with two datasets: Titler and Mopsi. Experiments show that our method gives a significant improvement, and achieves an accuracy of 0.85 with Titler dataset. The corresponding results of the baseline (title tag as such), Google, and the best content-based method are 0.52, 0.58 and 0.47 respectively. The rest of the paper is organized as follows. In Section 2, we review existing methods for title extraction, and the new method is introduced in Section 3. Experiments are done in Section 4. Effect of POS pattern is studied in Section 4.5, feature extraction in 4.6, choice of the classifier in 4.7, and title selection methods in 4.8. Comparisons to the existing methods are then performed in Section 4.9. We compare to all existing identification methods that are accessible. These include Styling (Changuel et al., 2009), TitleFinder (Mohammadzadeh, Gottron, Schweiggert, & Heyer, 2012), Title Tag Analyzer (Gali & Fränti, 2016), and the Baseline. We also compare to the titles provided by Google in the search results page. The results show that the proposed approach outperforms all the methods. The method with k-NN improves the Jaccard of the baseline from 0.50 to 0.84 on Titler corpus and from 0.44 to 0.59 on Mopsi dataset. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Figure 1 Different layouts of web pages 123 (white squares refer to images of logos while red ovals refer to titles in the body of the web pages). 2 Related work Figure 2 shows typical steps for the title extraction (left) and possible approaches to each step (right); the modules that are covered in this work are highlighted in blue. 2.1. Content source for title The title of a web page is usually found in one or more of three places: the title tag (i.e., between and ), the text of the body, and the logo image. According to our experiments with 1002 websites, the occurrence of the title in these three places is as follows:  Title tag (91%)  Text of the body (97%)  Logo (89%) 1 http://karaautos.co.uk/ 2 http://www.petterpharmacy.co.uk/ 3 http://theapollo.com.au/ ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Figure 2 Typical steps for title extraction. The title tag is the obvious source, and the author of the page is expected to fill it with a proper title. However, people often do not complete this tag carefully as it does not have a visual impact on the page. A title tag often contains additional text, such as the name of the hosting website, information about the place offering services, a slogan, and contact details (see Table 1). The body text of a web page is a second source for a title. It has been given more focus by researchers given that a title in the body is visible to users and is thus expected to be written more carefully than the title tag (Changuel et al., 2009; Hu et al., 2005; Wang et al., 2009; Xue et al., 2007) However, extracting a title from the body of the web page is not an easy task, as roughly half of a page‘s content is irrelevant text (Gibson, Punera, & Tomkins, 2005). This irrelevant text (e.g., advertisements) is often given even more visual emphasis than the main headlines, which makes the task even more challenging. Furthermore, no standard location exists in relation to title placement. In this paper, we extract the candidate titles from both the body of the web page and the title tag. Table 1 The most typical problems related to title tag and the frequency they appear (according to our experiments). Type Proportion (%) Example Annotated title Long description 62 Brook's Diner | 24 Hampden Square, Southgate, London N14 5JR | 020 8368 7201 | eat@brooksdiner.com | Like us on Facebook — Home Brook's Diner Incorrect 6 Hot Tubs, hot tub hire, swimming pools, Bristol, Gloucester Rio Pool Vague 2.4 home index | Hellard Bros Ltd. Short description 0.5 Toby‘s Estate Toby‘s Estate coffee Empty 0.2 Zavino Hospitality Group The third source for a title is the logo image. However, extracting a title from this image would be very challenging. One reason is that the logo image must first be identified. Another reason is that the standard optical character recognition (OCR) approach would not generally work given that the content of the image is highly complex. We are not aware of any technique that attempts this approach. It should technically be possible, but as shown by the examples in Figure 3, such a technique would need to handle a wide variety of complex text fonts that involve shadowing effects, textures, and other artistic features. 2.2. Content analysis and candidate title extraction Most title extraction methods use either DOM tree representation or combine the DOM structure with the visual cues of the page in a vision-based tree. The vision-based tree is built using the vision-based page segmentation algorithm (VIPS) introduced by Cai, Yu, Wen, and Ma (2003). The vision-based tree provides visual partitioning of the page where the blocks ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT (i.e., DOM nodes) are grouped visually, while DOM tree describes the parent-child relationship between the tree nodes; therefore, it is not necessary that nodes in the vision-based tree correspond to the node in the DOM-tree. The VIPS needs to refer to all styling information (including external sheets) to locate the proper place of the block in the tree. If the web page lacks rich visual properties, the hierarchies are incorrectly constructed. A wrong structure can also result from the algorithm not detecting separators represented by thin images. We, therefore, use DOM tree representation. In both tree representations, existing methods use the entire text of the leaf nodes as candidate titles. In this paper, we extract only the relevant part of the text nodes by using POS tag patterns (see section 3.3). Figure 3 Examples of web page logo. 2.3. Features for candidate titles Researchers have extracted a wide range of features from either DOM or vision-based tree. Those found in the literature are listed below. The features used in this paper are underlined. Features from DOM tree:  Visual: font weight, font family, font color (Changuel et al., 2009; Hu et al., 2005; Xue et al., 2007); font style, background color (Hu et al., 2005; Xue et al., 2007); alignment (Fan et al., 2011; Hu et al., 2005; Xue et al., 2007); and font size (Changuel et al., 2009; Fan et al., 2011; Hu et al., 2005; Xue et al., 2007);  HTML tag: bold, strong, emphasized text, paragraph, span, division (Changuel et al., 2009); image, horizontal ruler, line break, directory list (Hu et al., 2005; Xue et al., 2007); underline, list, anchor (Changuel et al., 2009; H u et al., 2005; Xue et al., 2007); meta; title (Changuel et al., 2009; Fan et al., 2011; Mohammadzadeh et al., 2012); heading level (h1- h6) (Changuel et al., 2009; Fan et al., 2009; Gali & Fränti, 2016; Hu et al., 2005; Xue et al., 2007); and position in tags (Gali & Fränti, 2016);  DOM structure: number of sibling nodes in the DOM tree (Hu et al., 2005; Xue et al., 2007); relation with the root, parent, sibling, next and previous nodes in term of visual format (Changuel et al., 2009; Hu et al., 2005; Xue et al., 2007);  Positional information: position of the text unit from the beginning of the body of the page and width of the text unit with respect to the width of the page (Hu et al., 2005);  Linguistic: length of text, negative words, positive words (Hu et al., 2005; Xue et al., 2007); position in text (Lopez, Prince, & Roche, 2011); syntactic structure, letter capitalization and phrase length; and  Statistical: term frequency (Mohammadzadeh et al., 2012); term frequency-inverse document frequency (Lopez et al., 2011; Mohammadzadeh et al., 2012); capitalization frequency, and independent appearance. Features from vision-based tree:  Page layout: height, width, and position relative to the top left corner (Xue et al., 2007);  Block: type, height, width, position (Wang et al., 2009; Xue et al., 2007); and front screen position (Wang et al., 2009);  Unit position: from the top and left side of the page and from the top and left side of the block (Xue et al., 2007); and  Content: number of words in a block (Wang et al., 2009). Other features:  Web page URL (Gali & Fränti, 2016). The majority of these features are based on formatting, whereas the features we consider are independent of the design of the page. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 2.4. Ranking candidate titles Ranking techniques can be divided into two broad classes: rule based and machine learning based (Xue et al., 2007). Rule- based techniques use a set of predefined heuristic rules to score the candidate titles. These rules are derived from the content of the DOM tree (Fan et al., 2009; Gali & Fränti, 2016; Mohammadzadeh et al., 2012), the link structure between web pages (Jeong, Oh, Kim, Lyu, & Kim, 2014), and the text (Lopez et al., 2011). The key advantage of the rule-based technique is that it does not require training data. Moreover, the technique is easy for humans to interpret and improve, as the weighting procedure and scoring formulas are explicit. However, heuristic methods often require determining thresholds and weights for feature parameters, which are not always straightforward to calculate. For example, if the number of features is n = 9 and each feature is assigned a value m = 0 to 5, it takes O(m n ) time to test all weight combinations. In this example, testing would take about four months if each attempt took 1 second. In contrast, Machine learning-based techniques involve two steps: training and testing. In training, the goal is to learn a set of rules that maps the inputs to outputs, so that the rules generalize beyond the training data. In testing, the generated classifier receives unseen data as input and predicts the output values. Proper training of the model is the key to generalizing the classifier beyond the training data. Several machine learning algorithms have been considered by the existing methods. These include perceptron (Li, Zaragoza, Herbrich, Shawe-Taylor, & Kandola, 2002), decision tree (C4.5) (Quinlan, 1993), random forest (Breiman, 2001), support vector machine (SVM) (Vapnik, 1995), and conditional random fields (CRF) (Lafferty, McCallum, & Pereira, 2001). While SVM has shown to be an effective classifier for the title extraction task, it has not been compared against simpler algorithms such as Naïve Bayes (Domingos & Pazzani, 1997), k-nearest neighbor (k-NN) (Cover & Hart, 1967), and clustering (Fränti & Kivijärvi, 2000), all of which we investigate in this paper. 3 Title extraction We consider the title extraction as a machine learning task, in which the computer is given a training data with assigned ground truth titles as the expected output. Three important issues are addressed: how to determine the candidate phrases, what features should be extracted, and which classifier to use. We add linguistic knowledge (syntactic POS) to the process to improve the extraction of the candidate phrases. The proposed method is based on four steps: extracting candidate phrases, feature extraction, phrase classification, and title selection (see Figure 4). A pre-processing step which involves corpus creation and learning POS patterns is applied before the training starts. The following subsections describe these steps in detail. Figure 4 Workflow for title extraction. 3.1. Corpus creation Several corpuses have been created to evaluate title extraction methods. Changuel et al. (2009) have created two corpuses on education domain. The first corpus contains 624 websites in English and French languages, and they were collected by submitting queries to the search engine such as chemistry + courses. The second corpus contains 424 websites in the French language, and they were collected from an online educational portal Eureka 4 . Lopez et al. (2014) have created a corpus of 300 news articles from three French newspaper websites, and they cover politics, sport, society, and science domains. 4 http://eureka.ntic.org/ ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT However, they do not consider web pages of places that offer services such as sports, hospitals, shops, banks, and restaurants, or web pages that host information about these places such as Wikipedia, Facebook, business directories, and information pages. These types of web pages are more challenging because they do not follow a certain template or standard format. None of these corpuses are available in public. We, therefore, built our corpus by collecting 1,002 unique websites from Google Maps 5 search results using queries such as restaurant + Australia, hospital + Canada, pharmacy + London, fitness + Ireland, and auto repair + California, to have reasonable geographical diversity and different layouts of websites. The main challenge of creating a corpus is that it is no t enough to store the web link and the ground truth title. The entire content of the web page should be stored because the links will become obsolete quite fast. Storing the content takes lots of space especially when the web page contains maps and images. The websites were collected during 18-31 July 2014 and 19-23 April 2015, and they cover various domains— food & drink, entertainment, auto & vehicles, beauty & fitness, health, sport, and hotels & accommodation. The resulting corpus is publicly available 6 . Similarly to the previous studies (Lopez et al., 2014; Wang et al., 2009; Xue et al., 2007), we created the ground truth by manually extracting the titles from the pages. We define the title as the most obvious description of the web page (see Figure 5) following the specifications in (Xue et al., 2007) with a few modifications of our own:  A web page can have more than one title, for example, we extracted V-Café and Viet-Café from http://www.viet- cafe.com/, C&A Bennett Ltd and C&A Bennett Tiling Contractors from http://www.candabennett-tiling-bristol.co.uk/;  The title cannot be a part of a numbering or bullets;  The title cannot be phrased like last updated, slogans like aim high go low, time, or address;  The title should not be too long;  The title must be concise and relevant to the page content;  The title must be grammatically correct;  The title must be understandable to humans. We did not use specifications such as the title must be in the top region of the page, or that the title cannot be a link because the correct title can be located on any part of the page. Further, a title can be clickable, especially in the case of business directory pages, in which the title is usually linked to the home page of the service. Two people were assigned to extract the ground truth titles independently on each other, and in the case of disagreement, a third person made a judgment between these two. Figure 5 Identifying the title of the web page 7 . 3.2. Learning POS tag pattern We define a set of specific POS tag patterns that correspond to the syntactic structure of the titles in the ground truth to remove the n-grams that have unwanted format. A POS tag is the part-of-speech label of a word in a text. For example, the 5 http://maps.google.com 6 http://cs.uef.fi/mopsi/TitlerCorpus/ 7 http://www.uef.fi/en/research/faculty-of-science-and-forestry ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT POS tag of the university is the_DT university_NN, where DT stands for determiner and NN stands for a noun. A POS tag pattern is a sequence of part-of-speech tags (e.g.
), where JJ stands for adjective. See Appendix A for a complete list of POS that we use in this paper. We first extract all n-grams (n=1 to 6) as candidate phrases. We observed that the number of the candidates is excessively high (1,024,142), of which only 2,179 phrases are in the title class. To evaluate them all would slow down the pr ocess and it would also cause a high-class size unbalance. There are also many grammatically incorrect title candidates among the n- grams like At Thai Food You, by Quay and Portishead Open Air Pool The. Most of these can be eliminated by applying the POS patterns. To generate the POS patterns, we have searched all POS tags that appeared among the ground truth titles in our corpus. We used the tagger developed by Stanford University 8 (Toutanova, Klein, Manning, & Singer, 2003). We observed that the following syntactic features are common for titles:  Starts with a general noun, proper noun, personal pronoun, foreign word, adjective, determiner, adverb, cardinal number, preposition, or verb;  Ends with a general noun, proper noun, personal pronoun, adjective, adverb, cardinal number, preposition, practical, verb or possessive ending;  In case it contains more than two words, the middle words are allowed to be a general noun, proper noun, personal pronoun, adjective, determiner, cardinal number, preposition, foreign word, coordinating conjunction, adverb, or verb;  Nouns appear much more than verbs see Table 2. Based on these observations, we generate 151 patterns with the length varying from 1 to 6 (see Appendix B). In our study, we follow traditional English grammar, in which a noun preceded by determiners or premodifiers such as adjectives is considered a noun phrase. It is important to note that this set of syntactic patterns is language-dependent and applicable only to the English language. The same process could also be done for other languages if a set of ground truth titles and POS tagger exist. Table 2 Features of ground truth titles POS tags Presence in title (%) Nouns and proper noun 83 Determiner 5 Coordinating conjunction 4 Adjective 3 Possessive ending 2 Preposition or subordinating conjunction 1 Verb 0.7 Cardinal number 0.7 Foreign word 0.2 Pronoun 0.2 Adverb 0.1 Particle 0.1 3.3. Candidate phrase extraction We first construct the DOM tree of the web page and strip it off the