Text Categorization Methods for Automatic Estimation of Verbal Intelligence F. Fernández-Mart́ıneza, K. Zablotskayab, W. Minkerb aGrupo de Tecnoloǵıa del Habla, Universidad Politécnica de Madrid, Madrid, Spain. bInstitute of Communications Engineering, University of Ulm, Germany Abstract In this paper we investigate whether conventional text categorisation methods may su!ce to infer di"erent verbal intelligence levels. This research goal relies on the hypothesis that the vocabulary that speakers make use of reflects their verbal intelligence levels. Automatic verbal intelligence estima- tion of users in a Spoken Language Dialogue System may be useful when defining an optimal dialogue strategy by improving its adaptation capabili- ties. The work is based on a corpus containing descriptions (i.e. monologues) of a short film by test persons yielding di"erent educational backgrounds and the verbal intelligence scores of the speakers. First, a one-way analysis of vari- ance was performed to compare the monologues with the film transcription and to demonstrate that there are di"erences in the vocabulary used by the test persons yielding di"erent verbal intelligence levels. Then, for the classi- fication task, the monologues were represented as feature vectors using the classical TF-IDF weighting scheme. The Naive Bayes, k-nearest neighbours and Rocchio classifiers were tested. In this paper we describe and compare these classification approaches, define the optimal classification parameters and discuss the classification results obtained. Keywords: spoken dialogue systems, Naive Bayes classification, Rocchio approach, k-nearest neighbours Email addresses: ffm@die.upm.es (F. Fernández-Mart́ınez), kseniya.zablotskaya@uni-ulm.de (K. Zablotskaya), wolfgang.minker@uni-ulm.de (W. Minker) Preprint submitted to Expert Systems with Applications December 22, 2011 *Manuscript Click here to view linked References Figure 1: Spoken Language Dialogue System 1. Introduction Next-generation spoken language dialogue systems (SLDS), developed to provide users with required information and/or to help them to accomplish certain goals, are expected to be able to deal with di!cult tasks and react to a wide range of situations and problems. They should help users to feel free and comfortable when interacting with them. Moreover, they should also be user-friendly and easy to use. Including aspects of adaptation to users into SLDS may help to increase the systems’ communicative competences and influence on their acceptability (Figure 1). Next-generation SLDS may change the level of dialogue depending on users’ experience. For example, a spoken dialogue system aimed at providing guidance and support for the installation of some software may try to estimate whether the user is an ex- pert or a novice in this field. Based on this information suitable words and explanations may be generated. These explanations may be very detailed and without specific vocabulary for a non-experienced user; in contrast, for an expert, the system may provide only a sequence of important steps or inform about more di!cult operations. From the beginning of the dialogue, SLDS may analyse the user’s speech, behaviour and requests and also the ex- isting di!culties. When deciding on the best response to a user, the dialogue manager may change words and sentence structures based on the informa- tion about cognitive processes. Its responses may become more helpful and the user-friendliness of the system may be improved. For this purpose it is necessary to identify di"erences in language use of people yielding di"er- 2 ent educational background and abilities to analyse situations and to solve problems. The ability to use language for accomplishing certain goals is called ver- bal intelligence (VI) [5, 3]. In other words, verbal intelligence is “the ability to analyse information and to solve problems using language-based reason- ing” [13]. Automatic verbal intelligence estimation may help dialogue sys- tems to choose the level of communication and be more simple, useful and e"ective. Figure 2 explains the adaptation process of spoken dialogue systems based on verbal intelligence estimation in more detail. When talking to the system, all j spoken utterances of a user are analysed for the verbal intelligence determination. This means that the intelligence level is re-estimated at each turn based on features extracted from the new spoken utterances and from all the phrases which were pronounced at the previous turns. In Figure 2 the SLDS has three di"erent dialogue scenarios corresponding to users yielding a higher, an average and a lower verbal intelligence. At the beginning of the dialogue, the systems uses scenarios corresponding to users yielding an average verbal intelligence. At the following turns, the system might switch to alternative dialogue scenarios. Figure 2: Adaptation to the user. The automatic estimation of users’ verbal intelligence may help SLDS to more e"ectively control the flow of the dialogues, engage users in the 3 interaction and be more attentive to human needs and preferences. For training machine learning algorithms, we need to know a maximum number of language features that reflect di"erences in language use of people yielding di"erent verbal intelligence. In this work we investigate to which extent the vocabulary of test persons reflect their levels of verbal intelligence when they all describe the same event and explain their thoughts and feelings about it. The investigation is based on a corpus containing descriptions of a short film along with the corresponding intelligence scores of the speakers [32] . The paper is structured as follows. In Section 2 we describe the cor- pus which was used for the experimental research. Section 3 describes our primary e"orts at defining film related features which could be useful for distinguishing test persons yielding di"erent verbal intelligence. Section 4 describes typical TF-IDF approaches and explains the details of the feature selection process for the monologues. In Section 5 we describe and compare the Naive Bayes, k-nearest neighbours and Rocchio classifiers. Classification results are presented and discussed in Section 6. Finally, Section 7 presents conclusions and future work. 2. Corpus Description For the data acquisition in [32], a short film was shown to German native speakers. It described an experiment on how long people could stay without sleep. The test persons were asked to imagine that they met an old friend and wanted to tell him about this film. Our goal was to record everyday speech when talking to relatives and friends. This corpus, described in [32], consists of 56 descriptions (3, 5 hours of audio data) of a short film (i.e. monologues). The test persons were also asked to participate in the verbal part of the Hamburg Wechsler Intelligence Test for Adults (HAWIE) [29]. According to Wechsler, intelligence is “the global capacity of a person to act purposefully, to think rationally, and to deal e"ectively with his environment” [28]. The verbal part consists of the following subtests: • Information: this subtest measures general knowledge and includes questions about history, geography, literatures, etc. For example, What is the capital of Italy? • Comprehension: test persons are asked to solve di"erent practical prob- lems and explain some social situations. For example, What would you do if you lost your way in a forest? 4 • Digit Span: test persons are asked to repeat increasingly longer strings of numbers forward and then backward; the subtest measures short- term memory. • Arithmetic: test persons are asked to solve some arithmetic problems given in a story-telling way; the subtest measures their concentration and computational ability. For example, How many rolls you can buy if you have 36 cents and one roll costs 4 cents? • Similarities: test persons are asked to find a similarity between a pair of words. For example, Please find a similarity between “wood” and “alcohol”? • Vocabulary: test persons are asked to explain increasingly more di!- cult words using their vocabulary. For example, What does “to creep” mean? Raw scores of each test person on the verbal test are based on the correct answers (Figure 3). The raw scores are then converted into “Scaled Scores” using special tables [29]. The Scaled Scores vary between 0 and 16 and may be used to compare the performance of the participants. The sum of the scaled scores and the age of a test person are used to estimate the corresponding verbal intelligence score.  Figure 3: Verbal Part of the Hamburg Wechsler Intelligence Test for Adults 5 Overall, 56 test persons yielding di"erent educational levels were tested, therefore 56 monologues about the same topic were collected. All the mono- logues and the film were transcribed according to the transcription standards by Mergenthaler [15]. 3. Modelling Verbal Intelligence by Using Film Derived Features To analyse the vocabulary of people yielding di"erent verbal intelligence when describing the same event, at first we decided to compare the mono- logues with the film transcription. Figure 4 shows excerpts from the film and from one of the monologues 1. Excerpt from the film Max and Funda have been without sleep for fifty eight hours. They have laid down on the sofa. Is it a mistake? Actually they would like to move. But now they cannot any more. The blood pressure is down, the energy reserves are over. They both are freezing despite the fire-place and the jacket. The question is who closes the eyes first. It is Max. Funda wins. She stays awake a few minutes longer. Excerpt from a corresponding monologue After fifty eight hours, they were really tired. And, they had frozen. Despite they had very warm clothes. And then the man fell asleep and then the woman. Figure 4: Excerpts from the film and one of the recorded monologues. For the comparison, the following features were extracted: • Number of reused words - number of words which a test person “reused” from the film. For our example in Figure 4 the reused words are: fifty, eight, hours, they, and, they, despite, they, and, the, and, the. • Number of unique reused words. It includes the number of reused words without repetitions. In Figure 4, the unique reused words are: fifty, eight, hours, they, and, despite, the. • Number of all reused lemmas. This feature is similar to the Number of all reused words, but referred to lemmas instead. 1As the conversation language is German, the example was directly translated into English. 6 • Number of unique reused lemmas. This feature is similar to the Number of unique reused words, but considering unique lemmas instead. • Cosine similarity between the film and a kth monologue using lemmas. For this feature extraction, we created a matrix consisting of all unique lemmas from the film, including the frequency of these lemmas within the film and within a kth monologue. Table 1 shows this matrix for the texts from our example (Figure 4). Table 1: Matrix for lemma frequency Lemmas from film Frequency (film) Frequency (monologue) Max 2 0 and 1 3 Funda 1 0 have 2 2 be 6 1 without 1 0 sleep 1 0 for 1 0 fifty 1 1 eight 1 1 hour 1 1 The frequencies were normalized by the total amount of words in the corresponding text; the cosine similarity between the two normalized vectors (lemma frequencies within the film and lemma frequencies within a kth monologue) was calculated as: similarity = !n i=1 aibi!n i=1 ai 2 !n i=1 bi 2 , where n is the number of unique lemmas in the film, ai - frequency of ith lemma in the film, bi - frequency of ith lemma in the monologue 2. • Number of reused n-grams. For this feature we have calculated the number of n-grams (n = 2, 10) that were used in the film and then 2Cosine similarity will be further used in the Rocchio classification approach that will be explained in detail in Section 5.2. 7 reused by a test-person in the corresponding monologue. In our exam- ple, the number of reused 2-grams equals to 2 (reused 2-grams are fifty eight and eight hour), the number of reused 3-grams equals to 1 (fifty eight hour), etc. • Cosine similarity using n-grams. The cosine similarity was calculated from a feature vector composed by the counts of di"erent n-grams for each monologue. • We also determined the number of lemmas that were used by the candi- dates but were not used in the film. For each monologue the following features have been calculated: Own lemmas1 = n" i=1 frequency(lemmai) ! count(lemmai) and Own lemmas2 = n" i=1 frequency(lemmai), where n is the number of unique lemmas that were used by a test person but were not used in the film; count(wordi) shows how many times lemmai was used in the monologue; frequency(lemmai) shows the frequency of lemmai according to a frequency dictionary of the German language [11]. This dictionary consists of 40000 German words with frequency from 1 to 17: 1 corresponds to more frequent words, 17 corresponds to less frequent words. If a word from the monologues was not found in the dictionary, its frequency was set to 20. 3.1. Feature Analysis The k-means algorithm, which is frequently used for data clustering in machine learning, was applied on the scaled scores of the test persons (Fig- ure 5). For the feature analysis two experiments were performed. In the first experiment the observations were partitioned into two clusters: cluster P1 consisted of test persons yielding a lower verbal intelligence, P2 contained candidates yielding a higher verbal intelligence. In the second experiment the test persons were partitioned into three clusters: P1 - lower verbal intel- ligence, P2 - average verbal intelligence, P3 - higher verbal intelligence. 8       Figure 5: The K-means algorithm The averaged values of all the features from the two clusters were com- pared using a one-way analysis of variance (ANOVA). In Experiment I with two clusters, features with small p-values were: • Number of reused 3-grams (averaged value for the first class AVlow = 0.021, averaged value for the second class AVhigh = 0.031, p = 0.012, F = 6, 63); • Cosine similarity using lemmas (AVlow = 0.79, AVhigh = 0.83, p = 0.03, F = 4, 64); • Cosine similarity using repeated n-grams (AVlow = 0.13, AVhigh = 0.15, p = 0.01, F = 7, 07). In Experiment II with three clusters, a feature with a small p-value was: • Cosine similarity using repeated n-grams (AVlow = 0.13, AVaver = 0.14, AVhigh = 0.16, p = 0.01, F = 7, 07). As we can see, participants with a higher verbal intelligence used more words from the film and the similarity between their descriptions and the film was higher than the similarity of participants with an average and a lower verbal intelligence. This may be explained in the following way. Test persons yielding a higher verbal intelligence (class HIGH ) may have a better ability to listen to and recall spoken information from the film. Mem- ory is indeed one of the verbal sub-tests of HAWIE so that a high memory score relates to a high verbal intelligence score of a test person. Therefore, people with good memory (i.e. higher verbal intelligence) were easier able to 9 remember many details of the film and to use words which they heard when watching the program. They may also better understand the relationships between language concepts, make more sophisticated language analogies or comparisons and perform a more complex language-based analysis. Hence, we may conclude that the vocabulary of test persons yielding di"erent verbal intelligence was di"erent when they talked about the same event, even despite they were asked to talk about this film just after they had watched it. 4. Text Categorization Solutions Film derived features presented in the previous section showed to be good predictors of verbal intelligence. Particularly, some of them suggested that test persons belonging to di"erent verbal intelligence classes may be distin- guished by word or lemma patterns, even regardless of the order of these words and lemmas in the monologues. This result led us to the main hypothesis that we investigate in this work: is it possible to solve the problem of inferring the corresponding level of verbal intelligence by simply applying conventional text categorization (TC) techniques? To validate this hypothesis, typical TF-IDF features (introduced in the next section) have been extracted from the transcripts of the monologues (henceforth we do not make any use of the film transcription). Three of the most popular TC methods have been applied for the auto- matic classification of monologues into three groups: test persons yielding a lower, an average and a higher verbal intelligence. 4.1. TF-IDF based Approaches TF-IDF (term frequency - inverse document frequency) based approaches are often used in information retrieval and text mining. As an example of a typical text mining task we may refer to the text categorization. The goal of TC is the classification of documents into a fixed number of predefined categories. The applicability of TC techniques has significantly grown in recent years. Organizing news by subject topics (e.g. to disambiguate information and to provide readers with greater search experiences) or papers by research domains (e.g. for large databases of information that need indexing for re- trieval) are just some of the most popular examples. Moreover, Security 10 (e.g. analysis of plain text sources such as Internet news), Biomedical (e.g. indexing of patient reports in health care organizations according to disease categories) or Software (e.g. for tracking and monitoring terrorist activities) domains also have benefit from these techniques. New domains, like Marketing (e.g. analytical customer relationship ma- nagement) or Sentiment analysis (e.g. analysis of movie reviews), start using text mining solutions. In this work we have applied these techniques to a new domain: the estimation of speakers’ verbal intelligence. For TC, every document has to be transformed into a representation which could be suitable for learning algorithms and classification tasks. As reviewed in [16], most TC algorithms are based on the vector space model (VSM). TC state-of-the-art systems widely apply the VSM approach [1, 26, 16]. Information retrieval (IR) research suggests that words work well as rep- resentation units. In VSM, each document in a corpus is represented by a list of words (i.e. bag of words). Each word is considered as a feature; the value of the feature is a weight transformation of the number of times the word occurs in the document (i.e. word’s frequency). Thus, a document is represented as a feature vector and its relevance to a query submitted by a user is measured through appropriate matching functions. These match- ing functions are typically based on statistical measures, like TF-IDF, that basically weight the importance of each word. The importance of a word increases proportionally to its frequency within a document but is o"set by its frequency within a corpus. Variations of this TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. 4.1.1. Mathematical Details TF-IDF is a common feature transforming or weighting function. The term count, ni,j, denotes the frequency of a given term ti in a given docu- ment dj. This count is usually normalized to prevent a bias towards longer documents. Thus, the term frequency tfi,j measures the importance of a term ti within a document dj and is defined as follows: tfi,j = ni,j! k nk,j (1) 11 where the denominator is the number of words in a document dj, that is, the size of the document |dj|. The inverse document frequency idfi is a measure of the general impor- tance of a term: idfi = log |D| {j : ti " dj} (2) where |D| is the total number of documents in the corpus, {j : ti " dj} is the number of documents where the term ti appears (i.e. documents for ni,j #= 0). The feature weighting function is then computed by using the following formula: tfidfi,j = tfi,j · idfi (3) These weights show the importance of the words in each document. As can be seen, more frequent terms in a document are more representative and, if the number of documents in which this term occurs increases, this term becomes less discriminative. At this point, we may view each document as a vector that contains terms and their corresponding weights. For those terms from the vocabulary that do not occur in a document this weight equals to zero. In the following sections we will show the advantage of such a document representation. 4.2. Feature Selection Typical TC approaches make use of di"erent feature selection techniques to further reduce the dimensionality of the data space by removing irrelevant features that have no contribution to category discrimination. Di"erent feature selection techniques through information theory were well studied in [31]. As a result of this study, information gain (IG) and v2-test (CHI) were reported to be the top performing methods out of five methods under test in terms of feature removal aggressiveness and classifica- tion accuracy improvement. However, the document frequency thresholding approach, the simplest method with the lowest cost in computation, was reported to perform similarly. The Document Frequency (DF) is the number of documents in which a term occurs. As described in [31], it is possible to compute the document frequency for each unique term in the training corpus and to remove from the 12 feature space those terms whose document frequency is less than a certain predefined threshold. By doing so we are adopting a basic assumption: rare terms are either non-informative for the category prediction (i.e. intelligence estimation in our case) or not influential in global performance. In either case, removal of rare terms contributes to the reduction of dimensionality of the feature space and improves the classification accuracy (i.e. if rare terms happen to be noise terms). If we try to summarize both pros and cons of using the document fre- quency thresholding approach, we may say that positive aspects are: • It is the simplest technique for vocabulary reduction (easily scalable to a very large corpora). • Computational complexity is approximately linear with the number of documents. while on the other hand, negative aspects are: • The technique is usually considered as an ad-hoc approach to improve the e!ciency instead of a principled criterion for a predictive features selection. • The technique is typically considered, from an IR point of view, as a non-appropriate approach for aggressive term removal (low-DF terms are assumed to be relatively informative and therefore should not be removed aggressively). In this work a slightly modified version of this DF thresholding approach was applied to the data: TF-IDF measures instead of DF measures were used. As another remarkable di"erence, we did not remove the lowest TF-IDF terms but just selected the highest TF-IDF terms. In particular, instead of defining a threshold for TF-IDF measures, we defined a fixed number of terms to be selected (i.e. N). Therefore, we first sorted all the terms according to their TF-IDF measures. Then, we selected the top N most representative or indicative terms according to their TF-IDF weights. The remaining terms were removed as stop or common words that did not add any meaningful content. By observing the evolution of the classification accuracy with an increasing N value, we determined the minimum size of the vocabulary (i.e. dimensionality) required to achieve the optimum performance. 13 4.2.1. Class-based vs Corpus-based As stated above, in our framework each word is considered as a feature and each document is represented as a feature vector. In [19] two alternative ways for implementing the selection of these keywords or features are presented. In the first one, the so-called corpus-based keyword selection, a common keyword or feature set that reflects the most important words for all classes (i.e. highest TF-IDF terms) in all documents is selected. In the alternative approach, named as class-based keyword selection, the keyword selection process is performed separately for each class. In this way, the most important and specific words for each class are determined. 4.2.2. Word Lemmatisation Word lemmatisation is often applied in the area of IR, where the goal is to enhance the system performance and to reduce the number of unique words [23]. Particularly, word lemmatisation is part of the data pre-processing required to convert a natural language document to the feature space. For- mally, it is the process for reducing inflected (or sometimes derived) words to their lemmas. For example, as a result of lemmatisation, di"erent words like “play”, “plays”, “playing” and “played” are related to the same feature identification (i.e. lemma) “play”. Word lemmatisation was applied to our monologues to assess its impact on performance (i.e. classification accuracy). Like removing stop words, lemmatisation also contributed to the reduction of the size of the lexicon, thus saving on computational resources. 5. Vector Space Classification As stated above, the vector space model represents each document as a vector with one real-valued component (i.e. TF-IDF weight) for each term. Therefore, we need text classification methods that can operate on real- valued vectors. In this section we introduce those ones that have been tested so far. A number of classifiers has been used to classify text documents, including regression models, Bayesian probabilistic approaches, Nearest Neighbours approaches, Rocchio algorithm, decision trees, inductive rule learning, neural networks, on-line learning, Support Vector Machines (SVMs), and combining classifiers [12, 7, 22, 30]. In this work we used three well-known vector space 14 classification methods: Naive Bayes (NB), Rocchio and Nearest Neighbour classification (kNN). NB is often used as a baseline in text classification research as it com- bines e!ciency (training and classification can be accomplished with one pass over the data) and good accuracy (particularly if there are many equally im- portant features that jointly contribute to the classification decision). The Rocchio algorithm is a very simple and e!cient text categorization method for applications such as web searching, on-line query, etc. because of its sim- plicity in both training and testing [26]. kNN requires no explicit training and can use the unprocessed training set directly in classification. However, it is less e!cient than the other classification methods (i.e. with kNN all the work is done at run-time so that it can have poor run-time performance if the training set is large). Rocchio and Naive Bayes are linear classifiers whereas kNN is an example of a non-linear one. Generally speaking, if a problem is non-linear and its class boundaries cannot be approximated well with linear hyperplanes, non- linear classifiers are often more accurate than linear classifiers (particularly, if the training set is large, then kNN can handle complex classes better than Rocchio and NB). On the other hand, if a problem is linear, then it is better to use a simpler linear classifier. However, this needs to be taken with a little bit of salt since the previous assertion is always conditioned by the well-known bias-variance trade-o" (i.e. with limited training data, a more constrained model tends to perform better). These approaches are described in more detail in the following sections. Among the enumerated alternatives, SVMs are widely used mainly be- cause they have much current theoretical and empirical appeal and perform at the state-of-the-art level. According to [22], SVMs, example based meth- ods, regression methods and boosting based combining classifiers deliver top- notch performance. Lewis et al. (2004) also found that SVMs perform better on Reuters-RCV1 corpus than kNN and Rocchio. Nonetheless, recent revisions of the selected algorithms have proposed en- hanced versions of these methods that achieve relatively close performance to the top-notch TC classifier: SVMs. For instance, Miao and Kamel (2010) have re-examined the applicable assumptions and parameter optimization method of the traditional Rocchio algorithm and proposed an enhanced ver- sion of this method that clearly outperforms the former one by using a pair- wise optimized strategy. Salles et al. (2010) also presents a methodology to determine the impact that may have temporal e"ects on TC and to minimize 15 it. By extending the three algorithms (namely kNN, Rocchio and NB) to in- corporate a Temporal Weighting Function (TWF), experiments showed that these temporally-aware classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art algorithms. In any case, and as discussed in [14], despite believes of many researchers that SVM is better than kNN in terms of e"ectiveness, kNN is better than Rocchio and Rocchio is better than NB, the ranking of classifiers ultimately depends on the classes, the document collection and the experimental setup. 5.1. Naive Bayes Classification The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method [14]. According to this method, the probability of a document d being in a class c can be computed as: P(c, d) $ P(c) · # 1!k!nd P(tk|c) (4) where P(tk|c) is the conditional probability of a term tk occurring in a doc- ument of a class c. It may also be interpreted as a measure of how much evidence tk contributes that c is the correct class. P(c) is the prior probabil- ity of a document occurring in a class c. Terms %t1, t2, · · · , tnd& are part of the vocabulary that is used for the classification; nd is the number of terms. In NB classification, the best class for a document d is determined as: cmap = arg max c"C $P(c|d) = arg max c"C $P(c) · # 1!k!nd $P(tk|c) (5) where $P refers to the parameters to be estimated from the training data by applying the Maximum Likelihood Estimation (MLE). The interpretation of this equation is rather simple. Each conditional parameter P(tk|c) is a weight that indicates the quality of an indicator tk for a class c. Similarly, the prior P(c) is a weight that indicates the relative frequency of c. More frequent classes are more likely to be determined as the correct class. To reduce the number of parameters, we adopted the Naive Bayes con- ditional independence assumption where attribute values are independent of each other given the class so that for our multinomial model: P(d|c) = P (%t1, t2, · · · , tnd&|c) = # 1!k!nd P (Xk = tk|c) (6) 16 where Xk is a random variable for a position k in the document and the values of Xk are terms from the vocabulary. P (Xk = tk|c) is the probability that in a document of a class c a term t will occur in a position k. To further reduce the complexity of our multinomial model (assuming a di"erent probability distribution for each position k in the document still re- sults in too many parameters), we made a second independence assumption: conditional probabilities for a term are the same regardless of its position in a document: P (Xk1 = t|c) = P (Xk2 = t|c) (7) where X is a single distribution of terms which is exactly the same for any po- sition k1, k2, · · · , ki. Equation 7 applies for all terms t and classes c. This po- sitional independence assumption is equivalent to adopting the bag of words model, which we introduced in Section 4.1. This bag-of-words model dis- cards all information that is communicated by the order of words in natural language sentences. 5.1.1. A Variant of the Multinomial Model A critical step in solving a text classification problem is to choose the document representation. An alternative formalization of the multinomial model represents each document d as a M-dimensional vector of counts: tf'idft1,d, tf'idft2,d, · · · , tf'idftM ,d where tf'idfti,d is the TF-IDF measure for a term ti in a document d. P(d|c) is then computed as follows: P(d|c) = P % %tf-idf t1,d, tf-idf t2,d, · · · , tf-idf tM ,d&|c & (8) All the model parameters (i.e. class priors and feature probability distri- butions) may be estimated from the training set by using MLE. For every class’ prior we calculated an estimate for the class probability from the train- ing set (i.e. (prior for a given class) = (number of samples in the class) / (total number of samples)). To estimate the parameters for our feature dis- tribution, we adopted the typical assumption that the continuous values as- sociated with each class are distributed according to a Gaussian distribution. Particularly, assuming that the training data contains continuous attributes, i.e. TF-IDF measures for each term and document, we first segmented the data by the class and then computed the mean and variance of every term specific TF-IDF measure in each class. 17 5.1.2. About the Independence Assumptions Typically, TC tasks rather look at the words themselves and not at their corresponding positions in the documents (i.e. bag of words). This relies on the hypothesis that each topic or class to be distinguished is fairly represented by only some specific words from our bag. The NB models often perform well for TC tasks despite the conditional independence and the positional independence assumptions. In fact, both assumptions are very important to avoid problems in estimation owing to data sparseness. By adopting both independence assumptions, we are committed to a spe- cific way of processing the evidence. Particularly, in the NB classification we look at each term separately so that we do not make a di"erence between word A followed by word B and word B followed by word A (although there is a di"erence between them). However, the conditional independence assump- tion does not really hold for text data as terms are conditionally dependent on each other. Additionally, the position of a term in a document by itself may carry more information about the class than expected. 5.2. The Rocchio Approach The Rocchio classification [20] divides the vector space into di"erent re- gions centred on prototypes. These prototypes or centroids, one for each class, define the class boundaries (i.e. hyperplanes). For a given training dataset, the centroid of a class c can be computed as the vector average or centre of mass of its members (i.e. all documents in the class) [14, 8]. '(µ (c) = 1 |Dc| " d"Dc '( V (d) (9) where Dc is the set of documents from class c: Dc = ' d : %d, c& " D ( ; '( V (d) = V1(d) · · ·VM(d) is a vector that contains tf-idf weights for each term of a document d. As many vector space classifiers (e.g. computing the nearest neighbours in kNN classification), the Rocchio approach relies on distance-based deci- sions (from a TC point of view, the relatedness of two documents can be typically expressed in terms of similarity or distance). Particularly, the Roc- chio classification rule is to classify a point in accordance with the region it falls into. To do this, basically we determine the centroid '(µ (c) that the point is closest to and then assign it to c. 18 In our experiments, we used the cosine similarity measure as the un- derlying distance. Cosine similarity is the cosine of the angle between two vectors and determines whether they are pointing in roughly the same direc- tion. Since the components of our vectors (i.e. tf-idf weights) could not be negative, the angle between two tf-idf vectors could not be greater than 90#. The vector representation '( V (d1) and '( V (d2) of the cosine similarity be- tween two documents d1 and d2 is: sim (d1, d2) = '( V (d1) · '( V (d2) | '( V (d1)|| '( V (d2)| (10) where | '( V (d1)| and | '( V (d2)| it the Euclidean length of the vectors. By using this measure, we are also applying a normalization process which makes each vector of the same length [21]. If we have a look at the magni- tude of the vector di"erence between two vectors corresponding to documents with very similar content, it may happen that this di"erence is significantly simple because one is much longer than the other. Cosine similarity measure compensates this e"ect of document length so that the similarity between document vectors is reduced to only measuring the cosine of the angle be- tween them. We can rewrite Equation 10 as follows: sim (d1, d2) = '(v (d1) · '(v (d2) (11) where '(v (d1) = '( V (d1)/| '( V (d1)| and '(v (d2) = '( V (d2)/| '( V (d2)|. The assign- ment criterion for a document d and its vector representation '( V (d) can be defined as: crocchio = arg max c"C sim ) '(µ (c), '( V (d) * (12) In our implementation of the Rocchio approach [4, 10] only positive train- ing samples are considered for obtaining the prototype for each class (i.e. training samples that belong to the corresponding class). However, recent variations of Rocchio [22, 17, 2] consider the e"ects of negative samples (i.e. training documents that belong to all other classes) when computing the prototypes for the defined classes. Di"erent parameters may be used to con- trol the relative importance of positive and negative samples. These Rocchio classifiers reward not only the closeness of a test document to the centroid 19 of the positive training instances, but also its distance from the centroid of the negative training instances. 5.3. K-nearest Neighbours In pattern recognition, the k-nearest neighbour algorithm (kNN) is a method for classifying objects based on the closest training examples in the feature space. In TC, kNN takes an arbitrary input document and ranks the k nearest neighbours among the training documents through the use of a similarity score (i.e. cosine similarity distance). It then assigns to the input the category or the class of the most similar document or documents. A constant k, defined by a user, denotes the number of neighbours included in the evaluation. The kNN algorithm is a valid non-parametric method. Despite being amongst the simplest of all machine learning algorithms, it is one of the best methods when the text is described by using VSM [30]. However, traditional kNN has two main drawbacks: the intensive computational e"ort, especially when the size of the training set grows (training examples are vectors in a highly multidimensional feature space), and its sensitiveness to the local structure of the data [9]. New nearest neighbour algorithms have been recently proposed mainly with the purpose of reducing the number of distance evaluations actually performed, thus trying to make kNN computationally tractable even for large data sets. For instance, [27] presents a fast kNN algorithm that reduces the cost of similarity computing in order to raise the classifying speed and applicability of kNN. In Naive Bayes and Rocchio classification we have to estimate correspond- ing parameters: priors and conditional probabilities and centroids. In kNN we do not need to estimate any parameters but simply memorize all exam- ples in the training set and then compare a test document to them. For this reason, kNN is also called memory-based learning or instance-based learning. The kNN algorithm is known because of its strong consistency results. As the amount of data approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achiev- able error rate given the distribution of the data) [14]. kNN is guaranteed to approach the Bayes error rate for a certain value of k (where k increases as a function of the number of data points). The k-nearest neighbour methods may be improved by using proximity graphs [25]. 20 5.3.1. Choosing the Class for an Unclassified Document To make a decision on a number of unclassified documents, we measure their similarity with all the documents that have already been classified. The unclassified documents are then ranked according to their similarity scores. Appropriate classes for the documents may be assigned in the following ways: • If we choose k = 1, the class is predicted to be the class of the closest training sample. This is called the nearest neighbour algorithm. • If we choose k > 1, then all the documents which ranks are smaller than or equal to k will be included in the ranked list. We can then use di"erent means to find a class for our document, like: – we may assign the document to the most common class amongst its k nearest neighbours (if we are dealing with a binary, i.e. two- class, classification problem, it is helpful to choose k to be an odd number as this avoids tied votes). – we may estimate the probability of membership in a class c as the proportion of the k nearest neighbours in c. This is commonly referred as the probabilistic version of the kNN classification al- gorithm. – for the individual classes, we may sum the distances to all the doc- uments in which the class occurs, and then choose the class corre- sponding to the highest accumulated distance (remember that we are using cosine distance). – etc. If we decide to use either the basic “majority voting”, the probabilistic method or the sum of distances based classification, those classes with more frequent examples will tend to dominate the prediction of a new vector. This is actually a drawback as they tend to come up in the k nearest neighbours when the neighbours are computed due to their large number (it is important to remind that the available data is certainly imbalanced). To overcome this problem, we compensate this possible imbalance by introducing a slightly modified classification method for our kNN based approach. Typically, the implementation of these versions of the algorithm starts by computing the distances from the test sample to all stored vectors of 21 the training data set. Next, all these training samples are sorted according to these distances thus ranking the nearest k training samples regardless of their corresponding class. If we look at those already labelled classes instead (neighbours are taken from a set of documents for which the correct classification is known), we could then identify specific top k neighbours for each class c (i.e. k nearest neighbours labelled as c, thus resulting in an overall list composed of k ) C elements). Finally, by computing distance to each class as the average distance between the test sample and those top k class specific neighbours, we will then manage to compensate any possible imbalance in the distribution of the training data among the defined classes. The best class for kNN classification can then be derived from: ckNN = arg max c"C score(c, d) = arg max c"C 1 k · " d!"Sk(c,d) sim('(v (d$), '(v (d)) (13) where Sk(c, d) is the c class specific set of d’s k nearest neighbours. As could be derived from Equation 13, it may also be useful to weight the contributions of the neighbours so that nearer ones contribute more to the average than more distant ones [24]. This classification method is weighted by taking into account not only the distance from the test sample to the c set of k nearest neighbours, but also the class compactness particularly for high values of k. 5.3.2. Parameter Selection The parameter k in kNN is typically defined by using some previous experience or specific knowledge about the classification domain. Normally, 1NN is found to be not very robust. The accuracy of the kNN algorithm can be severely degraded by the presence of noisy or irrelevant features (also if the feature scales are not consistent with their importance). 1NN means that the classification decision for each test document only relies on the class of a single training document, whose label could eventually be incorrect or atypical. kNN for k > 1 are more robust as larger values of k tend to reduce the e"ect of noise on the classification (although also make boundaries between classes less distinct). As an alternative, a good value of k can be assigned heuristically via cross validation technique or empirically via bootstrap method [6]. In our experiments, instead of applying any of these parameter selection methods, we tried di"erent k values, thus finally selecting the optimal k as the value which was used when obtaining the best performance. 22 6. Experimental Results and Discussion 6.1. Experimental Set-up Our main goal is to identify the algorithm that best computes class bound- aries and reaches the highest classification accuracy. In our experiments for comparing the performance of the di"erent approaches, a Leave-One-Out cross validation (LOO-CV) method was used. The idea of this method is to use N-1 observations for training (where N is the number of data points) and only 1 data point for testing. This procedure is repeated N times and each observation is used once as the testing data. 6.2. Baseline Approach: Class-Based vs Corpus-Based Feature Selection As introduced in Section 4.2.1, our experiments covered the comparison of the class-based and corpus-based keyword selection approaches. The corpus-based approach implies the selection of a common feature set for all classes with the top N most representative or indicative terms. The class-based approach instead implies the selection of the most important words for each particular class. In this case, to preserve the balance between classes, N/M words for each specific class were selected, where M is the number of classes. For our classification task M equals to 3, where the first class contained test persons yielding a lower verbal intelligence, the second class contained participants yielding an average verbal intelligence and the third class contained participants yielding a higher verbal intelligence. Then, we composed our feature vector by concatenating all the class-specific features, thus resulting into a vector comparable to the N-dimensional vector corresponding to the corpus-based approach. However, when using the class-based approach, a particular word may be included in various class-specific subsets (i.e. a word that is important to not only one single class but to several classes). To avoid using duplicate features, we only used the intersection between all the class-specific subsets. Therefore, the dimension of the resulting feature vector in these cases had to be necessarily lower than N. For simplicity, we will better refer to the number of features per class (i.e. F = N/3) rather than to the final dimensions of the vectors. Consequently, if we report, for instance, about 50 words or features per class, this means that we are using a 150-dimensional corpus- based vector. In this case for the class-based approach, 150 is the maximum number of dimensions. To definitely determine its value, it is necessary to check the possible intersection. 23 Of course, the higher the value of F , the more significant the intersection between class-specific word subsets, and also the bigger the di"erence with respect to the corpus-based vector dimensions. Analysing the corpus, 2210 di"erent words were extracted from all the monologue transcripts. Table 2 shows how the intersection evolved according to F . Considering the size of the vocabulary, the observed di"erence is significant. Table 2: Dimension di!erences between class-based and corpus-based approaches. # of features per class (3 classes) 50 100 150 200 250 300 350 400 450 500 Corpus-based 150 300 450 600 750 900 1050 1200 1350 1500 Class-based 150 289 393 486 557 601 737 858 992 1102 Di!erence 0 11 57 114 193 299 313 342 358 398 Rel. di!. (%) 0 3, 7 12, 7 19 25, 7 33, 2 29, 8 28, 5 26, 5 26, 5 Figure 6 presents the accuracy results obtained using either the corpus- based or the class-based feature selection methods. The results were obtained using the NB approach for di"erent dimensions of the feature vector. Confi- dence intervals of 95% are also shown in the figure. As it can be derived from the figure, the class-based approach clearly out- performed the corpus-based one regardless of any di"erence about the used dimension. Although the observed di"erences were not statistically signifi- cant in any case, it is interesting to pinpoint the result for the 155-dimensional value. At this point the class-based approach reached the top performance while the di"erence with the corpus-based alternative also turned to be the biggest one thus becoming almost significant. From a di"erent point of view, we may also try to analyse the min- imum dimensionality required by the class-based approach to outperform the corpus-based one. The corpus-based approach obtained a maximum ac- curacy of 51, 79%. As can be observed in Figure 6, this performance was reached with dimensionality equal to or higher than 110. Also derived from this figure, we may check that the class-based approach obtained a better performance of 57, 14% (though not statistically significant) using “only” 20 features per class. The class-based feature selection, by definition, focuses on finding the most crucial or indicative class keywords. On the other hand, 24 Figure 6: Baseline approach: Class-Based vs Corpus-Based feature selection. the corpus-based one simply tends to find general keywords concerning all classes. This clearly tips the balance in favour of the class-based approach particularly when we use a reduced set of features. This is important as there may be a significant gain in classification time when a small number of features is used. By confirming these di"erences with additional statistical evidence (i.e. more data), we may also conclude that the class-based feature selection im- proved the performance of the corpus-based one for the NB approach not only in terms of accuracy but also in terms of time. Similar results were already confirmed in [19]. When using the corpus-based approach, most features (i.e. words) tend to be selected from the prevailing classes so that rare classes are not well rep- resented. In contrast, when using the class-based approach all the classes are represented equally well as for their representation class specific features are used. Thus, the class-based approach achieved consistently higher accuracies than the corpus-based approach. Similar di"erences between the class-based and corpus-based methods 25 have been consistently observed throughout all of our experiments. There- fore, in the next sections we will only focus on the class-based versions. 6.3. Comparison between Approaches: Rocchio “Wins” In this section we compare the results that were obtained using di"erent approaches. Before proceeding with this comparison, we need first to assign the optimal configuration (i.e. k value) for the kNN approach. Figure 7 presents classification results corresponding to several k values. As expected, 1NN was found to be not very robust. Optimal performance may be reached by using k = 3 in combination with dimensionality of 155. However, if we keep increasing the value of k, which is typically more robust as it helps to reduce the e"ect of noise on the classification, then the results apparently start to be a"ected by sparse data bias. Figure 7: kNN results for di!erent k values. As a result of the initial k-means clustering, only 13 samples were defined to be part of the least populated class. Therefore, starting with 1NN we checked out up to k = 12 values leaving one sample out for testing (the LOO 26 approach was applied). For clarity, Figure 7 presents classification results only with some values of k. The observed di"erences were found to be statistically significant for the top performance dimensionality (i.e. F = 155) when comparing the best configuration (i.e. k = 3) with all the others for k > 5. No statistically significant di"erences where observed between the best k and any k * 5 configurations. Figure 8 allows to compare the results of the NB approach, the Rocchio approach and the kNN approach with k = 3. Figure 8: Comparison between approaches: Rocchio wins. A first important result that we can derive from Figure 8 is that both Roc- chio and kNN are clearly outperforming the NB approach, although the top performance is defined for di"erent dimensionalities in each case. The kNN performance had a maximum accuracy of 92, 86% for 155-dimensionality, while Rocchio just required 15 features per class to improve it up to 95, 6%. Both results denoted a statistically significant di"erence when compared to the NB top performance, 66, 07% also for 155-dimensionality. However, we did not observe any significant di"erence between Rocchio and 3NN (natu- 27 rally, Rocchio was also significantly outperforming any k > 5 approach). As it typically occurs in TC tasks, most of the learning takes place with a small yet crucial portion of features (i.e. keywords) for a class. This is evident in the steeper learning curves that reach the top performance at relatively low dimensionality. Therefore, we may conclude that the class- based feature selection approach is shown to be successful in quickly finding the most crucial or indicative class keywords. Another visible result in Figure 8, common to all the tested approaches, is the performance decrease as the value of F increases (particularly beyond a 200-dimensional value). As we already introduced in Section 6.2 and proved in Table 2, the higher the value of F , the more significant the intersection between class-specific word subsets. If we expand this interpretation, the more significant the intersection, the less discriminative the class-specific subsets, the more likely to include words that are not really indicative of any of the classes, and so the performance decreases. 6.4. Using Words vs Lemmas As we introduced in Section 4.2.2, we also tried a word lemmatisation strategy (i.e. to group together those words that are in di"erent forms but with the same lemma). This strategy was implemented as part of the data pre-processing stage during the classification task. Figure 9 shows the results with and without word lemmatisation for our top performing approach: the Rocchio one. The main advantage of word lemmatisation is to reduce the dimension- ality of the data space. In a TC task, it is basically applied under the assumption that all the documents belonging to the same category or topic may include these lemmas appearing in di"erent forms, and of course, it makes sense to use them as they refer to words with similar meanings. TC tasks typically rely on this. However, to be successful and thus really en- hance system performance, there is another important hypothesis that also needs to be confirmed: each topic or class to be distinguished should be fairly represented by only some class-specific lemmas. While the former one happens to be true for most of the cases, the latter one, though also successfully applied in typical TC tasks, may reasonably not be true in our case. The main reason for this would be that, from this point of view, all the documents (i.e. monologues) could be regarded as belonging to the same category according to their topic or content: all the documents are about the film which the participants watched. Consequently, we could 28 Figure 9: Classification results using words vs lemmas. expect an important number of lemmas to be shared among the participants as they all were talking about the same topic. This is an important di"erence with conventional TC tasks where, nor- mally, the topics or classes are well separated according to their conceptu- alization. In contrast, in our domain we may expect the participants to be identifiable among others not by the concepts or ideas themselves but by the way they express these ideas. Therefore, in this particular case, we may expect lemmas not to have much contribution to category discrimination but the di"erent endings and forms instead. Hence, missing this discriminative information because of lemmatisation (simplifying words with di"erent forms into their more common roots) could have some undesirable consequences in classification and clustering. Moreover, the fact that all the participants were German native speakers could be particularly critical for this problem. In this regard it is important to remark that German is a very agglutinative language [18]. Compound words or words that consist of more than one lemma (i.e. compounding or word- compounding occurs when a person attaches two or more words together to 29 make one word), can be found very often in the German language. “Donau- dampfschi"fahrtsgesellschaftskapitänsmütze” (i.e. Danube steamboat ship- ping company Captain’s hat) is a good example of how long these compound words could be (they can be practically unlimited in length, particularly in case of biochemistry). The meaning of a compound word di"ers from the meanings of words which it consists of. Lemmatisation of compound words would simply re- duce them to their more common lemmas thus loosing this discriminative information. To what extent this argument could be either true or false is something that can be derived from Figure 9. In fact, the word-based approaches sys- tematically outperform the lemma-based ones. Confidence boundaries for both cases are also shown in this figure. As we can observe, di"erences be- come statistically significant mostly around the same dimensionality range that was previously pinpointed when referring to the top performance for both kNN and NB (particularly at a 155-dimensional value). However, the di"erences are not statistically significant at F = 15, the point at which Roc- chio reaches its maximum accuracy for both word-based and lemma-based approaches. 6.5. Tempted to Use More Classes Although the three-classes scheme can be found entirely suitable from a practical implementation point of view (i.e. participants yielding a lower, an average and a higher verbal intelligence), we were also interested in analysing the performance of the suggested approaches for a higher number of classes. This would enable a better granularity for the verbal intelligence classifica- tion. Figure 10 presents benchmarking results for 4 classes instead of 3 (as it was shown in Figure 8). From a practical point of view, these classes may correspond to the following levels of verbal intelligence: poor, average, high and very high respectively. In this regard it seems to be important to remark that working with a higher number of classes, like 5 or more, was practically infeasible because of sparse data problems (i.e. k-means resulted into unpopulated classes). As for 3 classes, the Rocchio approach showed the highest accuracy again (i.e. 87, 5% at F = 15). The optimal dimensionality remained to be the same as for 3 classes (i.e. F = 15). Regarding the comparison between Rocchio and kNN, the observed di"erences also remained to be not statistically significant. 30 Figure 10: Tempted to use more classes: 4 classes. Additionally, both Rocchio and kNN clearly outperformed the NB algo- rithm, once again by a significant margin. For these two, another e"ect starts to become evident: the top performance region, previously observed for di- mensionality values up to 200, now turns to be narrower, locating its limit approximately around a value of 100. As we simply increase the number of classes, it seems to be evident that the number of terms or features that are really indicative of each particular class becomes smaller, thus a"ecting the performance. From a general point of view, the resulting performance can still be deemed to be satisfactory as the error rate is only roughly 7% higher than with three classes. If we look at the confusion matrix, presented in Table 3, we may check that by adopting the four classes into three by grouping high and very high classes, predictability would be more similar reducing the gap roughly to the half (i.e. 94, 64% for three classes and 91, 07% for four classes adopted into three). Finally, it may be interesting, particularly from a practical point of view, to have a look at the upper-right and lower-left corners of the matrix: 0 31 Table 3: Confusion matrix corresponding to Rocchio’s top performance using 4 classes (F = 15). Prediction outcome a b c d Actual value a = poor 4 0 1 0 b = average 1 14 1 0 c = high 1 0 22 1 d = very high 0 1 1 9 errors. This means there was not any critical errors like regarding a lower verbal intelligence individual as a higher verbal intelligence one and vice- versa. 7. Conclusions and Future Work This work showed that verbal intelligence may be recognized by com- puters through language cues. The achieved classification accuracy can be deemed as satisfying for a number of classes that is reasonably high enough to enable its integration into SLDSs. To our knowledge, this is the first report of experiments attempting to automatically predict verbal intelligence. Some of the most popular TC algorithms were applied to this task: NB, Rocchio and kNN. NB models are typically expected to perform well for TC tasks despite the conditional independence and the positional independence assumptions. However, the performance of NB approach was significantly worse than with the other approaches: kNN and Rocchio. This suggests that this probabilistic classifier was more sensitive to the low number of examples available, mainly resulting into inaccurate probability estimates, than the vector space ones (computing distances to some relevant members or to a prototype of each defined class seems to be more robust against sparse data). On the other hand, and connecting with those independence assumptions, it is well known that conditional independence does not really hold for text data (even worse considering that our features are highly correlated). Fur- thermore, we firmly believe that, for this specific task, the position of a term 32 in a document by itself could carry more information about the class than expected, mainly because of the above mentioned peculiarities of our classi- fication task (i.e. it is not only about the words that participants used to denote their intelligence, but also the way they combined them). Therefore, our data is somehow violating these independence assumptions, thus finally explaining why the NB approach performed so poorly. In this regard, it would be very interesting to test a LM based TC approach to better validate this argument. Using the class-based feature selection approaches has proven to be an essential factor, not only to achieve a better inference performance but also to reduce its computational cost. Despite typically successful when applied to TC tasks, word lemmatisa- tion was not really helpful for our task. The word-based approaches system- atically outperformed the lemma-based approaches, thus pointing out some peculiarities of the classification task. Particularly, these results were found to be mainly explained by two di"erent factors: the same topic for collecting monologues and the use of the German language. Unlike typical TC tasks, our verbal intelligence prediction task is influ- enced by the necessary fact that the di"erent categories or classes to be identified are not well separated from a conceptualization point of view. Of course, it might be easier to distinguish people talking about di"erent topics from their everyday life although the results for such a comparison might not be objective. By letting the participants (i.e. people with di"erent interests and hobbies) to discuss their own topics, we would be then recognizing the topics themselves rather than people with di"erent cognitive processes. On the other hand, the use of German, a very agglutinative language, resulted to be a drawback with regards to word lemmatisation. By lemma- tisation of compound words (compounding is a pretty common phenomena in German) we are basically loosing the extra meaning that arises from the combination of the interrelated words. This meaning has proven to be really helpful to correctly discriminate between di"erent levels of verbal intelligence. In future work, it would also be interesting to examine how well the suggested approaches perform when integrated into existing SLDS. In this regard, it is important to remark that any application involving speech recog- nition will always introduce noise in the features that we used. This needs to be considered as it will surely reduce the presented accuracies. Testing these approaches with conventional SLDS would allow us to assess whether the accuracies we achieve are high enough or not for our intended application 33 (i.e. dialogue system adaptation). On the other hand, this also suggests the importance of finding some other features that could be more robust when being used in a conventional system. Prosodic features could be a good alternative; so it would be interesting to start working on an multimodal inference framework that could jointly exploit the potential of, among others, this kind of features. As we have already mentioned, the linguistic cues that we have used in this work could pose a problem, for instance, if we want to apply these solutions with the same users but across di"erent domains. In this regard, prosodic features would be found to be advantageous as they would also allow us to explore the possibility of finding topic independent solutions. Acknowledgement This work is partly supported by the DAAD (German Academic Ex- change Service). Parts of the research described in this article are supported by the Tran- sregional Collaborative Research Centre SFB/TRR 62 ”Companion-Technology for Cognitive Technical Systems” funded by the German Research Founda- tion (DFG). For this work, Fernando was granted a fellowship by the Caja Madrid foundation. References [1] Baeza-Yates, R. A., Ribeiro-Neto, B., 1999. Modern Information Re- trieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. [2] Bi, Y., Bell, D., Wang, H., Guo, G., Guan, J., March 2007. Combining multiple classifiers using dempster’s rule for text categorization. Appl. Artif. Intell. 21, 211–239. URL http://dl.acm.org/citation.cfm?id=1392641.1392644 [3] Cianciolo, A. T., Sternberg, T. J., 2004. Intelligence: a Brief History. Blackwell Publishing. 34 [4] Dumais, S., Platt, J., Heckerman, D., Sahami, M., 1998. Inductive learn- ing algorithms and representations for text categorization. In: Proceed- ings of the seventh international conference on Information and knowl- edge management. CIKM ’98. ACM, New York, NY, USA, pp. 148–155. URL http://doi.acm.org/10.1145/288627.288651 [5] Goethals, G., Sorenson, G., Burns, J., 2004. Encyclopedia of leadership. No. v. 1 in Encyclopedia of Leadership. Sage Publications. URL http://books.google.es/books?id=kjLspnsZS4UC [6] Hall, P., Park, B. U., Samworth, R. J., 2008. Choice of neighbor order in nearest-neighbor classification. ANNALS OF STATISTICS 36, 2135. URL doi:10.1214/07-AOS537 [7] Hui, G. G., Wang, H., Bell, D., Bi, Y., Greer, K., 2003. Using knn model- based approach for automatic text. In: In Proc. of ODBASE’03, the 2nd International Conference on Ontologies, Database and Applications of Semantics, LNCS. pp. 986–996. [8] Ittner, D. J., Lewis, D. D., Ahn, D. D., 1995. Text categorization of low quality images. In: In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval. pp. 301–315. [9] Jianliang, Y., Yongcheng, W., 2004. Application of iterative-knn based on knn and automatic retrieval in automatic categorization. Journal of The China Society For Scientific and Technical Information 23, 137–141. [10] Joachims, T., 1998. Text categorization with support vector machines: Learning with many relevant features. In: European Conference on Ma- chine Learning (ECML). Springer, Berlin, pp. 137–142. [11] Kupietz, M., Belica, C., Keibe, H., Witt, A., 2010. The german ref- erence corpus dereko: A primordial sample for linguistic research in: Calzolari, nicoletta et al. (eds.). In: Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 2010). pp. 1848–1854. [12] Lewis, D. D., Yang, Y., Rose, T. G., Li, F., December 2004. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397. URL http://dl.acm.org/citation.cfm?id=1005332.1005345 35 [13] Logsdon, A., 2011. Learning disabilities. URL http://www.learningdisabilities.about.com/ [14] Manning, C. D., Raghavan, P., Schtze, H., 2008. Introduction to Infor- mation Retrieval. Cambridge University Press, New York, NY, USA. [15] Mergenthaler, E., 1996. Emotion-abstraction patterns in verbatim pro- tocols: A new way of describing psychotherapeutic processes. Journal of Consulting and Clinical Psychology 6 (64). [16] Miao, Y.-Q., Kamel, M., January 2011. Pairwise optimized rocchio al- gorithm for text categorization. Pattern Recogn. Lett. 32, 375–382. URL http://dx.doi.org/10.1016/j.patrec.2010.09.018 [17] Moschitti, A., 2003. A study on optimal parameter tuning for rocchio text classifier. In: Proceedings of the 25th European conference on IR research. ECIR’03. Springer-Verlag, Berlin, Heidelberg, pp. 420–435. URL http://dl.acm.org/citation.cfm?id=1757788.1757828 [18] Olsen, S., 2000. Ein internationales handbuch zur flexion und wortbil- dung. In: Booij, G., Lehmann, C., Mugdan, J. (Eds.), Morphologie. Berlin / New York: de Gruyter, pp. 897–916. [19] Özgür, A., Özgür, L., Güngör, T., 2005. Text categorization with class- based and corpus-based keyword selection. In: ISCIS. pp. 606–615. [20] Rocchio, J., 1971. Relevance Feedback in Information Retrieval. Prentice-Hall Inc., Ch. 14, pp. 313–323. [21] Salton, G., 1989. Automatic text processing: the transformation, analy- sis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. [22] Sebastiani, F., March 2002. Machine learning in automated text catego- rization. ACM Comput. Surv. 34, 1–47. URL http://doi.acm.org/10.1145/505282.505283 [23] Solka, J., Jul 2008. Text data mining: Theory and meth- ods. Statistics Surveys 2008, Vol. 2, 94-112Comments: Pub- lished in at http://dx.doi.org/10.1214/07-SS016 the Statistics Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical Statis- tics (http://www.imstat.org). 36 [24] Tan, S., May 2005. Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 28, 667–671. URL http://dx.doi.org/10.1016/j.eswa.2004.12.023 [25] Toussaint, G. T., 2005. Geometric proximity graphs for improving near- est neighbor methods in instance-based learning and data mining. Int. J. Comput. Geometry Appl. 15 (2), 101–150. [26] Vinciarelli, A., October 2005. Application of information retrieval tech- niques to single writer documents. Pattern Recogn. Lett. 26, 2262–2271. URL http://dx.doi.org/10.1016/j.patrec.2005.03.036 [27] Wang, Y., Wang, Z.-O., aug. 2007. A fast knn algorithm for text cat- egorization. In: Machine Learning and Cybernetics, 2007 International Conference on. Vol. 6. pp. 3436 –3441. [28] Wechsler, D., 1939. The Measurement of Adult Intelligence. Baltimore (MD): Williams & Witkins. [29] Wechsler, D., 1982. Handanweisung zum Hamburg-Wechsler- Intelligenztest fuer Erwachsene (HAWIE). Separatdr., Bern; Stuttgart; Wien, Huber. [30] Yang, Y., Liu, X., 1999. A re-examination of text categorization meth- ods. In: Proceedings of the 22nd annual international ACM SIGIR con- ference on Research and development in information retrieval. SIGIR ’99. ACM, New York, NY, USA, pp. 42–49. URL http://doi.acm.org/10.1145/312624.312647 [31] Yang, Y., Pedersen, J. O., 1997. A comparative study on feature se- lection in text categorization. In: Fisher, D. H. (Ed.), Proceedings of ICML-97, 14th International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, US, Nashville, US, pp. 412–420. URL citeseer.nj.nec.com/yang97comparative.html [32] Zablotskaya, K., Walter, S., Minker, W., May 2010. Speech data cor- pus for verbal intelligence estimation. In: Proceedings of LREC’10. pp. 1077–1080. 37