key: cord-020871-1v6dcmt3 authors: Papariello, Luca; Bampoulidis, Alexandros; Lupu, Mihai title: On the Replicability of Combining Word Embeddings and Retrieval Models date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_7 sha: doc_id: 20871 cord_uid: 1v6dcmt3 We replicate recent experiments attempting to demonstrate an attractive hypothesis about the use of the Fisher kernel framework and mixture models for aggregating word embeddings towards document representations and the use of these representations in document classification, clustering, and retrieval. Specifically, the hypothesis was that the use of a mixture model of von Mises-Fisher (VMF) distributions instead of Gaussian distributions would be beneficial because of the focus on cosine distances of both VMF and the vector space model traditionally used in information retrieval. Previous experiments had validated this hypothesis. Our replication was not able to validate it, despite a large parameter scan space. The last 5 years have seen proof that neural network-based word embedding models provide term representations that are a useful information source for a variety of tasks in natural language processing. In information retrieval (IR), "traditional" models remain a high baseline to beat, particularly when considering efficiency in addition to effectiveness [6] . Combining the word embedding models with the traditional IR models is therefore very attractive and several papers have attempted to improve the baseline by adding in, in a more or less ad-hoc fashion, word-embedding information. Onal et al. [10] summarized the various developments of the last half-decade in the field of neural IR and group the methods in two categories: aggregate and learn. The first one, also known as compositional distributional semantics, starts from term representations and uses some function to combine them into a document representation (a simple example is a weighted sum). The second method uses the word embedding as a first layer of another neural network to output a document representation. The advantage of the first type of methods is that they often distill down to a linear combination (perhaps via a kernel), from which an explanation about the representation of the document is easier to induce than from the neural network layers built on top of a word embedding. Recently, the issue of explainability in IR and recommendation is generating a renewed interest [15] . In this sense, Zhang et al. [14] introduced a new model for combining highdimensional vectors, using a mixture model of von Mises-Fisher (VMF) instead of Gaussian distributions previously suggested by Clinchant and Perronnin [3] . This is an attractive hypothesis because the Gaussian Mixture Model (GMM) works on Euclidean distance, while the mixture of von Mises-Fisher (moVMF) model works on cosine distances-the typical distance function in IR. In the following sections, we set up to replicate the experiments described by Zhang et al. [14] . They are grouped in three sets: classification, clustering, and information retrieval, and compare "standard" embedding methods with the novel moVMF representation. In general, we follow the experimental setup of the original paper and, for lack of space, we do not repeat here many details, if they are clearly explained there. All experiments are conducted on publicly available datasets and are briefly described here below. Classification. Two subsets of the movie review dataset: (i) the subjectivity dataset (subj) [11] ; and (ii) the sentence polarity dataset (sent) [12] . Clustering. The 20 Newsgroups dataset 1 was used in the original paper, but the concrete version was not specified. We selected the "bydate" version, because it is, according to its creators, the most commonly used in the literature. It is also the version directly load-able in scikit-learn 2 , making it therefore more likely that the authors had used this version. Retrieval. The TREC Robust04 collection [13] . The methods used to generate vectors for terms and documents are: TF-IDF. The basic term frequency -inverse document frequency method [5] . Implemented in the scikit-learn library 3 . [4] . LDA. Latent Dirichlet Allocation [2] . cBoW. Word2vec [9] in the Continuous Bag-of-Word (cBow) architecture. PV-DBOW/DM. Paragraph vector (PV) is a document embedding algorithm that builds on Word2vec. We use here both its implementations: Distributed Bag-of-Words (PV-DBOW) and Distributed Memory (PV-DM) [7] . The LSI, LDA, cBoW, and PV implementations are available in the gensim library 4 . The FK framework offers the option to aggregate word embeddings to obtain fixed-length representations of documents. We use Fisher vectors (FV) based on (i) a Gaussian mixture model (FV-GMM) and (ii) a mixture of von Mises-Fisher distributions (FV-moVMF) [1] . We first fit (i) a GMM and (ii) a moVMF model on previously learnt continuous word embeddings. The fixed-length representation of a document X containing T words w i -expressed as where K is the number of mixture components. The vectors G X i , having the dimension (d) of the word vectors E wi , are explicitly given by [3, 14] : where ω i are the mixture weights, γ t (i) = p(i|x t ) is the soft assignment of x t to (i) Gaussian and (ii) VMF distribution i, and σ 2 i = diag(Σ i ), with Σ i the covariance matrix of Gaussian i. In (i), σ i refers to the mean vector; in (ii) it indicates the mean direction and κ i is the concentration parameter. We implement the FK-based algorithms by ourselves, with the help of the scikit-learn library for fitting a mixture of Gaussian models and of the Spherecluster package 5 for fitting a mixture of von Mises-Fisher distributions to our data. The implementation details of each algorithm are described in what follows. Each of the following experiments is conceptually divided in three phases. First, text processing (e.g. tokenisation); second, creating a fixed-length vector representation for every document; finally, the third phase is determined by the goal to be achieved, i.e. classification, clustering, and retrieval. For the first phase the same pre-processing is applied to all datasets. In the original paper, this phase was only briefly described as tokenisation and stopword removal. It is not given what tokeniser, linguistic filters (stemming, lemmatisation, etc.), or stop word list were used. Knowing that the gensim library was used, we took all standard parameters (see provided code 6 ). Gensim however does not come with a pre-defined stopword list, and therefore, based on our own experience, we used the one provided in the NLTK library 7 for English. For the second phase, transforming terms and documents to vectors, Zhang et al. [14] specify that all trained models are 50 dimensional. We have additionally experimented with dimensionality 20 (used by Clinchant and Perronnin [3] for clustering) and 100, as we hypothesized that 50 might be too low. The TF-IDF model is 5000 dimensional (i.e. only the top 5000 terms based on their tf-idf value are used), while the Fischer-Kernel models are 15 × d dimensional, where d = {20, 50, 100}, as just explained. In what follows, d refers to the dimensionality of LSI, LDA, cBow, and PV models. The cBoW and PV models are trained using a default window size of 5, keeping both low and high-frequency terms, again following the setup of the original experiment. The LDA model is trained using a chunk size of 1000 documents and for a number of iterations over the corpus ranging from 20 to 100. For the FK methods, both fitting procedures (GMM and moVMF) are independently initialised 10 times and the best fitting model is kept. For the third phase, parameters are explained in the following sections. Logistic regression is used for classification in Zhang et al., and therefore also used here. The results of our experiments, for d = 50 and 100-dimensional feature vectors, are summarised in Table 1 . For all the methods, we perform a parameter scan of the (inverse) regularisation strength of the logistic regression classifier, as shown in Fig. 1(a) and (b) . Additionally, the learning algorithms are trained for a different number of epochs and the resulting classification accuracy assessed, cf. Fig. 1(c) and (d). Figure 1 (a) indicates that cBow, FV-GMM, FV-moVMF, and the simple TF-IDF, when properly tuned, exhibit a very similar accuracy on subj -the given confidence intervals do not indeed allow us to identify a single, best model. Surprisingly, TF-IDF outperforms all the others on the sent dataset ( Fig. 1(b) ). Increasing the dimensionality of the feature vectors, from d = 50 to 100, has the effect of reducing the gap between TF-IDF and the rest of the models on the sent dataset (see Table 1 ). For clustering experiments, the obtained feature vectors are passed to the kmeans algorithm. The results of our experiments, measured in terms of Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), are summarised in Table 2 . We used both d = 20 and 50-dimensional feature vectors. Note that the evaluation of the clustering algorithms is based on the knowledge of the ground truth class assignments, available in the 20 Newsgroups dataset. As opposed to classification, clustering experiments show a generous imbalance in performance and firmly speak in favour of PV-DBOW. Interestingly, TF-IDF, FV-GMM, and FV-moVMF, all providing high-dimensional document representations, have a low clustering effectiveness. LSI and LDA achieve low accuracy (see Table 1 ) and are omitted here for visibility. The left panels [(a) and (b)] show the effect of (inverse) regularisation of the logistic regression classifier on the accuracy, while the right panels [(c) and (d)] display the effect of training for the learning algorithms. The two symbols on the right axis in panels (a) and (b) indicate the best (FV-moVMF) results reported in [14] . For these experiments, we extracted from every document of the test collection all the raw text, and preprocessed it as described in the beginning of this section. The documents were indexed and retrieved for BM25 with the Lucene 8.2 search engine. We experimented with three topic processing ways: (1) title only, (2) description only, and (3) title and description. The third way produces the best results and closest to the ones reported by Zhang et al. [14] , and hence are the only ones reported here. An important aspect of BM25 is the fact that the variation of its parameters k 1 and b could bring significant improvement in performance, as reported by Lipani et al. [8] . Therefore, we performed a parameter scan for k 1 ∈ [0, 3] and b ∈ [0, 1] with a 0.05 step size for both parameters. For every TREC topic, the scores of the top 1000 documents retrieved from BM25 were normalised to [0,1] with the min-max normalisation method, and were used in calculating the scores of the documents for the combined models [14] . The original results, those of our replication experiments with standard (k 1 = 1.2 and b = 0.75) and best BM25 parameter values-measured in terms of Mean Average Precision (MAP) and Precision at 20 (P@20)-are outlined in Table 3 . We replicated previously reported experiments that presented evidence that a new mixture model, based on von Mises-Fisher distributions, outperformed a series of other models in three tasks (classification, clustering, and retrievalwhen combined with standard retrieval models). Since the source code was not released in the original paper, important implementation and formulation details were omitted, and the authors never replied to our request for information, a significant effort has been devoted to reverse engineer the experiments. In general, for none of the tasks were we able to confirm the conclusions of the previous experiments: we do not have enough evidence to conclude that FV-moVMF outperforms the other methods. The situation is rather different when considering the effectiveness of these document representations for clustering purposes: we find indeed that the FV-moVMF significantly underperforms, contradicting previous conclusions. In the case of retrieval, although Zhang et al.'s proposed method (FV-moVMF) indeed boosts BM25, it does not outperform most of the other models it was compared to. Clustering on the unit hypersphere using von Mises-Fisher distributions Latent Dirichlet allocation Aggregating continuous word embeddings for information retrieval Indexing by latent semantic analysis Distributional structure. Word Let's measure run time! Extending the IR replicability infrastructure to include performance aspects Distributed representations of sentences and documents Verboseness fission for BM25 document length normalization Efficient estimation of word representations in vector space Neural information retrieval: at the end of the early years A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales The TREC robust retrieval track. SIGIR Forum Aggregating neural word embeddings for document representation EARS 2019: the 2nd international workshop on explainable recommendation and search Authors are partially supported by the H2020 Safe-DEED project (GA 825225).