key: cord-0478836-buu4afgh authors: Liu, Zhenghao; Zhang, Han; Xiong, Chenyan; Liu, Zhiyuan; Gu, Yu; Li, Xiaohua title: Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder date: 2022-05-06 journal: nan DOI: nan sha: 7dab6939200309e97ac5c0e3e5b51f6e061226eb doc_id: 478836 cord_uid: buu4afgh Dense retrievers encode texts and map them in an embedding space using pre-trained language models. These embeddings are critical to keep high-dimensional for effectively training dense retrievers, but lead to a high cost of storing index and retrieval. To reduce the embedding dimensions of dense retrieval, this paper proposes a Conditional Autoencoder (ConAE) to compress the high-dimensional embeddings to maintain the same embedding distribution and better recover the ranking features. Our experiments show the effectiveness of ConAE in compressing embeddings by achieving comparable ranking performance with the raw ones, making the retrieval system more efficient. Our further analyses show that ConAE can mitigate the redundancy of the embeddings of dense retrieval with only one linear layer. All codes of this work are available at https://github.com/NEUIR/ConAE. As the first stage of numerous multi-stage IR and NLP tasks (Nogueira et al., 2019; Chen et al., 2017; Thorne et al., 2018) , dense retrievers (Xiong et al., 2021a) have shown lots of advances in conducting semantic searching and avoiding the vocabulary mismatch problem (Robertson and Zaragoza, 2009) . Dense retrievers usually encode queries and documents as high-dimensional embeddings, which are necessary to guarantee the retrieval effectiveness during training Reimers and Gurevych, 2021) but exhaust the memory to store the index and lead to longer retrieval latency (Indyk and Motwani, 1998; Meiser, 1993) . The research of building efficient dense retrieval systems has been stimulated recently . To reduce the dimensions of document embeddings, existing work reserves the principle dimensions or compresses query and document em-beddings for more efficient retrievers (Yang and Seo, 2021; . There are two challenges in compressing dense retriever embeddings: The compressed embeddings should share a similar distribution with the original embeddings, making the low-dimensional embedding space uniform and the document embeddings distinguishable; All the compressed embeddings should have the ability to maintain the maximal information for matching related queries and documents during retrieval, which helps better align the related query-document pairs. This paper proposes Conditional Autoencoder (ConAE), which aims to build efficient dense retrieval systems by reducing the embedding dimensions of queries and documents. ConAE first encodes high-dimensional embeddings into a lowdimensional embedding space and then generates embeddings that can be aligned to related queries or documents in the original embedding space. In addition, ConAE designs a conditional loss to regularize the low-dimensional embedding space to keep a more similar embedding distribution with original embeddings. Our experiments show that ConAE is effective to compress the high-dimensional embeddings and avoid redundant ranking features by achieving comparable retrieval performance with vanilla dense retrievers and better visualizing the embedding space with t-SNE. This section describes our Conditional Autoencoder (ConAE). Given a query q and a document collection D = {d 1 , . . . , d j , . . . , d n }, dense retrievers (Xiong et al., 2021b,a; Karpukhin et al., 2020) employ pretrained language models Liu et al., 2019) , to encode q and d as K-dimensional embeddings, h q and h d . Then we can calculate the retrieval score f (q, d) of q and d with dot product f (h q , h d ) = h q ·h d and contrastively train encoders by maximizing the probability P (d + |q, {d + } ∪ D − ) of the relevant document d + (Karpukhin et al., 2020; Xiong et al., 2021b,a) : where d − is the document from the irrelevant document set D − . In this subsection, we introduce ConAE to compress the high-dimensional embeddings h q and h d of both queries and documents to the lowdimensional embeddings h e q and h e d . Encoder. We first get the initial dense representations for query q and document d through dense retrievers, such as ANCE (Xiong et al., 2021a) . Then these high-dimensional embeddings can be compressed to low dimensional ones: where h e q and h e d are L-dimensional embeddings of q and d, which are respectively encoded by two different linear layers, Linear q and Linear d . The dimension L can be much lower than the dimension of initial representations, such as 256, 128 and 64. Then we use KL divergence to regulate encoded embeddings to mimic the initial embedding distributions of queries and their top-ranked documents: where P e (d|q, D top ) is calculated with E.q. 1, using the encoded embeddings h e q and h e d . Decoder. The decoder module maps the encoded embeddings h e q and h e d into the original embedding space and aligns them with h q and h d , aiming at optimizing encoder modules to maximally maintain ranking features from h q and h d . Firstly, we use one linear layer and project h e q and h e d to K-dimensional embeddings,ĥ q andĥ d : Then we respectively train the decoded embeddingŝ h q andĥ d with the original frozen document and query embeddings of ANCE to recover ranking features from vanilla ANCE and better align them with h q and h d in the original embedding space. The first loss L q is used to optimize the generated query representationĥ q : and we can also optimize the generated document representationĥ d with the second loss function L d : Training Loss. Finally, we train our conditional autoencoder model with the following loss L: where λ is a hyper-parameter to weight the autoencoder losses. This section describes the datasets, evaluation metrics, baselines and implementation details in our experiments. Dataset. Four datasets are used to evaluate retrieval performance, MS MARCO (Passage Ranking) (Nguyen et al., 2016) , NQ (Kwiatkowski et al., 2019) , TREC DL (Craswell et al., 2020) and TREC-COVID (Roberts et al., 2020) . In our experiments, we use MS MARCO for training and evaluate on MS MARCO (Dev), TREC DL and TREC-COVID. We randomly sample 50,000 queries from the training set of MS MARCO as the development set. Evaluation Metrics. NDCG@10 is used as the evaluation metric on three datasets, MS MARCO, TREC DL and TREC-COVID. MS MARCO also uses MRR@10 as the primary evaluation metric (Nguyen et al., 2016) . For the NQ dataset, the hit accuracy on top20 and top10 is used to evaluate retrieval performance (Karpukhin et al., 2020) . Baselines. In our experiments, we compare ConAE with two baselines from the previous work , Principle Component Analysis (PCA) and CE. PCA reduces the embedding dimension by retaining the principle dimensions that can keep most of the variance within the original representation. CE model uses two linear layers W q and W d without biases to transform dense representations of queries and documents into lower embeddings . We also start from CE models and continuously train the whole model to implement our ANCE models to conduct embeddings with the same reduced dimensions. Implementation Details. The rest describes our implementation details. All embedding dimension reduction models base on one of the best dense retrievers ANCE (Xiong et al., 2021a) and build document index with exact match (flat index) and is implemented by FAISS (Johnson et al., 2019) . For each query, we sample 7 negative documents to contrastively train CE and ANCE, while using 1 negative document to train ConAE. During training models, we set the batch size to 2 and accumulate step to 8 for ANCE, while the batch size and accumulate step are 128 and 1 for other models. All models are implemented with PyTorch and tuned with Adam optimizer. The learning rates of ANCE and other models are set to 2e − 6 and 0.001, respectively. λ of ConAE is set to 0.1. Three experiments are conducted in this section to study the effectiveness of ConAE in reducing embedding dimensions for dense retrieval. The performance of different dimension reduction models is shown in Table 1 . PCA, CE and ConAE are based on ANCE (teacher), which freeze the teacher model and only optimize the dimension projection layers. ANCE starts from CE and continuously tunes all parameters in the model. Compared with PCA and CE , ConAE achieves the best performance on almost of datasets, which shows its effectiveness in compressing dense retrieval embeddings. ConAE can achieve comparable performance with ANCE (teacher) using 128-dimensional embeddings to build the document index on MS MARCO, which reduces the retrieval latency (from 17.152 ms to 3.942 ms per query) and saves the index storage (from 26G to 4.3G) significantly. It demonstrates that ConAE is effective to eliminate the redundancy of the embeddings learned by dense retrievers. Among all baselines, PCA shows significantly worse ranking performance on MS MARCO, indicating that embedding dimensions of dense retrievers are usually nonorthogonal. ConAE achieves more than 11% improvements than CE and much better performance on TREC-COVID, demonstrating the ranking effectiveness and generalization ability of the compressed embeddings encoded by ConAE. ANCE can further improve the retrieval performance of CE by adapting the teacher model to the low-dimensional version (Zhou et al., 2021) . This subsection conducts ablation studies in Table 2 to investigate the roles of modules in ConAE. The source of the effectiveness of ConAE mainly derives from our KL module, which is used to directly optimize the encoder module of ConAE (E.q. 3). The KL module learns the distributions of low-dimensional embeddings of queries and documents from ANCE (teacher) by mimicking its document ranking probability. When only optimizing the embedding projection layers, KL shows much better retrieval performance than CE (Sec. 4.1), which thrives from optimizing the embedding encoders with more fine-grained ranking signals provided by ANCE (teacher). Besides, the autoencoder module can further improve the ranking performance with compressed embeddings, verifying our claim that the compressed embeddings should maintain the maximal information and recover the ranking features of high-dimensional embeddings. Finally, we randomly sample one case from MS MARCO and visualize the embedding space of query and documents in Figure 1 . We first employ t-SNE (van der Maaten and Hinton, 2008) to visualize the embedding spaces of ANCE (teacher) and ConAE. Compared with ANCE (teacher), the reduced embeddings usually conduct a more uniform space to better retain ranking features from teacher models. As shown in Figure 1(b) , ConAE-128 conducts a more meaningful visualization: the related query-document pair is closer and the other documents are distributed around the golden document according to their relevance to the query. The visualization of ANCE (teacher) is slightly distorted and different from expectation, which mainly due to its redundancy. The redundant features usually mislead t-SNE to overfit these ranking features, thus reducing the embedding dimension of dense retriever to 128 for eliminating redundant features provides a possible way to visualize the embedding space of dense retrievers using t-SNE. Besides, ConAE-64 shows decreased retrieval performance than ConAE-128 (Sec. 4.1) which derives from that ConAE-64 loses some ranking features with the limited embedding dimensions. The other way to visualize the embedding space is using ConAE (w/o Decoder) to project the embedding to a 2-dimensional coordinate, which uses a uniform function to map embeddings and focuses on maintaining primary features to mimic the relevance score distribution of documents. As shown in Figure 1(d) , the distributions of documents are distinguishable and provide an intuitive way to analyze the ranking-oriented document distribution. In addition, the query is usually far away from the documents. The main reasons lie that the relevance scores are calculated by dot product and the embedding norms are also meaningful to distinguish the relevant documents. Dense retrievers use BERT-Siamese architecture to encode queries and documents to conduct an embedding space for retrieval (Karpukhin et al., 2020; Xiong et al., 2021b,a; Lewis et al., 2020; Zhan et al., 2021; Yu et al., 2021) . To learn an effective embedding space, dense retrievers are forced to maintain high-dimensional embeddings during training. The most direct way to reduce the dimension of embeddings is that retaining part of dimensions of the high-dimensional embeddings (Yang and Seo, 2021; . Some work uses the first 128 dimensions to encode both questions and documents (Yang and Seo, 2021) , while using PCA can retain the primary dimensions to recover most information from the raw embeddings . Previous supervised models use neural networks to compress the high-dimensional embeddings as lower-dimensional ones. They provide a better dimension reduction way than unsupervised models by avoiding missing too much information. To optimize the encoders, some work continuously trains dense retrievers with the contrastive training objection (Karpukhin et al., 2020; Xiong et al., 2021a) . This paper presents ConAE to reduce embedding dimensions of dense retrievers. Our experiments show the effectiveness of ConAE by achieving comparable performance with teacher model, significantly reducing the index storage, accelerating the searching process and visualizing the embedding space more intuitively and effectively. In our experiments, we use four datasets to evaluate retrieval performance, MS MARCO (Passage Ranking) (Nguyen et al., 2016) , NQ (Kwiatkowski et al., 2019) , TREC DL (Craswell et al., 2020) and TREC-COVID (Roberts et al., 2020) . These datasets come from different retrieval scenarios and all data statistics are shown in Table 3 . Besides exact searching, we also show the retrieval results in Table 4 , which are implemented by an ANN retrieval method, Hierarchical Navigable Small World (HNSW). Using HNSW, the retrieval efficiency can be further improved, especially for high-dimensional embeddings. ConAE keeps its advanced retrieval performance again with less than 1ms retrieval latency. Reading Wikipedia to answer opendomain questions Emine Yilmaz, and Daniel Campos. 2020. Overview of the trec 2020 deep learning track BERT: pre-training of deep bidirectional transformers for language understanding Approximate nearest neighbors: towards removing the curse of dimensionality Billion-scale similarity search with gpus Dense passage retrieval for open-domain question answering Natural questions: A benchmark for question answering research Pre-training via paraphrasing More robust dense retrieval with contrastive dual learning Roberta: A robustly optimized BERT pretraining approach Simple and effective unsupervised redundancy elimination to compress dense vectors for passage retrieval Point location in arrangements of hyperplanes Neurips 2020 efficientqa competition: Systems, analyses and lessons learned MS MARCO: A human generated machine reading comprehension dataset Document expansion by query prediction The curse of dense low-dimensional information retrieval for large index sizes Trec-covid: rationale and structure of an information retrieval shared task for covid-19 The probabilistic relevance framework: BM25 and beyond The fact extraction and VERification (FEVER) shared task Visualizing data using t-sne Approximate nearest neighbor negative contrastive learning for dense text retrieval Douwe Kiela, and Barlas Oguz. 2021b. Answering complex open-domain questions with multi-hop dense retrieval Designing a minimal retrieve-and-read system for open-domain question answering Few-shot conversational dense retrieval Optimizing dense retrieval model training with hard negatives Meta learning for knowledge distillation This work is mainly supported by Beijing Academy of Artificial Intelligence (BAAI) as well as sup-