id author title date pages extension mime words sentences flesch summary cache txt cord-020815-j9eboa94 Kamphuis, Chris Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants 2020-03-24 .txt text/plain 2249 154 60 Experiments on three newswire collections show that there are no significant effectiveness differences between them, including Lucene's often maligned approximation of document length. Although learning-to-rank approaches and neural ranking models are widely used today, they are typically deployed as part of a multi-stage reranking architecture, over candidate documents supplied by a simple term-matching method using traditional inverted indexes [1] . Our goal is a large-scale reproducibility study to explore the nuances of different variants of BM25 and their impact on retrieval effectiveness. Their findings are confirmed: effectiveness differences in IR experiments are unlikely to be the result of the choice of BM25 variant a system implemented. We implemented a variant that uses exact document lengths, but is otherwise identical to the Lucene default. Storing exact document lengths would allow for different ranking functions to be swapped at query time more easily, as no information would be discarded at index time. ./cache/cord-020815-j9eboa94.txt ./txt/cord-020815-j9eboa94.txt