key: cord-0039598-lin7ya2l authors: Sanchez, Luis; He, Jiyin; Manotumruksa, Jarana; Albakour, Dyaa; Martinez, Miguel; Lipani, Aldo title: Easing Legal News Monitoring with Learning to Rank and BERT date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_42 sha: d2928c8d4a116265bd208cdafcd2d76cacde4fad doc_id: 39598 cord_uid: lin7ya2l While ranking approaches have made rapid advances in the Web search, systems that cater to the complex information needs in professional search tasks are not widely developed, common issues and solutions typically rely on dedicated search strategies backed by ad-hoc retrieval models. In this paper we present a legal search problem where professionals monitor news articles with constant queries on a periodic basis. Firstly, we demonstrate the effectiveness of using traditional retrieval models against the Boolean search of documents in chronological order. In an attempt to capture the complex information needs of users, a learning to rank approach is adopted with user specified relevance criteria as features. This approach, however, only achieves mediocre results compared to the traditional models. However, we find that by fine-tuning a contextualised language model (e.g. BERT), significantly improved retrieval performance can be achieved, providing a flexible solution to satisfying complex information needs without explicit feature engineering. In information retrieval (IR), there has been a long standing interest in professional search, as demonstrated by various TREC tracks dedicated to a diverse range of professional domains [4, 8] . Unlike traditional Web search, an important characteristic of professional search is the complex information needs of the professional users. For instance, a professional user may ask for information within certain time range, written in a professional quality [24] . Although there have been ongoing discussions and studies calling for search systems addressing common issues faced by professional search, solutions typically rely on dedicated databases or specialised search strategies that are backed by ad-hoc retrieval models [20, 23] . Meanwhile, although traditional retrieval models as well as learning to rank (L2R) approaches have made rapid advances in Web search, retrieval models that cater to the diverse requirements in professional search tasks are not widely developed. In this paper, we study a case in the context of legal professional search and investigate how different retrieval approaches can be employed to address the complex needs of professional users. The work task of our users is to monitor a number of legal topic of their interest in news and select articles to be included in a report periodically according to a set of clearly defined criteria, ranging from topical relevance to language quality. Like in many other professional search scenarios, our users setup their searches against a news stream with complex Boolean queries [20] where results are ranked in chronological order; and they deem recall an important metric as they do not want to miss relevant articles. While Boolean queries are often preferred by professional searchers due to their needs of having results which are "efficient, trustable, explainable and accountable" [16, 24] , it falls short in addressing the complex relevance criteria beyond matching terms. Traditional retrieval models such as BM25 and Language Models (LM) capture topical relevance. As a first step going beyond the Boolean search practice, we answer the following research question: RQ1. Do traditional IR models help our users in identifying relevant documents more effectively compared to the Boolean search practice? Further, in order to satisfy users' complex information needs beyond topicality, it seems natural to encode indicators of different criteria as features, and combine them with a L2R approach. Therefore, our next research question is: Can we provide better results by adopting a L2R approach to satisfy users' complex information needs beyond topicality? However, feature engineering for every criterion can be time-consuming and may not be convenient when switching from one use case to another. Recently, pre-trained contextualised language models (e.g. BERT [10] ) have effectively addressed various NLP tasks [10, 25] eliminating the need of feature engineering. This leads the investigation of the follow up research question: Can we further improve the quality of the search results by fine-tuning a pre-trained language model on our search task? Our contributions can be summarised as follows: 1. Unlike simulation based studies such as TREC tasks where the information needs, relevance criteria and judgements are set by different parties rather than the actual users, the complex information needs in our study come from real users, who also define the relevance criteria and provide judgements. Our study not only reveals the practical challenges for professional search systems, but also demonstrates possible solutions to effectively address these challenges, and; 2. We also contribute to the generic solution to professional search (i.e. search with complex information needs). Our experiments show the potential of employing pre-trained contextualised language models to learn relevance criteria without handcrafted features, which leads to a flexible solution that adapts to varying complex needs. In this case study, our users have three relevance criteria: topical relevance, factual information, and language quality. Specifically, the topicality of retrieved articles must be associated with a specific legal area 1 ; only factual articles are considered relevant; and articles written in technical language and linguistically accurate are preferred. We first explore the effectiveness of traditional models in satisfying the users' needs. We include four models: TF-IDF, BM25, unigram Language Model (LM) with Jelinek-Mercer and Dirichlet smoothing, applied to three fields of the news articles (title, summary, and content). As for query, we extract the keywords from the complex Boolean queries that our users created and concatenate them as a long query (typically ∼100 words), where negation terms were ignored. In order to estimate the relevance of a document with respect to the combined relevance criteria described above, we employ a L2R approach and encode these criteria as features. We devise 28 features (see Table 1 ) as follows. Topical Relevance. We model topical relevance using the outputs of traditional retrieval models, as usually done in the literature [17] . Factual Information. We model factual information with three types of features: (1) Subjectivity: it measures the degree of subjectivity of an article, which is directly related to the "factuality" of the article. (2) Modality: it shows the degree of certainty of the statements of an article by looking at the verb tense in which the article is written. (3) Sentiment: it provides the degree of negativity or positivity of the language used-while not directly related to the factuality dimension, there can be entanglement between the subjective and opinionated dimensions [13]. We employ a lexicon based approach to compute these features [15, 22] . We also include the number of lexicons assessed in an article as a normalisation factor for articles of different lexicon sizes. Language Quality. Since the content our users request are technical and sometimes hard for non-expert to read, we employ readability scores as features, which measure the ease with which a reader can understand a written text. Apart from devising task specific features, we exploit a pre-trained contextualised language model to automatically learn the complex relevance criteria. By fine tuning the model on our search task we expect to associate these language features with the relevance judgements. We use BERT [18] , which shows the state-of-the-art performance on a wide range of NLP tasks [10, 25] . Inspired by the work of MacAvaney et al. [18] , we employ BERT in its regression form [19] , John Aderson's index [5] , and Dale-Chall index [9] (known as Vanilla BERT in [18] ). Specifically, the input consists of a querydocument pair, and the output is a predicted relevance score. For document input, we use (i) a combined title and summary field (referred to as BERT on summary), and (ii) the content of the article (referred to as BERT on content). Dataset. The dataset we use to evaluate our retrieval approaches comes from the interaction data of legal professionals with a news monitoring system over a one year period. The users monitor a specific legal topic by querying the news stream with that topic periodically and all the retrieved results are tagged with a relevance judgment for later usage. Given this context, we group the data into equal intervals corresponding to the report creation times and evaluate the retrieved results per interval. The initial ranked lists were generated by using the Boolean query created by the users, and ranked in a chronological order. We apply the alternative ranking approaches as a re-ranking task. In total the dataset consists 206 queries and 60,512 labelled news articles, among which 2,872 (21%) are marked as relevant. By grouping the searches into the equal interval and removing sessions with no relevant articles, we obtain 1,774 search sessions (i.e. query-results pairs). The average number of relevant articles per-session varies from 1.5 to 4.4 articles depending on the queries. We randomly split the dataset into training (80%), validation (10%), and test (10%) sets. The same setup holds for both traditional retrieval models as well as for L2R approaches. We use the validation set to tune the models' parameters. Evaluation Measures. We use Mean Average Precision (MAP) to train and measure the performance of the retrieval models. Since recall is important in this user task, we use two recall oriented metrics: R@3 (given the small number of relevant documents per search); and average Search Length (SL) which measures the amount of effort a user needs to find all relevant documents. Features. The topical features take the scores generated by the traditional IR models with their optimal parameter settings. For language usage features, we use an implementation of the CLiPS pattern-en module for subjectivity and modality [22] ; and VADER [3] to compute sentiment scores. The 9 readability features are computed using the Python Readability package [2]. As L2R approach we use the LambdaMART implementation from RankLib [1] . We apply a linear normalisation to our features as implemented by the library; each feature is normalised according to its minimum and maximum values. BERT is fine tuned using our labelled data as described by MacAvaney et al. [18] . The input of BERT is the concatenation of a [CLS] token, the query, a [SEP] token, and the document. A document is capped when longer than 512 tokens. The output of BERT is the vector representations for each input token. We use the BERT-base uncased version and each vector has a dimension of 764. For fine-tuning we stack a linear-layer on top of BERT, which takes as input the output vector for the [CLS] token. We rank documents according to the score output by the linear-layer. Regarding RQ1, Table 2 lists the results of traditional retrieval models (TF-IDF, BM25 and Language Model (LM) with Jelinek-Mercer smoothing (J-M) and Dirichlet smoothing (Dir)) compared to that from the Boolean search with a chronological order, i.e. the working practice of our professional users. We see that all retrieval models significantly outperform the Boolean search results for all measures. This suggests that without further effort in terms of feature engineering and model fitting, traditional models already improve the ranking quality by capturing topical relevance. Further, we see that different fields may be best for one model but not for the other, suggesting that their combination in a L2R approach may be beneficial. Hereafter, we choose Dir on content, which has the best MAP score, as a baseline for the remaining experiments. To address RQ2 and RQ3, Table 3 shows the results of LambdaMART with explicitly encoded features and BERT scores, compared to the Dir baseline. Firstly, We see LambdaMART with explicitly encoded features has no significant improvements over the baseline. In particular, topical features (i.e. a linear combination of the traditional models and different fields) does not provide better performance compared to Dir. However, among the different type of features, LambdaMART with all features performs the best, suggesting that both types of features are somewhat useful in capturing the relevance criteria. In response to RQ2, these results imply that the L2R approach with explicit feature engineering does not achieve a competitive performance-perhaps, hand-crafted features were not able to match well with the user specified relevance criteria. Next, we observe that BERT based approaches significantly outperform Dir. In particular, in terms of SL, with the baseline a user would need to read on average 5.5 irrelevant documents before finding all relevant documents, while with BERT based models this is reduced to less than 2, providing potentially improved user experience. Moreover, compared to explicit feature engineering, fine tuning BERT seems to have captured the user information needs in an implicit manner. This is encouraging as it not only learns the complex relevance criteria more accurately, but also provides more flexibility as the model can be fine tuned for use cases with different criteria without dedicated feature engineering. The above results show promising performance of different ranking approaches in terms of off-line IR evaluation, compared to the original Boolean setup. From a user perspective, this means users may be able to confidently stop reading results after seeing certain number of irrelevant results, which would be particularly useful when the result list is long and relevant articles are few. On the other hand, we should also be aware that as the model complexity increases, there is decreasing model explainability and user controllability-the properties of Boolean search appreciated by professional users [16, 24] . Therefore for future work we find it crucial to investigate methods that explain and control complex models such as BERT. We explored different retrieval approaches to address the complex information needs of professional users in a legal search context. We found that, compared to Boolean search, traditional retrieval models are effective in improving the ranking quality and reducing user effort in finding relevant information (e.g. measured by SL). Learning to rank with explicit feature encoding does not seem to be able to easily improve over traditional models. However, fine-tuning a pre-trained language model (BERT) shows strong improvements over both traditional models and L2R models, with the advantage of not requiring dedicated feature encoding. In particular, our study opens up a number of research questions in the context of professional search: (i) what kind of features allow pre-trained LMs to capture the implicit information needs from users' relevance judgements? (ii) what are the limitations of pre-trained LMs to capture fine-grained information needs? and; (iii) how does the above depend on the number and quality of the relevance judgements, particularly in the case of niche retrieval tasks? Guidelines for the 2011 TREC medical records track Lix and rix: variations on a little-known readability index A computer readability formula designed for machine scoring Overview of the TREC 2010 legal track A formula for predicting readability: instructions Bert: pre-training of deep bidirectional transformers for language understanding Simplification of flesch reading ease formula An information nutritional label for online documents The Technique of Clear Writing Vader: a parsimonious rule-based model for sentiment analysis of social media text Automatic Boolean query suggestion for professional search Learning to rank for information retrieval Cedr: contextualized embeddings for document ranking Smog grading-a new readability formula Information retrieval in the workplace: a comparison of professional search practices Automated readability index Pattern for python First international workshop on professional search Information search in a professional context-exploring a collection of professional search tasks Xlnet: generalized autoregressive pretraining for language understanding