key: cord-0455352-3csva587
authors: Rogers, Anna; Gardner, Matt; Augenstein, Isabelle
title: QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
date: 2021-07-27
journal: nan
DOI: nan
sha: e664ebd9a480727bc1d2ca4f8ff2864cf8533430
doc_id: 455352
cord_uid: 3csva587

Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been also much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with over 80 new datasets appearing in the past two years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of ``reasoning types"in question answering and propose a new taxonomy. We also discuss the implications of over-focusing on English, and survey the current monolingual resources for other languages and multilingual resources. The study is aimed at both practitioners looking for pointers to the wealth of existing data, and at researchers working on new resources.

The result is a potpourri of datasets that is difficult to reduce to a single taxonomy, and for which it would be hard to come up with a single defining feature that would apply to all the resources. For instance, while we typically associate "question answering" (QA) and "reading comprehension" (RC) with a setting where there is an explicit question that the model is supposed to answer, even that is not necessarily the case. Some such datasets are in fact based on statements rather than questions (as in many cloze formatted datasets, see §3.2.3), or on a mixture of statements and questions.

The chief contribution of this work is a systematic review of the existing resources with respect to a set of criteria, which also broadly correspond to research questions NLP has focused on so far. After discussing the distinction between probing and information-seeking questions ( §2), and the issue of question answering as a task vs format ( §3.1), we outline the key dimensions for the format of the existing resources: questions (questions vs statements, §3.2), answers (extractive, multi-choice, categorical and freeform, §3.3), and input evidence (in terms of its modality, amount of information and conversational features, §3.4). Then we consider the domain coverage of the current QA/RC resources ( §4) and the types of reasoning ( §6), providing an overview of the current classifications and proposing our own taxonomy (along the dimensions of inference, information retrieval, world modeling, input interpretation, and multi-step reasoning). We conclude with the discussion of the issue of "requisite" skills and the gaps in the current research ( §7).

For each of these criteria, we discuss how it is conceptualized in the field, with representative examples of English 1 resources of each type. What this set of criteria allows us to do is to place QA/RC work in the broader context of work on machine reasoning and linguistic features of NLP data, in a way that allows for easy connections to other approaches to NLU such as inference and entailment. It also allows us to map out the field in a way that highlights the cross-field connections (especially multi-modal NLP and commonsense reasoning) and gaps for future work to fill.

This survey focuses exclusively on the typology of the existing resources, and its length is proof that data work on RC/QA has reached the volume at which it is no longer possible to survey in conjunction with modeling work. We refer the reader to the existing surveys [204, 293] and tutorials [46, 223] for the current approaches to modeling in this area.

The most fundamental distinction in QA datasets is based on the communicative intent of the author of the question: was the person seeking information they did not have, or trying to test the knowledge of another person or machine? 2 Many questions appear "in the wild" as a result of humans seeking information, and some resources such as Natural Questions [137] specifically target such questions. Many other datasets consist of questions written by people who already knew the correct answer, for the purpose of probing NLP systems. These two kinds of questions broadly correlate with the "tasks" of QA and RC: QA is more often associated with information-seeking questions and RC with probing questions, and many of the other dimensions discussed in this survey tend to cluster based on this distinction.

Information-seeking questions tend to be written by users of some product, be it Google Search [53, 137] , Reddit [79] or community question answering sites like StackOverflow [e.g. 36] and Yahoo Answers [e.g. 104] , though there are some exceptions where crowd workers were induced to write information-seeking questions [54, 64, 82] . Most often, these questions assume no given context ( §3. 4.2) and are almost never posed as multiple choice ( §3.3). Industrial research tends to focus on this category of questions, as research progress directly translates to improved products. An appealing aspect of these kinds of questions is that they typically arise from real-world use cases, and so can be sampled to create 1 Most QA/RC resources are currently in English, so the examples we cite are in English, unless specified otherwise. §5 discusses the languages represented in the current monolingual and multilingual resources, including the tendencies & incentives for their creation. 2 There are many other possible communicative intents of questions in natural language, such as expressing surprise, emphasis, or sarcasm. These do not as yet have widely-used NLP datasets constructed around them, so we do not focus on them in this survey.

[ FORMAT] how easily can the questions be replaced with ids? [TASK] (easy) Classification What is the sentiment of <STATEMENT>?

(doable) Template-filling When was <PERSON> born?

(difficult) Open-ended (too many templates and/or variables) Fig. 1 . When is question answering a task, and when is it a format? questions correspond to a few labels and could easily be replaced. An NLP system does not actually need to "understand" the wording of the recast question, beyond the part that needs to be classified. This heuristic is not a strict criterion, however, and the boundaries are fuzzy. Some datasets that have been published and used as QA or RC datasets can be templated with a few dozen templates [e.g . 272] . Still, such datasets have enabled undeniable progress, and will likely continue to be useful. What has changed is our awareness of how the low diversity of patterns in the training data leads to the over-reliance on these patterns [90, 118, 154, 171, among others].

One should also not conflate format with reasoning types ( §6). For example, "extractive QA" is often discussed as if were a cohesive problem -however, extractive QA is an output format, and datasets using this format can differ wildly in the nature of the problems they encapsulate.

3.2.1 Natural language questions. Most QA and RC datasets have "questions" formulated as questions that a human speaker could ask, for either information-seeking or probing purposes (see §2). They could further be described in terms of their syntactic structure: yes/no questions (Did it rain on Monday?) wh-questions (When did it rain?), tag questions (It rained, didn't it?), or declarative questions (It rained?) [108] . Resources with syntactically well-formed questions as question format may come with any type of answer format described in §3.3.

Bordering on what could be characterized as "questions" linguistically are queries: the pieces of information that could be interpreted as a question (e.g. tallest mountain Scotland −→ which mountain is the tallest in Scotland?).

Some QA resources (especially those with tables and knowledge bases (KB) as input) start with logically well-formed queries. From there, syntactically well-formed questions can be generated with templates [e.g. 272] , and then optionally later edited by people [e.g . 96] . On the messy side of that spectrum we have search engine queries that people do not necessarily form as either syntactically well-formed questions or as KB queries. The current datasets with "natural" questions use filters to remove such queries [16, 137] . How we could study the full range of human interactions with search engines is an open problem at the boundary of QA and IR.

Cloze-format resources are based on statements rather than questions or queries: it is simply a sentence(s) with a masked span that, like in extractive QA (see §3.3.1), the model needs to predict. The key difference is that this statement is simply an excerpt from the evidence document (or some other related text), rather than something specifically formulated for information extraction. The sentences to be converted to Cloze "questions" have been identified as:

• simply sentences contained within the text [159] ; • designating an excerpt as the "text", and the sentence following it as the "question" [112, 194] ;

• given a text and summary of that text, use the summary as the question [110] ;

The Cloze format has been often used to test the knowledge of entities (CNN/Daily Mail [110] , WikiLinks Rare Entity [159] ). Other datasets targeted a mixture of named entities, common nouns, verbs (CBT [112] , LAMBADA [194] ).

While the early datasets focused on single words or entities to be masked, there are also resources masking sentences in the middle of the narrative [62, 132] .

The Cloze format has the advantage that these datasets can be created programmatically, resulting in quick and inexpensive data collection (although it can also be expensive if additional filtering is done to ensure answerability and high question quality [194] ). But Cloze questions are not technically "questions", and so do not directly target the QA task. The additional limitation is that only the relations within a given narrow context can be targeted, and it is difficult to control the kind of information that is needed to fill in the Cloze: it could simply be a collocation, or a generally-known fact -or some unique relation only expressed within this context.

The Cloze format is currently resurging in popularity also as a way to evaluate masked language models [78, 92] , as fundamentally the Cloze task is what these models are doing in pre-training.

A popular format in commonsense reasoning is the choice of the alternative endings for the passage (typically combined with multi-choice answer format (see §3.3.2)). It could be viewed as a variation of Cloze format, but many Cloze resources have been generated automatically from existing texts, while choice-ofending resources tend to be crowdsourced for this specific purpose. Similarly to the Cloze format, the "questions"

are not necessarily linguistically well-formed questions. They may be unfinished sentences (as in SWAG [284] and HellaSWAG [285] ) or short texts (as in RocStories [181] ) to be completed.

The outputs of the current text-based datasets can be categorized as extractive ( §3. 3 3.3.1 Extractive format. Given a source of evidence and a question, the task is to predict the part of the evidence (a span, in case of a text) which is a valid answer for the question. This format is very popular both thanks to its clear relevance for QA applications, and the relative ease of creating such data (questions need to be written or collected, but answers only need to be selected in the evidence).

In its classic formulation extractive QA is the task behind search engines. The connection is very clear in early QA research: the stated goal of the first TREC QA competition in 2000 was "to foster research that would move retrieval systems closer to information retrieval as opposed to document retrieval" [265] . To answer questions like "Where is the Taj Mahal?" given a large collection of documents, the participating systems had to rank the provided documents and Manuscript submitted to ACM the candidate answer spans within the documents, and return the best five. Some of the more recent QA datasets also provide a collection of candidate documents rather than a single text (see §3.4.2).

A step back into this direction came with the introduction of unanswerable questions [3, 209] : the questions that target the same context as the regular questions, but do not have an answer in that context. With the addition of unanswerable questions, systems trained on extractive datasets can be used as a component of a search engine: first the candidate documents are assessed for whether they can be used to answer a given question, and then the span prediction is conducted on the most promising candidates. It is however possible to achieve search-engine-like behavior even without unanswerable questions [45, 52] .

Many extractive datasets are "probing" in that the questions were written by the people who already knew the answer, but, as the datasets based on search engine queries show, it does not have to be that way. A middle ground is questions written by humans to test human knowledge, such as Trivia questions [124] : in this case, the writer still knows the correct answer, but the question is not written while seeing the text containing the answer, and so is less likely to contain trivial lexical matches.

The advantage of the extractive format is that only the questions need to be written, and the limited range of answer options means that it is easier to define what an acceptable correct answer is. The key disadvantage is that it limits the kinds of questions that can be asked to questions with answers directly contained in the text. While it is possible to pose rather complex questions ( §6.2.5), it is hard to use this format for any interpretation of the facts of the text, any meta-analysis of the text or its author's intentions, or inference to unstated propositions.

Multiple choice questions are questions for which a small number of answer options are given as part of the question text itself. Many existing multi-choice datasets are expert-written, stemming from school examinations (e.g. RACE [138] , CLEF QA [197] ). This format has also been popular in datasets targeting world knowledge and commonsense information, typically based on crowdsourced narratives: MCTest [216] , MCScript [192] ,

RocStories [181] .

The advantage of this format over the extractive one is that the answers are no longer restricted to something explicitly stated in the text, which enables a much wider range of questions (including commonsense and implicit information). The question writer also has full control over the available options, and therefore over the kinds of reasoning that the test subject would need to be capable of. This is why this format has a long history in human education. Evaluation is also straightforward, unlike with freeform answers. The disadvantage is that writing good multi-choice questions is not easy, and if the incorrect options are easy to rule out -the questions are not discriminative. 3 Since multi-choice questions have been extensively used in education, there are many insights into how to write such questions in a way that would best test human students, both for low-level and high-level knowledge [15, 31, 163, 165] .

However, it is increasingly clear that humans and machines do not necessarily find the same things difficult, which complicates direct comparisons of their performance. In particular, teachers are instructed to ensure that all the answer options items are plausible, and given in the same form [31, p.4] . This design could make the questions easy for a model backed with collocation information from a language model. However, NLP systems can be distracted by shallow lexical matches [118] or nonsensical adversarial inputs [267] , and be insensitive to at least some meaning-altering perturbations [225] . For humans, such options that would be easy to reject.

Humans may also act differently when primed with different types of prior questions and/or when they are tired, whereas most NLP systems do not change between runs. QuAIL [221] made the first attempt to combine questions based on the textual evidence, world knowledge, and unanswerable questions, finding that this combination is difficult in human evaluation: if exposed to all three levels of uncertainty, humans have trouble deciding between making an educated guess and marking the question as unanswerable, while models do not.

3.3.3 Categorical format. We describe as "categorical" any format where the answers come from a strictly pre-defined set of options. As long as the set is limited to a semantic type with a clear similarity function (e.g. dates, numbers), we

can have the benefit of automated evaluation metrics without the limitations of the extractive format. One could view the "unanswerable" questions in the extractive format [209] as a categorical task, which is then followed by answering the questions that can be answered ( §3.3.1).

Perhaps the most salient example of the categorical answer format is boolean questions, for which the most popular resource is currently BoolQ [53] . It was collected as "natural" information-seeking questions in Google search queries similarly to Natural Questions [137] . Other resources not focusing on boolean questions specifically may also include them (e.g. MS MARCO [16] , bAbI [272] , QuAC [50] ).

Another kind of categorical output format is when the set of answers seen during training is used as the set of allowed answers at test time. This allows for simple prediction -final prediction is a classification problem -but is quite limiting in that no test question can have an unseen answer. Visual question answering datasets commonly follow this pattern (e.g. VQA [8] , GQA [115] , CLEVR [123] ).

The most natural setting for human QA is to generate the answer independently rather than choose from the evidence or available alternatives. This format allows for asking any kinds of question, and any other format can be instantly converted to it by having the system generate rather than select the available "gold" answer.

The problem is that the "gold" answer is probably not the only correct one, which makes evaluation difficult. Most questions have many correct or acceptable answers, and they would need to be evaluated on at least two axes: linguistic fluency and factual correctness. Both of these are far from being solved. On the factual side, it is possible to get high ROUGE-L scores on ELI5 [79] with answers conditioned on irrelevant documents [135] , and even human experts find it hard to formulate questions so as to exactly specify the desired level of answer granularity, and to avoid presuppositions and ambiguity [33] . On the linguistic side, evaluating generated language is a huge research problem in itself [39, 261] , and annotators struggle with longer answers [135] . There are also sociolinguistic considerations: humans answer the same question differently depending on the context and their background, which should not be ignored [219] ).

So far the freeform format has not been very popular. Perhaps the best-known example is MS MARCO [16] , based on search engine queries with human-generated answers (written as summaries of provided Web snippets), in some cases with several answers per query. Since 2016, the dataset has grown 4 to a million queries and is now accompanied with satellite IR tasks (ranking, keyword extraction). For NarrativeQA [131] , crowd workers wrote both questions and answers based on book summaries. CoQA [213] is a collection of dialogues of questions and answers from crowd workers, with additional step for answer verification and collecting multiple answer variants. The writers were allowed to see the evidence, and so the questions are not information-seeking, but the workers were dynamically alerted to avoid words directly mentioned in the text. ELI5 [79] is a collection of user questions and long-form abstractive answers from the "Explain like I'm 5" subreddit, coupled with Web snippet evidence.

There is a lot of work to be done in the direction of of evaluation for freeform QA. As a starting point, Chen et al. [43] evaluate the existing automated evaluation metrics (BLEU, ROUGE, METEOR, F1) for extractive and multi-choice questions converted to freeform format, concluding that these metrics may be used for some of the existing data, but they limit the kinds of questions that can be posed, and, since they rely on lexical matches, they necessarily do poorly for the more abstractive answers. They argue for developing new metrics based on representation similarity rather than ngram matches [44] , although the current implementations are far from perfect.

To conclude the discussion of answer formats in QA/RC, let us note that, as with other dimensions for characterizing existing resources, these formats do not form a strict taxonomy based on one coherent principle. Conceptually, the task of extractive QA could be viewed as a multi-choice one: the choices are simply all the possible spans in the evidence document (although most of them would not make sense to humans). The connection is obvious when these options are limited in some way: for example, the questions in CBT [112] are extractive (Cloze-style), but the system is provided with 10 possible entities from which to choose the correct answer, which makes also it a multi-choice dataset.

If the goal is general language understanding, we arguably do not even want to impose strict format boundaries.

To this end, UnifiedQA [128] proposes a single "input" format to which they convert extractive, freeform, categorical (boolean) and multi-choice questions from 20 datasets, showing that cross-format training often outperforms models trained solely in-format.

By "evidence" or "context", we mean whatever the system is supposed to "understand" or use to derive the answer from 3.4.1 Modality. While QA/RC is traditionally associated with natural language texts or structured knowledge bases, research has demonstrated the success of multi-modal approaches for QA (audio, images, and even video). Each of these areas is fast growing, and multimedia work may be key to overcoming issues with some implicit knowledge that is not "naturally" stated in text-based corpora [26] .

Unstructured text. Most resources described as RC benchmarks [e.g. 210, 216] have textual evidence in natural language, while many QA resources come with multiple excerpts as knowledge sources (e.g. [16, 55] ). See §3.4.2 for more discussions of the variation in the amount of text that is given as the context in a dataset.

Semi-structured text (tables). A fast-growing area is QA based on information from tables. At least four resources are based on Wikipedia tabular data, including WikiTableQuestions [195] and TableQA [260] . Two of them have supporting annotations for attention supervision: SQL queries in WikiSQL [290] , operand information in WikiOps [49] .

Structured knowledge. Open-domain QA with a structured knowledge source is an alternative approach to looking for answers in text corpora, except that in this case, the model has to explicitly "interpret" the question by converting it to a query (e.g. by mapping the text to a triplet of entities and relation, as in WikiReading [111] ). The questions can be composed based on the target structured information, as in SimpleQuestions [32] or Event-QA [57] . The process is reversed in FreebaseQA [121] , which collects independently authored Trivia questions and filters them to identify the subset that can be answered with Freebase information. The datasets may target a specific knowledge base: a general one such as WikiData [111] or Freebase [22, 121] , or one restricted to a specific application domain [109, 274] .

Images. While much of this work is presented in the computer vision community, the task of multi-modal QA (combining visual and text-based information) is a challenge for both computer vision and NLP communities. The complexity of the verbal component is on a sliding scale: from simple object labeling, as in MS COCO [153] to complex compositional questions, as in GQA [115] .

While the NLP community is debating the merits of the "natural" information-seeking vs probing questions and both types of data are prevalent, (see §2), for visual QA the situation is skewed towards the probing questions, since most of them are based on large image bases such as COCO, Flickr or ImageNet which do not come with any independently occurring text. Accordingly, the verbal part may be created by crowdworkers based on the provided images (e.g. [242] ), or (more frequently) generated, e.g. AQUA [84] , IQA [95] . In VQG-Apple [196] the crowd workers were provided with an image and asked to write questions one might ask a digital assistant about that image, but the paper does not provide analysis of how realistic the result is.

Audio. "Visual QA" means "answering questions about images. Similarly, there is a task for QA about audio clips. DAQA [80] is a dataset consisting of audio clips and questions about what sounds can be heard in the audio, and in what order. As with most VQA work, the questions are synthetic.

Interestingly, despite the boom of voice-controlled digital assistants that answer users' questions (such as Siri or Alexa), public data for purely audio-based question answering is so far a rarity: the companies developing such systems undoubtedly have a lot of customer data, but releasing portions of it would be both ethically challenging and not aligned with their business interests. The result is that in audio QA the QA part seems to be viewed as a separate, purely text-based component of a pipeline with speech-to-text input and text-to-speech output. That may not be ideal, because in real conversations, humans take into account prosodic cues for disambiguation, but so far, there are few such datasets, making this a promising future research area. So far there are two small-scale datasets produced by human speakers: one based on TOEFL listening comprehension data [258] , and one for a Chinese SquAD-like dataset [139] .

Spoken-SQuAD [145] and Spoken-CoQA [282] have audio clips generated with a text-to-speech engine.

Another challenge for audio-based QA is the conversational aspect: questions may be formulated differently depending on previous dialogue. See §3.4.3 for an overview of the text-based work in that area.

Video. QA on videos is also a growing research area. Existing datasets are based on movies (MovieQA [251] , MovieFIB [164] ), TV shows (TVQA [141] ), games (MarioQA [183] ), cartoons (PororoQA [129] ), and tutorials (Turor-ialVQA [56] ). Some are "multi-domain": VideoQA [294] comprises clips from movies, YouTube videos and cooking videos, while TGIF-QA is based on miscellaneous GIFs [117] .

As with other multimedia datasets, the questions in video QA datasets are most often generated [e.g. 117, 294] and the source of text used for generating those question matters a lot: the audio descriptions tend to focus on visual features, and text summaries focus on the plot [141] . TVQA questions are written by crowd workers, but they are still clearly probing rather than information-seeking. It is an open problem what a "natural" video QA would even be like: questions asked by someone who is deciding whether to watch a video? Questions asked to replace watching a video?

Questions asked by movie critics?

Other combinations. While most current datasets fall into one of the above groups, there are also other combinations. For instance, HybridQA [47] and TAT-QA [292] target the information combined from text and tables, and MultiModalQA [250] adds images to that setting. MovieQA [251] has different "settings" based on what combination of input data is used (plots, subtitles, video clips, scripts, and DVS transcription).

The biggest challenge for all multimodal QA work is to ensure that all the input modalities are actually necessary to answer the question [253] : it may be possible to pick the most likely answer based only on linguistic features, or detect

How much knowledge for answering questions is provided in the dataset?

[0%]

Single source one document needs to be considered for answering the question

Multiple sources evidence is provided, but it has to be ranked and found Partial source some evidence is provided, but it has to be combined with missing knowledge No sources the model has to retrieve evidence or have it memorized the most salient object in an image while ignoring the question. After that, there is the problem of ensuring that all that multimodal information needs to be taken into account: for instance, if a model learns to answer questions about presence of objects based on a single image frame instead of the full video, it may answer questions incorrectly when the object is added/removed during the video. See also §7.1 for discussion of the problem of "required" skills.

The second dimension for characterizing the input of a QA/RC dataset is how much evidence the system is provided with. Here, we observe the following options:

• Single source: the model needs to consider a pre-defined tuple of a document and a question (and, depending on the format, answer option(s)). Most RC datasets such as RACE [138] and SQuAD [210] fall in this category. A version of this are resources with a long input text, such as complete books [131] or academic papers [64] .

• Multiple sources: the model needs to consider a collection of documents to determine which one is the best candidate to contain the correct answer (if any). Many open-domain QA resources fall in this category: e.g. MS MARCO [16] and TriviaQA [124] come with retrieved Web snippets as the "texts". Similarly, some VQA datasets have multiple images as contexts [243] .

• Partial source: The dataset provides documents that are necessary, but not sufficient to produce the correct answer. This may happen when the evidence snippets may be collected independently and not guaranteed

to contain the answer, as in ARC [55] . Another frequent case is commonsense reasoning datasets such as RocStories [181] or CosmosQA [114] : there is a text, and the correct answer depends on both the information in this text and implicit world knowledge. E.g for SemEval2018 Task 11 [192] the organizers provided a commonsense reasoning dataset, and participants were free to use any external world knowledge resource.

• No sources. The model needs to rely only on some external source of knowledge, such as the knowledge stored in the weights of a pre-trained language model, a knowledge base, or an information retrieval component. A notable example is commonsense reasoning datasets, such as Winograd Schema Challenge [142] or COPA [94, 217] ).

As shown in Figure 2 , this is also more of continuum than a strict taxonomy. As we go from a single well-matched source of knowledge to a large heterogeneous collection, the QA problem increasingly incorporates an element of information retrieval. The same could be said for long single sources, such as long documents or videos, if answers can be found in a single relevant excerpt and do not require a high-level summary of the whole context. So far, our QA/RC resources tend to target more complex reasoning for shorter texts, as it is more difficult to create difficult questions over larger contexts.

Arguably, an intermediate case between single-source and multiple-source cases are datasets that collect multiple sources per question, but provide them already coupled with questions, which turns each example into a single-source problem. For example, TriviaQA [124] contains 95K questions, but 650K question-answer-evidence triplets.

The vast majority of questions in datasets discussed so far were collected or created as standalone questions, targeting a static source of evidence (text, knowledge base and/or any multimedia). The pragmatic context modeled in this setting is simply a set of standalone questions that could be asked in any order. But due to the active development of digital assistants, there is also active research on QA in conversational contexts: in addition to any sources of knowledge being discussed or used by the interlocutors, there is conversation history, which may be required to even interpret the question. For example, the question "Where did Einstein die"? may turn into "Where did he die?" if it is a follow-up question; after that, the order of the questions can no longer be swapped. The key differences to the traditional RC setting is that (a) the conversation history grows dynamically as the conversation goes on, (b) it is

not the main source of information (that comes from some other context, a knowledge base, etc.). While "conversational QA" may be intuitively associated with spoken (as opposed to written) language, the current resources for conversational QA do not necessarily originate in this way. For example, similarly to RC datasets like SQuAD, CoQA [213] was created in the written form, by crowd workers provided with prompts. It could be argued that the "natural" search engine queries have some spoken language features, but they also have their own peculiarities stemming from the fact that functionally, they are queries rather than questions (see §3.2).

The greatest challenge in creating conversational datasets is making sure that the questions are really informationseeking rather than probing ( §2), since humans would not normally use the latter with each other (except perhaps in language learning contexts or checking whether someone slept through a meeting). From the perspective of how much knowledge the questioner has, existing datasets can be grouped into three categories:

• Equal knowledge. For example, CoQA [213] collected dialogues about the information in a passage (from seven domains) from two crowd workers, both of whom see the target passage. The interface discouraged the workers from using words occurring in the text.

• Unequal knowledge. For example, QuAC [50] is a collection of factual questions about a topic, 5 asked by one crowdworker and answered by another (who has access to a Wikipedia article). A similar setup to QuAC was used for the Wizards of Wikipedia [67] , which, however, focuses on chitchat about Wikipedia topics rather than question answering, and could perhaps be seen as complementary to QuAC. ShARC [227] uses more than two annotators for authoring the main and follow-up questions to simulate different stages of a dialogue.

• Repurposing "natural" dialogue-like data. An example of this approach is Molweni [146] , based on data from the Ubuntu Chat corpus, and its unique contribution is discourse level annotations in sixteen types of relations (comments, clarification questions, elaboration etc.) MANtIS [199] is similarly based on StackExchange dialogues, with a sample annotated for nine discourse categories. MSDialog [205] is based on Microsoft support forums, and the Ubuntu dialogue corpus [161] likewise contains many questions and answers from the Ubuntu ecosystem.

Again, these proposed distinctions are not clear-cut, and there are in-between cases. For instance, DoQA [36] is based on "real information needs" because the questions are based on StackExchange questions, but the actual questions 5 In most conversational QA datasets collected in the unequal-knowledge setup the target information is factual, and the simulated scenario is that only one of the participants has access to that information (but theoretically, anyone could have such access). An interesting alternative direction is questions where the other participant is the only possible source of information: personal questions. CCPE-M [206] is a collection of dialogues where one party elicits the other party's movie preferences.

were still generated by crowdworkers in the "unequal knowledge" scenario, with real queries serving as "inspiration".

SHaRC [227] has a separate annotation step in which crowd workers formulate a scenario in which the dialogue they see could take place, i.e. trying to reverse-engineer the information need.

An emerging area in conversational QA is question rewriting: 6 rephrasing questions in a way that would make them easier to be answered e.g. through Web search results. CANARD [76] is a dataset of rewritten QuAC questions, and SaAC [7] is similarly based on a collection of TREC resources. QReCC [7] is a dataset of dialogues with seed questions from Natural Questions [137] , and follow-up questions written by professional annotators. All questions come in two versions: the "natural" and search-engine-friendly version, e.g. by resolving pronouns to the nouns mentioned in the dialogue history. Disfl-QA [99] is a derivative of SQuAD with questions containing typical conversational "disfluencies" such as "uh" and self-corrections.

The above line of work is what one could call conversational QA. In parallel with that, there are datasets for dialogue comprehension, i.e. datasets for testing the ability to understand dialogues as opposed to static texts. They are "probing" in the same sense as e.g. RACE [138] , the only difference being that the text is a dialogue script. In this category,

FriendsQA [278] is based on transcripts of the 'Friends' TV show, with questions and extractive answers generated by crowd workers. There is also a Cloze-style dataset based on the same show [162] , targeting named entities. DREAM [244] is a multi-choice dataset based on English exam data, with texts being dialogues.

Another related subfield is task-oriented (also known as goal-oriented) dialogue, which typically includes questions as well as transactional operations. The goal is for the user to collect information they need and then perform a certain action (e.g. find out what flights are available, choose and book one). There is some data for conversations with travel agents [116, 130] , conducting meetings [6] , navigation, scheduling and weather queries to an in-car personal assistant [77] , and other [12, 232] , as well as multi-domain resources [35, 200] .

Conversational QA is actively studied not only in NLP, but also in information retrieval, and that community has produced many studies of actual human behavior in information-seeking dialogues that should be better known in NLP, so as to inform design of future resources. For instance, outside of maybe conference poster sessions and police interrogations, human dialogues do not usually consist only of questions and answers, which is e.g. the CoQA setting.

Studies of human-system interaction [e.g. 255] elaborate on the types of conversational moves performed by the users (such as informing, rejecting, promising etc.) and how they could be modeled. In conversational QA there are also potentially many more signals useful in evaluation than simply correctness: e.g. MISC [252] is a small-scale resource produced by in-house MS staff that includes not only transcripts, but also audio, video, affectual and physiological signals, as well as recordings of search and other computer use and post-task surveys on emotion, success, and effort.

One major source of confusion in the domain adaptation literature is the very notion of "domain", which is often used to mean the source of data rather than any coherent criterion such as topic, style, genre, or linguistic register [211] . In the current QA/RC literature it seems to be predominantly used in the senses of "topic" and "genre" (a type of text, with a certain structure, stylistic conventions, and area of use). For instance, one could talk about the domains of programming or health, but either of them could be the subject of forums, encyclopedia articles, etc. which are "genres"

in the linguistic sense. The below classification is primarily based on the understanding of "domain" as "genre", with caveats where applicable.

Encyclopedia. Wikipedia is probably the most widely used source of knowledge for constructing QA/RC datasets [e.g. 71, 111, 210, 277] . The QA resources of this type, together with those based on knowledge bases and Web snippets, constitute what is in some communities referred to as "open-domain" QA. 7 Note that here the term "domain" is used in the "topic" sense: Wikipedia, as well as Web and knowledge bases, contain much specialist knowledge, and the difference from the resources described below as "expert materials" is only that it is not restricted to particular topics.

Fiction. While fiction is one of the areas where large amounts of public-domain data is available, surprisingly few attempts were made to use them as reading comprehension resources, perhaps due to the incentive for more "useful" information-seeking QA work. CBT [112] is an early and influential Cloze dataset based on children's stories.

BookTest [17] expands the same methodology to a larger number of project Gutenberg books. Being Cloze datasets, they inherit the limitations of the format discussed in §3.2.3. The first attempt to address a key challenge of fiction (understanding a long text) is NarrativeQA [131] : for this dataset, a QA system is supposed to answer the questions while taking into account the full text of a book. However, a key limitation of this data is that the questions were formulated based on book summaries, and thus are likely to only target major plot details.

The above resources target literary or genre fiction: long, complex narratives created for human entertainment or instruction. NLP papers also often rely on fictional mini-narratives written by crowdworkers for the purpose of RC tests. Examples of this genre include MCTest [216] , MCScript [179, 191, 192] , and RocStories [181] .

Academic tests. This is the only "genre" outside of NLP where experts devise high-quality discriminative probing questions. Most of the current datasets were sourced from materials written by expert teachers to test students, which in addition to different subjects yields the "natural" division by student level (different school grades, college etc.). Arguably, it corresponds to level of difficulty of target concepts (if not necessarily language). Among the college exam resources, CLEF competitions [197, 198] and NTCIR QA Lab [235] were based on small-scale data from Japanese university entrance exams. RACE-C [149] draws on similar data developed for Chinese university admissions. ReClor [283] is a collection of reading comprehension questions from standartized admission tests like GMAT and LSAT, selected specifically to target logical reasoning.

In the school-level tests, the most widely-used datasets are RACE [138] and DREAM [244] , both comprised of tests created by teachers for testing the reading comprehension of English by Chinese students (on narratives and multi-party dialogue transcripts, respectively). ARC [55] targets science questions authored for US school tests. OpenBookQA [174] also targets elementary science knowledge, but the questions were written by crowdworkers. ProcessBank [23] is a small-scale multi-choice dataset based on biology textbooks.

News. Given the increasing problem of online misinformation (see §7.3), question answering for news is a highly societally important area of research, but it is hampered by the lack of public-domain data. The best-known reading comprehension dataset based on news is undoubtedly the CNN/Daily Mail Cloze dataset [110] , focusing on the understanding of named entities and coreference relations within a text. Subsequently NewsQA [256] also relied on CNN data; it is an extractive dataset with questions written by crowd workers. Most recently, NLQuAD [238] is an extractive benchmark with "non-factoid" questions (originally BBC news article subheadings) that need to be matched with longer spans within the articles. In multi-choice format, a section of QuAIL [221] is based on CC-licensed news.

There is also a small test dataset of temporal questions for news events over a New York Times archive [269] E-commerce. There are two e-commerce QA datasets based on Amazon review data. The earlier one was based on a Web crawl of questions and answers about products posed by users [167] , and the more recent one (AmazonQA [101] ) built upon it by cleaning up the data, and providing review snippets and (automatic) answerability annotation. Sub-jQA [28] is based on reviews from more sources than just Amazon, has manual answerability annotation and, importantly, is the first QA dataset to also include labels for subjectivity of answers.

Expert materials. This is a loose group of QA resources defined not by genre (the knowledge source is presumably materials like manuals, reports, scientific papers etc.), but by the narrow, specific topic only known to experts on that topic. This might be the most common category for QA datasets, since domain-specific chatbots for answering frequent user questions are increasingly used by companies, but these datasets are rarely made available to the research community.

Most existing resources are based on answers provided by volunteer experts: e.g. TechQA [38] is based on naturallyoccurring questions from tech forums. A less common option is to hire experts, as done for Qasper [64] : a dataset of expert-written questions over NLP papers.

The "volunteer expert" setting is the focus of the subfield of community QA. It deserves a separate survey, but the key difference to the "professional" support resources is that the answers are provided by volunteers with varying levels of expertise, on platforms such as WikiAnswers [2] , Reddit [79] , or AskUbuntu [68] . Since the quality and amount of both questions and answers vary a lot, in that setting new QA subtasks emerge, including duplicate question detection and ranking multiple answers for the same question [184] [185] [186] .

The one expert area with abundant expert-curated QA/RC resources is biomedical QA. BioASQ is a small-scale biomedical corpus targeting different NLP system capabilities (boolean questions, concept retrieval, text retrieval), that were initially formulated by experts as a part of CLEF competitions [197, 198, 257] . PubMedQA [122] is a corpus of biomedical literature abstracts that treats titles of articles as pseudo-questions, most of the abstract as context, and the final sentence of the abstract as the answer (with a small manually labeled section and larger unlabeled/artificially labeled section). In the healthcare area, CliCR [245] is a Cloze-style dataset of clinical records, and Head-QA [264] is a multimodal multi-choice dataset written to test human experts in medicine, chemistry, pharmacology, psychology, biology, and nursing. emrQA [193] is an extractive dataset of clinical records with questions generated from templates, repurposing annotations from other NLP tasks such as NER. There is also data specifically on the COVID pandemic [180] .

Social media. Social media data present a unique set of challenges: the user speech is less formal, more likely to contain typos and misspellings, and more likely to contain platform-specific phenomena such as hashtags and usernames.

So far there are not so many such resources. The most notable dataset in this sphere is currently TweetQA [275] , which crowdsourced questions and answers for (news-worthy) tweet texts.

Multi-domain. Robustness across domains is a major issue in NLP, and especially in question answering where the models trained on one dataset do not necessarily transfer well to another even when within one domain [280] .

However, so far there are very few attempts to create multi-domain datasets that could encourage generalization by design, and, as discussed above, they are not necessarily based on the same notion of "domain". In the sense of "genre", the first one was CoQA [213] , combining prompts from children's stories, fiction, high school English exams, news articles, Wikipedia, science and Reddit articles. It was followed by QuAIL [221] , a multi-choice dataset balanced across news, fiction, user stories and blogs.

In the sense of "topic", two more datasets are presented as "multi-domain": MMQA [100] is an English-Hindi dataset of Web articles that is presented as a multi-domain dataset, but is based on Web articles on the topics of tourism, history, diseases, geography, economics, and environment. In the same vein, MANtIS [199] is a collection of information-seeking dialogues from StackExchange fora across 14 topics (Apple, AskUbuntu, DBA, DIY, ELectronics, English, Gaming, GIS, Physics, Scifi, Security, Stats, Travel, World-building).

Manuscript submitted to ACM There are also "collective" datasets, formed as a collection of existing datasets, which may count as "multi-domain" by different criteria. In the sense of "genre", ORB [70] includes data based on news, Wikipedia, fiction. MultiReQA [97] comprises 8 datasets, targeting textbooks, Web snippets, Wikipedia, scientific articles.

As in other areas of NLP, the "default" language of QA and RC is English [20] , and most of this survey discusses English resources. The second best-resourced language in terms or QA/RC data is Chinese, which has the counterparts of many popular English resources. Besides SQuAD-like resources [59, 233] , there is shared task data for open-domain QA based on structured and text data [72] . WebQA is an open-domain dataset of community questions with entities as answers, and web snippets annotated for whether they provide the correct answer [147] . ReCO [268] targets boolean questions from user search engine queries. There are also cloze-style datasets based on news, fairy tales, and children's reading material, mirroring CNN/Daily Mail and CBT [60, 61] , as well as a recent sentence-level cloze resource [62] .

DuReader [107] is a freeform QA resource based on search engine queries and community QA. In terms of niche topics, there are Chinese datasets focusing on history textbooks [288] and maternity forums [276] .

In the third place we have Russian, which a version of SQuAD [75] , a dataset for open-domain QA over Wikidata [134] , a boolean QA dataset [91] , and datasets for cloze-style commonsense reasoning and multi-choice, multi-hop RC [81] .

The fourth best resourced language is Japanese, with a Cloze RC dataset [270] , a manual translation of a part of SQuAD [10] , and a commonsense reasoning resource [189] .

Three more languages have their versions of SQuAD [210] : French [66, 126] , Vietnamese [187] , and Korean [150] , and there are three more small-scale evaluation sets (independently collected for Arabic [182] ), human-translated to French [10] ). Polish has a small dataset of open-domain questions based on Wikipedia "Did you know...?" data [166] . And, to the best of our knowledge, this is it: not even the relatively well-resourced languages like German necessarily have any monolingual QA/RC data. There is more data for individual languages that is part of multilingual benchmarks, but that comes with a different set of issues ( §5.2).

In the absence of data, the researchers resort to machine translation of English resources. For instance, there is such SQuAD data for Spanish [37] , Arabic [182] , Italian [58] , Korean [140] . However, this has clear limitations: machine translation comes with its own problems and artifacts, and in terms of content even the best translations could differ from the questions that would be "naturally" asked by the speakers of different languages.

The fact that so few languages have many high-quality QA/RC resources reflecting the idiosyncrasies and information needs of the speakers of their languages says a lot about the current distribution of funding for data development, and the NLP community appetite for publishing non-English data at top NLP conferences. There are reports of reviewer bias [220] : such work may be perceived as "niche" and low-impact, which makes it look like a natural candidate for second-tier venues 8 , which makes such work hard to pursue for early career researchers.

This situation is not only problematic in terms of inclusivity and diversity (where it contributes to unequal access to the latest technologies around the globe). The focus on English is also counter-productive because it creates the wrong impression of progress on QA/RC vs the subset of QA/RC that is easy in English. For instance, as pointed out by the authors of TydiQA [54] , questions that can be solved by string matching are easy in English (a morphologically poor language), but can be very difficult in languages with many morphophonological alternations and compounding.

Another factor contributing to the perception of non-English work as "niche" and low-impact is that many such resources are "replications" of successful English resources, which makes them look derivative (see e.g. the abovementioned versions of SQuAD). However, conceptually the contribution of such work is arguably comparable to incremental modifications of popular NLP architectures (a genre that does not seem to raise objections of low novelty), while having potentially much larger real-world impact. Furthermore, such work may also require non-trivial adaptations to transfer an existing methodology to a different language, and/or propose first-time innovations. For instance, MATINF [276] is a Chinese dataset jointly labeled for classification, QA and summarization, so that the same data could be used to train for all three tasks. The contribution of Watarai and Tsuchiya [270] is not merely a Japanese version of CBT, but also a methodology to overcome some of its limitations.

One way in which non-English work seems to be easier to publish is multilingual resources. Some of them are data from cross-lingual shared tasks 9 , and also independent academic resources (such as English-Chinese cloze-style XCMRC [157] ). But in terms of number of languages, the spotlight is currently on the following larger-scale resources:

• MLQA [144] targets extractive QA over Wikipedia with partially parallel texts in seven languages: English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. The questions are crowdsourced and translated.

• XQuAD [9] is a subset of SQuAD professionally translated into 10 languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi.

• XQA [156] is an open-domain QA dataset targeting entities; it provides training data for English, and test and development data for English and eight other languages: French, German, Portuguese, Polish, Chinese, Russian, Ukrainian, and Tamil.

• TydiQA [54] is the first resource of "natural" factoid questions in ten typologically diverse languages in addition to English: Arabic, Bengali, Finnish, Japanese, Indonesian, Kiswahili, Korean, Russian, Telugu, and Thai.

• XOR QA [11] builds on Tidy QA data to pose the task of cross-lingual QA: answering questions, where the answer data is unavailable in the same language as the question. It is a subset of TidyQA with data in seven languages: Arabic, Bengali, Finnish, Japanese, Korean, Russian and Telugu, with English as the "pivot" language (professionally translated).

• XQuAD-R and MLQA-R [222] are based on the above-mentioned XQuAD and MLQA extractive QA resources, recast as multilingual information retrieval tasks.

• MKQA [160] is based on professional translations of a subset of Natural Questions [137] , professionally translated into 26 languages, focusing on "translation invariant" questions.

While these resources are a very valuable contribution, in multilingual NLP they seem to be playing the role similar to the role that the large-scale language models play in development of NLP models: the small labs are effectively out of the competition [218] . In comparison with large multilingual leaderboards, monolingual resources are perceived as "niche", less of a valuable contribution, less deserving of the main track publications on which careers of early-stage researchers depend. But such scale is only feasible for industry-funded research: of all the above multilingual datasets, only the smallest one (XQA) was not produced in affiliation with either Google, Apple, or Facebook.

Furthermore, scale is not necessarily the best answer: focus on multilinguality necessarily requires missing a lot of nuance that is only possible for in-depth work on individual languages performed by experts in those languages. A key issue in multilingual resources is collecting data that is homogeneous enough across languages to be considered a fair and representative cross-lingual benchmark. That objective is necessarily competing with the objective of getting a natural and representative sample of questions in each individual language. To prioritize the latter objective, we would need comparable corpora of naturally occurring multilingual data. This is what happened in XQA [156] (based on the "Did you know... ?" Wikipedia question data), but there is not much such data that is in public domain. Tidy QA [54] attempts to approximate "natural" questions by prompting speakers to formulate questions for the topics, on which they are shown the header excerpts of Wikipedia articles, but it is hard to tell to what degree this matches real information needs, or samples all the linguistic phenomena that are generally prominent in questions for this language and should be represented.

A popular solution that sacrifices representativeness of individual languages for cross-lingual homogeneity is using translation, as it was done in MLQA [144] , xQuaD [9] , and MKQA [160] . However, translationese has many issues. In addition to the high cost, even the best human translation is not necessarily similar to naturally occurring question data, since languages differ in what information is made explicit or implicit [54] , and cultures also differ in what kinds of questions typically get asked.

A separate (but related) problem is that it is also not guaranteed that translated questions will have answers in the target language data. This issue lead XQuAD to translating both questions and texts, MLQA -to partial cross-lingual coverage, MKQA -to providing only questions and answers, without the evidence texts, and XOR QA [11] -to positing the task of cross-lingual QA.

One more issue in multilingual NLP that does not seem to have received much attention in QA/RC research is code-switching [237] , even though it clearly has a high humanitarian value. For instance, in the US context better question answering with code-switched English/Spanish data could be highly useful in the civil service and education, supporting the learning of immigrant children and social integration of their parents. So far there are only a few small-scale resources for Hindi [18, 40, 102, 207] , Telugu and Tamil [40] .

6 REASONING SKILLS

We discussed above how different QA and RC datasets may be based on different understandings of "format" ( §3.1) and

"domain" ( §4), but by far the least agreed-upon criterion is "types of reasoning". While nearly every paper presenting a RC or QA dataset also presents some exploratory data analysis of a small sample of their data, the categories they employ vary too much to enable direct comparisons between resources.

Before discussing this in more detail, let us recap how "reasoning" is defined. In philosophy and logic "any process of drawing a conclusion from a set of premises may be called a process of reasoning" [30] . Note that this is similar to the definition of "inference": "the process of moving from (possibly provisional) acceptance of some propositions, to acceptance of others" [29] . But this definition does not cover everything that is discussed as "reasoning types" or "skills" in QA/RC literature.

To date, two comprehensive taxononomies for the QA/RC "skills" have been proposed in the NLP literature:

• Sugawara and Aizawa [239] , Sugawara et al. [240] distinguish between object tracking skills, mathematical reasoning, logical reasoning, analogy, causal and spatiotemporal relations, ellipsis, bridging, elaboration, metaknowledge, schematic clause relation, punctuation.

• Schlegel et al. [231] distinguish between operational (bridge, constraint, comparison, intersection), arithmetic (subtraction, addition, ordering, counting, other), and linguistic (negation, quantifiers, conditional monotonicity, con/dis-junction) meta-categories, as opposed to temporal, spatial, causal reasoning, reasoning "by exclusion" and "retrieval". They further describe questions in terms of knowledge (factual/intuitive) and linguistic complexity (lexical and syntactic variety, lexical and syntactic ambiguity).

A problem with any taxonomy is that using it to characterize new and existing resources involves expensive finegrained expert annotation. A frequently used workaround is a kind of keyword analysis by the initial words in the question (since for English that would mean what, where, when and other question words). This was done e.g. in [e.g. 16, 131, 192] , and Dzendzik et al. [74] perform 

Based on [231, 239, 240] , we propose an alternative taxonomy of QA/RC "skills" along the following dimensions:

• Inference ( §6.2.1): "the process of moving from (possibly provisional) acceptance of some propositions, to acceptance of others" [29] .

• Retrieval ( §6.2.2): knowing where to look for the relevant information.

• Input interpretation & manipulation ( §6.2.3): correctly understanding the meaning of all the signs in the input, both linguistic and numeric, and performing any operations on them that are defined by the given language/mathematical system (identifying coreferents, summing up etc.).

• World modeling ( §6.2.4): constructing a valid representation of the spatiotemporal and social aspects of the world described in the text, as well as positioning/interpreting the text itself with respect to the reader and other texts.

• Multi-step ( §6.2.5): performing chains of actions on any of the above dimensions.

A key feature of the taxonomy is that these dimensions are orthogonal: the same question can be described in terms of their linguistic form, the kind of inference required to arrive at the answer, retrievability of the evidence, compositional complexity, and the level of world modeling (from generic open-domain questions to questions about character relations in specific books). In a given question, some of them may be more prominent/challenging than others.

Our proposal is shown in Figure 3 and discussed in more detail below.

6.2.1 Inference type. Fundamentally, the problem of question answering can be conceptualized as the classification of the relation between the premise (context+question) and the conclusion (a candidate answer) [226] . Then, the type of reasoning performed by the system can be categorized in the terms developed in logic and philosophy 10 . Among the criteria developed for describing different types of reasoning is the direction of reasoning: deductive (from premise to conclusion) and abductive (from conclusion to the premise that would best justify the conclusion) [69] . Another key criterion is the degree to which the premise supports the conclusion: in deductive reasoning, the hypothesis is strictly entailed by the premise, and in inductive reasoning, the support is weaker [106] . Finally, reasoning could be analysed 10 Note that logical reasoning is only a subset of human reasoning: human decisions are not necessarily rational, they may be based on biases, heuristics, or fall pray to different logical fallacies. It is not clear to what extent the human reasoning "shortcuts" should be replicated in machine RC systems, especially given that they develop their own biases and heuristics (see §7.1)

with respect to the kind of support for the conclusion, including analogical reasoning [19] , defeasible reasoning ("what normally 11 happens", [133] ), and "best explanation" [69] .

While the above criteria are among the most fundamental and well-recognized to describe human reasoning, none of them is actively used to study machine reasoning, at least in the current QA/RC literature. Even though deductive reasoning is both fundamental and the most clearly mappable to what we could expect from machine reasoning, to the best of our knowledge so far there is only one dataset for that: LogiQA [155] , a collection of multi-choice questions from civil servant exam materials.

To further complicate the matter, sometimes the above-mentioned terms are even used differently. For instance,

ReClor [283] is presented as a resource targeting logical reasoning, but it is based on GMAT/LSAT teaching materials, and much of it actually targets meta-analysis of the logical structure rather than logical reasoning itself (e.g. identifying claims and conclusions in the provided text). CLUTTR [236] is an inductive reasoning benchmark for kinship relations, but the term "inductive" is used in the sense of "inducing rules" (similar to the above definition of "inference") rather than as "non-deductive" (i.e. offering only partial support for the conclusion).

A kind of non-deductive reasoning that historically received a lot of attention in the AI literature is defeasible reasoning [48, 169] , which is now making a comeback in NLI [224] (formulated as the task of re-evaluating the strength of the conclusion in the light of an additional premise strengthening/weakening the evidence offered by the original premise). There is also ART [25] , an abductive reasoning challenge where the system needs to come up with a hypothesis that better complements incomplete observations.

It could be argued that information retrieval happens before inference: to evaluate a premise and a conclusion, we first have to have them. However, inference can also be viewed as the ranking mechanism of retrieval:

Have you stopped beating your wife? invalid premise (that the wife is beaten) What is the meaning of life, the universe, and everything? [5] not specific enough At what age can I get a driving license? missing information (in what country?) Can quantum mechanics and relativity be linked together?

information not yet discovered What was the cause of the US civil war? [33] no consensus on the answer Who can figure out the true meaning of 'covfefe'? uninterpretable due to language errors Do colorless ideas sleep furiously? syntactically well-formed but uninterpretable What is the sum of angles in a triangle with sides 1, 1, and 10 cm? 14 such a triangle cannot exist What have the Romans ever done for us? [41] rhetorical question What is the airspeed velocity of a swallow carrying a coconut? [273] the answer would not be useful 15 to know Table 2 . Types of invalid questions NLP systems consider the answer options so as to choose the one offering the strongest support for the conclusion.

This is how the current systems approach close-world reading comprehension tests like RACE [138] or SQuAD [210] .

In the open-world setting, instead of a specific text, we have a much broader set of options (a corpus of snippets, a knowledge base, knowledge encoded by a language model etc.). However, fundamentally the task is still to find the best answer out of the available knowledge. We are considering two sub-dimensions of the retrieval problem: determining whether an answer exists, and where to look for it.

Answerability. SQuAD 2.0 [209] popularized the distinction between questions that are answerable with the given context and those that are not. However, the distinction is arguably not binary, and at least two resources argue for a 3-point uncertainty scale. ReCO [268] offers boolean questions with "yes", "no" and "maybe" answer options.

QuAIL [221] distinguishes between full certainty (answerable with a given context), partial certainty (a confident guess can be made with a given context + some external common knowledge), and full uncertainty (no confident guess can be made even with external common knowledge). A more general definition of the unanswerable questions would be this: the questions that cannot be answered given all the information that the reader has access to. This is different from invalid questions: the questions that a human would reject rather than attempt to answer. Table 2 shows examples for different kinds of violations: the answers that are impossible to retrieve, loaded questions, ill-formed questions, rhetorical questions, "useless" questions, and others.

Where to look for the target knowledge? The classical RC case in resources like SQuAD is a single context that is the only possible source of information: in this case, the retrieval problem is reduced to finding the relevant span.

When the knowledge is not provided, the system needs to know where to find it, 12 and in this case it may be useful to know whether it is factual (e.g. "Dante was born in Florence") or world knowledge (e.g. "bananas are yellow"). 13 This is the core distinction between the subfields of open-domain QA and commonsense reasoning, respectively. Note that in both of those cases, the source of knowledge is external to the question and must be retrieved from somewhere (Web snippets, knowledge bases, model weights, etc.). The difference is in the human competence: an average human speaker is not expected to have all the factual knowledge, but is expected to have a store of the world knowledge (even though the specific subset of that knowledge is culture-and age-dependent). 12 For human readers, McNamara and Magliano [172] similarly distinguish between bridging (linking new information-in this case, from the question-to previous context) and elaboration (linking information to some external information). 13 Schlegel et al. [230] distinguish between "factual" and "intuitive" knowledge. The latter is defined as that "which is challenging to express as a set of facts, such as the knowledge that a parenthetic numerical expression next to a person's name in a biography usually denotes [their] life span".

Many resources for the former were discussed in §4. Commonsense reasoning resources deserve a separate survey, but overall, most levels of description discussed in this paper also apply to them. They have the analog of open-world factoid QA (e.g. CommonsenseQA [249] , where the task is to answer a multi-choice question without any given context), but more resources are described as "reading comprehension", with multi-choice [114, 192] or cloze-style [286] questions asked in the context of some provided text. Similarly to "domains" in open-world QA (see §4), there are specialist resources targeting specific types of world knowledge (see §6.2.4).

Interpreting & manipulating input. This dimension necessarily applies to any question: both humans and machines should have the knowledge of the meaning of the individual constituent elements of the input (words, numbers), and have the ability to perform operations on them that are defined by the language/shared mathematical system (rather than given in the input). 16 It includes the following subcategories:

• Linguistic skills. SQuAD [210] , one of the first major RC resources, predominantly targeted argument extraction and event paraphrase detection. Curently many resources focus on coreference resolution (e.g. Quoref [63] , part of DROP [71] ). Among the reasoning types proposed in [239, 240] , "linguistic skills" also include ellipsis, schematic clause relations, punctuation. The list is not exhaustive: arguably, any questions formulated in a natural language depend on a large number of linguistic categories (e.g. reasoning about temporal relations must involve knowledge of verb tense), and even the questions targeting a single phenomenon as it is defined in linguistics (e.g. coreference resolution) do also require other linguistic skills (e.g. knowledge of parts of speech).

Thus, any analysis based on linguistic skills should allow the same question to belong to several categories, and it is not clear whether we can reliably determine which of them are more "central".

Questions (and answers/contexts) could also be characterized in terms of "ease of processing" [172] , which is related to the set of linguistic phenomena involved in its surface form. But it probably does not mean the same thing for humans and machines: the latter have a larger vocabulary, do not get tired in the same way, etc.

• Numeric skills. In addition to the linguistic knowledge required for interpreting numeral expressions, an increasing number of datasets is testing NLP systems' abilities of answering questions that require mathematical operations over the information in the question and the input context. DROP [71] involves numerical reasoning over multiple paragraphs of Wikipedia texts. Mishra et al. [177] contribute a collection of small-scale numerical reasoning datasets including extractive, freeform, and multi-choice questions, some of them requiring retrieval of external world knowledge. There is also a number of resources targeting school algebra word problems [136, 173, 234, 259] , and multimodal counting benchmarks [4, 42] .

• Operations on sets. This category targets such operations as union, intersection, ordering, and determining subset/superset relations which going beyond the lexical knowledge subsumed by the hypernymy/hyponymy relations. The original bAbI [272] included "lists/sets" questions such as Daniel picks up the football. Daniel drops the newspaper. Daniel picks up the milk. John took the apple. What is Daniel holding? (milk, football). Among the categories proposed by Schlegel et al. [230] , the "constraint" skill is fundamentally the ability to pick a subset the members which satisfy an extra criterion. 12 https://philosophy.stackexchange.com/questions/37311/are-all-answers-to-a-contradictory-question-correct-or-are-all-wrong-or-is-it 13 The practical utility of questions is hard to estimate objectively, given the wide range of human interests (especially cross-culturally). Horbach et al. [113] annotate questions for centrality to the given topic, and whether a teacher would be likely to use that question with human students, but the human agreement on their sample is fairly low. The agreement is likely even less for the more niche, specialist questions: the low agreement on acceptance recommendations in peer review [201] is likely partly due to the fact that different groups of researchers simply do not find each other's research questions equally exciting. 16 The current NLP systems can perform well on QA/RC benchmarks even when they are transformed to become uninterpretable to humans [241] .

It is an open question whether we should strive for systems to reject inputs that a human would reject, and on the same grounds.

Some linguistic phenomena highly correlate with certain reasoning operations, but overall these two dimensions are still orthogonal. A prime example is comparison: 17 it is often expressed with comparative degrees of adjectives (in the question or context) and so requires interpretation of those linguistic signs. At the same time, unless the answer is directly stated in the text, it also requires a deductive inference operation. For example: John wears white, Mary wears black. Who wears darker clothes?

6.2.4 World modeling. One of the key psychological theories of human RC is based on mental simulation: when we read, we create a model of the described world, which requires that we "instantiate" different objects and entities, track their locations, and ingest and infer the temporal and causal relations between events [262, 295] . Situation modeling has been proposed as one of the levels of representation in discourse comprehension [263] , and it is the basis for the recent "templates of understanding" [73] that include spatial, temporal, causal and motivational elements. We further add the category of belief states [221] , since human readers keep track not only of spatiotemporal and causal relations in a narrative, but also the who-knows-what information.

A challenge for psychological research is that different kinds of texts have a different mixture of prominent elements (temporal structure for narratives, referential elements in expository texts etc.), and the current competing models were developed on the basis of different kinds of evidence, which makes them hard to reconcile [172] . This is also the case for machine RC, and partly explains the lack of agreement about classification of "types of reasoning" across the literature.

Based on our classification, the following resources explicitly target a specific aspect of situation modeling, in either RC (i.e. "all the necessary information in the text") or commonsense reasoning (i.e. "text needs to be combined with extra world knowledge") settings: 18 • spatial reasoning: bAbI [272] , SpartQA [176] , many VQA datasets [e.g. 117, see §3.4.1]

• temporal reasoning: event order (QuAIL [221] , TORQUE [188] ), event attribution to time (TEQUILA [120] , TempQuestions [119] , script knowledge (MCScript [192] ), event duration (MCTACO [291] , QuAIL [221] ), temporal commonsense knowledge (MCTACO [291] , TIMEDIAL [203] ), some multimodal datasets [80, 117] • belief states: Event2Mind [212] , QuAIL [221] • causal relations: ROPES [152] , QuAIL [221] , QuaRTz [247] .

• tracking entities: across locations (bAbI [272] ), in coreference chains (Quoref [63] , resources in the Winograd Schema Challenge family [142, 228] ). Arguably the cloze-style resources based on named entities also fall into this category (CBT [112] , CNN/DailyMail [110] , WhoDidWhat [190] ), but they do not guarantee that the masked entity is in some complex relation with its context.

• entity properties and relations: 19 social interactions (SocialIQa [229] ), properties of characters (QuAIL [221] ), physical properties (PIQA [27] , QuaRel [246] ), numerical properties (NumberSense [151] ).

The text + alternative endings format used in several commonsense datasets like SWAG (see §3.2.4) has the implicit question "What happened next?". These resources cross-cut causality and temporality: much of such data seems to target causal relations (specifically, the knowledge of possible effects of interactions between characters and entities), but also script knowledge, and the format clearly presupposes the knowledge of the temporal before/after relation.

A separate aspect of world modeling is the meta-analysis skills: the ability of the reader to identify the likely time, place and intent of its writer, the narrator, the protagonist/antagonist, identifying stylistic features and other categories.

These skills are considered as a separate category by Sugawara et al. [240] , and are an important target of the field of literary studies, but so far they have not been systematically targeted in machine RC. That being said, some existing resources include questions formulated to include words like "author" and "narrator" [221] . They are also a part of some resources that were based on existing pedagogical resources, such as some of ReClor [283] questions that focus on identifying claims and conclusions in the provided text.

6.2.5 Multi-step reasoning. Answering a question may require one or several pieces of information. In the recent years a lot of attention was drawn to what could be called multi-step information retrieval, with resources focusing on "simple" and "complex" questions:

• "Simple" questions have been defined as such that "refer to a single fact of the KB" [32] . In an RC context, this corresponds to the setting where all the necessary evidence is contained in the same place in the text.

• The complex questions, accordingly, are the questions that rely on several facts [248] . In an RC setting, this corresponds to the so-called multi-hop datasets that necessitate the combination of information across sentences [127] , paragraphs [71] , and documents [279] . It also by definition includes questions that require a combination of context and world knowledge [e.g. 221].

That being said, the "multi-step" skill seems broader than simply combining several facts. Strictly speaking, any question is linguistically complex just because it is a compositional expression. We use some kind of semantic parsing to find the missing pieces of information for the question "Who played Sherlock Holmes, starred in Avengers and was born in London?" -but we must rely on the same mechanism to interpret the question in the first place. We may likewise need to perform several inference steps to retrieve the knowledge if it is not explicitly provided, and we regularly make chains of guesses about the state of the world (Sherlock Holmes stories exemplify a combination of these two dimensions).

This section concludes the paper with broader discussion of reasoning skills: the types of "skills" that are minimally required for our systems to solve QA/RC benchmarks ( §7.1) vs the ones that a human would use ( §7.2). We then proceed to highlighting the gaps in the current research, specifically the kinds of datasets that have not been made yet ( §7.3).

A key assumption in the current analyses of QA/RC data in terms of the reasoning skills they target (including our own taxonomy in §6.2) is that the skills that a human would use to answer a given question are also the skills that a model would use. However, that is not necessarily true. Fundamentally, DL models search for patterns in the training data, and they may and do find various shortcuts that happen to also predict the correct answer [90, 103, 118, 241, inter alia ].

An individual question may well target e.g. coreference, but if it contains a word that is consistently associated with the first answer option in a multi-choice dataset, the model could potentially answer it without knowing anything about coreference. What is worse, how a given question is answered could change with a different split of the same dataset, a model with a different inductive bias, or, the most frustratingly, even a different run of the same model [170] .

This means that there is a discrepancy between the reasoning skills that a question seems to target, and the skills that are minimally required to "solve" a particular dataset. In the context of the traditional machine learning workflow with training and testing, we need to reconsider the idea that whether or not a given reasoning skill is "required" is a characteristic of a given question. It is rather a characteristic of the combination of that question and the entire dataset.

The same limitation applies to the few-shot-or in-context-learning paradigm based on extra-large language models [34] , where only a few samples of the target task are presented as examples and no gradient updates are performed.

Conceptually, such models still encapsulate the patterns observed in the language model training data, and so may still be choosing the correct answer option e.g. because there were more training examples with the correct answer listed first (see [289] for an approach to countering this bias). The difference is only that it is much harder to perform the training data analysis and find any such superficial hints.

How can we ever tell that the model is producing the correct answer for the right reasons? There is now enough work in this area to deserve its own survey, but the main directions are roughly as follows:

• • creating larger collections of generalization tests, both out-of-domain ( §4) and cross-lingual ( §5.2). The assumption is that as their number grows the likelihood of the model solving them all with benchmark-specific heuristics decreases.

• work on controlling the signal in the training data, on the assumption that if a deep learning model has a good opportunity to learn some phenomenon, it should do so (although that is not necessarily the case [89] This direction also includes the methodology work on crafting the data to avoid reasoning shortcuts: e.g. using human evaluation to discard the questions that humans could answer without considering full context [194] .

• interpretability work on generating human-interpretable explanations for a given prediction, e.g. by context attribution [e.g. 214] or influential training examples [e.g. 208]. However, the faithfulness of such explanations is itself an active area of research [e.g. 202, 281] . The degree to which humans can use explanations to evaluate the quality of the model also varies depending on the model quality and prior belief bias [93] .

While we may never be able to say conclusively that a blackbox model relies on the same strategies as a human reader, we should (and, under the article 13 of the AI Act proposal, could soon be legally required to 20 ) at least identify the cases in which they succeed and in which they fail, as it is prerequisite for safe deployment.

Section 7.1 discussed the fundamental difficulties with identifying how a blackbox neural model was able to solve a QA/RC task. However, we also have trouble even identifying the processes a human reader would use to answer a question. As discussed in §6.1, there are so far only two studies attempting cross-dataset analysis of reasoning skills according to a given skill taxonomy, and they both only target small samples (50-100 examples per resource). This is due to the fact that such analysis requires expensive expert annotation. Horbach et al. [113] showed that crowdworkers have consistently lower agreement even on annotating question grammaticality, centrality to topic, and the source of information for the answers. What is worse, neither experts nor crowdworkers were particularly successful with annotating "types of information needed to answer this question".

The dimension of our taxonomy ( §6.2) that has received the least attention so far seems to be the logical aspects of the inference performed. Perhaps not coincidentally, this is the most abstract dimension requiring the most specialist knowledge. However, the logical criterion of the strength of support for the hypothesis is extremely useful: to be able to 20 https://eur-lex.europa.eu/resource.html?uri=cellar:e0649735-a372-11eb-9585-01aa75ed71a1.0001.02/DOC_1&format=PDF trust NLP systems in the real world, we would like to know how they handle reasoning with imperfect information. This makes analysis of the "best-case" inference (assuming a non-heuristic reasoning process based on full interpretation of the questions and inputs) in the existing QA/RC resources a promising direction for future work. It could be bootstrapped by the fact that at least some of the inference types map to the question types familiar in QA/RC literature:

• The questions that involve only interpreting and manipulating (non-ambiguous) linguistic or numerical input fall under deductive reasoning, because the reader is assumed to have a set of extra premises (definitions for words and mathematical operations) shared with the question author.

• The questions about the future state of the world, commonsense questions necessarily have a weaker link between the premise and conclusion, and could be categorized as inductive.

• Other question types could target inductive or deductive reasoning, depending on how strong is the evidence provided in the premise: e.g. temporal questions are deductive if the event order strictly follows from the narrative, and inductive if there are uncertainties filled on the basis of script knowledge.

Notwithstanding all of the numerous datasets in the recent years, the space of unexplored possibilities remains large.

Defining what datasets need to be created is itself a part of the progress towards machine NLU, and any such definitions will necessarily improve as we make such progress. At this point we would name the following salient directions.

Linguistic features of questions and/or the contexts that they target. The current pre-trained language models do not acquire all linguistic knowledge equally well or equally fast: e.g. RoBERTa [158] learns the English irregular verb forms already with 100M tokens of pre-training, but struggles with (generally more rare) syntactic island effects even after 1B pre-training [287] . Presumably knowledge that is less easily acquired in pre-training will also be less available to the model in fine-tuning. There are a few datasets that focus on questions requiring a specific aspect of linguistic reasoning, but there are many untapped dimensions. How well do our models cope with questions that a human would answer using their knowledge of e.g. scope resolution, quantifiers, or knowledge of verb aspect?

Pragmatic properties of questions. While deixis (contextual references to people, time and place) clearly plays an important role in multimodal and conversational QA resources, there does not seem to be much work focused on that specifically (although many resources cited in §3.4.3 contain such examples). Another extremely important direction is factuality: there is already much research on fact-checking [13, 14, 105, 254] , but beyond that, it is also important to examine questions for presuppositions.

QA for the social good. A very important dimension for practical utility of QA/RC data is their domain ( §4): domain adaptation is generally very far from being solved, and that includes the transfer between QA/RC datasets [83, 280] .

There are many domains that have not received much attention because they are not backed by commercial interests, and are not explored by academics because there is no "wild" data like StackExchange questions that could back it up.

For instance, QA data that could be used to train FAQ chatbots for education and nonprofit sectors could make a lot of difference for low-resource communities, but is currently notably absent. And beyond the English-speaking world, high-quality QA/RC data is generally scarce (see §5.1).

Documented data. Much work has been invested in investigation of biases in the current resources (see §7.1), and the conclusion is clear: if the data has statistically conspicuous "shortcuts", we have no reason to expect neural nets to not pick up on them [154] . Much work discussed in this survey proposed various improvements to the data collection methodology (which deserves a separate survey), but it is hard to guarantee absence of spurious patterns in naturally-occurring data -and it gets harder as dataset size grows [87] . The field of AI ethics is currently working on documenting the speaker demographics and possible social biases [21, 88] , with the idea that this would then be useful for model certification [178] . Given that neither social nor linguistic spurious patterns in the data are innocuous [219] , we need a similar practice for documenting spurious patterns in the data we use. New datasets will be a lot more useful if they come with documented limitations, rather than with the impression that there are none.

The number of QA/RC datasets produced by the NLP community is large and growing rapidly. We have presented the most extensive survey of the field to date, identifying the key dimensions along which the current datasets vary.

These dimensions provide a conceptual framework for evaluating current and future resources in terms of their format, domain, and target reasoning skills. We have categorized over two hundred datasets while highlighting the gaps in the current literature, and we hope that this survey would be useful both for the NLP practitioners looking for data, and for those seeking to push the boundaries of QA/RC research.

X-WikiRE: A Large, Multilingual Resource for Relation Extraction as Machine Comprehension

ComQA: A Community-Sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters

VQD: Visual Query Detection In Natural Scenes

TallyQA: Answering Complex Counting Questions

The Hitchhiker's Guide to the Galaxy (del rey trade pbk

Dialogue Acts in Verbmobil 2

Open-Domain Question Answering Goes Conversational via Question Rewriting

VQA: Visual Question Answering

On the Cross-Lingual Transferability of Monolingual Representations

Multilingual Extractive Reading Comprehension by Runtime Machine Translation

XOR QA: Cross-Lingual Open-Retrieval Question Answering

Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems

Generating Fact Checking Explanations

MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims

Multiple-Choice Item Format. The TESOL Encyclopedia of English Language Teaching

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Embracing Data Abundance: BookTest Dataset for Reading Comprehension

The First Cross-Script Code-Mixed Question Answering Corpus

Analogy and Analogical Reasoning

The #BenderRule: On Naming the Languages We Study and Why It Matters

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science

Semantic Parsing on Freebase from Question-Answer Pairs

Modeling Biological Processes for Reading Comprehension

STARC: Structured Annotations for Reading Comprehension

Annual Meeting of the Association for Computational Linguistics. ACL, Online

Abductive Commonsense Reasoning

Experience Grounds Language

PIQA: Reasoning about Physical Commonsense in Natural Language

SubjQA: A Dataset for Subjectivity and Review Comprehension

Reasoning

Multiple Choice Questions: An Introductory Guide

Large-Scale Simple Question Answering with Memory Networks

What Question Answering Can Learn from Trivia Nerds

Ilya Sutskever, and Dario Amodei. 2020. Language Models Are Few-Shot Learners

MultiWOZ -A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

DoQA-Accessing Domain-Specific FAQs via Conversational QA

Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering

The TechQA Dataset

Evaluation of Text Generation: A Survey

Code-Mixed Question Answering Challenge: Crowd-Sourcing Data and Techniques

Monty Python (Comedy troupe), Handmade Films, and Criterion Collection (Firm)

Counting Everyday Objects in Everyday Scenes

Evaluating Question Answering Evaluation

MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics

Reading Wikipedia to Answer Open-Domain Questions

Open-Domain Question Answering

HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data

Logical Models of Argument

Adversarial TableQA: Attention Supervision for Question Answering on Tables

QuAC: Question Answering in Context

Decontextualization: Making Sentences Stand-Alone

Simple and Effective Multi-Paragraph Reading Comprehension

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. TACL

Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

TutorialVQA: Question Answering Dataset for Tutorial Videos

Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs

Enabling Deep Learning for Large Scale Question Answering in Italian

A Span-Extraction Dataset for Chinese Machine Reading Comprehension

Dataset for the First Evaluation on Chinese Machine Reading Comprehension

Consensus Attention-Based Neural Networks for Chinese Reading Comprehension

A Sentence Cloze Dataset for Chinese Machine Reading Comprehension

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Logic and Probability

FQuAD: French Question Answering Dataset

Wizard of Wikipedia: Knowledge-Powered Conversational Agents

Learning Hybrid Representations to Retrieve Semantically Equivalent Questions

The Stanford Encyclopedia of Philosophy

ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Overview of the NLPCC 2017 Shared Task: Open Domain Chinese Question Answering

To Test Machine Comprehension, Start by Defining Comprehension

English Machine Reading Comprehension Datasets: A Survey

SberQuAD -Russian Reading Comprehension Dataset: Description and Analysis

Can You Unpack That? Learning to Rewrite Questions-in-Context

Key-Value Retrieval Networks for Task-Oriented Dialogue

What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models

ELI5: Long Form Question Answering

Temporal Reasoning via Audio Question Answering

Read and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russian

IIRC: A Dataset of Incomplete Information Reading Comprehension Questions

MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension

Yuta Nakashima, and Teruko Mitamura. 2020. A Dataset and Baselines for Visual Question Answering on Art

Evaluating Models' Local Decision Boundaries via Contrast Sets

Question Answering Is a Format; When Is It Useful?

Competency Problems: On Finding and Removing Artifacts in Language Data

Datasheets for Datasets

Posing Fair Generalization Tasks for Natural Language Inference

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

DaNetQA: A Yes/No Question Answering Dataset for the Russian Language

Assessing BERT's Syntactic Abilities

On the Interaction of Belief Bias and Explanations

SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning

Iqa: Visual Question Answering in Interactive Environments

2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases

MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models

IJCNLP-2017 Task 5: Multi-Choice Question Answering in Examinations

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

MMQA: A Multi-Domain Multi-Lingual Question-Answering Framework for English and Hindi

AmazonQA: A Review-Based Question Answering Task

Transliteration Better than Translation? Answering Code-Mixed Questions over a Knowledge Base

Annotation Artifacts in Natural Language Inference Data

ANTIQUE: A Non-Factoid Question Answering Benchmark

Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by ClaimBuster

Inductive Logic

DuReader: A Chinese Machine Reading Comprehension Dataset from Real-World Applications

Meanings and configurations of questions in English

The ATIS Spoken Language Systems Pilot Corpus

Teaching Machines to Read and Comprehend

WikiReading: A Novel Large-Scale Language Understanding Task over Wikipedia

The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations

Oier Lopez de Lacalle, and Montse Maritxalar. 2020. Linguistic Appropriateness and Pedagogic Usefulness of Reading Comprehension Questions

Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

SRI's Amex Travel Agent Data

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Adversarial Examples for Evaluating Reading Comprehension Systems

TempQuestions: A Benchmark for Temporal Question Answering

TEQUILA: Temporal Question Answering over Knowledge Bases

FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase

PubMedQA: A Dataset for Biomedical Research Question Answering

Clevr: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Learning The Difference That Makes A Difference With Counterfactually-Augmented Data

Project PIAF: Building a Native French Question-Answering Dataset

Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences

UnifiedQA: Crossing Format Boundaries With a Single QA System

DeepStory: Video Story QA by Deep Embedded Memory Networks

Dialog State Tracking Challenge 5 Handbook v

The NarrativeQA Reading Comprehension Challenge

SCDE: Sentence Cloze Dataset with High Quality Distractors From Examinations

Defeasible Reasoning

RuBQ: A Russian Dataset for Question Answering over Wikidata

Hurdles to Progress in Long-Form Question Answering

Learning to Automatically Solve Algebra Word Problems

Natural Questions: A Benchmark for Question Answering Research

RACE: Large-Scale ReAding Comprehension Dataset From Examinations

ODSQA: Open-Domain Spoken Question Answering Dataset

Semi-Supervised Training Data Generation for Multilingual Question Answering

TVQA: Localized, Compositional Video Question Answering

The Winograd Schema Challenge

Zero-Shot Relation Extraction via Reading Comprehension

MLQA: Evaluating Cross-Lingual Extractive Question Answering

Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension

Molweni: A Challenge Multiparty Dialogues-Based Machine Reading Comprehension Dataset with Discourse Structure

Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering

Entity-Relation Extraction as Multi-Turn Question Answering

A New Multi-Choice Reading Comprehension Dataset for Curriculum Learning

KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension

Birds Have Four Legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models

Reasoning Over Paragraph Effects in Situations

Microsoft COCO: Common Objects in Context

How Can We Accelerate Progress Towards Human-like Linguistic Generalization?

LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning

XQA: A Cross-Lingual Open-Domain Question Answering Dataset

XCMRC: Evaluating Cross-Lingual Machine Reading Comprehension

RoBERTa: A Robustly Optimized BERT Pretraining Approach

World Knowledge for Reading Comprehension: Rare Entity Prediction with Hierarchical LSTMs Using External Descriptions

MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems

Challenging Reading Comprehension on Daily Conversation: Passage Completion on Multiparty Dialog

Multiple-Choice Tests Can Support Deep Learning! Proceedings of the Atlantic Universities

A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering

Developing Certification Exam Questions: More Deliberate Than You May Think

Open Dataset for Development of Polish Question Answering Systems

International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE

The Natural Language Decathlon: Multitask Learning as Question Answering

Some Philosophical Problems From the Standpoint of Artificial Intelligence

BERTs of a Feather Do Not Generalize Together: Large Variability in Generalization across Models with Similar Test Set Performance

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

Toward a Comprehensive Model of Comprehension

A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

AmbigQA: Answering Ambiguous Open-Domain Questions

SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning

Bhavdeep Sachdeva, and Chitta Baral. 2020. Towards Question Format Independent Numerical Reasoning: A Set of Prerequisite Tasks

Model Cards for Model Reporting

InScript: Narrative texts annotated with script information

COVID-QA: A Question Answering Dataset for COVID-19

Shared Task: The Story Cloze Test

Neural Arabic Question Answering

MarioQA: Answering Questions by Watching Gameplay Videos

SemEval-2017 Task 3: Community Question Answering

SemEval-2015 Task 3: Answer Selection in Community Question Answering

Hamdy Mubarak, abed Alhakim Freihat, Jim Glass, and Bilal Randeree

A Vietnamese Dataset for Evaluating Machine Reading Comprehension

TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions

A Method for Building a Commonsense Inference Dataset Based on Basic Events

Who Did What: A Large-Scale Person-Centered Cloze Dataset

MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge

SemEval-2018 Task 11: Machine Comprehension Using Commonsense Knowledge

emrQA: A Large Corpus for Question Answering on Electronic Medical Records

The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context

Compositional Semantic Parsing on Semi-Structured Tables

Generating Natural Questions from Images for Multimodal Assistants

Overview of CLEF Question Answering Track

Overview of the CLEF Question Answering Track

Introducing MANtIS: A Novel Multi-Domain Information Seeking Dialogues Dataset

Multi-Domain Goal-Oriented Dialogues (MultiDoGO): Strategies toward Curating and Annotating Large Scale Dialogue Data

The NIPS Experiment

Learning to Deceive with Attention-Based Explanations

TIMEDIAL: Temporal Commonsense Reasoning in Dialog

A Survey on Neural Machine Reading Comprehension

Analyzing and Characterizing User Intent in Information-Seeking Conversations

Coached Conversational Preference Elicitation: A Case Study in Understanding Movie Preferences

Answer Ka Type Kya He?": Learning to Classify Questions in Code-Mixed Language

Richard Socher, and Caiming Xiong. 2020. Explaining and Improving Model Behavior with k Nearest Neighbor Representations

Know What You Don't Know: Unanswerable Questions for SQuAD

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Neural Unsupervised Domain Adaptation in NLP-A Survey

Event2Mind: Commonsense Inference on Events, Intents, and Reactions

CoQA: A Conversational Question Answering Challenge

Why Should I Trust You?": Explaining the Predictions of Any Classifier

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning

How the Transformers Broke NLP Leaderboards

Changing the World by Changing the Data

What Can We Do to Improve Peer Review in NLP

Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks

LAReQA: Language-Agnostic Answer Retrieval from a Multilingual Pool

Multi-Domain Multilingual Question Answering

Thinking Like a Skeptic: Defeasible Inference in Natural Language

Does It Care What You Asked? Understanding Importance of Verbs in Deep Learning QA System

Learning Answer-Entailing Structures for Machine Comprehension

Interpretation of Natural Language Rules in Conversational Machine Reading

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

Social IQa: Commonsense Reasoning about Social Interactions

Beyond Leaderboards: A Survey of Methods for Revealing Weaknesses in Natural Language Inference Data and Models

A Framework for Evaluation of Machine Reading Comprehension Gold Standards

A Survey of Available Corpora for Building Data-Driven Dialogue Systems

DRCD: A Chinese Machine Reading Comprehension Dataset

Automatically Solving Number Word Problems by Semantic Parsing and Reasoning

Overview of the NTCIR-11 QA-Lab Task

CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text

2020. A Survey of Code-Switched Speech and Language Processing

NLQuAD: A Non-Factoid Long Question Answering Data Set

An Analysis of Prerequisite Skills for Reading Comprehension

Evaluation Metrics for Machine Reading Comprehension: Prerequisite Skills and Readability

Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets

A Corpus of Natural Language for Visual Reasoning

A Corpus for Reasoning about Natural Language Grounded in Photographs

DREAM: A Challenge Data Set and Models for Dialogue-Based Reading Comprehension

CliCR: A Dataset of Clinical Case Reports for Machine Reading Comprehension

QuaRel: A Dataset and Models for Answering Questions about Qualitative Relationships

QuaRTz: An Open-Domain Dataset of Qualitative Relationship Questions

The Web as a Knowledge-Base for Answering Complex Questions

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

MultimodalQA: Complex Question Answering Over Text, Tables and Images

MovieQA: Understanding Stories in Movies through Question-Answering

MISC: A Data Set of Information-Seeking Conversations

Shifting the Baseline: Single Modality Performance on Visual Navigation & QA

FEVER: a Large-scale Dataset for Fact Extraction and VERification

Informing the Design of Spoken Conversational Search: Perspective Paper

NewsQA: A Machine Comprehension Dataset

Ion Androutsopoulos, and Georgios Paliouras. 2015. An Overview of the BIOASQ Large-Scale Biomedical Semantic Indexing and Question Answering Competition

Towards Machine Comprehension of Spoken Content: Initial TOEFL Listening Comprehension Test by Machine

Annotating Derivations: A New Evaluation Strategy and Dataset for Algebra Word Problems

TableQA: Question Answering on Tabular Data

Best Practices for the Human Evaluation of Automatically Generated Text

Temporal Order Relations in Language Comprehension

Strategies of Discourse Comprehension

HEAD-QA: A Healthcare Dataset for Complex Reasoning

Building a Question Answering Test Collection

Trick Me If You Can: Adversarial Writing of Trivia Challenge Questions

Universal Adversarial Triggers for Attacking and Analyzing NLP

ReCO: A Large Scale Chinese Reading Comprehension Dataset on Opinion

Improving Question Answering for Event-Focused Questions in Temporal Collections of News Articles

Developing Dataset of Japanese Slot Filling Quizzes Designed for Evaluation of Machine Reading Comprehension

Jack the Reader -A Machine Reading Framework

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Python (Monty) Pictures, and Columbia TriStar Home Entertainment (Firm)

Learning for Semantic Parsing with Statistical Machine Translation

TWEETQA: A Social Media Focused Question Answering Dataset

MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization

WikiQA: A Challenge Dataset for Open-Domain Question Answering

FriendsQA: Open-Domain Question Answering on TV Show Transcripts

HotpotQA: A Dataset for Diverse, Explainable Multi-Hop Question Answering

A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC

On the Faithfulness Measurements for Model Interpretations

Towards Data Distillation for End-to-End Spoken Conversational Question Answering

ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

HellaSwag: Can a Machine Really Finish Your Sentence

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

When Do You Need Billions of Words of Pretraining Data?

One-Shot Learning for Question-Answering in Gaokao History Challenge

Calibrate Before Use: Improving Few-Shot Performance of Language Models

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Going for a Walk": A Study of Temporal Commonsense Understanding

TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance

Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and Reading: A Comprehensive Survey on Open-Domain Question Answering

Uncovering the Temporal Context for Video Question Answering

Situation Models, Mental Simulations, and Abstract Concepts in Discourse Comprehension