key: cord-0109416-lpg49gnn
authors: Lai, Vivian; Chen, Chacha; Liao, Q. Vera; Smith-Renner, Alison; Tan, Chenhao
title: Towards a Science of Human-AI Decision Making: A Survey of Empirical Studies
date: 2021-12-21
journal: nan
DOI: nan
sha: 449ff2404a7b11fa3c47cd228fa5469b00ffef20
doc_id: 109416
cord_uid: lpg49gnn

As AI systems demonstrate increasingly strong predictive performance, their adoption has grown in numerous domains. However, in high-stakes domains such as criminal justice and healthcare, full automation is often not desirable due to safety, ethical, and legal concerns, yet fully manual approaches can be inaccurate and time consuming. As a result, there is growing interest in the research community to augment human decision making with AI assistance. Besides developing AI technologies for this purpose, the emerging field of human-AI decision making must embrace empirical approaches to form a foundational understanding of how humans interact and work with AI to make decisions. To invite and help structure research efforts towards a science of understanding and improving human-AI decision making, we survey recent literature of empirical human-subject studies on this topic. We summarize the study design choices made in over 100 papers in three important aspects: (1) decision tasks, (2) AI models and AI assistance elements, and (3) evaluation metrics. For each aspect, we summarize current trends, discuss gaps in current practices of the field, and make a list of recommendations for future research. Our survey highlights the need to develop common frameworks to account for the design and research spaces of human-AI decision making, so that researchers can make rigorous choices in study design, and the research community can build on each other's work and produce generalizable scientific knowledge. We also hope this survey will serve as a bridge for HCI and AI communities to work together to mutually shape the empirical science and computational technologies for human-AI decision making.

Thanks to recent advances, AI has become a ubiquitous technology and been introduced into high-stakes domains such as healthcare, finance, criminal justice, hiring, and more [8, 33, 38, 72, 142, 150] . To prevent hazardous consequences of failures, complete automation is often not desirable in such domains. Instead, AI systems are often introduced to augment or assist human decision makers-by providing a prediction or recommendation for a decision task, which humans can choose to follow or ignore and make their own decision. Besides predictions, current AI technologies can provide a range of other capabilities to help humans gauge the model predictions and make better final decisions, such as providing performance metrics of the model, or uncertainty and explanation for the prediction. In this paper, we refer to these capabilities as different AI assistance elements that an AI system can choose to provide. We refer to this general paradigm as human-AI decision making, though in relevant literature we found a multitude of generic terms used such as human-AI interaction, human-AI collaboration, and human-AI teaming. Figure 1 shows that the number of papers on this topic has been growing dramatically in the past five years.

Many of these papers made technical contributions for AI to better support human decisions, such as developing new algorithms to generate explanations for model predictions. Meanwhile, the research community is increasingly recognizing the importance of empirical studies of human-AI decision making by involving human subjects and performing decision tasks. These studies are not only necessary to evaluate the effectiveness of AI technologies in assisting decision making, but also to form a foundational understanding of how people interact with AI to make decisions. This understanding can serve multiple grounds, including: (1) to inform new AI techniques that provide more effective assistance or are more human compatible; (2) to guide practitioners in making technical and design choices for more effective decision-support AI systems; (3) to provide input for human-centric policy, infrastructure, and practices around AI for decision making, such as regulatory requirements on when AI can or cannot be used to augment certain human decisions [132] .

However, empirical human-subject studies on human-AI decision making are distributed in multiple research communities, asking diverse research questions, and adopting various methodologies. Currently there is a lack of coherent overview of this area, let alone coherent practices in designing and conducting these studies, hindering concerted research effort and emergence of scientific knowledge. We recognize several challenges to coherence. First, empirical studies of human-AI decision making are conducted in various domains with different decision tasks. Without investigating the scope of these tasks and their impact, we may not be able to generalize from individual findings.

Second, human interactions with AI are enabled and augmented by the affordances of chosen AI assistance elements.

Individual empirical studies tend to focus on a small set of AI assistance elements. There is a lack of common frameworks to understand how results for these AI assistance elements generalize and, therefore, their effect on human interactions with AI. Lastly, design of human-subject studies are inherently complex, varying depending on the research questions, disciplinary practices, accessible subjects and other resources, etc. This challenge is further exacerbated by the fact that methodologies, including evaluation metrics, to study human-AI interaction are still in an early development stage.

To facilitate coherence and develop a rigorous science of human-AI decision making, we provide an overview of the current state of the field -focusing on empirical human-subject studies of human-AI decision making -through this survey. We focus on studies with the goal of evaluating, understanding and/or improving human performance and experience for a decision making task, rather than improving the model. This scope differentiates from prior surveys on empirical studies of human-AI interactions that either deviate from the scope of decision making or focus on only one aspect of it, like trust [63, 89, 126, 127, 141, 146] . With the above-mentioned challenges in mind, our survey focuses on analyzing three aspects of study design choices made in these surveyed papers: the decision tasks, the types of AI and AI assistance elements, and the evaluation metrics. For each aspect we summarize current trends, identify potential gaps, and provide recommendations for future work.

The remainder of this survey paper is structured as follows. We first discuss the scope of papers surveyed, and methodology for paper inclusion and coding. We then present our analysis for each of the three areas mentioned above, with summary tables provided. We conclude the survey with a call to actions for developing common frameworks for the design and research spaces of human-AI decision making, and mutual shaping of empirical science and computational technologies in this emerging area. To allow for easy access to the papers that we cite, they are available at https://haidecisionmaking.github.io.

In this section, we define the scope of our survey in detail and describe how we selected papers to include.

Survey scope and paper inclusion criteria. The focus of this survey is on empirical human-subject studies of human-AI decision making, where the goal is to evaluate, understand and/or improve human performance and experience for a decision making task, rather than to improve the model. As such, we specify the following inclusion criteria:

• The paper must include evaluative human-subject studies. We thus exclude purely formative studies that focus on exploring user needs to inform design of AI systems, often qualitatively. • The paper must target a decision making task, thus we exclude tasks of other purposes (e.g., debugging and other forms of improving the model, co-creation, gaming).

• The task must involve and focus on studying human decision makers, thus we exclude papers on AI automation or other AI stakeholders (model developers, ML practitioners). However, we do not limit our studies to those that implement complete decision making processes, but also include studies that claim to evaluate some aspects of decision makers' perceptions, such as their understanding, satisfaction, and perceived fairness of the AI.

Search strategy. In addition to papers that we were already aware of, we looked through proceedings of premier conferences where AI and human-computer interaction (HCI) works are published from 2018 to 2021, to identify papers that fit the criteria mentioned above. Specifically, conferences we searched include ACM CHI Conference on Human Factors in Computing Systems, ACM Conference on Computer-supported Cooperative Work and Social Computing, ACM Conference on Fairness, Accountability, and Transparency, ACM Conference on Intelligent User Interfaces, Conference on Empirical Methods in Natural Language Processing, Conference of the Association for Computational Linguistics, and Conference of the North American Chapter of the Association for Computational Linguistics. We focused on NLP conferences in AI because (1) the fraction of papers with empirical studies on human subjects is low in AI (including NLP) conferences such as AAAI and NeurIPS; (2) the expertise of the authors enables regular examinations of all papers in NLP conferences. We further expand our considered papers by choosing relevant references from each paper and examining the papers that cite these papers.

An initial search process yielded in over 130 papers. All the authors then looked through the list and discussed with each other to exclude out-of-scope papers, resulting in over 80 in-scope papers. We collated them in a spreadsheet then started coding these papers by the three study design choices they make: decision tasks, types of AI and AI assistance elements, and evaluation metrics.

We started the coding by having one author extracting relevant information from the paper, such as what kind of decision tasks, AI models and assistance elements, and evaluation metrics were used in the study. A second-round coding was then performed, focusing on merging similar codes and grouping related codes into areas. For example, decision tasks were grouped by domains, as shown in Table 3 . For the second-round coding, all authors were assigned to look through the initial codes for one of the three study design choices. All authors met regularly to discuss the grouping and ambiguous cases to ensure the integrity of the process, until consensus on the codes and grouping, as presented in the summary tables, were reached. We provide summary tables for each study design choice and these tables can provide a quick overview of the literature space.

We start by reviewing decision tasks that researchers have used to conduct empirical studies of human-AI decision making. We group decision tasks used in prior studies based on their application domains (e.g., healthcare, education, etc.). To facilitate consideration on how the choices of decision tasks may impact the generalizability of study results, we highlight four dimensions that differ in these tasks-risk, required expertise, subjectivity, and AI for emulation vs. discovery-to help interpret results in these studies.

We group the decision tasks used in the surveyed papers by their application domains, as summarized in Table 1 Law & Civic. This domain includes tasks in the justice system and for civil purposes. The most commonly used task is recidivism prediction, which has attracted a lot of interest since the ProPublica article on biases in recidivism prediction algorithms used by the criminal justice system in the United States [6] . Popular datasets for recidivism prediction and the discussed variants include COMPAS [6] , rcdv [121] , and ICPSR [138] . Given these datasets, how these studies defined recidivism prediction varied. A common formulation is to predict whether a person with a particular profile will recidivate, or be rearrested within 2 years of their most recent crime [94, 139, 143] or reoffend before bail [58] .

Slight variations of this definition include: (1) predicting if a rearrest is for a violent crime within two years [54, 55] or (2) predicting if the defendant will reoffend or not, without the two year time limit [39, 115] . In comparison, Green and Chen [54, 55] define the recidivism prediction task not as a binary prediction, but as assessing the likelihood that the defendant will commit a crime or fail to appear in court if they are released before trial, ranging from a scale from 0% to 100% in intervals of 10%. This more fined-grained question could potentially make it harder for human subjects to assess and use model output. In a slightly different set-up, Anik and Bunt [7] , Harrison et al. [61] , Lakkaraju et al. [82] define the decision tasks as either to predict if a defendant is released on bail or predict one out of four given bail outcomes: (1) the defendant is not arrested while out on bail and appears for further court dates (No Risk); (2) the defendant fails to appear for further court dates (FTA); (3) the defendant commits a non-violent crime (NCA) and (4) the defendant commits a violent crime (NVCA) when released on bail.

Prior works also explore how AI assistance can be applied to civil activities in the public sector. For example, De-Arteaga et al. [35] examine the child maltreatment hotline and develop a model that assists call workers in identifying potential high-risk cases.

Medicine & Healthcare. This domain includes tasks related to clinical decision making, ranging from medical diagnosis to sub tasks such as medical image search. The general formulation of medical diagnosis is to predict whether a patient has a disease given a list of symptoms or other information about the patient. Researchers have studied AI assistance for a range of medical diagnosis tasks, include general disease diagnosis [82] , COVID-19 diagnosis [137] and balance

Decision task

Law & Civic Recidivism prediction [58, 94, 139, 143] and its slight variations [39, 115] , likelihood to recidivate [54, 55, 94] , bail outcomes prediction [7, 61, 82] , child maltreatment risk prediction [35] .

Medicine & Healthcare Medical disease diagnosis [82] , cancer image search [24] , cancer image classification [73] , COVID-19 diagnosis [137] , balance disorder diagnosis [22] , clinical notes annotation/medical coding [90] , stroke rehabilitation assessment [86, 87] Finance & Business Income prediction [62, 115, 144, 155] , loan approval [15, 55, 139] , loan risk prediction [29] , sales forecast [36, 97] , property price prediction [1, 111] , apartment price prediction [101] selecting overbooked airline passengers for re-routing [15] , determining to freeze bank accounts due to money laundering suspicion [15] , stock price prediction [16] , marketing email prediction [60] , dynamically pricing car insurance premiums [15] .

Students' performance forecasting [36, 37, 144] , student admission prediction [7, 28] , student dropout prediction [82] , LSAT question answering [13] .

Music recommendation [76, 77] , movie recommendation [78] , song rank order prediction [95] , speed dating [96, 152] , Facebook news feed prioritization [112] , Quizbowl [43] , drawand-guess [23] word guessing [48] , chess playing [32] , plant classification [147] , goods division [88] .

Job promotion [15] , meeting scheduling assistance [74] , email topic classification [125] , cybersecurity monitoring [42] , profession prediction [94] , military planning (monitor and direct unmanned vehicles) [130] .

Alien medicine recommendation [79, 104] , alien recipe recommendation [79, 104] , jellybean counting [109] , broken glass prediction [153] , defective object pipeline [11, 12] , news reading time prediction [134] , water pipe failure prediction [10] , math questions [45] .

Question answering [27, 52] , image classification [3] , review sentiment analysis [13, 62, 106] .

Deception detection [80, 81, 94] , forest cover prediction [143] , toxicity classification [26] , nutrition prediction [20, 21] , person weight estimation [95] , attractiveness estimation [95] , activity recognition [92, 107] , emotion analysis [128] , religion prediction [114] . disorder diagnosis [22] . Another popular area is imaging-related assistance to help medical staff make better decisions during medical diagnoses. For example, Cai et al. [24] developed a tool for pathologists to search for similar images when diagnosing prostate cancer. Other imaging tasks include interpreting chest x-rays [73] . In addition, Lee et al. [86, 87] investigate a support system for stroke rehabilitation assessment, which assists physical therapists in assessing patients' progress. Due to the difficulty in understanding and predicting biological processes in healthcare, these tasks can be difficult, even for medical experts (i.e., pathologists, radiologists). Lastly, unlike previously mentioned studies where the focus is on medical diagnosis, Levy et al. [90] investigates the effects on annotating medical notes with the help of AI assistance.

Finance & Business. This domain includes decisions related to income, businesses, and properties. Popular datasets used in income and credit prediction include the Adult Income dataset in UCI Machine Learning Repository [113] and

Lending Club [129] . As a result, an income task is to predict whether a particular person's profile earns more than $50K annually [62, 115, 144, 155] , driven by the Adult Income dataset [113] . Note that the number, $50K, is outdated and somewhat arbitrary given inflation. Other similar tasks include loan approval (e.g., assessing the likelihood of an applicant defaulting on a loan) [15, 55, 139] , loan risk prediction [29] , and freezing of bank account prediction [15] .

In addition to income and credit prediction in the financial domain, prior work that explores how AI assistance can help make other business-related decisions. Some classification tasks include sales forecasting prediction where Dietvorst et al. [36] define a sales forecasting task to predict the rank (1 to 50) of individual U.S. states in terms of the number of airline passengers that departed from that state in 2011, marketing email prediction [60] where the task is to predict the better email to send given customers reactions, and predicting overbooked airline flights [15] ,

On the other hand, regression tasks include forecasting monthly sales of Ahold Delhaize's stores [97] , property price prediction [1, 111] , apartment rent prediction [101] , and car insurance prediction [15] . Lastly, Biran and McKeown [16] asks participants to decide if they would buy a stock given various AI assistance.

Education. This domain includes decisions performed within the education system. Most tasks are broadly about forecasting student performance. Different variations exist: (1) predicting how well students perform in a given program [36] ; (2) predicting if a student will not graduate on time or drop out [82] (3) predicting students' grades in tests such as math exams [37, 144] ; and (4) making admission decisions [7, 28] . Bansal et al. [13] also considered the test questions in Law School Admission Test (LSAT).

Leisure. This domain includes tasks serving entertainment purposes. Prior works have explored AI assistance for a range of leisure activities (e.g., to recommend music [76, 77] , to recommend movies [78] , to predict songs' chart ranking (i.e. song popularity) [95] ), and to reorder your Facebook news feed [112] . Other works used games or gamified tasks such as to predict if a person would date a person given a profile [96, 152] , classify types of leaves [147] , and distribute goods fairly [88] . Games are also used, such as Quizbowl [43] , draw-and-guess [23] , word-guessing [48] , and chess playing [32] .

Professional. This domain includes tasks related to employment and professional progress. Binns et al. [15] define a task to predict whether a person's profile would receive a promotion. Liu et al. [94] define a task to predict a person's occupation given a short biography. Other tasks not related to jobs include classifying emails' topic [125] , AI-assisted meeting scheduling [74] , military planning via monitoring and unmanned vehicles [130] , and cybersecurity monitoring [42] .

Artificial. This domain includes tasks that are artificial or fictional, usually made up to explore specific research questions. Lage et al. [79] and Narayanan et al. [104] created two fictional tasks to evaluate the effect of providing model explanations: (1) predicting aliens food preferences in various settings and (2) personalized treatment strategies for various fictional symptoms. Other tasks include predicting the number of jellybeans in an image [109] , predicting water pipe failure [10] , predicting news reading time [134] , predicting broken glass [153] , predicting defective object in a pipeline [11, 12] , and answering math questions [45] . Artificial tasks have the advantage of being easily accessible to lay people, and allowing researchers to control for confounding factors. The flip side is that results obtained from artificial tasks may not generalize to real applications.

Generic. Generic tasks are ones without specified applications and can be applied to different domains. These include AI benchmarks where crowdsourced datasets are used to test how well AI models can emulate human intelligence such as object recognition in images (e.g., horses, trains) [3] and [27, 52] . Another popular generic task is review sentiment analysis, which are performed with various contents such as movie reviews [62, 106] , beer reviews [13] , and book reviews [13] .

Others. Finally, we list decision tasks that do not fit in any of the domains above: attractiveness estimation [95] ; activity recognition (e.g., exercise [92] and kitchen [107] ); deception detection, predicting if a hotel review is deceptive [80, 81, 94] ; toxicity classification [26] , predicting external consensus of whether a comment is toxic or not; predicting a person's weight based on an image [95] ; predicting the nutrition value of a dish given an image [20, 21] ; religion prediction [114] , predicting whether text is about Christianity or atheism; emotion analysis [128] , predicting emotion of text; forest cover prediction, predicting if an area is covered by spruce-fir forest [143] .

Given the wide variety of decision tasks that have been studied, it is important to understand how findings generalize across tasks. Although domain can serve as a thematic umbrella, it is not useful for evaluating generalizability because each domain includes tasks with drastically different properties (e.g., medicine & healthcare includes both diagnosing cancer and annotating medical notes). Here we seek to identify meaningful task characteristics.

Characteristics of the decision task can determine whether a task is appropriate for the claims in a study as well as their generalizability. For example, a low-stakes decision task may not create an ideal condition with vulnerability to study trust. A task that is more challenging for human to perform may induce higher baseline reliance on the AI, so the results may not generalize to settings where the human outperforms the AI, and vice versa. However, existing literature often does not provide explicit justification on the choices of decision task, nor indicate the scope of generalizability of the results.

To facilitate future research to make such considerations, we look across the surveyed papers and highlight four dimensions that vary in the chosen design tasks: (1) task risk (e.g., high, low, or artificial stakes), (2) required expertise in the task, and (3) decision subjectivity, and (4) AI for emulation vs. discovery. We do not claim these four as an exhaustive list, but hope to illuminate the challenges in interpreting and generalizing from results in studies that adopt different decision tasks, and encourage future studies to justify the choice of tasks and report their characteristics.

Risk. The risk, including its stakes and potential societal impact, of a task (whether high, low, or artificial) is an important characteristic that could impact decision behaviors, particularly for user trust and reliance. In fact, Jacovi et al. [65] argue that trust can only be manipulated and evaluated in high-stakes scenarios where vulnerability is at play. In comparison, leisure and artificial are mostly of relatively low stakes. Tasks in Professional can have varying stakes, for example, human resource related decisions are high stakes, while email topic classification is low stakes. Generic is a category driven by the creation of AI benchmark datasets. It is unclear how to interpret their stakes or societal relevance, and their stakes are contingent on the contexts they are adopted.

Researchers should carefully consider design choices with respect to risk: it is also critical to be cognizant of potential ethical concerns when using AI assistance for high-stakes decisions, such as recidivism prediction [53] . Increasing the unwarranted trust can be highly problematic in high-stakes decisions [65] . Moreover, generalizability may be affected by risk as well; for example, findings related to AI assistance effectiveness in the context of low-risk scenarios, especially on reliance, may not generalize to high-risk scenarios without further research.

Required expertise. Levels of expertise or prior training in a task can lead to different decision behaviors with AI.

For some tasks, limited to no domain expertise is required (e.g., artificial tasks), whereas others require significant expertise (e.g., cancer image classification While many works in human-AI decision making categorize decision makers as either "domain experts" or "lay users", AI literacy is also an important consideration, especially as it relates to one's ability to interpret AI assistance elements. A framework proposed by Suresh et al. [133] suggests decomposing stakeholder expertise into both context (domain vs. machine learning) and knowledge for that context (e.g., formal, personal, etc.). For example, decision making performance with AI-enabled medical decision support tools [24] , is affected by both formal, instrumental, and personal domain expertise as well as instrumental machine learning expertise (i.e., familiarity with ML toolkits). As such, systems should be evaluated with targeted expertise, or even varied levels of expertise to investigate the generalizability of results. Studies should also carefully report on participants' expertise to allow appropriate interpretation and usage of the results.

Subjectivity. Many decision tasks are framed as supervised prediction problems in machine learning, where there exists groundtruth, . This choice often implicitly assumes that this is an objective prediction task (at least in hindsight), e.g., whether a person has a balance disorder or not [22] or whether a person will pay back a loan [15, 55] . Only in these tasks, quantitative measures of human performance are appropriate. In comparison, personal decision making can be subjective, for example, whether a music recommendation is good is subjective for the person receiving it [76, 77] ; similarly, whether or not language is perceived as "toxic" depends on the person assessing it and their determination is hard for others to refute [26] . Subjective decision tasks typically have high variability (low agreement) on what is the correct model output. Yet, AI assistance is still valuable to help people make subjective decisions. However, human performance might not be a good measure for evaluating these subjective decision making tasks, or non-trivial assumptions are required to convert such decisions to objective tasks (e.g., predicting which movie has the highest box office proceeds). As a major focus so far in human-AI decision making is to improve the performance of human-AI teams, most of the tasks in our surveyed papers are objective tasks.

AI for emulation vs. discovery. We highlight a final dimension that affects how one should interpret results from a study but is often overlooked in the choice of tasks. Within objective tasks, we can further distinguish tasks based on the source of groundtruth. In many high-stakes decisions, groundtruth can come from (social and biological) processes that are external to human judgments (e.g., the labels in recidivism prediction are based on observing the behavior of the defendants after bailing rather than judges' decisions). In these tasks, machine learning models can be used to discover patterns that humans may not recognize, and can be useful for tasks such as recidivism prediction [39, 54, 55, 58, 94, 115, 139, 143] , deception detection [80, 81, 94] , and income prediction [62, 115, 139, 144, 155] . We refer to such tasks as AI for discovery tasks. 1 These tasks are usually more challenging to humans, because they require humans to reason about external (social and biological) processes that are not innate to human intelligence. In fact, human performance in some of these tasks, such as deception detection tasks [80, 81] , were found to be close to random guessing (manual annotation is thus inappropriate for getting groundtruth). Humans decisions are also prone to biases in these challenging reasoning tasks such as recidivism prediction. AI can improve decision efficacy and alleviate potential biases not only by providing predictions, but also by elucidating the embedded patterns in these decision tasks, such as by providing explanations. However, a challenge lies in the difficulty for humans to determine whether counterintuitive or inconspicuous patterns are genuinely valid or are driven by spurious correlations.

In comparison, a typical narrative of AI is to emulate human intelligence. For example, humans perform well at simple recognition tasks, such as determining whether images include people or whether documents discuss sports, and we build AI to emulate this ability. That is, machine learning models are designed to emulate the human intelligence for these tasks, and human performance is considered as the upper bound. In these tasks, the groundtruth comes from human decision makers. We refer to these tasks as emulation tasks. As such, these tasks are designed for automation purposes, and are not preferred choices for studying human-AI decision making because humans are less likely to benefit from AI assistance. However, there are still a handful of experimental studies investigating human-AI decision making in emulation tasks [3, 27, 52] . These tasks are typically in the generic domain, and improving human performance might be interpreted as reducing the mistakes of crowdworkers (possibly due to the lack of attention). It is unclear whether results would generalize to discovery tasks where humans reason about external processes and models may identify counterintuitive patterns, and future research should explicitly consider the boundary between the two.

We summarize current trends in the choices of decision tasks, discuss gaps we see in current practices of the field, and make recommendations towards a more rigorous science of human-AI decision making. We follow this organization when summarizing each of the remaining sections.

Current trends.

(1) Variety: Existing studies on human-AI decision making cover a wide variety of tasks in many application domains. This variety demonstrates the potential of human-AI decision making and also leads to challenges in generalization of results and developing scientific knowledge across studies.

(2) Task characteristics: Most existing studies focus on high-stake domains such as justice systems, medicine & healthcare, finance, and education; while artificial and generic tasks are still used by some. Although many decision tasks require domain expertise, experts are seldom the subjects of study. Finally, most existing studies focus on "AI for discovery" tasks because humans typically need or can benefit from AI's assistance in these tasks more than "AI for emulation" tasks. However, studies often do not explicitly justify using decision tasks with these characteristics nor discuss their implications for other study design choices (e.g., subjects) and generalizability of results.

Gaps in current practices.

(1) Choice of tasks are driven by datasets availability. For instance, due to the popularity of COMPAS [6] and ICPSR [138] datasets, many studies used recidivism predictions as the decision task and focused on the law & civic domain. In comparison, despite the public discourse on the potential harm of AI in other domains like hiring [33] , there is relatively little research on AI assistance in human resources due to lack of available datasets.

We suspect that this is also the reason that emulation tasks are used in some studies (e.g., prevalence of AI benchmarks such as visual question answering).

(2) Lack of frameworks for generalizable knowledge. A key question for the research community is how to develop scientific knowledge by validating and comparing results from studies across many different domains and types of decision task. For example, when an artificial task is used, how much can the results generalize to other domains? How to interpret differences in the results in a medical diagnosis task versus a movie recommendation task? Do results on medical diagnosis generalize to bailing decisions? We believe a first step is to identify different underlying characteristics of decision tasks such as risk and required expertise, in order to make meaningful comparisons across studies and reconcile differences in empirical results.

(3) Misalignment with application reality. The focuses and study design choices of current studies may not align with how AI is or will be used in real-world decision-support applications. For instance, the overwhelming focus on high-stake domains is worrisome if the study designs (subjects, consequence, context) do not align with the reality. Tasks defined based on easily available datasets may deviate from realistic decision making scenarios. For example, experiments based on generic tasks such as visual question answering can be quite different from real-world imaging related tasks such as for medical diagnosis. This misalignment is analogous to the discrepancies between the recent burst of COVID-related machine learning papers and clinical practices [116] .

Recommendations for future work.

(1) Develop frameworks to characterize decision tasks. To allow scientific understanding across studies, there is an urgent need for the field to have frameworks that can characterize the space of human-AI decision tasks.

As a starting point we suggest the following dimensions in Section 3.2: risk, required expertise, subjectivity and AI for emulation v.s. discovery. We encourage future work to further develop such frameworks. We further encourage specification, such as including meta-data of task characteristics whenever a new decision task or dataset is introduced to study human-AI decision making.

(2) Justify choices of decision task. We encourage researchers to articulate the rationale behind the choice of decision task, including its suitability to answer the research questions. Researchers should also consider whether other study design choices such as system design, subject recruitment and evaluation methods align with the characteristics of the task. Such practices can help interpret and consolidate results across studies and identify important and new dimensions of decision task characteristics.

(3) Expand datasets availability. A bottleneck hindering the community from studying broader and more realistic decision tasks is the availability of datasets. Popular datasets are often introduced for AI algorithmic research and may not reflect what is needed for realistic AI decision-support tasks. The field should motivate dataset creation by what decision tasks are needed to better understand human-AI decision making, which may require first better understanding decision-makers' needs for AI support.

To use AI to accomplish decision tasks, people not only rely on the model's predictions, but can also leverage other information provided by the system to make informed decisions, including gauging if the model predictions are reliable.

For example, with the recent surge of the field explainable AI (XAI), many have contended that AI explanations could provide additional insights to assist decision making [40, 81, 93] . Therefore, we take a broad view on "AI assistance

Deep learning models Convolution Neural Networks [3, 27] , Recurrent Neural Networks [26, 60] , BERT [80] , RoBERTa [13] , VQA model (hybrid LSTM and CNN) [115] , not specified [23, 24, 30, 48, 50, 52, 62, 73, 86, 87, 107, 110, 152] "Shallow" models Logistic regression [16, 39, 41, 45, 68, 86, 106, 143] , linear regression [28, 37, 101, 111] , generalized additive models * (GAMs) [1, 13, 39, 43, 49, 128, 135] decision trees/random forests [29, 45, 54, 55, 86, 92, 97, 137, 144, 155] , support-vector machines (SVMs) [41, 80, 81, 86, 94, 114, 147, 152] , Bayesian decision lists [82] , K-nearest neighbors [77] , shallow (1-to 2-layer) neural networks [45, 106] , naive Bayes [125] , matrix factorization [78] Wizard of Oz [7, 15, 20-22, 79, 95, 96, 104, 109, 139] elements" and review the system features studied in prior work that could impact people's decision making. We start with an overview of the types of model and data used in the surveyed studies and then unpack AI assistance elements.

An important driving factor for the recent surge of interest in human-AI decision making is the growing capacity of AI models to aid decisions. This subsection provides a summary of the different types of models used in surveyed studies, as listed in Table 2 .

Deep learning models. Much recent excitement around AI is driven by the popularity of deep learning models that demonstrate strong performance in a wide variety of tasks and can even outperform humans. Deep learning models are based on neural networks, which usually consist of more than two layers. Deep learning models have been included in many recent studies on human-AI decision making [3, 13, 23, 24, 27, 30, 48, 50, 52, 60, 62, 73, 80, 86, 87, 101, 107, 110, 152] . Some papers specified their deep learning models, e.g., convolution neural networks [3, 27] , recurrent neural networks [26, 60] , BERT [80] , and RoBERTa [13] , a hybrid LSTM and CNN model for VQA task [115] . In average training settings, deep learning models typically provide greater predictive power than traditional "shallow" models but with the expense of added system complexity. Deep learning models are commonly considered not directly interpretable and thus raise concerns of user trust. To tackle this challenge, many "post-hoc" explanation techniques [98, 114] have been developed to approximate the complex model's logic, which also raise concerns about explanation fidelity [2, 123] . We discuss some examples of post-hoc explanation techniques studied in Section 4.2.

"Shallow" models. Despite the superior performance of deep learning models, many empirical studies employed traditional, "shallow" models, which are often easier to train and debug. These models include generalized additive models (e.g., logistic regression and linear regression) [ shallow neural networks [45, 106] , Bayesian decision lists [82] , K-nearest neighbors [77] , naive Bayes [125] , and matrix factorization [78] . In prediction tasks with a small number of features, "shallow" models are able to achieve competitive accuracy to deep learning models [117] . Moreover, some of the simpler "shallow" models are deemed to be directly interpretable. For example, coefficients in linear models as feature importance and shallow decision trees are relatively intuitive to comprehend. It is worth noting that more papers used shallow models instead of deep learning models to conduct empirical studies on human-AI decision making.

Wizard of Oz. Finally, many experiments did not use an actual model, but instead having researchers manually creating and simulating the model output, a common method called "Wizard of Oz" (WoZ) in HCI research [71] . Researchers have used WoZ method with fictional cases of model predictions and explanation styles [7, 15, 20-22, 79, 95, 96, 104, 109, 139] .

WoZ is not only convenient for conducting user studies without investing in technical development, but also gives researchers full control over the interested model behaviors. For instance, utilizing this approach allowed researchers to adjust the algorithm accuracy [109] , control error types [21] , and test different explanation styles [7, 15, 20, 22] .

However, it can be challenging to design realistic WoZ studies given the complexity of model behaviors. Failing to do so could impair the validity and generalizability of study results.

Data types Besides the models, it is also important to distinguish different types of data used: text, imagery, audio, video, and tabular (or structured data). The surveyed papers used a number of data types, including plain text (e.g.,

LSAT questions [13] , hotel reviews [80, 81] , etc.), imagery (potentially cancerous images [25] , meal images [21] , etc.), video (stroke patient rehabilitation videos [87] , kitchen activity videos [107] , etc.), and tabular (or structured) data (e.g., company financial data [16] , personal financial data [29] , etc.). For some tasks, combinations of data types are used.

For example, in music recommendations, humans might review structured data (e.g., song title, artist, and genre) and

listen to audio when determining whether to listen to a recommended song [76] . Video question answering requires assessing both images as well as textual questions about them [27] .

The data type influences what ML models can be applied and how well they perform. For example, shallow models may perform relatively better on structured (or tabular data) than unstructured text compared to deep learning approaches, because of the high dimensionality of text data. More importantly, the data type can determine the nature of the decision task and the affordances of AI assistance for the decision. For example, a common form of AI assistance is explanation of the model outputs, such as the models' attention. For example, prior work found that attention explanations have limited utility for explaining image classifications [3] . This might be because such explanations-where important areas of the image are painted-can be noisy and confusing to humans compared to attention explanations for text data where important words are highlighted. Data type can also influence the experience with the decision tasks in many ways: video can take longer to review than images or short text.

A central consideration in studies of human-AI decision making is what kind of assistance is effective in improving decision outcomes. At the minimum, models can assist decision makers by providing predictions, for example, music or movie recommendations, or generating a health risk scores. It is often desirable to provide information about model predictions to help users judge whether they should follow them, especially in cases of disagreement or in high-stakes domains. It is also common to provide information about the model to help users gain an overall understanding of how the model works or the data it was trained on, which can influence their perceptions of and interactions with the model in decision making tasks. Figure 2 illustrates our taxonomy of AI assistance elements, which includes predictions, information about predictions, information about models, and other AI system elements that govern the use of the system (e.g., workflows, user control, and varied model quality).

Based on this conceptualization, we categorize the AI assistance elements studied in the survey paper into these four groups, as listed in Table 3 . It is interesting to note that many of the studies reviewed focused on providing information [80] , prototypes [23, 43, 74] Model documentation Overview of the model or algorithm [74, 76, 77, 88, 112] , model prediction distribution [139] Information about training data Input features or information the model considers [61, 77, 111, 155] , aggregate statistics (e.g., demographic) [15, 39] , full training "data explanation" [7] Other AI system elements affecting user agency or experience

Interventions or workflows affecting cognitive process

User makes prediction before model [21, 58, 96, 111, 143, 152, 155] , vary system response times [21, 109] , outcome feedback to user [12, 13, 27, 58, 147, 153] , training phase [27, 76, 80, 111, 155] , source of recommendation or local explanation (human or AI) [36, 78, 95] , varied model quality [42, 74, 90, 107, 109, 125, 152, 153] 

Allowing user feedback or personalization for model [43, 76, 125] , outcome control (after decision [37, 88, 155] , before decision [74] ), interactive explanations [24, 28, 94] , user direction on input data [24, 73, 90] , level of machine agency [21, 90] explanations, counterfactual explanations, and natural language explanations. For our review, we primarily focus on the forms of explanation and how they are presented to humans instead of the underlying algorithm/computation.

Showing the model's uncertainty for a prediction can make humans aware of when the model is more or less sure, so long as uncertainty estimates are reliable [51] . In theory, a low uncertainty should alert users to not over-rely on the prediction and resort to their own judgment or other resources.

The most common form of uncertainty information is a confidence score or prediction probability for a classification model. Here, uncertainty is usually calculated as the probability, a numeric value between 0 to 1, associated with the predicted label (opposed to alternative labels) given by the underlying ML model. Many prior systems expose classification confidence scores to human decision makers [10, 13, 21, 22, 43, 60, 68, 73, 86, 87, 144, 155] . Among these studies, Buçinca et al. [21] , Bussone et al. [22] generate confidence scores using "Wizard of Oz", while the others use the classification model to generate probabilities and typically present these scores alongside the prediction. Other classification systems expose uncertainty with labeled categories. For example, Levy et al. [90] label some predictions as low confidence in one version of their clinical notes annotation system.

In contrast to classification, uncertainty information of regression models is currently under-explored in human-AI decision making, with one recent exception [101] (though the effect of uncertainty information on decision making has long been studied outside the context of AI assistance, e.g. [70] ). Uncertainty for regression models can take the form of uncertainty distribution-how the possible values are distributed (often centered around the given prediction)-or prediction interval-the range of possible values.

Currently there is also a lack of discussion on the reliability, or sometimes referred to as calibration [51] , of uncertainty estimates, and whether decision makers can make sense of uncertainty estimates properly. For example, in some deep learning models, prediction probabilities are prone to overconfidence [59] . Recent work has experimented with deep probabilistic models to give more reliable uncertainty estimations, including Bayesian neural networks [136, 154] , deep neural networks that integrate dropout [47] , or ensemble methods to approximate Bayesian inference [83] . How reliability of uncertainty information affects decision making, or how to communicate the reliability or a lack thereof, remain open questions in the context of human-AI decision making.

Local feature importance. Local explanation techniques provide information about how and why a given prediction is made to assist humans' judgment of the prediction and inform their final decision. A common local explanation type is local feature importance, which, given a single instance, quantifies the contribution (or importance) of each of its features to the model's prediction of it. For example, when predicting property values, certain features are more important to the prediction (e.g., lot size and number of rooms) while others might be less important (e.g., distance to a school [80] adopt attention mechanism to model the local feature importance. 2 • Post-hoc methods learn to generate explanations separately for a trained model, often a non-interpretable model such as deep neural networks. These methods can be grouped into: gradient-based [27, 73, 106] , propagationbased (LRP [3] ), and perturbation-based (e.g., LIME [3, 13, 62, 106] , SHAP [29, 144, 155] ). First, gradient-based methods compute the gradient of the prediction with respect to the input features. Examples using gradient-based methods include classification activation map (CAM) [73] and Grad-CAM [27] . Second, propagation-based methods, especially Layer-wise Relevance Propagation (LRP), uses a forward pass and then a backward pass to calculate the relevance among input features. Alqaraawi et al. [3] adopted LRP in their experiments. Third, perturbation-based methods, such as SHAP and LIME, manipulates parts of the inputs to generate explanations. LIME [114] uses a sparse linear model to approximate the behavior of a machine learning model locally. The coefficients of this sparse linear model can then serve as explanations. Alqaraawi et al. [3] , Bansal et al. [13] , Hase and Bansal [62] , Nguyen [106] use explanations from LIME as AI assistance. SHAP (SHapley Additive exPlanations) provides the marginal contribution of each feature for a particular prediction, averaged over all possible permutations, which is first proposed by Lundberg and Lee [98] . Shapley values assign each feature an importance value for a particular prediction. In the context of human-AI decision making, Weerts et al. [144] , Zhang et al. [155] use SHAP for local feature importance. In addition, deep learning models to generate video-specific feature captions [107] are also being used. Rule-based explanations. Rule-based explanations are constructed with a combination of rules, where a rule can be a simple 'if-then' statement. Both built-in approaches (e.g., decision sets and decision trees) and post-hoc approaches (e.g., anchors) have been explored for generating rule-based explanations in the context of human-AI decision making.

• Decision sets: Lakkaraju et al. [82] generate interpretable decision sets, which are sets of if-then rules to explain model decisions. In their study, participants are asked to describe the characteristics of certain classes (e.g., depression) based on the learned decision set for that class. Lage et al. [79] , Narayanan et al. [104] also used decision sets as local explanations in their studies.

• Tree-based explanations: For decision tree-based models, local explanations can be generated directly by the decision-tree path, as rules that the model followed, to reach the given decision. Tree-based explanations were used by Kulesza et al. [77] in the context of a music recommendation system, which employed a content-based decision tree approach for selecting songs. Lim et al. [92] used the underlying decision tree model to generate multiple types of explanations (why, why not, how to, and what if ), such as by the decision-tree paths to reach an alternative decision.

• Anchors, proposed by Ribeiro et al. [115] , learns if-then rules representing "sufficient" conditions (important features) that guarantee the given input to have the prediction, such that changes to the rest of the features will not change the prediction. Hase and Bansal [62] , Ribeiro et al. [115] explored anchors as an explanation methods in their experiments.

Example-based methods. Example-based explanation methods explains a prediction by examples (with known outcomes) to support case-based reasoning. A common formulation is to find instances from training dataset that are similar to the given input. The explanations should include their labels in the ground truth to help users make sense of reasons behind the current prediction. For example, we might explain the predicted price for a given home by showing similar homes with their actual prices. A common and simple way to generate similar instances is to find the nearest neighbors in the embedding (latent representation) space. This method is used by many papers to explain predictions for human-AI decision making [15, 20, 23, 24, 39, 62, 77, 78, 81, 137, 143] .

Counterfactual Explanations. Counterfactual explanations help people understand how the current input should change to get an alternative prediction, answering "why not" (a different prediction) or "how to be a different prediction" instead of a "why" question. Counterfactual explanations can also be provided based on either features or examples.

Feature-based ones are often called contrastive feature or sensitive feature methods-highlighting features that if changed, often implying minimum change, will alter the prediction to the alternative class. For example, a counterfactual explanation for loan prediction task could be "you would have received the loan if your income was higher by $10,000. For example, Wang and Yin [143] show instances with minimal changes that result in the desired output. Friedler et al. [45] asked users to answer 'what if' question given a perturbed input.

Natural language explanations. Natural language explanations, or sometimes referred to as rationale-based explanations, are a form of "why" explanations that provide the reasoning or rationale behind a particular decision. For example, Tsai et al. [137] study rationale-based explanations for their COVID-19 chatbot, such as why the chatbot asks particular diagnostic questions to the user. These explanation types are sometimes referred to as "justifications" [16] . Natural language explanations can be differentiated by how they are generated, either model/algorithm generated [16, 32, 137] where these explanations are produced by the system-or human experts generated [13] , meaning domain experts (or algorithm developers) provided rationales behind types of predictions to be shown to users. [62] showed the model's partial decision boundary by traversing the latent space around a specific input, in order to show how the model behaves as the input changes. The methods were initially proposed in the computer vision domain [67, 119] , whereas Hase and Bansal [62] developed and adapted the method for text and tabular data.

the underlying model to form an appropriate mental model that can help them interact more effectively. "Global" information about the model can include the model's overall performance, global explanations (e.g., how the model weighs different features, visualizing the whole model processes for simple models), input and output spaces, information about the training data, provenance, and more. Recently, there are growing interests in providing documentation or 'About Me" page to present such global information (e.g., Model cards [102] , FactSheets [9] ). In this section, we discuss what types of global information about models have been studied in surveyed papers of human-AI decision making.

Model performance. Model performance describes how well a model works in general. In studies of human-AI decision making, model performance has been mainly presented in the form of accuracy (i.e., percentage of correctly predicted instances) [61, 80, 81, 147, 152] . These works typically explore how observing model accuracy affects people's perception of and decision making with the model. For example, Lai and Tan [81] investigate whether human subjects' awareness of the ML model's accuracy improves their performance in decision making tasks. And, Yin et al. [152] studies the effect of accuracy on human's trust in ML models.

Model performance has also been described by false positive rates, or how frequently the system mislabels an input as a particular class. For example, Harrison et al. [61] showed the presenting false positive rates in addition to accuracy in their experiments helped people gauge fairness of the model for recidivism prediction tasks.

It is useful to note that accuracy information is usually estimated on a held-out dataset, and the model's actual performance in deployment can shift, especially when the actual decision inputs or their distribution differ from that of the training data. This gap between communicated accuracy and experienced accuracy has been studied in Yin et al. [152] . Future work should also explore the effects of other types of performance metrics, such as precision and recall.

Global feature importance. Different from local feature importance that quantifies each feature's importance to a specific prediction, global feature importance quantifies each feature's overall importance to the model's decisions for a given task. Here we enumerate methods used in surveyed papers for computing global feature importance, grouped into built-in methods, post-hoc methods, and Wizard of Oz as follows: Early work used permutation importance to compute global feature importance for random forests Breiman [18] followed by a rich line of research on the topic [4, 34, 56, 57, 131, 156] . More recently, Fisher et al. [44] proposed a model-agnostic version of permutation importance. Wang and Yin [143] adopt this method Fisher et al. [44] to compute global feature importance in their paper, exploring whether such explanations are helpful during decision making tasks.

• Wizard of Oz manually constructs the global feature importance. For example, Binns et al. [15] created scenarios of recidivism prediction with hypothetical global feature importance to explore people's fairness perception.

Presentation of simple models. For simple models, it is possible to present the whole or part of the model internals to humans to give them a detailed view of how the model makes decisions. These simple-often referred to as inherently transparent or intrinsically interpretable-models can be presented to humans in the form of decision tree, rule sets, graphs, or other visualizations. For this reason, such models are often preferred over more complicated architectures (e.g., neural nets) when interpretability is desired.

In the context of human-AI decision making, researchers have explored presentations of simple models, including decision sets [82] and trees [45] , linear [111] and logistic regressions [45] , and one-layer multilayer perceptron (MLP)s [45] .

For example, Lakkaraju et al. [82] construct a small number of compact decision sets that are capable of explaining the behavior of blackbox models in certain parts of feature space. Friedler et al. [45] compare three models, representing a decision tree as a node-link diagram and both a logistic regression and a one-layer MLP as math worksheets that are intended to "walk the users through the calculations without any previous training in using the model. "

Global example-based explanations. Example based explanations are instances from the training set that explain the prediction or provide insights to the data which would help humans make task decisions. Lai et al. [80] select examples with features that provide great coverage from training set as tutorial to the task using the SP-LIME algorithm [114] . They also proposed the Spaced Repetition algorithm that creates a set of examples which exposes humans to important features repeatedly. Another common approach is to pinpoint one or a set of training samples that are representative of prototypical instance (with the outcome of a given prediction class) [105] . The representative data instance is called a prototype. For linear models, it is natural to find the important training example based on the distance in the representation space [108] . For nonlinear models, influence function [75] and representer value [151] are proposed. Kocielnik et al. [74] use Wizard of Oz to generate a table of representative instances for each prediction class respectively to help user understand how the AI component operates. Note that prototypical examples can also be used to explain a prediction locally, by presenting the prototype in its proximity, such as the explanations used in Cai et al. [23] , Feng and Boyd-Graber [43] .

Model documentation. The requirements for model documentation or "About Me" page, which provides not only an overview of the model characteristics but also how it is developed and intended to be used, are discussed in recent literature as critical to AI transparency and governance [9, 102] . However, only a small number of surveyed studies explored using relevant features in human-AI decision making [74, 76, 77, 88, 112] . For example, the meeting scheduling assistant of Kocielnik et al. [74] includes a description of how the scheduling assistant works, specifically, "The Scheduling Assistant examines each sentence separately and looks for meeting related phrases to make a decision. " Some work on model documentation argued for the importance of providing an overview of model's input and output spaces, such as the output distribution. van Berkel et al. [139] display the race and sex group filters based on the participant's demographic information (e.g., a male participant first sees the data of male loan applicants).

Information about training data. Finally, human-AI decision making systems can help people better understand the models by providing information about the data on which they were trained, such as the inputs features used or data distribution [39, 61, 77, 111, 155] . For example, in the income prediction system of Zhang et al. [155] , humans are made aware of whether or not the model considers "marital status" as a feature. Some studies present demographic-based or aggregated statistics of the training data [15, 39] . Finally, Anik and Bunt [7] , in a more detailed way, presents a 'full' training data explanation, including how the data was collected. They also describes demographics, recommended usage, potential issues, and so on.

Besides providing information about the AI to assist decision making, prior research also studied additional system elements that can affect user experience, mainly around interventions that affect users' cognitive processes of decision making, or users' agency over the system.

Interventions or workflows affecting cognitive processes. Besides providing information about the model and predictions, how people process such information to form perceptions of AI and make decisions can be impacted by interventions that change their cognitive processes One area of interventions is concerned with how to design the workflow, such as when users make their own decisions versus seeing the model's predictions. The typical paradigm of human-AI decision making is to have models providing predictions, then users can choose to follow or ignore. Some studies explored having users making their own predictions before being shown the model output [21, 58, 96, 111, 143, 152, 155] . Such designs force people to engage more deliberately with their own decision making rather than relying on the model predictions.

Prior work also explored the impact of different workflow design on users' mental models of how AI assistants make decisions. Some systems include a training phase prior to the task, during which users review model outputs and explanations of how the system works [27, 76, 80, 111, 155] . In some real-world scenarios, decision makers can see the actual outcomes of decisions. Studies have also explored how receiving outcome feedback on either their or the models' decision correctness [12, 13, 27, 58, 147, 153] impact people's perception of the models and performance of the tasks.

Another type of intervention studied is system response time, or how long the system takes to provide a decision [21, 109] . For example, Buçinca et al. [21] compare the effect of cognitive forcing functions, where participants get suggestions immediately as opposed to having to wait 30 seconds for the machine's prediction, on over-reliance and subjective perception.

The models' actual performance (as opposed to communicated performance metrics described in Section 4.2.3) can govern the usefulness of the decision support. Prior work also explored how varying model performance or prediction quality impacts human-AI decision making [42, 74, 90, 107, 109, 125, 152, 153] . For example, Smith-Renner et al. [125] explored whether describing model errors, which enables users to gauge model performance, without the opportunity to make fixes yielded user frustration for both low and high quality models. Similarly, Kocielnik et al. [74] studied the difference between high precision and high recall models (without explicitly showing this information to users) on user perceptions.

Lastly, some studies looked at how the source of assistance, whether from an AI versus from a human, affect decision making [36, 78, 95] . For example, Kunkel et al. [78] explore how machine generated versus human generated explanation impact the acceptance and trust of a system for movie recommendations.

Levels of user agency. Typical decision-support AI systems work in a closed loop without the possibilities for guidance from end users. This kind of set-up limits the agency users can have for controlling or improving the decision assistance from the AI. Some studies have explored improving the user agency, such as allowing and incorporating user feedback on predictions or personalization of the model [43, 76, 125] . For example, in the music recommendation system of Kulesza et al. [76] , participants can provide feedback about the recommended songs and guidelines to the model to improve future recommendations. Another studied form of user agency is the ability to guide the prediction (or outcome) either before [74] or after the model's decisions [37, 88, 155] . For example, Kocielnik et al. [74] explore user experience with a system for detecting meeting requests from emails. They compare whether providing users a slider to control whether the system tends towards false positive (high recall) or false negatives (high precision) improves experience.

This control occurs before the system makes predictions, but can be updated as needed. Lee et al. [88] study whether allowing participants to override the outputs of a system on how to split food between grad students promotes fairness perception.

More recently, interactive explanations have been investigated, which allow humans to have better control on what kind of explanations they can get from the model [24, 28, 94] . For example, Cai et al. [24] propose and evaluate an interactive refinement tool to extract similar images for pathologists in medical domain. In their tool, users are not provided with explicit explanations, but instead can interact with the system and test it under different hypotheses.

Cheng et al. [28] compare different explanation interfaces for understanding an algorithm's university admission decisions. They find that an interactive explanation approach is more effective at improving comprehension compared to static explanations, but at the expense of taking more time.

Another form of user control in human-AI decision making is to allow users to choose input data to get model predictions. Prior work explores cases where users choose the input dataa (or the underlying features) for the model to consider [24, 73, 90] . For example, in the similar image search system of Cai et al. [24] , participants denote the important points of the initial image for the system to attend to when looking for similar images.

Finally, researchers compare different levels of machine agency. For example, Levy et al. [90] experimented with two distinct clinical note annotation systems: one only suggests annotation labels after users choose text spans to be labeled, and another performs both span and label suggestions. Similarly, Buçinca et al. [21] examined receiving predictions only on demand.

We summarize current trends and gaps in how AI models and assistance elements are used and studied, then make recommendations for future work.

(1) Limited uses of deep learning models. Despite the popularity of deep learning models, a large proportion of empirical studies still adopted traditional shallow models, even Wizard of Oz, possibly due to their ease to develop or access. Further, shallow models are typically less complicated to explain, making them an easier choice for studying human-AI decision making assisted by information about the prediction or model.

(2) Assistance beyond predictions. Besides the prediction itself, empirical studies have explored the effect of a wide range of elements providing information about the prediction and the model on improving decision performance. By summarizing these elements studied, we hope to also inform the design space of AI decision support systems.

(3) A focus on AI explanations. A large portion of prior work focused on studying the effect of explanations, both local explanations for a prediction and global explanations for the model. This is necessary information for people to better understand the AI to interact, but also partly due to a recent surge in the field of explainable AI (XAI), which produces increasing technical availability to generate explanations.

(4) Beyond the model. Beyond model-centric assistance elements, a small portion of work explored system elements that affect user agency and action spaces, including workflow and user control.

Gaps in current practices.

(1) A fragmented understanding and limited design space of AI assistance elements. Current humansubjects studies often focus on one or a small set of AI assistance elements. We have very limited understanding on the interplay between different assistance elements, and thus limited knowledge in how to choose between, or combine, them when designing AI systems. More problematically, studies are often driven by technical availability such as new explanation techniques. This practice may risk losing sight on what users need to better accomplish decision tasks or what are the necessary elements of the design space of human-AI decision making, which is especially important knowledge for practitioners to develop effective AI systems. For example, only a small number of studies explored non-model-centric system elements that can affect users' action space and cognitive processes, and showed that they are critical for user experience and also their interaction with model assistance features [21, 43, 125] .

(2) Focus on decision trials only instead of the holistic experience with AI Existing work commonly experimented with participants performing discrete decision trial tasks-seeing an instance and making a decision with AI's assistance. However, in reality, when people use a decision-support AI system, many other steps and aspects can affect their experience and interaction with the AI, such as system on-boarding experience, workflow, contexts where the decision happens, and repeated experiences with the system. Their effects are currently under-explored for human-AI decision making. This narrow use of experimental tasks could have led to certain gaps or biases of assistance elements studied. For example, studies tended to focus on decision-specific assistance but less on model-wide information.

(3) Gaps in models used. Our analysis revealed a current bias in model types used in the studies-more using traditional shallow models than deep learning models. It is necessary to elucidate how model types and their associated properties affect the experiment setup and generalizability of results, which can guide future studies to make appropriate choices. For example, some deep learning models not only tend to perform better in average (but not all) training settings, but also are likely less interpretable and prone to over-confidence. While wizard-of-oz approaches have a long tradition in HCI, applying them to studying AI models face many new challenges, such as how to simulate model errors and explanations in a realistic way. We caution against them without justifying the design as sufficiently approximating the interested model behaviors and stating the limitations. Another gap in models used is limited studies of regression models. In addition to the prediction forms, some assistance elements take distinct forms for regression v.s. classification (e.g., uncertainty information), and their effects are under-explored for regression.

Recommendations for future work.

(1) Human-centered analysis to define the design space of AI assistance elements. Complementary to current practices of studying fragmented AI assistance elements, the field can benefit from having top-down frameworks that define the design space of AI assistance elements needed for better human-AI decision making, which requires analysis centering on what decision-makers need rather than technical availability. Having this kind of framework can guide researchers to identify gaps in the literature and formulate research questions, and ultimately produce unified knowledge that can better help practitioners make appropriate design choices when developing AI decision support systems. We hope our analysis can inform such efforts.

(2) Extend the design space and studies beyond decision trials. To center the research efforts on real user needs also means we should look beyond the discrete decision trials used by current studies, which not only lack may ecological validity but also fail to account for many temporal, contextual, and individual factors that can shape how people perceive and interact with AI, such as on-boarding experience, time constraints, workload, prior experience, and individual differences. Future work should explore these factors, and conduct field and longitudinal studies of human-AI decision making.

(3) Task-driven studies to complement feature-driven studies. Current studies are often motivated by understanding the effect of certain assistance elements or design features. Then a decision task is chosen in an ad-hoc fashion or even arbitrarily in some cases. To inform the design space of AI assistance elements and actionable design guidelines for different types of AI system, we believe it is useful to complement current practices with task-driven studies, which may require conducting formative studies to understand user needs and behaviors for a given decision task.

Deciding on the evaluation metrics is one of the most critical research design choices. This decision often involves choosing the construct -what to evaluate, then choosing the specific formulation or content of the metrics -how to evaluate the target construct. Our survey reveals a wide range of constructs evaluated in studies of human-AI decision making, likely due to broad research questions asked by the community and a lack of standardized evaluation methods.

As mentioned in the Methodology section, our survey focuses on quantitative evaluation metrics, although some studies used qualitative analysis to gain a further understanding of user perceptions and behaviors.

At a high-level, we group the evaluation metrics into two categories: (1) evaluation with respect to the decision making task and (2) evaluation with respect to the AI. Under each, we group them into areas of evaluation such as task efficacy versus efficiency. Then we further classify them as either objective or subjective measurements. Later in this section we discuss that subjective and objective measurements may in fact target different constructs (perception or attitude versus behavioral outcomes guided by the attitude). Here we classify them based on what the studies claim they are measuring. Note our analysis stays at the granularity of measurement construct instead of detailed differences in the content or formulation (e.g., what specific survey items are used). It is worth noting that many papers did not provide access to the survey scales or questionnaires. As a result, it can be difficult to interpret some of these findings or for future research to replicate them.

Decision making performance-for which in the AI assistance is designed to support-is intuitively the most important outcome measurement for human-AI decision making. Evaluation of the decision tasks mainly belongs to two categories:

(1) efficacy (i.e., the quality of the decisions); and (2) efficiency (i.e., the speed in making these decisions). In addition, we include a category on measuring people's task satisfaction. Table 4 gives a summary of these measures and how they are collected, both objectively and subjectively in the surveyed papers.

Efficacy. We start with objective metrics of decision task performance. The most commonly used metric is accuracy, measured as the percentage of correctly predicted instances (or equivalently, error rate, the percentage of incorrectly predicted instances) [11, 26, 27, 41, 43, 49, 79-82, 97, 99, 104, 106, 114] . Typically, the metric of interest is the accuracy of the joint outcome of human-AI teams, compared against the baseline accuracy of humans without AI assistance or of AI alone.

As the evaluation is essentially comparing decision labels against ground-truth labels, 3 other metrics that are typically used to evaluate AI performance can also be used to evaluate the performance of human-AI decision making. These

Subjective? Metrics Efficacy Subjective Self-rated error/accuracy [37, 81, 137] , perceived performance improvement [32] , confidence in the decisions [54, 55, 60, 95] , soundness of participants' mental models [76] Objective Accuracy/error [11, 13, 16, 20, 20, 24, 26, 42, 43, 52, 58, 62, 68, 73, 74, 79-81, 90, 92, 94, 97, 99, 101, 107, 130, 144, 155] , F1 [16, 104] , precision [16] , recall [16, 90] , AUC-ROC [41] , false positive rate [26, 54, 99] , false negative rate [26, 41] , true positive rate [41] , true negative rate [41] , Brier score [54, 55] , mean prediction error [111] , win rate [32, 48] , cumulative award [52] , customized score/return [12] , human percentile rank [32] , agreement between labels [87] Efficiency Objective time taken on the task (response time, average time for a game round, speed) [1, 10, 26, 28, 45, 48, 52, 74, 79, 82, 90, 92, 104, 125, 130, 144, 147] , total number of labels [90] Task satisfaction & mental demand

Subjective Satisfaction with the process [37] , confidence in the process [37] , frustration/annoyance [77, 125] , mental demand/effort [20, 21, 77, 144] , workload [24, 87, 128, 130] , task difficulty [10] Objective number of words in user feedback [82] metrics include F1 [16, 104] , precision [16] , recall [16, 90] , AUC-ROC [41] , which are commonly adopted for imbalanced datasets in the machine learning literature. For cases where the cost of mistakes varies significantly for the positive class and the negative class, false positives rate [26, 41, 54, 99] or false negatives rate [26, 41, 99] , true positive rate [41] , and true negative rate [41] have been used. In regression tasks, such as asking people to predict the likelihood of recidivism Green and Chen [54, 55] , prior work similarly adopts continuous counterparts of accuracy, including mean prediction error [111] and brier score (1 − (prediction − outcome) 2 ) [54, 55] .

In gamified tasks, researchers also use win rate [32, 48] , cumulative award [52] , customized return [12] , and human percentile rank [32] to capture the performance of human-AI teams. Finally, in cases where groundtruth labels are not available, agreement between labels (inter-annotator agreement) has also been used [87] .

In addition to objective metrics, subjective metrics can help understand human perception of the task performance.

A natural extension to the objective metrics for performance is perceived accuracy (i.e., self-rated error/accuracy) [37, 81, 137] and perceived performance improvement [32] . Another common metric is to ask humans about their confidence in the decisions [54, 55, 60, 95] . These perceived confidence measurements are usually based on Likert scales.

Finally, Kulesza et al. [76] introduce a unique metric that combines subjective metrics and objective metrics to measure mental model soundness as, (correctness × confidence ), where is the index of questions. While this metric was originally used for comprehension questions of how a recommendation system works, Kulesza et al. [76] show it can be adapted to measure the soundness of mental models on test instances.

Efficiency. In addition to efficacy-how accurately participants make decisions, another important dimension to consider is efficiency-how quickly they can make them. The main motivation is to gauge if the AI assistance can help humans make decisions faster. The most common objective metric is time taken on the task [1, 10, 26, 28, 45, 48, 52, 74, 79, 82, 90, 92, 104, 125, 130, 144, 147] . Alternatively, the total number of labels (or task output) can be used to measure efficiency as in Levy et al. [90] , most appropriate when task time is held constant. Notably, subjective metrics of efficiency are not seen in the papers we reviewed, although self-reported task efficiency is a common metric used in usability testing [46] .

Task-level satisfaction and mental demand. Finally, an important consideration in human-AI decision making is whether AI assistance improves human's satisfaction or enjoyment with the decision task. We consider both direct measurement of task satisfaction or the counter-measurement of task mental demand in this category.

Most metrics in this category are subjective, typically solicited through survey questions at the exit survey or questionnaire. The only exception in papers we surveyed is Lakkaraju et al. [82] , who used the number of words in user feedback to gauge user satisfaction. Researchers have asked participants about their subjective satisfaction with the process [37] , confidence in the process [37] , frustration/annoyance [77, 125] , mental demand/effort [20, 21, 77, 144] ,

workload [24, 87, 128, 130] , and task difficulty [10] .

In addition to evaluating the decision task, works on human-AI decision making also focus on evaluating users' perception and response to the AI system itself, including understanding of AI, trust in AI, fairness perception, AI system satisfaction, and others. Table 5 summarizes these measures.

Understanding. Since a significant proportion of empirical studies focus on AI explanations as a form of decision making assistance, users' understanding of the AI is a commonly used measurement. Subjective metrics of understanding typically ask participants to directly rate their understanding of the AI [7, 15, 20, 23, 28, 97, 125, 143, 147] , or some variations of it such as confidence in understanding [76] , ease of understanding [60, 111] , or confidence in simulation [3, 106] .

Other metrics ask participants to rate on perceived intuitiveness [134] or transparency [112, 137] of the AI system.

Objective metrics often test how well people understand the system compared to ground-truth facts about its outputs or how it works. The most commonly used metric is forward simulation [1, 3, 20, 27, 29, 45, 62, 106, 107, 111, 115, 143, 153] , by asking participants to simulate a model's predictions on unseen instances. Some researchers have also used counterfactual simulation [62, 143] (i.e., to predict feature changes that would lead to a different prediction). Other metrics measures the correctness of people's assessment of model performance [36, 55, 107, 125] , detection of errors [143] ,

identification of important features [28, 134] , or whether they can provide a correct description of model behaviors [29, 77, 112] . Other studies also design comprehension quizzes to evaluate human understanding [28, 48, 76, 143] . Using many of these objective measures, researchers aim to evaluate humans' mental model of the AI systems' innerworkings and how they make predictions. It is important to note that objective and subjective understanding do not always align due to the phenomena of illusory confidence with which one believes they understood the model more than they actually did [29] .

Trust and reliance. Trust in AI is an important research topic and a commonly used metric. For subjective metrics, direct self-reported trust is often used [1, 3, 20, 28, 29, 36, 45, 48, 55, 76, 111, 111, 115, 125, 128, 137] or some variations of it such as acceptance or confidence in the model [3, 29, 77, 125, 134, 143] , self-reported agreement or reliance [27] , or perceived accuracy of the AI [74, 125, 128] . Some works measured user trust based on the well-established ABI framework [100] or a subset of it, which measures the subjective trust belief (i.e., perceived trustworthiness) as perceived capability, benevolence and integrity [107, 112] , and/or trust intention as usage willingness [1, 24, 29, 36, 74, 77, 106, 112] .

Objective metrics of trust often focus on reliance as a direct outcome of trusting (i.e., how much people's decisions rely on or are influenced by the AI's), such as acceptance of model suggestions [13, 16, 22, 26, 35, 80, 81, 90, 94, 96, 101, 143, 152, 153, 155] , likelihood to switch [58, 96, 101, 109, 152, 155] , weight of model advice [95, 111] , choice to use the Evaluation Subjective? Metrics Understanding Subjective Self-reported understanding [7, 15, 20, 23, 28, 97, 125, 143, 147] , confidence in understanding [76] , confidence in simulation [3, 106] , ease of understanding [60, 111] , intuitiveness [134] , perceived transparency/interpretability [112, 137] Objective

Forward simulation [1, 3, 20, 27, 29, 45, 62, 106, 107, 111, 115, 143, 153] , counterfactual simulation [62, 143] , model errors detection [143] , identifying important features [28, 134] , correctness of described model behaviors [29, 77, 112] , correctness of estimated model performance/accuracy [36, 55, 107, 125] , comprehension quiz [28, 48, 76, 143] Trust and reliance Subjective Self-reported trust [1, 3, 20, 28, 29, 36, 45, 48, 55, 76, 111, 111, 115, 125, 128, 137] , model confidence/acceptance [3, 29, 77, 125, 134, 143] , self-reported agreement/reliance [27] , perceived accuracy [74, 125, 128] , perceived capability/benevolence/integrity [107, 112] , usage intention/willingness [1, 24, 29, 36, 74, 77, 106, 112] Objective

Agreement/acceptance of model suggestions [13, 16, 22, 26, 35, 80, 81, 90, 94, 96, 101, 143, 152, 153, 155] , switch [58, 96, 101, 109, 152, 155] , weight of advice [95, 111] , model influence (difference between conditions) [54, 55] , disagreement/deviation [111] , choice to use the model [11, 36, 37, 114] , over-reliance [21, 22, 143, 147] , under-reliance [22, 143, 147] , appropriate reliance [52, 111, 143, 147] Fairness Subjective Perceived fairness [7, 39, 54, 61, 139] , individual fairness [88] , group fairness [88] , process fairness [15] , deserved outcome [15] , feature fairness [15, 139] , accountability [112] Objective

Decision bias [54, 55] System satisfaction and usability Subjective Satisfaction [16, 37, 74, 79, 97, 104, 137] , helpfulness/support [16, 20, 24, 42, 74, 147] , usefulness [60, 86] , effectiveness [137] , quality [137] , appropriateness [20] , preference/likability [76, 86, 115, 147] , system affect [128] , system usability [130] , complexity [21] , ease/comfort of use [7, 137] , system frustration [74, 86, 125] , richness/informativeness [7, 86, 87] , learning [137] , recommendation to others [74] Objective time spent on the application [23] Others Subjective Rate on specific features (e.g., explanations): quality/soundness/completeness of explanation [77, 78, 134] , usefulness/helpfulness of explanation [24, 87, 107, 134] , agreement with explanation [144] , easiness to use explanation [134] , explanation workload [1], attribution to AI versus self [23] , desire to provide feedback [125] , expected model improvement [125] model [11, 36, 37, 114] , model influence (difference between conditions) [54, 55] , 4 as well as disagreement or deviation from the model's recommendations [111] . Some researchers have also looked at more fine-grained reliance such as over-reliance (relying on the model when it is wrong) [21, 22, 143, 147] , under-reliance (not relying on the model when it is right) [22, 143, 147] , and appropriate reliance [52, 111, 143, 147] . These objective trust metrics often influence the joint decision-outcomes and thus correlate with the task efficacy metrics we reviewed above.

It is worth noting that although trust as an attitude guides the behavior of reliance, the two are in fact two different Fairness. Studying how people perceive the fairness of AI and what design impacts the perception is an active research area. These studies primarily rely on subjective metrics, from general perceived fairness [7, 39, 54, 61, 139] to perceptions of more fine-grained types of fairness such as individual fairness [88] , group fairness [88] , process fairness [15] , deserved outcome [15] , feature fairness [15, 139] , and accountability (i.e., the extent to which participants think the system is fair and they can control the outputs the system produces) [112] . Only a small number of studies leveraged decision bias (e.g., the action to follow model's recommendations despite their lack of fairness) [54, 55] as an objective metric of perceived fairness.

System satisfaction and usability. Many subjective metrics have been used to measure general satisfaction with the AI [16, 37, 74, 79, 97, 104, 137] or related constructs such as perceived helpfulness [16, 20, 24, 42, 74, 147] , usefulness [60, 86] , effectiveness [137] , quality [137] , appropriateness [20] , likeability [76, 86, 115, 147] , etc. Some studies leverage system usability related metrics such as usability [130] , system complexity [21] , ease of use [7, 137] , system frustration [74, 86, 125] , information richness [7, 86, 87] , learning effect [137] , and recommendation to others (net promoter) [74] . Of the papers we surveyed, only Cai et al. [23] leveraged objective satisfaction measures, using the time spent with the AI system to reflect users' interest and satisfaction.

Others. Other measures focus on evaluating a specific feature of the AI. For example, AI explanations are frequently studied in the context of human-AI decision making, and subjective metrics have been used to measure people's perceived explanation quality [77, 78, 134] , explanation usefulness [24, 87, 107, 134] , easiness to use explanation [134] , explanation workload [1] , and agreement with the explanation [144] . Other metrics include users' outcome attribution to AI versus self [23] , desire to provide feedback [125] , and expected improvement of the AI system over time [125] .

Qualitative analysis. While the measurements described above are quantitative, we note that it is common for studies to supplement with qualitative analysis to further gauge the target measure (e.g., coded participants' statements about how the system work to measure understanding [77] ) or understand the underlying mechanisms or reasons. For example, some studies asked open-ended "why" questions following survey scales, others conducted exit interviews or asked participants to think aloud while using the AI system [15, 19, 22, 25, 30, 48, 61, 69, 124, 125, 140] . Thematic analysis is then typically used to analyze these qualitative data, which allows researchers to extract main themes from the bulk of information and serve as insightful knowledge. Other qualitative analysis performed includes grounded theory [91, 103] and affinity diagramming [14, 64, 148, 149] .

We summarize current trends and gaps in how surveyed studies evaluate human-AI decision making, and make recommendations for future work.

Current trends.

(1) Diverse evaluation focuses. Depending on the research questions and assistance elements studies, prior studies focused on different evaluation constructs. Our analysis reveals a framework that differentiates between dimensions evaluating the human-AI decisions and dimensions evaluating human perception and interactions with the AI, each with subjective and objective measurements.

(2) A focus on efficacy when evaluating decision tasks, but efficiency and subjective satisfaction are also useful indicators.

(3) Focuses on understanding, trust, system satisfaction and fairness with regard to AI. We note that some of these focuses could be a result of the field's focus on explanation features and fair machine learning.

(4) A lack of common measurements. Within a given measurement area, there exists significant variations in the choices of evaluation construct, content, and formulation. For example, trust has been measured by a single item, multi-items, by trustworthiness dimensions, trust intention, and objective reliance, among others; similarly, there are many nuanced constructs to measure satisfaction.

Gaps in current practices.

(1) A focus on decision efficacy (i.e., performance), and less emphasis on efficiency and user satisfaction. The three are commonly used constructs for usability measures [46] . This reflects a deep value of the field [17] , which prioritizes optimizing decision outcomes rather than the experience of human decision makers. That being said, we acknowledge that not all decision tasks require high efficiency (also the efficiency of AI alone is trivially better than human-AI teams). An open question for the field is to better understand the role of efficiency in tasks where it is necessary for human and AI to collaborate.

(2) Use of subjective versus objective measurements need to be better understood and regulated. It is important to recognize that the results from subjective and objective metrics do not always align, and the two may be in fact measuring different constructs, despite some studies make mixed claims. For example, participants can express high subjective understanding without objectively understanding the model behaviors. In some cases, objective metrics are measuring behavioral outcomes that are guided by user attitudes that are evaluated by subjective measures, but often in a non-linear way. One example that has long been studied in the human factors literature is trust as an attitude (by subjective measurements) versus reliance as a behavior (by objective measures). Despite some studies claim using reliance behaviors to reflect trust, many other factors besides trust can influence reliance, such as required efforts, perceived risk, self-confidence, and time constraints [85] . We do not claim the superiority of either. Studies may choose to focus on objective versus subjective measurements for many reasons. For example, the research questions may deal with user attitude versus behavioral outcomes, or it is easier or only feasible for the system gather data for one type of measure in practice. However, this choice is often not explicitly justified or disentangled in terms of the actual constructs being measured.

(3) Home-grown measurements, especially subjective survey items, are often used. There is a lack of practices to validate, re-use (and enabling re-use of) measurements, and leverage existing psychometrics or survey scales developed in HCI. Especially for subjective measurements, it is also not common practices to publish the survey scales used in the experiments. As a result, it can be difficult to replicate a study or compare different studies.

(4) Variance on the coverage of measurements. Some studies measure only task efficacy or ask about user trust, other studies cover many aspects. While the choice should be driven by research questions, it might reflect a lack of common framework or awareness for researchers to make choices in a principled manner.

Recommendations for future work.

(1) Make choices of evaluation metrics by research questions/hypotheses and targeted constructs. It is important to articulate what constructs, whether it is subjective perception or attitude, objective behavioral outcomes, with regard to the AI or with regard to the decision tasks, should be measured for the research questions or hypotheses. In general, researchers should pay attention to the concepts of measurement validity established in statistics and social sciences, including construct validity (does the test measure the concept that it is intended to measure) [31] and content validity (is the test fully representative of what it aims to measure) [84] .

Meanwhile, thinking through different areas of evaluation metrics (i.e., what can a given design/assistance element impact?) can help formulate more comprehensive and insightful hypotheses.

(2) Work towards common metrics and a shared understanding on the meanings, strengths and weaknesses of different evaluation methods. Such an understanding is key to a rigorous and replicable science.

We must also recognize that human-AI decision making is a nascent area where new metrics may need to be developed. Studies should not be limited to focusing on areas reviewed in this paper or using existing metrics.

(3) Keep reflecting on common evaluation metrics as value-laden choices. Evaluation metrics, if widely accepted and used, can profoundly shape the outcomes of a field. At a collective level, we should keep questioning whether the evaluation measures we use capture what matters for stakeholders and the society, and what the potential long-term outcome could be if we prioritize one set of measures over the other. This will also help the field expand the measurements and ultimately lead to more principled and responsible AI for decision making.

By summarizing decision choices made in more than 100 papers on empirical studies of human-AI decision making, specifically around the decision tasks, AI assistance elements studies, and evaluation metrics, we reflect on the barriers for the field to produce scientific knowledge and effectively advance human-AI decision making. A few core recommendations for future work emerged in our analysis, as we summarize below.

Building on each other's work. The advancement of empirical science requires joint effort. The field should learn from other experimental sciences such as psychology to practice replication, meta-analysis across studies, rigorous methodology and metrics development, and theory development that helps consolidate (sometimes contradicting) empirical results. To build on each other's work also means that researchers should prioritize enabling others to re-use and reproduce when publishing results, by articulating rationales behind design choices and reflecting on them to build shared knowledge, and making study materials accessible. The field should also strive to establish common practices or infrastructure that make knowledge sharing easier. For example, in the context of evaluation for human-centered machine learning, Sperrle et al. [126] propose the use of two artifacts: a checklist to help researchers make more principled choices in study design, and a reporting template that, besides results, covers many aspects of study design such as hypotheses, procedure, tasks, data, participants, and analysis.

Developing common frameworks for human-AI decision making. Another aspect to enable generalizable and unified scientific knowledge is to develop frameworks that account for the research spaces for human-AI decision making. In this paper, we discuss the needs for the field to develop frameworks that characterize different decision tasks, lay out the design space for AI assistance elements, and areas of evaluation metrics. Such frameworks can help shape research efforts in several ways. First, they can provide researchers a shared understanding to identify important research problems and articulate research questions in a common language. For example, with a framework on the design space for AI assistance elements, researchers can identify under-explored areas. Second, frameworks make explicit otherwise latent or disregarded factors that can help interpret and consolidate results across studies, and ultimately lead to more robust knowledge and theories. For example, a framework on task characteristics can help differentiate between the setups of two studies, and a framework on evaluation metrics can help differentiate their coverage of measurements. Last but not least, developing principled frameworks is also a critical and reflective practice-reflecting on the limitations and gaps in current research, and questioning the missing perspectives. For example, we urge the field to consider the design space of AI assistance beyond supporting discrete decision trials by paying attention to the entire decision process, the holistic experience with an AI system, as well as contextual and individual factors. We also encourage research efforts that systematically examine what should be measured for human-AI decision making, considering what matters for different stakeholders instead of just the decision-makers (e.g.,

people whose life will be impacted by the decision), and what should be the ethical principles of decision-support AI.

Bridging AI and HCI communities to mutually shape human-AI decision making. To advance human-AI decision making requires both a foundational understanding of human needs and behaviors, and based on them developing more effective and human-compatible AI to support decision-makers. In the present time, the research efforts are somewhat one-directional. The HCI community typically work as the receiver of new AI techniques, then build systems or design evaluative studies. How can the two communities work better together? How can HCI research drive AI technical development? Such questions have been long contemplated on in other interdisciplinary areas such as interactive machine learning [5] and human-robot interaction [118] . We believe one aspect is to reconsider the priorities of HCI research contributions. Rather than focusing on conducting evaluative studies, developing theories and principled frameworks based on empirical studies and engagement with user needs can help guide AI research efforts. For example, a framework of AI assistance elements can inform what kinds of AI technique are needed to better support human-AI decision making; and a framework of evaluation metrics can guide the technical optimization efforts.

Meanwhile, the AI communities should prioritize technical work that is informed by human needs and behaviors, and actively seek to distill insights from empirical studies as well as psychological and behavioral theories into computational work. As always, cross-disciplinary collaboration will require change of culture and translative research to bridge different perspectives. We hope the common goal of improving human-AI decision making can unite researchers from the two communities, and this survey as a bridge for joint research efforts.

COGAM: Measuring and Moderating Cognitive Load in Machine Learning Model Explanations

Sanity checks for saliency maps

Evaluating saliency map explanations for convolutional neural networks: a user study

Permutation importance: a corrected feature importance measure

Power to the people: The role of humans in interactive machine learning

Machine bias

Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency

System and method for predicting consumer credit risk using income risk based credit score

FactSheets: Increasing trust in AI services through supplier's declarations of conformity

Investigating user confidence for uncertainty presentation in predictive decision making

Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance

Updates in human-ai teams: Understanding and addressing the performance/compatibility tradeoff

Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance

A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy

It's Reducing a Human Being to a Percentage' Perceptions of Justice in Algorithmic Decisions

Human-Centric Justification of Machine Learning Predictions

The values encoded in machine learning research

Random forests

Toward algorithmic accountability in public services: A qualitative study of affected community perspectives on algorithmic decision-making in child welfare services

Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems

To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making

The role of explanations on trust and reliance in clinical decision support systems

The effects of example-based explanations in a machine learning interface

Human-centered tools for coping with imperfect algorithms during medical decision-making

Hello AI: Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making

Feature-Based Explanations Don't Help People Detect Misclassifications of Online Toxicity

Do explanations make VQA models more predictable to a human?

Explaining Decision-Making Algorithms through UI: Strategies to Help Non-Expert Stakeholders

2021. I Think I Get Your Point, AI! The Illusion of Explanatory Depth in Explainable AI

Creative writing with a machine in the loop: Case studies on slogans and stories

Construct validity in psychological tests

Leveraging rationales to improve human task performance

Amazon scraps secret AI recruiting tool that showed bias against women. Reuters

Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems

A Case for Humans-in-the-Loop: Decisions in the Presence of Erroneous Algorithmic Scores

Algorithm aversion: People erroneously avoid algorithms after seeing them err

Overcoming algorithm aversion: People will use imperfect algorithms if they can (even slightly) modify them

Artificial intelligence in medicine and cardiac imaging: harnessing big data and advanced computing to provide personalized medical diagnosis and treatment

Explaining models: an empirical study of how explanations impact fairness judgment

Towards a rigorous science of interpretable machine learning

The accuracy, fairness, and limits of predicting recidivism

Taking advice from intelligent systems: the double-edged sword of explanations

What can AI do for me: Evaluating Machine Learning Interpretations in Cooperative Play

All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously

Assessing the Local Interpretability of Machine Learning Models

Measuring usability: are effectiveness, efficiency, and satisfaction really correlated

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Mental Models of AI Agents in a Cooperative Game Setting

Explainable Active Learning (XAL): An Empirical Study of How Local Explanations Impact Annotator Experience

Hafez: an interactive poetry generation system

Uncertainty Quantification 360: A Holistic Toolkit for Quantifying and Communicating the Uncertainty of AI

Human Evaluation of Spoken vs

The Flaws of Policies Requiring Human Oversight of Government Algorithms

Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments

The principles and limits of algorithm-in-the-loop decision making

Grouped variable importance with random forests and application to multiple functional data analysis

Correlation and variable importance in random forests

Human decision making with machine assistance: An experiment on bailing and jailing

On calibration of modern neural networks

Visualizing uncertainty and alternatives in event sequence predictions

An empirical study on the perceived fairness of realistic, imperfect machine learning models

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior

Visual analytics in deep learning: An interrogative survey for the next frontiers

Improving fairness in machine learning systems: What do industry practitioners need

Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in ai

Attention is not explanation

xGEMs: Generating examplars to explain black-box models

The limits of human predictions of recidivism

Interpreting Interpretability: Understanding Data Scientists' Use of Interpretability Tools for Machine Learning

When (ish) is my bus? user-centered visualizations of uncertainty in everyday, mobile predictive systems

An empirical methodology for writing user-friendly natural language computer applications

Consumer credit-risk models via machine-learning algorithms

Impact of a deep learning assistant on the histopathologic classification of liver cancer

Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems

Understanding black-box predictions via influence functions

Tell me more? The effects of mental model soundness on personalizing an intelligent agent

Too much, too little, or just right? Ways explanations impact end users' mental models

Let Me Explain: Impact of Personal and Impersonal Explanations on Trust in Recommender Systems

An evaluation of the humaninterpretability of explanation

Towards Building Model-Driven Tutorials for Humans

On human predictions with explanations and predictions of machine learning models: A case study on deception detection

Interpretable decision sets: A joint framework for description and prediction

Simple and scalable predictive uncertainty estimation using deep ensembles

A quantitative approach to content validity

Trust in automation: Designing for appropriate reliance

Alexandre Bernardino, and Sergi Bermúdez i Badia. 2020. Co-Design and Evaluation of an Intelligent Decision Support System for Stroke Rehabilitation Assessment

Alexandre Bernardino, and Sergi Bermúdez Bermúdez i Badia. 2021. A Human-AI Collaborative Approach for Clinical Decision Making on Rehabilitation Assessment

Procedural justice in algorithmic fairness: Leveraging transparency and outcome control for fair algorithmic mediation

Explanation-Based Human Debugging of NLP Models: A Survey

Assessing the Impact of Automated Suggestions on Decision Making: Domain Experts Mediate Model Errors but Take Less Initiative

Questioning the AI: Informing Design Practices for Explainable AI User Experiences

Why and why not explanations improve the intelligibility of context-aware intelligent systems

The mythos of model interpretability

Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision Making

Algorithm appreciation: People prefer algorithmic to human judgment

Human Reliance on Machine Learning Models When Performance Feedback is Limited: Heuristics and Risks

Why does my model fail? contrastive local explanations for retail forecasting

A unified approach to interpreting model predictions

Divya Ramesh, and Ece Kamar. 2020. Do I Look Like a Criminal? Examining how Race Presentation Impacts Human Judgement of Recidivism

An integrative model of organizational trust

Isaac Lage, and Himabindu Lakkaraju. 2020. When Does Uncertainty Matter?: Understanding the Impact of Predictive Uncertainty in ML Assisted Decision Making

Model cards for model reporting

How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation

How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation

Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

Comparing automatic and human evaluation of local explanations for text classification

Anchoring Bias Affects Mental Model Formation and User Reliance in Explainable AI Systems

Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning

A Slow Algorithm Improves Users' Assessments of the Algorithm's Accuracy

What You See Is What You Get? The Impact of Representation Criteria on Human Bias in Hiring

Manipulating and measuring model interpretability

Explanations as mechanisms for supporting algorithmic transparency

UCI Mahcine Learning Repository

Why should i trust you?: Explaining the predictions of any classifier

Anchors: High-precision model-agnostic explanations

Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Robots in society, society in robots

Explaingan: Model explanation via decision boundary crossing transformations

Simple means to improve the interpretability of regression coefficients

Predicting recidivism in north carolina

Is attention interpretable? arXiv preprint

When Explanations Lie: Why Many Modified BP Attributions Fail

Closing the loop: User-centered design and evaluation of a human-in-the-loop topic modeling system

No Explainability without Accountability: An Empirical Study of Explanations and Feedback in Interactive ML

A Survey of Human-Centered Evaluations in Human-Centered Machine Learning

Should We Trust (X) AI? Design Dimensions for Structured Experimental Evaluations

Progressive disclosure: designing for effective transparency

Lending Club Statistics

Insights into human-agent teaming: Intelligent agent transparency and uncertainty

Conditional variable importance for random forests

Beyond Expertise and Roles: A Framework to Characterize the Stakeholders of Interpretable Machine Learning and their Needs

Visual, textual or hybrid: the effect of user expertise on different explanations

Investigating human+ machine complementarity for recidivism predictions

Bayesian layers: A module for neural network uncertainty

Exploring and Promoting Diagnostic Transparency and Explainability in Online Symptom Checkers

Felony Defendants in Large Urban Counties

Effect of Information Presentation on Fairness Perceptions of Machine Learning Predictors

Fairness and accountability design needs for algorithmic support in high-stakes public sector decision-making

How to Evaluate Trust in AI-Assisted Decision Making? A Survey of Empirical Methodologies

Improving content-based and hybrid music recommendation using deep learning

Are Explanations Helpful? A Comparative Study of the Effects of Explanations in AI-Assisted Decision-Making

A human-grounded evaluation of shap for alert processing

Attention is not not explanation

Tianlong Ma, and Liang He. 2021. A Survey of Human-in-the-loop for Machine Learning

How do visual explanations foster end users' appropriate trust in machine learning

Re-examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design

Investigating the heart pump implant decision process: opportunities for decision support tools to help

Insurance premium prediction via gradient tree-boosted Tweedie compound Poisson models

Representer point selection for explaining deep neural networks

Understanding the Effect of Accuracy on Trust in Machine Learning Models

Do i trust my machine teammate? an investigation from perception to decision

Advances in Variational Inference

Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making

Reinforcement learning trees