key: cord-0913033-vz3ctkww
authors: Shi, Li; Bhattacharya, Nilavra; Das, Anubrata; Lease, Matthew; Gwidzka, Jacek
title: The Effects of Interactive AI Design on User Behavior: An Eye-tracking Study of Fact-checking COVID-19 Claims
date: 2022-02-17
journal: 7th ACM SIGIR Conference on Human Information Interaction and Retrieval, CHIIR 2022
DOI: 10.1145/3498366.3505786
sha: 59212cba5f295dbe3bb4229dfa9cc301a1574d79
doc_id: 913033
cord_uid: vz3ctkww

We conducted a lab-based eye-tracking study to investigate how the interactivity of an AI-powered fact-checking system affects user interactions, such as dwell time, attention, and mental resources involved in using the system. A within-subject experiment was conducted, where participants used an interactive and a non-interactive version of a mock AI fact-checking system and rated their perceived correctness of COVID-19 related claims. We collected web-page interactions, eye-tracking data, and mental workload using NASA-TLX. We found that the presence of the affordance of interactively manipulating the AI system's prediction parameters affected users' dwell times, and eye-fixations on AOIs, but not mental workload. In the interactive system, participants spent the most time evaluating claims' correctness, followed by reading news. This promising result shows a positive role of interactivity in a mixed-initiative AI-powered system.

As an important task in Information Retrieval (IR), fact-checking has various implications for both system-centered and user-centered IR. These include which results should be returned, how they should be presented, what models of interaction should be provided, and how can success be evaluated. Many new models for automatic factchecking of claims were recently developed in machine learning and natural language processing literature [7, 8, 10, 18] . These models, however, focused on fully-automated fact-checking, and maximizing model's predictive accuracy.

While accurate predictions are important, a user doubtful of online information is likely to remain skeptical of any fact-checking tool [12] . As with all AI systems, fact-checkers operate on limited resources and thus are failable and prone to errors. Users may arrive at a wrong decision influenced by model errors. Nguyen et al. [15] show that users might trust a fact-checking model even when the model is wrong. Further, Mohseni et al. [13] show that transparent systems prevent users from over-trusting model predictions. Effective human-AI teaming [3] might alleviate such issues in fact-checking models for which a system needs to be transparent. A transparent system could reveal to a user how it made a prediction to support a user's understanding and calibrate trust [1, 3, 4] . Moreover, individual claim assessments will certainly partially rely on the user's prior worldviews concerning the perceived credibility of sources and claims.

Thus, a fact-checking system needs to integrate user beliefs, and enable users to infuse their views and knowledge into the system. Additionally, such a system needs to be transparent, by communicating the prediction-uncertainty of its model and enabling users to perform their in-depth reasoning. Recent investigations have also suggested that search engines, like any technology, have the potential to harm their users [2, 11, 17] . Allowing users to interact with a tool might help in mitigating such harm.

In this short paper, we present an eye-tracking and usability analysis of a human-AI mixed-initiative fact-checking system. We believe that such systems can potentially augment the efficiency and scalability of automated IR, with transparent and explainable AI [12, 14, 15] . Starting with a claim as a query, the system retrieves relevant news articles. It then infers the degree to which each article arXiv:2202.08901v2 [cs.HC] 14 Mar 2022 supports or refutes the claim (stance), as well as each news source's reputation. The system then aggregates this evidence and predicts the claim correctness. It shows to the user how the information is being used, and what are the sources of the model's uncertainty.

Our focus in this paper is not to evaluate the fact-checking IR model per se, but rather to understand how users interact with the system via two different user interfaces: an interactive interface and a non-interactive interface. In the interactive interface (Figure 1) , the model's predicted source reputation and stance for each retrieved article, is shown to the user. These can be revised via simple sliders to reflect a user's beliefs and/or to correct erroneousness model estimates. The overall claim prediction is then updated visually in real-time, as the user interacts with the system. In the noninteractive interface (Figure 2) , the values of source bars, with no option to change any of the data presented.

Therefore, the key research aim is to study the effects of the interactive AI design on user behavior in fact-checking systems. This paper investigates three aspects of the user behavior: dwell time, attention, and mental workload. We hypothesize that people would spend more time in the interactive interface, pay more attention to the interactive elements, and that the interactive interface would impose higher mental workload.

A controlled, within-subjects eye-tracking study was conducted in the Information eXperience usability lab at the University of Texas at Austin (N=40, 22 females). Voluntary participants interacted with two versions (user-interfaces) of a fact-checking system. Participants were pre-screened for native-level English familiarity, 20/20 vision (uncorrected or corrected), and non-expert topic familiarity of the content being shown. A Tobii TX-300 eye-tracker was used to record the participants' eye movements. Upon completion of the experiment, each participant was compensated with USD 25.

Each participant interacted with two versions (user-interfaces) of a fact checking system. In each interface, there were 12 trials. Each trial consisted of viewing a claim and, optionally, its corresponding news articles. Screenshots of single trials from the two versions of the system are shown in Figures 1 and 2 . In each trial, a claim was shown at the top of the interface, and surrogates of five related news articles were presented below, each with its corresponding news source, source reputation, news headline, and the article's stance towards the claim. Based on the article's stance and news source reputation, the system provided a prediction of the claim's correctness at the bottom. The news headlines were clickable and upon clicking opened the news article in a new tab.

The claims and corresponding news-articles were on the topic of the COVID-19 pandemic. They were handpicked by the researchers to simulate a mock version of the fact-checking system for usability analysis. Each claim was selected so as to have a preassigned ground-truth correctness value of TRUE, FALSE, or UN-SURE (claims that are partially true, or not totally proven). The TRUE and UNSURE claims were handpicked from reputed websites in the medical domain, such as World Health Organization, WebMD, Mayoclinic, Johns Hopkins University, US State Government webpages, and others. The FALSE claims were selected by searching for "coronavirus myths" on popular search engines. The supporting news articles for each claim were collected by manually searching the web. The source reputations for news articles were collected from existing datasets [5, 16] , while the stance values of each news article towards each claim was labelled by the researchers. Two example claims are "wearing masks is not required during exercising", and "asymptomatic people can transmit COVID-19". In total there were 24 claims (8 TRUE, 8 FALSE, , distributed equally between both interfaces. The order of the interfaces (interactive / non-interactive) was balanced, and the order of the claims were randomized. The list of claims and corresponding news articles used are shared in the GitHub repository: https://github.com/ixlab-ut/chiir-2022.

The overall procedure of the experimental session is illustrated in Figure 3 . Each session started with two training trials (one in interactive, one in non-interactive) for participants to get familiar with the two fact-checking interfaces. Then the participants started trials in one of the two interfaces (experiment blocks), which was randomly chosen and balanced across all participants. In each trial (viewing a claim) participants interacted with the interface freely without a time limit. Participants were also instructed to click on news headlines to open the underlying news articles in a new browser tab, and read it, if they considered it necessary for evaluating the claim. Before and after viewing each claim in the system, participants indicated their perceived-correctness of the claim (which is not analyzed in this short paper). After completing 12 trials in the first interface (block), participants reported their mental workload using the NASA-TLX questionnaire [6] . Then they were allowed to take a five-minute break before resuming the second block (in the other interface). NASA-TLX questionnaire was again administered at the end of the block.

To study the effect of the differences in design of the user-interfaces on user's information behavior, we collected eye-tracking data to reflect the participants' attention, reading, and information processing. We divided the interfaces into six areas of interests (AOIs): (i) claim text (T); (ii) news source names (S); (iii) source reputation (R); (iv) news article headlines (H); (v) stance of news article towards the claim (A); and (vi) predicted correctness of the claim (C). The ET measures comprise total fixation count, total fixation duration, and average fixation duration. For analysis, we discarded low-quality and non-reading eye tracking data, which are the fixations with durations shorter than 100 milliseconds or longer than 1500 milliseconds [9] .

To assess how much time participants spent in interacting with our system vis a vis how much time they spent on reading the original news articles (which opened in a new browser tab upon clicking the news headlines in the interface), we recorded the total dwell times while participants were viewing the interface (interface dwell time), and when they were reading news articles (news dwell time). 

As shown in Figure 4 , the participants generally spent more time reading news than using the interface. Overall, in the interactive interface, participants spent more time using the fact-checking system (interface dwell time), as well as reading the underlying news articles (news dwell time).

A Wilcoxon signed ranks test (T=255, n=40, p<.05) indicated that the total dwell time spent on reading the underlying news articles was significantly different between the two interfaces. The sum of the positive difference ranks ( + = 564) was larger than the sum of the negative difference ranks ( − = 255), showing that people spent more time reading news in the interactive interface, than in the non-interactive interface. The effect size for the test was 0.33, signifying that presence of interactivity in the system had an effect on making users spend more time reading underlying news articles.

UI and ET. Figure 5 shows that total fixation count and total fixation duration have wider spread and higher maximums in the interactive interface. We employed the Wilcoxon signed ranks between the two interfaces to test for significant differences in total fixation count (T=99, n=40, p<.05) and total fixation duration (T=77, n=40, p<.05). The results indicated that the fixation measures differed significantly between the two interfaces. In addition, the sums of the positive difference ranks for the total fixation count ( + = 721) and total fixation duration ( + = 743) were larger than the sums of the negative difference ranks for the total fixation count ( − = 99) and total fixation duration ( − = 77), respectively. Therefore, there were more fixations and overall longer eye-dwell time in the interactive interface. Moreover, the effect size for the matched-pair samples was 0.66 for total fixation count and 0.71 for total fixation duration, showing that presence of interactivity had a strong effect of attracting participants' visual attention and processing. We did not find any significant differences in average fixation duration.

AOIs, UI and ET. Figure 6 illustrates that most fixations and longer fixation durations were on the 'news article headlines' area of interest (AOI), followed by the 'source reputation', 'news source', and 'stance of article' AOIs. The 'predicted correctness' AOI had fewer fixations and shorter durations, while the 'claim text' AOI had the least.

A two-way ANOVA was conducted to examine the effect of interface and AOI on eye-fixation metrics. There was a significant interaction between the effects of interface and AOI on total fixation count, F(5, 464) = 4.00, p < .05, total fixation duration, F(5, 464) = 4.41, p < .05, and average fixation duration, F(5, 464) = 3.42, p < .05. Simple main effects analysis showed that both interface and AOI had significant main effects on all three of the eye-fixation metrics.

Claim correctness, UI and ET. Figure 7 shows the eye-tracking metrics for viewing TRUE, FALSE, and UNSURE claims in the two interfaces. The total fixation count and total fixation duration while viewing UNSURE claims are higher than when viewing TRUE or FALSE claims. To investigate further, a two-way ANOVA was conducted to examine the effect of interface and claim-correctness on eye-fixation metrics. Interface type had a significant main effect on all three eye-fixation metrics: total fixation count, F(1, 234) = 37.42, p < .05, total fixation duration, F(1, 234) = 42.18, p < .05, and average fixation count, F(1, 234) = 4.85, p < .05. Claim correctness had a significant main effect on total fixation count, F(2, 234) = 5.95, p < .05, and total fixation duration, F(2, 234) = 5.50, p < .05, but not on average fixation duration. Interaction effects were also not significant.

The Wilcoxon signed ranks test (T=208, n=40, p>.05) indicated that the participants' mental workload was not significantly different between the two interfaces. Therefore, the interactivity did not significantly influence the mental workload level.

We conducted a lab-based experiment to investigate how interactivity of an AI-powered fact-checking system affects user interactions. We found that the interactivity of the system has an influence on the system dwell time, fixation count and duration, but not mental workload.

Overall people tended to spend more time on reading the original news than looking at and interacting with the two systems. This indicated that they did not rely on the system unilaterally, but read the original news to help them make informed decisions. Furthermore, people engaged more and spent more time reading the news in the interactive than in the non-interactive system. We found that the interactivity of the interface makes a difference in the fixation counts and durations regardless of the AOI type and claim condition. People always paid more attention to the interactive interface when using the fact-checking system. The news headlines drew the most attention among all the interface elements. By reading the headlines, people decided which news to view. We also found that the fixation counts and fixation duration differed significantly between the news headlines, news source, source reputation, and article stance. Users tended to pay more attention to them, when the interface elements were interactive. The difference between the claim conditions showed that people paid more attention to the UNSURE claims. They needed more information from the system when they were dealing with the UNSURE claims. People's mental workload was not influenced by the system interactivity. Even though they paid more attention to the interactive interface, the amount of perceived mental resources required was apparently not significantly changed. Thus, the interactivity did not increase the self-reported effort of using the system.

These findings reveal that the system interactivity plays a positive role in a mixed-initiative AI-powered system. The system interactivity encourages people to spend the most time evaluating the claim correctness and then reading the news, while not imposing higher mental workload on users.

Limitations of our work include using only researcher-assigned tasks and claims selected by researchers, which were assessed by participants in the lab environment rather than in the context of their natural information tasks and fact-checking. Additionally, we did not capture participants' prior knowledge regarding the particular claims in the study. In practice, fact-checking is often performed with constrained time and high error costs. However, this study does not incentivize the participants for accuracy and efficiency.

Future work will include using sets of claims on different topics and investigation of user interaction in the context of their naturally generated tasks. Surprisingly, we notice that users spend a significant chunk of time reading the linked articles. Future work could explore the effect of time constraints and incentivizing participants' accuracy on engagement with interactive AI. Specifically, a timeconstrained setup might encourage participants to engage with the AI outcomes more instead of reading all the articles themselves. We see that participants pay more attention to the interactive interface. However, it is unclear whether this increased attention stems from the apparent correlation between interactivity and the attention required for engagement with interactive interfaces. Future work could bolster the current findings by closely observing prolonged user interaction with the tool.

Guidelines for human-AI interaction

Cognitive biases in search: a review and reflection of cognitive biases in Information Retrieval

Beyond accuracy: The role of mental models in human-AI team performance

What are people doing about XAI user experience? A survey on AI explainability research and practice

NELA-GT-2019: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

NASA-task load index (NASA-TLX); 20 years later

The quest to automate fact-checking

Toward automated fact-checking: Detecting check-worthy factual claims by claimbuster

Eye tracking: A comprehensive guide to methods and measures

Factoring fact-checks: Structured information extraction from fact-checking articles

Understanding credibility judgements for web search snippets

Mix and match: collaborative expert-crowd judging for building test collections accurately and affordably

Machine learning explanations to prevent overtrust in fake news detection

An interpretable joint graphical model for fact-checking from crowds

Believe it or not: Designing a human-ai partnership for mixed-initiative fact-checking

NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles

The positive and negative influence of search results on people's decisions about the efficacy of medical treatments

Detection and resolution of rumours in social media: A survey

We thank the reviewers for their valuable feedback, and the research participants for their time. This research was completed under UT Austin IRB study 2017070049 and supported in part by Wipro, the Micron Foundation, the Knight Foundation, and by Good Systems 1