key: cord-0150309-txb4y78z
authors: Kim, Byungsoo; Suh, Hongseok; Heo, Jaewe; Choi, Youngduck
title: AI-Driven Interface Design for Intelligent Tutoring System Improves Student Engagement
date: 2020-09-18
journal: nan
DOI: nan
sha: 53cb7581b20f236d882e62dccb457618405e62e2
doc_id: 150309
cord_uid: txb4y78z

An Intelligent Tutoring System (ITS) has been shown to improve students' learning outcomes by providing a personalized curriculum that addresses individual needs of every student. However, despite the effectiveness and efficiency that ITS brings to students' learning process, most of the studies in ITS research have conducted less effort to design the interface of ITS that promotes students' interest in learning, motivation and engagement by making better use of AI features. In this paper, we explore AI-driven design for the interface of ITS describing diagnostic feedback for students' problem-solving process and investigate its impacts on their engagement. We propose several interface designs powered by different AI components and empirically evaluate their impacts on student engagement through Santa, an active mobile ITS. Controlled A/B tests conducted on more than 20K students in the wild show that AI-driven interface design improves the factors of engagement by up to 25.13%.

The recent COVID-19 pandemic has caused unprecedented impact all across the globe. With social distancing measures in place, many organizations have implemented virtual and remote services to prevent widespread infection of the disease and support the social needs of the public. Educational systems are no exception and have changed dramatically with the distinctive rise of online learning. Also, the demands for evaluation methods of learning outcomes that are safe, reliable and acceptable have led the educational environment to take a paradigm shift to formative assessment.

An Intelligent Tutoring System (ITS), which provides pedagogical services in an automated manner, is a promising technique to overcome the challenges post COVID-19 educational environment has brought to us. However, despite the development and growing popularity of ITS, most of the studies in ITS research have mainly focused on diagnosing students' knowledge state and suggesting proper learning items, and less effort has been conducted to design the interface of ITS that promotes students' interest in learning, motivation and engagement by making better use of AI features. For example, Knowledge Tracing (KT), a task of modeling students' knowledge through their learning activities over time, is a long-standing problem in the field of Artificial Intelligence in Education (AIEd). From Bayesian Knowledge Tracing [10, 48] to Collaborative Filtering [27, 43] and Deep Learning [5, 15, 36, 49] , various approaches have been proposed and KT is still being actively studied. Learning path construction is also an essential task that ITS performs, where learning items are suggested to maximize students' learning objectives. This task is commonly formulated as a reinforcement learning framework [3, 20, 32, 51] and is also an active research area in AIEd. On the other hand, little works have been done in the context of the user interface for ITS including intelligent authoring shell [17] , affective interface [30, 31] and usability testing [8, 26, 38] . Although they cover important aspects of ITS, the methods are outdated and their effectiveness is not reliable since the experiments were conducted on a small scale.

Interface of ITS not fully supportive of making AIâĂŹs analysis transparent to students adversely affects their engagement. Accordingly, improving the interface of ITS is also closely related to explainable AI. Explaining what exactly makes AI models arrive at their predictions and making them transparent to users is an important issue [12, 18, 19] , and have been actively studied in both human-computer interaction [1, 24, 40, 45] and machine learning [39] communities. There are lots of works about the issue of explainability in many subfields of AI including computer vision [13, 23, 34, 50] , natural language processing [14, 22, 29, 35] and speech processing [25, 37, 41, 42] . Explainability in AIEd is mainly studied in the direction of giving feedback that helps students identify their strengths and weaknesses.

A method of combining item response theory with deep learning has been proposed, from which students' proficiency levels on specific knowledge concepts can be found [4, 46, 47] . Also, [2, 7] attempted to give students insight why the system recommends a specific learning material.

In this paper, we explore AI-driven design for the interface of ITS describing diagnostic feedback for students' problemsolving process and investigate its impacts on their engagement. We propose several interface designs composed of different AI-powered components. Each page design couples the interface with AI features in different levels, providing different levels of information and explainability. We empirically evaluate the impacts of each design on student engagement through Santa, an active mobile ITS. We consider conversion rate, Average Revenue Per User (ARPU), total profit, and the average number of free questions a student consumed as factors measuring the degree of engagement.

Controlled A/B tests conducted on more than 20K students in the wild show that AI-driven interface design improves the factors of engagement by up to 25.13%.

Santa 1 is a multi-platform AI tutoring service with more than a million users in South Korea available through Android, iOS and Web that exclusively focuses on the Test of English for International Communication (TOEIC) standardized examination. The test consists of two timed sections named Listening Comprehension (LC) and Reading Comprehension (RC) with a total of 100 questions, and 4 or 3 parts respectively. The final test score ranges from 10 to 990 in steps of 5 points. Santa helps users prepare the test by diagnosing their current state and dynamically suggesting learning items appropriate for their condition. Once a user solves each question, Santa provides educational feedback to their responses including explanation, lecture or another question. The flow of a user entering and using the service is described in Figure 1 . When a new user first opens Santa, they are greeted by a diagnostic test (1) . The diagnostic test consists of seven to eleven questions resembling the questions that appear on the TOEIC exam (2) . As the user progresses through 1 https://aitutorsanta.com the diagnostic test, Santa records the user's activity and feeds it to a back-end AI engine that models the individual user.

At the end of the diagnostic test, the user is presented with a diagnostic page detailing the analytics of the userâĂŹs problem-solving process in the diagnostic test (3) . After that, the user may choose to continue their study by solving practice questions (4).

As the user decides whether to continue their study through Santa after viewing the diagnostic page, we consider the page design that encourages user engagement and motivates them to study further. Throughout this section, we explore the design of the diagnostic page that can most effectively express the AI features brought by back-end AI models and better explains the userâĂŹs problem-solving process in the diagnostic test. We propose two page designs summarizing the userâĂŹs diagnostic test result: page design A ( Figure 2a ) and page design B (Figure 2b ). Each page design provides analytics of the diagnostic test result at different levels of information and explainability, and is powered by different AI models running behind Santa. The effectiveness of each page design and its impact on user engagement is investigated through controlled A/B tests in Section 4.

Page design A presents the following four components: Estimated Score, Grade by Part, Comparison to Users in the Target Score Zone and TutorâĂŹs Comment. diagnostic test. The estimated scores and the target scores are presented together so that the user easily compares them.

The percentile rank is obtained by comparing the estimated score with scores of more than a million users recorded in the database of Santa.

This component provides detailed feedback for the userâĂŹs ability on each question type to help them identify their strengths and weaknesses ( Figure 3b ). For each part in TOEIC exam, the red and white bar graphs show the userâĂŹs current proficiency level and the required proficiency level to achieve the target score, respectively. . Comparison to Users in the Target Score Zone in the page design A. The radar chart is intended to give the feeling that the AI teacher is analyzing the users closely from multiple perspectives by presenting the five features of their particular aspects of ability.

The red bar graphs are obtained by averaging the estimated probabilities of the user correctly answering the potential questions for each part. Similarly, the white bar graphs are obtained by computing the averaged correctness probabilities for each part of users in the target score zone.

Score Zone. This component shows a radar chart of five features representing the userâĂŹs particular aspects of ability ( Figure 4 ). The five features give explanations of how AI models analyze the userâĂŹs problem-solving process, making Santa looks more like an AI teacher. The five features are the followings:

• Performance: The userâĂŹs expected performance on the actual TOEIC exam. • Correctness: The probability that the user will correctly answer each given question.

• Timeliness: The probability that the user will solve each given question under time limit.

• Engagement: The probability that the user will continue studying with Santa.

• Continuation: The probability that the user will continue the current learning session.

The red and white pentagons present the five features with values of the current user and averaged values of users in the target score zone, respectively. This component is particularly important as shown in Section 4 that usersâĂŹ engagement factors vary greatly depending on the presence or absence of the radar char.

This component presents natural language text describing the userâĂŹs current ability and suggestions for achieving the target score ( Figure 3c ). This feature is intended to provide a learning experience of being taught by a human teacher through a more human-friendly interaction. Based on the userâĂŹs diagnostic test result, the natural language text is selected among a set of pre-defined templates.

Although the page design A is proposed to provide AI-powered feedback for the userâĂŹs diagnostic test result, it has limitations in that the composition is difficult to deliver detailed information and insufficient to contain all the features computed by AI models. To this end, the page design A is changed in the direction of making better use of AI 

You. This component provides information of the change in average scores of Santa users at the similar level to the current user, conveying the feeling that the specific score can be attained by using Santa (Figure 6b ). It shows how the scores of the Santa users change by dividing them into top 20%, median 60% and bottom 20%, and presents the estimated average score attained after studying with Santa for 60 hours. This feature is obtained by finding Santa users with the same estimated score as the current user and computing the estimated score every time they consume a learning item.

Curriculum. This component presents the learning path personalized to the user to achieve their learning objective (Figure 6c ). When the user changes the target date and target score by swiping, Santa dynamically suggests the number of questions and lectures the user must study per day based on their current position. The amount of study the user needs to consume every day is computed by finding Santa users whose initial state is similar to the current user and tracking how their learning progresses so that the user can achieve the target score on the target date.

When the user presses the flask button next to each component, a window pops up and provides an explanation of AI models used to compute the features of the component (Figure 7 ). For instance, when the user presses the flask button next to the Estimated Score component, a window appears with an explanation about the Assessment Modeling [6] , the Santa's score estimation modeling method. This component conveys information about the AI technology provided by Santa to the user, giving them a feeling that the AI is actually analyzing them, increasing the credibility of the system.

The features in the components of each page design are computed by processing the output of SantaâĂŹs AI engine, which takes the user's past learning activities and models individual users. Whenever the user consumes a learning item suggested by Santa, the AI engine updates models of individual users and makes predictions on specific aspects of their ability. The predictions that the AI engine makes include the followings: response correctness, response timeliness, score, learning session dropout and engagement.

The response correctness prediction is made by following the approaches introduced in [27] and [5] . [27] is the Collaborative Filtering (CF) based method which models users and questions as low-rank matrices. Each vector in the user matrix and question matrix represents latent traits of each user and latent concepts for each question, respectively.

SAINT [5] is a deep learning based model that follows the Transformer [44] architecture. The deep self-attentive computations in SAINT allows to capture complex relations among exercises and responses. Since the CF-based model can quickly compute the probabilities of response correctness for the entire questions of all users and SAINT predicts the response correctness probability for each user with high accuracy, the two models are complementary to each other in real world applications where both accuracy and efficiency are important.

Assessment Modeling (AM) [6] is a pre-train/fine-tune approach to address the label-scarce educational problems, such as score estimation and review correctness prediction. Following the pre-train/fine-tune method proposed in AM, a deep bidirectional Transformer encoder [11] based score estimation model is first pre-trained to predict response correctnesses and timelinesses of users conditioned on their past and future interactions, and then fine-tuned to predict scores of each user. The response timeliness and score are predicted from the pre-trained model and the fine-tuned model, respectively.

The learning session dropout prediction is based on the method proposed in DAS [28] . DAS is a deep learning based dropout prediction model that follows the Transformer architecture. With the definition of session dropout in a mobile learning environment as an inactivity for 1 hour, DAS computes the probability that the user drops out from the current learning session whenever they consume each learning item.

The engagement prediction is made by the Transformer encoder based model. The model is trained by taking the userâĂŹs learning activity record as an input and matching the payment status based on the assumption that the user who makes the payment is engaged a lot with the system.

In this section, we provide supporting evidence that AI-driven interface design for ITS promotes student engagement for evaluating the usersâĂŹ engagement since paying for a service means that the users are highly satisfied with the service and requires a strong determination of actively using the service. For users without the determination to make payment, the average number of free questions a user consumed after the diagnostic test is a significant measure of engagement since it represents their motivation to continue the current learning session.

From April 15th to 24th, we conducted an A/B test by randomly assigning two different diagnostic test analytics pages to the users: one without the radar chart in the page design A (1,391 users) and another one the page design A (1,486

users). Table 1 shows the overall results. We see that the page design A with the radar chart improves all factors of user engagement. With the radar chart, the conversion rate, ARPU, total profit and the average number of free questions a user consumed are increased by 22.68%, 17.23%, 25.13% and 11.78% respectively, concluding that a more AI-like interface design for ITS encourages student engagement. Figure 8 and Figure 9 show the comparison of the conversion rate and the average number of free questions a user consumed per day between the users of the A/B test, respectively. We observe that the users of the page design A with the radar chart made more payments and solved more free questions throughout the A/B test time period. 

The A/B test of the page design A and B was conducted from August 19th to September 11th by randomly allocating them to the users. 9,442 users were allocated to the page design A and 9,722 users were provided the page design B.

The overall results are shown in Table 2 . Compared to the page design A, the page design B is better at promoting all factors of user engagement by increasing 11.07%, 10.29%, 12.57% and 7.19% of the conversion rate, ARPU, total profit and the average number of free questions a user consumed, respectively. Note that although the page design with the radar chart in the previous subsection and the page design A are the same, there is a difference between the values of the engagement factors of page design with the radar chart in Table 1 , and those of the page design A in Table   2 . The absolute value of each number can be changed by external factors, such as timing and the company's public relations strategy, and these external factors are not a problem as they apply to both A and B groups in the A/B test.

The comparisons of the conversion rate and the average number of free questions a user consumed on every two days between the users assigned to the page design A and B are presented in Figure 10 and Figure 11 , respectively. We can observe in the figures that users experiencing the page design B made more payments and solved more free questions during the A/B test time period. Throughout the experiment, the results show that a more informative and explainable design of interface for ITS by making better use of AI features improves student engagement. 6 RELATED WORKS

Although the development of ITS has become an active area of research in recent years, most of the studies have mainly focused on learning science, cognitive psychology and artificial intelligence, resulting in little works done in the context of UI. [17] describes the UI issue of an intelligent authoring shell, which is an ITS generator. Through experiences in the usage of TEx-Sys, an authoring shell proposed in the paper, the authors discusses the importance of a well designed UI that brings system functionality to users. [16] considers applying multiple views to UI for ITSs. The paper substantiates the usage of multiple perspectives on the domain knowledge structure of an ITS by the implementation of MUVIES, a multiple views UI for ITS. Understanding students' emotional states has become increasingly important for motivating their learning. Several works [30, 31] incorporate affective interface in ITS to monitor and correct the students' states of emotion during learning. [31] studies the usage affective ITS in Accounting remedial instruction. Virtual agents in the system analyze, behave and give feedback appropriately to students' emotions to motivate their learning. [30] proposes ATSDAs, an affective ITS for digital arts. ATSDAs analyzes textual input of a student to identify emotion and learning status of them. A visual agent in the system adapts to the student, provides text feedback based on the inferred results and thereby increases their learning interest and motivation. The performance of a software can be measured by its usability, a quality that quantifies ease of use. Whether applying usability testing and usability principles to the design of UI can improve the performance of ITS is an open question [8] . [26] discusses the importance of UI design, usability and software requirements and suggests employing heuristics from software engineering and learning science domains in the development process of ITS. An example of applying basic usability techniques to the development and testing of ITS is presented in [38] . The paper introduces Writing Pal, an ITS for helping to improve studentsâĂŹ writing proficiency. The design of Writing Pal includes many usability engineering methods, such as internal usability testing, focus groups and usability experiments.

Providing an explainable feedback which can identify strengths and weaknesses of a student is a fundamental task in many educational applications [9] . DIRT [4] and NeuralCDM [46] propose methods to enhance explainability of educational systems through a cognitive diagnosis modeling, which aims to discover studentâĂŹs proficiency levels on proposed in [21] , is a bidirectional LSTM based knowledge tracing model. EKT explains the change of knowledge mastery levels of a student by modeling evolution of their knowledge state on multiple concepts over time. Also, equipped with the attention mechanism, EKT quantifies the relative importance of each exercise for the mastery of the studentâĂŹs multiple knowledge concepts.

As pointed in [33] , explainability also poses challenges to educational recommender systems. [2] addresses this issue by providing a visual explanation interface composed of conceptsâĂŹ mastery bar chart, recommendation gauge and textual explanation. When a certain learning item is recommended, the conceptsâĂŹ mastery bar chart shows concept-level knowledge of a student, the recommendation gauge represents suitability of the item and the textual explanation describes the recommendation rule why the item is suggested. Rocket, a tinder-like UI introduced in [7] , also provides explainability in learning contents recommendation. When an ITS proposes a learning material to a user, Rocket shows a polygonal visual summary of AI-extracted features, such as the probability of the user correctly answering the question being presented and expected score gain when the user correctly answers the question, which gives the user insight into why the system recommends the learning material. Based on the AI-extracted features, the user can decide whether to consume the suggested learning material or not through swiping or tapping action.

Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda

Explaining educational recommendations through a concept-level knowledge visualization

Reinforcement Learning for the Adaptive Scheduling of Educational Activities

DIRT: Deep Learning Enhanced Item Response Theory for Cognitive Diagnosis

Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing

Assessment Modeling: Fundamental Pre-training Tasks for Interactive Educational Systems

Choose Your Own Question: Encouraging Self-Personalization in Learning Path Construction

Usability evaluation of intelligent tutoring system: ITS from a usability perspective

AI in Education needs interpretable machine learning: Lessons from Open Learner Modelling

Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction

Bert: Pre-training of deep bidirectional transformers for language understanding

Monsters, Metaphors, and Machine Learning

Interpretable explanations of black boxes by meaningful perturbation

A compositional and interpretable semantic space

Context-aware attentive knowledge tracing

Interacting with educational systems using multiple views

User interface aspects of an intelligent tutoring system

Defense Advanced Research Projects Agency (DARPA)

DARPAâĂŹs explainable artificial intelligence program

Exploring multi-objective exercise recommendations in online education systems

Ekt: Exercise-aware knowledge tracing for student performance prediction

Interpretable rationale augmented charge prediction system

Interpretable 3d human action analysis with temporal convolutional networks

How much information? Effects of transparency on trust in an algorithmic interface

Interpretable deep learning model for the detection and reconstruction of dysarthric speech

A Design Model for Educational Multimedia Software

Machine Learning Approaches for Learning Analytics: Collaborative Filtering Or Regression With Experts

Deep Attentive Study Session Dropout Prediction in Mobile Learning Environment

Interpretable neural models for natural language processing

Usability of affective interfaces for a digital arts tutoring system

The influence of using affective tutoring system in accounting remedial instruction on learning performance and usability

Exploiting cognitive structure for adaptive learning

Recommender systems for learning

Learning conditioned graph structures for interpretable visual question answering

Word2Sense: sparse interpretable word embeddings

Deep knowledge tracing

Interpretable convolutional filters with sincnet

The Writing Pal intelligent tutoring system: Usability testing and development

Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models

Interacting meaningfully with machine learning systems: Three experiments

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

Improving the interpretability of deep neural networks with stimulated learning

Recommender system for predicting student performance

Attention is all you need

Designing theory-driven user-centric explainable AI

Neural Cognitive Diagnosis for Intelligent Education Systems

Deep-irt: Make deep learning based knowledge tracing explainable using item response theory

Individualized bayesian knowledge tracing models

Dynamic key-value memory networks for knowledge tracing

Interpretable convolutional neural networks

Improving Student-System Interaction Through Data-driven Explanations of Hierarchical Reinforcement Learning Induced Pedagogical Policies