key: cord-0592510-dzns0ydi authors: Abdelrahman, Ghodai; Wang, Qing; Nunes, Bernardo Pereira title: Knowledge Tracing: A Survey date: 2022-01-08 journal: nan DOI: nan sha: 110c359de78cf1071a3d7a1cd61396b3b8f2722e doc_id: 592510 cord_uid: dzns0ydi Humans ability to transfer knowledge through teaching is one of the essential aspects for human intelligence. A human teacher can track the knowledge of students to customize the teaching on students needs. With the rise of online education platforms, there is a similar need for machines to track the knowledge of students and tailor their learning experience. This is known as the Knowledge Tracing (KT) problem in the literature. Effectively solving the KT problem would unlock the potential of computer-aided education applications such as intelligent tutoring systems, curriculum learning, and learning materials' recommendation. Moreover, from a more general viewpoint, a student may represent any kind of intelligent agents including both human and artificial agents. Thus, the potential of KT can be extended to any machine teaching application scenarios which seek for customizing the learning experience for a student agent (i.e., a machine learning model). In this paper, we provide a comprehensive and systematic review for the KT literature. We cover a broad range of methods starting from the early attempts to the recent state-of-the-art methods using deep learning, while highlighting the theoretical aspects of models and the characteristics of benchmark datasets. Besides these, we shed light on key modelling differences between closely related methods and summarize them in an easy-to-understand format. Finally, we discuss current research gaps in the KT literature and possible future research and application directions. Teaching is a vital activity to facilitate transfer of knowledge. It is well-known that one key factor of teaching is the ability of human teachers to track the learning progress of their students. This ability allows human teachers to adjust their teaching pace, materials and methodology to maximize the knowledge growth of each individual student. Over the past 30 years, a variety of online education platforms such as Massive Open Online Courses (MOOCs) [109] , intelligent tutoring systems [90] , and educational games [67] have emerged to complement and sometimes completely replace conventional education systems. The COVID-19 pandemic, for example, has challenged conventional classroom-based teaching and sped up digital transformation in education systems. To alleviate the disruption of COVID-19, teachers and students around the world had to rapidly adjust to an online education mode. However, while there is a pressing need for teaching using computer technologies, technology-enhanced teaching has also posed new challenges, one of which is how to effectively track the learning progress of a student through their online interaction with teaching materials -a challenge known as the Knowledge Tracing (KT) problem [5, 6, 112] . In a nutshell, KT aims to observe, represent, and quantify a student's knowledge state, e.g., the mastery level of skills underlying the teaching materials. To better understand the KT problem, let us consider the learning activity depicted in Figure 1 . It shows an interaction scenario between a student and an Intelligent Tutoring System (ITS) in which the student is given a sequence of questions taken from a question set { 1 , 2 , 3 , 4 } and asked to answer these questions. During the interaction, the ITS estimates the student's knowledge states over the skills { 1 , 2 , 3 , 4 } (e.g., math skills such as addition, subtraction and multiplication) that are required to answer these questions. However, capturing a student's knowledge state is a challenging task due to several reasons: • Each question might require more than one skill, which adds complexity to trace knowledge states. For instance, as shown by the arrows going from skills to questions in Figure 1 , question 2 requires the skills { 1 , 3 }. It is worth noting that skill is also referred to as knowledge component in some previous studies [65] . • Dependency among skills is another important factor to consider when tackling the KT problem. For example, although 3 requires only skill 2 , skills 1 and 4 are prerequisites for skill 2 according to the dependency graph shown in Figure 1 . Thus, the mastery level of skills required by the question 3 should also consider 1 and 4 , in addition to skill 2 . • A student's forgetting behavior [30] may result in decaying their knowledge over skills. By modeling forgetting features, skills can be ranked by their relevance to forgetting. For example, the bottom of Figure 1 shows that skill 1 is least affected by forgetting when the latest question 1 is reached, whereas skill 2 is the most affected one. Historically, the notion of knowledge tracing was introduced by Anderson et al. in a technical report [5] for cognitive modeling and intelligent tutoring in 1986, which was later published in the Artificial Intelligence journal in 1990 [6] . Since then, many attempts have been made to design machine learning models for solving the KT problem. Early attempts [27, 112] followed Bayesian inference approaches, which usually relied on oversimplifying the model assumptions (e.g., assuming only one skill) to make the posterior computation tractable. Later, with the rise of classic machine learning methods such as logistic regression models [117] , another direction followed by KT is to use parametric factor analysis approaches which trace a student's knowledge states and perform the answer prediction based on modeling a variety of factors [16, 17, 86] , including: (1) aspects about students such as prior knowledge, learning capacity, or learning rate; (2) aspects about learning materials such as familiarity, number of previous practices, or difficulty; (3) aspects about a learning environment itself such as the nature of the learning channel (paper-or computer-based) and the temporal context of the practice time (within an examination period or a regular study period). In addition to these, psychological studies about the learning behavior [75] and forgetting behavior [52] of students have also suggested additional factors, such as the time lapse between a student's different interactions and the number of times on practicing learning materials, to be considered when tracing knowledge states. It is worth noting that this direction of KT is still active [35, 111] and is considered as an alternative to the recent state-of-the-art approaches based on deep learning. Motivated by breakthroughs achieved by deep learning techniques [61] , deep learning KT models have emerged rapidly. Piech et al. [89] has pioneered this direction of research and revealed the power of deep learning techniques for knowledge tracing. They proposed a model called Deep Knowledge Tracing (DKT) which applies Recurrent Neural Networks (RNNs) [63] to capture temporal dynamics in a sequence of interactions between questions and answers by a student, and based on that, to predict the student's answer for a new question. Empirical results showed that DKT outperformed traditional KT models on several benchmark datasets. This attempt highlighted the potential of using deep learning models in addressing the KT problem. In recent years, an increasing number of studies have exploited the development of deep learning KT models from different perspectives, including: • Memory structures. Inspired by memory-augmented neural networks [40, 72] , deep learning KT models have been extended by augmenting more powerful memory structures, typically key-value memory, for capturing knowledge states dynamically at a finer granularity such as the mastery level of each individual skill (e.g. [1, 125] ). • Attention mechanisms. Inspired by the Transformer architecture [110] and some further developments in natural language processing applications, attention mechanisms have been incorporated into deep learning KT models for capturing the relationships among questions and their relevance to a student's knowledge states (e.g. [21, 36, 79, 80, 99] ). • Graph representation learning. Inspired by the representational power of graph learning techniques such as Graph Neural Networks [59, 93] , deep learning KT models have been equipped with graph learning techniques to leverage the rich structural information from graphs that can flexibly model relationships among questions and skills (e.g. [78, 108, 120] ). • Textual features. Question text may potentially contain a great wealth of information such as skills required by questions, difficulty of questions, and relationships between questions. Several deep learning KT models have leveraged textual features from question text for learning question representations and tracing a student's knowledge states (e.g. [64, 101, 122] ). • Forgetting features. Motivated by the learning curve theory [75] , a recent trend in developing deep learning KT models is to incorporate forgetting features so that a student's forgetting behavior can be taken into consideration for knowledge tracing (e.g. [2, 20, 77] ). These studies facilitate a seamless translation of breakthroughs in deep learning techniques into the KT domain. So far, deep learning KT models have achieved the state-of-the-art results on the majority of benchmark datasets for knowledge tracing (a summary of the results obtained by different KT models is presented in Table 7 ). This paper performs a detailed survey that reviews, summarizes, classifies, and analyzes KT methods both from the traditional machine learning perspective and the recent deep learning perspective. It also presents KT benchmark datasets and applications. The main contributions are as follows: • We highlight the key categories of KT methods and compare their architectures over multiple aspects including model design, knowledge state representation, assumptions for relationships between questions and skills, and consideration of student's forgetting behavior. • We present the chronic evolution for each KT category and discuss how each method is extended on the previous work. • We summarize the characteristics of well-known KT datasets and compare the performance of key KT methods on each dataset by consolidating results from the relevant literature. • We discuss application areas of KT that are not well explored currently to help derive future research directions in new venues. Note that a recent KT survey has been presented by Liu et al. [65] . Despite their contributions to the field, our survey differs in depth and topics covered. The most notable differences are the comprehensive coverage of KT models; an important discussion on the differences in representation learning techniques related to key aspects of KT (such as knowledge state, forgetting behaviors, and knowledge components); an in-depth coverage of the KT datasets used by the relevant literature highlighting their similarities and issues; and, an extensive report of the results obtained by different KT models which allows future and accurate comparisons between them. This paper surveys the knowledge tracing literature to answer the following five research questions: In the following sections, we discuss each of these five research questions in detail, as illustrated in Figure 2 . To answer RQ1, we first survey the history of traditional KT techniques by summarizing the relevant studies and analysing their connections in Section 2. Then, we investigate different types of deep learning techniques being proposed in the literature for solving the KT problem with respect to RQ2, which results in a taxonomy of deep learning KT models. For each type of deep learning techniques, we discuss their key assumptions and analyze their differences and connections. After that, in Section 3, we conduct a comprehensive analysis on the datasets in terms of data collection, pre-processing, characteristics, and the ground truth information, which answers RQ3. For RQ4, we discuss several types of KT applications in Section 4, particularly in relating to how KT techniques can enhance students' personalized learning experience and performance. Finally, we explore future research directions for KT which may enrich the field of study and provide a broad understanding of opportunities and limitations of existing KT models with respect to RQ5 in Section 5 and conclude the paper. This section introduces a comprehensive categorization for the KT models according to the related works found in the literature. Generally speaking, there are two broad categories: (1) Traditional Knowledge Tracing Models; and, (2) Deep Learning Knowledge Tracing Models. Traditionally, there are two popular lines of research for knowledge tracing: the Bayesian Knowledge Tracing and the Factor Analysis Models. Figure 3 provides an overview for major traditional knowledge tracing models that have been developed in the KT literature. Bayesian Knowledge Tracing (BKT) was motivated by the concepts of mastery learning [23] . Mastery learning assumes that all students can practise on a skill such that it may lead to mastery of that skill if two conditions are satisfied: (a) knowledge is appropriately described as a hierarchy of skills; and, (b) learning experiences are structured to ensure that students master skills lower than those higher in the hierarchy [24] . BKT models often use a probabilistic graphical model such as Hidden Markov Model [28] and Bayesian Belief Network [113] to trace students' changing knowledge states as they practise skills. Central to these models is the Bayes' theorem that, for two events and , the following holds: In what follows we discuss the standard Bayesian approaches and their variations. • Standard BKT Model The first BKT model was introduced by Corbett and Anderson in 1994 [24] . The proposed model associates a skill with a binary knowledge state: { , }. This model only considers transitions from the unlearned state to the learned state, overlooking the forgetting theory (i.e., the probability of transition from a learned state to an unlearned state is always zero). Additionally, note that a student may make a mistake while in a learned state or guess correctly in an unlearned state. We refer to this model as the standard BKT model from now on. Description ( 0 ) Probability of skill mastery by a student before learning ( ) Probability of transition from an unlearned state to a learned state ( ) Probability of slipping by a student in a learned state ( ) Probability of guessing correctly by a student in an unlearned state There are two types of variables in the standard BKT model: (1) the Binary latent variables which represent the knowledge states of a given student (i.e., a single variable per skill indicating learned state and unlearned state); and, (2) the Binary observed variables which represent how students attempt questions (i.e., a variable per question indicating whether a question is answered correctly or not). Figure 4 shows four types of model parameters in the standard BKT model. For each skill, there is one set of four corresponding parameters. At each time step ≥ 1, the model estimates the probability ( ) of skill mastery by a student by: where Posterior( −1 ) is the posterior probability of being in a learned state given the observation to the -th attempt by a student, calculated as: if the -th attempt is correct; otherwise. Hence, the probability of a student to correctly answer a question at each time step is the sum of the probability of either mastering the skill but making a "slip" or not mastering the skill but making a correct "guess", as formulated by: ( ) * (1 − ( )) + (1 − ( )) * ( )). (3) Figure 4a shows the standard BKT model with one skill node . Starting with the prior probability ( 0 ) of skill mastery, the latent variable for skill is transitioned from one time step − 1 to the next time step based on the probability ( ). The corresponding observed variable represents the answer node, i.e., how a student attempts questions that require skill , which is based on the probabilities ( ) and ( ). • Individualized BKT Model One limitation of the standard BKT model is that it has no model parameters specific to students. All students are assumed to have the same prior knowledge and the same learning rate for any skill. As a result, the standard BKT model may underestimate the learning performance for above-average students, but overestimate the learning performance for below-average students [24] . To alleviate this limitation, several attempts [24, 56, 62, 83, 123] have been made to extend the standard BKT model by introducing student-specific parameters. Corbett and Anderson [24] considered to add individual weights for each student, one weight corresponding to each of the four parameter types in the standard BKT model. Pardos and Heffernan [83] focused on individualizing the prior probability of skill mastery ( 0 ) heuristically for each student. Lee and Brunskill [62] also individualized the four parameter types in the standard BKT model for students; nonetheless, the combined effects of skill-specific and student-specific parameters were unexplored. Yudelson et al. [123] introduced an approach for individualizing BKT models that account for student differences with respect to two types of model parameters, the prior probability of skill mastery ( 0 ) and the probability of transition ( ). The idea was to first define student-specific parameters and skill-specific parameters. Then, the gradients were explicitly computed in terms of both student-specific and skill-specific parameters. Parameter [24] [83] [62] [123] [56] skill student skill student skill student skill student skill student The underlying Hidden Markov Model remains unchanged. It turns out that adding studentspecific parameter for ( ) is more beneficial for the model accuracy than adding studentspecific parameter for ( 0 ). Later, Khajah et al. [56] proposed to extend the standard BKT model by personalizing the guess and slip probabilities ( ) and ( ) based on student ability and problem difficulty. Compared to the standard BKT model, the individualized models can provide better correlation between actual and expected accuracy among students, leading to more effective decisions or improving the accuracy of predicting student performance. Table 2 summarizes the skilland student-specific parameters used in the individualized BKT models. In the early years, the BKT models have assumed that each question requires only one skill and different skills are independent from each other [24, 62, 83, 123] . Thus, these models cannot handle questions that require multiple skills, nor represent relationships between different skills. To address this limitation, Käser et al. [54] proposed to jointly model multiple skills and dependencies between different skills using Dynamic Bayesian Network (DBN). They aimed to capture prerequisite skill hierarchies within a single model, e.g., one skill is conditionally dependent on another skill if the former is a prerequisite for mastering the latter. Similar to the standard BKT model, DBN considers the same two types of variables: binary latent variables and binary observed variables. At each time step, a latent variable for each skill is associated with an observed variable. A forget probability ( ) is introduced, in addition to the four types of model parameters { ( 0 ), ( ), ( ), ( )} in the standard BKT model. Dependencies between different skills are learnt as weights w of a log-linear model. Let : × → R denote a mapping from a latent space and an observed space to d-dimensional feature vectors, and be a normalizing constant. The objective of the log-linear model is to find the model parameters { ( 0 ), ( ), ( ), ( ), ( ), w} that maximize the likelihood of the joint probability of ∈ and ∈ as formulated below: Figure 4b shows the Dynamic Bayesian Network (DBKT) with three skill nodes (denoted as 1 , 2 and 3 ). As indicated by the directed arrows, the latent variable for skill 2 depends on the latent variable for skill 1 , and the latent variable for skill 1 depends on the latent variable for skill 3 . Further, at each time step , latent variables for skills depend on their latent variables in the previous time step − 1, while is the corresponding observed answer nodes. Factor analysis models are theoretically supported by the Item Response Theory (IRT) [31] , which has played a large role in educational assessment and measurement. The key idea is to estimate student performance by learning a function, usually a logistic function, based on various factors in a population of students who solve a set of problems. It is important to note that, although an item in the original IRT corresponds to a question involving a single skill, later works in this line have been generalized to considering an item that may involve multiple skills. The mapping between items and skills is often represented in the form of Q-matrix, i.e., an entry in a Q-matrix is 1 if an item involves a skill ; or 0, otherwise. Q-matrix is commonly assumed as the side information. • Item Response Theory (IRT) The history of IRT can be traced back to Thurston's pioneering work in the 1920s [69] and several other works in the 1950s and 1960s [7, 13, 41, 70] . IRT is built upon the following assumptions: (a) The probability that a student correctly answers an item can be formulated as an item response function based on the parameters of the student and the item; (b) The item response function monotonically increases with respect to the ability of a student ; (c) For a student with the ability , items are considered conditionally independent. The Rasch model [92] is often refereed to as the simplest IRT model, in which the item response function is defined by a one-parameter logistic regression (1PL) model. Let L (·) be a logistic function. By taking into account a difficulty parameter that models the difficulty of an item , the probability that a student correctly answers an item is defined as: Several multiple parameter logistic regression models have also been developed for IRT, e.g., a four-parameter logistic (4PL) model introduced by Barton and Lord [10] : where is a discrimination parameter to model how well an item can differentiate students, is a guessing parameter to model the effect of guessing, and is a slipping parameter to model the effect of careless errors. IRT has been extended in many different directions. Wilson et al. [116] proposed Hierarchical IRT (HIRT) and Temporal IRT (TIRT). HIRT exploits structure among questions by assuming that related questions (i.e., questions sharing similar skills) have difficulty parameters drawn from the same distribution. Different questions might vary in difficulty, but questions for trivial skills tend to be easier while ones for difficult skills tend to be harder. TIRT models each parameter in the logistic model (e.g., 4PL) as a time-varying stochastic process such as a Wiener random process [105] . The Additive Factor Model [17] , originated from Learning Factors Analysis (LFA) [16] , is a logistic regression model under four assumptions: (1) the prior knowledge of students may vary; (2) students learn at the same rate; (3) some skills are more likely to be known than others; and, (4) some skills are easier to be learnt than others. In this model, a difficulty parameter and a learning rate parameter are assigned for each skill , respectively. The key idea of AFM is that the probability of answering an item correctly by a student is proportional to an additive combination of the ability of the student, the difficulty of skills involved in the item, and the amount of learning gained from each attempt. Let ( ) be the set of skills involved in an item , which can be obtained from a Q-matrix, and be the number of times that a student has attempted on an item involving a skill . AFM defines the probability of answering an item correctly by a student on an item as: • Performance Factor Analysis (PFA) Performance Factor Analysis (PFA) [86] overcomes the limitation of AFM which ignores the evidence of learning in the successful and unsuccessful attempts on items by a student. The key idea of PFA is to discard the parameter used in the previous models and instead count for success and unsuccessful attempts separately. PFA has three parameters for each skill, including: (1) is for the difficulty of a skill ; (2) is for the effect of learning a skill after successful attempts; and, (3) is for the effect of learning a skill after unsuccessful attempts. Conceptually, and reflect the learning rate for a skill when being applied successfully and unsuccessfully. and be the number of successful attempts and the number of unsuccessful attempts made by a student on a skill , respectively. Then PFA calculates the probability that a student correctly answer an item as: • Knowledge Tracing Machine (KTM) Knowledge Tracing Machine (KTM) was recently proposed by Vie and Hisashi [111] , which generalizes Factorization Machine (FM) [104, 106] for student modeling. KTM allows to consider an arbitrary number of factors about students, items, skills, successful and unsuccessful attempts, or extra information about the learning environment, such as using a mobile or a laptop. For a number of factors, we denote all factors involved in an event by a sparse vector of length such that > 0 if a factor ∈ [1, ] is involved in the event; or 0, otherwise. Then KTM estimates the probability of a correct answer on an item by a student with an event involving as follows: where is a global bias, and each factor is modeled by both a weight ∈ R and an embedding vector ∈ R for some dimension . The first term models the logistic regression of all factors and the second term models pairwise interactions between different factors. When L is a logistic function, KTM include IRT, AFM and PFA as special cases [111] . Wenbin et al. [35] proposed Knowledge Tracing Machine by modeling cognitive item Difficulty and Learning and Forgetting (KTM-DLF) that extends KTM by adding factors related to the forgetting behavior of students. They represented the forgetting by time lapse since the last successful attempt for the involved skills. Both Bayesian knowledge tracing and factor analysis models have strengths and weaknesses. We discuss their connections and differences from three aspects: model parameters, model inference, and temporal analysis. • Model parameters: Most recent models in BKT and FAM have taken into account both student-and skill-specific model parameters. Early BKT models were primarily centered on the four parameters: prior learning parameter ( 0 ), learning rate parameter ( ), guess parameter ( ), and slip parameter ( ) [24] and their student-specific variants. These early BKT models usually assume that there is no forgetting parameter. Only some recent works have incorporated the forgetting parameter [54] . For factor analysis models, most of them have been developed with similar or more flexible model parameters than BKT models. Particularly, recent works such as KTM [111] consider a wide range of factors, enabling a flexible way to incorporate the side information into student modelling. • Model inference: To keep the Bayesian inference tractable, BKT models typically assume a first-order Markov chain when making inference based on a sequence of past questionanswering history, i.e., only considering the most recent observation. This assumption however limits their ability to model complex dynamics on student learning behaviors. Some recent dynamic BKT models such as DBN [54] are often computationally intractable, and thus they trade-off predication accuracy for computational efficiency. On the other hand, factor analysis models usually do not explicitly make inferences on knowledge states of a student (e.g., decide whether a student has achieved a certain level of skill mastery by tracing knowledge states). Instead, they target to maximize other model parameters such as the learning rate. • Temporal analysis: BKT models essentially deal with a sequence prediction problem based on the history of student learning. In contrast, factor analysis models do not consider the order of questions in which a student's answers are observed. For example, given two questions and their corresponding answers from a student, whether one question is answered before the other is not important for factor analysis models. Nonetheless, by incorporating extra temporal features of student learning behaviors, factor analysis models can be enhanced to analyze temporal aspects of student learning. A number of attempts have been made to leverage the best from both worlds of Bayesian knowledge tracing and factor analysis models. For example, [47, 57, 58] extended the IRT using the Bayesian inference to customize the estimation of question difficulty based on observations of each student. Further work has been done through incorporating factors that reflect the characteristics of individual students [28, 84] , or the characteristics of specific items of assessment within skills [85] . Inspired by the success of deep learning [38, 39, 61, 95, 102] , recent researches on knowledge tracing have applied deep learning techniques. Figure 5 presents a taxonomy of deep learning knowledge tracing models. A knowledge tracing task is typically modeled as a sequence prediction problem from a machine learning perspective. Let = 1 , . . . , | | be the set of all distinct questions in a dataset. Each ∈ may have a different level of difficulty, which is not explicitly provided. When a student interacts with the questions in , a sequence of interactions = ⟨ 1 , 2 , . . . , −1 ⟩ undertaken by the student can be observed, where = ( , ) consisting of a question and answer ∈ {0, 1}. = 0 means that is incorrectly answered and = 1 means that is correctly answered. Definition 2.1. Given a sequence of interactions that contains the previous question answering of a student, the knowledge tracing problem is to predict the probability of correctly answering a new question at the time step by the student, i.e., = ( = 1| , ). Deep Knowledge Tracing (DKT) [89] pioneered the use of deep learning for knowledge tracing. It employs Recurrent Neural Network (RNN) [63] and a Long Short Term Memory (LSTM) [46] to predicate the probability of correctly answering a question at each time step. A sequence of hidden states ⟨ℎ 1 , ℎ 2 , . . . , ℎ ⟩ is computed which encodes the sequence information obtained from previous interactions. At each time step , the model calculates the hidden state ℎ and the student's response as follows: where the ℎ( ) = ( − − )/( + − ) and ( ) = 1/(1 + − ) are activation functions, ℎ , ℎℎ and ℎ are weight matrices, and ℎ and are bias vectors. Despite the promising performance, DKT has several limitations. First, it assumes only one hidden KC (i.e., a skill) in a student's knowledge state ℎ . Second, it cannot model the relationships among multiple KCs. Third, it assumes that all questions are equally likely related to each other, which may not hold in many scenarios as some questions may be more relevant to each other than the remaining of the sequence. Thus, various attempts have been made on extending DKT with the aim of enhancing the model capacity for tackling the KT problem. Below, we review the related work in the following areas of extension. A number of KT models have extended DKT [89] to address its limitations. For example, Xiong et al. [119] proposed Extended-Deep Knowledge Tracing that extends DKT by adding auxiliary student features such as previous knowledge, question answering rates and time spent on learning and practice; and, exercise features, such as textual information, question difficulty, skill hierarchies and skill dependencies. A variant of DKT, called (DKT+) [121] , was proposed to augment the original DKT loss function with two additional regularization terms to address the limitations in DKT's ability to reconstruct an answer input and reduce inconsistency of answer prediction for questions sharing similar KCs. Minn et al. [73] proposed an extension of DKT, named Deep Knowledge Tracing with Dynamic Student Classification (DKT-DSC), that uses K-means to cluster student profiles into groups based on their performance over the KCs and dynamically update the current cluster information over time while the performance changes. To trace complex KCs learned by students, several works have extended DKT by augmenting an external memory structure, inspired by memory-augmented neural networks [40] . In particular, following Key-Value Memory Network (KVMN) [72] , a key-value memory has been employed to represent knowledge state, which has more representational power than a hidden variable used in DKT. Such a key-value memory consists of two matrices: key and value. The key matrix stores the representations of KCs and the value matrix stores the student's mastery level of each KC. Below, we discuss two popular key-value memory networks for knowledge tracing. Dynamic Key-Value Memory Network (DKVMN) [125] has augmented DKT with two memory matrices: key and value. To trace how the knowledge state of a student evolves over time, unlike KVMN in which both key and value matrices are static [72] , DKVMN designs the value matrix to be dynamic while remaining the key matrix to be static. Figure 6a shows the model architecture of DKVMN, where M ∈ R × is the key matrix and M ∈ R × is the value matrix at the time step . It is assumed that there are latent KCs underlying all questions in a learning task. For a question at the time step , a correlation weight is computed, which represents the correlation between the question and the underlying latent KCs stored in the key matrix M . The model first retrieves a student's knowledge state with regard to the question from the value matrix M , calculated as: Then, the student's response for the question is predicted based on the retrieved knowledge state. After the student answers the question , the value matrix is updated to reflect the knowledge growth of the student after working on . • Following the Transformer architecture [110] , several works [21, 36, 79, 80, 99] have attempted to incorporate an attention mechanism into KT models. Although the attention mechanisms introduced by these works vary, their key ideas are similar, i.e., to learn the attention weights of questions in a sequence of interactions in a way that can reflect the relative importance of these questions for predicting the probability of correctly answering the next question. This mitigates one limitation of DKT that treats all questions in a sequence of interactions equally important. In what follows, we discuss the main attentive knowledge tracing models. Table 3 briefly summarizes these models. Multi-head self-attention RKT [80] ✓ ✓ Relational multi-head self-attention Table 3 . A comparison of attentive knowledge tracing models. • Self-Attentive Knowledge Tracing (SAKT) Self-Attentive Knowledge Tracing (SAKT) [79] was the first to add an attention mechanism into the KT models. It uses the scaled dot-product attention mechanism proposed by Vaswani et al. [110] to learn attention matrices using multiple attention heads. Specifically, each attention matrix contains relative weights from a representative subspace, which indicate the importance of questions in the past interactions for predicting a student's answer to the current question. Then, attention matrices from different representative subspaces are sent to a feed forward network for predicating student performance. • Attentive Knowledge Tracing (AKT) Attentive Knowledge Tracing (AKT) was proposed by Ghosh, Heffernan, and Lan [36] . AKT differs from SAKT in its attention mechanism called monotonic attention (i.e., a modified, monotonic version of the scaled dot-product attention mechanism [110] ) that can reduce attention weights for questions in a sequence of interactions proportional to their time distance in an exponential decay rate. The exponential weight decay is meant to consider the forgetting effect in a student's memory over time. In addition, an embedding representation was proposed to take into account a parameter for controlling how far a question deviates from a knowledge component it involves by following the Rasch model [92] . • Separated Self-AttentIve Neural Knowledge Tracing (SAINT) Separated Self-AttentIve Neural Knowledge Tracing (SAINT), proposed by Choi et al. [21] , differs from AKT and SAKT in the way that it has further applied a encoder-decoder model along with the scaled dot-product attention mechanism as in the original architecture of Transformer [110] . Specifically, SAINT separates a sequence of interactions by a student into a question embedding sequence and a response embedding sequence, which are then sent to the encoder and the decoder as input, respectively. The encoder and decoder are combinations of multi-head attention networks with the scaled dot-product attention mechanism [110] . Recently, SAINT was extended by adding two time-related features into a response embedding sequence: elapsed time for the time taken by a student to answer each question, and lag time for the time interval between two consecutive learning interactions. This variant was named as the SAINT+ model [99] . • Relation-Aware Self-Attention for Knowledge Tracing (RKT) Relation-Aware Self-Attention for Knowledge Tracing (RKT) was proposed by Pandey and Srivastava [80] . Similar to SAKT and SAINT, RKT employs the scaled dot-product attention mechanism proposed by Vaswani et al. [110] to learn attention weights using multiple attention heads. However, differs from the other attention-based KT models, RKT combines attention weights with relation coefficients, which are obtained from exercise relation modeling and forgetting behavior modeling. For the exercise relation modeling, it leverages the text information of questions (e.g., a question's textual information) to represent questions and estimate the relation between questions in a sequence of past interactions. For the forgetting behavior modeling, similar to AKT, RKT considers an exponential decay to count for a student's forgetting behavior over time. In KT tasks, various relational structures often exist, for example, similarity of KCs, dependency between KCs, and correspondence between questions and their KCs. To capture such relational structures for better addressing the KT problem, a recent trend is to explore the power of graph representation learning techniques such as Graph Neural Network (GNN). Table 4 briefly summarizes several main graph-based knowledge tracing models and we discuss each of them separately. • Graph-based Knowledge Tracing (GKT) Graph-based Knowledge Tracing (GKT), proposed by Nakagawa, Iwasawa, and Matsuo [78] , attempted to incorporate a graph where nodes represent KCs and edges represent the dependency relation between KCs for a relational inductive bias. They reformulated the KT problem as a time series node-level classification problem and solved it using standard graph learning techniques such as message-passing GNNs [94] . Since such a graph is not explicitly given in KT tasks, the authors proposed two approaches to construct such graphs from a sequence of interactions by a student: (1) Statistics-based approach to construct a graph based on statistics such as how many times one KC was answered after another KC was answered; (2) Learning-based approach to learn a graph with the performance optimization in an end-to-end manner. • Graph-based Interaction Knowledge Tracing (GIKT) Graph-based Interaction Knowledge Tracing (GIKT), proposed by Yang et al. [120] , leverages the relation between questions and KCs, represented as a graph to learn useful embedding for answer prediction. Different from GKT which implicitly assumes that each question corresponds to one KC, GIKT assumes that one KC may be related to many questions and one question may correspond to more than one KC. Thus, GIKT can use a GNN to aggregate the embeddings of questions and KCs based on their relation in the graph, and sends the embedding of each question in a sequence of interactions to an RNN model to predict a student's answer for the next question. • Structure-based Knowledge Tracing (SKT) Structure-based Knowledge Tracing (SKT) was proposed by Tong et al. [108] which aimed to capture multiple relations among KCs such as similarity relation and prerequisites relation. Similar to GKT, SKT also assumes that each question corresponds to one KC. However, instead of a single relation between KCs as captured in GKT, SKT exploits multiple relations between KCs. Further, SKT supports information propagation to jointly model the temporal and spatial effects when summarizing graph data. These two kinds of graph embeddings are combined at each time step and fed to a recurrent model to predict the correct answer by a student. Until now, the deep KT models that we have discussed mainly focused on a student's interactions with questions in a sequence to predict the probability of correctly answering the latest question by the student. Yet, they did not consider much about the textual features of questions themselves. Text-aware KT models are motivated by leveraging the textual features of questions to enhance the performance in tackling the KT tasks. • Exercise-Enhanced Recurrent Neural Network (EERNN) Exercise-Enhanced Recurrent Neural Network (EERNN) was proposed by Su et al. [101] , which is a text-aware KT model to predict the probability of correctly answering a given question. The model uses a bi-directional LSTM module to extract the representation (i.e., a vector) of each question from the question's text and then trace a student's knowledge states by combining it with the representations of the previously answered questions using another LSTM module. Two variants of EERNN were developed: EERNNM, and EERNNA. The EERNNM variant assumes that a sequence of interactions satisfies the Markov property, i.e., the answer prediction for the next question only depends on the latest observed knowledge state; thus it only considers the last hidden state. The EERNNA variant considers all the previous knowledge states and combines them through an attention mechanism. Later, Yin et al. [122] has further extended the work by Su et al. [101] through leveraging a pre-training task to learn question representations. The authors followed a masked language model (MLM) objective [71] and showed that this pre-training step could further enhance the model's performance compared to the original model. • Exercise-Aware Knowledge Tracing (EKT) Exercise-Aware Knowledge Tracing (EKT) [64] extends EERNN to incorporate the information of multiple KCs during answer prediction, where a student's knowledge state is represented by a knowledge state matrix, rather than a knowledge state vector. Specifically, the model uses a memory network for quantifying how much each question can affect the mastery of a student on multiple KCs during a sequence of interactions by the student. In addition to the above text-aware KT models, other types of KT models such as Relation-Aware Self-Attention for Knowledge Tracing (RKT) [80] and Hierarchical Graph Knowledge Tracing (HGKT) [107] also extract features from the textual information of questions for learning question representations in their models. Learning psychological studies [52, 87, 97] showed that forgetting is an important aspect to consider for an accurate estimation of a student's knowledge state. This is because the knowledge mastery level of a student tends to decline with an exponential rate over time since the last practice of the relevant questions. From an experimental psychology perspective, Hermann Ebbinghaus [30, 76] studied forces that affect memory retention, leading to formulate what is currently known as the learning curve theory [75] . Two effects that are reflecting these forces on memory retention are the forgetting effect and the learning effect. Modeling the forgetting effect is one of the major challenges that the KT literature has aimed at tackling. Traditional KT models have attempted to incorporate forgetting behavior by adding features such as the number of past trials or the lag time from the previous interaction [55, 88, 91, 98] . In recent years, several deep learning KT models have been developed to take a student's forgetting behavior into consideration during tracing knowledge states. Nagatani et al. [77] proposed to extend the Deep Knowledge Tracing (DKT) model [89] by adding sequence-related forgetting features. These features include: (1) the number of times a student answers questions with the same KC till the current point of time; (2) the time lapse since the last interaction on a question with the same KC; and, (3) the time lapse since the last interaction on a question regardless of its relating KC. The first feature reflects the learning effect while the other two features reflect the forgetting effect. Different from the previous traditional KT models that use forgetting features only with regard to questions with the same KC [55, 88, 91, 98] , this work considers a student's interactions in the whole sequence so as to model more complex forgetting behavior. • Knowledge Proficiency Tracing (KPT) Knowledge Proficiency Tracing (KPT) was proposed by Chen et al. [20] , which is a probabilistic matrix factorization model that leverages prior information for knowledge tracing. Specifically, two kinds of priors have been considered in this model: (1) question priors: the model uses a Q-matrix which was marked by experts to depict the relationship between questions and KCs for generating question representations; and, (2) student priors: the model captures the changes in a student's knowledge state over time by jointly applying both learning curve and forgetting curve theories. The learning and forgetting factors are designed based on the assumption that a student's current knowledge state is mainly influenced by two underlying reasons: (a) the more exercises she does, the higher level of related knowledge state she will get; and, (b) the more the time passes, the more knowledge she will forget. To further improve the predictive performance, an improved version of KPT, called Exercisecorrelated Knowledge Proficiency Tracing (EKPT) [48] was later developed, which incorporates the connectivity among questions over knowledge concepts into the probabilistic modeling. Inspired by the Hawkes process [44] , Wang et al. [114] proposed HawkesKT, a model that uses point process to adaptively model temporal cross-effects in KT. It assumes that the mastery of a KC by a student is not only affected by previous interactions on questions of the same KC, but also interactions on the other questions (cross-effects). Further, the model assumes that cross effects caused by different previous interactions may also have different temporal evolutions on the mastery of different KCs. Although cross effects all decay with time, their decay rates differ from each other because some KCs may be easier to forget than the others. • Deep Graph Memory Network (DGMN) Deep Graph Memory Network (DGMN) [2] is a hybrid KT model that combines graph neural networks with memory for forgetting aware knowledge tracing. The model aims at modeling the forgetting behavior over a KC space, which has the advantage to capture indirect relationships between questions. DGMN builds a dynamic graph from a knowledge state memory to capture relationships across KCs. Given a sequence of interactions, DGMN uses an attention mechanism to associate questions to their relevant KCs. Then, it calculates forgetting features over the sequence, and fuses question embedding, KC graph embedding, and forgetting features using a gating mechanism. The gating output is used to predict the probability of answering the next question correctly. Deep learning KT models have demonstrated its great potential in solving the KT problem. Below, we discuss several key aspects that are crucial for being considered in their designs. • Knowledge state: One fundamental assumption underlying each deep learning KT model is whether a knowledge state is considered over a single KC or multiple KCs. Accordingly, modeling a student's knowledge states based on the mastery level of KCs by the student is an important task in designing the deep learning KT models. Generally, from early works such as DKT which uses a hidden state to model knowledge states over a single KC, to later works by memory-augmented KT models (e.g., DKVMN and SKVMN) and by text-aware KT models (e.g., EKT) which use matrices to model knowledge states over multiple KCs, a trend in deep learning KT models is to develop a mechanism that is expressive for dynamically capturing knowledge state representations over complex KCs. • KC dependencies: In a KT task, each question is assumed to associate with a single KC or multiple KCs, which is often provided as a prior knowledge such as Q-matrix. One main challenge faced by deep learning KT models is to discover the dependencies among different KCs, for example, one KC requires several other KCs as the prerequisite skills. To address this challenge, two lines of research have been explored in the KT literature, including: (1) using an attention mechanism to learn how questions are related to each other in terms of their required KCs; and, (2) using a graph-based learning model such as graph neural networks to learn the relationships between KCs or between questions according to their required KCs. • Feature augmentation: To improve the model performance on KT tasks, additional features such as temporal features relating to forgetting behavior and textual features relating to question texts have been leveraged by a number of deep learning KT models in recent years. On one hand, augmenting additional features can usually lead to more accurate prediction on student learning performance; on the other hand, the augmentation of such additional features depends on their availability in databases, thus limiting their applicability within specific KT applications. HIRT [116] TIRT [116] LFA [16] AFM [17] PFA [86] KTM [111] KTM-DLF [35] Table 5 . A summary of descriptive characteristics of main knowledge tracing models, where HMM is an abbreviation for Hidden Markov Mode [28] , BN for Bayesian Network [113] , DBN for Dynamic Bayesian Network [54] , LR for Logistic Regression [117] , FM for Factorization Machine [104, 106] , GNN for Graph Neural Network [94] , FFN for Feed Forward Network [110] , KVMN for Key-Value Memory Network [72] , ED for Encoder and Decoder model [110] , MSA for Multi-head Self-Attention mechanism or variants [110] ) and AM for Attention Mechanism [8, 110] . This section presents an overview of the benchmark datasets used in the literature to support the evaluation of KT models. All publicly available datasets were downloaded, inspected and relevant information reported. Table 6 lists the datasets and provides the general information such as student interactions, the number of questions, and data availability. More details about the datasets are presented below. The ASSISTments datasets [33, 82] contain longitudinal data collected from the free online tutoring ASSISTment platform 1 . Table 7 shows that the ASSISTments datasets are the most popular datasets used to benchmark KT models and the ones containing the most questions in total. Despite the popularity of this dataset, its original version is not reliable as discussed in [125] and the updated 'skill-builder' 3 version is preferred as it fixes data modeling issues and removes duplicated records. It is also noteworthy to mention that results obtained with the different versions of this dataset (or with duplicated records) are often reported in the literature but they should not be directly compared to other approaches [116] . • ASSISTments2012 4 : This is the largest version of the ASSISTments datasets consisting of data collected for one year (from Sept 2012 to Oct 2013). Despite the ASSISTments team reporting that the dataset contains approximately 10 million 'exercises', the available dataset consists of 179, 999 distinct questions answered by 46, 674 students resulting in 6, 123, 270 interactions. We note that the vast majority (126, 908) of the questions does not have any of the 265 KCs associated with them. The lack of questions annotated with KCs may explain the overall lower performance of the KT models when applied to this dataset (see Table 6 ). On average, this dataset has 298.17 answers per question, placing it second in terms of the answer per question ratio (only lower than the ASSISTments2015 dataset). As part of a data mining competition, this dataset contains the most descriptive information among the ASSISTments datasets. This dataset is available upon request and contains students' interactions with questions related to the Engineering Statics course 7 taught at the Carnegie Mellon University during Fall 2011 [60] . The original dataset contains 361, 092 interactions, 335 students and 1, 224 questions. In the KT literature, this dataset is often preprocessed [125] resulting in 1, 223 distinct questions answered by 333 students over 85 KCs. After preprocessing, the number of students' interactions is almost halved to 189, 297. This preprocessing was justified due to the large number of interactions without information on whether the questions have been correctly answered. The preprocessing considers the concatenation of the attributes 'problem name' and 'step name' and only the interactions with a valid first attempt. On average, this dataset has 568.45 answers per question. This dataset 8 was collected between November 2010 and March 2015 from the Junyi Academy [19] , an e-learning platform in Taiwan. The original dataset contains 25, 925, 992 interactions, 247, 606 students, 722 distinct questions, and 41 KCs covering a number of topics in math. However, considering the same preprocessing step made in the previous datasets (i.e., students interactions with no hints given to help solve the questions and only students who have attempted each question once), the number of interactions drops to 21, 571, 469 (∼ 17%), 220, 441 (∼ 11%) students, and 716 (< 1%) distinct questions. Finally, on average, this dataset has 97.85 answers per question. Although this dataset has been commonly used in the KT literature, the performance reported in some of the works cannot be directly compared. This is because these works use different subsets or preprocessing techniques [1, 80, 108] . Further, note that an updated version of the Junyi dataset is available in Kaggle 9 with data collected from August 2018 to August 2019. This dataset contains 11, 468, 379 interactions, 25, 649 students, and 1, 701 distinct questions where no hints were given and students have attempted each question only once. This synthetic dataset was proposed by Piech et al. [89] , which simulates virtual students to answer the same sequence of questions over a set of KCs in a controlled environment 10 . The dataset is divided into two subsets, including training and testing. Each subset contains 50 distinct questions associated with a single KC and a difficulty level. In total, each question is answered by 4, 000 virtual students resulting in 200, 000 interactions. The classic Item Response Theory [34] was used to create the interactions and simulate students learning over time [89, 121] . This dataset provides a standard format and does not require preprocessing steps such as removing duplicates or steps to infer a sequence of questions answered by a student, potentially allowing direct comparison between different KT models. This dataset was presented at the KDDcup 2010 Educational Data Mining challenge 11 [100] and contains [13] [14] year old students' responses to Algebra questions from 2005 to 2007 extracted from the intelligent tutoring system called "The Cognitive Tutors" developed by Carnegie Learning Inc. in the US. The dataset is split into three subsets described in what follows. EdNet 12 is a hierarchical dataset composed of four subsets identified by the ids: 'KT1', 'KT2', 'KT3' and 'KT4', each containing different types of student activities. 'KT1', for example, contains question-response pairs similar to other datasets. The main difference is that some questions in this dataset are organized in bundles (a set of questions that must be completed altogether). This dataset contains over 95, 293, 926 interactions, 13, 169 questions, 784, 309 students, and 188 KCs. Unlike the other datasets, 'KT2' contains the actions of the users during question-solving activities. For example, it records the final submission and student decision-making (alternating choices) before submitting the final answer. This subset has 56, 360, 602 interactions and 297, 444 students. Besides the actions in 'KT1' and 'KT2', 'KT3' subset includes information about how students interact with learning activities to answer a question (e.g., watch a lecture). This subset contains more KCs (293), interactions (89, 270, 654) subset containing every action recorded by the EdNet system, including students' purchases (e.g., course purchases). This subset contains 131, 441, 538 interactions in total. Overall, the EdNet dataset series incrementally provides information into student activities and behaviors. The dataset was collected over two years from the intelligent online tutoring platform named Riid TUTOR 13 dedicated to practicing English for international communication (TOEIC) assessment [22] in South Korea. The variety of recorded behaviors and the large size of data points are unique aspects of this dataset. Table 7 presents the results obtained by several KT models using the aforementioned datasets. The results are reported using the AUC-ROC curve -a traditional performance measurement for binary classification models. The probabilistic receiver operating characteristic (ROC) curve plots the true positive rate (TPR) against the false positive rate (FPR) while the area under the curve (AUC) reports how well a KT model can distinguish between correct and incorrect answers. The AUC-ROC ranges from 0 to 1, where 0.5 indicates an uninformative classifier (random guesses) and 1 a perfect classifier. It is important to note that the results cannot be directly compared if experimental settings are not standardized. As discussed in [116] , the results obtained without removing duplicate interactions (preprocessing steps) may be inaccurate and appear elevated. The existence of duplicates is also acknowledged in the KDDcup dataset 14 . We reported in the previous sections the raw number of interactions, students, KCs, etc., and whenever different, the total after removing duplicates and discarding noise data. BKT [24] 80%-20% 0.651 [73] 0.67 [89] 0.6571 [120] 0.623 [73] 0.6204 [120] 0.611 [73] 0.678 [108] --0.68 [89] 0.831 [108] 0.54 [89] 0.642 [73] -0.6027 [120] IRT [41] 80%-20% 0.7651 [116] 0.6908 [111] 0.751 [73] 0.5869 [114] 0.702 [35] 0.743 [73] 0.6340 [114] 0.672 [73] ----0.771 [35] 0.806 [73] 0.747 [35] 0.8542 [116] -HIRT [116] 80% -AFM [17] 80%-20% 0.6163 [111] 0.610 [35] -----0.707 [35] 0.706 [35] -PFA [86] 80%-20% 0.6849 [111] 0.703 [73] 0.670 [73] 0.669 [35] 0.689 [73] ----0.760 [73] 0.744 [35] 0.746 [35] -KTM [111] 80%-20% 0.8186 [111] 0.7169 [120] 0.7425 [114] 0.7535 [114] 0.6788 [120] -------0.6888 [120] KTM-DLF [35] 80%-20% 0.756 [35] -----0.837 [35] 0.812 [35] -DKT [89] 80%-20% 0.7429 [116] 0.86 [89] 0.8212 [121] 0.721 [73] 0.820 [79] 0.817 [36] 0.709 [78] 0.7561 [120] 0.7515 [114] 0.713 [73] 0.712 [80] 0.7286 [120] 0.7235 [77] 0.7308 [114] 0.707 [73] 0.731 [36] 0.736 [79] 0.7365 [121] 0.727 [108] 0.7263 [36] 0.734 [79] 0.7343 [121] 0.8233 [36] 0.815 [79] 0.8159 [121] 0.814 [80] 0.85 [89] 0.847 [108] 0.8255 [121] 0.823 [79] 0.75 [89] 0.784 [73] 0.751 [78] 0.8901 [116] 0.6822 [120] 0.7638 [99] DKT+ [121] 80%-20% 0.8227 [121] 0.8024 [36] 0.822 [79] -0.7371 [121] 0.728 [108] 0.737 [79] 0.7313 [36] 0.7343 [121] 0.728 [79] 0.7124 [36] 0.8349 [121] 0.835 [79] 0.8301 [36] 0.889 [108] 0.8264 [121] 0.824 [79] ---DKT-DSC [73] 80%-20% 0.735 [73] 0.721 [73] 0.716 [73] 0 ---SAKT [79] 80%-20% 0.848 [79] 0.752 [36] 0.6860 [114] 0.735 [80] 0.6906 [114] 0.854 [79] 0.7212 [36] 0.734 [79] 0.6569 [36] 0.853 [79] 0.8029 [36] 0.834 [80] 0.832 [79] --0.7671 [21] 0.7663 [99] AKT [36] 80%-20% 0.8346 [36] 0.7474 [114] 0.7555 [114] 0.7828 [36] 0 The lack of accurate and descriptive information about each of the attributes in the datasets also hinders experiments. An example is in the ASSISTments datasets where terminology used is confusing 15 or lacking 16 . This may explain the different AUC values reported to the same approach and dataset pairs presented in Table 7 . The sequence of students' interactions is also an important component of a KT task that is affected by the quality of the datasets. Due to noise in the data (e.g., incorrect timestamps, null values, etc.), the sequence of interactions may not be correctly extracted from the datasets which may accordingly impact the performance of the proposed KT models. For example, after inspecting the datasets, we identified timestamp errors in the Algebra 2006 − 2007, which may justify its low adoption in comparison to the other two Algebra datasets. The fact that the benchmark datasets are often used for multiple tasks other than the KT problem hinders the correct use and interpretation of the datasets in the KT context. The works in [36, 73, 79, 114, 120, 121, 125] , for example, report different numbers of knowledge components for the same datasets and the work in [36] discusses how different experimental settings (association between KC and questions) may impact the reported results. The data model and the file format chosen to represent the datasets may also contribute to misinterpretation of the data. For example, the data is often extracted from relational databases and stored in comma-separated values (CSV), compressing information in a single attribute using non-standard characters, e.g., the KDDcup dataset uses double tilde characters to assign multiple KCs to a question. Given the hierarchical and relational nature of the data, XML and JSON are more suitable and explicit formats. Another critical aspect is the lack of more diverse datasets. The existing public datasets are often from a specific domain (mainly math) and a specific region (e.g., the ASSISTment datasets from the US, and EdNet, the most recent publicly available dataset, from South Korea). Most of the datasets do not provide demographic information, and therefore gender-based, or other similar predictions cannot be performed. Finally, the benchmark datasets have been updated over the years, and the version used by the KT models is not readily identifiable, which can compromise their direct comparison. A version control mechanism for the publicly available datasets would help keep track of every change made to a dataset over time and allows for consistent and comparable results. This section explores possible application areas that can benefit from KT models. We broadly divide these application areas into the following four categories: (i) Recommender Systems; (ii) Learning Provision and Quality Assurance; (iii) Interactive Learning; and, (iv) Learning to Teach. An application area of KT that comes directly to mind is online education systems where the main objective is to provide effective learning experience for their students. Tracing the knowledge state of a student would facilitate to tailor students' learning experience according to their capabilities and skills. This can be achieved through recommending learning materials (e.g., lectures, labs, and/or exercises) based on the learnt knowledge state of a student. Thus, the aim of such online education systems is twofold: (1) to estimate the knowledge state of a student using a KT model; and, (2) to recommend learning materials conditioned on the knowledge state using a recommendation model [14] . Below are some examples of applications that have been recently studied. A graph-based recommendation method has been proposed by Chanaa and Faddouli [18] to aid an educational instructor to segment students into groups based on their knowledge states and recommend the most appropriate kinds of exercises for each group. More specifically, the instructor first selects a specific knowledge component (KC), and then, the model constructs a dynamic knowledge graph based on historical practice information that includes student knowledge vectors as nodes and uses edges for representing their mastery level similarities around the selected KC. Such a graph is clustered into node groups and for each group a shared embedding is constructed using GNNs to get the final recommendations. Cai et al. [15] followed an interactive recommendation approach in which a reinforcement learning recommender agent selects learning materials to recommend based on a reward signal calculated from the progress in a knowledge state estimated by a KT model. Huang et al. [49] proposed an interactive educational video recommender model that follows a multi-objective setup of rewards. The authors designed three reward functions to reflect three main aspects in online education systems including a reviewing reward function for recommending videos about KCs that a student did not perform well previously, a smoothing reward function for recommending videos with a gradual difficulty to understand, and an engagement reward function for recommending videos about KCs a student started to master recently. The recommender agent followed a reinforcement learning design with a state estimated by a DKT variant model, an action for the video id to recommend, and a combined reward by using a weighted sum of the three reward functions. Another potential application area is to provision a learning curriculum (e.g., an ordered list of topics to study) for a specific subject based on the knowledge state of a student. One direction [25, 96] in this application area follows a semi-automated approach in which an instructor would conduct simulations using a KT model trained on the historical exercise records of students to identify a suitable curriculum of learning materials for a course to maximize the knowledge gain of students. Another direction [81] uses KT models to assess the effectiveness of a course structure in achieving its targeted objectives by assessing the impact of each module (i.e., a collection of learning materials) on the knowledge growth of students. Recently, deep KT models have been adopted in this direction to provide quality assurance for the course design [66] . The authors used a DKT model [89] to trace the progress in the knowledge state of a student after taking a specific course module, and then, an actor-critic reinforcement learning agent [42] considers the knowledge state across different course modules and their predefined relationships (i.e., prerequisites) and takes an action to select the next module to work on for the student to maximize their knowledge gain in achieving course objectives. Following this approach, the structure of a course could adapt dynamically to a student's needs and skills instead of having one fixed structure that does not fit all students. Interactive education aims at making the learning process more exciting and engaging through delivering knowledge components into a gaming shell. Cognitive studies [45] showed that students can easily gain new knowledge components and be able to spend more time on learning if the learning materials were delivered in an engaging manner such as gaming. This is due to the nature of the human mind that was mainly designed for learning by real-world practices that usually come as a form of an interactive experience (i.e., state, action, and rewards). Thus, an educational game can provide a similar experience that could be more aligned with our natural learning capabilities than the conventional educational ways (e.g., textbooks, lectures, etc.) that lack engaging interaction. Long and Aleven [68] evaluated the effect of educational games in comparison to conventional online tutoring systems through a study that involved two groups of students: one group was learning on an educational game for math equation solving and the other group was learning using a non-interactive system that presents math concepts in a conventional manner through demonstrative examples and exercises. The study found that the group using the educational game was more excited and engaged to continue the learning procedure in comparison to the other group. Another study [4] focused on the effect of mobile educational games on the learning progress for elementary school kids. The authors divided the student into two groups: one allowed to use mobile educational games to review and practice math concepts taught at school and another group practiced the math concepts using conventional text exercises. The study concluded that students with access to the mobile educational games were performing better in retaining the math concepts in comparison to the other group. These findings about the effectiveness of educational games demonstrate the great potential of education games as an promising application area for KT models. For an educational game to be more effective, it has to assess the knowledge progress of a player and adjust the gaming experience accordingly. This might include adjusting the difficulty of challenges, opening new part(s) in a game, or adjusting the competency of a computer opponent. Kantharaju et al. [53] proposed a KT model that could detect when a player attempts a specific skill in an educational game and quantify their knowledge state across different states, so the game experience could be focused on challenging skills for each player. Cui et al. [26] used the BKT [112] model to trace the knowledge state of fifth-grade elementary school students in Canada during a science gaming assessment. The authors were not only able to effectively predict the final score of a student based on their partial observations from the game assessment, but also identify pitfalls in the game design and assumptions that tend not to work as expected by the designer in terms of assessing dedicated skills. Going beyond the conventional assumption of a human student in a KT setting opens the door to a wide range of application areas. Virtual students, such as intelligent agents which adopt a reinforcement learning setup or machine learning models, can be treated in a similar way as real-life students who are in need to learn a set of skills in different machine learning tasks. For example, this can be a deep neural network model that needs to master a skill of classifying different class labels (e.g., cats, dogs, furniture, etc.) in an image classification task or a reinforcement learning agent aiming at mastering different skills in an Atari game. Curriculum learning (CL) [12] aims at learning a curriculum of tasks to enable a student agent from mastering a set of skills. A CL policy would imply a statistical distribution on learning tasks that gradually drive the student agent towards convergence. Another relevant paradigm is machine teaching (MT) [126] which aims at minimizing the teaching cost represented by the size of training samples drawn from training data in a machine learning scenario. In MT, there are two models being included: the teacher model and the student model. The former targets to sample training data for the latter to learn an optimal parameter set * the minimizes the loss function in the task. Learning to teach (L2T) [32, 118] targeted customizing the learning process for a student agent/model through optimizing three main aspects including training data sampling, neural architecture design, and loss function design. In L2T, a teacher agent follows a reinforcement learning approach to optimize a teaching policy that handles one or more of the three aspects previously mentioned. It can be observed that a shared characteristic across these different attempts to enhance the conventional machine learning procedure is the need to trace the knowledge state of a student model. Thus, there is a significant potential for KT models to contribute in this application area by tracing the knowledge state of a student model during training procedure. The output of the KT model would form the input/state of a teacher model that aims at customizing the training procedure to speedup the student model's convergence. Knowledge Augmented Data Teaching (KADT) [3] aims at improving a data teaching strategy of a student ML model by tracking its knowledge progress across multiple knowledge components in a learning task. The KADT method includes a knowledge tracking model to dynamically capture the knowledge progression of the student model in terms of latent knowledge components. The authors develop an attention-pooling mechanism to extract knowledge representations of the student model with respect to class labels, which enables the development of a data-teaching strategy on significant training samples. The authors evaluated the performance of the KADT method on four different machine learning tasks including knowledge tracing, sentiment analysis, movie recommendation, and image classification. Results compared to the state-of-the-art machine teaching methods have been empirically proven that KADT consistently outperforms the others in all tasks. Despite the promising results achieved by current state-of-the-art KT models, limitations and gaps of current approaches and available datasets open up several opportunities for future research. A numbered list of such research opportunities is presented below. Multimodal and informative representation learning & datasets. The choice of data representation directly impacts the performance of any machine learning model [11] . KT models tend to learn embedding representations for questions and KCs from abstract formats such as one-hot encoding; however, some data in the description of a question such as images and mathematical equations that can lead to more informative embedding representations, are overlooked, either by the proposed models or by the available datasets. This opens up research directions (RD) through the following questions: ( 1 ) What information/satellite data can be used to improve the performance of KT models? ( 2 ) How to represent such data for the KT tasks? ( 3 ) How to create a dataset for the KT tasks that enables a more informative embedding representation learning? The Exercise-aware Knowledge Tracing (EKT) approach proposed by Liu et al. [64] is a recent attempt to learn richer embedding representations, taking into account the textual context and the relationships between questions. However, despite the previous efforts, the representations of multimodal and domain-specific data such as mathematical equations and code snippets remain mostly unexplored in the literature resulting in low-informative representation learning for KT models. Fusing signals from multiple feature spaces may enable better representation learning in addition to mitigating noise in data [9] . To solve ( 1 ) and ( 2 ), a new range of benchmark datasets for KT must be created containing contextual information rather than only encoded data with ids (as previously seen in Section 3). Datasets spanning various knowledge domains and from different cultures and educational levels are also needed, as the performance of KT models can be affected when applied to other demographic and educational contexts. This leads us to the next research opportunity, Self-Supervised Learning in Knowledge Tracing. Self-supervised learning in knowledge tracing. Although supervised learning has led to advances in different areas, it still has a major drawback: the need for large, high-quality labeled data for training. Self-supervised learning (SSL) [74, 124] , on the other hand, has proved to be effective in several areas (e.g., natural language processing [29] and computer vision [37] ) learning from unlabeled data. SSL often adopts similarity ranking loss functions (e.g., contrastive loss [115] ) in a process called pre-training or pretext task [74] to automatically generate labels. Such pre-trained models can thus be transferred to a downstream task to train in a supervised learning manner with a limited amount of labeled data. Along with SSL, therefore, further contributions to the KT field can be made, for example, by ( 4 ) creating pre-trained models (e.g., using existing pre-trained language and computer vision models) to generate informative representations for KT; and, ( 5 ) investigating how it can mitigate limited training of students' activities in cases of a cold-start scenario or skewed participation data, e.g., when only a small number of students contribute to most activities. Interactive knowledge tracing. Most KT models adopt a passive approach of observing question answering response history to estimate students' knowledge states; however, interactive methods, driven by question answering response behavior, to better understand the dynamics of students' knowledge states are still unexplored. Interactive methods are particularly useful in cold start scenarios where an interactive approach can reveal students' knowledge states by directly asking questions related to different KCs. Thus, another potential future work is to ( 6 ) develop optimized question sampling policies to enhance the performance of KT models in, but not limited to, cold start scenarios. Reinforcement Learning (RL) [103] approaches are a possible natural choice given its maximal rewarding scheme. Last but not least, considering that Knowledge Tracking involves human knowledge and learning, transparency in the internal logic and the results obtained by KT models would benefit educational stakeholders and processes. This leads us to investigate on eXplainable Artificial Intelligence (XAI) approaches, as seen in other research fields such as [50, 51] . Potential research avenues includes ( 7 ) the development of techniques and methods to understand and explain the prediction process in KT models; and, ( 8 ) how algorithmic decisions make impacts on learning processes, course design, instructor performance, the quality of learning materials, and student engagement. Promising research to explain deep learning models has been carried out using knowledge distillation [43] to understand and explain predictions in other models. In this work, we presented a comprehensive research monograph for knowledge tracing. To cover the related fundamental concepts, we covered early attempts for knowledge tracing in chronological order and illustrated related background concepts. We built upon the historical Knowledge Tracing landscape to layout a categorization for the relevant literature based on shared theoretical and algorithmic aspects. As deep learning is considered the current prominent toolkit for the majority of the state-of-theart KT approaches, we deeply reviewed its different approaches and contrasted their characteristics on multiple dimensions such as knowledge representation learning, consideration of forgetting behavior, and architecture design. Moreover, we presented a chronological flow for deep learning approaches that showed how later methods extended former ones for a cohesive understanding of limitations and research directions in these approaches. Moreover, we introduced a detailed review of relevant KT datasets that have been used by the literature while describing their characteristics, limitations, and contribution points. Additionally, we provided a curated summary for all the reported performance results on the covered datasets for key KT methods. Finally, we discussed various application areas for knowledge tracing to show its potential in addressing human and machine teaching domains. Furthermore, we highlighted future research directions that could push the boundaries of the current knowledge tracing methods by harnessing data from multiple modalities, learning with weak or no supervision, or following an interactive reinforcement learning paradigm to overcome cold start challenges. Knowledge Tracing with Sequential Key-Value Memory Networks Deep Graph Memory Networks for Forgetting-Robust Knowledge Tracing Learning Data Teaching Strategies Via Knowledge Tracing Effect of Mobile Gaming on Mathematical Achievement among 4th Graders Cognitive Modelling and Intelligent Tutoring Cognitive modeling and intelligent tutoring Book Review : Probabilistic Models for Some Intelligence and Attainment Tests Neural Machine Translation by Jointly Learning to Align and Translate Multimodal Machine Learning: A Survey and Taxonomy An upper asymptote for the three-parameter logistic item-response model Representation Learning: A Review and New Perspectives Curriculum Learning Statistical theory for logistic mental test models with a prior distribution of ability Recommender systems survey. Knowledge-Based Systems Learning Path Recommendation Based on Knowledge Tracing Model and Reinforcement Learning Learning Factors Analysis -A General Method for Cognitive Model Evaluation and Improvement Comparing Two IRT Models for Conjunctive Skills Predicting Learners Need for Recommendation Using Dynamic Graph-Based Knowledge Tracing Modeling Exercise Relationships in E-Learning: A Unified Approach Tracking Knowledge Proficiency of Students with Educational Priors Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing Ednet: A large-scale hierarchical dataset in education Cognitive mastery learning in the act programming tutor Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction Analyzing Student Process Data in Game-Based Assessments with Bayesian Knowledge Tracing and Dynamic Bayesian Networks More Accurate Student Modeling through Contextual Estimation of Slip and Guess Probabilities in Bayesian Knowledge Tracing More Accurate Student Modeling through Contextual Estimation of Slip and Guess Probabilities in Bayesian Knowledge Tracing BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Memory: A contribution to experimental psychology Item response theory Learning to Teach Addressing the assessment challenge with an online system that tutors as it assesses. User modeling and user-adapted interaction Review of item response theory practices in organizational research: Lessons learned and paths forward Modeling learner's dynamic knowledge construction procedure and cognitive item difficulty for knowledge tracing Context-Aware Attentive Knowledge Tracing Self-supervised Pretraining of Visual Features in the Wild Generating Sequences With Recurrent Neural Networks. arXiv e-prints Speech recognition with deep recurrent neural networks Neural turing machines A general solution for the latent class model of latent structure analysis A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients Towards Black-Box Explainability With Gaussian Discriminant Knowledge Distillation Spectra of some self-exciting and mutually exciting point processes Increasing the effectiveness of digital educational games: The effects of a learning instruction on students' learning, motivation and cognitive load Long Short-Term Memory General features in knowledge tracing to model multiple subskills, temporal item response theory, and expert knowledge Learning or Forgetting? A Dynamic Approach for Tracking the Knowledge Proficiency of Students Exploring Multi-Objective Exercise Recommendations in Online Education Systems Towards Trustable Explainable AI Towards Quantification of Explainability in Explainable Artificial Intelligence Methods Practice and Forgetting Effects on Vocabulary Memory: An Activation-Based Model of the Spacing Effect Tracing Player Knowledge in a Parallel Programming Educational Game Dynamic Bayesian Networks for Student Modeling How Deep is Knowledge Tracing Integrating latent-factor and knowledgetracing models to predict individual differences in learning Integrating latent-factor and knowledgetracing models to predict individual differences in learning Integrating knowledge tracing and item response theory: A tale of two frameworks Semi-Supervised Classification with Graph Convolutional Networks A data repository for the EDM community: The PSLC DataShop. Handbook of educational data mining Deep learning The Impact on Individualizing Student Models on Necessary Practice Opportunities A critical review of recurrent neural networks for sequence learning EKT: Exercise-Aware Knowledge Tracing for Student Performance Prediction Enhong Chen, and Yonghe Zheng. 2021. A Survey of Knowledge Tracing Exploiting Cognitive Structure for Adaptive Learning Educational game and intelligent tutoring system: A classroom study and comparative design analysis Educational Game and Intelligent Tutoring System: A Classroom Study and Comparative Design Analysis Statistical theories of mental test scores A THEORY OF TEST SCORES AND THEIR RELATION TO THE TRAIT MEASURED Efficient Estimation of Word Representations in Vector Space Key-Value Memory Networks for Directly Reading Documents Deep Knowledge Tracing and Dynamic Student Classification for Knowledge Tracing Self-Supervised Learning of Pretext-Invariant Representations Revealing the Learning in Learning Curves Replication and Analysis of Ebbinghaus' Forgetting Curve Augmenting Knowledge Tracing by Considering Forgetting Behavior Graph-Based Knowledge Tracing: Modeling Student Proficiency Using Graph Neural Network A Self Attentive model for Knowledge Tracing Rkt: Relation-aware self-attention for knowledge tracing Adapting Bayesian knowledge tracing to a massive open online course in edX Affective States and State Tests: Investigating How Affect and Engagement during the School Year Predict End-of-Year Learning Outcomes Modeling Individualization in a Bayesian Networks Implementation of Knowledge Tracing Modeling Individualization in a Bayesian Networks Implementation of Knowledge Tracing KT-IDEM: Introducing Item Difficulty to the Knowledge Tracing Model Performance Factors Analysis -A New Alternative to Knowledge Tracing Modeling Students' Memory for Application in Adaptive Educational Systems Modeling Students' Memory for Application in Adaptive Educational Systems Deep Knowledge Tracing Intelligent tutoring systems: Lessons learned Does Time Matter? Modeling the Effect of Time with Bayesian Knowledge Tracing Probabilistic models for some intelligence and attainment tests The Graph Neural Network Model The Graph Neural Network Model Deep learning in neural networks: An overview Adaptive Robot Language Tutoring Based on Bayesian Knowledge Tracing and Predictive Decision-Making An Individual's Rate of Forgetting Is Stable Over Time but Differs Across Materials A trainable spaced repetition model for language learning SAINT+: Integrating Temporal Features for EdNet Correctness Prediction Algebra i 2008-2009. challenge data set from kdd cup 2010 educational data mining challenge Exercise-Enhanced Sequential Modeling for Student Performance Prediction Sequence to Sequence Learning with Neural Networks Reinforcement learning: An introduction Next-Term Student Performance Prediction: A Recommender Systems Approach An elementary introduction to the Wiener process and stochastic integrals Using factorization machines for student modeling HGKT : Introducing Problem Schema with Hierarchical Exercise Graph for Knowledge Tracing Structure-based Knowledge Tracing: An Influence Propagation View Will MOOCs destroy academia? Attention is All you Need Knowledge Tracing Machines: Factorization Machines for Knowledge Tracing Probabilistic Student Models: Bayesian Belief Networks and Knowledge Space Theory Probabilistic Student Models: Bayesian Belief Networks and Knowledge Space Theory Temporal Cross-Effects in Knowledge Tracing Understanding the Behaviour of Contrastive Loss Back to the basics: Bayesian extensions of IRT outperform neural networks for proficiency estimation Logistic Regression. In Reading and understanding multivariate statistics Learning to Teach with Dynamic Loss Functions Going Deeper with Deep Knowledge Tracing GIKT: A Graph-based Interaction Model for Knowledge Tracing Addressing two problems in deep knowledge tracing via predictionconsistent regularization QuesNet: A Unified Representation for Heterogeneous Test Questions Individualized bayesian knowledge tracing models S4L: Self-Supervised Semi-Supervised Learning Dynamic Key-Value Memory Networks for Knowledge Tracing Machine Teaching: An Inverse Problem to Machine Learning and an Approach Toward Optimal Education