key: cord-0102768-ybd9k8td authors: Lee, Seewoo; Choi, Youngduck; Park, Juneyoung; Kim, Byungsoo; Shin, Jinwoo title: Consistency and Monotonicity Regularization for Neural Knowledge Tracing date: 2021-05-03 journal: nan DOI: nan sha: 0d91f7a78479f4537a06c8221535d7327278a288 doc_id: 102768 cord_uid: ybd9k8td Knowledge Tracing (KT), tracking a human's knowledge acquisition, is a central component in online learning and AI in Education. In this paper, we present a simple, yet effective strategy to improve the generalization ability of KT models: we propose three types of novel data augmentation, coined replacement, insertion, and deletion, along with corresponding regularization losses that impose certain consistency or monotonicity biases on the model's predictions for the original and augmented sequence. Extensive experiments on various KT benchmarks show that our regularization scheme consistently improves the model performances, under 3 widely-used neural networks and 4 public benchmarks, e.g., it yields 6.3% improvement in AUC under the DKT model and the ASSISTmentsChall dataset. In recent years, Artificial Intelligence in Education (AIEd) has gained much attention as an emerging field to elevate educational technology. Especially due to the circumstances from the COVID-19 pandemic, much of the education industry was forcibly moved to an online environment which inevitably allowed much opportunity to utilize educational data. The ability to diagnose students through data and provide personalized learning paths have become a critical edge to uplift online education. The most fundamental aspect of assessing student's current knowledge state has been the focus of AIED research, a task commonly known as Knowledge Tracing(KT). Creating a more precise and a robust KT model has become an essential path to develop a highly effective and AI-based educational system. Since the work of [Piech et al., 2015] , deep neural networks have been widely used for the KT modeling. Current research trends in the KT literature concentrate on building more sophisticated, complex and large-scale models, inspired by model architectures from Natural Language Processing (NLP), such as LSTM [Hochreiter and Schmidhuber, 1997] or Transformer [Vaswani et al., 2017] architec- * Contact Author : Distribution of the correctness rate of past interactions when the response correctness of current interaction is fixed, for 4 knowledge tracing benchmark datasets. Orange (resp. blue) represents the distribution of correctness rate (of past interactions) where current interaction's response is correct (resp. incorrect). x axis represents previous interactions' correctness rates (values in [0, 1] ). The orange distribution lean more to the right than the blue distribution, which shows the monotonicity nature of the interaction datasets. tures. Further investigations to introduce additional components based on educational contexts such as textual information . However, not all educational data are sufficiently large and more often than not, the larger model sizes lead to overfitting and ultimately hurt the model's generalizabiliy [Gervet et al., 2020] (See Figure 1 of the Appendix). To the best of our knowledge, only a handful of the literature addresses such issues and even then, the scope is limited to regularization [Yeung and Yeung, 2018; Sonkar et al., 2020] . To address the issue, we propose simple, yet effective data augmentation strategies for improving the generalization ability of KT models, along with novel regularization losses for each strategy. In particular, we suggest three types of data augmentation, coined (skill-based) replacement, insertion, and deletion. Specifically, we generate augmented (train- ing) samples by randomly replacing questions that a student solved with similar questions or inserting/deleting interactions with fixed responses. Furthermore, during training, we impose certain consistency (for replacement) and monotonicity (for insertion/deletion) bias on the model's predictions by optimizing corresponding regularization losses that compares the original and the augmented interaction sequences. Such regularization strategies are motivated from our observation that existing knowledge tracing models' prediction often fails to satisfy the consistency and monotonicity condition, e.g., see Figure 4 in Section 3. Here, our intuition behind the proposed consistency regularization is that the model's output for two interaction sequences with same response logs for similar questions should be close. Next, the proposed monotonicity regularization is designed to enforce the model's prediction to be monotone with respect to the number of questions that are correctly (or incorrectly) answered, i.e., a student is more likely to be correct (or incorrectly) in the next question if the student was more correct in the past. By analyzing the distribution of previous correctness rate, we can observe that the existing student interaction is indeed monotonic as shown in Figure 1 . The overall augmentation and regularization strategies are sketched in Figure 2 . We demonstrate the effectiveness of the proposed method with 3 widely used neural knowledge tracing models -DKT [Piech et al., 2015] , DKVMN [Zhang et al., 2017] , and SAINT [Choi et al., 2020a] -on 4 public benchmark datasets -ASSISTments2015, ASSISTmentsChall, STATICS2011, and EdNet-KT1. Extensive experiments show that, regardless of dataset or model architecture, our scheme remarkably increases the prediction performance -6.3% gain in Area Under Curve (AUC) for DKT on the ASSISTmentsChall dataset. In particular, ours is much more effective under smaller datasets: by using only 25% of the ASSISTmentsChall dataset, we improve AUC of the DKT model from 69.68% to 75.44%, which even surpasses the baseline performance 74.4% with the full training set. We further provide various ablation studies for the selected design choices, e.g., AUC of the DKT model on the ASSISTments2015 dataset dropped from 72.44% to 66.48% when we impose 'reversed' (wrong) monotonicity regularization. The findings from the current study contribute to existing KT literature by providing a novel generalization mechanism which provide a strong baseline for future augmentation and regularization research. Data augmentation is arguably the most trustworthy technique to prevent overfitting or improve the generalizability of machine learning models. In particular, it has been developed as an effective way to impose a domain-specific, inductive bias to a model. For example, for computer vision models, simple image warpings such as flip, rotation, distortion, color shifting, blur, and random erasing are the most popular data augmentation methods [Shorten and Khoshgoftaar, 2019] . For NLP models, it is popular to augment texts by replacing words with synonyms [Zhang et al., 2015] or words with similar (contextualized) embeddings [Kobayashi, 2018] . Recently, [Wei and Zou, 2019] show that even simple methods like random insertion/swap/deletion could improve text classification performances. The aforementioned data augmentation techniques have been used not only for standard supervised learning setups, but also for various unsupervised and semi-supervised learning frameworks, by imposing certain inductive biases to models. For example, consistency learning [Berthelot et al., 2019] impose a consistency bias to a model so that the model's output is invariant under the augmentations, by means of training the model with consistency regularization loss (e.g. L 2 -loss between outputs). [Abu-Mostafa, 1992] suggested general theory for imposing such inductive biases (which are stated as hints) via additional regularization losses. Their successes highlight the importance of domain specific knowledge for designing appropriate data augmentation strategies, but such results are rare in the domain of AIEd, especially for Knowledge Tracing. Knowledge tracing (KT) is the task of modeling student knowledge over time based on the student's learning history. Formally, for a given student interaction sequence (I 1 , . . . , I T ), where each I t = (Q t , R t ) is a pair of question id Q t and the student's response correctness R t ∈ {0, 1} (1 means correct), KT aims to estimate the following probability i.e., the probability that the student answers correctly to the question Q t at t-th step. [Corbett and Anderson, 1994] For a given set of data augmentations A, we train KT models with the following loss: where L ori is the commonly used binary cross-entropy (BCE) loss for original training sequences and L aug are the same BCE losses for augmented sequences generated by applying augmentation strategies aug ∈ A. 1 L reg-aug are the regularization losses that impose consistency and monotonicity bias on the model's predictions for the original and augmented sequence, which will be defined in the following sections. Finally, λ aug , λ reg-aug > 0 are hyperparameters to control the trade-off among L ori , L aug , and L reg-aug . In the following sections, we introduce the three augmentation strategies, replacement, insertion and deletion with corresponding consistency and monotonicity regularization losses, L reg-rep , L reg-cor ins (or L reg-incor ins ) and L reg-cor del (or L reg-incor del ). Replacement, similar to the synonym replacement strategy in NLP, is an augmentation strategy that replaces questions in the original interaction sequence with other similar questions without changing their responses, where similar questions are defined as the questions that share the same skills as the original question. Our hypothesis is that the predicted correctness probabilities for questions in an augmented interaction sequence should not change drastically compared to the original interaction sequence. Formally, for each interaction in the original interaction sequence (I 1 , . . . , I T ), we randomly decide whether the interaction will be replaced or not, following the Bernoulli distribution with the probability α rep . If an interaction I t = (Q t , R t ) with a set of skills S t associated with the question Q t is set to be replaced, we determine where R is a set of indices to replace. Then we consider the following consistency regularization loss: where p t and p rep t are model's predicted correctness probabilities for t-th question of the original and augmented sequences. The output for the replaced interactions are not included in the loss computation. Also, the replacement strategy has several variants depending on the dataset. For instance, randomly selecting a question for Q rep t from the question pool is an alternative strategy when the related skill set information is not available. It is also possible to only consider outputs for interactions that are replaced or consider outputs for all interactions in the augmented sequence for the loss computation. We investigate the effectiveness of each strategy in Section 3. Insertion strategy is based on the notion of a student's knowledge to be higher when more questions are answered correctly. Specifically, data is augmented in a monotonic manner by inserting new interactions into the original interaction sequence. Formally, we generate an augmented interaction sequence (I ins 1 , . . . , I ins T ) by inserting a correctly (resp. incorrectly) answered interaction I ins , where the question Q ins t is randomly selected from the question pool and I with the size α ins proportion of the original sequence is a set of indices of inserted interactions. Then our hypothesis is formulated as p t ≤ p ins , where p t and p ins t are model's predicted correctness probabilities for t-th question of the original and augmented sequences, respectively. Here, σ : Figure 2 , σ sends {1, 2, 3, 4} to {2, 3, 4, 6}) We impose our hypothesis through the following losses: where L reg-cor ins and L reg-incor ins are losses for augmented interaction sequences of inserting correctly and incorrectly answered interactions, respectively. Similar to the insertion augmentation strategy, we enforce monotonicity by removing some interactions in the original interaction sequence based on the following hypothesis: if a student's response record contains less correct answers, the correctness probabilities for the remaining questions would become decrease and vice versa. Formally, from the original interaction sequence (I 1 , . . . , I T ), we randomly sample a set of indices D ⊂ [T ], where R t = 1 (resp. R t = 0) for t ∈ D, based on the Bernoulli distribution with the probability α del . We remove the index t ∈ D and impose the hypothesis p t ≥ p del σ(t) (resp. p t ≤ p del σ(t) ), where p t and p del t are model's predicted correctness probabilities for t-th question of the original and augmented sequences, respectively. Here, σ : We impose the hypothesis through the following losses: where L reg-cor del and L reg-incor del are losses for augmented interaction sequences of deleting correctly and incorrectly answered interactions, respectively. We demonstrated the effectiveness of the proposed method on 4 widely used benchmark datasets: benchmarks, their statistics, and pre-processing procedures in the Appendix. We test DKT [Piech et al., 2015] , DKVMN [Zhang et al., 2017] , and SAINT [Choi et al., 2020a] models. For DKT, we set the embedding dimension and the hidden dimension as 256. For DKVMN, key, value, and summary dimension are all set to be 256, and we set the number of latent concepts as 64. SAINT has 2 layers with hidden dimension 256, 8 heads, and feed-forward dimension 1024. All the models do not use any additional features of interactions except question ids and responses as an input, and the model weights are initialized with Xavier distribution [Glorot and Bengio, 2010] . They are trained from scratch with batch size 64, and we use the Adam optimizer with learning rate 0.001 which is scheduled by Noam scheme with warm-up step 4000. We set each model's maximum sequence size as 100 on ASSISTments2015 & EdNet-KT1 dataset and 200 on ASSISTmentsChall & STATICS2011 dataset. Hyperparameters for augmentations, α aug , λ reg-aug , and λ aug are searched over α aug ∈ {0.1, 0.3, 0.5}, λ reg-aug ∈ {1, 10, 50, 100}, and λ aug ∈ {0, 1}. For all dataset, we evaluate our results using 5-fold cross validation and use Area Under Curve (AUC) as an evaluation metric. The results (AUCs) are shown in Table 1 that compares models without and with augmentations, and we report the best results for each strategy. (The detailed hyperparameters for these results are given in Supplementary materials.) The 4th column represents results using both insertion and deletion, and the last column shows the results with all 3 augmentations. Since there's no big difference on performance gain between insertion and deletion, we only report the performance that uses one or both of them together. We use skill-based replacement if skill information for each question in the dataset is available, and use question-random replacement that that selects new questions among all questios if not (e.g. ASSISTments2015). As one can see, the models trained with consistency and monotonicity regularizations outperforms the models without augmentations in a large margin, regardless of model's architectures or datasets. Using all three augmentations gives the best performances for most of the cases. For instance, there exists 6.3% gain in AUC on ASSISTmentsChall dataset under the DKT model. Furthermore, not only enhancing the prediction performances, our training scheme also resolves the vanilla model's issue where the monotonicity condition on the predictions of original and augmented sequences is violated. As in Figure 4 , the predictions of the model trained with monotonicity regularization (correct insertion) are increased after insertion, which contrasts to the vanilla DKT model's outputs. Since overfitting is expected to be more severe when using a smaller dataset, we conduct experiments using various fractions of the existing training datasets (5%, 10%, 25%, 50%) and show that our augmentations yield more significant improvements for smaller training datasets. Figure 3 shows performances of DKT model on various datasets, with and without augmentations. For example, on ASSISTmentsChall dataset, using 100% of the training data gives AUC 74.4%, while the same model trained with augmentations achieved AUC 75.44% with only 25% of the training dataset. Are constraint losses necessary? One might think that data augmentations are enough for boosting up the performance, and imposing consistency and monotonicity are not necessary. However, we found that including such regularization losses for training is essential for further performance gain. To see this, we compare the performances of the model trained only with KT losses for both original and augmented sequences (8)) and with consistency and monotonicity regularizations (i.e., using the loss (2)). AUCs of the vanilla DKT model are given in parentheses below the dataset names. O (resp. X) represents correct (resp. incorrect) response. (where λ aug = 1) and with consistency and monotonicity regularization losses (2) where A is a set that contains a single augmentation. Training a model with the loss (8) can be thought as using augmentations without imposing any consistency or monotonicity biases. Table 2 shows results under the DKT model. Using only data augmentation (training the model with the loss (8)) gives a marginal gain in performances or even worse performances. However, training with both data augmentation and consistency or monotonicity regularization losses (2) give significantly higher performance gain. Under ASSISTmentsChall dataset, using replacement along with consistency regularization improves AUC by 6%, which is much higher than the 1% improvement only using data augmentation. Ablation on monotonicity constraints We perform an ablation study to compare the effects of monotonicity regularization and reversed monotonocity regularization. Monotonocity regularization introduces constraint loss to align the inserted or deleted sequence in order to modify the probability of correctness of the original sequence to follow insertion or deletion. For example, when a correct response is inserted to the sequence, the probability of correctness for the original sequence increases. Reversed monotonocity regular- ization modifies the probability of correctness in the opposite manner, where inserting a correct response would decrease the probability of correctness in the original sequence. For each aug ∈ {cor ins, incor ins, cor del, incor del}, we can define reversed version of the monotonicity regularization loss L rev reg-aug which impose the opposite constraint on the model's output, e.g. we define L rev reg-cor ins as = L reg-incor ins (9) which forces model's output of correctness probability to decrease when correct responses are inserted. In this experiments, we do not include KT loss for augmented sequences (set λ aug = 0) to observe the effects of consistency loss only. Also, the same hyperparameters (α aug and λ reg-aug ) are used for both the original and reversed constraints. Table 3 shows the performances of DKT model with the original and reversed monotonicity regularizations. Second row represents the performance with no augmentations, the 3rd to the 6th rows represent the results from using original (aligned) insertion/deletion monotonicity regularization dataset no augmentation question-random interaction-random skill-set skill losses, and the last four rows represent the results when the reversed monotonicity regularization losses are used. The results demonstrate that using aligned monotonicity regularization loss outperforms the model with reversed monotonicity regularization loss. Also, the performances of reversed monotonicity shows large decrease in performance on several datasets even compared to the model with no augmentation. In case of the EdNet-KT1 dataset, the model's performance with correct insertion along with original regularization improves the AUC from 72.75% to 73.70%, while using the reversed regularization drops the performance to 69.67%. Ablation on replacement We compare our consistency regularization with the other two variations of replacements, consistency regularization on replaced interactions and full interactions, correspond to the following losses: where ro stands for replaced only. We compared such variations with the original consistency loss L reg-rep that does not include predictions for the replaced interactions. For all variations, we used the same replacement probability α rep and loss weight λ reg-rep , and we do not include KT loss for replaced sequences as before. Table 4 shows that including the replaced interactions' outputs hurt performances. To see the effect of using the skill information of questions for replacement, we compared skill-based replacement with three different random versions of replacement: question random replacement, interaction random replacement, and skillset-based replacement. For question random replacement, we replace questions with different ones randomly (without considering skill information), while interaction random replacement changes both question and responses (sample each response with 0.5 probability). Skill-set-based replacement is almost the same as the original skill-based replacement, but the candidates of the questions to be replaced are chosen as ones with exactly same set of skills are associated, not only have common skills (S = S rep ). The results in Table 5 show that the performances of the question random replacements depends on the nature of dataset. It shows similar performance with skill-based replacement on ASSIST-mentsChall and EdNet-KT1 datasets, but only give a minor gain or even dropped the performance on other datasets. However, applying interaction-random replacement significantly hurt performances over all datasets, e.g. the AUC is decreased from 86.43% to 84.50% on STATICS2011 dataset. This demonstrates the importance of fixing responses of the interactions for consistency regularization. At last, skill-setbased replacement works similar or even worse than the original skill-based replacement. Note that each question of the STATICS2011 dataset has single skill attached to, so the performance of skill-based and skill-set-based replacement coincide on the dataset. Comparison with other regularization methods in KT We also compare our regularization scheme with previous works: DKT+ [Yeung and Yeung, 2018] and qDKT [Sonkar et al., 2020] . DKT+ uses two types of regularization losses: reconstruction loss and waviness loss. Reconstruction loss enable a model to recover the current interaction's label, and waviness loss make model's prediction to be consistent over all timesteps. qDKT uses the Laplacian loss that regularizes the variance of predicted correctness probabilities for questions that fall under the same skill, which is similar to the variation L reg-rep ro of the consistency loss. We explain these losses in detail in the Appendix. Results in Table 6 shows that our regularization approach yields the largest performance gain over all benchmarks compared to other methods. In some cases, using DKT+ or qDKT even harm the performances, while consistency and monotonocity regularization yields substantial performance gain over all datasets. We propose simple augmentation strategies with corresponding constraint regularization losses for KT and show their efficacy. We only considered the most basic features of interactions, question and response correctness, and other features like elapsed time or question texts enables us to exploit diverse augmentation strategies if available. Furthermore, exploring applicability of our idea on other AIEd tasks (dropout prediction or at-risk student prediction) is another interesting future direction. past interactions' correctness rate. Formally, for given interaction sequences (I 1 , . . . , I T ) with I t = (Q t , R t ) and each 2 ≤ t ≤ T , we compare the distributions of past interactions' correctness rate where 1 Rτ =1 is an indicator function which is 1 (resp. 0) when R τ = 1 (resp. R τ = 0). We compare the distributions of correctness rate