key: cord-0168887-zua2o6b8 authors: Zhang, Chengwei; Jiang, Yangzhou; Zhang, Wei; Gu, Chengyu title: MUSE: Multi-Scale Temporal Features Evolution for Knowledge Tracing date: 2021-01-30 journal: nan DOI: nan sha: dc04a6edb689e1e1fbd2f2ec1572cc6877f2a489 doc_id: 168887 cord_uid: zua2o6b8 Transformer based knowledge tracing model is an extensively studied problem in the field of computer-aided education. By integrating temporal features into the encoder-decoder structure, transformers can processes the exercise information and student response information in a natural way. However, current state-of-the-art transformer-based variants still share two limitations. First, extremely long temporal features cannot well handled as the complexity of self-attention mechanism is O(n2). Second, existing approaches track the knowledge drifts under fixed a window size, without considering different temporal-ranges. To conquer these problems, we propose MUSE, which is equipped with multi-scale temporal sensor unit, that takes either local or global temporal features into consideration. The proposed model is capable to capture the dynamic changes in users knowledge states at different temporal-ranges, and provides an efficient and powerful way to combine local and global features to make predictions. Our method won the 5-th place over 3,395 teams in the Riiid AIEd Challenge 2020. Recent COVID-19 has forced most countries to temporarily close schools and offline-education is in a tough place. The equity gaps in every country could grow wider since student knowledge over time was hard to trace offline. With the fast evolution of data science, data scientists can help teachers to realize personalized teaching by developing knowledge tracing models and relevant dataset (Choi et al. 2020b) . Existing works mainly focus on diagnosing the knowledge proficiency of students and making prediction of whether they can answer the given exercise correctly. To capture complex relations among exercises and responses over time, the attention mechanism has been widely explored by various knowledge tracing models such as (Pandey and Karypis 2019; Choi et al. 2020a; Shin et al. 2020) . However, most of the previous works focused on a fixed timescale, without paying attention to multi scales. Multi-Scale structures are widely used in computer vision (CV), NLP, and signal processing domains. It can help the model to Copyright © 2020 , Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. capture patterns at different scales and extract robust features. For example, (Guo et al. 2020) introduce the multiscale structure into self-attention framework and propose a multi-scale multi-head self-attention network. Our work is inspired by their success, and we propose a more powerful network called MUSE to capture the multi-scale timerange features. Our model is composed of two separated modules: MUSE-Local and MUSE-Global, which is designed to capture the changes of the knowledge in short and long temporal-ranges respectively. The MUSE-Local is a transformer-based variants based on SAINT+ (Shin et al. 2020 ). More robust structures like Attentional aggregator (Li, Yang, and Zhang 2020) , Attentional pooling (Zhou et al. 2018 ) are introduced to fully capture the short temporalrange dynamic changes under a fixed window size. The MUSE-Global is a Recurrent Neural Network (RNN) based network. We take full advantage of the characteristic that RNN-structure can be simply extended into unlimited length to capture the extremely long temporal-range global features. These two modules are complementary with each other. Our experiments demonstrate the effectiveness of the ideas and our model has won the 5-th place over 3,395 teams in the Riiid AIEd Challenge 2020. Given the history of how a student responded to a set of exercises, MUSE follows SAINT (Choi et al. 2020a), and SAINT+ (Shin et al. 2020 ) that predicts the probability that the student will answer a particular new exercise correctly. In SAINT (Choi et al. 2020a) , the student activity is recorded as a sequence I i , ..., I n of interactions I i = (E i , R i ), where E i represents the i-th exercise given to the student with related metadata such as the type of the exercise. Also, the response information R i denotes the student i-th response to E i with related metadata such as the duration of time the student took to respond. In this competition, more metadata such as lectures are provided at the same time. We denote them as L i in this report and the whole interactions can be represented as X i = (I i , L i ). The goal of MUSE is to predict the probability P (r k = 1|I 1 , I 2 , ..., E k ) arXiv:2102.00228v1 [cs.AI] 30 Jan 2021 of a student answering the i-th exercise correctly. In this section, we will describe how the input features of our model, as well as the feature engineering of each feature. Here, all the features are simply embedded by either categorical or continuous embeddings proposed by (Shin et al. 2020) . The categorical embeddings are applied by default if not specified. Exercise embeddings The exercise embedding, which contributes to E i , includes the following features: • Content and Bundle id: the content id and bundle id are embedded into a common latent space via a shared embedding matrix. • Part, Tag, Content answer: The part, tag and content answer, including different information of an exercise, are separately embedded into an latent vector. • Position: The position embeddings of an content in the input sequence. Note that the position embeddings are NOT shared across the exercise sequence and the response sequence in our model. The user embedding, which contributes to R i , includes the following features: • Task container id: the task container id is embedded using categorical embedding. • Response: the user's historical response sequences of possible 0/1 value are embedded into an latent vector. • Prior questions elapsed time: elapsed time is an amount of time that a student spent on solving a given exercise. This feature is directly embedded using continuous embedding. • Lag time: lag time is defined as the time gap between the end of a previous response and the beginning of current exercise in SAINT+ (Shin et al. 2020 ). Here, we define the lag time as the time gap between beginning of two exercises. We set the maximum lag time as 300 seconds and any time more than that is cut off to 300 seconds. Continuous embedding is used for this feature. • Prior questions had explanation: 0/1 value, using categorical embedding. • Prior questions had been attempted: this value stands for whether the user has attempted the given exercise. To improve diversity, in MUSE-Local, 0/1 value is applied to represent the state. In MUSE-Global, 1 − exp(−x) is applied, where x is the cumulative attemption count of the given exercise by the user. • Position: as described in the Exercise embedding. Lecture embeddings The lecture embedding, which contributes to L i , includes the following features: • Last part: means whether each part of the lectures has been recorded, which has 7 values. • Last type: means whether each type of the lectures has been recorded, which has 4 values. • Last tag: means whether each tag of the lectures has been recorded. Since the lectures share 188 tags, we only select the most frequently used tags (top 14) as the embedding space and denote all other tags as one specific value. Thus the last tag contains 15 values. Note that all the embedded features are directly concatenated together without LayerNorm and Dropout, then linearly transformed into a predefined common space (dimension of d model). Global features We also add several global features without any embedding layer, shown as follows: • Exercise's hotness: the total count of each content id, cut off by 22000. • Exercise's hardness: the accuracy of each content id . • Exercise's part hardness: the accuracy of each part. • User's cumulative response ratio: the value is to keep track of the user's preference for each answer. For example, many users will choose C if they do not know the answers. • User's cumulative correct rate: the value is to keep track of the accuracy that users have answered the exercise. • User's cumulative lecture-watching counts: the value is to keep track of the counts that users have watched lectures. In this section, we will describe our model architecture in details. The overall MUSE-Local is shown in Figure 1 . Self-Attentive Encoder-Decoder Following the SAINT+ (Shin et al. 2020) , we directly introduce Self-Attentive Encoder-Decoder to encode the user embeddings and exercise embeddings. The differences are summarized as follows: I We use more sequence features and combine them by concatenating and linear transforming instead of directly adding together. II We add another self-attentive encoder Layer to encode the lecture embeddings separately. III The position embeddings of the above encoders and decoders are all not shared. To capture the dynamic changes in user interactions, we introduce the attentional aggregator proposed by (Li, Yang, and Zhang 2020 , where w is the window size of the aggregator and α j is the attention parameter which can be learned during training. In this competition, we use two aggregators as our default settings and set the w = 3, stride=1 to keep the sequence length unchanged. We did not observe further improvements by enlarging the number of layers and size of w. Attention Pooling Layer When making the final decisions, we only make predictions for the current step. Therefore we need to transform a 2D feature representations into 1D. Global Average Pooling (GAP) or Global Max Pooling (GMP) can solve the problems, but the result is not satisfactory because the most query-related interaction will be overwhelmed by historical interactions. Here, we use the attention pooling layer proposed by (Zhou et al. 2018 ), formulated as follows: ζ Sequence (Query) = l j=1 a(S j , Query) where {S 1 , S 2 , ..., S l } is the list of embedding vectors in Sequence, Query is the embedding vector of content id, l is the sequence length. In this way, ζ Sequence (Query) will vary over different content id. a(·) is a feed-forward network with output as the activate weight. At last, to make the final prediction, we concatenate all the pooled features, global features and the GRU output together, which will be transformed into decision space by 3 fully connected layers. To capture the local sequence features, we use several training techniques as follows: • Normal Training. We use transformer with d model=128 and N=3 and M=2 layers. All the model was trained from scratch. The window size, dropout rate, and batch size are set to 200, 0.0, and 2048 respectively. We use the AdamW (Loshchilov and Hutter 2019) with lr = 0.001, β 1 = 0.9, β 2 = 0.999 and weight decay=0.001. We use the so-called Noam scheme to schedule the learning rate as in (Vaswani et al. 2017 ) and set warmup steps to about 8000. • Random Answer Masking. Inspired by (Devlin et al. 2019), MUSE randomly masks some of the user's response from the input sequence, and the objective is to predict the original response of the masked response value based only on its context. The training technique is termed as RAM in this report and the RAM ratio is 25% by default. • Adversarial Training. Adversarial training, which was originally proposed to defend adversarial examples and enhance the security of machine learning systems (Goodfellow, Shlens, and Szegedy 2015), has recently been proved to be effective for improving the generalization of language models (Zhu et al. 2020) . Here, we also introduce the adversarial training to improve the generalization of knowledge tracing models. However, since the huge computation cost of adversarial training (it takes 3-30 times longer to form a robust network (Shafahi et al. 2019) ) and the limitation of competition deadline , we only train the MUSE-Local for another 10k steps using techniques introduced by (Zhu et al. 2020) , and achieves only (less than)0.001 AUC improvements. We believe that the performance will be better if the models are trained even longer. To better fuse the local and global models, we train the MUSE-Local and MUSE-Global using 90M of the training data, and simply blend the results on the remaining training data via 5-fold cross validation by Light-GBM (Ke et al. 2017) with default parameters and early-stopping=100 without any further post-processing. Effect of Adversarial Training. We have already discussed the function of Adversarial Training in Training Details. Here we firmly believe that the performance can be better if the models are trained longer. Effect of Multi-Scale Model Fusion. Our final models are all trained on 90M datasets and we do not found further improvements after adding more training data. Therefore, we use the last nearly 10M dataset to blend all the MUSE-Local and Global models and achieve about 0.003 private LB improvements, which demonstrate the effective of multiscale model fusion. We can also infer that the transformerbased variants are limited by window size and have drawbacks in dealing with extremely long dependencies. In this report, we propose a powerful modified transformerbased model called MUSE for knowledge tracing by automatically aggregating multi-scale temporal features. Experiments show that our solution can perform better than single Transformer method. With this method, we won the 5-th place in the Riiid AIEd Challenge 2020. Towards an appropriate query, key, and value computation for knowledge tracing. CoRR abs/2002.07033 Multi-scale self-attention for text classification Lightgbm: A highly efficient gradient boosting decision tree MRIF: multi-resolution interest fusion for recommendation A self-attentive model for knowledge tracing Adversarial training for free! In NeurIPS SAINT+: integrating temporal features for ednet correctness prediction Deep interest network for click-through rate prediction Deep interest evolution network for click-through rate prediction Freelb: Enhanced adversarial training for natural language understanding