key: cord-0046968-w4bosam3
authors: Zaidi, Ahmed; Caines, Andrew; Moore, Russell; Buttery, Paula; Rice, Andrew
title: Adaptive Forgetting Curves for Spaced Repetition Language Learning
date: 2020-06-10
journal: Artificial Intelligence in Education
DOI: 10.1007/978-3-030-52240-7_65
sha: 01f3ed13bbbfbedcde039fb2b187d6760fa2b16e
doc_id: 46968
cord_uid: w4bosam3

The forgetting curve has been extensively explored by psychologists, educationalists and cognitive scientists alike. In the context of Intelligent Tutoring Systems, modelling the forgetting curve for each user and knowledge component (e.g. vocabulary word) should enable us to develop optimal revision strategies that counteract memory decay and ensure long-term retention. In this study we explore a variety of forgetting curve models incorporating psychological and linguistic features, and we use these models to predict the probability of word recall by learners of English as a second language. We evaluate the impact of the models and their features using data from an online vocabulary teaching platform and find that word complexity is a highly informative feature which may be successfully learned by a neural network model.

Optimal human learning techniques have been extensively studied by researchers in psychology [4] and computer science [8, 16, 19, 20] . The impact of learning techniques can be measured by how they affect the long-term retention of the learning materials. Measuring retention requires a model of the human forgetting curve, which plots the probability of recall over time. The first version of the forgetting curve was defined by Ebbinghaus [5] but has since been developed further by many researchers who have incorporated additional psychologically grounded variations to the model [3, 9, 13, 14, 17] . The ideal forgetting curve should adapt to learning materials as well as user meta-features (including current ability). In this study we examine the task of vocabulary learning. We investigate a range of linguistically motivated features, meta-features, and a variety of models in order to predict the probability a given learner will correctly recall a particular word. 

We use the Duolingo spaced repetition dataset [15] in order to train and evaluate our features and variety of models. The dataset is filtered for English language learners which results in approximately 4.28 million learner-word datapoints. Our models are a modification of the half-life regression model proposed by Settles and Meeder [16] .

The half-life regression model is defined as follows:

where p is the probability of recall, Δ is the time since last seen (days) and h is the half-life or strength of the learner's memory. We denote the estimated half-life byĥ Θ , and it is defined as:

where Θ is a vector of weights for the features x. The features of the model are made up of lexeme tags, one tag for each word in the vocabulary (e.g. the lexeme tag for word camera is camera.N.SG). The aim of these features is to capture the inherent difficulty of the word. The HLR model is trained using the following loss function:

In practice, it was found that optimising for both p and h in the loss function improved the model. The true value of h is defined as h = −Δ log(p) . p andp Θ are the true probability and model estimated probability of recall, respectively.

We now expand on the HLR model by adding additional linguistic, psychological and meta-features to x. We refer to this model as HLR+. The features include word complexity scores estimated by a pre-trained model [6] , mean concreteness scores and percent known based on human judgements [2] , SUBTLEX word frequencies [18] and user ids.

The motivation for including complexity as a feature is based on the intuition that the more complex the word, the harder it is to remember. Concreteness is included based on previous work showing that concrete words are easier to remember than abstract words because they activate perceptual memory codes in addition to verbal codes [10] . SUBTLEX is the relative frequency of an English word based on a corpus of 201.3 million words: we hypothesise that more frequent words are more likely to be encountered and reinforced during the time since last seen Δ. Similarly, we expect that 'percent known' (the proportion of respondents familiar with each word based on survey data) will correlate with probability of recall. Lastly, we include user id to capture latent behavioural aspects about the learners.

In addition to adding new features, we now describe a new model that modifies the p such that it directly incorporates word complexity. Gooding et al. [6] derived word complexity to express perceived difficulty. We hypothesise that this will correlate with probability of recall. As the complexity of the word rises, the forgetting curve will become steeper. Therefore, the new model is as follows:

where C is the mean complexity for word i. We define estimated half-lifeĥ Θ as 2 Θ·x where x is a vector composed of all of the features described in Sect. 2.2.

Motivated by the recent success of neural networks, we now describe the N-HLR+ model which replacesĥ Θ = 2 Θ·x with a neural network. The network can be described as follows:ĥ

where the network contains a single hidden layer. x is a vector of input features, w 1 is the weight matrix between the inputs and the hidden layer and w 2 is the weight matrix between the hidden layer and the output. We use the same loss function as HLR which optimises for both p and h.

We use mean absolute error (MAE) of probability of recall for a lexical item as our evaluation metric which, despite some known problems [11] , is in line with previous work [16] . MAE is defined as: 1

We divided the Duolingo English data into 90% training and 10% test. We trained all non-neural models (e.g. HLR, HLR+, C-HLR) using the following parameters which were tuned on the first 500k data points-learning rate: 0.001, alpha α: 0.01, λ: 0.1. For all neural models (e.g. N-HLR), we used-learning rate: 0.001, epochs: 200, hidden dim: 4. 

Pimsleur [12] 0.396 Leitner [7] 0.214 Logistic Regression 0.196 HLR [16] 0.195 HLR-lex [16] 0.130

Model MAE↓ HLR+ 0.129 C-HLR+ 0.109 N-HLR+ 0.105 CN-HLR+ 0.105

We can see in Table 1 that HLR+ did not perform much better than HLR. By modifying the loss function to include complexity as a parameter in the C-HLR+ model, we considerably improved the performance of our model. This was in line with our hypothesis that more complex words are forgotten faster and thus are an important feature in modelling the forgetting curve.

The N-HLR+ model provided additional improvements to the C-HLR+ model. This is due to the fact that neural models are better at capturing nonlinearities between the features and the expected output. Furthermore, when compared to the N-HLR+ model we can see that including complexity into the loss function (CN-HLR+) provides no clear improvements in performance. This is because the model learns to place more importance on the complexity feature. We confirm this by analysing the average weights in the hidden layer of the model. The model learns to give greater importance to word complexity, percent known, and concreteness respectively. It does not however, learn much from the user id and SUBTLEX. This is probably due to the fact that a single dimension for capturing user behaviour is not sufficient and that SUBTLEX does not adequately represent learners' experience with English as a second language.

We present a new model for adaptively learning a forgetting curve for language learning using a modified HLR loss function and a neural network. We incorporate linguistically and psychologically motivated features and show that word complexity is an important feature in predicting probability of recall for a vocabulary item. Furthermore, we illustrate that neural networks can capture the importance of word complexity while a simple HLR fails to take advantage of that signal. This work lays the foundation for work in neural approaches to understanding language learning over time. Future work in this area includes incorporating high-dimensional user embeddings to capture user specific signals that might influence the forgetting curve, and also different models such as Pareto and power functions which have been proposed in prior work [1] .

The form of the forgetting curve and the fate of memories

Concreteness ratings for 40 thousand generally known English word lemmas

DAS3H: modeling student learning and forgetting for optimally scheduling distributed practice of skills

Improving students' learning with effective learning techniques: promising directions from cognitive and educational psychology

Ueber das gedächtnis

Complex word identification as a sequence labelling task

So lernt man lernen: angewandte Lernpsychologie-ein Weg zum Erfolg

Skills embeddings: a neural approach to multicomponent representations of students and tasks

Artificial intelligence to support human instruction

Imagery and Verbal Processes

Metrics for evaluation of student models

A memory schedule. Modern Lang

Accelerating human learning with deep reinforcement learning

One hundred years of forgetting: a quantitative description of retention

Replication data for: a trainable spaced repetition model for language learning

A trainable spaced repetition model for language learning

Enhancing human learning via spaced repetition optimization

SUBTLEX-UK: a new and improved word frequency database for British English

Accurate modelling of language learning tasks and students using representations of grammatical proficiency

Curriculum Q-learning for visual vocabulary acquisition

Acknowledgements. This paper reports on research supported by Cambridge Assessment, University of Cambridge.