Time-to-Event Prediction with Neural Networks Håvard Kvamme Time-to-Event Prediction with Neural Networks Thesis submitted for the degree of Philosophiae Doctor Department of Mathematics Faculty of Mathematics and Natural Sciences 2019 © Håvard Kvamme, 2019 Series of dissertations submitted to the Faculty of Mathematics and Natural Sciences, University of Oslo No. 2346 ISSN 1501-7710 All rights reserved. No part of this publication may be reproduced or transmitted, in any form or by any means, without permission. Cover: Hanne Baadsgaard Utigard. Print production: Reprosentralen, University of Oslo. Preface This thesis is submitted in partial fulfillment of the requirements for the degree of Philosophiae Doctor at the University of Oslo. The thesis is a collection of four papers with the common theme of time-to-event prediction with neural networks. Their chronological order is representative of my understanding of the field. Although I have a background in statistics from the Norwegian University of Science and Technology, I have always found the empirical mindset of machine learning to be closer to my way of thinking. It has, therefore, been interesting to work at the intersection of statistical survival analysis and a part of machine learning that originated in computer science. I am deeply grateful to my three supervisors Ørnulf Borgan, Ida Scheel, and Kjersti Aas, as they allowed me the freedom to explore my ideas, but managed to always be available with knowledge, assistance, and encouragement. In particular, I want to thank my main supervisor and main collaborator Ørnulf. His incredible knowledge of survival analysis has been invaluable for our work, and his rigor and attention to detail has caught more of my mistakes than I care to admit. I want to thank Ida for always having time for my questions, no matter how little time she had available. I want to thank Kjersti for having multiple interesting projects available for me when I started in 2016, and for getting us through the revisions and publishing process of our first paper. I am also grateful to my other co-authors Nikolai Sellereite and Steffen Sjursen. Additionally, I want to thank Nikolai for showing me the KKBox churn prediction data set which ended up being an important part of two other papers he was not a part of. During my time as a research fellow, I got to visit Trevor Hastie at Stanford University for six months, and I am very grateful to him for including me in his research group. His students, in addition to the rest of the Stanford statistics students and staff, made my stay very enjoyable. I am grateful to BigInsight for providing me with necessary funding for the stay, in addition to the funding for all other research-related trips. I want to thank all my colleagues at the Section of Statistics and Data Science at UiO and at the Norwegian Computing Center, in addition to Gudmund Horn Hermansen and Erlend Aune for inspiring conversations and conference trips. Finally, I want to thank my friends and family for making my time outside the university so great. In particular, I am grateful to Amalie for her patience, understanding, and for never questioning the time I spent on my research, even when it came at the expense of our time together. Håvard Kvamme Oslo, December 2019 i List of Papers Paper I Håvard Kvamme, Nikolai Sellereite, Kjersti Aas, and Steffen Sjursen. Predicting Mortgage Default Using Convolutional Neural Networks. Expert Systems with Applications, 102: 207–217, 2018. Paper II Håvard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-Event Prediction with Neural Networks and Cox Regression. Journal of Machine Learning Research, 20(129): 1–30, 2019. Paper III Håvard Kvamme and Ørnulf Borgan. Continuous and Discrete-Time Survival Prediction with Neural Networks. Submitted for publication. Paper IV Håvard Kvamme and Ørnulf Borgan. The Brier Score under Administrative Censoring: Problems and Solutions. Submitted for publication. iii Contents Preface i List of Papers iii Contents v 1 Introduction 1 1.1 Neural Networks and Statistical Models . . . . . . . . . . . 2 2 Neural Networks 5 2.1 A Brief Historical Perspective . . . . . . . . . . . . . . . . 5 2.2 The Multilayer Perceptron . . . . . . . . . . . . . . . . . . 7 2.3 Training Networks . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Overfitting and Regularization . . . . . . . . . . . . . . . . 10 2.5 Convolutional Networks . . . . . . . . . . . . . . . . . . . . 12 3 Time-to-Event Prediction 17 3.1 Event-Time Modeling . . . . . . . . . . . . . . . . . . . . . 17 3.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Evaluation of Survival Estimates . . . . . . . . . . . . . . . 26 4 Summary of Papers 29 4.1 Paper I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Paper II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Paper III . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4 Paper VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 Discussion 33 5.1 The Survival Methods . . . . . . . . . . . . . . . . . . . . . 33 5.2 Can We Trust Individual Predictions? . . . . . . . . . . . . 34 5.3 Extensions of the Methods . . . . . . . . . . . . . . . . . . 35 5.4 Evaluation of Survival Estimates . . . . . . . . . . . . . . . 43 5.5 Time-to-Event Prediction and Machine Learning . . . . . . 45 Bibliography 47 Papers 56 v Chapter 1 Introduction As the title of this thesis suggests, we investigate how machine learning, and in particular neural networks, can be applied for time-to-event prediction. The topic of time-to-event prediction is a subfield of survival analysis predominately concerned with when in the future an event will occur. So, contrary to the rest of survival analysis research, we make little effort to understand why the event occurred. The event in question can be a mortgage default, as in Paper I; customers stopping their subscription to a service, as in Paper II; the failure time of a mechanical system; an actual death from a disease; or any other suitable event. In the context of this thesis, the term neural networks refers to a set of machine learning algorithms, also known as artificial neural networks, connectionist systems, or deep learning, and are not to be confused with the eponymous biological networks. So, when are we willing to trade some of the understanding of the event- time process for improved predictive performance? Often, we are interested in both. Take customer relationship management as an example. One would, of course, like to know why customers are churning (leaving the service). But to retain current customers, a viable strategy can be to simply identify customers that are likely to churn and provide them some discount or offer. Even if we are primarily interested in understanding the effect of the variables, valuable insights might be obtained by studying methods purely focused on prediction. If such a “prediction-focused” model has much better predictive performance than our “understandable” model, the latter is clearly not using the available information to its full potential and might require some modifications. This way a “prediction-focused” model can be used to benchmark our “understandable” model. The thesis essentially consists of three parts: the first is mainly about finding a good neural network model for a specific problem, the second part explores how methods from survival analysis can be combined with neural networks for improved predictive performance, and the third part is concerned with the evaluation of such predictions. The three topics represent the chronological progress of my research, as each of them was motivated by experiences encountered in the previous topic. In the first project, we created a neural network model for predicting mortgage defaults for Norway’s largest commercial bank. We later understood that such time-to-event problems could benefit from combining methodology from survival analysis and machine learning. Then, while developing such methodology, we found that some of the evaluation criteria did not behave as expected and could, possibly, give misleading results. So, naturally, we continued by investigating the performance metrics. To make one’s research accessible to others, it is import to provide the 1 1. Introduction necessary details in a usable format. Code has, therefore, been an essential addition to the papers. This has resulted in a python package that contains implementations of the proposed methods, data sets, simulations, and evaluation metrics in Papers II, III, and IV. The package is built on the popular deep learning framework PyTorch (Paszke et al., 2017) and is available at github.com/havakv/pycox. The thesis is structured as follows: In Chapter 2, I will give an introduction to neural networks in the context of our papers, as all the predictive methodology is built on neural networks. This chapter is most relevant for Paper I, as it is concerned with finding a network model that is suitable for time-series modeling. The three remaining papers, on the other hand, consider the networks as general function approximators that minimize some objective function and is more concerned with the survival methodology. In Chapter 3, I will give a brief introduction to the field of survival analysis while focusing on predictive modeling. I will introduce the statistical models that are the foundations of Papers II and III, in addition to similar methods in the literature. In short, most of the approaches are classical statistical models where a linear predictor function is replaced by a neural network. This chapter also addresses some of the common criteria for evaluating the predictive performance of time-to-event models. Here I will briefly discuss some of the problems that led us to the investigations in Paper IV. A summary of the four papers is given in Chapter 4. In Chapter 5, I will discuss some natural extensions of the proposed methods. Also, I will use the opportunity to address some shortcomings of the papers and proposed methods. 1.1 Neural Networks and Statistical Models A substantial part of this thesis is about improving the predictive performance of classical statistical models with methodology from machine learning. To illustrate the general approach, I will start by explaining how the Logistic Regression model for binary classification can be extended with neural networks. This should also provide the reader with the necessary terminology for understanding Chapter 2 on neural networks. Consider an individual i with covariates xi ∈ Rp and response yi ∈ {0, 1}. We assume that yi is an observation of the random variable Yi ∼ Bernoulli(πi), where the parameter πi depends on the covariates, πi = P(Yi = 1 |xi). Logistic Regression assumes that the relationship between xi and πi is given by the logistic function πi = σ(xi) = 1 1 + exp[−φ(xi)] , (1.1) where φ(xi), typically, is the linear predictor function φ(xi) = wT xi with parameters w ∈ Rp. For a data set of n individuals, we can estimate the 2 https://github.com/havakv/pycox Neural Networks and Statistical Models parameter values w by minimizing the negative log-likelihood of Bernoulli data loss = − n∑ i=1 ( yi log[σ(xi)] + (1 −yi) log [1 −σ(xi)] ) , (1.2) with respect to the parameters w. For a new individual j, π̂j = σ̂(xj) is the estimated probability of Yj = 1, and π̂j can be used to make a prediction of the response yj. The terminology in machine learning differs slightly from that of classical statistics. The response yi is often called a label or target, and the parameters w are called weights. The logistic function in (1.1) is referred to as the sigmoid function, and we rarely address its inverse; the link function that is so common in statistics. The data set is typically split into multiple subsets that serve different purposes, and the subset used to minimize the loss function (1.2) is called the training set. A model generalizes well if it has accurate predictions and we often refer to the topic as predictive performance. Consequently, we are typically not interested in minimizing the loss (1.2) with respect to the training set, but instead, we want the loss to be small for data we have not yet seen. This can be approached by considering the loss of a held-out subset, called a validation set, which essentially serves the same purpose as in cross-validation. Finally, we use a held-out test set to quantify the predictive performance of the model. If we now refer to the linear predictor φ(xi) = wT xi as a neural network, we have left the field of classical statistics and are instead doing machine learning. This is because a neural network is simply a parametric model such as φ(xi), though we are typically interested in more complex functions than wT xi. A neural network consists of a set of transformations of the covariates that are applied in sequence. Consider three such transformations denoted h1, h2, and h3. A network can then be defined as φ(xi) = h3(h2(h1(xi))). The transformations are typically referred to as layers, and each layer’s outputs are the values obtained from the transformation. We refer to the number of layers as the depth of the network, and the term deep learning refers to deep networks. There is, however, no formal depth requirement for a network to be considered deep. The structure of the network is often called the network architecture and refers to the general design of the network. Returning to the Logistic Regression, but with neural network terminology, we find that the linear predictor φ(xi) is a single-layer network (depth of one), consisting of a single dense, or fully-connected, layer. If we include the sigmoid function (1.1) as a part of the network, we can call σ(xi) a two-layer network where φ(xi) is a hidden layer. It is, however, more common to refer to transformations without parameters, such as the sigmoid, as activation functions, and we typically consider them part of a parametric layer. One can also consider the loss function (1.2) a part of the network, as it is conceptually no different from the other layers. This interpretation is mostly useful for illustrative purposes. The most standard neural network structure is the multilayer perceptron (MLP). It consists of multiple dense layers on the form h(z) = g(Wz), where g is 3 1. Introduction an activation function, W is a matrix of parameters or weights, and z represents the output of the previous layer. For layers h1, h2, and h3, with parameters W1 ∈ Rq×p, W2 ∈ Rr×q, w3 ∈ Rr, and sigmoid activation functions (applied element-wise), we have the network φ(xi) = wT3 σ ( W2 σ [ W1xi ]) . (1.3) If we replace the linear predictor in the Logistic Regression with this network, we will have a much more flexible model. We do, however, also have many more parameters. To train the model, we apply some version of the gradient descent algorithm. The gradients are obtained with the back-propagation algorithm which is, essentially, application of the chain rule of differentiation. It should now be clear that neural networks can be applied to many statistical models simply by replacing the linear predictor wT x of the classical statistical model with a network such as in (1.3). The network architecture of modern MLP’s differs slightly from the example in (1.3). This is one of the topics of Chapter 2, together with gradient descent, back-propagation, other layer types, and more. 4 Chapter 2 Neural Networks The field of neural networks is a branch of machine learning concerned with gradient-based optimization of models composed of sequences of transformations. It is commonly referred to as deep learning, artificial neural networks, or even artificial intelligence. One might argue that these terms actually represent different research areas or subcategories, a distinction that will be ignored in this thesis as I will simply use the term neural networks. Recently, neural networks have received much attention, both from academia and the industry. This has resulted in impressive results in areas such as image recognition (Krizhevsky et al., 2012; Szegedy et al., 2015; He et al., 2015a), image segmentation (Long et al., 2015; He et al., 2017), language modeling and language translation (Wu et al., 2016; Gehring et al., 2017; Vaswani et al., 2017; Radford et al., 2018, 2019), and image and audio generation (Goodfellow et al., 2014; Oord et al., 2016; Brock et al., 2018; Razavi et al., 2019). In part, the successes of neural networks are owed to the modularity of the back-propagation algorithm for automatic gradient computation (Kelley, 1960; Rumelhart et al., 1986). While back-propagation is essentially the chain rule of differentiation, in practice it allows for modular implementations of the desired transformations. This makes the networks very applicable, as it is easy to extend or change existing network structures. In the following, I will first give a brief overview of the development of neural networks, starting from the late 1950s. I will then present multilayer perceptrons (MLP’s), which are the most basic networks, and I will cover how they are trained and regularized. At the end of the chapter, I will describe convolutional neural networks, which are an important part of Paper I. 2.1 A Brief Historical Perspective The development of neural networks dates back to the perceptron by Rosenblatt (1958) and ADALINE by Widrow and Hoff (1960), both of which are binary classifiers with linear decision boundaries. Both methods were heavily criticized, especially by Minsky and Papert (1969), for their limited ability to express richer functions, and received limited academic attention. More than a decade later, Rumelhart et al. (1985, 1986) reestablished interest in neural networks by presenting the benefits of applying multiple perceptrons sequentially with non-linear functions in between them. This directly addressed the concerns raised by Minsky and Papert (1969). Furthermore, Hornik et al. (1989) established the notion that these multilayer perceptrons (MLP) are universal approximators in the sense that they can “approximate any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided 5 2. Neural Networks sufficiently many hidden units are available.” At the same time, LeCun (1989) introduced convolutional neural networks, or CNN’s, that are essentially a sparse version of the MLP with some parameters in the network forced to be identical (called weight sharing). This idea was tailored to the grid structure of images represented by pixels and was found to be quite successful in recognizing handwritten digits for the US Postal Service (LeCun et al., 1989). The networks were, however, very computationally expensive and somewhat unpractical to train. This period also includes the development of the generative models: Boltzmann machines (Ackley et al., 1985), Belief networks (Neal, 1992), and the Helmholtz machine (Dayan et al., 1995), in addition to Recurrent neural networks (Rumelhart et al., 1986) and Long Short-Term Memory networks (Hochreiter and Schmidhuber, 1997) for processing sequential data. At the time, the methods were quite computationally expensive, and it was problematic to train deeper networks. As a result, other methods such as Random Forests and Support-Vector Machines received more attention. Renewed interest in the field came with the breakthrough of Hinton et al. (2006) who were able to efficiently train deep Belief networks by greedy layer-wise pre-training. Shortly after, Bengio et al. (2007) and Poultney et al. (2007) showed that other deep network structures could be trained with the same strategy. Furthermore, they were able to show that both unsupervised pre-training and deeper networks improved generalization. With this new focus on depth, neural networks were rebranded as “deep learning” (Allen, 2015; Goodfellow et al., 2016), which marks the beginning of modern neural network research. The following period was marked by three important discoveries. The first was that graphical processing units (GPUs) could massively speed up the training process of the networks (Mohamed et al., 2009; Raina et al., 2009), which enabled researchers to faster explore new networks and allowed for more parameters in the networks. The second discovery was that better non-linear activation functions and better parameter initialization made the optimization problem much nicer (Jarrett et al., 2009; Glorot and Bengio, 2010; Nair and Hinton, 2010; Glorot et al., 2011). This had a large impact on both training time and generalization error. The third discovery was the importance of large labeled data sets. While this might seem obvious, researchers had, to some extent, been focusing on marginal improvements on smaller data sets. Faster computers and much “nicer” networks, enabled networks to be fitted to much larger data sets. The most prominent example of the importance of large data sets is likely the 2012 computer vision contest ILSVRC (Russakovsky et al., 2015). Here the convolutional network by Krizhevsky et al. (2012) achieved a top-5 error rate of 15.3% (compared to the second place of 26.1%). Their network architecture was actually quite similar to the architectures proposed a decade earlier by LeCun et al. (1998), but deeper and with the advancements discussed above. Although there had been increasing interest in neural network research since 2006, the entry by Krizhevsky et al. (2012) was, in fact, the only neural network entry to the competition that year. Their success, however, marks a turning point in the popularity of neural networks and the competition has since only been won 6 The Multilayer Perceptron by neural network entries. In the years after 2012, there have been numerous success stories of neural networks for a variety of applications, some of which were mentioned in the introduction to this chapter. For a more detailed historical overview, see, e.g., the blog series by Andrey Kurenkov1, or the book by Goodfellow et al. (2016). 2.2 The Multilayer Perceptron Multilayer perceptrons, or MLP’s, are the simplest and, arguably, the most fundamental network structures. MLP’s were briefly covered in the context of Logistic Regression in Section 1.1, but we will here provide a more general overview of the topic. To fit a standard MLP, we need input covariates x ∈ Rp, targets y ∈ Rm, and a differentiable loss function loss(φ(x), y), where φ(x) denotes the output of the MLP. The loss function can, for instance, be the mean squared error, or the negative log-likelihood of Bernoulli data (binary cross-entropy). An MLP consists of one or more dense, or fully-connected, layers of the form h(x) = Wx + b, where W ∈ Rq×p and b ∈ Rq are learnable parameters. It is common to refer to the parameters W as weights and b as biases. The biases b are sometimes excluded from the model, as we did in Section 1.1. For φ(x) to be a non-linear function of x, we need to add non-linear transformations to the network. These transformations are called activation functions and are commonly applied to the output of a dense layer. In fact, activation functions are so common that we typically consider them part of the layer, meaning we have the layer h(x) = g (Wx + b). The most commonly used activation function is the rectified linear unit, or ReLU, defined as g(z) = max{0,z}. There are, however, many proposed alternatives to this function, such as the ELU (Clevert et al., 2015), PReLU (He et al., 2015b), and SELU (Klambauer et al., 2017). Up till the late 2000s, it was more common to use the sigmoid function g(z) = 1/(1 + exp[−z]) or g(z) = tanh(z), as these more closely mimic the on/off behavior of a biological neural network. Both were, however, found to work rather poorly as the resulting gradients become very small when z is not close to 0. Nevertheless, they are still useful in other settings; the sigmoid is, for instance, commonly used to scale the final output of a network to be in [0, 1] for classification tasks. This brings us to the final layer of the MLP, the output layer. This is often a dense layer, but with an output activation that is suited to the loss function. For instance, for binary classification we can use the sigmoid activation, for multi-class classification we can use the softmax gj(z) = exp(zj)/ ∑m k=1 exp(zk), for j = 1, . . . ,m, and for regression one can simply use the identity g(z) = z. Conventions here vary, but sometimes the output activation is implicitly included in the loss function to ensure numerical stability. This is, however, most relevant for the implementation and usage of a method, and often less relevant for descriptive purposes. 1andreykurenkov.com/writing/ai/a-brief-history-of-neural-nets-and-deep-learning 7 http://www.andreykurenkov.com/writing/ai/a-brief-history-of-neural-nets-and-deep-learning/ 2. Neural Networks �(�) z3z2 z4z1 z5 x2x1 x3Input Hidden Output � = (�)ℎ1 � �(�) = (�)ℎ2 Figure 2.1: Simple MLP illustration with three covariates, a hidden layer with five units, and a single output. Figure 2.1 shows the typical illustration of a one-hidden-layer MLP. Although the figure does not provide much additional insights compared to simply writing the model as φ(x) = h2(h1(x)), it illustrates that it is common to think of a neural network as a computational graph. In particular, frameworks for working with neural networks, such as Theano, Tensorflow, and PyTorch, are typically built on this graph representation. The nodes represent values and the arrows represent the matrix multiplication with the parameters. As the activation functions are considered part of the layers, the hidden layer is actually h1(x) = g1 (Wx + b), and the output layer is h2(z) = g2 ( wT z ) . It can be shown that with a sufficiently “wide” hidden layer h(x), an MLP can approximate any function (Hornik et al., 1989). In practice, however, it is typically better to stack multiple hidden layers sequentially, giving networks such as φ(x) = h3(h2(h1(x))). 2.3 Training Networks The parameter values of a neural network φ(x) are generally found by minimizing the loss function with some version of gradient descent. In the example of the Logistic Regression in Section 1.1, the loss function was the negative log-likelihood of Bernoulli data, and we will generally consider negative log-likelihoods as loss functions in this thesis. Consider the network φ(x) with parameters denoted by w for simplicity, i.e., w contains all the parameters of all the layers. For a training set of n individuals, each with covariates xi, we want to obtain the minimizers w∗ = argmin w n∑ i=1 loss (φ(xi), yi) , (2.1) 8 Training Networks which, for a negative log-likelihood loss, corresponds to the maximum likelihood estimates. In practice, we find the parameter estimates by some version of gradient descent, meaning we calculate the partial derivatives of the loss with respect to the parameters, and move the parameter values in the negative direction of these partial derivatives. With α denoting a step size, a somewhat informal notation for this is w ← w −α∇w ( n∑ i=1 loss (φ(xi), yi) ) . By repeating this gradient step multiple times, we will eventually reach a minimum of the loss function. Usually, we perform each gradient step by only considering a small subset of the data set at a time. This subset is called a batch and is changed for every gradient update. In the neural networks literature, we refer to this procedure as stochastic gradient descent or just SGD, and it comes in many flavors such as Adagrad (Duchi et al., 2011), RMSprop (Hinton et al., 2012), and Adam (Kingma and Ba, 2014). Their distinction is not important for the understanding of this thesis. If we have a model with many parameters compared to the size of the data set, the optimal parameters in (2.1) are probably overfitted to the training set, meaning that they generalize poorly. It is, therefore, common to monitor the loss on a held-out validation set, and stop the optimization procedure when this validation loss stops improving. This is known as early stopping and is just one of many regularization techniques, a topic we will further investigate in Section 2.4. 2.3.1 Back-propagation For all but the simplest neural networks, it can be cumbersome to derive the gradients of all the parameters. To this end, back-propagation, an application of the chain rule of differentiation, was developed to efficiently obtain the gradients of the parameters in neural networks. Consider the neural network φ(x) = hm(hm−1(. . . (h1(x)))), consisting of the transformations z1 = h1(x), z2 = h2(z1), . . . , zm = hm(zm−1), each with parameters denoted w1, w2, . . . , wm. We require that each transformation in the network can calculate the partial derivatives of its output zk with respect to its parameters wk and its input zk−1. Informally written, this means that layer k needs to calculate ∂zk/∂wk and ∂zk/∂zk−1. By the chain rule of differentiation, the gradients of the parameters with respect to the loss can be obtained by ∂loss(φ(x), y) ∂wk = ∂loss(φ(x), y) ∂zk ∂zk ∂wk = ∂loss(φ(x), y) ∂zm ∂zm ∂zm−1 · · · ∂zk+1 ∂zk ∂zk ∂wk . 9 2. Neural Networks Obtaining the gradients of all parameters in the network is called a backward pass while calculating the loss is called a forward pass. It should now be clear that modifying or replacing a layer zk = hk(zk−1) only requires redefining ∂zk/∂wk and ∂zk/∂zk−1, and the rest of the implementation is left untouched. This modularity is rather powerful as it greatly simplifies experimentation with new layers and network structures. In fact, in Paper II and III, we propose methods based on standard neural network structures, but with new loss functions, which can be viewed as the final layer of the computational graph. The development of neural network methodology is further simplified by frameworks such as Tensorflow (Abadi et al., 2015), Keras (Chollet et al., 2015), PyTorch (Paszke et al., 2017), Torch (Collobert et al., 2002), Caffe (Jia et al., 2014), and Theano (Bergstra et al., 2010). Some of these also perform automatic differentiation, meaning one only needs to implement the transformations and not the partial derivatives. For a more in-depth discussion of back-propagation, see, e.g., Chapter 6.5 of the book by Goodfellow et al. (2016). 2.4 Overfitting and Regularization Neural networks are generally overparameterized. Rumelhart et al. (1986) found that networks with just enough connections to perform a given task tend to get stuck in local minima that are far worse than the global minima, and that “adding a few more connections creates extra dimensions in weight-space and these dimensions provide paths around the barriers that create poor local minima in the lower dimensional subspaces”. So, creating a neural network is not simply about finding the right number of parameters in the network. Typically, the best model, in terms of generalization error, is an appropriately regularized large model (Goodfellow et al., 2016). There is, of course, much literature on regularization, and in the following, we will investigate some of the most relevant. 2.4.1 Weight Decay Neural networks aside, regularization in statistics and machine learning is typically associated with penalization of the loss with respect to the size of the parameter values. Consider a model φ(x) with parameters w. The loss penalized by the L2 norm is then lossγ = loss(φ(x), y) + γ 1 2 ‖w‖22, where γ is a hyperparameter. The gradient updates with vanilla gradient descent will then take the form w ← w −α∇lossγ, or, equivalently, as ∇12‖w‖ 2 2 = w, w ← (1 −αγ)w −α∇loss(φ(x), y). (2.2) 10 Overfitting and Regularization For expression (2.2), we can interpret the gradient update as performing the regular non-penalized update w ← w − α∇loss(φ(x), y), but we shrink the weights w by a constant factor (1 − αγ) at every iteration. As a result, L2- regularization is referred to as weight decay in the neural network literature. In fact, generally, we do not calculate the penalty 12‖w‖ 2 2 directly but, instead, consider the weight decay a part of the optimization scheme. 2.4.2 Ensemble Learning and Dropout A common approach to improving the generalization of a model is to train an ensemble of models and combine their predictions. Bagging (Breiman, 1996) and Random Forest (Breiman, 2001) are two well-known methods built on this principle, and both average the predictions of many tree models. Averaging of multiple predictions reduces the variance and, consequently, improves the generalization error of methods with high variance and low bias. As neural networks are high variance, such ensemble strategies are applicable. There are many different approaches to ensemble learning for neural networks as one can bootstrap or subsample the training set (such as in bagging), use subsamples of the covariates, combine randomly initialized homogeneous network structures, etc. In Paper I, we found that averaging the predictions of networks fitted to subsets of the covariates (individual time series) substantially improved the performance over combining all covariates in a single network. We also averaged predictions from multiple randomly initialized networks for further improvements. Similar approaches are commonly used by the winning entries in prediction competitions such as the ILSVRC (Russakovsky et al., 2015). Dropout (Srivastava et al., 2014) can be thought of as a computationally efficient alternative to training an ensemble of neural networks and is one of the most commonly applied regularization strategies for deep neural networks. The idea is to randomly “drop” units from the network (represented by nodes in Figure 2.1), and thus train an ensemble of all possible subnetworks. With φγ(x) denoting a subnetwork γ obtainable with dropout, we want to minimize the expected loss over all possible subnetworks Eγ [loss(φγ(x), y)]. This is, however, typically intractable, so we instead minimize an unbiased estimate by sampling γ. In practice, dropout is applied by multiplying the output of a layer with a mask of iid 0/1 variables drawn from a Bernoulli(p) distribution (at every batch iteration), where p is a hyperparameter. The ensemble prediction can then be obtained by computing the average over a sample of masks φensemble(x) = 1 m m∑ k=1 φγk (x). A more common, less computationally expensive, alternative is to scale the output of each dropout layer by the dropout-probability p, meaning we compute an estimate of the expected output of that layer. This approach, called weight scaling, has limited theoretical justifications but works well in practice. 11 2. Neural Networks 2.4.3 Data Augmentation Possibly the simplest way to improve the generalization of a machine learning model is by increasing the size of the training set (assuming the quality of the data is reasonable). Obtaining more data if often hard or expensive, so data augmentation techniques instead attempt to artificially increase the size of the training set. The most basic form of data augmentation is to introduce noise to the covariates (Sietsma and Dow, 1991), in which case dropout can be viewed as an augmentation scheme. For image recognition, data augmentation has become an important consideration when training networks. There has been great success in artificially augmenting data sets by subjecting images to label-preserving transformations such as rotation, translation, scaling, and mirroring. Krizhevsky et al. (2012) emphasize this approach as an important part of their winning entry to the LSVRC-2012 competition (Russakovsky et al., 2015). In Paper I, we explore how data augmentation can be applied to time-series data by considering multiple overlapping windows of the same time series per individual. This is found to have a substantial impact on the performance of our models. 2.4.4 Other Approaches to Regularization Three other common regularization techniques are batch normalization (Ioffe and Szegedy, 2015), though its main objective is to improve optimization; cyclical learning rates (Loshchilov and Hutter, 2016), to find better local minima; and early stopping of the optimization routine based on a held-out validation set. Finally, an alternative approach to regularization is to change the structure of the neural networks. If one tailor the network architecture to a specific problem, one can potentially reduce the number of parameters without sacrificing the network’s ability to model the objective. Two such approaches are the use of sparse layers, meaning that many weights are set to zero; and the use of shared parameter values, meaning multiple parameters are fixed to be identical. The most prominent application of these two techniques is the development of the convolutional networks by LeCun (1989). 2.5 Convolutional Networks The name neural networks is from biology, as some of the first learning algorithms were modeled after the brain. It is, however, hard to argue that modern networks are more than loosely inspired by biological networks. Still, convolutional networks claim to capture some of the same properties as the primary visual cortex (LeCun et al., 2010; Goodfellow et al., 2016). Convolutional networks, convolutional neural networks, ConvNets, and CNN’s are all terms we use to describe neural networks where convolutional layers are a crucial part of the network architecture. LeCun (1989) is considered the inventor of CNN’s, though he was heavily inspired by the works of Fukushima (1980) and Lang (1988). The 12 Convolutional Networks Figure 2.2: The LeNet-5 architecture from LeCun et al. (1998) main motivation of CNN’s was to reduce the number of free parameters in a network by tailoring its architecture to the grid structure of an image represented by pixels. By enforcing sparsity and weight sharing, LeCun could accomplish this reduction without necessarily reducing the size of the network. In Figure 2.2, you can see the CNN proposed by LeCun et al. (1998) for recognizing digits (even though the figure use the character A as an example). In the figure, convolutions are represented by the small square on the input image. By applying this square to different parts of the input image one obtains the outputs in “C1: feature maps”. The small squares represent convolutional kernels and are often referred to as filters. This is because each filter is “looking” for a pattern in the image, and its output (C1: feature maps) is a map describing which parts of the input image that contain the filter pattern. For example, a filter might look for horizontal lines in the image, while a second filter might look for vertical lines. Figure 2.2 also contains other layers than the convolutional layers, but for now, we only note that the last layers are an MLP as described in Section 2.2. Images are not really part of this thesis, but in Paper I, we used CNN’s for time-series classification. Therefore, when describing the convolutions in detail, I will phrase them in terms of time series. The reader should, however, note that the following generalizes to images. Convolutional layers are, essentially, the discrete convolution operation known from mathematics (see, e.g., Goodfellow et al., 2016). We do, however, prefer to think of a convolution as a filter w applied to a patch x̃ of the time series x, where w is a vector of parameters. In this case, the convolution takes the form of the cross-correlation c = wT x̃. The same filter w is then applied to numerous patches from the time series, denoted x̃i, which results in a vector of outputs c. We can alternatively express this operation as c = Ax, where x is the full time-series and A is a sparse diagonal-constant matrix (Toeplitz matrix) with w descending along the diagonal. This means that the convolution can be viewed as a fully-connected layer, but with sparse weights and weight sharing. The new feature series c tells us something about the correlation between the time series and the filter w. In other words, c tells us where in the time series we find patterns close to that of the filter w. As w can only express a single 13 2. Neural Networks Figure 2.3: Illustration from Paper I of a convolutional layer applied to a time series. Only the first part of the full network is included. From left to right the figure shows a time series, a convolution operation with the resulting features, and pooling operation with the resulting features. pattern, we repeat the operation with q different filters {w1, w2, . . . wq}, giving the corresponding feature series {c1, c2, . . . cq}. This is illustrated in Figure 2.3 where we have a time series of length 365 and apply 32 different filters of size 9. Each filter w consists of randomly initialized parameters and is fitted in the same manner as the dense layers in the MLP’s. After the convolutional layers, we typically add an MLP that performs the final transformations. When trained, the different filters can extract various information that is useful for minimizing our loss function. When we build a neural network consisting of multiple convolutional layers applied in sequence (with non-linear activation functions in between), the network can learn a hierarchy of representations. This is hard to illustrate for time series, but it is quite common to view for images (see, e.g., Simonyan et al., 2013; Zeiler and Fergus, 2014). For images, we generally find that the first layers only learn to recognize simple patterns such as edges with different orientations, while the subsequent layers are able to detect more complex patterns such as corners and textures. The last layers can find whole objects such as faces, text, and animals. 2.5.1 Controlling the Feature Space Consider a time-series input of length m and a filter of length v. If we slide the filter across the time series one index at a time, the resulting output will be of length m−v + 1. For deeper networks (many convolutional layers), the output might become very short. Zero padding can be applied to prevent this shrinkage from occurring and works by adding zeros to the beginning and end of the time series, thus making the time series longer. The typical argument for applying zero padding is, however, that it preserves the information at the borders of the series. 14 Convolutional Networks On the other hand, it can be beneficial to reduce the length of the convolved features, as larger feature spaces require more computations and are more prone to overfitting. We can, therefore, use downsampled convolutions, or strided convolutions, where we move the filter by s indices at a time. This will reduce the size of the output by a factor 1/s. An alternative to downsampled convolutions is to use some version of pooling. Pooling summarizes an area of the input (i.e., input to the pooling operation) by computing a simple function such as the average or maximum. This creates a representation that is approximately invariant to small translations in the input and is, therefore, particularly useful when the existence of a pattern is more important than the exact location of that pattern. If we apply pooling to non-overlapping patches, we reduce the length of the output. This is illustrated in Figure 2.3, where we perform non-overlapping max pooling over 4 features (days) at a time and, consequently, reduce the feature length by a factor of 4. Correspondingly, an example of pooling for images is shown in Figure 2.2 where pooling is referred to as subsampling. 2.5.2 CNN’s for Time Series In Paper I, we applied CNN’s to the time-series classification problem of mortgage default. At that time, there were limited literature on CNN’s for time series, as recurrent neural networks (RNN’s), such as the long short-term memory networks (Hochreiter and Schmidhuber, 1997), were the preferred architectures for processing sequential data. This is, in a sense, surprising as the predecessor of the CNN’s, the time-delay neural networks (Lang, 1988) can be considered convolutions applied to time series. Although versions of RNN’s continue to be the most popular architecture for time series, CNN’s have the benefit of being more computationally efficient. Therefore, CNN’s for time series is an active research area, and there have been multiple examples CNN’s claiming to have better performance than RNN’s (Zhang et al., 2015; van den Oord et al., 2016; Bai et al., 2018; Elbayad et al., 2018). 15 Chapter 3 Time-to-Event Prediction Time-to-event prediction is a subfield of survival analysis concerned with prediction of future events. In classical statistics, survival analysis tends to focus on understanding how variables affect the event-time distribution. We here distinguish us from the majority of the statistical survival literature, as we will make no claim of understanding the survival process. Instead, we model the event-times as a “black box” that produces estimates of the event-time distribution. In the following, we will start with a basic introduction to survival analysis to make it clear how we can model the event-time distributions. Next, we will look at some classical regression models and see how they can be extended with neural networks. At the end of the chapter, we will discuss some of the typical evaluation criteria used for evaluating survival predictions. 3.1 Event-Time Modeling In this chapter, individuals are generally denoted by i and their covariates by xi. But in the following basic introduction to survival analysis, the covariates are disregarded for simplicity of notation and, instead, only a single individual is considered. Let f(t) be the probability density function (PDF) of the continuous-time event time T∗. In survival analysis, it is common to study the survival function S(t), which gives the probability of survival beyond time t S(t) = P(T∗ > t). The hazard rate h(t) is also quite commonly studied and, for continuous time, it is defined by the conditional probability h(t) = lim ∆t→0 P(t ≤ T∗ < t + ∆t |T∗ ≥ t) ∆t . This means that the hazard tells us something about the immediate likelihood of an event, given that the event has yet to occur, h(t) ∆t ≈ P(t ≤ T∗ < t + ∆t |T∗ ≥ t). Note that the density, the survival function, and the hazard rate are all representations of the same distribution, so knowing one is sufficient to obtain the other two. So, with F(t) and H(t) denoting the cumulative distribution 17 3. Time-to-Event Prediction 0 2 4 6 8 10 Time 0.000 0.025 0.050 0.075 0.100 0.125 0.150 Density: f(t) 0 2 4 6 8 10 Time 0.0 0.2 0.4 0.6 0.8 1.0 Survival: S(t) 0 2 4 6 8 10 Time 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Hazard: h(t) Figure 3.1: Illustration of the density, survival, and hazard for two individuals. function and cumulative hazard rate F(t) = ∫ t 0 f(u) du, H(t) = ∫ t 0 h(u) du, we have that f(t) = −S′(t) = h(t) S(t), S(t) = 1 −F(t) = exp[−H(t)], h(t) = f(t) S(t) = − d log [S(t)] dt . An illustration of these three representations of the event-time distribution is shown in Figure 3.1 for two individuals (blue and orange). Clearly, the three functions give contrasting views of the survival distribution, and it is reasonable to study all three. In the four papers of this thesis, the objective was to obtain estimates of the survival function and we will, therefore, consider estimation of the survival function the objective of this thesis as well. The hazard and density are, however, very useful means to this end. 3.1.1 Censoring and Truncation In Figure 3.1, one might notice that the survival functions S(t) do not drop below approximately 0.2, but it looks like both would continue to decrease if we extend the time axis beyond 10. If we are only able to make observations in the interval between the start and 10, we say that individuals are censored at time 10. Censoring, and the related topic truncation, do, however, extend far beyond this illustrative case, and is a fundamental part of survival analysis. Let C∗ denote a right-censoring time. C∗ is typically considered a random variable, though that may not always be the case. I will only use the notation 18 Event-Time Modeling Time Calendar Time End of Study Event Censoring    Figure 3.2: Example of administrative censoring (progressive type I) for five individuals (two censored and three observed events). The left figure represents the time scale in which we model the event times. The right figure represents the same individuals, but in the calendar time in which the data was collected. C∗ for the censoring time, and the reader is expected to understand from context whether or not it is a random variable. Let us first consider random censoring, where C∗ is a random variable with density fC∗ (t) and survival function SC∗ (t) defined in the same manner as for the event time T∗. A typical example of random censoring is when an individual i decides to no longer be part of a study and, consequently, we only know that the individual had still to experience the event at time C∗i . As both C ∗ i and T ∗ i are random variables, we are only able to observe the shorter of the two. We, therefore, define the observed time Ti and event indicator Di as follows, Ti = min{T∗i ,C ∗ i } (3.1) Di = 1{T∗i ≤ C ∗ i }. (3.2) In Papers II and III, we consider observations on this form. Administrative censoring, or progressive type I censoring, is an alternative where the right-censoring times C∗i are known for all individuals. So we know C∗i even when T ∗ i < C ∗ i . Administrative censoring often occurs in studies where individuals have different times of entry to the study (in calendar time) with a defined end date of the study. This is illustrated in Figure 3.2, where the left figure shows the event and censoring times in the study, while the right figure shows the calendar times at which the individuals were recorded. As an example, the figure could illustrate the time to mortgage defaults, such as in Paper I, where the mortgages are started on different dates and the end date of the study is today. The censoring time C∗i will, in this case, be the difference between the entry date and the end-of-study date. The observations are still given by (3.1) and (3.2), but we now also know all C∗i ’s. As the time of entry can be considered random, it is common to consider C∗i random in these data sets too. Also, researchers tend to disregard the additional information provided by C∗i and treat the problem as we did for non-administrative censoring. 19 3. Time-to-Event Prediction Only right-censored observations were considered in the papers that constitute this thesis. Left-truncation is, however, also quite common in survival data. As an example, for a left-truncation time Vi, we will only observe individual i if T∗i > Vi. This means that if T ∗ i < Vi we might not even know that the individual ever existing. Consequently, we are limited to model the distribution conditioned on the truncation time, or we can make sufficiently strong assumptions on the distribution of Vi to model the full event times. As an example of left-truncation, Klein and Moeschberger (2003) consider the age at death of residents at a retirement center. Individuals need to survive to a sufficient age to enter the retirement center, meaning all individuals that died before this time never entered the study. The topic of left-truncation will be revisited in Chapter 5. For a more extensive review of other types of censoring and truncation, see, e.g., Chapter 3 of the book by Klein and Moeschberger (2003). 3.1.2 Discrete-Time Survival Thus far, we have considered the time scale to be continuous. Sometimes, it can, however, be more appropriate with a discrete time scale. As an example of discrete-time survival data, we can consider the event of leaving a subscription service where each subscription payment lasts for a full month at a time. This means that the events can only occur after month number 1, 2, 3, etc., even though the decision to leave might happen sometime between the discrete time points. It is also quite common to have events that occur in continuous-time, but we are only able to make observations at discrete points in time. Let 0 = τ0 < τ1 < ... denote the discrete time scale. The probability mass function (PMF), the survival function, and the discrete hazard are defined as f(τj) = P(T∗ = τj), S(τj) = P(T∗ > τj) = ∑ k>j f(τk), h(τj) = P(T∗ = τj |T∗ > τj−1) = f(τj) S(τj−1) . Consequently, we also have that f(τj) = S(τj−1) −S(τj), S(τj) = [1 −h(τj)] S(τj−1) = j∏ k=1 [1 −h(τk)]. Assuming the same time scale for the censoring distribution, we denote the corresponding PMF and survival fC∗ (τj) and SC∗ (τj). In Paper III, we are concerned with discrete-time models, and we also address how the discrete-time models can be applied to continuous-time data. Both discretization schemes for continuous-time data and interpolation schemes for obtaining continuous-time survival estimates are considered. 20 Event-Time Modeling 3.1.3 The Likelihood for Right-Censored Event Times Continuing with the discrete time-scale and right-censored observations of the form Ti = min{T∗i ,C ∗ i }, Di = 1{T∗i ≤ C ∗ i }, we will derive the likelihood. For observations t and d, the probability of the observations is P(T = t,D = d) = P(T∗ = t,C∗ ≥ t)d P(T∗ > t,C∗ = t)1−d. If the observed censoring time c is deterministic, then d = 1 implies that t ≤ c, and d = 0 implies that t = c. This gives us P(T = t,D = d) = P(T∗ = t)d P(T∗ > t)1−d = f(t)d S(t)1−d. (3.3) It is, however, more common to consider random censoring C∗. If we assume that T∗ and C∗ are independent, we obtain the probability P(T = t,D = d) = [P(T∗ = t) P(C∗ ≥ t)]d [P(T∗ > t) P(C∗ = t)]1−d = [f(t) (SC∗ (t) + fC∗ (t))] d [S(t)fC∗ (t)] 1−d = [ f(t)d S(t)1−d ][ fC∗ (t) 1−d (SC∗ (t) + fC∗ (t)) d ] . Now, assuming f(t) does not depend on the parameters of fC∗ (t), we can view [fC∗ (t) 1−d (SC∗ (t) + fC∗ (t)) d] as a constant and only consider the proportional expression P(T = t,D = d) ∝ f(t)d S(t)1−d, which is identical to the expression with deterministic censoring in (3.3). So, in both cases, for individuals denoted by i, with covariates xi, observed times ti, and event indicators di, we consider the likelihood L = ∏ i f(ti |xi) di S(ti |xi) 1−di. (3.4) If we substitute the PMF with the PDF in (3.4), we obtain the corresponding likelihood for continuous-time data. As both the PMF and PDF are denoted f(ti |xi), the expression is unchanged. The derivations are also very close to that of (3.4). The likelihood for right-censored event times in (3.4) is the foundation for many survival methods, some of which will be addressed in the next sections. In fact, the survival methods presented Papers II and III are built on this likelihood. However, if we have, e.g., truncated or left-censored data, some alterations need to be made. This is somewhat outside the scope of this thesis, so I will refer the interested reader to Chapter 3.5 of the book by Klein and Moeschberger (2003). 21 3. Time-to-Event Prediction 3.2 Regression Models The application of regression models in survival analysis typically involves maximization of the survival likelihood (3.4). It is, however, more common to phrase the likelihood in terms of the hazard rate h(ti |xi), rather than the PMF/PDF f(ti |xi). Considering continuous-time data, with individuals denoted by i with covariates xi, observed times ti, and event indicators di, we can rewrite (3.4) to L = ∏ i h(ti |xi) di exp [−H(ti |xi)] . (3.5) Correspondingly, for discrete-time we have L = ∏ i  h(ti |xi)di [1 −h(ti |xi)]1−di ∏ j: τj τj ti − τj−1, if τj−1 < ti ≤ τj 0, if ti ≤ τj−1, the likelihood in (3.5) can be written as L = ∏ i ∏ j: τj≤ti [∆t̃ij ηj(xi)] yij exp [ −∆t̃ij ηj(xi) ] . This is proportional to the likelihood of independent Poisson-distributed variables yij with expectations µij = ∆t̃ij ηj(xi). So if we define ηj(x) = exp[φj(x)], we can use GLM software for the optimization procedure. Fornili et al. (2014) extended this methodology by parameterizing φj(x) with a neural network. In Paper III we revisit this method, but with modern frameworks for neural networks and some modifications to the link function. 3.2.4 Binary Classifiers Outside the survival literature, a common approach to event-time prediction is to frame the problem as binary classification. For a time τ, we want to estimate the probability of experiencing an event before this time, meaning we want to estimate P(T∗i ≤ τ). By disregarding individuals censored before time τ, meaning that we remove individuals with C∗i < min{T ∗ i ,τ}, we can fit a binary classifier to the remaining individuals with labels yi = 1{ti > τ}. This was our approach in Paper I for predicting mortgage defaults. If we want to obtain predictions for more than a single point in time, we can either repeat the steps above for multiple τj’s, or we can create a model that does this implicitly. The latter can be achieved with a neural network with an output for every τj. This approach was applied in Paper IV. Clearly, estimates obtained from such methods are dependent on the censoring distribution. However, for a negligible amount of censoring, the estimates of the survival function will be close to those of a method that accounts for censoring. Furthermore, if we are only interested in discrimination, meaning well-calibrated estimates is not important, the binary classifier can be a reasonable approach. 25 3. Time-to-Event Prediction There are of course some benefits to using a survival method in this case too, as the ranking can be affected by the censoring bias. But, at least in my experience, these differences are smaller than those of the evaluation metrics that also consider the calibration of the estimates. 3.3 Evaluation of Survival Estimates A substantial part of the machine learning literature is based on empirically verifying that a new method performs better than competing methodology for some evaluation criteria. So, when we combine survival methodology with methods from machine learning, it is crucial that we have suitable evaluation criteria for quantifying the performance of the proposed methods. A good evaluation metric for survival methodology needs to account for censored observations. This is quite problematic as popular metrics typically depend on the censoring distribution. If the scores are affected by the researcher’s assumptions on the censoring distribution, it is not obvious to what degree we can trust the results. To stress this point, consider the analogous situation in classification where there are missing labels (responses) in our test set. An evaluation metric, such as accuracy, will need to make some assumptions on the distribution of the missing labels. For example, we can assume that the labels are missing at random and simply disregard the missing labels. If, in fact, the probability of a label being missing is correlated with the label itself, the scores would be biased as the class proportions would be wrong. Returning to survival data, a seemingly obvious choice for evaluating survival predictions is to use the likelihood for right-censored event times in (3.4) and (3.5). The likelihood is, however, not very well suited for comparing estimates from different approaches, such as discrete and continuous time, and is, therefore, rarely used. An alternative metric, that we use in Papers II, III, and IV, is the Brier score. If there is no censoring, the Brier score is essentially the mean squared prediction error between our survival estimates Ŝ(t |xi) and the current state of the individuals 1{T∗i > t}, BS(t) = 1 n n∑ i=1 [ 1{T∗i > t}− Ŝ(t |xi) ]2 . (3.12) Therefore, the Brier score is a function of the time t. In practice, to obtain a single number for the score, we can calculate the integrated Brier score by numerically approximating the integral IBS = 1 t2 − t1 ∫ t2 t1 BS(u) du, for a suitable time interval [t1, t2]. 26 Evaluation of Survival Estimates 3.3.1 Inverse Probability of Censoring Weighting For right-censored event times, the Brier score can be approximated by a weighting scheme that preserves the expectation of the score. Graf et al. (1999) proposed to weight the Brier score (3.12) by the inverse probability of censoring, BSIPCW(t) = 1 n n∑ i=1 [ Ŝ(t |xi) 2 1{ti ≤ t, di = 1} ŜC∗ (ti − |xi) + [1 − Ŝ(t |xi)] 2 1{ti > t,} ŜC∗ (t |xi) , ] where ti− denotes the time right before ti. They proposed to use the Kaplan- Meier estimator to obtain censoring estimates ŜC∗ (t |x), meaning one assumes that the censoring distribution is independent of the covariates. Gerds and Schumacher (2006) extended the idea to covariate-dependent censoring estimates. The Brier score is obtained directly from the survival estimates. This makes the Brier score attractive for comparing any methods that produce survival estimates, which is the case for most survival methods. A drawback is that the scores are dependent on our estimate of the censoring distributions, which can be as difficult to estimate as the event-time distributions. If the censoring distributions substantially affect the scores, it can be hard to make any strong claims based on the results. In Paper IV, we discuss some of the issues when this scheme is applied to administratively censored times, in particular when covariates are very informative of the censoring times. The IPCW scheme is more general than the Brier score and can be applied to any suitable function. For example, in Paper II, we use the same weighting scheme for the binomial log-likelihood, as proposed by Graf et al. (1999). 3.3.2 Concordance The Harrell Jr et al. (1982) concordance index, or C-index, is a commonly used evaluation metric in the predictive survival literature. For uncensored data, it can be interpreted as the relative frequency of concordant pairs, meaning, for a given time t, it provides an estimate of C(t) = P(Ŝ(t |xi) < Ŝ(t |xj) |T∗i < T ∗ j ). (3.13) This is particularly useful for “one-to-one” corresponding models, defined by the property S(t |xi) < S(t |xj) ⇐⇒ S(s |xi) < S(s |xj) for all s and t, as the concordance is not dependent on the time, i.e., C(t) = C(s). The Cox proportional hazards model is an example of a “one-to-one” corresponding model, as the survival functions are given by S(t |x) = S0(t) exp[g(x)]. For models that are not “one-to-one”, it is somewhat common to replace the survival functions in (3.13) with summary statistics M(x), as this results in a single score that is not a function of the time. In this case, M(xi) is meant to represent the ranking of individual i’s “predicted event time”, and can, for example, be the integral of the survival function, M(x) = ∫ τ 0 S(t |x) dt. This approach might suppress some drawbacks of model restrictions, such as proportional hazards, but the interpretability of a single number for the 27 3. Time-to-Event Prediction concordance might be worth it. Alternatively, one can, of course, integrate the concordance scores in (3.13) with respect to the time t by some numerical approximation. For censored data, the concordance is dependent of the censoring distribution, making the scores harder to interpret. It is, however, still quite common to use it. To retain the original interpretation, Uno et al. (2011) and Gerds et al. (2013) proposed an IPCW concordance in a similar manner to that of the Brier score in Section 3.3.1. Antolini et al. (2005) proposed a version of the concordance that instead estimates Ctd = P(Ŝ(Ti |xi) < Ŝ(Ti |xj) |Ti < Tj, Di = 1), which they showed is equivalent to a weighted ROC AUC over time. This metric is attractive because it provides a single score (without integrating over time) while still considering the time dependence of the estimates. Furthermore, for “one-to-one” corresponding models it is equivalent to the Harrell Jr et al. (1982) concordance index. In Paper II, we show that by using the concordance of Antolini et al. (2005) for hyperparameter tuning of Random Survival Forests (Ishwaran et al., 2008), we obtain better predictive performance than that obtained by the Harrell Jr et al. (1982) concordance, both in terms of discrimination and calibration. 28 Chapter 4 Summary of Papers 4.1 Paper I Håvard Kvamme, Nikolai Sellereite, Kjersti Aas, and Steffen Sjursen. Predicting Mortgage Default Using Convolutional Neural Networks. Expert Systems With Applications, 102: 207–217, 2018. This paper is concerned with predicting defaults of private mortgages at Norway’s largest bank DNB. In contrast to similar work in the literature, we only use the state of each customer’s checking account, savings account, and credit card, while disregarding all other information about the customers. Hence, the aim of the paper is twofold: to show that the transaction data holds information useful for estimating the risk of a mortgage, and how to extract this information. No survival methodology was considered in this paper. Instead, the problem was approached as time-series classification. As was argued in Section 3.2.4, it is reasonable to apply a binary classifier to survival data if there are relatively few censored observations. Although this was not addressed in the paper, the binary classifier only considers mortgage defaults in the span of one year, meaning there was very little censoring. The data set consists of the daily balances of the checking accounts, the savings accounts, and the credit cards, in addition to the daily number of checking account transactions and the total amount transferred to the checking account. For each customer, a full year of data was used as covariates. Convolutional networks were applied to the times series and it was found better to fit individual networks for each of the time series, rather than fitting a single network with all the time series as input. At the time of this project, it was not very common to apply convolutional networks to time series, and most literature instead considered recurrent networks for this purpose. We found, however, that the convolutions performed better than such recurrent networks, and in the years since, we have seen an increase in the use of convolutions for these types of problems. The training set only consisted of 13,000 individuals. Hence, an augmentation technique was proposed. The scheme uses multiple “windows” of a customer’s time series, and was shown to improve predictive accuracy substantially. In relation to this, empirical studies found that increasing the size of the training set would likely improve the performance of the classifier. This suggests that we were not able to take advantage of all the information available in the time series. In conclusion, the proposed methodology was able to extract valuable 29 4. Summary of Papers information from the transaction time series. We do, however, believe that by combining this research with other data sources, further improvements would be possible. 4.2 Paper II Håvard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-Event Prediction with Neural Networks and Cox Regression. Journal of Machine Learning Research, 20(129): 1–30, 2019 In this paper, we propose new ways to apply neural networks to survival data. The task is approached by extending the well-known Cox regression. There are numerous papers on parameterizing a Cox regression with neural networks, but we distinguish us from them by proposing a model with non-proportional hazards. This is achieved by allowing for interactions between the covariates and the time. In particular, the hazard model is h(t |x) = h0(t) exp [g(t, x)], where h0(t) is a non-parametric function and g(t, x) is parameterized by a neural network. This model is actually a relative risk model rather than a Cox model. Regardless, we refer to the method as Cox-Time, as the name is likely more familiar to most readers. Training the proposed Cox-Time method with the Cox partial log-likelihood would be very computationally expensive. By instead using an approximation of the partial log-likelihood from nested case-control studies, we are able to fit the model in reasonable time. Prediction with the Cox-Time method can, however, still be quite computationally expensive. Some suggestions were proposed to alleviate the cost, but further research is warranted. A proportional Cox model, fitted with the same approximation, is also proposed. The method is referred to as CoxCC or Cox-MLP (CC), where CC refers to the case-control approximation of the partial log-likelihood. The validity of the approximation was verified through simulation studies, and we found that, in general, very few control samples were needed. The proposed methodology was compared with other methods from the literature in a study of five data sets. Four of the data sets were obtained from the survival literature, each consisting of 1,000 to 10,000 individuals. The fifth was created with data made available by the Kaggle KKBox Churn Prediction Challenge. Although not originally intended for time-to-event prediction, we were able to create a data set of more than 2 million individuals. As neural networks typically excel with larger data sets, we wanted to have at least one large data set in our experiments. Our proposed Cox-Time method was found to perform well compared to competing methods, in particular in terms of the Brier score and binomial log-likelihood, both weighted by the inverse probability of censoring. 30 Paper III 4.3 Paper III Håvard Kvamme and Ørnulf Borgan. Continuous and Discrete-Time Survival Prediction with Neural Networks. Submitted for publication, 2019 This paper is, in some ways, a continuation of Paper II. However, the semi- parametric Cox model is abandoned, and we instead direct our attention to fully parametric approaches that directly optimize the likelihood of right-censored event-times. The explored methods either assume a discrete time-scale or consider the time-scale to be divided into intervals. Consequently, we need to address the issue of how discrete-time models can be applied to continuous-time data sets. Two discretization schemes are considered; one that splits the time scale into equidistant intervals, and one that uses the quantiles of the estimated event-time distribution to split the time scale. To obtain continuous-time predictions from the discrete-time models, we consider two interpolation schemes; one that assumes a constant density function in each interval (linear interpolation of survival estimates), and one that assumes constant hazard rates (exponential survival function in each interval). For smaller data sets, a coarse discretization grid can be beneficial as it reduces the number of parameters in the neural networks. This was verified in a simulation study. It was also found that the discretization grid based on quantiles was especially useful for the small data sets. Larger data sets, on the other hand, allows for finer discretization grids, meaning the differences between the two discretization and interpolation schemes decrease. It was, generally, found that the granularity of the discretization grid had a much larger impact on the predictive performance than the discretization and interpolation schemes. The paper compares the two approaches to discrete-time modeling discussed in Section 3.2.2, i.e., the Logistic-Hazard and the PMF method. In terms of predictive performance, the two methods are very close, but we find that the Logistic-Hazard generally gives more stable performance, while the PMF method is more sensitive to hyperparameter configurations. In this paper, we also propose a continuous-time method that assumes constant hazards in each interval. This is essentially the piecewise exponential model described in Section 3.2.3, but with modern neural networks and a softplus function replacing the regular exponential function in h(τj |x) = exp[φj(x)]. We find that its performance is competitive with the two other methods discussed in this paper, in addition to other methods found in the literature. In general, the methods described in this paper seem to perform slightly better than the Cox-Time methods proposed in Paper II. 31 4. Summary of Papers 4.4 Paper VI Håvard Kvamme and Ørnulf Borgan. The Brier Score under Ad- ministrative Censoring: Problems and Solutions. Submitted for publication, 2019 This paper is an investigation of the Brier score, which is commonly used to evaluate survival predictions. In short, the Brier score is the mean squared difference between the survival estimates and the state of the individuals at a given time denoted 1{T∗i > t}. For right-censored event-times, the Brier score can be weighted by the inverse probability of censoring, called IPCW. The paper shows that if the censoring times can be identified from the covariates, then the IPCW Brier score might be biased. Furthermore, this bias is equivalent to the bias of the binary classifiers in Section 3.2.4 and, therefore, benefits the binary classifiers. This is also verified through a simulation study where we show that a binary classifier can obtain better IPCW Brier scores than the true survival function. The paper also addresses that estimation of the censoring distribution can be problematic, especially if the distribution is dependent on the covariates. Under administrative censoring, meaning all censoring times are observed, we propose the administrative Brier score which does not require estimation of the censoring distribution and does not have the bias of IPCW scores when the censoring times can be identified from the covariates. This is verified through a simulation study. The KKBox churn prediction data set from Paper II is administratively censored and is, therefore, used as a case study in the paper. We find that the data set exhibits the traits we produced with simulations, meaning the covariates are likely to hold information about the censoring times of a subset of the customers. We show that the IPCW Brier score is substantially affected by the estimated censoring distribution. For the highest times, meaning the highest proportion of censored customers, binary classifiers outperform the Logistic- Hazard method. However, for the administrative Brier score, we get that the Logistic-Hazard performs better than the binary classifiers, as we would expect. 32 Chapter 5 Discussion In this thesis, we have investigated how neural networks can be used to improve the predictive performance of existing survival methods. As this topic is still somewhat unexplored, there are almost limitless extensions that could be addressed here. I have tried to choose a relevant subset and instead provide some more details. I will, also, use the opportunity to discuss some of the weaknesses of the discussed methodology, in addition to my experiences working with each of the proposed methods. Instead of addressing each of the four papers individually, the following is a general discussion of the machine learning methodology for time-to-event prediction. 5.1 The Survival Methods In Papers II and III, the predictions by various survival methods are compared using metrics such as the concordance and the Brier score. After working with numerous approaches to survival prediction, I would like to take the opportunity to address some of my experiences that are not as easily quantifiable and share some of my thoughts on the methods. Random Survival Forests (RSF) by Ishwaran et al. (2008) is by far the simplest method to use. As with other Random Forest methods, very little preprocessing of the data is required (if any at all), it is easy to find a good set of hyperparameters, and as it is fully non-parametric, there are no numerical issues. This means that it is easy to get good and stable performance. However, in my experience, RSF is very slow to train on larger data set and, in general, it took many times longer to train than the neural networks. I do not know if this is because the method is very computationally expensive or if it was just the implementation I used that was slow. The Cox-Time method generally performs better than the Cox methods with proportional hazards, but obtaining predictions is much more computationally expensive. This was discussed in Paper II. Comparing the proportional CoxPH method and CoxCC method, where the first minimizes the negative partial log-likelihood as described in Section 3.2.1 and the latter minimizes an approximation of this negative partial log-likelihood, we found that CoxCC has slightly less stable performance than CoxPH. It is also worth mentioning that the implementation of CoxCC is more complicated than that of CoxPH. However, the CoxCC method has a validation loss that is more comparable with the training loss which gives better insights to the training process. In Paper III, we found that the Logistic-Hazard and PC-Hazard were generally the two best-performing methods. Compared to Cox-Time, predictions are 33 5. Discussion much less computationally expensive and, compared to all the semi-parametric Cox methods, the implementation is much simpler. In fact, considering the implementation, the Logistic-Hazard method is probably the simplest, as it requires only a few lines of code to implement it in a deep learning framework. Also, it is likely one of the more intuitive methods for those without a background in survival analysis. The downside of the Logistic-Hazard and PC-Hazard is that they require discretization of the time scale, which introduces additional hyperparameters. RSF and the Cox-Methods, on the other hand, all handle the continuous time non-parametrically. 5.2 Can We Trust Individual Predictions? The methods discussed in Papers II and III give predictions in the form of survival estimates. These estimates were evaluated with multiple performance metrics such as the concordance index and the Brier score. However, these scores only verify that the survival estimates work well on average and do not address the individual predictions. As none of the methods produce uncertainty estimates, can we really trust the survival estimate for a single person? To illustrate this concern, I have created a simple simulation study with constant and covariate independent hazard, meaning that all individuals have the same survival function. I then include a single covariate that is a monotone function of the administrative censoring time. This is supposed to illustrate a scenario where we have covariates that contain a lot of information about the censoring times, just as in Paper IV. If we fit a Cox model with proportional hazards to this data set, it would simply learn to disregard this covariate and give survival estimates close to the true survival function. The non-proportional methods discussed in Papers II and III, on the other hand, have more difficulties with this estimation. In Figure 5.1, I have plotted the survival estimates of various methods, obtained by training on 1,000 samples from the simulation study. The vertical dotted red line in each plot gives the censoring time. Note that this censoring time represents the covariate used to obtain the survival predictions in each plot. The blue line in each plot represents the true survival function we want to estimate and, as it is independent of the censoring time, it is the same in all the six plots. As expected, for times smaller than the censoring time (left of the vertical red line), all methods are able to estimate the survival function reasonably well. However, the methods clearly struggle with understanding that all individuals have the same survival function. As a result, the methods are not concerned with the survival estimation after the censoring time, as this is never considered in the likelihood. At least this is my interpretation of the results. In a sense, these results are perfectly reasonable, as we should not predict for longer time horizons than what is available in the training set for a given set of covariates. However, usage of such methods clearly requires the practitioner to be aware of what time horizon that is applicable for each individual. This could 34 Extensions of the Methods 0.00 0.25 0.50 0.75 1.00 S( t) True Cox-Time Logistic-Hazard PC-Hazard PMF Censor-time 0 20 40 60 80 100 Time 0.00 0.25 0.50 0.75 1.00 S( t) 0 20 40 60 80 100 Time 0 20 40 60 80 100 Time Figure 5.1: Comparison of survival estimates by various survival methods. The only covariate is a monotone function of the censoring time, which is represented by the vertical red dotted lines. The blue line is the true survival function, which is independent of the covariate. Hence, the true survival function is identical in all six plots. be handled by also estimating the censoring distribution and only use survival estimates for times t that satisfies ŜC∗ (t |xi) > ξ, where ξ is some positive threshold. Estimation of the censoring distribution is, in itself, a survival problem censored by the event times. This means that we need to know the time horizons that are reasonable for our censoring estimates. The point I am trying to make is that one needs to be careful when using the individual survival estimates of these methods. Especially, as the methods can give survival predictions for much longer time horizons than what is reasonable for a particular individual. Uncertainty estimates would highly benefit these methods, though they may be hard to obtain. A reasonable starting point could be to consider estimates of the censoring distribution. 5.3 Extensions of the Methods The methods discussed in Papers II and III are somewhat basic, in the sense that they only consider right-censored event times. DeepHit (Lee et al., 2018) is the exception as it also handles multiple event types. Natural extensions of the methods would be to account for other types of censoring and truncation (left, right, interval), competing risks, dynamic prediction (time-dependent covariates), recurrent events, and multistate models. These are all topics addressed in the statistical survival literature. In the following, we will investigate how a few of these extensions can be approached for the methods in Papers II and III. We 35 5. Discussion will here use the same notation as presented in Chapter 3, meaning that, for individual i with covariates xi, the event time is T∗i , the censoring time is C ∗ i , and the potentially right-censored time is Ti with Di denoting if it is an event or censored observation. 5.3.1 Left Truncation Section 3.5 of the book by Klein and Moeschberger (2003) addresses how the survival likelihood changes with censoring and truncation of the types left, right, and interval. This is a good starting point for extending the methods that directly optimize the survival likelihood, such as Logistic-Hazard, the PMF method, and PC-Hazard. For the corresponding extension of Cox regression, see, e.g., Pan and Chappell (2002) and Rennert and Xie (2018). In the following, I will shortly address the case of left truncation, as it the most common of the truncation types. If we have left-truncated data, our estimates will be conditioned on these left-truncation times. So, for ν denoting a left-truncation time, we can estimate S(t |ν) = P(T∗ > t |T∗ > ν). Note that this conditioning is similar to the hazard specification. This makes the hazard suitable for working with left-truncated data. Consider a discrete time scale. In the same manner as in Section 3.1.3, we can derive the probability distribution of T and D, but now conditioned on the left-truncation time ν < t, P(T = t,D = d |T > ν) = P(T = t,D = d,T > ν) P(T > ν) = P(T = t,D = d) P(T∗ > ν) P(C∗ > ν) = [ f(t)d S(t)1−d S(ν) ][ fC∗ (t) 1−d (SC∗ (t) + fC∗ (t)) d SC∗ (ν) ] ∝ f(t)d S(t)1−d S(ν) . This means that the “regular” likelihood contribution of an individual (3.4), is scaled by the inverse of the survival probability at the left-truncation time. For an individual i with left truncation time νi, the discrete-time likelihood contribution is then Li = f(ti |xi) di S(ti |xi) 1−di S(νi |xi) . (5.1) Written in terms of the hazard, this is equivalent to Li = h(ti |xi) di [1 −h(ti |xi)] 1−di κ(ti)−1∏ j=κ(νi) [1 −h(τj |xi)] , 36 Extensions of the Methods where the τj’s denote the discrete time scale, and κ(ti) is the index corresponding to time ti, meaning ti = τκ(ti). These two ways of expressing the likelihood contribution can be used to extend the Logistic-Hazard and PMF method in Paper III to account for left-truncated event-times. For continuous time, the likelihood contribution can still be expressed by (5.1), but with f(ti |xi) denoting the event-time density function instead of the probability mass function. Written in terms of the hazards, the continuous-time likelihood contribution is Li = h(ti |xi) di exp [H(νi |xi) −H(ti |xi)] . This can be used to extend the PC-Hazard method in Paper III to account for left-truncated event times. For Cox regression, the left-truncated event times can be handled by modifying the risk set in Section 3.2.1 to Ri = {j : νj < ti ≤ tj}, as explained in the book by Aalen et al. (2008, p. 32). This should also work for the proposed Cox-Time and CoxCC in Paper II. For all the methods discussed here, the survival functions are monotone. Hence, the estimates of S(t |x) are dependent on the estimates up till time t. Left truncation can cause a data set to contain very few individuals that are followed from time zero, so if we estimate the survival from time zero, poor estimation at the lower times might ruin the later survival estimates. It is, therefore, more reasonable to estimate the conditional survival S(t |ν0, x) = P(T∗ > t |T∗ > ν0, x), (5.2) for a lower time ν0 at which the set of individuals we follow, {i : νi ≤ ν0}, is of reasonable size. 5.3.2 Dynamic Prediction The KKBox data set explored in Papers II and IV contains the subscription histories of KKBox’s customers. We created a simple data set for predicting churn, with stationary covariates obtained from each customer’s first subscription. This means that we considered the starting point t = 0 to be the time of the first subscription for each customer. In practice, however, it is probably more interesting to make dynamic predictions in calendar time, meaning that from a given calendar date t we want to predict the probability of churn for each time t + v. This is more in line with the original KKBox Kaggle competition, as they wanted churn predictions from a given date, and not from the time of each customer’s first subscription. Explicitly, for time-dependent covariates x(t) and a given date t, we want to predict time v into the future, conditioned on all information up till time t. Using the notation of (5.2), we can express the conditional survival as S(t + v |t, {x(u)}u≤t) = P(T ∗ > t + v |T∗ > t, {x(u)}u≤t). 37 5. Discussion One approach to this objective is to jointly model the event-time distribution and the distribution of the time-dependent covariates. While there is an extensive statistical literature on this topic (e.g., Ibrahim et al., 2010; Lawrence Gould et al., 2015), it requires estimation of the covariate process. A more straightforward approach is that of landmarking (Van Houwelingen, 2007; van Houwelingen and Putter, 2011). In its simplest form, one creates K data sets, one for each landmarking time sk, by truncating individuals with Ti < sk, and fit individual models to each of the K data sets. This way, model k can predict S(sk + v |sk, x(sk)) = P(T∗ > sk + v |T∗ > sk, x(sk)), where x(sk) is the covariates at time sk. This approach requires one model for each landmarking time sk, which is problematic for a large K. It is, therefore, desirable to combine all these K models into a single model. This can be approached by conditioning on the landmarking time in the sense that we consider sk a covariate. Also, it is probably better to let the covariates be some function of all previous information, ensuring that the covariate space is of constant dimensions p. We denote these covariates x̃(sk) ∈ Rp, as they are a function of the landmarking times and all previous information, giving x̃(sk) = g ( sk,{x(u)}u≤sk ) . Partly conditional modeling (Zheng and Heagerty, 2005) closely resembles landmarking, but considers the residual time v instead of t + v. This means that we reset the time at every landmark sk. For a set of landmarking times {sk} K k=1, we substitute individual i by a set of “fake” individual ik, one for each sk, with covariates x̃i(sk) and observed time vik = ti−sk. This means that we can fit the methods in Papers II and III with no alterations and their survival predictions will represent S(v |sk, x̃(sk)). This approach was used by Martinsson (2016) who considered v |sk, x̃(sk) to be Weibull distributed, and constructed x̃(sk) with a recurrent neural network over {x(u)}u≤sk . To my understanding, we are, technically, not restricted to use the same landmarking times for all individuals, and one could have a different set for each individual. However, I have not seen anyone address this in the literature. By following the reasoning of landmarking or partially conditional modeling, it should be possible to extend the methods in Papers II and III to work for dynamic prediction. Note, however, that this section should be considered a starting point, as I have not tested any of the approaches, and there may be issues I have failed to address. In Paper I, we proposed an augmentation scheme for artificially increasing the size of the mortgage default data set. In short, for each individual, we extracted data from multiple time periods in a similar manner to that of partial conditional modeling. However, the objective was to classify mortgage defaults within one year, and we only used covariates from one year of data. This means that, for each landmarking time sk, we estimated S(365 |sk, {x(u)} sk u=sk−365), where 38 Extensions of the Methods the data {x(u)}sku=sk−365 was processed by a convolutional network. Paper I could benefit from a partial conditional modeling approach, as the specific time of each mortgage default holds more information than the binary labels, and building on an established approach is likely more reasonable than our ad hoc data augmentation scheme. 5.3.3 Competing Risks In this thesis, I have only considered situations where there is only one event type for each data set. In some situations, there can, however, be multiple competing event types. For example, there are multiple diseases a patient can die from. We typically think of these types of events as competing processes where we only observe the event type with the shortest event time. The survival literature, therefore, refers to these situations as competing risks. In the following, we will investigate possible extensions of the methods in Papers II and III to enable them to handle multiple event types. To this end, we need the random variable R ∈{0, . . . ,nr} for denoting the event type with nr possible event types and R = 0 denoting a censored time. As before, the lower case letters represent the corresponding observations, meaning an observation consist of the tuple (ti,ri), and we let di be the event indicator di = 1{ri > 0}. The cumulative incidence functions Fr(t |x) = P(T∗ ≤ t,R = r |x). are commonly reported in competing risk studies and we will consider the estimation of these functions our objective. 5.3.3.1 Logistic-Hazard From the statistical literature on discrete-time survival analysis, we know that a natural extension of the Logistic-Hazard model is the Multinomial Logit Hazard model. The following is a derivation of the Multinomial Logit Hazard model, based on the presentation by Tutz and Schmid (2016). The event-specific hazard at the discrete time τj is hr(τj |x) = P(T∗ = τj, R = r |T∗ ≥ τj, x), and the overall hazard is h(τj |x) = P(T∗ = τj |T∗ ≥ τj, x) = nr∑ r=1 hr(τj |x). This gives the survival function and event probability S(τj |x) = P(T∗ > τj |x) = j∏ k=1 [1 −h(τk |x)] , fr(τj |x) = P(T∗ = τj,R = r |x) = hr(τj |x) S(τj−1 |x). 39 5. Discussion The cumulative incidence functions can be expressed as Fr(τj |x) = j∑ k=1 fr(τk |x). (5.3) For a function φ(x) ∈ Rm×nr , such as a neural network with an output node for each combination of the time and the event type, we define the event-specific hazards as hr(τj |x) = σjr(x) = exp[φjr(x)] 1 + ∑nr k=1 exp[φjk(x)] , and we let σj0(x) = P(T∗ > τj |T∗ ≥ τj, x) = 1 1 + ∑nr k=1 exp[φjk(x)] , denote the conditional survival. Note that σjr(x) is the softmax function with the statistical convention of a reference class φj0(x) = 0. Recall that di = 1{ri > 0}. Disregarding x for simpler notation and assuming that the censoring time is independent of the event time (given covariates), the probability of T and R is P(T = t, R = r) = [P(T∗ = t,R = r) P(C∗ ≥ t)]d [P(T∗ > t) P(C∗ = t)]1−d ∝ P(T∗ = t,R = r)d P(T∗ > t)1−d. This means the likelihood contribution of individual i is Li = fri (ti |xi) di S(ti |xi) 1−di (5.4) = hri (ti |xi) di [1 −h(ti |xi)] 1−di κ(ti)−1∏ j=1 [1 −h(τj |xi)], where κ(ti) denotes the index corresponding to time ti, meaning that ti = τκ(ti). By introducing labels yijr = { 1{ti = τj,ri = r}, if r > 0 1{ti > τj} di, if r = 0, such that ∑nr r=0 yijr = 1, we can write the negative log-likelihood as loss = − n∑ i=1 κ(ti)∑ j=1 nr∑ r=0 yijr log[σjr(xi)]. We recognizer this as the negative of the multinomial log-likelihood, also known as the categorical cross-entropy. This is one of the most commonly used loss functions for neural networks (e.g., for image classification) and should, therefore, be available in any deep learning framework. 40 Extensions of the Methods 5.3.3.2 PMF For the discrete-time PMF method, Lee et al. (2018) have already proposed a competing risks version expressed by the event probabilities fr(τj |x) = P(T∗ = τj,R = r |x). To my understanding, however, they assume S(τm |x) = 0. We will, therefore, investigate a version without this restriction. By considering observations on the discrete time scale τ1, . . . ,τm, we simply let m∑ j=1 nr∑ r=1 fr(τj |x) + S(τm |x) = 1, with S(τm |x) ∈ [0, 1]. In the same manner as in Paper III, for j ∈{1, . . . ,m} and r ∈{1, . . . ,nr}, we let the risk-specific PMF be defined by the softmax fr(τj |x) = exp[φjr(x)] 1 + ∑m k=1 ∑nr q=1 exp[φkq(x)] , S(τm |x) = 1 1 + ∑m k=1 ∑nr q=1 exp[φkq(x)] . The survival function can then be written as S(τj |x) = m∑ k=j+1 nr∑ r=1 fr(τk |x) + S(τm |x). The negative log-likelihood corresponding to (5.4) can now be written as loss = − n∑ i=1 di log[fri (ti |xi)] − n∑ i=1 (1 −di) log   m∑ k=κ(ti)+1 nr∑ r=1 fr(τk |xi) + S(τm |xi)   . This does not correspond to any commonly used loss function, meaning implementation requires some more work than for the competing risks version of the Logistic-Hazard. The cumulative incidence functions can be obtained from the risk-specific PMF fr(τj |x) with (5.3). 5.3.3.3 PC-Hazard For continuous time, the cumulative incidence functions take the form of the integral Fr(t |x) = ∫ t 0 fr(u |x) du = ∫ t 0 hr(u |x) S(u |x) du. (5.5) 41 5. Discussion The continuous-time PC-Hazard method in Paper III can be extended to competing risks by considering the cause-specific hazards. With κ(t) now denoting the interval of time t, meaning that t ∈ (τκ(t)−1, τκ(t)], the cause- specific hazard is constant in each interval hr(t |x) = ηκ(t) r(x). The continuous-time survival function can be expressed as (Prentice et al., 1978) S(t |x) = exp [ − nr∑ r=1 ∫ t 0 hr(u |x) du ] , and, by following the same line of reasoning as in Paper III, we can write the survival function as S(t |x) = nr∏ r=1  exp[−ηκ(t) r(x) (t− τκ(t)−1)] κ(t)−1∏ j=1 exp[−ηjr(x)∆τj]   , where ∆τj = τj − τj−1. The continuous-time likelihood contribution, corresponding to (5.4), is then Li = hri (ti |xi) di S(ti |xi) = ηκ(ti) ri (xi) di nr∏ r=1  exp[−ηκ(ti) r(xi) (t− τκ(ti)−1)] κ(ti)−1∏ j=1 exp[−ηjr(xi)∆τj]   , and the negative log-likelihood follows. We can, alternatively, represent the likelihood as a Poisson-likelihood. This is achieved by considering independent Poisson-distributed variables with expectation µijr = ∆t̃ijηjr, as described in Appendix C of Paper III. 5.3.3.4 Cox-Time The Cox proportional hazards model can be extended to competing risks by considering the cause-specific hazard hr(t |x) = h0r(t) exp[φr(x)], and the partial likelihood (Holt, 1978; Prentice et al., 1978) PL = nr∏ r=1 n∏ i=1 ( exp[φr(xi)]∑ j∈Ri exp[φr(xj)] ) 1{ri=r} , where Ri = {j : tj ≥ ti} and ri = 0 for censored times. Here φ(x) ∈ Rnr can be parameterized by a neural network with an output node for each event type. The cause-specific baseline hazards can be obtained with the Breslow 42 Evaluation of Survival Estimates estimator (3.9), and the cumulative incidence functions can be expressed in terms of the risk specific hazard increments (Cheng et al., 1998). The Cox-Time method proposed in Paper II can be extended to competing risks in the same manner by considering PLCC = nr∏ r=1 n∏ i=1 ( exp[φr(ti, xi)]∑ j∈R̃i exp[φr(ti, xj)] ) 1{ri=r} , where R̃i ⊂Ri. In summary, it should be possible to make competing risks version of the methods in Papers II and III. 5.4 Evaluation of Survival Estimates Neural networks can represent much more flexible models than classical statistical models, but this added flexibility can introduce new issues not really relevant for the simpler models. As an example, in Paper IV we discuss how the IPCW Brier score is biased for the KKBox churn data set. For simpler models than ours, this bias is likely not an issue, but by introducing neural networks, we find that these biases needed to be addressed. A substantial part of machine learning research is concerned with empirical evaluation of methods on various data sets. To some degree, for prediction, it does not matter if your method is reasonable as long as it achieves good scores. Combining machine learning methodology with survival analysis is somewhat problematic, as censoring and truncation cause the test set to have incomplete observations. We typically account for censoring and truncation by making assumptions about their distributions and relationship to the event- time distribution. When these assumptions are essential for the evaluation metric, the choices made by the researcher may substantially affect the results. Phrased differently, we have missing data in our test set and the score accounts for the missing data by assuming something about its distribution. Different assumptions may lead to different results. The IPCW Brier score in Paper IV can be biased if the covariates contain information that can identify some of the censoring times. Interestingly, the bias is in favor of the bias of a binary classifier that disregards individuals as they are censored. By choosing a method based on “best performance” in terms of the IPCW Brier score, we can end up choosing a binary classifier over a method that accounts for censored observations in a reasonable manner. For someone not familiar with survival analysis, the aforementioned binary classifier can seem to be a reasonable approach. To convince this audience to move to survival methodology, we need to be able to show the benefits of survival approaches. Rather than arguing that survival methodology is better suited for time-to-event prediction than binary classifiers, we need to provide evaluation metrics that verify this empirically. 43 5. Discussion 5.4.1 The Concordance Index Recall the that the concordance index is a metric for evaluating the discriminatory performance of survival estimates Ŝ(t |x). It aims at estimating the probability C(t) = P(Ŝ(t |xi) < Ŝ(t |xj) |T∗i < T ∗ j ), which can be somewhat problematic for right-censored observations. Uno et al. (2011) and Gerds et al. (2013) handle right censoring by inverse probability of censoring weighting (IPCW) of the concordance in a similar manner to that of the Brier score in Section 3.3.1. For administrative censoring with identifiable censoring times, this IPCW score might suffer from the same problems as the Brier score addressed in Paper IV. Therefore, a similar study to that of Paper IV is warranted and, depending on the findings, one could create an administrative concordance index by following the same line of reasoning as for the Brier score. The concordance by Antolini et al. (2005), discussed in Section 3.3.2, estimates Ctd = P(Ŝ(Ti |xi) < Ŝ(Ti |xj) |Ti < Tj, Di = 1), which has the benefit of giving the score as a single number instead of a function of t. Unfortunately, the Ctd is dependent on the censoring distribution, leaving the interpretation somewhat unintuitive. It would be desirable to find a scaling, such as IPCW, that accounts for the censoring distribution in a manner that lets us instead estimate P(Ŝ(T∗i |xi) < Ŝ(T ∗ i |xj) |T ∗ i < T ∗ j ). If one could manage to create such a score, it would also be reasonable to investigate the impact of administrative censoring with identifiable censoring times. 5.4.2 On the Administrative Brier Score The administrative Brier score, presented in Paper IV, is useful when the censoring times C∗i can be identified from the covariates xi. This is because the IPCW Brier score requires a censoring distribution with SC∗ (t |xi) > 0, but with sufficient information about the censoring time, we approach SC∗ (t |xi) = 1{C∗i > t}. In short, the administrative Brier score BSA(t) disregards all individuals with a censoring time C∗i < t. Note that we observe the censoring time for all individuals, meaning we remove both individuals that are censored and individuals that have experienced the event prior to time t. For future survival prediction, it is problematic to use covariates that can predict the censoring time, as this essentially means that the covariates of future individuals are different from the covariates in our training set. Extrapolation is generally hard, and we want to avoid it if possible. It is, therefore, reasonable to remove covariates that are closely connected to the censoring times. If this substantially changes the predictions, one has at least learned something about the non-stationary behavior of the data set. Note that if we are able to remove covariates closely related to the censoring time, we can use the IPCW Brier score instead, as we now have SC∗ (t |xi) > 0. 44 Time-to-Event Prediction and Machine Learning This will require estimation of the censoring distribution, but the interpretability of the IPCW score can be worth this disadvantage. If one prefers the IPCW scores, the administrative score can still be useful for understanding the data set and could be applied to verify that the IPCW scores are reasonable. 5.5 Time-to-Event Prediction and Machine Learning I believe there is great potential in combining methodology from machine learning with that of survival analysis. This research field is relatively new, and the approaches discussed in this thesis are only concerned with a minor subfield of survival analysis. There are still parts of the survival literature that could benefit from machine learning methodology and at the point of writing there are plenty of “low-hanging fruits”. For example, in this chapter, we have briefly addressed the topics of left-truncation, competing risks, dynamic prediction, and extensions of the concordance index. Returning to the mortgage defaults in Paper I, this problem could benefit from a survival approach to the default modeling. Interestingly, there are actually multiple levels of delinquencies before a customer defaults on their mortgage. By considering a multistate model appropriate for the delinquencies, one could provide the model with more information than the simple binary responses of the defaults. Such multistate models have not been addressed in this thesis and would be an interesting direction for future research. For the customer churn data sets in Papers II and IV, the customers can start and stop their subscription to KKBox multiple times. This means that individuals can move between two states (customer and non-customer) multiple times. This type of data is reasonable to consider with recurrent events models or multiple-spell modeling (Tutz and Schmid, 2016, Chapter 10). This is also a widely studied topic in survival analysis that has received limited attention with machine learning methodology. A substantial part of machine learning research compares methods by empirical evaluation on a held-out test set. For event-time prediction, there are, however, very few large data sets and the current literature generally relies on somewhat small data sets. The resulting variance in the evaluation scores makes it hard to compare methods. In Paper II, we contribute to this issue by modifying the existing KKBox Kaggle data set, resulting in an event-time data set with more than 2 million individuals. We do, however, need more large data sets for the further advancement of the field. I hope that the research I have been a part of can help to draw some attention to this field, and maybe inspire someone to further improve on the methodology for time-to-event prediction. 45 Bibliography Odd Aalen, Ørnulf Borgan, and Håkon Gjessing. Survival and Event History Analysis: A Process Point of View. Springer Science & Business Media, 2008. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Science, 9(1):147–169, 1985. Kate Allen. How a Toronto professor’s research revolutionized artificial intelligence. The Star, Apr 2015. Laura Antolini, Patrizia Boracchi, and Elia Biganzoli. A time-dependent discrimination index for survival data. Statistics in Medicine, 24(24):3927–3944, 2005. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems, pages 153–160, 2007. James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), volume 4. Austin, TX, 2010. Elia Biganzoli, Patrizia Boracchi, Luigi Mariani, and Ettore Marubini. Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach. Statistics in Medicine, 17(10):1169–1186, 1998. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 47 Bibliography Charles C Brown. On the use of indicator variables for studying the time- dependence of parameters in a response-time model. Biometrics, 31(4):863–872, 1975. S C Cheng, Jason P Fine, and L J Wei. Prediction of cumulative incidence function under the proportional hazards model. Biometrics, 54:219–228, 1998. Travers Ching, Xun Zhu, and Lana X Garmire. Cox-nnet: An artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Computational Biology, 14(4):e1006076, 2018. François Chollet et al. Keras. https://keras.io, 2015. Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015. Ronan Collobert, Samy Bengio, and Johnny Marithoz. Torch: A modular machine learning software library, 2002. David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187–220, 1972. Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The Helmholtz machine. Neural Computation, 7(5):889–904, 1995. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011. Maha Elbayad, Laurent Besacier, and Jakob Verbeek. Pervasive attention: 2d convolutional neural networks for sequence-to-sequence prediction. arXiv preprint arXiv:1808.03867, 2018. David Faraggi and Richard Simon. A neural network model for survival data. Statistics in Medicine, 14(1):73–82, 1995. Marco Fornili, Federico Ambrogi, Patrizia Boracchi, and Elia Biganzoli. Piecewise exponential artificial neural networks (PEANN) for modeling hazard function with right censored data. In Computational Intelligence Methods for Bioinformatics and Biostatistics, pages 125–136, 2014. Michael Friedman. Piecewise exponential models for survival data with covariates. The Annals of Statistics, 10(1):101–113, 1982. Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, 1980. 48 https://keras.io Bibliography Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org, 2017. Michael F Gensheimer and Balasubramanian Narasimhan. A scalable discrete- time survival model for neural networks. PeerJ, 7:e6257, 2019. Thomas A Gerds and Martin Schumacher. Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biometrical Journal, 48(6):1029–1040, 2006. Thomas A Gerds, Michael W Kattan, Martin Schumacher, and Changhong Yu. Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Statistics in Medicine, 32(13):2173–2184, 2013. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15, pages 315–323, 2011. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z Ghahramani, M Welling, C Cortes, N D Lawrence, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine, 18(17-18):2529–2545, 1999. Frank E Harrell Jr, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A Rosati. Evaluating the yield of medical tests. Journal of the American Medical Association, 247(18):2543–2546, 1982. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015b. 49 Bibliography Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 2961–2969, 2017. Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006. Geoffrey E Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning: Lecture 6a: Overview of mini-batch gradient descent, 2012. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. Theodore R Holford. Life tables with concomitant information. Biometrics, 32 (3):587–597, 1976. John D Holt. Competing risk analyses with special reference to matched pair experiments. Biometrika, 65(1):159–165, 1978. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989. Joseph G Ibrahim, Haitao Chu, and Liddy M Chen. Basic concepts and methods for joint models of longitudinal and survival data. Journal of Clinical Oncology, 28(16):2796, 2010. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015. Hemant Ishwaran, Udaya B Kogalur, Eugene H Blackstone, and Michael S Lauer. Random survival forests. Annals of Applied Statistics, 2(3):841–860, 2008. Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision, pages 2146–2153, Sep. 2009. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 675–678. ACM, 2014. Søren Johansen. An extension of Cox’s regression model. International Statistical Review, 51(2):165–174, 1983. Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Medical Research Methodology, 18(1):24, Feb 2018. 50 Bibliography Henry J Kelley. Gradient theory of optimal flight paths. Ars Journal, 30(10): 947–954, 1960. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in Neural Information Processing Systems, pages 971–980, 2017. John P Klein and Melvin L Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. Springer, New York, 2. edition, 2003. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012. Kevin J Lang. The development of the time-delay neural network architecture for speech recognition. Technical Report CMU-CS-88-152, 1988. A Lawrence Gould, Mark Ernest Boye, Michael J Crowther, Joseph G Ibrahim, George Quartey, Sandrine Micallef, and Frederic Y Bois. Joint modeling of survival and longitudinal non-survival data: current methods and issues. report of the dia bayesian joint modeling working group. Statistics in Medicine, 34 (14):2181–2195, 2015. Yann LeCun. Generalization and network design strategies. Technical Report CRG-TR-89-4, University of Toronto, 1989. Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989. Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278–2324, 1998. Yann LeCun, Koray Kavukcuoglu, and Clément Farabet. Convolutional networks and applications in vision. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pages 253–256. IEEE, 2010. Changhee Lee, William R Zame, Jinsung Yoon, and Mihaela van der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 51 Bibliography Egil Martinsson. WTTE-RNN : Weibull Time To Event Recurrent Neural Network. Master’s thesis, Chalmers University Of Technology, 2016. Marvin Minsky and Seymour Papert. Perceptrons: An introduction to computational geometry. MIT press, 1969. Abdelrahman Mohamed, George Dahl, and Geoffrey Hinton. Deep belief networks for phone recognition. In Nips Workshop on Deep Learning for Speech Recognition and Related Applications, volume 1, page 39. Vancouver, Canada, 2009. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814, 2010. Radford M Neal. Connectionist learning of belief networks. Artificial Intelligence, 56(1):71–113, 1992. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. Wei Pan and Rick Chappell. Estimation in the Cox proportional hazards model with left-truncated and interval-censored data. Biometrics, 58(1):64–70, 2002. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017. Christopher Poultney, Sumit Chopra, Yann L Cun, et al. Efficient learning of sparse representations with an energy-based model. In Advances in Neural Information Processing Systems, pages 1137–1144, 2007. Ross L Prentice, John D Kalbfleisch, Arthur V Peterson Jr, Nancy Flournoy, Vern T Farewell, and Norman E Breslow. The analysis of failure times in the presence of competing risks. Biometrics, 34(4):541–554, 1978. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019. Rajat Raina, Anand Madhavan, and Andrew Y Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 873–880. ACM, 2009. 52 Bibliography Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high- fidelity images with vq-vae-2. arXiv preprint arXiv:1906.00446, 2019. Lior Rennert and Sharon X Xie. Cox regression model with doubly truncated data. Biometrics, 74(2):725–733, 2018. Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386, 1958. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. Jocelyn Sietsma and Robert JF Dow. Creating artificial neural networks that generalize. Neural Networks, 4(1):67–79, 1991. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015. Gerhard Tutz and Matthias Schmid. Modeling Discrete Time-to-Event Data. Springer, 2016. Hajime Uno, Tianxi Cai, Michael J Pencina, Ralph B D’Agostino, and L J Wei. On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine, 30(10):1105–1117, 2011. Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. 53 Bibliography Hans van Houwelingen and Hein Putter. Dynamic Prediction in Clinical Survival Analysis. CRC Press, 1st edition, 2011. Hans C Van Houwelingen. Dynamic prediction by landmarking in event history analysis. Scandinavian Journal of Statistics, 34(1):70–85, 2007. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. Bernard Widrow and Marcian E Hoff. Adaptive switching circuits. Technical report, Stanford Univ Ca Stanford Electronics Labs, 1960. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. Safoora Yousefi, Fatemeh Amrollahi, Mohamed Amgad, Chengliang Dong, Joshua E Lewis, Congzheng Song, David A Gutman, Sameer H Halani, Jose Enrique Velazquez Vega, Daniel J Brat, et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Scientific Reports, 7(11707), 2017. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833, 2014. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649–657, 2015. Yingye Zheng and Patrick J Heagerty. Partly conditional survival models for longitudinal data. Biometrics, 61(2):379–391, 2005. Xinliang Zhu, Jiawen Yao, and Junzhou Huang. Deep convolutional neural network for survival analysis with pathological images. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 544–547, 2016. Xinliang Zhu, Jiawen Yao, Feiyun Zhu, and Junzhou Huang. Wsisa: Making survival prediction from whole slide histopathological images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6855–6863, 2017. 54 Papers I 57 Expert Systems With Applications 102 (2018) 207–217 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa Predicting mortgage default using convolutional neural networks Håvard Kvamme a , ∗, Nikolai Sellereite b , Kjersti Aas b , Steffen Sjursen c a Department of Mathematics, University of Oslo, Niels Henrik Abels hus Moltke Moes vei 35, Oslo 0851, Norway b Statistical Analysis, Machine Learning and Image Analysis, Norwegian Computing Center, Gaustadalleen 23a, Oslo 0373, Norway c Group Risk Modelling, DNB ASA, Dronning Eufemias gate 30, Oslo 0191, Norway a r t i c l e i n f o Article history: Received 15 August 2017 Revised 17 February 2018 Accepted 18 February 2018 Available online 19 February 2018 Keywords: Consumer credit risk Machine learning Deep learning Mortgage default model Time series a b s t r a c t We predict mortgage default by applying convolutional neural networks to consumer transaction data. For each consumer we have the balances of the checking account, savings account, and the credit card, in addition to the daily number of transactions on the checking account, and amount transferred into the checking account. With no other information about each consumer we are able to achieve a ROC AUC of 0.918 for the networks, and 0.926 for the networks in combination with a random forests classifier. © 2018 Elsevier Ltd. All rights reserved. 1. Introduction The ability to discriminate bad customers from good ones is im- portant for banks and other lending companies. A small improve- ment in prediction accuracy may result in a large gain in profitabil- ity. Early identification of high risk consumers may aid the pre- vention of loan defaults and help the consumers to better manage their personal economy. In credit scoring, one builds a model for the correspondence be- tween default and various loan obligor characteristics based on a relevant sample of people, and use this model to predict the prob- ability that a person will repay his debts. There is an extensive literature on credit scoring, both for as- sessing private loans ( Butaru et al., 2016; Chi & Hsu, 2012; Khan- dani, Kim, & Lo, 2010; Sousa, Gama, & Brandão, 2016 ) and corpo- rate loans ( Jones, Johnstone, & Wilson, 2015; Ravi Kumar & Ravi, 2007 ). Some recent work include Abellán and Castellano (2017) ; Chen, Zhou, Wang, and Li (2017) ; Xia, Liu, Li, and Liu (2017) , and Barboza, Kimura, and Altman (2017) . For an overview and com- parison of papers, see García, Marqués, and Sánchez (2014) and Lessmann, Baesens, Seow, and Thomas (2015) . All the papers above attempt to model delinquencies and de- faults by applying machine learning algorithms to a set of ex- planatory variables. While there are variations in the information ∗ Corresponding author. E-mail addresses: haavakva@math.uio.no (H. Kvamme), nikolai.sellereite@nr.no (N. Sellereite), kjersti.aas@nr.no (K. Aas), sasjurse@gmail.com (S. Sjursen). that is available to researchers (see e.g. Lessmann et al., 2015 ), the constructed explanatory variables tend to be quite similar. Papers typically use information from credit bureaus, such as number of outstanding accounts, delinquent accounts, and balance on other loans; individual account characteristics, such as current balance of the individuals accounts and monthly income; and demographic data, such as age and marital status. Butaru et al. (2016) also in- clude macroeconomic variables, such as interest rates and unem- ployment statistics, as an attempt to make the delinquency model generalize better over longer periods of time. As all these papers use similar explanatory variables, the re- searchers commonly explore differences between scoring models rather than the benefit of adding new explanatory variables. There are however some exceptions: Khandani et al. (2010) explore the benefit of adding information from detailed purchase volumes to their models. This includes travel expenses, gas station expenses, bar expenses, etc. Chi and Hsu (2012) also introduce consumer transaction data through an aggregated measure called average uti- lization ratio of credit. In this paper we further investigate how transaction data can be used for credit scoring. In a joint research with Norway’s largest financial service group, DNB, we use transaction data to predict mortgage defaults. In 2012, the average Norwegian made 323 card transactions, where 71% of the value transferred was through debit payments ( Norges-Bank, 2012 ). Hence, transactional data may pro- vide a useful description of user behavior, and subsequently con- sumer credit risk. The transaction information consists of the daily balances on consumers’ credit, checking, and savings accounts, in addition to https://doi.org/10.1016/j.eswa.2018.02.029 0957-4174/© 2018 Elsevier Ltd. All rights reserved. 208 H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 the daily number of transactions on the checking account, and the amount transferred into the checking account. Our dataset is thus a collection of time series for each customer. We do not use any of the common covariates mentioned above, as our goal is to investi- gate the value of the information available in the time series data. In a production setting, our model can be combined with existing risk models to improve the overall performance. The idea behind this paper is to use deep learning, or deep neural networks ( LeCun, Bengio, & Hinton, 2015 ), to predict mort- gage defaults. Deep learning have had a dramatic impact in fields like image classification, text mining, and speech recognition. Com- mon to these three fields is that the data is unstructured. Thus, the original data (pixels, letters, words, frequencies) needs to be trans- formed into meaningful covariates before they can be passed to a classical algorithm. Until recently, such tasks have been solved by the manual creation of informative covariates from the data. Deep neural nets, on the other hand, process the data in a sequential manner, learning multiple levels of abstraction. Hence, a deep net use data to “learn” how to create good explanatory variables for the problem at hand. In this paper we apply a type of deep neural networks denoted convolutional neural networks (CNNs). To the extent of our knowl- edge, we are the first to apply CNNs to consumers’ account bal- ances to predict mortgage defaults. Transaction data has previously been used for credit scoring, see e.g. Khandani et al. (2010) . How- ever, most existing work relies on heuristic hand-crafted feature design. It is often hard for us humans to figure out appropriate features or covariates for credit scoring. Hence, the purpose of this study is to use a convolutional neural network to automate fea- ture engineering from the raw time series, in a systematic way. The learned features resulting from such a model may be viewed as the higher level abstract representation of the low level raw time se- ries signals. There have been some successful attempts applying deep neural nets to time series data in other application ar- eas. In the field of human activity recognition, Ordóñez and Roggen (2016) ; Yang, Nguyen, San, Li, and Krishnaswamy (2015) , and Hammerla, Halloran, and Ploetz (2016) use sensor data to recognize activities improving the state-of-the-art. Cui, Chen, and Chen (2016) ; Prasad and Prasad (2014) ; Zheng, Liu, Chen, Ge, and Zhao (2014) , and Le Guennec, Malinowski, and Tave- nard (2016) present network structures for time series classifi- cation across multiple domains. Also, Sirignano, Sadhwani, and Giesecke (2016) assess mortgage risk using a deep net on 294 ex- planatory variables. However, on the contrary to our work, they use multilayer perceptrons instead of CNNs, and the majority of the variables are loan-level feature and performance variables. The deep net is mainly used to identify complex interactions between the input variables. Much work in credit scoring is related to regulatory frame- works such as the Basel Accord. The Basel II regulations require transparency in the loan-granting process. Due to their non-linear structure, CNNs are usually considered as black boxes. That is, it is usually not obvious what exactly makes them arrive at their predictions. However, since this lack of transparency in many ap- plications is considered a major drawback, the development of methods for explaining and interpreting deep learning models has recently attracted increased attention, see e.g. Lundberg and Lee (2016) ; Samek, Wiegand, and Müller (2017) ; Shrikumar, Green- side, Shcherbina, and Kundaje (2016) , and Ribeiro, Singh, and Guestrin (2016) . Hence, it is not obvious that CNNs will remain inappropriate for building regulatory models in the future. Mean- while, a credit scoring system based on a CNN may be useful in many other applications. Financial institutions may e.g. use them in their own internal risk estimation or as part of a validation ex- ercise (Pillar 2 of the Basel framework). A lender commonly makes two types of credit decisions: first, whether to grant credit to a new applicant, and second, how to deal with existing applicants, including whether to increase their credit limits. The first problem is denoted application scor- ing and the latter behavioral scoring. The majority of previous work focus on application scoring, while prior work on behav- ioral scoring is much less developed, see e.g. Thomas (20 0 0) and Kennedy, Mac Namee, Delany, O’Sullivan, and Watson (2013) . The methodology proposed in this paper may be used both for appli- cation and behavioral scoring. Consumer transaction histories con- tain implicit repayment behavior, making them suitable for behav- ior scoring. However, as the new Revised Payment Service Direc- tive (PSD2) becomes implemented across the EU and the European Economic Area during 2018, the banks (and other lending institu- tions) will have access to transaction data even for new customers. Hence, this kind of data may also be used for application scoring. The remainder of the paper is organized as follows. In Section 2 we describe our dataset and how it was processed be- fore fitting the neural networks. In Section 3 we introduce convo- lutional neural networks, and show how they were applied to our problem. In Section 4 we evaluate the performance of the proposed algorithms. Finally, in Section 5 , we summarize our findings. 2. Data This study is a collaboration with the largest Norwegian finan- cial service group, DNB, using data from their banking services. The dataset consists of a sample of 20,989 customers who either pre- viously had a mortgage, or got approved for a mortgage at some point between January 2012 and April 2016. For every customer we have the daily balances on their checking account, savings account, and credit card, in addition to the daily number of transactions on the checking account. We also know the daily amounts that are transferred into the checking account (e.g. from salary, payments from friends, etc.). Hence, we have a set of five time series for each customer. Finally, adding the sum of checking account, savings ac- count, and credit card as a new time series, we end up with a total of six series. A summary of the six series can be found in Table 1 . For a customer to be granted a mortgage by DNB, he or she is required to open a checking account at the bank, given that this criterion is not already satisfied. There are however no require- ments as far as savings accounts and credit cards are concerned. As a result, there are many missing series in the dataset. Our definition of default is the one used in Basel II, i.e. that the obligor is past due for more than 90 days on the obligation to the bank. Housing prices in Norway have generally increased steadily since 2003, and thus, the mortgage market has seen few defaults ( Finanstilsynet, 2016 ). The lack of defaults poses a problem due to the fact that our algorithms need large datasets to generalize well. In addition to this, the large class imbalance (ratio between non- defaults and defaults) is commonly regarded as a problem for clas- sification algorithms. Both issues are addressed in later sections. 2.1. Training and test set We divide our data into a training set and a test set. For train- ing, we extract transaction histories from the time period Decem- ber 31, 2012 to December 31, 2013 and use the outcome (default / non-default) in the period from January 1, 2014 to January 1, 2015 as response variables. A subset of the training set was held out from training and used as a validation set for tuning of our algo- rithms. For testing, we extract transaction histories from the time period February 28, 2014 to February 28, 2015, and response vari- ables from the period March 1, 2015 to March 1, 2016. There are no customers who appear in both the training and test set, mean- ing that our study is both “out-of-time” and “out-of-sample”. Cus- H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 209 Table 1 Time series. Series Abbrev. Explanation Checking account ch Balance on the checking account. Savings account sa Balance on the savings account. Credit card cc Balance on the bank issued credit card. Checking transactions trch Number of transaction on the checking account. Into checking in Sum of transactions into the checking account. Sum sum Sum of series ch, sa, and cc. The time series used in this paper. Each time series consists of values aggregated to a daily level. Table 2 Data proportions for training, validation, and test sets. Dataset Defaults Non-defaults Total Default type Training 1298 11,398 12,696 90 or 60 days past Training augmented 95,647 841,247 936,894 90 or 60 days past Validation 329 6043 6372 90 or 60 days past Test 96 1825 1921 90 days past tomers for whom we did not have a full year of transaction data, were removed both from the training and the test set. However, we evaluate the performance of our methods on customers with miss- ing data in Section 4.5 . For a further specification of the datasets, see Table 2 . 2.2. Preprocessing time series Convolutional neural networks require the input to be scaled. Hence, for images it is common to normalize the pixels to lie in the range [0, 1] ( Goodfellow, Bengio, & Courville, 2016 ). Pixels how- ever have the benefit of all being in [0, 255] and somewhat evenly distributed in this interval. On the contrary, there are great dif- ferences in the magnitude of the accounts. If we scale all series based on the most extreme account values, most of the series will have very small variations, making it hard for our neural net to learn from the data. Therefore, the time series were scaled such that each individual series is in the range [0, 1] through x = x − min (x ) max (x ) − min (x ) , (1) where x denotes one time series. With this scaling, the neural net receives inputs of similar magnitude. However, this comes at the cost of removing the magnitude of the individual series. Hence, the network can only find information in the relative patterns of the series, and has no knowledge about the actual size of the dif- ferent accounts. Missing series were treated as accounts without any movements. 3. Methods While there is an extensive literature on time series classifica- tion, we have chosen to focus only on the subset of algorithms that falls under deep learning. For a review of other state-of-the-art methods, see e.g. Bagnall, Lines, Bostrom, Large, and Keogh (2016) . In the following, we will first, in Section 3.1 , introduce convolu- tional neural networks and how they were applied to our credit scoring problem. Then, in Section 3.2 , we will discuss methods used to avoid overfitting and how we approached the class imbal- ance problem. 3.1. Convolutional neural networks Deep learning arguably owes much of its success to its extent of modularity. The framework is based on combining differentiable transformations, where each such transformation is commonly re- ferred to as a layer . The way one organizes the layers is called the network architecture . When fitting such a network to a dataset, or performing the “training” as it is denoted in machine learning, it is necessary to define a loss function that can be optimized. The loss function does not necessarily need to be the objective we actually want to optimize, but rather a function that indirectly optimizes our true goal ( Goodfellow et al., 2016 ). For binary classification it is common to use the binary cross entropy Loss = − ∑ i { y i log p i + (1 − y i ) log (1 − p i ) } , (2) where y i denotes the true class label as {0, 1}, and p i denotes our prediction for customer i in [0, 1]. Eq. (2) is the same as the neg- ative log likelihood used in logistic regression. Note that the loss decreases as the p i ’s move closer to the y i ’s, hence reducing the loss should improve our objective. During training the data is passed through the network, the loss is calculated, the gradients are computed with respect to all the parameters in the network, and the parameters are optimized through some version of gradient descent. The gradients can be ob- tained in a nice way using backpropagation ( LeCun, Bottou, Orr, & Müller, 1998 ), which basically is application of the chain rule for differentiation. Backpropagation calculates the gradients in a se- quential manner, such that the gradients of each transformation can be derived solely from the error passed back from the suc- ceeding layer. The main takeaway here is that layers can be im- plemented independent of each other, simplifying the process of combining them in an architecture. Convolutional neural networks (CNNs) constitute the subset of neural networks that contain convolutional layers. However, they typically contain other layers as well. The following sections de- scribe the different types of layers that have been used in our neu- ral networks. 3.1.1. Convolutional layers The first two layers of our network are presented in Fig. 1 . The left block shows an input time series that is 365 days long. For now, assume that the dataset only consist of a single time series per customer (instead of six). A convolutional layer consists of multiple filters that are applied to the input series. A filter is simply a vector (or matrix) consisting of weights that need to be optimized. The first convolutional layer has filters of length 9, as shown in the figure. The application of such a filter to the input series is just a sum over the element- wise product of the filter weights and the time series, resulting in 210 H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 Fig. 1. The first convolutional layer and first pooling layer of our model. The left block is the input series. The gray area marks the part of the series for which a convolutional filter is currently being applied. The middle block shows the convolved features (or activations) resulting from the 32 different convolutional filters. The right block shows the result of applying max pooling to the middle block. a convolved feature (or activation). A convolved feature is there- fore a measure of the correlation between the filter and the rele- vant part of the input. The middle block of Fig. 1 displays all the convolved features from the input, using 32 different filters. Each filter is slided across the input to generate as many convolved fea- tures as the length of the time series 1 . Due to the fact that the 32 filters are randomly initialized, they end up extracting different information from the input. After applying the filters to a time series, the convolved features are passed through a ReLU transform, which is simply max {0, x }. Such activation functions are applied to make the transformation non-linear, and they are commonly considered a part of the convo- lutional layer rather than a separate layer. See Gu et al. (2017) for a discussion of other activation functions. 3.1.2. Max pooling layers A convolutional layer preserves the resolution in the temporal dimension. There can however be some benefits of reducing the temporal resolution, such as reducing the number of parameters in the next layer, and introducing some shift-invariance ( Gu et al., 2017 ). As a result, many CNNs include so-called max pooling lay- ers, which simply replace a set of values with their maximum. Note that this layer does not introduce any additional parameters. We use max pooling after each convolutional layer. An example is shown to the right in Fig. 1 , where four values of each column of the convolved features are pooled into one. As shown in the fig- ure, pooling is applied separately for each feature series. Thus, the pooling layer preserves the number of feature series, while reduc- ing the temporal resolution. See Gu et al. (2017) for a discussion of other pooling layers. 3.1.3. Fully connected layers To perform the actual classification, it is necessary to combine the convolved features into a binary classifier. This is typically done through one or more fully connected (or dense ) layers, which form 1 Zero-padding is used to preserve the temporal dimension ( Goodfellow et al., 2016 ). a multilayer perceptron (a classical neural net). A dense layer re- quires a one-dimensional input, so the columns of the last layer are concatenated into a single vector x . The transformation is then simply the inner product with a weight matrix, z = x T W, with some non-linear activation function (typically ReLU). The number of columns in W determines the dimension of the output. Our fi- nal layer has only two outputs, one for each class, and therefore a softmax activation function is applied to ensure that the predic- tions are in the interval [0, 1]. Softmax (z) c = e z c ∑ j e z j , (3) where c and j refers to classes. Note that the softmax can be ap- plied for an arbitrary number of classes, but for binary classifica- tion it it equivalent to the logit function. As a result, in combina- tion with our choice of loss function, this final layer is equivalent to a logistic regression. 3.1.4. Network architecture and training In our network architecture we alternate between a convolu- tional layer and a max pooling layer two times, followed by two fully connected layers. For an input of length 365 days, this results in 199,234 weights that need to be optimized. The whole structure is displayed in Fig. 2 , were we also show the length of the filters and number of filters for the convolutions, the size of the pooling kernels, and the number of output nodes in the dense layers. This architecture was developed manually by evaluating on a validation set 2 , and the hyper parameter search was therefore not as rigor- ous as with e.g. a grid search. However, it was found that a variety of model configurations resulted in quite similar performance. In particular, further increasing the number of nodes, filters, or layers did not really seem to affect the performance. We train one network for each of our six time series, com- pletely independently, using the Adam optimization algorithm ( Kingma & Ba, 2014 ) with default parameters and batch size of 512. The resulting six predictions are averaged to give a final prediction 2 Note that the validation set is not the same as the test set. H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 211 Fig. 2. Our convolutional neural net. The blue layers are convolutional layers and the yellow layers are fully connected layers. All activations are ReLU, with the ex- ception of softmax in the final layer. Dropout, with rate 0.5, is performed between the last two layers. (For interpretation of the references to color in this figure leg- end, the reader is referred to the web version of this article.) for each customer. When computing the averages, predictions cor- responding to missing series are discarded. We also experimented with averaging the log-odds instead of the predictions, which re- sulted in a slight performance decrease. We also tried an alternative model where we train a network on all six series simultaneously. Thus, the input is 365 × 6 rather than 365 × 1, meaning that the first convolutional filter will have size 9 × 6, rather than 9 × 1. Apart from this, the architecture is identical to the architecture for the individual series. By combin- ing all series in the input, the network has the potential to learn interactions between a customer’s time series. However, as this model is more complex, it will require more data than the pre- vious model. In addition to the two models presented here, we have also tried deeper and more shallow models, as we will get back to in Section 4.2 . In early stages of the project, we even experimented with recurrent neural networks in the form of Long Short-Term Memory networks (LSTM). They were found to be outperformed by the CNNs, in addition to being much more computationally ex- pensive. Hence, they were dropped from further investigation. 3.2. Overfitting and class imbalance In the following sections we explain the techniques we have used to prevent the models from overfitting and how we handled the large class imbalance between defaulting and non-defaulting customers. 3.2.1. Regularization Deep neural networks are commonly overparameterized, mak- ing them prone to overfitting ( Gu et al., 2017 ). Subsequently, regu- larizing techniques have been introduced to cope with these prob- lems. Among the more common is the use of a validation set for monitoring the training process. We used this set to stop the gradi- ent descent iterations when the net starts to overfit. This technique is commonly referred to as early stopping . Another common form of regularization is dropout ( Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014 ), which simply sets activations to zero with a given probability dur- ing training. This prevents units in the network from co-adapting too much. At test time, the dropout layer scales the activations according to the dropout rate. We use dropout with rate 0.5 between the two last layers, but it can in practice be applied between any two layers. See Goodfellow et al. (2016) and Gu et al. (2017) for a further review of regularization. 3.2.2. Data augmentation The problem of predicting mortgage defaults is highly imbal- anced in that the number of defaults is very small compared to the number of non-defaults. As a classifier’s performance commonly deteriorates under imbalance, we under-sample the non-defaulting customers to make the two classes more similar in size. More- over, to increase the number of defaults in the training set, we use delinquencies of 60 days in addition to those of 90 days. Hence, we ended up with a training set consisting of 12,696 customers, 10% of whom where characterized with default. Deep neural nets commonly require more data than classi- cal machine learning algorithms to perform well. As a result, data augmentation schemes have been explored in the literature ( Goodfellow et al., 2016; Krizhevsky, Sutskever, & Hinton, 2012 ). Most of this work has been on images, and does not necessar- ily generalize well to other data sources. However, recently some attempts of performing data augmentation on time series have been proposed. Le Guennec et al. (2016) explore the possibilities of training a CNN in a semi-supervised manner using time series from other datasets. Cui et al. (2016) propose a sliding window technique where they split the time series in many overlapping slices, and train a classifier using each slice as an individual series. This is in some sense similar to other time series prediction tasks where the dataset originally consists of many overlapping series of data (e.g. Ordóñez & Roggen (2016) ). In this paper, we use a slight variation of the sliding windows approach of Cui et al. (2016) . Our approach can be described as fol- lows. For each defaulting customer we first define a training period which ends one month before the default, and starts two years ear- lier. If the customer did not exist in the dataset at the start of this period, the start was set to the earliest date the customer existed. We then extract the time series corresponding to the first 365 days of the training period and use that as an observation. By jumping a fixed number of days, called stride in the machine learning lit- erature, we can get a new observation that is partially overlapping with the previous observation. This process is repeated until the end of our defined training period is reached. In this way, we get a training set consisting of multiple overlapping series from each de- faulting customer. This augmentation scheme is illustrated in Fig. 3 . The gray area shows the defined training periods for three different customers, and the red dots mark the time of their defaults. For id: 2 we show three of the observations (windows) that are added to the training set. Note that the first of the corresponding periods ends one year before the default occurs, while the last ends one month before. Thus both these observations, and all in between, fit our objective of predicting a default in the following 365 days. We also need to perform a similar augmentation for the non- defaulting individuals. Our first experiments showed that sampling non-defaulting and defaulting customers from different time peri- ods made our net learn different characteristics of these periods instead of the actual objective. Hence, to avoid this sampling bias, we had to sample the start and end dates for the non-defaulting customers from the ones of the defaulting customers. Ideally one would use strides of 1 as this would give most diversity in the training data. However, our preliminary analysis 212 H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 Fig. 3. Illustration of our data augmentation scheme. The gray area identifies the time period from which we extract data, and the red dots show the time of default. The sliding window approach is illustrated for id: 2. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) showed that strides of 1 did not give better results than strides of 5. Hence, since the memory usage of the latter is only a fifth of the first, we decided to use strides of 5 in our final analysis. In Table 2 we have shown the size of our training set, the train- ing set augmented with the sliding windows, the validation set, and the test set. The financial group DNB does not want to dis- close its true default ratio. Hence, to generate the test set, 20 0 0 customers were randomly sampled in such a way that the fraction of defaulting customers was 5%. After removing customers with missing history, we ended up with a test set of 1921 customers. The size of the test set is quite small, which results in larger vari- ance in our results. Note however that we have experimented with different test sets from different time periods, and the results seem to be quite consistent. 4. Experiments In the following, we first present the measures used to evaluate the performance of our models. We then describe our experiments and their results. 4.1. Evaluation measures The Receiver Operating Characteristic (ROC) curve is a com- mon measure for evaluating a classifier’s ability to separate two classes. The ROC curve is a plot of the true positive rate (for de- fault) against the false positive rate, for all thresholds. A threshold is the chosen cut-off in the estimated scores from the neural net. Since the ROC curve is based on the true positive rates and the false positive rates, it is invariant to class proportions. Moreover, it is not dependent on the quality of the predictions (probability estimates), only on the classifier’s ranking capabilities. ROC curves are often summarized through the area under the curve, or AUC. With perfect classification, the AUC will be 1, while random predictions result in an AUC of 0.5 on average. AUC is a general measure of binary classification, and is widely used across disciplines. There are however more specialized mea- sures for evaluating a credit scoring model. One such measure is the Expected Maximum Profit (EMP) ( Verbraken, Bravo, Weber, & Baesens, 2014 ), which is a profit based performance measure ac- counting for the benefits generated by healthy loans and the costs caused by loan defaults. We have chosen not to use EMP in this paper. The main reason is that this measure is heavily dependent Table 3 AUC for different CNN architectures. Model Mean AUC Std AUC AUC for average pred. Full, 1 conv layer 0.8884 0.0055 0.8979 Full, 2 conv layers 0.8879 0.0041 0.8982 Full, 3 conv layers 0.8897 0.0048 0.9015 Indiv, 1 conv layer 0.9022 0.0022 0.9043 Indiv, 2 conv layers 0.9146 0.0025 0.9184 Indiv, 3 conv layers 0.9106 0.0033 0.9148 AUC mean and standard deviation for 10 experiments for each of six differ- ent CNNs. “Full” refers to the approach with all time series as input, and “Indiv” to the approach where we compute the average of the predictions from six individually trained models. The rightmost column gives the AUC of the predictions averaged over the 10 trained models. on the class proportions. As stated earlier, we have used a fictive default rate, since the bank did not want to disclose its true ratio. Hence, it does not make any sense to use a measure that signifi- cantly changes when the class proportions are altered. Our convolutional neural network is initialized with ran- dom weights. Moreover, we use dropout by multiplying random Bernoulli variables with the input to the last layer, and the net is trained with stochastic gradient descent. As a result, the network will be slightly different every time it is trained. Hence, to control how our results are influenced by randomness, we repeat all ex- periments 10 times and report the average and standard deviation of the 10 resulting AUC values. 4.2. Competing architectures In the following section, the two CNNs introduced in Section 3.1.4 are compared and evaluated. The architectures of the two models are identical, except that the first takes individual time series as input ( Indiv model ), while all six time series are input to the latter ( Full model ). Note that all training sets are augmented as described in Section 3.2.2 . Summary statistics for the AUCs are shown in Table 3 under the names Indiv, 2 conv layes and Full, 2 conv layers . We see that the individual model is clearly better than the full model. This might indicate that the full model is not able to take advantage of interactions between the series, or at least that the added benefit is smaller than the downside of the added complexity of the model. H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 213 Table 4 AUC for individual series. Series Mean AUC Std NM Mean AUC NM Std Prop Missing ch 0.8632 0.0040 0.8646 0.0036 0.0068 in 0.7116 0.0108 0.7140 0.0107 0.0068 cc 0.7636 0.0080 0.8099 0.0101 0.2405 sa 0.7902 0.0056 0.8178 0.0039 0.1463 sum 0.8752 0.0078 0.8752 0.0078 0 trch 0.8472 0.0064 0.8480 0.0063 0.0068 All 0.9146 0.0025 0.9146 0.0025 0 In the NM columns (Not Missing) we have removed customers having missing series or series without any activity. “Prop missing” gives the proportion of series removed from the test set in NM. The AUC is averaged over 10 experiments. The “All” series gives the score of the averaged predictions (Indiv model). Table 5 AUC for different subsets of the training data. Proportion Mean AUC Std AUC 1.00 0.9146 0.0026 0.90 0.9109 0.0048 0.75 0.9086 0.0054 0.50 0.9028 0.0033 0.25 0.8954 0.0082 0.10 0.8827 0.0062 Non-augmented 0.8880 0.0042 For each subset, we sample the customers from the training set, and use sliding window for data augmentation. The left- most column shows the proportion of customers sampled from the training set, while the remaining columns show the average and standard deviation of AUC for the 10 repetitions. 1.00 gives the results for the full training set (Indiv model), while “Non- augmented” refers to using all customers, but with no data aug- mentation. Table 6 AUC for different lengths of time series. Model Mean AUC Std AUC AUC for average pred. 365 days 0.9146 0.0026 0.9184 180 days 0.8945 0.0051 0.8981 91 days 0.8744 0.0051 0.8791 31 days 0.8425 0.0043 0.8460 The “Mean AUC” is averaged over 10 repetitions. The “AUC for average pred.” gives the AUCs for the averaged predictions of the 10 repetitions of the same model. All models have the same ar- chitecture as the 365 days CNN. Table 3 also includes four other models. The only difference be- tween the models is the number of convolutional layers (and pool- ing layers), and whether the input consist of individual series or all six series simultaneously. For the full specification of the models, see Fig. A.1 in the Appendix. From the table we see that all the individual models have higher average AUC than the full models. Additionally, it seems that there is no reason to have more than two convolutional layers in the networks. Averaging the predictions from six individually trained models has a regularizing effect. To check whether the differences in per- formance between the full models and the individual models are caused by this averaging, we also, for each model, compute the AUC for the average of the 10 predictions for each customer. The results are shown in the rightmost column of Table 3 (“AUC for average pred.”) We see that this improves the AUC for all models. The individual models are still better than the full models, but the difference seems to be smaller. In the remainder of this section we concentrate on the Indiv model, 2 conv layers , since this is the one that shows the best per- formance in Table 3 . Table 7 AUC for different length of series. Model Mean AUC Std AUC AUC for average pred. RF 365 days 0.9130 – – RF 31 days 0.8907 – – Combined 365 days 0.9254 0.0011 0.9260 Combined 31 days 0.8941 0.0 0 06 0.8946 RF is a random forests classifier with covariates extracted from the unscaled time series. Combined gives the averaged prediction of the CNN and the random forests classifier. The results for the combined classifier are aver- aged over 10 repetitions of the CNN. The “AUC for average pred.” gives the AUC for the averaged predictions of the 10 repetitions of the CNN. 4.3. Importance of different series Our model creates predictions based on six different time se- ries: checking account balance (ch), amount transferred into check- ing account (in), credit card balance (cc), savings account balance (sa), the sum of checking, saving and credit card balances (sum), and number of transactions on the checking account (trch). To in- vestigate how much information there is in each series, we report their individual AUC values in Table 4 . As before, we report the re- sults averaged over 10 experiments. The checking account (ch and trch) and the sum seem to be most informative, while the amount transferred into the checking account (in) provides less informa- tion. In Section 2 it was stated that missing savings accounts and credit card accounts are regarded as accounts without any move- ments. This means that the comparison between the different se- ries in the two leftmost columns of Table 4 might not be com- pletely fair. Hence, we did a new experiment in which we re- moved all customers missing the relevant time series from the test set. The results are shown in columns 3 and 4 of the table (NM Mean and NM Std), while the proportions of removed customers is shown in column 5. From the table we see that the AUC for credit card accounts (cc) and savings accounts (sa) significantly increase, while there are small differences for the other series. This is rea- sonable, as the proportion of missing data is largest for the credit cards and savings accounts. 4.4. Training data In this section we investigate how the size of the training set affects the performance of our classifier. This can help us un- derstand to what extent more training data would improve our discriminator. We also evaluate the effect of our augmentation scheme. Table 5 shows AUC for models trained on different subsets of the training data. The subsets were created by randomly sampling customers. For a given proportion of the full training set we train 10 models in the same way as previously for the full training set. Investigating the table, the performance seems to be improved with the size of the training data. This suggests that with more data we might be able to obtain even better results than those pre- sented in this paper. The last row of the leftmost column in Table 5 shows the re- sults of fitting the same model to the original (not augmented) training set. We see that using 10% to 25% of the original customers with augmented data results in the same performance as using all customers with non-augmented data. 4.5. Length of time series To investigate to what extent lack of information affects the performance of our model, we have also fitted our CNN to series of different lengths, More specifically, we used the last month (31 214 H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 Table 8 Evaluation metrics. Model Threshold Accuracy Sensitivity Specificity AUC Brier Score H-measure CNN 0.3 0.954 0.374 0.985 0.915 0.0381 0.564 Combined 0.3 0.947 0.527 0.970 0.925 0.0364 0.592 LR 0.3 0.910 0.490 0.935 0.864 0.0458 0.455 MLP (512) 0.3 0.899 0.590 0.918 0.875 0.0550 0.484 RF 0.3 0.920 0.602 0.940 0.913 0.0386 0.557 CNN 0.5 0.953 0.064 0.999 0.915 0.0381 0.564 Combined 0.5 0.956 0.177 0.997 0.925 0.0364 0.592 LR 0.5 0.943 0.219 0.981 0.864 0.0458 0.455 MLP (512) 0.5 0.927 0.455 0.953 0.875 0.0550 0.484 RF 0.5 0.957 0.332 0.991 0.913 0.0386 0.557 Different evaluation metrics for two different threshold values. All results are based on the time series with length 365 days. For each model we have averaged the metrics over 10 trained models. CNN is the best- performing CNN from Table 3 , LR is a logistic regression model, MLP is a Multilayer perceptron with one hidden layer and 512 nodes, RF is the random forest model, and Combined is the combined model in which we averaged the predictions from the CNN and the RF. Fig. 4. ROC curves for the CNN, the RF, and their averaged predictions model. The time series are the full 365 day long series. The CNN comes from the averaged predictions over the 10 experiments. Fig. 5. ROC curves for the CNN, the RF, and their averaged predictions model. The time series used are contain only the last 31 days of data. The CNN comes from the averaged predictions over the 10 experiments. H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 215 days), the last quarter (91 days), and the two last quarters (182 days) of the original dataset to train the same model as previously. The results are shown in Table 6 . As can be seen, there is a clear drop in performance when the time series get shorter. On the downside, this makes it harder to evaluate customers with a short relationship to the bank. However, the results also suggest that it might be possible to increase the performance of our models by applying them to time series longer than 365 days. 4.6. Classical covariates As described in Section 2.2 , all time series are scaled to be in [0, 1] before fitting the CNN. This kind of scaling means that in- formation about the magnitude of the series is lost. To assess the potential decrease in results connected to this removal of relevant information, we fit a random forests classifer (RF), consisting of 800 trees, to explanatory variables created from the original, un- scaled time series. RF is often credited as a very strong classifier outperforming several alternative methods (including Support Vec- tor Machines and Neural networks) when applied to credit scor- ing ( Brown & Mues, 2012; Kruppa, Schwarz, Arminger, & Ziegler, 2013; Lessmann et al., 2015 ). For each of the six time series, we computed the mean, max, minimum, standard deviation, and the standard deviation scaled by the mean. These covariates were com- puted both for the full series and for the last month (31 days). In addition, we divided each of the covariates for the full series by the equivalent covariate calculated for the last month, producing a total of 15 features for each series. The resulting AUC is shown in the upper row of Table 7 . We also fitted a logistic regression model and a Multilayer perceptron (with one hidden layer and 512 nodes) using the same explanatory variables. The AUC values for these models were significantly worse (0.86 and 0.88 see Table 8 ) than that for the RF, confirming the observations from previous studies. Hence, we decided not to include these models in the fur- ther comparisons. Comparing the numbers in the upper rows of Tables 6 and 7 we see that for the 365 days period, the AUC for the CNN is slightly better than that for the RF. We also used a combined model in which we averaged the predictions from the CNN and the RF. From the third row of Table 6 we see that there is a small in- crease in performance compared to that obtained using one of the models only. The ROC curves for the CNN, the RF and the com- bined model are shown in Fig. 4 . Running the diagnostic test for comparing ROC curves described in DeLong, DeLong, and Clarke- Pearson (1988) verifies that there are no significant differences be- tween the AUCs for any pair of ROC-curves. Table 8 gives the values of 6 evaluation metrics for each of the 5 models. In addition to Accuray, Sensitivity, Specificity and AUC, the Brier Score ( Brier, 1950 ) and the H-measure ( Hand, 2009; 2010 ) are reported. The AUC is sometimes criticised for treating the relative severities of misclassifications differently when different classifiers are used. The H-measure, on the other hand, may accommodate expert knowledge regarding misclassification costs, whenever that is available. We want to treat misclassifications of the smaller class as more serious than those of the larger class, since otherwise very little loss would be made by assigning everything to the larger class. Hence, we set the cost of misclassifying a bad customer as healthy to 0.95 and the cost of misclassifying a good customer as defaulter to 0.05. We have also compared the three methods using data only from the last 31 days. The AUC-values are given in the last row of Table 6 and rows two and four in Table 7 , respectively, while the corresponding ROC-curves are shown in Fig. 5 . As can be seen from the tables and the figure, the performance of the RF does not degrade as much as that of the CNN when decreasing the length of the time series to 31 days. The p -value of the test for comparing the two ROC-curves is 0.005, meaning that the AUC for the RF is significantly higher than that for the CNN in this case. 5. Discussion The ability to discriminate bad customers from good ones is very important for banks and other lending companies. There is a large literature on methods used to predict defaults of consumer loans. Status quo in both the industry and the academic litera- ture is to use models with handcrafted explanatory variables. In contrast, we use a highly nonlinear convolutional neural network (CNN), where the input is raw account transactional data. More specifically, we have used daily balances of customers’ checking ac- counts, savings accounts, and credit card accounts, in addition to daily number of transaction on the checking accounts, and amount transferred into the checking accounts to predict mortgage delin- quency. A CNN has a number of hyperparameters which need to be cho- sen, such as the number of layers, type of layers, and the type of regularization. Our best performing CNN with two convolutional and two fully connected layers achieved an AUC of 0.915. Consider- ing the fact that our training set only consists of 12,696 customers, and that we do not use any other information than the account data, the results are quite promising. In our analysis, we found that the AUC for our CNN was in- creasing with the size of the training set, suggesting that a larger dataset might result in even higher performance. In addition, the performance increased with the length of the time series, indicat- ing that using a time series that is longer than 365 days may give better results. Comparing the predictions from a single CNN with ensemble predictions from 10 randomly initialized models showed that the increase in AUC using several model was limited. By using transaction data only, one throws away a lot of other informative data that is typically used for credit scoring, e.g. the loan balance, socioeconomic data, payment history, and credit bu- reau data. Further, all time series are scaled to be in the interval [0,1] before fitting the CNN, meaning that information about the magnitude of the time series is lost. To account for the scaling, a random forests classifier was fitted to covariates extracted from the unscaled series, resulting in an AUC of 0.913. By combining the predictions of the CNN and the random forests, we achieved an AUC of 0.925. Future research directions could include the combination of our model with other credit scoring models that use more clas- sical credit information; applying our model on a larger dataset, as CNNs tend to perform better with more data; and using more granular data in the form of labeled transactions (e.g. groceries, gas, rent, vacation). Acknowledgments This work was supported by The Norwegian Research Council 237718 through the Big Insight Center for research-driven innova- tion. We thank Ørnulf Borgan for comments and suggestions on the paper. We also thank Sven Haadem for his help on creating a dataset from DNB’s mortgage portfolio. Appendix A Fig. A.1 . 216 H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 Fig. A.1. Our three convolutional neural net architectures for single time series. We also have a version of each model with 365 × 6 input. The blue layers are convolutional layers and the yellow layers are fully connected layers. All activations are ReLU, with the exception of softmax in the final layer. Dropout, with rate 0.5, is performed between the last two layers. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) References Abellán, J., & Castellano, J. G. (2017). A comparative study on base classifiers in en- semble methods for credit scoring. Expert Systems with Applications, 73 , 1–10. doi: 10.1016/j.eswa.2016.12.020 . Bagnall, A., Lines, J., Bostrom, A., Large, J., & Keogh, E. (2016). The great time series classification bake off: A review and experimental evaluation of re- cent algorithmic advances. Data Mining and Knowledge Discovery . doi: 10.1007/ s10618- 016- 0483- 9 . Barboza, F., Kimura, H., & Altman, E. (2017). Machine learning models and bankruptcy prediction. Expert Systems with Applications, 83 , 405–417. doi: 10. 1016/j.eswa.2017.04.006 . Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78 (1), 1–3 . Brown, I. , & Mues, C. (2012). An experimental comparison of classification algo- rithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39 (3), 3446–3453 . Butaru, F., Chen, Q., Clark, B., Das, S., Lo, A. W., & Siddique, A. (2016). Risk and risk management in the credit card industry. Journal of Banking & Finance, 72 , 218– 239. doi: 10.1016/j.jbankfin.2016.07.015 . Chen, X. , Zhou, C. , Wang, X. , & Li, Y. (2017). The credit scoring model based on lo- gistic-bp-adaboost algorithm and its application in p2p credit platform. In X. Li, & X. Xu (Eds.), Proceedings of the fourth international forum on decision sciences (pp. 119–130). Singapore: Springer Singapore . Chi, B.-W., & Hsu, C.-C. (2012). A hybrid approach to integrate genetic algorithm into dual scoring model in enhancing the performance of credit scoring model. Expert Systems with Applications, 39 (3), 2650–2661. doi: 10.1016/j.eswa.2011.08. 120 . Cui, Z., Chen, W., & Chen, Y. (2016). Multi-Scale Convolutional Neural Networks for Time Series Classification. CoRR http://arxiv.org/abs/1603.06995 . DeLong, E. R. , DeLong, D. M. , & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A non- parametric approach. Biometrics, 44 (3), 837–845 . Finanstilsynet (2016). Finansielt utsyn 2016. http://www.finanstilsynet.no/Global/ Venstremeny/Rapport/2016/Finansielt _ utsyn _ 2016.pdf . Accessed: 2017-02-06. García, V., Marqués, A. I., & Sánchez, J. S. (2014). An insight into the experimental design for credit risk and corporate bankruptcy prediction systems. Journal of Intelligent Information Systems, 44 (1), 159–189. doi: 10.1007/s10844- 014- 0333-4 . Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning . MIT Press . http:// www.deeplearningbook.org . Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., & Wang, G. (2017). Recent Advances in Convolutional Neural Networks. CoRR http: //arxiv.org/abs/1512.07108 . Hammerla, N. Y., Halloran, S., & Ploetz, T. (2016). Deep, Convolutional, and Recur- rent Models for Human Activity Recognition Using Wearables. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (pp. 1533– 1540). New York, New York, USA: AAAI Press. http://dl.acm.org/citation.cfm?id= 3060832.3060835 . Hand, D. J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77 (1), 103–123. doi: 10.1007/ s10994- 009- 5119- 5 . Hand, D. J. (2010). Evaluating diagnostic tests: The area under the ROC curve and the balance of errors. Statistics in Medicine, 29 (14), 1502–1510. doi: 10.1002/sim. 3859 . Jones, S., Johnstone, D., & Wilson, R. (2015). An empirical evaluation of the perfor- mance of binary classifiers in the prediction of credit ratings changes. Journal of Banking & Finance, 56 , 72–85. doi: 10.1016/j.jbankfin.2015.02.006 . Kennedy, K. , Mac Namee, B. , Delany, S. J. , O’Sullivan, M. , & Watson, N. (2013). A window of opportunity: Assessing behavioural scoring. Expert Systems with Ap- plications, 40 (4), 1372–1380 . Khandani, A. E., Kim, A. J., & Lo, A. W. (2010). Consumer credit-risk models via machine-learning algorithms. Journal of Banking & Finance, 34 (11), 2767–2787. doi: 10.1016/j.jbankfin.2010.06.001 . Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. CoRR http://arxiv.org/abs/1412.6980 . Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bot- tou, & K. Q. Weinberger (Eds.), Advances in neural information processing sys- tems 25 (pp. 1097–1105). Curran Associates, Inc. http://papers.nips.cc/paper/ 4824- imagenet- classification- with- deep- convolutional- neural- networks.pdf . Kruppa, J. , Schwarz, A. , Arminger, G. , & Ziegler, A. (2013). Consumer credit risk: In- dividual probability estimates using machine learning. Expert Systems with Ap- plications, 40 (13), 5125–5131 . Le Guennec, A. , Malinowski, S. , & Tavenard, R. (2016). Data augmentation for time series classification using convolutional neural networks. In Proceedings of the ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data ,Riva Del Garda, Italy . LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436– 4 4 4. doi: 10.1038/nature14539 . LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient backprop. Neural Networks: Tricks of the Trade , 9–50. doi: 10.1007/3- 540- 49430- 8 _ 2 . Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state- of-the-art classification algorithms for credit scoring: An update of research. Eu- ropean Journal of Operational Research, 247 (1), 124–136. doi: 10.1016/j.ejor.2015. 05.030 . H. Kvamme et al. / Expert Systems With Applications 102 (2018) 207–217 217 Lundberg, S., & Lee, S. (2016). An unexpected unity among methods for interpreting model predictions CoRR abs/1611.07478 . Norges-Bank (2012). Årsrapport om betalingssystem 2012. http://static.norges-bank. no/pages/94894/Betalingssystem _ 2012 _ o.pdf?v=27052013101201&ft=.pdf . Ac- cessed: 2017-02-09. Ordóñez, F., & Roggen, D. (2016). Deep convolutional and LStm recurrent neu- ral networks for multimodal wearable activity recognition. Sensors, 16 (1), 115. doi: 10.3390/s16010115 . Prasad, S. C., & Prasad, P. (2014). Deep Recurrent Neural Networks for Time Series Prediction. CoRR http://arxiv.org/abs/1407.5949 . Ravi Kumar, P., & Ravi, V. (2007). Bankruptcy prediction in banks and firms via sta- tistical and intelligent techniques – A review. European Journal of Operational Research, 180 (1), 1–28. doi: 10.1016/j.ejor.2006.08.043 . Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why should i trust you?: Explaining the predictions of any classifier CoRR abs/1602.04938 . Samek, W., Wiegand, T., & Müller, K. (2017). Explainable artificial intelligence: Un- derstanding, visualizing and interpreting deep learning models CoRR abs/1708. 08296 . Shrikumar, A., Greenside, P., Shcherbina, A., & Kundaje, A. (2016). Not just a black box: Learning important features through propagating activation differences CoRR abs/1605.01713 . Sirignano, J., Sadhwani, A., & Giesecke, K. (2016). Deep Learning for Mortgage Risk. ArXiv e-prints http://adsabs.harvard.edu/abs/2016arXiv160702470S . Sousa, M. R., Gama, J., & Brandão, E. (2016). A new dynamic modeling framework for credit risk assessment. Expert Systems with Applications, 45 , 341–351. doi: 10. 1016/j.eswa.2015.09.055 . Srivastava, N. , Hinton, G. , Krizhevsky, A. , Sutskever, I. , & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15 , 1929–1958 . Thomas, L. C. (20 0 0). A survey of credit and behavioural scoring: forecasting fi- nancial risk of lending to consumers. International journal of forecasting, 16 (2), 149–172 . Verbraken, T., Bravo, C., Weber, R., & Baesens, B. (2014). Development and applica- tion of consumer credit scoring models using profit-based classification mea- sures. European Journal of Operational Research, 238 (2), 505–513. doi: 10.1016/j. ejor.2014.04.001 . Xia, Y., Liu, C., Li, Y., & Liu, N. (2017). A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78 , 225–241. doi: 10.1016/j.eswa.2017.02.017 . Yang, J. B. , Nguyen, M. N. , San, P. P. , Li, X. L. , & Krishnaswamy, S. (2015). Deep convo- lutional neural networks on multichannel time series for human activity recog- nition. In Proceedings of the twenty-fourth international conference on artificial intelligence . In IJCAI’15 (pp. 3995–4001). AAAI Press . Zheng, Y., Liu, Q., Chen, E., Ge, Y., & Zhao, J. L. (2014). Time series classification using multi-channels deep convolutional neural networks. Lecture Notes in Computer Science , 298–310. doi: 10.1007/978- 3- 319- 08010- 9 _ 33 . II 71 Journal of Machine Learning Research 20 (2019) 1-30 Submitted 6/18; Revised 7/19; Published 8/19 Time-to-Event Prediction with Neural Networks and Cox Regression H̊avard Kvamme haavakva@math.uio.no Ørnulf Borgan borgan@math.uio.no Ida Scheel idasch@math.uio.no Department of Mathematics University of Oslo P.O. Box 1053 Blindern 0316 Oslo, Norway Editor: Jon McAuliffe Abstract New methods for time-to-event prediction are proposed by extending the Cox proportional hazards model with neural networks. Building on methodology from nested case-control studies, we propose a loss function that scales well to large data sets and enables fitting of both proportional and non-proportional extensions of the Cox model. Through simulation studies, the proposed loss function is verified to be a good approximation for the Cox partial log-likelihood. The proposed methodology is compared to existing methodologies on real-world data sets and is found to be highly competitive, typically yielding the best performance in terms of Brier score and binomial log-likelihood. A python package for the proposed methods is available at https://github.com/havakv/pycox. Keywords: Cox regression, customer churn, neural networks, non-proportional hazards, survival prediction 1. Introduction In this paper, we consider methodology for time-to-event prediction, a part of survival analysis that reasons about when a future event will occur. Applications of time-to-event predictions can be found in a variety of settings such as survival prediction of cancer patients (e.g., Vigan et al., 2000), customer churn (e.g., Van den Poel and Lariviere, 2004), credit scoring (e.g., Dirick et al., 2017), and failure times of mechanical systems (e.g., Susto et al., 2015). Arguably, the field of survival analysis has predominantly focused on interpretability, potentially at some cost of predictive accuracy. This is perhaps the reason why binary classifiers from machine learning are commonly used in industrial applications where survival methodology is applicable. However, while the binary classifiers can provide predictions for one predetermined duration, one loses the interpretability and flexibility provided by modeling the event probabilities as a function of time. Furthermore, in time-to-event data, it is common that some individuals are not followed all the way to their event time, resulting in censored times rather than event times. While binary classifiers typically ignore these observations, one of the main objectives in survival analysis is to account for them. Hence, c©2019 H̊avard Kvamme, Ørnulf Borgan, and Ida Scheel. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-424.html. Kvamme, Borgan, and Scheel in applications with a substantial amount of censoring, the use of survival models tends to be advantageous. In our work, we propose an approach for combining machine learning methodology with survival models. We do this by extending the Cox proportional hazards model with neural networks, and further remove the proportionality constraint of the Cox model. Building on methodology from nested case-control studies (e.g., Langholz and Goldstein, 1996) we are able to do this in a scalable manner. The resulting methods have the flexibility of neural networks while modeling event times continuously. Building on the PyTorch frame- work (Paszke et al., 2017), we provide a python package for our methodology, along with all the simulations and data sets presented in this paper.1 The paper is organized as follows. Section 2 contains a summary of related work. In Section 3, we review some basic concepts from survival analysis and introduce the Cox proportional hazards model with our extensions. In Section 4 we discuss some evaluation criteria of methods for time-to-event prediction. In Section 5, we conduct a simulation study, verifying that the methods we propose behave as expected. In Section 6 we evaluate our methods on five real-world data sets and compare their performances with existing methodology. We conclude in Section 7. 2. Related Work The extension of Cox regression with neural networks was first proposed by Faraggi and Simon (1995), who replaced the linear predictor of the Cox regression model, cf. formula (3) below, by a one hidden layer multilayer perceptron (MLP). It was, however, found that the model generally failed to outperform regular Cox models (Xiang et al., 2000; Sargent, 2001). Katzman et al. (2018) revisited these models in the framework of deep learning and showed that novel networks were able to outperform classical Cox models in terms of the C-index (Harrell Jr et al., 1982). Our work distinguishes itself from this in the following way: The method by Katzman et al. (2018), denoted DeepSurv, is constrained by the proportionality assumption of the Cox model while we propose an extension of the Cox model where proportionality is no longer a restriction. In this regard, we propose an alternative loss function that scales well for both the proportional and the non-proportional cases. Similar works based on Cox regression include SurvivalNet (Yousefi et al., 2017), a frame- work for fitting proportional Cox models with neural networks and Bayesian optimization of the hyperparameters, and Zhu et al. (2016) and Zhu et al. (2017) which extended the Cox methodology to images. Both Zhu et al. (2016) and Zhu et al. (2017) replace the MLP of DeepSurv with a convolutional neural network and applied these methods to pathological images of lung cancer and to whole slide histopathological images. An alternative approach to time-to-event prediction is to discretize the duration and compute the hazard or survival function on this predetermined time grid. Luck et al. (2017) proposed methods similar to DeepSurv, but with an additional set of discrete out- puts for survival predictions and computed an isotonic regression loss over this time grid. Fotso (2018) parameterized a multi-task logistic regression with a neural net that directly computes the survival probabilities on the time grid. Lee et al. (2018) proposed a method, 1. Implementations of methods and the data sets are available at https://github.com/havakv/pycox. 2 Time-to-Event Prediction with Neural Networks and Cox Regression denoted DeepHit, that estimates the probability mass function with a neural net and com- bine the log-likelihood with a ranking loss; see Appendix D for details. Furthermore, the method has the added benefit of being applicable for competing risks. The majority of the papers mentioned benchmark their methods against the random survival forests (RSF) by Ishwaran et al. (2008). RSF computes a random forest using the log-rank test as the splitting criterion. It computes the cumulative hazards of the leaf nodes and averages them over the ensemble. Hence, RSF is a very flexible continuous-time method that is not constrained by the proportionality assumption. 3. Methodology In the following, we give a brief review of some concepts in survival analysis. For a more in-depth introduction to the field, see, for example, Klein and Moeschberger (2003). Our objective is to model the event distribution as a continuous function of time. So with f(t) and F(t) denoting the probability density function and the cumulative distribution function of an event time T∗, we want to model P(T∗ ≤ t) = ∫ t 0 f(s) ds = F(t). As alternatives to F(t), it is common to study the survival function S(t) and the hazard rate h(t). The survival function is defined as S(t) = P(T∗ > t) = 1 −F(t), and is commonly used for visualizing event probabilities over time. For specifying models, however, it is rather common to use the hazard rate h(t) = f(t) S(t) = lim ∆t→0 1 ∆t P(t ≤ T∗ < t + ∆t | T∗ ≥ t). If we have the hazard rate, the survival function can be retrieved through the cumulative hazard, H(t) = ∫ t 0 h(s)ds, by S(t) = exp[−H(t)]. (1) The survival function and the hazard rate therefore provide contrasting views of the same quantities, and it may be useful to study both. Working with real data, the true event times are typically not known for all individuals. This can occur when the follow-up time for an individual is not long enough for the event to happen, or an individual may leave the study before its termination. Instead of observing the true event time T∗, we then observe a possibly right-censored event time T = min{T∗,C∗}, where C∗ is the censoring time. In addition, we observe the indicator D = 1{T = T∗} labeling the observed event time T as an event or a censored observation. Now, denoting individuals by i, with covariates xi and observed duration Ti, the likelihood for censored survival times is given by L = ∏ i f(Ti |xi)DiS(Ti |xi)1−Di = ∏ i h(Ti |xi)Di exp[−H(Ti |xi)]. (2) We will later refer to this as the full likelihood. 3 Kvamme, Borgan, and Scheel 3.1. Cox Regression The Cox proportional hazards model (Cox, 1972) is one of the most used models in survival analysis. It provides a semi-parametric specification of the hazard rate h(t |x) = h0(t) exp[g(x)], g(x) = βT x, (3) where h0(t) is a non-parametric baseline hazard, and exp[g(x)] is the relative risk function. Here, x is a covariate vector and β is a parameter vector. Note that the linear predictor g(x) = βT x does not contain an intercept term (bias weight). This is because the intercept would simply scale the baseline hazard, and therefore not contribute to the relative risk function, h0(t) exp[g(x) + b] = h0(t) exp[b] exp[g(x)] = h̃0(t) exp[g(x)]. The Cox model in (3) is fitted in two steps. First, the parametric part is fitted by max- imizing the Cox partial likelihood, which does not depend on the baseline hazard, then the non-parametric baseline hazard is estimated based on the parametric results. For individual i, let Ti denote the possibly censored event time and Ri denote the set of all individuals at risk at time Ti (not censored and have not experienced the event before time Ti). Note that Ri includes individuals with event times at Ti, so i is part of Ri. The Cox partial likelihood, with Breslow’s method for handling tied event times, is given by Lcox = ∏ i ( exp[g(xi)]∑ j∈Ri exp[g(xj)] )Di , (4) and the negative partial log-likelihood can then be used as a loss function loss = ∑ i Di log   ∑ j∈Ri exp[g(xj) −g(xi)]   . (5) Let β̂ be the value of β that maximizes (4), or equivalently, minimizes (5). Then the cumulative baseline hazard function can be estimated by the Breslow estimator Ĥ0(t) = ∑ Ti≤t ∆Ĥ0(Ti) (6) ∆Ĥ0(Ti) = Di∑ j∈Ri exp[ĝ(xj)] , where ĝ(x) = β̂ T x. If desired, the baseline hazard h0(t) can be estimated by smoothing the increments, ∆Ĥ0(Ti), of the Breslow estimate, but the cumulative baseline hazard typically provides the information we are interested in. 3.2. Cox with SGD The Cox partial likelihood is usually minimized using Newton-Raphson’s method. In our work, we instead want to fit the Cox model with mini-batch stochastic gradient descent 4 Time-to-Event Prediction with Neural Networks and Cox Regression (SGD), to better scale to large data sets. As the loss in (5) sums over risk sets Ri, which can be as large as the full data set, it cannot be computed in batches. Nevertheless, it is possible to do batched iterations by subsampling the data set (to a batch) and restrict the set Ri to only contain individuals in the current batch. This scales well for proportional methods such as DeepSurv (Katzman et al., 2018), but would be very computationally expensive for our non-proportional extension presented in Section 3.4. Hence, we propose an approximation of the loss that is easily batched. Intuitively, we can approximate the risk set Ri with a sufficiently large subset R̃i, and weight the likelihood accordingly with weights wi, L = ∏ i ( exp[g(xi)] wi ∑ j∈R̃i exp[g(xj)] )Di . (7) The weights should ensure that the weighted sum over the subset R̃i in (7) is a reasonable approximation of the full sum over Ri in (4). By choosing a fixed sample size of the sampled risk sets R̃i, we can now optimize the objective by batched gradient descent. The individual i is always included in the sampled risk set R̃i to ensure that each of the products in (7) is bounded above by 1. As the weights wi do not contribute to the gradients of the logarithm of (7) (as can be seen by differentiating with respect to the model parameters), we can simply drop them from the loss function. Also, in practice we do not compute the loss for Di = 0 as these entries do not contribute to (7). Finally, if we average the loss to make it independent of the data set size, we obtain loss = 1 n ∑ i:Di=1 log   ∑ j∈R̃i exp[g(xj) −g(xi)]   , (8) where n denotes the number of events in the data set. In our experiments in Sections 5 and 6, we find that it is often sufficient to sample only one individual j from the risk set, which gives us the loss loss = 1 n ∑ i:Di=1 log (1 + exp[g(xj) −g(xi)]) , j ∈Ri\{i}. (9) One benefit of (8) is that it is, in a sense, more interpretable than the negative partial log- likelihood in (5). Due to the sample dependence in the mean partial log-likelihood (MPLL), i.e., the expression in (5) divided by n, the magnitude of the MPLL is dependent on the size of the risk sets. Hence, for a change of batch size, the mean partial log-likelihood changes. This prohibits a comparison of losses across different batch sizes. Comparably, the loss in (8) is not affected by the choice of batch size, as the size of R̃i is fixed. As a result, we can derive the range of values we expect the loss to be in. Using (9) as an example, we know that it is typically in the range (0, 0.693], as a trivial g(x) = const, gives a loss = log(2) ≈ 0.693, and the minimum is obtained by letting g(xi) →∞, g(xj) →−∞, which results in a loss that tends towards 0. Sampling of the risk sets in Cox’s partial likelihood is commonly done in epidemiology and formalized through the nested case-control design, originally suggested by Thomas 5 Kvamme, Borgan, and Scheel (1977). In (8), case refers to the i’s, while the controls are the j’s sampled from Ri\{i}. Goldstein and Langholz (1992) show that for the Cox partial likelihood, the sampled risk sets produce consistent parameter estimators. While their results do not extend to non- linear models, it is still an indication that the loss function in (8) is reasonable. Our sampling strategy deviates from that of the nested case-control literature in two ways. Firstly, we sample a new set of controls for every iteration, instead of keeping control samples fixed. Secondly, we sample controls with replacement, as this requires less com- putation than sampling without replacement. Note, however, that we typically sample a single control, in which case it does not matter if we sample with or without replacement. 3.3. Non-Linear Cox Having established the simple loss function in (8), which can be computed with SGD, the generalization of the relative risk function exp[g(x)] is rather straightforward. In this paper, we replace the linear predictor g(x) = βT x by a g(x) parameterized by a neural network. While our proposed loss function is not a requirement for the adaptation of a neural network (see, e.g., DeepSurv by Katzman et al., 2018), it really helps for the further extensions in Section 3.4. Also, it has been found that batched iterations can improve predictive performance (Keskar et al., 2016; Hoffer et al., 2017). Our generalization of g(x) leaves the presented theory in Sections 3.1 and 3.2 essentially unchanged, so we do not repeat the likelihoods and loss functions for this model. We will later refer to the Cox proportional hazards model parameterized with a neu- ral network as Cox-MLP. To differentiate between minimizing the negative partial log- likelihood in (5), as done by DeepSurv, and our case-control approximation in (8), we will denote the corresponding methods by Cox-MLP (DeepSurv) and Cox-MLP (CC), respec- tively. For the non-linear Cox models, the loss does not necessarily have a unique minimizer for g(x). Therefore, we add a penalty to the loss function to encourage g(x) to not deviate too far from zero penalty = λ ∑ i:Di=1 ∑ j∈R̃i |g(xj)|. (10) Here λ is a tuning parameter, and note that i is included in R̃i. 3.4. Non-Proportional Cox-Time The proportionality assumption of the Cox model can be rather restrictive, and parame- terizing the relative risk function with a neural net does not affect this constraint. Ap- proaches for circumventing this restriction are typically based on grouping the data based on a categorical covariate and applying a stratified version of the Cox model (Klein and Moeschberger, 2003, chap. 9). We propose a parametric approach that does not require stratification. Continuing with the semi-parametric form of the Cox model, we now let the relative risk function depend on time, h(t |x) = h0(t) exp[g(t, x)]. (11) 6 Time-to-Event Prediction with Neural Networks and Cox Regression In practice, we let g(t, x) handle the time as a regular covariate, which enables g(t, x) to model interactions between time and the other covariates. This is similar to the approach taken in classical survival analysis, where the non-proportional effect of a covariate x may be modeled by including time-dependent covariates like x · t and x · log t. The model (11) is no longer a proportional hazards model. However, it is still a relative risk model with the same partial likelihood as previously, only now with an additional covariate. Following the approach from Section 3.2, we have the loss function loss = 1 n ∑ i:Di=1 log   ∑ j∈R̃i exp[g(Ti, xj) −g(Ti, xi)]   , (12) and we include the penalty in (10), with g(Ti, xj) replacing g(xj). We will later refer to models fitted by (12) as Cox-Time. Note that the loss has the same Ti for both xi and the xj’s. Consequently, if we had used the full risk set Ri instead of the subset R̃i, as is the case for the loss in (5), the loss would become very computationally expensive. In fact, for the full risk set, the time complexity of the loss would be O(n · |Ri|) = O(n2), where |Ri| denotes the size of the risk set. But for (12) we get O(n · |R̃i|) = O(n), as |R̃i| is fixed and small. In the proportional case, to compute the loss in (5) one only needs to compute g(xj) once (per iteration) for each j, and reuse that value in all other risk sets. This ensures the linear time complexity for the classical Cox models. We can find the Breslow estimate for the cumulative baseline hazard H0(t) using (6) with ĝ(xj) replaced by ĝ(Ti, xj). Note that we still need the non-parametric baseline, as g(t, x) is restricted to model interactions between the time and the covariates. To see this, consider g(t, x) = a(t, x) + b(t), and observe that b(t) cancels out in the loss. 3.5. Prediction We can obtain predictions from the relative risk models by estimating the survival function in (1), Ŝ(t |x) = exp[−Ĥ(t |x)]. For the proportional hazards models, the relative risk function does not depend on time, enabling us to integrate only over the baseline hazard and compute the relative risk separately, H(t |x) = ∫ t 0 h0(s) exp[g(x)] ds = H0(t) exp[g(x)]. By first estimating H0(t) on the training data with (6), we only need to compute exp[g(x)] to obtain predictions. Computation of the estimate Ĥ0(t) requires a single pass over the whole training set, in addition to sorting the training set by time. In the case of models with non-proportional hazards, such as models fitted by Cox-Time in Section 3.4, predictions are much more computationally expensive. As the relative risk is time-dependent, we now need to integrate over both the baseline hazard and g(t, x), H(t |x) = ∫ t 0 h0(s) exp[g(s, x)] ds. 7 Kvamme, Borgan, and Scheel In practice, we estimate the cumulative hazards by Ĥ(t |x) = ∑ Ti≤t ∆Ĥ0(Ti) exp[ĝ(Ti, x)], where ∆Ĥ0(Ti) is an increment of the Breslow estimate and ĝ(Ti, x) is the estimate of g(Ti, x) obtained from the neural network. This is clearly rather computationally expensive as we need to compute ĝ(Ti, x) for all distinct event times Ti ≤ t. Furthermore, for continuous- time data, computation of the cumulative baseline hazard through the Breslow estimate, ∆Ĥ0(Ti) = Di∑ j∈Ri exp[ĝ(Ti, xj)] , (13) scales quadratically. To alleviate the computational cost, one can compute the cumulative hazards over a reduced number of distinct time points. Hence, Cox-Time is trained on continuous-time data but produces discrete-time predictions, with the benefit of the discretization happening after the network is fitted. In practice, we perform this discretization by computing the baseline on a random subset of the training data and subsequently control the resolution of the time grid through the sample size. 4. Evaluation Criteria Metrics for evaluating the performance of methods for time-to-event prediction should ac- count for the censored individuals. In the following, we describe the metrics used in the experimental sections of this paper. 4.1. Concordance Index In survival analysis, the concordance index, or C-index (Harrell Jr et al., 1982), is arguably one of the most commonly applied discriminative evaluation metrics. This is likely a result of its interpretability, as it has a close relationship to classification accuracy (Ishwaran et al., 2008) and ROC AUC (Heagerty and Zheng, 2005). In short, the C-index estimates the probability that, for a random pair of individuals, the predicted survival times of the two individuals have the same ordering as their true survival times. See Ishwaran et al. (2008) for a detailed description. As the C-index only depends on the ordering of the predictions, it is very useful for evaluating proportional hazards models. This is because the ordering of proportional haz- ards models does not change over time, which enables us to use the relative risk function instead of a metric for predicted survival time. It is, however, not obvious how the C-index should be applied for non-proportional hazards models (Gerds et al., 2012; Ishwaran et al., 2008). We will use a metric based on the time-dependent C-index by Antolini et al. (2005), which estimates the probability that observations i and j are concordant given that they are comparable, Ctd = P{Ŝ(Ti |xi) < Ŝ(Ti |xj) | Ti < Tj, Di = 1}. (14) 8 Time-to-Event Prediction with Neural Networks and Cox Regression However, to account for tied event times and survival estimates, we make the modifications listed by Ishwaran et al. (2008, Section 5.1, step 3). This is to ensure that predictions independent of x, Ŝ(t |x) = Ŝ(t), yields Ctd = 0.5 for unique event times. Note that for proportional hazards models, our metric is equivalent to the regular C-index. 4.2. Brier Score The Brier score (BS) for binary classification is a metric of both discrimination and cal- ibration of a model’s estimates. In short, for N binary labels yi ∈ {0, 1} with probabil- ities pi of yi = 1, the BS is the mean squared error of the probability estimates p̂i, i.e., BS = 1 N ∑ i (yi − p̂i) 2 . To get binary outcomes from time-to-event data, we choose a fixed time t and label data according to whether or not an individual’s event time is shorter or longer than t. Graf et al. (1999) generalize the Brier score to account for censoring by weighting the scores by the inverse censoring distribution, BS(t) = 1 N N∑ i=1 [ Ŝ(t |xi) 2 1{Ti ≤ t,Di = 1} Ĝ(Ti) + (1 − Ŝ(t |xi)) 2 1{Ti > t} Ĝ(t) ] . (15) Here N is the number of observations, Ĝ(t) is the Kaplan-Meier estimate of the censoring survival function, P(C∗ > t), and it is assumed that the censoring times and survival times are independent. The BS can be extended from a single duration t to an interval by computing the integrated Brier score IBS = 1 t2 − t1 ∫ t2 t1 BS(s) ds. (16) In practice, we approximate this integral by numerical integration, and we let the time span be the duration of the test set. In our experiments we found that using 100 grid points were sufficient to obtain stable scores. 4.3. Binomial Log-Likelihood The mean binomial log-likelihood is a commonly used metric in binary classification that measures both discrimination and calibration of the estimates of a method. Using the same inverse censoring weighting as for the Brier score, we can apply this metric to censored duration time data, BLL(t) = 1 N N∑ i=1 [ log[1 − Ŝ(t |xi)] 1{Ti ≤ t,Di = 1} Ĝ(Ti) + log[Ŝ(t |xi)] 1{Ti > t} Ĝ(t) ] . (17) The binomial log-likelihood can also be integrated in the same manner as (16) IBLL = 1 t2 − t1 ∫ t2 t1 BLL(s) ds. 9 Kvamme, Borgan, and Scheel 0 10 20 30 40 50 60 Epochs 8.7 8.6 8.5 8.4 8.3 8.2 M PL L 1 2 4 8 Figure 1: Box plots giving the mean partial log-likelihood (MPLL) of the test sets for different training epochs. The colors show how many controls were sampled during training (in addition to the case). 5. Simulations To empirically investigate our proposed methodology, we conducted a series of simulations. These experiments are by no means exhaustive but are rather meant to verify that the methods behave as expected. In the following, we let classical Cox refer to a Cox regression with g(x) = βT x obtained with the Lifelines python package (Davidson-Pilon et al., 2018). For experimental details exempt from the main article, we refer the reader to Appendix C.3. We first investigate the behavior of our proposed loss (8). In particular, we want to examine the impact the number of sampled controls has on the fitted models, in addition to how well the results from using our loss agree with those from using the Cox partial log- likelihood, i.e., the loss (5). To this end, we simulated survival times from a proportional hazards model h(t |x) = h0(t) exp[g(x)], g(x) = βT x, (18) with constant baseline hazard h0(t) = 0.1, and β T = [0.44, 0.66, 0.88]. The covariates were sampled uniformly from [−1, 1]. We drew censoring times independent of the covariates with constant hazard c(t) = 1 30 , and, in addition, we censored all individuals that were still under observation at time 30. This resulted in approximately 30 % censored individuals. We sampled 10,000 individuals for training and 10,000 for testing, and fitted our Cox model by SGD as described in Section 3.2. This method will be referred to as Cox-SGD. Four models were fitted by sampling 1, 2, 4, and 8 controls (in addition to the case). The whole experiment was repeated 100 times, and the mean partial log-likelihood (MPLL) of the test sets are visualized in Figure 1. The figure indicates that the number of sampled controls does not affect the rate of convergence, but we note that the computational complexity increases with the number of sampled controls. 10 Time-to-Event Prediction with Neural Networks and Cox Regression 100 1000 10000 Size data set 0.3 0.2 0.1 0.0 0.1 D iff . C oe f 0.44 100 1000 10000 Size data set 0.66 1 2 4 8 100 1000 10000 Size data set 0.88 Figure 2: Differences between Cox-SGD and Classical Cox parameter estimates for different data set sizes. The numbers above the plots give the value of the true coefficient. Each box plot is based on 100 observations. The legend above the plots states the number of sampled controls for Cox-SGD. The experiment was repeated with training sets of 1,000 individuals to verify that the results were not simply due to the size of the training data. The same patterns were found in this setting, so the figure is exempted from the paper. Next, we compare the parameter estimates obtained from our proposed loss (Cox-SGD) with the estimates obtained with classical Cox regression. For data sets of sizes 100, 1,000, and 10,000, we fitted models to 100 sampled data sets. The differences between the Cox- SGD parameter estimates and the classical Cox estimates are displayed in Figure 2, where the legend above the plots gives the number of controls sampled for the Cox-SGD method. For the data sets of size 100, we observe that the Cox-SGD estimates seem to be slightly smaller than the Cox estimates, and this difference is larger for fewer sampled controls. However, as the data sets increase in size, the estimates for the two methods agree well. Finally, we want to compare the likelihoods obtained by the two methods. As the mean partial log-likelihood depends on data set size, it is really not comparable across data sets. We will therefore instead use the full likelihood (2), which also depends on the baseline hazard, to compare the methods. Note that the partial likelihood may be interpreted as a profile likelihood, so the full likelihood and the partial likelihood are closely related (see, e.g., Klein and Moeschberger, 2003, p. 258). We obtain cumulative baseline hazard estimates by (6), and baseline hazard estimates by numerical differentiation of the cumulative baseline hazard estimates. We report the mean log-likelihood (MLL) of (2) to compare results for different data set sizes. In Figure 3, we show the difference in the MLL between the Cox- SGD method and the classical method for Cox regression (for each sampled data set). We used the training MLL as we are here only interested in the losses’ abilities to optimize the objective, and not the generalization to a test set. From the figure, we observe that, for smaller data sets, more sampled controls in Cox-SGD seems to give likelihood estimates closer to those of a classical Cox regression. As the data sets increase in size, the MLL of the Cox-SGD seems to converge to that of the classical Cox, regardless of control sample 11 Kvamme, Borgan, and Scheel 100 1000 10000 Size data set 0.010 0.008 0.006 0.004 0.002 0.000 D iff . M LL 1 2 4 8 Figure 3: Differences in mean log-likelihood between Cox-SGD and classical Cox using (2). The figure gives results for different data set sizes, and the legend gives the number of sampled controls. Each box plot is based on 100 observations. size. The MLL’s of the classical Cox regression is approximately -2.2, meaning that even for the smallest data sets the differences in MLL for 1 sampled control is around 0.1 %. Hence, it seems that our loss (8) approximates the negative partial log-likelihood rather well. Furthermore, while a higher number of sampled controls can give lower training error, the effect of the number of sampled controls decreases with the size of the data sets. In the further experiments, we will for simplicity only use one sampled control, as this was found sufficient. Furthermore, in Appendix C we have included simulations for the methods using neural networks for non-linear and non-proportional models, to verify that our proposed methods behave as expected. In summary, these simulations verify that for observations drawn from a proportional hazards model with a non-linear relative risk function, Cox models parameterized by neural networks provide better estimates of the partial log-likelihood than classical Cox models. Further, by drawing observations from a non-proportional relative risk model, the simulations verify that our Cox-Time method (Section 3.4) is able to obtain better estimates of the survival functions than methods which assume proportional hazards. 6. Experiments In the following, we will evaluate our proposed methods on real data sets and compare their performance to existing methods from the literature. In total, we use five data sets. One large data set for a more in-depth analysis (see Section 6.2), and four smaller data sets commonly used in the survival analysis literature. 6.1. Four Common Survival Data Sets We base this experimental section on the data sets provided by Katzman et al. (2018), as they are made available through the DeepSurv python package, and need no further prepro- 12 Time-to-Event Prediction with Neural Networks and Cox Regression Data set Size Covariates Unique Durations Prop. Censored SUPPORT 8,873 14 1,714 0.32 METABRIC 1,904 9 1,686 0.42 Rot. & GBSG 2,232 7 1,230 0.43 FLCHAIN 6,524 8 2,715 0.70 Table 1: Summary of the four data sets used in the experiments in Section 6.1. cessing. The data sets include the Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT), the Molecular Taxonomy of Breast Cancer Inter- national Consortium (METABRIC), and the Rotterdam tumor bank and German Breast Cancer Study Group (Rot. & GBSG). Katzman et al. (2018) also provide the Worcester Heart Attack Study (WHAS) data set. However, their version of the data set is actually a case-control data set, meaning it contains multiple replications of some individuals, some- thing the authors seem to have overlooked. We replace it with the Assay Of Serum Free Light Chain (FLCHAIN) made available in the survival packages of R (Therneau, 2015). For FLCHAIN, we remove individuals with missing values. Further, we remove the “chap- ter” covariate, which gives the cause of death. Table 1 provides a summary of the data sets. For a more detailed description, we refer to the original sources (Therneau, 2015; Katzman et al., 2018). 6.1.1. Methods and Hyperparameter Tuning In Sections 3.3 and 3.4 we presented two new survival methods based on case-control sam- pling and neural networks: a proportional Cox method and a non-proportional Cox method, which we will refer to as Cox-MLP (CC) and Cox-Time respectively. We will compare our methods to a classical linear Cox regression referred to as Classical Cox (Linear), DeepHit (Lee et al., 2018), and Random Survival Forests (RSF) (Ishwaran et al., 2008). We will also compare to a proportional Cox method similar to DeepSurv (Katzman et al., 2018), but our version performs batched SGD by computing the negative partial log-likelihood in (5) on a subset of the data set. Furthermore, we choose not to restrict the network structure and optimization scheme to that of Katzman et al. (2018). Hence, this method is identical to our proportional Cox method in Section 3.3, except that it computes the negative partial log-likelihood of a batch, while we use case-control sampling in the loss function. We will refer to these two methods as Cox-MLP (DeepSurv) and Cox-MLP (CC) respectively. We do not compare with Luck et al. (2017) as their method is another proportional hazard method and is therefore restricted in all the same ways as our other proportional methods. As we will show in Section 6.1.2, the proportional hazards assumption is very restrictive, and methods based on this assumption are therefore not able to compete with other methods such as DeepHit, RSF, and Cox-Time. For the neural networks, we standardize the numerical covariates and encode the cat- egorical covariates by entity embeddings (Guo and Berkhahn, 2016) half the size of the number of categories. For the Classical Cox (Linear) regression, we one-hot encoded the categorical covariates (dummy variables), and in RSF we simply passed the covariates with- out any transformations. The networks are standard multi-layer perceptrons with the same 13 Kvamme, Borgan, and Scheel Method SUPPORT METABRIC Rot. & GBSG FLCHAIN Classical Cox (Linear) 0.598 0.628 0.666 0.790 Cox-MLP (DeepSurv) 0.611 0.636 0.674 0.790 Cox-MLP (CC) 0.613 0.643 0.669 0.793 Cox-Time 0.629 0.662 0.677 0.790 DeepHit 0.642 0.675 0.670 0.792 RSF (Mortality) 0.628 0.649 0.667 0.784 RSF (Ctd) 0.634 0.652 0.669 0.786 Table 2: Concordance (Ctd) for the experiments in Section 6.1. number of nodes in every layer, ReLU activations, and batch normalization between layers. We used dropout, normalized decoupled weight decay (Loshchilov and Hutter, 2019), and early stopping for regularization. SGD was performed by AdamWR (Loshchilov and Hut- ter, 2019) with an initial cycle length of one epoch, and we double the cycle length after each cycle. Learning rates were found using the methods proposed by Smith (2017). As the four data sets are somewhat small, we scored our fitted models using 5-fold cross- validation, where the hyperparameter search was performed individually for each fold. For all the neural networks, we performed a random hyperparameter search over 300 parameter configurations and chose the model with the best score on a held-out validation set. We scored the proportional Cox methods by the partial log-likelihood, and for the Cox-Time method, we used the loss (12). Model hyperparameter tuning for DeepHit and RSF was performed with the time-dependent concordance index (Antolini et al., 2005). However, we also include an RSF tuned in the manner proposed by the authors, i.e., by computing the concordance of the mortality (Ishwaran et al., 2008, Section 4.1). The two versions of RSF are in the following denoted RSF (Ctd) and RSF (Mortality), respectively. A list of the hyperparameter search spaces can be found in Appendix A.1. 6.1.2. Results We compare the methods using the metrics presented in Section 4, i.e., the time-dependent concordance, the integrated Brier score, and the integrated binomial log-likelihood. While the concordance solely evaluates a method’s discriminative performance, the Brier score and binomial log-likelihood also evaluate the calibration of the survival estimates. Table 2 shows the time-dependent concordance, or Ctd, averaged over the five cross- validation folds. As expected, the methods that assume proportional hazards, in general, perform worse than the less restrictive methods, and Cox-MLP (DeepSurv) and Cox-MLP (CC) are very close to each other. We see that RSF (Ctd) has better concordance than RSF (Mortality), which is expected as RSF (Ctd) uses the same metric for hyperparameter tuning as for evaluation. Cox-Time seems to do slightly better than the RSF methods, which is impressive as we have not used concordance for hyperparameter tuning. DeepHit seems to have the best discriminative performance overall, but, as we will see next, this comes at the cost of poorly calibrated survival estimates. 14 Time-to-Event Prediction with Neural Networks and Cox Regression Method SUPPORT METABRIC Rot. & GBSG FLCHAIN Classical Cox (Linear) 0.217 0.183 0.180 0.096 Cox-MLP (DeepSurv) 0.214 0.176 0.170 0.092 Cox-MLP (CC) 0.213 0.174 0.171 0.093 Cox-Time 0.212 0.172 0.169 0.102 DeepHit 0.223 0.184 0.178 0.120 RSF (Mortality) 0.215 0.175 0.171 0.093 RSF (Ctd) 0.212 0.176 0.171 0.093 Table 3: Integrated Brier score weighted by estimates of the censoring distribution for the experiments in Section 6.1. Method SUPPORT METABRIC Rot. & GBSG FLCHAIN Classical Cox (Linear) -0.623 -0.538 -0.529 -0.322 Cox-MLP (DeepSurv) -0.619 -0.532 -0.514 -0.309 Cox-MLP (CC) -0.615 -0.515 -0.509 -0.314 Cox-Time -0.613 -0.511 -0.502 -0.432 DeepHit -0.637 -0.539 -0.524 -0.487 RSF (Mortality) -0.619 -0.515 -0.507 -0.311 RSF (Ctd) -0.610 -0.517 -0.507 -0.311 Table 4: Integrated binomial log-likelihood weighted by estimates of the censoring distri- bution for the experiments in Section 6.1. Tables 3 and 4 show the integrated Brier score and integrated binomial log-likelihood, both weighted with Kaplan-Meier estimates of the censoring distribution. Here, for both metrics, closer to zero is better. We find that both metrics yield very similar results, as the orderings of the methods are almost identical. First, we see that Cox-Time seems to generally perform the best, but it struggles with the FLCHAIN data set. However, we note that, for FLCHAIN, Cox-MLP (DeepSurv) has the best integrated Brier score and binomial log-likelihood while Cox-MLP (CC) has the best concordance. This indicates that the proportionality assumption is quite reasonable for this data set. Again, we find that there is very little difference between Cox-MLP (DeepSurv) and Cox-MLP (CC), which is as expected. The RSF methods generally perform well. Note that the two RSF methods perform equally well here, even though RSF (Ctd) had the best concordance. Classical Cox (Linear) does rather poorly, which was expected as it has very restrictive model assumptions. Even though DeepHit, in general, had the best concordance, it has the worst integrated Brier score and binomial log-likelihood in three out of the four data sets. The loss function in DeepHit, given by formula (D.2) in Appendix D, is a convex combination of the negative log-likelihood and a ranking loss, determined by a parameter α. For α = 1 we obtain only the negative log-likelihood and for α = 0 we obtain only the ranking loss. As we 15 Kvamme, Borgan, and Scheel Data set Size Churned Censored Prop. Censor Unique users Train 1,786,333 1,279,358 506,975 0.28 1,582,202 Test 661,748 473489 188,259 0.28 586,001 Validation 198,665 142,104 56,561 0.28 175,801 Table 5: Summary of the KKBox churn data set. do hyperparameter tuning based on the concordance, we see that α tends towards smaller values, which results in excellent discriminative performance at the cost of poorly calibrated survival estimates. 6.2. KKBox Churn Case Study Thus far we have compared competing survival methodologies on fairly small data sets. We now perform a case study on a much larger data set, as this is more interesting in the context of neural networks. The WSDM KKBox’s churn prediction challenge was proposed in preparation for the 11th ACM International Conference on Web Search and Data Mining.2 The competition was hosted by Kaggle in 2017, with the goal of predicting customer churn on a data set donated by KKBox, the leading music streaming service in Asia. The available data provide us the opportunity to create a survival data set with event times and censoring indicators. We stick with the churn definition given by KKBox, were a customer is considered churned if he or she fails to resubscribe within 30 days after the previous subscription expired. Note, however, that our use of the data is not comparable to the Kaggle competition, as we work with survival times from the start of a subscription period, while they consider durations from a fixed calendar date. KKBox provides multiple data sources, but as we are primarily interested in evaluating our methods, we spend less time on feature engineering and only use a subset of covariates with general customer information (e.g., city, age, price of subscription). Furthermore, a customer that has previously churned and later resubscribed, is treated as a new customer with some covariate information describing the previous subscription history. This gives us a total of 15 covariates. We split the data into a training, a testing, and a validation set, and some information about these subsets are listed in Table 5. A more in-depth description of the KKBox data set can be found in Appendix B.1. 6.2.1. Methods and Hyperparameter Tuning We use the same methods as in Section 6.1. However, as this data set is very large, we replace the classical Cox regression with our Cox-SGD (Linear) method from Section 3.2. We standardize and encode the covariates in the same manner as for the smaller data sets. The networks and training are also the same as earlier, but we multiply the learning rate by 0.8 at the start of every cycle, as we found this to give more stable training. The KKBox data set is quite large, so we are not able to explore as large a hyperparameter space as in the previous experiments. Hence, we do not include weight decay, and perform 2. https://www.kaggle.com/c/kkbox-churn-prediction-challenge 16 Time-to-Event Prediction with Neural Networks and Cox Regression Method Layers Nodes Dropout α σ Cox-MLP (DeepSurv) 6 256 0.1 - - Cox-MLP (CC) 6 128 0 - - Cox-Time 8 256 0 - - DeepHit 6 512 0.1 0.001 0.5 Table 6: KKBox model architectures. α and σ only applies the DeepHit (see Appendix D). Method Ctd IBS IBLL Cox-SGD (Linear) 0.816 0.127 -0.406 Cox-MLP (DeepSurv) 0.841 0.111 -0.349 Cox-MLP (CC) 0.844 0.119 -0.379 Cox-Time 0.861 0.107 -0.334 DeepHit 0.888 0.147 -0.489 RSF (Mortality) 0.855 0.112 -0.352 RSF (Ctd) 0.870 0.111 -0.352 Table 7: Evaluation metrics for the KKBox data. a grid search over a small number of suitable parameters. The hyperparameter search is described in detail in Appendix A.2. The best configurations are given in Table 6. For RSF, the hyperparameter search based on Ctd yielded 8 covariates sampled for each split and a minimum of 50 observations in each leaf node. With concordance of mortality as the validation criterion, the best fitted model used 2 covariates for splitting, and a minimum leaf node size of 10. Furthermore, we found that 500 trees were sufficient, as there was little improvement compared to 250 trees. 6.2.2. Results For evaluation, we fitted each of the methods five times and computed the time-dependent concordance index (Ctd), the integrated Brier score (IBS), and the integrated binomial log- likelihood (IBLL) of the fitted models. The median scores are presented in Table 7. We use the median because the two proportional Cox-MLP methods yielded rather unstable results, where some of the fitted models performed very badly. From the table, we see that DeepHit continues to outperform the other methods in terms of concordance while having the worst performance in terms of IBS and IBLL. Furthermore, Cox-Time has the best IBS and IBLL, while still providing a decent concordance. RSF continues to do well across all metrics, while again, the tuning based on Ctd seems to yield better results than tuning base on the concordance of the mortality. Cox-SGD (Linear) does rather poorly, as it is very restricted, and serves more as a baseline in this context. The Cox-MLP methods seem to again perform reasonably close to each other, at least when taking into account that we found both methods to yield rather unstable results. We are 17 Kvamme, Borgan, and Scheel 0 100 200 300 400 500 600 700 800 Time 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Br ie r s co re Cox-SGD (Linear) Cox-MLP (DeepSurv) Cox-MLP (CC) Cox-Time DeepHit RSF (Mortality) RSF (Ctd) Figure 4: Brier score on KKBox data set. The methods are the same as in Table 7. not sure why this was the case, but note that the combination of the flexible neural net and the proportionality constraint might be problematic for large data sets. In Figure 4, we display the Brier scores used to obtain the IBS. Again, we see that DeepHit does poorly for all durations. The instability of the Cox-MLP (CC) is also very apparent for shorter durations. Cox-Time is clearly doing very well for all durations, but interestingly, we observe the Cox-MLP (DeepSurv) provides the best fitted model for larger durations. We could make a corresponding figure for the binomial log-likelihood, but as it is very similar to the Brier score plot, and provide us with no new insights, we have not included it. 6.2.3. Survival Curves One of the benefits survival models have over binary classifiers is their ability to produce survival curves. In Figure B.1 in Appendix B, we show nine examples of estimated survival curves from Cox-Time on the test data. The curves nicely illustrate the extent of detail the method has learned. To obtain a more general view of the predictions, we cluster the estimated survival curves of the test set. For an equidistant grid 0 = τ0 < τ1 < · · · < τm, the survival curve of individual i is represented by a vector [Ŝ(τ0 |xi), Ŝ(τ1 |xi), . . . , Ŝ(τm |xi)], and by considering these as feature vectors, we apply K-means clustering to the test set with 10 clusters. In Figure 5, we display the cluster centers and the proportions of the test set assigned to each of the clusters. This is a reasonable approach for segmenting customers based on their churning behavior. A natural next step would be to further investigate the clusters, but as we consider this somewhat outside the scope of this paper, we only make 18 Time-to-Event Prediction with Neural Networks and Cox Regression 0 100 200 300 400 500 600 700 800 Time 0.0 0.2 0.4 0.6 0.8 1.0 S( t) 19 % 18 % 13 % 12 % 10 % 7 % 7 % 7 % 4 % 3 % Figure 5: Cluster centers of survival curves from KKBox test data. The centers were gener- ated by K-means clustering on survival curves from Cox-Time. The legend gives the proportion of test data assigned to each cluster (rounded to nearest integer). a few observations. First, we see that 19 % of the customers are assigned to a cluster that does not provide much detailed information about their behavior, but instead provides a survival curve with a rather constant slope. In sharp contrast, the second largest group (18 %) is at high risk of churning immediately. Furthermore, we observe that many of the curves seem to have higher hazards at the end of each month (drops in the survival curves around 30, 60, 90 days), and we hypothesize that this is a result of customers paying for a full month at a time. The smallest cluster, constituting only 3 % of the test data, has a sharp drop around day 400. Investigating the covariates of the assigned customers reveals that most of them had prepaid for a 411 days long subscription. However, the large drop after 400 days indicates that only around 25 % of them were interested in continuing their subscription. Our choice of 10 clusters is mainly motivated by how many curves that can be visualized in a single plot, without being too crowded. Further increasing the number of clusters would likely reveal more detailed behavior. 7. Discussion In this paper, we propose extensions of the Cox proportional hazards model. By parame- terizing the relative risk function of a Cox model with neural networks, we can model rich relationships between the covariates and event times. Furthermore, we allow the networks to model interactions between the covariates and time, resulting in models that are no longer constrained by the proportionality assumption of the Cox model. Building on methods for nested case-control studies, we propose a loss function that can be computed in batches, enabling the models to scale to large data sets. We verify through simulation studies that fitting a Cox model using our proposed loss function gives results close to those obtained using the full Cox partial likelihood. 19 Kvamme, Borgan, and Scheel We compare our suggested methodology with classical Cox regression, random survival forests (RSF), DeepHit, and DeepSurv on 5 real-world data sets, and find that our proposed Cox-Time method performs very well, and has the best overall performance in terms of integrated Brier score (IBS) and integrated binomial log-likelihood (IBLL). DeepHit has, in general, the best discriminative performance, but this comes at the cost of poorly calibrated survival estimates. Finally, we show how estimated survival curves (event probabilities as functions of time), can be used as a descriptive tool to better understand event-time data sets. This is illus- trated by an example where we cluster the survival estimates of a customer churn data set, and show that this customer segmentation provides a useful view of the churning process. Interesting expansions of our methodology include the extension to handle multiple competing events, time-dependent covariates, dynamic predictions, and recurrent events. Furthermore, it would be of interest to explore other data sources that require more ad- vanced network structures such as convolutions and recurrent neural networks. Finally, less computationally expensive alternatives for creating survival estimates for the Cox-Time method should be explored. Acknowledgments This work was supported by The Norwegian Research Council 237718 through the Big Insight Center for research-driven innovation. Appendix A. Details on Hyperparameter Tuning In the following, we provide further details about the experiments conducted in Section 6. Here we list hyperparameter configurations and details about the model fitting procedures. A.1. Four Common Survival Data Sets Tuning Table A.1 gives the hyperparameter search space used for Rot. & GBSG, SUPPORT, METABRIC, and FLCHAIN. The square brackets describe continuous variables. In the ex- periments in Section 6.1, we sample 300 random parameter configurations for each method, for each fold of each data set. In the table, “α” and “σ” are given in (D.2) in Appendix D, “Num. durations” are the number of discrete durations (equidistant) used in DeepHit, “λ” is the penalty in (10), “Log duration” refers to a log-transform of the durations passed to Cox-Time, “Ridge” is a ridge penalty used in classical Cox regression, “Split covariates” and “Size leaf” are the number of covariates used for splitting, and the minimum node size of RSF. A.2. KKBox Tuning Hyperparameters in the KKBox study were found by a grid search over the relevant pa- rameters in Table A.2. The table consists of three sections, where the top represents the networks, the bottom represents RSF, and the middle contains network parameters that were found on a smaller network with two layers and 128 nodes. “α” and “σ” controls the 20 Time-to-Event Prediction with Neural Networks and Cox Regression Hyperparameter Values Layers {1, 2, 4} Nodes per layer {64, 128, 256, 512} Dropout [0, 0.7] Weigh decay {0.4, 0.2, 0.1, 0.05, 0.02, 0.01, 0} Batch size {64, 128, 256, 512, 1024} α (DeepHit) [0, 1] σ (DeepHit) {0.1, 0.25, 0.5, 1, 2.5, 5, 10, 100} Num. durations (DeepHit) {50, 100, 200, 400} λ (Cox-Time and Cox-MLP (CC)) {0.1, 0.01, 0.001, 0} Log durations (Cox-Time) {True, False} Ridge (Cox (Linear)) {1000, 100, 10, 1, 0.1, 0.01, 10−3, 10−4, 10−5} Split covariates (RSF) {2, 4, 6, 8} Size leaf (RSF) {2, 8, 32, 64, 128} Table A.1: Hyperparameter search space for experiments on Rot. & GBSG, SUPPORT, METABRIC, and FLCHAIN. Hyperparameter Values Layers {4, 6, 8} Nodes per layer {128, 256, 512} Dropout {0, 0.1, 0.5} α(*) {0, 0.001, 0.1, 0.2, 0.5, 0.8, 0.9, 0.99, 0.999, 1} σ(*) {0.01, 0.1, 0.25, 0.5, 1, 10, 100} Log durations(*) {True, False} Split covariates {2, 4, 6, 8} Size leaf {8, 10, 20, 50} Table A.2: KKBox hyperparameter configurations. (*) denotes parameters found with a two layer network with 128 nodes. 21 Kvamme, Borgan, and Scheel 0 50 100 150 200 250 300 350 Time (days) 0.0 0.2 0.4 0.6 0.8 1.0 S( t) Figure B.1: Survival curves from the case study in Section 6.2, which models the times at which customers of a streaming service churn. Each curve gives the estimated probabilities for a customer not having churned (probabilities of still being a customer at any given time). The time axis shows the number of days since first subscription. The curves are generated by the Cox-Time method from Section 3.4. loss function of DeepHit, and we assumed it should generalize well across network struc- tures. The same goes for whether or not we should log-transform the durations passed to Cox-Time. Hence, to reduce the hyperparameter search, we found suitable values with a smaller network. For the proposed Cox-MLP (CC) and Cox-Time, we used a fixed penalty λ = 0.001 in (10). All networks were trained with batch size of 1028, and the best performing archi- tectures can be found in Table 6. Appendix B. Details from KKBox Churn Case Study In the following, we provide some details of the KKBox case study that were exempt from the main article. In Figure B.1, we show an example of nine survival curves estimated by Cox-Time on the KKBox data set. Each line represents an individual from the test set. It is clear that the Cox-Time method has learned to represent a variety of survival curves. B.1. KKBox Data Set KKBox provides data consisting of general customer information (city, age, gender, initial registration), transaction logs listing how customers manage their subscription plans, and user logs containing some aggregate information of the customers’ usage of the streaming 22 Time-to-Event Prediction with Neural Networks and Cox Regression service. KKBox defines a customer as churned if he or she has failed to resubscribe to their service more than 30 days after their last subscription expired. Through the transaction information, we can create a data set with survival times and censoring indicators. We keep KKBox’s original churn definition of 30 days absence and calculate the survival time from the date of the first subscription or the earliest record of the customer in the transaction logs. Customers that have previously churned but resubscribed, are treated as new customers with some covariate information describing their previous subscription history. All covariates are calculated at the time of subscription of each customer, and they do not change with time. This is a simplification, as the covariates of a customer are typically not stationary. However, as our objective is to evaluate the proposed methodology, we refrain from doing extensive feature engineering. In this regard, we further disregard the user logs, as we get a reasonable set of covariates from the customer and transaction information. The data sets are summarized in Table 5. As some of the customers have churned multiple times, and are therefore included multiple times, the table also includes the number of unique customers in the respective data sets. We have a total of 15 covariates, where 7 are numeric and 8 are categorical. The numerical covariates give the time between subscriptions (for customers that have previously churned), number of previous churns, time since first registration, number of days until the current subscription expires, listed price of the current subscription, the amount paid for the subscription, and age. All numerical covariates, except the number of previous churns, were log-transformed. The categorical covariates are gender (3 categories), city (22 categories), how the customer registered (6 categories), and 5 indicator variables specifying if an individual has previously churned, if the subscription is canceled, if the age is unrealistic, if the subscription is automatically renewed, and if we do not know when the customer first registered. Appendix C. Additional Simulations In the following, we provide some additional simulations used to further investigate the proposed methods, and in Appendix C.3 we explain how the simulated data were generated. We again stress that the aim of the simulations is only to verify the expected behavior of our methods, and should not be interpreted as a general evaluation of them. C.1. Non-Linear Hazards Continuing from the simulations in Section 5, we do a simple study of the increased flexibility provided by replacing the linear predictor in a Cox regression by a neural network. To evaluate this, we extend the simulations in Section 5 by replacing g(x) = βT x with the non-linear function g(x) = βT x + 2 3 (x21 + x 2 3 + x1x2 + x1x3 + x2x3), (C.1) where xi denotes element i of x. The simulations are otherwise unchanged from the linear case in Section 5. We sample 10,000 training samples, 10,000 test samples, and 1,000 samples used for validation (validation data is used for early stopping), and fit the Cox-SGD and 23 Kvamme, Borgan, and Scheel 11 10 9 8 7 6 5 4 True PLL 11 10 9 8 7 6 5 4 Es tim at ed P LL Cox-SGD Cox-MLP Figure C.1: Partial log-likelihood estimates of Cox-SGD and Cox-MLP plotted against the true partial log-likelihoods, for 2,000 samples of the test set. classical Cox regression from Section 5. Additionally, we fit the Cox model in Section 3.3, where g(x) is parameterized by a one-hidden layer MLP (multilayer perceptron) with 64 hidden nodes and ReLU activations. This model will be referred to as Cox-MLP (we drop (CC) as we do not consider DeepSurv here). In Figure C.1 we have, for 2,000 individuals of the test set, plotted the individual partial log-likelihood estimates of Cox-SGD and Cox-MLP against the true individual partial log- likelihoods. That is, for each i with Di = 1, we plot ˆ̀ i = − log   ∑ j∈Ri exp[ĝ(xj) − ĝ(xi)]   against the true `i. The closer a method estimates g(x) to the true predictor in (C.1), the closer the scatter plot should be to the identity function (straight line with slope 1 through the origin). As expected, we see that Cox-MLP produces likelihood estimates very close to the true likelihoods, while Cox-SGD struggles to represent this non-linear function. C.2. Non-Proportional Hazards In our final simulations, we investigate the effect of removing the proportionality constraint in the Cox models. Building on the previous simulations, we add a time dependent term to the risk function. The hazard function is now given by h(t |x) = h0 exp[g(t, x)], with h0 = 0.02 and g(t, x) = a(x) + b(x) t, a(x) = gph(x) + sign(x3), b(x) = |0.2 (x0 + x1) + 0.5 x0x1|, 24 Time-to-Event Prediction with Neural Networks and Cox Regression 12 10 8 6 4 True PLL. 12 10 8 6 4 Es tim at ed P LL Cox-MLP Cox-Time Figure C.2: Partial log-likelihood estimates of Cox-MLP and Cox-Time plotted against the true partial log-likelihoods, for 2,000 samples of the test set. were gph(x) is the function in (C.1). The simulations are otherwise unchanged from the previous experiments. We require the term b(x) to be non-negative to ensure that the haz- ards increase with time. Furthermore, we have added sign(x3) to a(x) as this is essentially equivalent to having two different baselines, h̃0 = h0 exp[sign(x3)] ∈ {0.0074, 0.054}. Both of these choices are motivated by our attempt to produce reasonable looking survival curves. We sample 10,000 training samples, 10,000 test samples, and 2,000 samples for valida- tion, and fit a Cox-MLP model and a Cox-Time model. Both Cox-MLP and Cox-Time parameterize g with a 4 hidden layer MLP with ReLU activations, 128 nodes in each layer, and dropout between layers with rate 0.1. In Figure C.2, we have plotted the estimated partial log-likelihoods of Cox-MLP and Cox-Time against the true partial log-likelihoods for 2,000 samples of the test set. Though the difference between the methods is not very large, Cox-Time appears to capture the true values better than Cox-MLP. To further illustrate some of the differences between Cox-MLP and Cox-Time, we have in Figure C.3 plotted nine survival curves Ŝ(t |x) from the test set. The figure shows both the true curves and the curves estimated by the two methods. It is seen that Cox-Time is able to estimate the true survival curves better than those produced by Cox-MLP. C.3. Simulation Details In the following, we explain in detail how we generated our simulated data sets in Section 5 and Appendices C.1 and C.2. We want to generate survival times T∗ from relative risk models of the form h(t |x) = h0(t) exp[g(t, x)]. (C.2) 25 Kvamme, Borgan, and Scheel 0.00 0.25 0.50 0.75 1.00 S( t) 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 S( t) 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0 10 20 30 Time 0.00 0.25 0.50 0.75 1.00 S( t) 0 10 20 30 Time 0.00 0.25 0.50 0.75 1.00 0 10 20 30 Time 0.00 0.25 0.50 0.75 1.00 True Cox-MLP Cox-Time Figure C.3: Examples of survival curves Ŝ(t |x) from the test set in the non-proportional simulation study. If the function g does not depend on t, we have a proportional hazards model, which is just a special case of the relative risk models. Let H(t |x) denote the continuous (increasing) cumulative hazard H(t |x) = ∫ t 0 h(s |x) ds, and let V be exponentially distributed with parameter 1. Then we can obtain survival times through the inverse cumulative hazard T∗ = H−1(V |x). This can be shown to hold through the hazard’s relationship to the survival function in (1), S(t |x) = P(T∗ > t |x) = P(H−1(V |x) > t) = P(V > H(t |x)) = exp[−H(t |x)], as V is exponentially distributed with P(V > v) = exp(−v). Hence, we can obtain survival times by transforming generated samples from an exponential distribution. To obtain an analytical expression for the inverse cumulative hazard H−1(v |x), we restrict the models in (C.2) to have constant baseline h0, in addition to a simple time dependence in g(t, x). For the linear predictor in Section 5 and the non-linear predictor in Appendix C.1 we simulate with proportional hazards, meaning g(t, x) = g(x). Hence, we get cumulative hazards and inverse cumulative hazards of the form H(t |x) = th0 exp[g(x)], H−1(v |x) = v h0 exp[g(x)] . 26 Time-to-Event Prediction with Neural Networks and Cox Regression In the non-proportional simulations in Appendix C.2, we add a time dependent term g(t, x) = a(x) + b(x) t, which gives us H(t |x) = h0 exp[a(x)] (exp[b(x) t] − 1) b(x) , H−1(v |x) = 1 b(x) log ( 1 + v b(x) h0 exp[a(x)] ) . Appendix D. DeepHit DeepHit by Lee et al. (2018) is a discrete survival model for competing risks. However, as we only consider one type of event, we will express the method in terms of a single cause. DeepHit considers time to be discrete, so to fit it to the continuous-time data sets in Section 6, we discretize the event times with an equidistant grid between the smallest and largest duration in the training set. The number of discrete time-points is considered a hyperparameter, given by “Num. durations” in Table A.1. Now, assume time is discrete with 0 = τ0 < τ1 < τ2 < · · · < τm. Let y(x) = [y0(x),y1(x), . . . ,ym(x)] T denote the output of a neural net with covariates x and a soft- max output activation. Then, y(x) can be interpreted as the estimated probability mass function of the duration times yj(xi) = P̂(T ∗ i = τj |xi). The estimated survival function is then given by Ŝ(τj |x) = 1 − j∑ k=1 yk(x), and the discrete negative log-likelihood corresponding to the continuous version in (2), is lossL = − N∑ i=1 [ Di log(yei (xi)) + (1 −Di) log(Ŝ[Ti |xi]) ] . Here, ei denotes the index of the event time for observation i, i.e., Ti = τei . Furthermore, DeepHit adds a second loss that attempts to improve its ranking capabilities, lossrank = ∑ i,j Di 1{Ti < Tj}exp ( Ŝ(Ti |xi) − Ŝ(Ti |xj) σ ) . (D.1) The loss of DeepHit is a combination of these two losses, where the ranking loss is scaled by a constant. We deviate slightly from the original implementation here and instead use a convex combination of the two, loss = α lossL + (1 −α) lossrank, (D.2) where α and σ from (D.1) are considered hyperparameters. 27 Kvamme, Borgan, and Scheel References Laura Antolini, Patrizia Boracchi, and Elia Biganzoli. A time-dependent discrimination index for survival data. Statistics in Medicine, 24(24):3927–3944, 2005. David R. Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187–220, 1972. Cameron Davidson-Pilon, Jonas Kalderstam, Ben Kuhn, Andrew Fiore-Gartland, Luis Moneda, Paul Zivich, Alex Parij, Kyle Stark, Steven Anton, Lilian Besson, et al. Cam- davidsonpilon/lifelines: v0.14.1, 2018. Lore Dirick, Gerda Claeskens, and Bart Baesens. Time to default in credit scoring using survival analysis: a benchmark study. Journal of the Operational Research Society, 68 (6):652–665, 2017. David Faraggi and Richard Simon. A neural network model for survival data. Statistics in Medicine, 14(1):73–82, 1995. Stephane Fotso. Deep neural networks for survival analysis based on a multi-task framework. arXiv preprints arXiv:1801.05512, 2018. Thomas A. Gerds, Michael W. Kattan, Martin Schumacher, and Changhong Yu. Estimat- ing a time-dependent concordance index for survival prediction models with covariate dependent censoring. Statistics in Medicine, 32(13):2173–2184, 2012. Larry Goldstein and Bryan Langholz. Asymptotic theory for nested case-control sampling in the Cox regression model. Annals of Statistics, 20(4):1903–1928, 1992. Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine, 18(17-18):2529–2545, 1999. Cheng Guo and Felix Berkhahn. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016. Frank E. Harrell Jr, Robert M. Califf, David B. Pryor, Kerry L. Lee, and Robert A. Rosati. Evaluating the yield of medical tests. Journal of the American Medical Association, 247 (18):2543–2546, 1982. Patrick J. Heagerty and Yingye Zheng. Survival model predictive accuracy and ROC curves. Biometrics, 61(1):92–105, 2005. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: clos- ing the generalization gap in large batch training of neural networks. arXiv preprints arXiv:1609.04836, 2017. Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, and Michael S. Lauer. Random survival forests. Annals of Applied Statistics, 2(3):841–860, 2008. 28 Time-to-Event Prediction with Neural Networks and Cox Regression Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Medical Research Methodology, 18(1), 2018. Nitish S. Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016. John P. Klein and Melvin L. Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. Springer, New York, 2. edition, 2003. Bryan Langholz and Larry Goldstein. Risk set sampling in epidemiologic cohort studies. Statistical Science, 11(1):35–53, 1996. Changhee Lee, William R. Zame, Jinsung Yoon, and Mihaela van der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. Margaux Luck, Tristan Sylvain, Hélöıse Cardinal, Andrea Lodi, and Yoshua Ben- gio. Deep learning for patient-specific kidney graft survival analysis. arXiv preprint arXiv:1705.10245, 2017. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- ferentiation in PyTorch. In NIPS Autodiff Workshop, 2017. Daniel J. Sargent. Comparison of artificial neural networks with other statistical approaches. Cancer, 91(8):1636–1642, 2001. L. N. Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464–472, 2017. Gian A. Susto, Andrea Schirru, Simone Pampuri, Seán McLoone, and Alessandro Beghi. Machine learning for predictive maintenance: A multiple classifier approach. IEEE Trans- actions on Industrial Informatics, 11(3):812–820, 2015. Terry M. Therneau. A package for survival analysis in S. Version 2.38, 2015. Duncan C. Thomas. Addendum to: Methods of cohort analysis: appraisal by application to asbestos mining, by F. D. K. Liddell, J, C. McDonald and D. C. Thomas. Journal of the Royal Statistical Society: Series A (General), 140(4):469–491, 1977. Dirk Van den Poel and Bart Lariviere. Customer attrition analysis for financial services using proportional hazard models. European Journal of Operational Research, 157(1): 196–217, 2004. 29 Kvamme, Borgan, and Scheel Antonio Vigan, Marlene Dorgan, Jeanette Buckingham, Eduardo Bruera, and Mari E. Suarez-Almazor. Survival prediction in terminal cancer patients: a systematic review of the medical literature. Palliative Medicine, 14(5):363–374, 2000. Anny Xiang, Pablo Lapuerta, Alex Ryutov, Jonathan Buckley, and Stanley Azen. Com- parison of the performance of neural network methods and Cox regression for censored survival data. Computational Statistics & Data Analysis, 34:243–257, 2000. Safoora Yousefi, Fatemeh Amrollahi, Mohamed Amgad, Chengliang Dong, Joshua E. Lewis, Congzheng Song, David A. Gutman, Sameer H. Halani, Jose E. V. Vega, Daniel J. Brat, et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Scientific Reports, 7(1):11707, 2017. Xinliang Zhu, Jiawen Yao, and Junzhou Huang. Deep convolutional neural network for survival analysis with pathological images. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 544–547, 2016. Xinliang Zhu, Jiawen Yao, Feiyun Zhu, and Junzhou Huang. WSISA: Making survival pre- diction from whole slide histopathological images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6855–6863, July 2017. 30 III 103 IV 133 Preface List of Papers Contents Introduction Neural Networks and Statistical Models Neural Networks A Brief Historical Perspective The Multilayer Perceptron Training Networks Overfitting and Regularization Convolutional Networks Time-to-Event Prediction Event-Time Modeling Regression Models Evaluation of Survival Estimates Summary of Papers Paper I Paper II Paper III Paper VI Discussion The Survival Methods Can We Trust Individual Predictions? Extensions of the Methods Evaluation of Survival Estimates Time-to-Event Prediction and Machine Learning Bibliography Papers