key: cord-0043171-gwnbnkwg
authors: Kim, Sundong; Song, Hwanjun; Kim, Sejin; Kim, Beomyoung; Lee, Jae-Gil
title: Revisit Prediction by Deep Survival Analysis
date: 2020-04-17
journal: Advances in Knowledge Discovery and Data Mining
DOI: 10.1007/978-3-030-47436-2_39
sha: e1fea16bd575af6c01ddaae0248c3fa7ed9a8382
doc_id: 43171
cord_uid: gwnbnkwg

In this paper, we introduce SurvRev, a next-generation revisit prediction model that can be tested directly in business. The SurvRev model offers many advantages. First, SurvRev can use partial observations which were considered as missing data and removed from previous regression frameworks. Using deep survival analysis, we could estimate the next customer arrival from unknown distribution. Second, SurvRev is an event-rate prediction model. It generates the predicted event rate of the next k days rather than directly predicting revisit interval and revisit intention. We demonstrated the superiority of the SurvRev model by comparing it with diverse baselines, such as the feature engineering model and state-of-the-art deep survival models.

Predicting customer revisit in offline stores has been feasible because of the advancement in sensor technology. In addition to well-known but difficult-toobtain customer revisit attributes, such as purchase history, store atmosphere, customer satisfaction with products, large-scale customer motion patterns captured via in-store sensors are effective in predicting customer revisit [9] . Market leaders, such as Alibaba, Amazon, and JD.com, opened the new generation of retail stores to satisfy customers. In addition, small retail chains are beginning to apply thirdparty retail analytics solutions built upon Wi-Fi fingerprinting and video content analytics to learn more about their customer behavior. For small stores that have not yet obtained all the aspects of customer behavior, the appropriate use of sensor data becomes more important to ensure their long-term benefit.

By knowing the visitation pattern of customers, store managers can indirectly gauge the expected revenue. Targeted marketing can also be available by knowing the revisit intention of customers. By offering discount coupons, merchants can encourage customers to accidentally revisit their stores nearby. Moreover, they can offer a sister brand with finer products to provide new shopping experiences to customers. Consequently, they can simultaneously increase their revenue and satisfy their customers. A series of previously conducted works [9, 10] introduced a method of applying feature engineering to estimate important attributes for determining customer revisit. The proposed set of features was intuitive and easy to reproduce, and the model was powered by widely known machine learning models, such as XGBoost [2] .

However, some gaps did exist between their evaluation protocol and real application settings. Although their approach could effectively perform customer-revisit prediction, in downsampled and cross-validated settings, it was not guaranteed to work satisfactorily in imbalanced visitations with partial observations. In the case of class imbalance, the predictive power of each feature might disappear because of the dominance of the majority label, and such small gaps might result in further adjustment in actual deployment. In addition, in a longitudinal prediction setup, the cross-validation policy results in implicit data leakage because the testing set is not guaranteed to be collected later than the training set.

By evaluating the frameworks using chronologically split imbalanced data, the gap between previously conducted works and real-world scenarios seemed to fill. However, an unconsidered challenge, i.e., partial observations, occurred after splitting the dataset by time. Partial observations occur for every customer, as the model should be trained up to certain observation time. In the case of typical offline check-in data, most customers are only one-time visitors for a certain point of interest [9] . Therefore, the amount of partial observations is considerably large for individual store level. However, previously conducted works [9, 10] ignored partial observations, as their models required labels for their regression model, resulting in not only significant information loss but also biased prediction, as a model is trained using only revisited cases. In this study, we adopt survival analysis [18] to counter the aforementioned instances.

A practical model must predict the behavior of both partially observed customers as well as new visitors who first appear during the testing period. Predicting the revisit of both censored customers and new visitors simultaneously is very challenging, as the characteristics, such as the remaining observation time and their visit histories, of both of them inherently differ from each other. In a usual classification task, it is assumed that the class distributions between training and testing sets are the same. However, the expected arrival rate of new visitors might be lower than that of the existing customers, as the former did not appear during the training period [16] . To understand the revisit pattern using visitation histories with irregular arrival rates, we use deep learning to be free from arrival rate λ and subsequently, predict quantized revisit rates.

These abovementioned principles associated with a practical model might be crucial in applied data science research, and they offer considerable advantages compared with those offered by previously conducted works, which compromise difficulties. In the following section, we introduce our principled approach, i.e., SurvRev, to resolve customer-revisit prediction in more realistic settings.

Customer-Revisit Prediction [10] : Given a set of visits Vtrain = {v1, . . . , vn} with known revisit intentions RV bin (vi) and revisit intervals RV days (vi), where vi ∈ Vtrain, RV bin (vi) ∈ {0, 1}, and RV days (vi) = R>0, if RV bin (vi) = 1 ∞, otherwise,

build a classifier C that predictsRV bin (vj) andRV days (vj) for a new visit vj.

In this section, we introduce our customer-revisit prediction approach. We named our model SurvRev, which is the condensed form of S urvival Revisit predictor. 

Fig. 2 depicts the architecture of the low-level visit encoder. In the encoder, the main area sequence inputs go through three consecutive layers and are, subsequently, combined with auxiliary visit-level inputs, i.e., user embeddings and handcrafted features. We first introduce three-tiered main layers for the area inputs, followed by introducing the process line of the auxiliary visit-level inputs. 

The first layer that is passed by an area sequence is a pretrained area embedding layer to obtain the dense representation for each sensor ID. We used the pretrained area and user embedding results via Doc2Vec [12] as initialization. The area embedding is concatenated with the dwell time, and, subsequently, it goes through a bidirectional LSTM (Bi-LSTM) [17] , which is expected to learn meaningful sequential patterns by looking back and forth. Each LSTM cell emits its learned representation, and the resulting sequences pass through a one-dimensional convolutional neural networks (CNN) to learn higher-level representations from wider semantics. We expect CNN layers to automate the manual process of grouping the areas for obtaining multilevel location semantics, such as category or gender-level sementics [10] . In business, the number of CNN layers can be determined depending on the number of meaningful semantic levels that the store manager wants to observe. The output of the CNN layer goes through self-attention [1] to examine all the information associated with visit. Using the abovementioned sequence of processes, SurvRev can learn the diverse levels of meaningful sequential patterns that determine customer revisits.

Adding Visit-Level Features: Here, we concatenate a user representation with an area sequence representation and, subsequently, apply FC layers with ReLU activation [4] . We can implicitly control the importance of both the representations by changing the dimensions for both the inputs. Finally, we concatenate selected handcrafted features with the combination of user and area representations. The handcrafted features contain the summary of each visit that may not be captured using the boxed component depicted in Fig. 2 . The selected handcrafted features are the total dwell time, average dwell time, number of areas visited, number of unique areas visited, day of the week, hour of the day, number of prior visits, and previous interval. We applied batch normalization [7] before passing the final result through the high-level module of SurvRev.

The blue box in Fig. 1 depicts the architecture of the high-level event-rate predictor. Its main functionality is to consider the histories of a customer by using dynamic LSTMs [6] and predict the revisit rate for the next k days. For each customer, the sequence of outputs from the low-level encoder becomes the input to the LSTM layers. We use dynamic LSTMs to allow sequences with variable lengths, which include a parameter to control the maximum number of events to consider. The output from the final LSTM cell goes through the FC layers with softmax activation. We set the dimension k of the final FC layer to be 365 to represent quantized revisit rates [8] for the next 365 days. For convenience, we refer to this 365-dim revisit rate vector asλ = [λ t , 0 ≤ t < k, t ∈ N], where each elementλ t indicates a quantized revisit rate in a unit-time bin [t, t + 1).

In this section, we explain the procedure to convert the 365-dim revisit rateλ to other criteria, such as probability density function, expected value, and complementary cumulative distribution function (CCDF). The aforementioned criteria will be used to calculate the diverse loss function in Sect. 2.5. Remember that RV days (v) denotes the predicted revisit interval of visit v, meaning that SurvRev expects a revisit will occur afterRV days (v) from the time a customer made a visit v to a store.

1. Substituting the quantized event rateλ from 1 gives the survival rate, i.e., 1 −λ, which denotes the rate at which a revisit will not occur during the next unit time provided that the revisit has not happened thus far. Therefore, the cumulative product of the survival rate with time gives the quantized probability density function as follows:

2. Subsequently, the predicted revisit interval can be represented as a form of expected value as follows:

3. On the basis of the last time of the observation period, it can be predicted whether a revisit is made within a period, which is denoted byRV bin (v).

Here, we define a suppress time t supp (v) = t end − t v , where t v denotes the visit time of v and t end the time when the observation ends. We used the term suppress time to convey that the customer suppresses his or her desire to revisit until the time the observation ends by not revisiting the store. Thus,

4. Calculating the survival rate using suppress time gives CCDF and CDF, both of which will be used to compute the cross-entropy loss. When t supp (v) is a natural number, the following holds:

We designed a custom loss function to learn the parameters of our SurvRev model. We defined four types of losses-negative log-likelihood loss, root-meansquared error (RMSE) loss, cross-entropy loss, and pairwise ranking losses. The prefixes L uc , L c , and L uc−c mean that each loss is calculated for uncensored, censored, and all samples, respectively. Depending on the task domain, the losses to be considered will be slightly different. Therefore, the final L uc−nll can be a weighted sum of five variants.

The second loss is the RMSE loss which minimizes the error between the predicted revisit intervalRV days (v) and actual interval RV days (v). The term L uc−rmse minimizes the error of the model for the case of uncensored samples. One might consider the RMSE loss a continuous expansion of negative log-likelihood loss.

Using the cross-entropy loss, one can measure the performance of the classification model whose output is a probability value between 0 and 1. The cross-entropy loss decreases as the predicted probability converges to the actual label. We separate L uc−c−ce into L uc−ce and L c−ce denoting the partial cross-entropy value of the uncensored and censored sets, respectively.

Pairwise Ranking Loss L uc−c−rank : Motivated by the ranking loss function [13] and c-index [14] , we introduce the pairwise ranking loss to compare the orderings between the predicted revisit intervals. This loss function fine-tunes the model by making the tendency of the predicted and the actual intervals similar to each other. The loss function L uc−c−rank is formally defined using the following steps.

1. First, we define two matrices P and Q as follows:

For a censored visit v, we use the suppress time t supp (v) instead of the actual revisit interval RV days (v) to draw a comparison between uncensored and censored cases. 2. The final pairwise ranking loss is defined as follows:

By minimizing L uc−c−rank , our model encourages the correct ordering of pairs while discouraging the incorrect one. Both the constraint v i ∈ V uncensored and variable Q ij remove the influence of incomparable pairs, such as v i and v j with RV days (v i ) = 3 and t supp (v j ) = 2, respectively, due to the censoring effect.

Combining all the losses, we can design our final objective L to train our SurvRev model. Thus, arg min

where θ denotes a model parameter of SurvRev. We used the product loss to benefit from all the losses and reduce the weight parameters among the losses.

To prove the efficacy of our model, we performed various experiments using a real-world in-store mobility dataset collected by Walkinsights. After introducing the tuned parameter values of the SurvRev model, we summarized the evaluation metrics required for performing revisit prediction (see Sect. 3.1). In addition, we demonstrate the superiority of our SurvRev model by comparison with seven different baseline event prediction models (see Sect. 3.2).

Data Preparation: We used a Wi-Fi fingerprinted dataset introduced in [9] , which represents customer mobility captured using dozens of in-store sensors in flagship offline retail stores located in Seoul. We selected four stores that had collected data for more than 300 days from Jan 2017. We consider each store independently, only a few customer overlaps occurred among the stores. We randomly selected 50,000 customers that had visits longer than 1 min, which is a sufficiently large number of customers to guarantee satisfactory model performance according to [10] . If a customer reappears within 10 min, we do not consider that particular subsequent visit as a new visit. We also designed several versions of training and testing sets by varying the training length to 180 and 240 days. default values. For the low-level module, the 64-dim Bi-LSTM unit was used. The kernel size of CNN was 3 with 16-dim filters, and the number of neurons in the FC layer was 128. We used only one dense layer. For a visit with a long sequence, we considered m areas that could cover up to 95% of all the cases, where m depends on each dataset. In the high-level module, the dynamic LSTM had 256dim units and processed up to 5 events. We used two layers of LSTM with tanh activation. For the rate predictor, we used two FC layers with 365 neurons and ReLU activation. For training the model, we used Adam [11] optimizer with the learning rate of 0.001. We set the mini-batch size to be 32 and ran 10 epochs. The NLL loss L uc−nll was set as the average of L uc−nll−season and L uc−nll−month . Some of these hyperparameters were selected empirically via grid search.

We made a switch to control the number of user histories to be used while training the SurvRev model. For predicting partially-observed instances (v tep ), all the histories up to the observation time were used to train the model. For instance, if an input visit v 5 is a partial observation, then {v 1 , · · · , v 5 } and t supp (v 5 ) are fed in the high-level event-rate predictor. For predicting firsttime visitors, only the first appearances (v 1 ∈ V train ) were used to train the model. In the latter case, the LSTM length in a high-level event-rate predictor is always one because each training instance has no prior log.

We used two metrics, namely, F-score and c-index, to evaluate the prediction performance.

-F-score: F-score measures the binary revisit classification performance.

-C-index [14] : C-index measures the global pairwise ordering performance, and it is the most generally used evaluation metric in survival analysis [13, 15] .

Comparison with Baselines: We verify the effectiveness of our SurvRev model on the large-scale in-store mobility data. To compare our method with various baseline methods, we implemented eight different event-prediction models.

Baselines Not Considering Covariates: The first three baselines focus on the distribution of revisit labels and consider them an arrival process. They do not consider the attributes, i.e., covariates, obtained from each visit.

-Majority Voting (Majority): Prediction results are dictated by the majority class for classification, which depends on the average values of regression; this baseline is a naive but powerful method for an imbalanced dataset. -Personalized Poisson Process (Poisson) [16] : We assume that the inter-arrival time of customers follows the exponential distribution with a constant λ. To make it personalized, we control λ for each customer by considering his or her visit frequency and observation time. Baselines Considering Covariates: The following two models considered the covariates derived from each visit. For ensuring fairness, we used the same set of handcrafted features for the latter baseline.

-Cox Proportional Hazard model (Cox-PH) [3] : It is a semi-parametric survival analysis model with proportional hazards assumption. -Gradient Boosting Tree with Handcrafted Features (XGBoost) [9] : It uses carefully designed handcrafted features with XGBoost classifier [2] .

Baselines Using Deep Survival Analysis: The last two models are state-of-the-art survival analysis models that applied deep learning.

-Neural Survival Recommender (NSR) [8] : It is a deep multi-task learning model with LSTM and three-way factor unit used for music subscription data with sequential events. However, the disadvantage of this model is that the input for each cell is simple, and the input does not consider lower-level interactions. -Deep Recurrent Survival Analysis (DRSA) [15] : It is an auto-regressive model with LSTM. Each cell emits a hazard rate for each timestamp. However, the drawback of this model is that each LSTM considers only a single event. Comparison Results: Tables 2 and 3 summarize the performance of each model on partially observed customers (V tep ) and first-time visitors (V tef ), respectively. The prediction results on the partially observed set shows that SurvRev outperforms other baselines in terms of the c-index, in most cases. In addition, regarding first-time visitors, SurvRev outperforms other baselines in terms of the f-score. As a preliminary result, it is fairly satisfying to observe that our model showed its effectiveness on two different settings. However, we might need to further tune our model parameters to achieve the best results for every evaluation metric.

Throughout ablation studies, we expect to observe the effectiveness of the components of both the low-level encoder and high-level eventrate predictor. The variations in both low-level encoders (L1-L6) and high-level event-rate predictors (H1-H2) are as follows:

Ablation by simplifying the low-level module: Ablation by simplifying the high-level module:

-H1 (FC +FC ): Concatenate the outputs of the low-level encoder and, subsequently, apply an FC layer instead of LSTMs. -H2 (LSTM +FC ): Stack the outputs of the low-level encoder and, subsequently, apply two-level LSTM layers. This one is equivalent to our original high-level event-rate predictor described in Sect. 2.3. Figure 3 depicts the results of the ablation study. The representative c-index results are evaluated on a partially-observed set of store D with 240-day training interval. The results show that the subcomponents of both the low-level visit encoder and the high-level event-rate predictor are critical to designing the SurvRev architecture. 

In this study, we proposed the SurvRev model for customer-revisit prediction. In summary, our SurvRev model successfully predicted customer revisit rates for the next time horizon by encoding each visit and managing the personalized history of each customer. Upon applying survival analysis with deep learning, we could easily analyze both first-time visitors and partially-observed customers with inconsistent arrival behaviors. In addition, SurvRev did not involve any parametric assumption. Through comparison with various event-prediction approaches, SurvRev proved effective by realizing several prediction objectives. For future work, we would like to extend SurvRev to other prediction tasks that suffer from partial observations and sessions with multilevel sequences.

Neural machine translation by jointly learning to align and translate

XGBoost: a scalable tree boosting system

Regression models and life-tables

Deep sparse rectifier neural networks. In: AIS-TATS

Spectra of some self-exciting and mutually exciting point processes

Long short-term memory

Batch normalization: accelerating deep network training by reducing internal covariate shift

Neural survival recommender

Utilizing in-store sensors for revisit prediction

A systemic framework of predicting customer revisit with in-store sensors

Adam: a method for stochastic optimization

Distributed representations of sentences and documents

DeepHit: a deep learning approach to survival analysis with competing risks

On ranking in survival analysis: bounds on the concordance index

Deep recurrent survival analysis

Stochastic Processes

Bidirectional recurrent neural networks

Machine learning for survival analysis: a survey