key: cord-0195612-rnk7n7pw
authors: Tax, Niek; Vries, Kees Jan de; Jong, Mathijs de; Dosoula, Nikoleta; Akker, Bram van den; Smith, Jon; Thuong, Olivier; Bernardi, Lucas
title: Machine Learning for Fraud Detection in E-Commerce: A Research Agenda
date: 2021-07-05
journal: nan
DOI: nan
sha: 19e539f11068875cfd5ab1a8ee4dfaaa96a59743
doc_id: 195612
cord_uid: rnk7n7pw

Fraud detection and prevention play an important part in ensuring the sustained operation of any e-commerce business. Machine learning (ML) often plays an important role in these anti-fraud operations, but the organizational context in which these ML models operate cannot be ignored. In this paper, we take an organization-centric view on the topic of fraud detection by formulating an operational model of the anti-fraud departments in e-commerce organizations. We derive 6 research topics and 12 practical challenges for fraud detection from this operational model. We summarize the state of the literature for each research topic, discuss potential solutions to the practical challenges, and identify 22 open research challenges.

E-commerce is an important and rapidly growing sector that has tripled its share of the world GDP from 0.5% to >1.5% in the past decade [55] . This surge in economic importance is accompanied by a rapid increase in the total cost of global cybercrime, which increased from $445 billion in 2014 to >$600 billion in 2017 [52] . Fraud and cybercrime in the e-commerce domain spans a variety of fraud types, such as fake accounts [16] , payment fraud, account takeovers [39] , and fake reviews.

Machine learning (ML) plays an important role in the detection, prevention, and mitigation of fraud in e-commerce organizations. Publicly-known examples include Microsoft [56] , LinkedIn [83] , and eBay [63] . In practice, fraud detection ML models in e-commerce organizations do not operate in isolation, but they are embedded in a larger anti-fraud department that also employs fraud analysts or fraud investigators who perform case investigations and proactively search for fraud trends. This requires fraud detection models to be embedded in the way of working and daily operations of an anti-fraud department.

While the existing literature on fraud detection is extensive, to the best of our knowledge there is currently no work that provides an explicit formulation of the daily operations of anti-fraud departments. This creates a gap between academic work on fraud detection and practical applications of fraud detection in industry. Furthermore, this makes it more difficult to assess whether novel fraud detection methods fit into the practical way-of-working in fraud departments, or whether they address practically relevant challenges.

In this paper, we describe the operational model of an anti-fraud department. We use this operational model to derive a set of practically relevant research topics for fraud detection. For each research topic, we summarize the state of the literature and put forth a set of open research challenges that are formulated from a practical angle. The main aim of this paper is to put forward a research agenda of open challenges in fraud detection.

This paper is structured as follows. In section 2 we introduce and discuss the operational model of fraud detection from an organizational point of view, discuss the role that machine learning plays in it, and derive research topics from it. In section 3 to section 8, we zoom in on each of the individual research topics that we introduce in section 2. In each of those sections, we discuss one research topic, list practical considerations from industry experience, summarize the current state of the literature, and formulate open research challenges. We conclude this paper in section 9.

In this section, we introduce our operational model (Figure 1 ) of the daily operations of anti-fraud departments in e-commerce organizations. We highlight the role of machine learning in the daily operations of anti-fraud departments and derive research topics and practical challenges.

E-Commerce Platform Online service where users can buy and sell products (e.g., Amazon, Booking.com, or Zalando). Users Genuine users perform legitimate transactions (e.g., purchases or sales) on the e-commerce platform. Fraudulent users are wrongful or criminal actors who intend to achieve financial or personal gains through fraudulent activity on the e-commerce platform. Examples of such fraudulent activity: purchase attempts with stolen credit cards, abuse of marketing initiatives (e.g., incentive programs), registering fake accounts (e.g. merchant accounts or user accounts), phishing, or other attempts at account take-overs. Users interact with the e-commerce platform 1 , which in turn, generates data 2 . Data Is generated by the e-commerce platform as a result of user interactions.

From the ML viewpoint, data can be transformed into features and labels.

Features represent relevant behavior (e.g., browsing, purchasing, messaging, or managing accounts), or business entities (e.g., purchases, products, or users) of fraudulent and legitimate users. Labels indicate whether or not behavior or an entity is fraudulent. Labels often result from investigation results of fraud investigators 10 . Sometimes, labels arrive through external escalations 12 , e.g., through notifications of fraud from credit card issuers. Informs (7) Generates Labels (12) Generates Data (2) Interacts (1) Train/deploy model (6) Informs (3) Defines rules (5) Triggers investigation (9) Manual action (4) Automated Action (8) Provides Labels (10) Triggers investigation (11) Data Fraud Investigator Professionals who investigate suspected fraud cases, using the data 3 . These suspected fraud cases might originate from internal escalations 11 (e.g., complaints through customer service), or the decision system triggered an investigation 9 . For fraud that they find they take remediating actions 4 (e.g., canceling orders or blocking users) and/or preventative steps by defining rules 5 that are aimed to identify similar fraud in the future, which are used in the decision system. Decision System A system that can take concrete actions for instances. Instances arise from specific user requests, e.g., the purchase of a product, or registration of an account. Instances require a decision on what action to take, e.g., no intervention, to request additional verification, or to fully block the user's request. The decision system can take automatic action 8 , or trigger an investigation by a fraud investigator 9 (who could then take manual action through 4 ). Actions can either be synchronous (i.e., blocking the user request) or asynchronous (i.e., without blocking the user request).

The decision system decides on its actions by combining ML models 7 and rules 5 . In addition, some use cases require exploration, e.g., by occasionally triggering investigations on instances where there is high uncertainty on whether they are fraudulent. Model A machine learning model that aims to distinguish between fraudulent and genuine users. This model is trained 6 on the data.

We now discuss the research topics that arise from Figure 1 . These research topics form the basis of the remainder of this paper, where we dedicate one section per Table 1 . An overview of the research topics that we derived from Figure 1 , the practical challenges, and solution areas in the literature that relate to them.

research topic, list their concrete practical challenges, provide a summary of existing solution areas in the literature, and identify open research challenges. Table 1 summarizes all research topics, their connection to Figure 1 , their practical challenges, and the solution areas in the literature. These connections are either a set of edges, or a path of edges in Figure 1 . In the latter case, ⇒( X , Y ) denotes a path consisting of edges X and Y . Below we introduce the research topics and highlight the practical challenges in bold.

Investigation support Fraud investigations (triggered by 9 or 11 ) are performed based on evidence from the data 3 . Fraud investigators have limited time capacity, and to avoid alarm fatigue, 9 and 11 must yield a high precision. Furthermore, investigators must be enabled and supported to reach decisions efficiently and accurately. There are several opportunities for machine learning to play a role in supporting these investigations and in the evidence-gathering process that they entail. For example, some fraud cases are highly similar because they are part of the same attack (e.g., they might be performed from the same IP address). Ideally, these are grouped into a single investigation, to minimize the number of investigations, and to provide context to the fraud investigator during the investigation. We summarize and discuss this research topic in section 3. Decision-making Relations 5 , 7 , 8 , and 9 show the role of the decision system, which is tasked to decide which instances to take action on by combining the output of the ML model and rules that were created by the fraud investigators. The decision system is also tasked to decide how to take action: either automatically and immediately or by sending the case for further review to fraud investigators. These decisions should be made with the aim to manage risk, i.e., possible risks of negatively impacting genuine users should be traded-off with the possible risk of failing to block fraud. The preventative or remediating actions (e.g., disabling accounts, or stopping purchases) have great consequences to the user by design. Therefore, it is essential to limit false positives and to take fairness into account.

In section 4, we discuss how the research areas of cost-sensitive learning, AI fairness, and uncertainty quantification offer partial solutions to these challenges and formulate remaining open challenges. Labels The two sources of labels are fraud investigators 10 , and automatic escalations 12 . These mechanisms introduce selection bias through delay or incompleteness of labels. In addition, automated actions 8 that block suspected fraud mask labels from automatic escalations. We discuss in section 5 how learning under selection bias and multi-armed bandits offer solutions and we formulate open challenges. Concept drift The cycles ⇒( 1 , 2 , 6 , 7 , 8 ) and ⇒( 1 , 2 , 6 , 7 , 9 , 4 ) show how the actions of the decision system or the fraud investigator impact fraudulent users. The fraudulent user may consequently adapt their behavior, i.e., adversarial drift. However, the behavior of genuine users can also change, i.e., natural drift. Moreover, several decisions may be taken at different stages in the life-cycle of a business entity, resulting in upstream models. We discuss methods to deal with the adaptivity that this requires from the ML model in section 6. ML-investigator interaction The cycle ⇒( 6 , 7 , 9 , 10 ) highlights the ability of fraud analysts to provide labels to aid ML models. One objective is to investigate the most suspicious instances (i.e., exploitation). A contrasting objective is to investigate those instances that are expected to be the most informative to the model (i.e., exploration). This creates an explore/exploit trade-off regarding which instances are presented to the investigator through 9 . We discuss the aspects involved in the interaction between the ML model and fraud investigator in section 7. Model Relation 6 concerns the training and deployment of the ML model. The fraud detection setting has particular requirements for model deployment and monitoring, which we discuss in section 8.

Relations 3 , 9 , and 11 in Figure 1 describe the investigations of potential fraud instances that were either found proactively, presented by the decision system, or escalated. Fraud investigations are often time-consuming and require a high amount of experience and expertise. Much of the time of a fraud investigation goes to gathering evidence and documenting the decision with relevant evidence. The number of investigations that can be processed is limited because fraud investigations are time-consuming. Therefore, any support from machine learning in aiding the evidence gathering and decision support is of great benefit.

The aim is to make evidence gathering more efficient and more effective, respectively resulting in the ability to process more investigations and to increase the accuracy of the investigation outcome.

Explainable AI methods and visualizations thereof [3] can provide decision support to the fraud investigator when embedded in the user interface of investigation tools. Weerts et al [84] found no strong evidence that SHAP model explanations increase the accuracy and efficiency of fraud investigators' decisionmaking. However, in many fraud detection systems, feature interactions are important to fraud detection accuracy. Therefore, one can hypothesize that rulebased explanations such as anchors [66] , which in contrast to SHAP explain the predictions in terms of rules over multiple features rather than in contributionscores of individual features, might be better suited for the fraud detection setting. More generally, more research is needed into how fraud investigators can be best supported through model explanations.

In the limited research on the use of model explanations for decision support of the fraud investigator, some work (e.g., [8] ) uses crowdsourced labelers (e.g., from Amazon Mechanical Turk) for the experiment. The fraud investigators in industry are highly trained, therefore, it is questionable whether empirical results that are obtained with untrained crowdsourced participants transfer to a real-life setting with highly-trained fraud professionals.

In contrast to local interpretability methods that explain individual predictions, global interpretability methods provide insight into how a model as a whole makes its decisions and might be useful to increase the overall trust of fraud investigators and other stakeholders from the anti-fraud department in the ML model. To the best of our knowledge, empirical work on whether global interpretability methods increase model trust in fraud detection settings is lacking.

Multiple instance learning (MIL) allows ML models to classify whole groups of instances at once instead of single instances. In many fraud use cases, the fraudulent user targets multiple business entities through repeated actions on the e-commerce platform. For example, the same fraudster might perform multiple attempts to compromise accounts. In such cases, performing group-level investigations, i.e., investigating multiple instances related to that same actor, yields more appropriate evidence and leads to taking action on more instances per fixed unit of time. MIL is the ML-counterpart of group-level investigations, where individual instances are grouped in Bags, e.g., multiple energy states (instance) of a molecule (bag) in drug discovery, or multiple segments (instance) of an image (bag). Surveys of MIL can be found in [4, 13] .

Multiple instance learning methods can aid group-level investigations of fraud investigators by identifying groups of instances that could be fraudulent. The bags that are presented to fraud investigators must be relevant, i.e., they must contain a sufficient share of fraud. This is addressed by appropriately defining bag-level labels [24, 13] . There is often a trade-off between bag-level and instancelevel model performance [13] . In manual investigations, the former might be more important, while the latter might be more important for automated decisions.

Network learning is closely related to grouped investigations. Graph-based visualizations can show the fraud investigator which instances are connected based on some identifier, e.g., based on IP address or e-mail address. Like group-level investigations, the graphs provide the fraud investigator with visual information on which instances are connected to some identifier, and which might therefore possibly also be fraudulent. Network learning [46] and graph neural networks [31] are ML counterparts of graph-based investigations. Such models can aid the fraud investigator by identifying graph nodes or subgraphs where the fraud investigator is likely to find fraud.

Challenge 1: Model explanations for decision support. There is limited research on the effect of model explanations on the fraud investigator 's decisions quality and efficiency. It is unclear if these explanations sometimes can bias decisions. It is also unclear what types of model explanations empirically would be most helpful to the fraud investigator, and whether or not this depends on aspects like the application domain, or the experience level of the fraud investigator. Challenge 2: Multiple instance learning. As pointed out in [13] , most of the current literature on MIL covers applications in biology and chemistry, computer vision, document classification, and web mining. To our knowledge, the area of fraud detection in e-commerce, especially the interaction with fraud investigators, has received little attention in the literature, although related topics like the detection of fraudulent financial statements have been discussed in [41] , as well as HTTP network traffic in [59] . A particularly interesting challenge is how to compose the bags. In practice, this is often done using characteristics of the user actions (e.g., the IP address and date of the user action cf. [85] ). We are not aware of MIL literature that addresses bag construction, especially in the context of fraud detection in e-commerce.

Relations 8 and 9 in Figure 1 show that the decision system can take automated action or trigger an investigation for the instances that it suspects of fraud. The consequences of wrong decisions can be severe. False positives respectively lead to adding friction for or blocking a genuine user, or wasting the time of fraud investigators. False negatives result in allowing fraudulent behavior. Another, less severe, wrong decision is to trigger an investigation for a fraud case instead of automatically blocking it. It is the decision system's task to manage risk by appropriately trading off risks of different types of wrong decisions, by combining ML and heuristic rules that are developed by fraud investigators. Relations 5 , 7 in Figure 1 highlight the two types of information sources that the decision system has available to make its decisions: ML models and rules. Models and rules aim to complement each other, and it is the task of the decision system to aggregate them into a single action when models and rules might recommend different actions for the same instance.

Probability calibration aims to transform the model output of a classifier in such a way that the predicted model score approximately matches the probability of an instance belonging to the positive class. Calibrated model scores play an important role in managing risk in decision-making, as it is the prerequisite of some cost-sensitive learning techniques (see below). They also enable the calculation of expected values of key performance indicators of the business.

Methods for probability calibration include Platt scaling [61] , Beta calibration [42] , and isotonic regression [87] . Such methods require a calibration set of data that is held out from the training data. The adversarial concept drift (see section 6) in fraud settings and the sometimes rapidly changing prevalence (i.e., fraud rate) bring difficulties in obtaining model scores that are close to probabilities in the latest production data and not just on the calibration set.

Cost-sensitive learning takes the misclassification costs into account and thereby enables making decisions that minimize the expected cost of fraud to the business operations, rather than simply minimizing the number of classification errors [47] . This addresses the challenge of managing risk in decision-making. The application of cost-sensitive learning requires the estimation of the costs (or benefit) to the business of the four cells of the confusion matrix: the false positive, false negative, true positive, and true negative. The cost of a false positive could for example be the missed income from a blocked transaction, while the cost of a false negative could be the financial cost of that fraud instance.

Practical difficulties often arise because some aspects of these costs can be difficult to quantify or measure, such as the reputational damage to the business in the case of misclassification. More research is needed on guidelines and frameworks for how to design cost functions for cost-sensitive learning when some aspects of the costs are not easily quantified financially.

Once cost functions are in place, the expected costs can be calculated trivially if the classifier returns calibrated scores; alternatively, a method like empirical thresholding [77] can be used to minimize the expected costs under uncalibrated model scores. The label might not be available in cases where a fraud attempt was blocked by an automated action. In that case, one might make use of the control group or some other source of unbiased data (see section 5) to optimize the threshold with regard to the cost function.

AI fairness is an important topic in fraud detection because misclassification can have severe consequences -a false positive often harms genuine users, like purchases being canceled or accounts being disabled. Since any decision taken in fraud detection potentially directly effects a real person, attention should be paid to ensure fairness and mitigate fairness issues where they may exist.

The body of literature covering fairness in ML is extensive [53] , including considerations for its practical applications in production systems [9, 34] . While not much fraud-specific applications research has been done in ML fairness, the problem can be cast more generally as a supervised classification setting where positive predictions signify actions taken against individuals. Similar aspects can be found in mortgage default prediction [32] or recidivism prediction [17] . One particular challenge to e-commerce organizations looking to achieve ML fairness is low observability on certain protected attributes -dimensions such as race or sex are often not explicitly collected in e-commerce platforms, rendering some attribute-dependent ML fairness methods [82] inapplicable.

The presence of an adversary (fraudster) in the fraud context presents a unique challenge to the implementation of fairness controls. Protected attributes can be spoofed by dishonest actors, for example, by using a VPN to pretend to be in a different country. If the system uses a fairness control that applies corrections conditioned on these protected attributes, the fraudster may be able to tweak their attributes to maximize their success. Data poisoning attacks that target fairness controls have been recently developed [78, 54] .

Anomaly detection is also important to fraud detection for flagging novel and potentially malicious behaviors but has its own set of fairness pitfalls. Recent work [21, 76] aims to quantify and mitigate these issues, but overall, fairness in anomaly detection systems is still a novel area of research.

Uncertainty quantification has clear applications in the settings of active learning (see section 7) and classification with a reject option [18, 33] , where the ML model has the option to refuse or delay making a decision when uncertainty is too high around the model's prediction. Hüllermeier and Waegeman [36] provide a detailed survey of methods for uncertainty quantification.

Classification with a reject option has applications in fraud detection for socalled trust systems, which are tasked with assigning a permanent trust status to some subsets of the genuine users that are important customers and are clearly not fraudulent users, for which taking any action on them should at all times be avoided. Accidentally trusting fraudulent users could do a lot of damage, and therefore making a prediction can be rejected by the model when there is too much uncertainty regarding the instance.

An important concept in both classification with a reject option and active learning is epistemic uncertainty, i.e., the degree of uncertainty due to lack of data in the part of feature space for which the prediction needs to be made (reducible uncertainty). This contrasts aleatoric uncertainty, i.e., the degree of uncertainty due to an overlap in class distributions in the part of input space for which the prediction needs to be made (irreducible uncertainty). Methods that explicitly quantify the epistemic part of the uncertainty and can separate this from the aleatoric part are summarized in [36] and include density estimation, anomaly detection, Bayesian models, or the framework of reliable classification [73] . The argument is that rejecting or delaying a decision is only a reasonable decision if it is expected that the uncertainty is expected to decrease, which is not the case with aleatoric (irreducible) uncertainty. Likewise, in the active learning setting, spending the fraud investigator's time to investigate an instance only makes sense if there is reducible uncertainty regarding that instance. Marking a user as trustworthy seems safe when a model predicts a user to be of the non-fraud class with low epistemic uncertainty regarding that prediction. However, quantification of epistemic uncertainty is a rather novel research direction in the field of machine learning, and more research is needed. More specifically, the application of epistemic uncertainty quantification in adversarial problem domains is not well-understood.

Rules-based systems are created by fraud investigators and are designed to supplement ML models in the detection of fraud. Because of concept drift, rules can become ineffective shortly after they have been added to the decision system. After fraud investigators have first observed a new type of fraud attack that is not recognized by the ML model, a rule can be used for the period until the ML model picks up on that attack. In some sense, decision-making based on model output and multiple rule outputs can be seen as analogous to ensemble learning [69] , which concerns the decision-making based on multiple ML models. A specific challenge is that the set of rules is subject to change: new rules get developed and old ones get decommissioned. This creates a need for ensemble models that combine a non-stationary set of components. Challenge 7: Epistemic uncertainty quantification for trust systems.

Predictions of the non-fraud class that are made with low epistemic uncertainty could have an application to identify trustworthy users that should never be marked as fraudulent. However, more research is needed into the applications of such techniques in adversarial problem settings. Challenge 8: Ensemble learning for non-stationary sets of components.

The output of the ML model and the rules ultimately need to be combined into a single decision. Ensemble learning methods address this task but do currently not handle dynamic sets of ensemble components.

Labels are obtained through fraud investigations 10 , and automatic escalations 12 . From the machine learning perspective, we would like to train models using labeled instances that are uniformly sampled from the population. In practice, there are several sources of selection bias. First, delay in labeling arises for both label sources: manual investigations can take minutes up to days, whereas it can take days up to weeks for notifications of fraud to arrive through escalations [19] . Secondly, manual investigations may overlook fraudulent instances (e.g., caused by resource constraints, or well-hidden fraud). Third, automated actions 8 block suspicious transactions, and as a result, there will be merely a suspicion (no documented evidence) that these transactions are fraudulent. Because there is no certainty that these blocked transactions are fraudulent, they cannot be considered to be labeled instances. In many e-commerce applications, similar issues are commonly addressed with a control group, i.e., by always approving a certain percentage of transactions. While this control group would be an unbiased sample of labeled data, collecting it would come at the high cost of needing to purposefully let a share of fraud go through without blocking it.

Learning under selection bias has been studied for example in [86] and [80] , both relying on ideas from causal inference such as inverse propensity weighting.

Another example worth mentioning is [35] where matching weights are computed directly. In [37] , the authors study the problem in the low prevalence regime, which is particularly fitting the fraud detection problem, although they don't construct an unbiased model. Rather, they propose to utilize unlabeled data to construct a case ranking model, which might or might not be appropriate depending on the specific problem at hand. The domain adaptation field also tackles this problem, defined as learning a model with data sampled from a source domain to be applied in a target domain. This particular field distinguishes several settings, two of which are of particular interest to fraud detection:

Unsupervised domain adaptation (e.g., [26, 79] Semi-supervised domain adaptation (e.g., [20] ) adds some labeled examples from the target domain, matching the with-control-group setting.

Multi-armed bandits (MAB) [43] is an area of research that studies the tradeoff between exploration and exploitation. A control group is a simple form of exploration. In the context of automated actions, approving a transaction allows us to observe both the consequences of approving as well as the consequences that would have been observed if the transaction would have been blocked. This is known as partial feedback and is different from the bandit feedback setting where feedback is only observed for the action that was taken. No feedback is observed for blocked transactions.

To handle this setting, one can create more sophisticated exploration policies, that don't necessarily approve cases uniformly at random but explore with some optimization criteria. The principle of optimism in the face of uncertainty [7, 45, 50] is of particular interest for the fraud detection problem. The core idea is to prioritize exploration (acceptance) of transactions where the expected cost is lower or the model has higher uncertainty. Typically, the expected cost is estimated through standard supervised learning techniques and the uncertainty is modeled with the variance of the mean cost estimate. Other approaches such as Thompson sampling [81, 2] , or more generally, posterior sampling [68] can balance exploration and exploitation in fraud detection.

Challenge 9: Bias-variance trade-off. Removing the bias from the data almost always involves an increase in the variance of the predictions. This variance might lead to models with poor generalization error, defeating the purpose of bias reduction. Most bias reduction techniques focus on completely removing the bias, and although there exists work on variance reduction, it is always under the no-bias constraint. Creating principled mechanisms to tune this trade-off, potentially allowing positive bias but improving generalization error is still an open challenge, and an active area of research, mainly in the domain adaptation field. A related and harder challenge is the fact that in practice, at training time, there is no data available from the target domain, which can be considered an adversarial version of the unsupervised domain adaptation problem where the goal is to learn a model that generalizes sufficiently well to a large set of potential target domains from labeled source domain data. Challenge 10: Pseudo-MAB setting. Only the approval action reveals full feedback whereas rejection reveals no feedback. This setting does not exactly match the MAB setting. This opens questions about the optimality of standard MAB policies. An alternative formulation is simply to select a subset of transactions for rejection (or acceptance) to minimize some carefully crafted loss function that combines the monetary costs with the value of the gathered information. This can be addressed from the perspective of set-function optimization and online active learning [71] . However, the MAB formulation addresses other relevant challenges such as delayed feedback and non stationary which have been studied to a large extent in the MAB literature (e.g., [38, 30] ).

In Figure 1 , the cycles ⇒( 1 , 2 , 6 , 7 , 8 ) and ⇒( 1 , 2 , 6 , 7 , 9 , 4 ) highlight how fraud detection is an adversarial problem domain. When the decision system is successful in blocking the fraud attempts of a fraudulent user (i.e., 4 or 8 ), then the fraudster is likely to try to circumvent the system by modifying their attack until successful. Due to this behavior, fraud detection systems often experience concept drift nearly constantly. Besides adversarial drift from changing fraud attacks, the data distributions that are generated by genuine users can also be subject to concept drift. Examples include seasonal patterns, unexpected events (e.g., , or changes in the e-commerce platform. However, the drift of genuine users is often largely independent of 4 and 8 and is thus not adversarial.

The third source of concept drift to fraud detection systems is related to updates to so-called upstream models. For example, imagine a webshop that requires users to log in before they can make a purchase. An update to a logintime fraud detection model shifts the distributions of the data that reaches a payment-time fraud detection model that occurs later in the sales funnel, because the population of fraudulent users that are already caught by the login-time model will likely change with the update.

Concept drift adaptation [25, 49] , or dataset shift [64] , is a well-studied research topic. Drift can be categorized by their distributional type: covariate shift concerns a shift in P (X), prior shift a shift in P (y), and real concept drift a shift in P (y | X). Orthogonally, drift can be categorized by its temporal type: it can be sudden, gradual, incremental, or recurring. Finally, drift can be adversarial or natural, i.e., adversarial drift is specifically aimed to beat a detection system, while natural drift happens for reasons that are exogenous to it.

Fraud detection has adversarial drift in the fraudulent class. Changes to the attack patterns of fraudsters often result in gradual and incremental drift, as fraudsters tend to gradually increase the frequency of their successful and undetected attacks while decreasing the frequency of detected and unsuccessful attacks. Fraudsters can also cause recurring drift, as they occasionally retry old attempts to check whether the fraud detection systems still catch them.

Drift due to changes in the e-commerce platform is often sudden because changes (e.g., in an account registration portal or a payment process) change at once at the time of new code deployment. However, in practice, many changes in the e-commerce platform are first evaluated in an A/B test, and thus, the drift that results from this change might at first affect only a fraction of the users. Furthermore, the drift that results from changes in the e-commerce platform is natural drift, contrasting the adversarial drift that originates from fraudsters' attempts to remain undetected. Much of the adversarial concept drift detection and adaptation literature ignores that such tasks often need to be performed in the presence of sudden and natural drift that originates from changes to the platform itself. While many concept drift detection techniques exist [25] , there is a practical need for methods that can distinguish the fraudsters' gradual adversarial drift from the sudden and natural drift that is caused by platform changes.

Delayed labels make the task of concept drift adaptation much more challenging. Until the labels are known, concept drift is only detectable when a change in P (y | X) is accompanied by a change in P (X) [88] . Likewise, adaptation to a change in P (y | X) is not possible without a change in P (X). Several methods exist to address the problem of concept drift adaptation under delayed labels, including positive unlabeled (PU) learning [40, 22] , or by explicit modeling of the expected label delay of individual instances through survival modeling. Dal Pozzolo et al. [19] proposed a solution specific for the fraud detection case where they train two separate models. The first model is trained on the labels found by fraud investigators, while the second is trained on labels obtained through the often much more delayed label-source of escalations. In practice, some fraud detection use cases deal with label delay that is theoretically upper bounded, such as in the case of credit card chargebacks that have a deadline set by the credit card issuers. To the best of our knowledge, concept drift detection under upper-bounded label delay has not been studied as of yet.

Supervised methods for fraud detection often outperform purely unsupervised anomaly detection for fraud detection in industry applications [29] . Supervised models, in particular, outperform anomaly detection models in detecting fraud instances that are continuations of fraud attacks that were ongoing at the time the model was trained. The field of evolving data stream classification developed several methods to incrementally update ML models in a streaming setting to adapt to distributional changes. State-of-the-art methods include adaptive random forest [27] and streaming random patches [28] . The field is heavily focused on updating ML models instead of retraining them from scratch, which is motivated by computational efficiency. In practice, however, e-commerce organizations do have the computational resources that are required to retrain models daily.

Finally, there is currently limited insight into how fraudsters respond, adapt their attacks, and cause drift. In practice, the fraudsters don't have direct control over the feature vectors that their attacks produce, but they instead control it only indirectly through their interactions with the e-commerce platform. This constrains how fraudsters can change the distributions of feature values that they generate. The field currently lacks methods to identify potential fraud attacks that the e-commerce platform theoretically would allow for but that have not yet been observed. One possible direction is the use of attack trees [51] , a common method in the cybersecurity field to map out possible attack angles for hackers.

Adversarial robustness in ML [11] is a research area that focuses on building ML models that make it hard for attackers to create adversarial examples, i.e., data points that the model predicts wrongly. Adversarial robustness closely links to concept drift-a fraud detection system that is adversarially robust makes it more difficult for fraudsters to generate new types of fraud that remain undetected. Current work on adversarial robustness is heavily focused on computer vision and natural language processing tasks, while the majority of fraud detection systems use tabular data. Adversarial robustness methods for tabular data are an open research challenge with applicability in fraud detection.

Anomaly detection is a class of methods that separate normal data points from outlier data points. The task of anomaly detection strongly links to density estimation, and can be seen as its inverse. Novelty detection [60] concerns the detection of novel behavior that emerges after drift and is therefore of particular relevance to fraud detection. Novelty detection typically uses anomaly detectionwhat is normal w.r.t. pre-drift data is likely to be an outlier w.r.t. post-drift data. Several empirical benchmark studies [1, 23] have compared anomaly detection methods, often identifying isolation forest [48] performs well consistently.

New attacks by fraudsters generate feature values that are distinct from their previous attacks, causing the new attacks to be marked as outliers. However, a drift in the behavior of genuine users (e.g., due to changes in the e-commerce platform or exceptional events like COVID-19) is also likely to generate feature values that are distinct from earlier behavior. Therefore, not every outlier can be assumed to be fraudulent, and not every distribution shift is caused by a change in fraud attacks. Marking all outliers as possible fraud cases that require investigation by the fraud investigator introduces spikes of false positives. In practice, such a spike would for example be expected with every release of a new change in the e-commerce platform. This calls for investigation into methods that account for the existence of "harmless" outliers caused by external changes.

Challenge 11: Separating platform changes from changes in attack patterns. Changes in the e-commerce platform and changes in fraudster behavior both drive concept drift. In the former case, drift tends to be natural and sudden, while in the latter case it tends to be adversarial and gradual. Concept drift detection methods that alert in the latter case but not in the former would be of practical value for fraud detection. Challenge 12: Accounting for platform changes in novelty detection.

Changes in the e-commerce platform can cause large spikes in the number of outliers that are flagged by novelty detection algorithms, thereby limiting their practical use. This creates a need for methods that aim to detect outliers that are novel fraud types, but not outliers that result from platform changes. Challenge 13: Mapping attack angles. There is a need for methods and frameworks to map possible attack-angles in an e-commerce platform, and for decision-making frameworks to leverage those this in work prioritization.

Challenge 14: Methods for adversarial robustness for tabular data.

Many fraud detection systems work on tabular data, which is an understudied data modality in the research field of adversarial robustness. Challenge 15: Balancing anomaly detection and supervised methods.

While supervised methods are more accurate in detecting recurring fraud types, anomaly detection methods can detect new attacks. Can a strategy for combining both types of models be automatically inferred? Challenge 16: Concept drift adaptation in the delayed label setting.

Supervised methods for concept drift adaptation often assume that labels are immediately available. How do we adapt to concept drift if labels may be delayed? In some fraud problems, there is a theoretical upper bound in the label delay. Can this upper bound be used in concept drift adaptation?

Fraud investigations are a vital part of the operational model by stopping fraudulent behavior through manual action 4 , and as a result, generating new labels for the ML model 10 . Two objectives are involved here: the goal to identify fraud and the goal to generate labels that are most useful to the model. These two objectives can sometimes compete. Below we describe various machine learning (ML) methods to trigger investigations 9 in ways that address and balance these objectives, and we discuss open research questions.

Active learning (AL) is a paradigm that naturally fits cycle ⇒( 6 , 7 , 9 , 10 ) in Figure 1 . The AL paradigm decides which unlabeled data points to prioritize for labeling depending on how much they are expected to improve the ML model. Fraud investigators label these data points, after which the model can be retrained and a new iteration of data point prioritization is started. AL can help to generate rapidly when escalations (i.e., 11 and 12 ) are slow. This helps to mitigate label delay (see section 5) and therefore improves the model's ability to adapt to concept drift (see section 6). Furthermore, selecting instances that would maximize the learning is an efficient use of the fraud investigator's time. AL has not been widely applied in industry context so far despite extensive academic research [6, 75] . This might be due to uncertainty about which AL technique to use, how to deal with extreme class imbalance, the possibility of viable alternatives, the engineering overhead, and uncertainty about the validity of the assumptions made in AL. The problem of class imbalance is particularly relevant for many fraud detection use cases. Carcillo et al. [14] investigated AL methods under the high class-imbalance setting of fraud detection and showed on credit card fraud data that simply selecting the instances with the highest probability of being fraudulent maximizes learning and obtained high precision. The popular uncertainty sampling [74] method explores those data points with high proximity to the model's decision boundary. More recent work on active learning [57] aims to distinguish epistemic uncertainty from aleatoric uncertainty (see section 4). The rationale is that the fraud investigator's investigation time is wasted time if they investigate data points in parts of the feature space where the uncertainty cannot be reduced (i.e., where uncertainty is aleatoric). Interrater disagreement between multiple fraud investigators about the same instance can cause aleatoric uncertainty. Traditional AL methods that do not distinguish between the two uncertainty types tend to repeatedly select instances with high aleatoric uncertainty [57] , thereby wasting time of the fraud investigator. New fraud patterns are likely to come from regions of the feature space with high epistemic uncertainty. Therefore, fraud detection AL systems would benefit from focusing on sampling instances based on epistemic uncertainty. This research area is novel and lacks real-life evaluations in fraud settings.

Guided learning [5] contrasts by asking fraud investigators to search themselves for fraudulent examples (i.e., ⇒( 3 , 10 )), instead of being asked to provide labels for specific instances that were selected by the AL model. Guided learning is possible when investigators have sufficient domain knowledge to find positive examples themselves. This is typically the case in the fraud domain. Guided learning can be particularly useful when prevalence is very low and in situations with disjunct classes, like when there are different fraud modus operandi. A disadvantage is that the cost per label for guided learning is most likely higher than for AL. A further disadvantage is that relying on investigator searches induces selection bias that is unique to each investigator, the impact of which has not been studied to the best of our knowledge. Practically, guided learning can be supported by empowering investigators to generate queries based on ML models' input features [72] . This approach allows investigators to directly investigate specific areas of the feature space. While it is not a well-studied methodology, this presents a potential area of research.

Guided learning and AL can complement each other: fraud investigators can both proactively search for fraud and label suggestions from an AL model. Some success has been obtained with hybrid variants that start with guided learning and evolve to AL when some initial data set has been gathered [5] , or that supplement AL with additional searched labels [10] . Guided learning is particularly successful compared to AL in settings with low prevalence [5] . However, the exact success factors in applications of searching and labeling are not well-understood beyond the dependence on prevalence.

Weak supervision techniques such as snorkel [65] solve the problem of inferring labels for instances using so-called labeling functions that fraud investigators create. These labeling functions are expected to be imperfect (i.e., weak ), and can be seen as analogous to the rules in relation 5 . The main idea of weak supervision is to infer reliable labels from a collection of weak labels using a generative model, which can then be used as ground truth to train the fraud detection model. An important aspect of weak supervision tools is a user interface that allows a fraud investigator and an ML practitioner to collaborate and develop new labeling functions that assign labels to instances that are not yet labeled by existing labeling functions. Fraud investigators must be able to quickly find patterns in currently unlabeled data points and develop new rules to apply weak supervision successfully. This procedure is represented by the cycle ⇒( 3 , 10 ) and by 5 . In existing literature, like [65] , the focus of this user interface is on textual data, where it is for example easy for a fraud investigator to instantly spot whether a tweet is spam or not. In practice, many fraud detection problems concern tabular data, and much more expert knowledge and deeper investigations are needed for the fraud investigator to conclude if a certain instance is fraudulent. This requires further research for user interfaces that support the iterative process of developing labeling functions for weak-supervision in the context of tabular data.

Challenge 17: Exploration/exploitation trade-off in active learning.

While in typical use cases of active learning the goal is to label data points that are most helpful for improving the model, in the fraud detection use case it is important to trade-off this goal with finding more fraud cases. The trade-off between these two goals is currently an open research challenge. Challenge 18: Epistemic vs. aleatoric uncertainty sampling. How can we leverage active learning methods, while avoiding wasting our investigative resources on parts of the feature space with high aleatoric uncertainty? Epistemic uncertainty sampling is a promising direction of research, but applications of epistemic uncertainty estimates for active learning in a practical fraud prevention setting with adversarial drift are lacking. Challenge 19: Label vs. search. The relative value of AL-type labeling and guided-learning-type searching depends on the cost of the two types of investigations and the fraud prevalence. In many situations, searching is most likely more expensive than labeling, but the exact conditions that influence the costs of both approaches are not clear. Challenge 20: Weak supervision tools for fraud detection. The applicability of weak supervision methods is highly dependent on the ability to quickly and accurately assess the class of observed data points. This is often difficult because fraud investigations can be complex and time-consuming. Therefore, better decision support tools for fraud investigators are needed not only to assist the fraud investigations themselves but are also a requirement for applications of weak supervision methods for fraud detection.

Training, deployment, and monitoring of ML models 6 in a production environment comes with a variety of challenges, some of which are specific to the setting of fraud detection. Fraud detection models are often integrated into vital parts of the e-commerce platform, such as the payment portal or account registration portal. The financial business consequences are large when such systems malfunction. For example, the business revenue would almost come to a complete halt if an outage would cause the platform to be unable to process payments or process login requests. Note that this situation is distinct from, for example, a recommender system, where an outage would be undesirable but of smaller consequences. Therefore, it is important to take risk mitigation measures to ensure that model deployment is safe. Additionally, fraud models are particularly retrained and deployed frequently compared to ML models in other parts of the business, because the adversarial drift creates a need to do so (see section 6). This creates a need for deployment safety measures to be efficient and automated.

Model verification methods allow to validate that the ML model satisfies certain desired properties. CheckList [67] is a verification method inspired by metamorphic testing in software engineering and requires the ML practitioner to formulate a set of unit-test-like checks that the model needs to pass. While the main use case of CheckList is during offline evaluation, these unit tests can additionally be used as a sanity check to verify that a model that has been deployed in the model serving platform indeed still passes the unit tests. For natural language data, CheckList provides a mechanism to automatically generate test cases at scale. For other forms of data, formulating test cases is currently still a manual process. Automated test case generation for data formats other than natural language is still an open challenge.

Deployment best practices have recently increasingly become a subject of study, and provide guidance on how to manage and mitigate the risks that are involved in model deployment. Early work includes the ML test score [12] , which provides a list of checks to be performed during model deployment that can catch common problems and mistakes. More recent work includes [58, 44] . Commonly recommended is the practice of canary testing, i.e., to expose a newly deployed model first to a small group of users where it is closely monitored before exposing all users to the model. Alternatively, in a shadow mode deployment, the new model starts making predictions for every instance without being used by the decision system. Another common recommendation is to create the infrastructure that allows for quick and safe rollbacks to an earlier model version, which enables quick recovery in case of unforeseen problems.

Automated data validation methods focus on monitoring and validation of the feature values that are used in the ML model. Fraud detection models often consume data generated by parts of the e-commerce platform that are often not directly maintained by the anti-fraud department. For example, a payment processing service (or an accounts registration portal) often has a dedicated team that builds the service, owns its database tables, and controls its data schema. These database tables are then read by the fraud detection system to calculate feature values. There is a risk that newly deployed changes (or bugs) in these upstream dependencies affect the feature values and therefore the performance of the production model. For example, the accounts registration portal could contain an age field. A fraud detection model that uses this value as feature is negatively impacted when the team managing the registration portal changes the semantics of the age field by changing it from a mandatory to an optional field. Automated data validation for ML applications has been studied extensively [62, 70, 15] . Solutions typically perform simple checks, such as validating that all feature values are within a reasonable range of values (e.g., age must not be negative), or by validating that the feature values have a reasonable distribution (e.g., values are not constant). Another frequent approach is to validate that recent values of features are within some threshold of similarity compared to older values of that same feature (e.g., using Kullback-Leibler divergence). While comparing feature distributions over time is useful to detect change, in practice, there is an important distinction between a change to the semantics of a feature 1 and a change in user behavior. The former case might require the data team to repair data pipelines, while the latter case requires the model to adapt to concept drift (see section 6). There is an open challenge to detect changes in the semantics of features without false alarms on changes in user behavior.

Challenge 21: Test case generation for model verification. There are existing methods for automated test case generation in the natural language domain. This is an open challenge for other types of data. Challenge 22: Automated data validation under concept drift. Existing methods for automated data validation often compare recent feature values to older values. There is a need for automated data validation methods that distinguish between the case of broken data pipelines or changes to feature semantics on the one hand and a change of user behavior on the other hand. This would enable alerting only in the former scenario.

We presented an operational model of how an anti-fraud department in an ecommerce organization operates. We formulated a list of practical challenges related to fraud detection, and we derived a list of machine learning research topics that are practically relevant and applicable in anti-fraud departments by addressing some of these practical challenges. We summarized the state of the scientific literature in these research topics and formulated open research challenges that we believe to be relevant to the industry for anti-fraud operations.

By formulating these open challenges, this paper functions as a research agenda with industry practicality in mind. At the same time, this paper aims to enable future work in fraud detection to embed their methods in the organizational context using the operational model presented in this paper.

Outlier ensembles: An introduction

Analysis of thompson sampling for the multi-armed bandit problem

Power to the people: The role of humans in interactive machine learning

Multiple instance classification: Review, taxonomy and comparative study

Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance

Inactive learning? difficulties employing active learning in practice

Using confidence bounds for exploitation-exploration trade-offs

Does the whole exceed its parts? the effect of ai explanations on complementary team performance

Putting fairness principles into practice: Challenges, metrics, and improvements

Search improves label for active learning

Wild patterns: Ten years after the rise of adversarial machine learning

The ML test score: A rubric for ml production readiness and technical debt reduction

Multiple instance learning: A survey of problem characteristics and applications

Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization

TensorFlow data validation: Data analysis and validation in continuous ML pipelines

Wells fargo's fake accounts scandal and its legal and ethical implications for management

A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions

On optimum recognition error and reject tradeoff

Credit card fraud detection and concept-drift adaptation with delayed supervised information

Frustratingly easy semi-supervised domain adaptation

A framework for determining the fairness of outlier detection

Learning classifiers from only positive and unlabeled data p

A meta-analysis of the anomaly detection problem

A review of multi-instance learning assumptions. Knowledge Eng

A survey on concept drift adaptation

Unsupervised domain adaptation by backpropagation

Adaptive random forests for evolving data stream classification

Streaming random patches for evolving data stream classification

Toward supervised anomaly detection

Adapting to delays and data in adversarial multi-armed bandits

Graph representation learning

Equality of opportunity in supervised learning

The nearest neighbor classification rule with a reject option

Improving fairness in machine learning systems: What do industry practitioners need?

Correcting sample selection bias by unlabeled data

Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods

On selection bias with imbalanced classes

Online learning under delayed feedback

Internet fraud: The case of account takeover in online marketplace

Positive-unlabeled learning with non-negative risk estimator

Multi-instance learning for predicting fraudulent financial statements

Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

Bandit algorithms

Technology readiness levels for machine learning systems

A contextual-bandit approach to personalized news article recommendation

Uncovering insurance fraud conspiracy with network learning

Cost-Sensitive Learning

Isolation forest

Learning under concept drift: A review

Logucb: an explore-exploit algorithm for comments recommendation

Foundations of attack trees

The economic impact of cybercrime -no slowing down

A survey on bias and fairness in machine learning

Exacerbating algorithmic bias through fairness attacks

Retail e-commerce (e-tail)-evolution, characteristics and perspectives in china, the usa and europe

Microsoft uses machine learning and optimization to reduce e-commerce fraud

Epistemic uncertainty sampling

Challenges in deploying machine learning: a survey of case studies

Nested multiple instance learning in modelling of HTTP network traffic

A review of novelty detection

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

Data validation for machine learning

Credit card fraud detection in e-commerce

Dataset shift in machine learning

Snorkel: Rapid training data creation with weak supervision

Anchors: High-precision model-agnostic explanations

Beyond accuracy: Behavioral testing of nlp models with checklist

Learning to optimize via posterior sampling

Ensemble learning: A survey

Automating large-scale data quality verification

Online active learning methods for fast label-efficient spam filtering

Detecting adversarial advertisements in the wild

Reliable classification: Learning classifiers that distinguish aleatoric and epistemic uncertainty

Active learning literature survey

Active Learning and Experimental Design workshop In conjunction with AISTATS 2010

Fairod: Fairness-aware outlier detection

Thresholding for making classifiers cost-sensitive

Poisoning attacks on algorithmic fairness

Return of frustratingly easy domain adaptation

Counterfactual risk minimization: Learning from logged bandit feedback

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples

Fairness without harm: Decoupled classifiers with preference guarantees

Deep learning for anomaly detection

A human-grounded evaluation of SHAP for alert processing

Detecting clusters of fake accounts in online social networks

Learning and evaluating classifiers under sample selection bias

Transforming classifier scores into accurate multiclass probability estimates

Change with delayed labeling: When is it detectable?