key: cord-0126235-rds598cj authors: Kim, Sundong; Mai, Tung-Duong; Han, Sungwon; Park, Sungwon; Khanh, Thi Nguyen Duc; So, Jaechan; Singh, Karandeep; Cha, Meeyoung title: Active Learning for Human-in-the-Loop Customs Inspection date: 2020-10-27 journal: nan DOI: nan sha: ef430bf3ae296efcef65789e3a5df809789c3ba4 doc_id: 126235 cord_uid: rds598cj We study the human-in-the-loop customs inspection scenario, where an AI-assisted algorithm supports customs officers by recommending a set of imported goods to be inspected. If the inspected items are fraudulent, the officers can levy extra duties. Th formed logs are then used as additional training data for successive iterations. Choosing to inspect suspicious items first leads to an immediate gain in customs revenue, yet such inspections may not bring new insights for learning dynamic traffic patterns. On the other hand, inspecting uncertain items can help acquire new knowledge, which will be used as a supplementary training resource to update the selection systems. Based on multiyear customs datasets obtained from three countries, we demonstrate that some degree of exploration is necessary to cope with domain shifts in trade data. The results show that a hybrid strategy of selecting likely fraudulent and uncertain items will eventually outperform the exploitation-only strategy. Suppose you are given the task of developing an AI-based selection system to assist customs officers working on-site who inspect goods based on recommendations. With the increasing prevalence of online retail, the kinds and amounts of trade traffic are growing astronomically, as are emerging fraudulent trades that try to deceive the system with sophisticated tactics. For example, during the COVID-19 pandemic, the World Customs Organization reported increased numbers of attempted fraud and tax evasion incidents [1] . A detection algorithm that is tailored to past logs will inevitably degrade over time. However, relearning the algorithm entirely may waste domain knowledge that has been accumulated over decades. How should the system balance between exploiting existing knowledge and exploring new trends? As illustrated in this story, machine learning models in online prediction settings must adapt well to changes in the input, a challenge known as concept drift [2] . In the context of customs operations, the list of countries procuring a particular product will change over time, and products that are foreign to systems may be declared (e.g., new technology). Even a well-trained machine learning model can fall into the trap of confirmation bias and may not capture these changes. Particularly for situations in which manual labeling is expensive, it can be challenging to make significant changes to the model's working logic. To mitigate this problem, active learning techniques can help users decide how to query and interactively annotate data points in light of unknown concepts [3] . In the recent active learning setting, the model encourages querying un-certain samples while ensuring sample diversity. However, in customs risk management systems, queried data are often subject to evaluation. The system needs to be profitable while securing knowledge for the future. Therefore, the learning model cannot fully follow the exploration principles of active learning. To describe the problem, we introduce a case study in which customs administrations maintain an AIbased selection model to support officers' collecting duties. 1 depicts a customs clearance process. Importers need to specify the trade items' information in import declaration forms to trade goods across borders. We hypothesize that a trade selection model plays a role in prioritizing items for inspection. Officers follow the recommendation to manually inspect the authenticity of the chosen items and levy additional tariffs if there is fraud. In most customs offices, only a small subset of the import goods (say, 1-5%) are inspected due to the large trade volume. Once items are inspected, whether there is fraud or not, the performance of the customs offices is evaluated; at this time, the selection model's parameters can be updated with new knowledge. Many customs offices worldwide set aside a small set of random samples to be inspected to learn new fraud patterns [4] . This paper aims to innovate the sample selection strategy, together with the existing inspection process. We propose a hybrid selection strategy to maximize long-term revenue from fraud detection while maintaining a high income during a short-term inspection. By lever-aging the concept of exploration, our model remains up to date against concept drifts. In contrast, the concept of exploitation maintains high-quality precision for current fraudulent transactions. As an alternative to the random exploration used by customs, we propose the gATE exploration methodology, built upon a state-of-the-art active learning approach [5] . gATE is designed to select the most informative samples in a diverse way to capture the dynamically changing traits of trade flows. Tested on actual multiyear actual trade data from three African countries, we empirically demonstrate that the proposed hybrid model outperforms state-of-the-art models in detecting fraud and securing revenue. Our key contributions are as follows: 1) We define the problem of concept drifts in the context of customs fraud detection, i.e., a dynamic trade selection setting that adaptively acknowledges new trends in data while securing revenue collected from fraud detection. 2) We propose a novel hybrid sampling strategy based on active learning that combines exploration and exploitation strategies. 3) The experiments demonstrate the long-term benefit of exploration strategies on real trade logs from three African countries. 4) We prepare the codes for simulating customs trade selection, considering the needs of customs administration. See the reproducibility section. Since 2019, we have collaborated with customs communities represented by the World Customs Organization, and their partner countries, including the Nigeria Customs Service. Namely, our prior work DATE classifies and ranks illegal trade flows that contribute most to the overall customs revenue when they are identified [6] . DATE is opensource 1 and is being studied widely to advance data analytics capabilities in customs organizations. However, in the process of piloting DATE in a live environment, we found that concept drift could be fatal to the model performance. Considering that various fraud detection algorithms were tested in an offline setting [4] , [6] , [7] , [8] , it is very likely that the model will suffer from confirmation bias in a live setting. We also show that our DATE model suffers from confirmation bias in Figure 4 (c). Avoiding bias in the model is the primary motivation for extending the research. In contrast to our prior work, we propose a hybrid algorithm with a new exploration strategy gATE and refine our experiments from 80-20% data splitting to long-term simulation with consecutive inspections. We also tested our algorithm in datasets from multiple countries. Earlier research on customs fraud detection focused on rulebased or random selection algorithms [4] , [7] . While they are intuitive and widely used, these classical methods need to relearn patterns periodically, leading to high maintenance costs. The application of machine learning in customs administration has been a closed task primarily due to the proprietary nature of the data. Several recent studies have shown the use of off-the-shelf machine learning techniques 1. http://bit.ly/kdd20-date such as XGBoost and the support vector machine (SVM) [8] , [9] . Recently, dual attentive tree-aware embedding (DATE) was proposed, employing transaction-level embeddings in customs fraud detection [6] . This new model provides interpretable decisions that can be checked by customs officers and yield high revenue through the collected tax. However, these algorithms are expected to face performance degradation over a long period due to their limited adaptability to uncertainty, diversity, and concept drifts in trade data. We introduce the concept of exploration to remain up to date against concept drifts to address this issue. Concept drift describes unexpected changes in the underlying distribution of streaming data over time [2] . Past research has studied three aspects of concept drift: drift detection, drift understanding, and drift adaptation [10] . Several methods are available for adapting existing learning models to concept drift. The most straightforward way is to retrain a model with the latest data and replace the obsolete model parameters when drift is observed [11] . In cases with recurring drift, ensemble methods are known to be effective. A classic example involves utilizing treebased classifier and replacing an obsolete tree with a new tree [12] . Voting schemes have also been applied to manage base classifiers by adding them to ensembles [13] . The requirement of maintaining a set of pre-defined classifiers is a major drawback of these methods. For stream applications, where only a fraction of the given data is annotated by human effort, one can consider sample selection via active learning to maintain the optimal level of performance. This situation is often subject to the concept drift problem [14] . Since our customs trade selection problem also falls under this setting, we consider active learning as a potential solution. Active learning enables an algorithm to elicit ground truth labels for uncertain data instances and enhance its performance [3] , [15] . It has been utilized in training models to deal with high-dimensional data [16] , to offer long-term benefits [17] , [18] , to select appropriate data instances to speed up the model training [19] , and to train the model with a limited budget [20] . For example, one study proposed a way to measure the 'informativeness' of given samples [21] . Others have proposed collecting as much information as possible by prioritizing inspection of uncertain samples [22] , [23] , [24] . Another line of research has focused on improving diversity by strategically collecting samples to represent the overall data distribution. Diversity-based algorithms include region-based active learning [25] and core-setbased approaches [26] . Recent research has also focused on the concurrent inclusion of uncertainty and diversity aspects [5] , [27] . Collectively, these approaches share common limitations in practical use. First, active learning research has assumed an offline setting [5] , [28] , [29] , [30] . For example, HAL showed that including simple exploration helps margin sampling in a skewed dataset [30] , and BADGE showed the effectiveness of sampling uncertain data points in a diverse way [5] . However, their evaluations are based on fixed test data, which cannot accommodate concept drifts in real active learning scenarios. Real-world logs exhibit substantial changes and dynamics over time, as we will demonstrate in this paper, making most static machine learning models obsolete. Second, extant models separate the processes of exploitation (i.e., inspection) and exploration (i.e., annotation). In practice, every manual inspection or annotation is a cost in which the budget is often limited (for example, by inspection officers in customs). Given the shared budget, an exploration-oriented active learning algorithm is unlikely to succeed if it is learned separately. This constrained optimization setting has not been handled in conventional approaches. Our work addresses these two realistic settings. The customs administration aims to detect fraudulent transactions and maximize the tax revenue from illicit tradesthis is the customs fraud detection problem [6] , [8] . Given an import trade flow B, the main goal is to predict both the fraud score y cls and the raised revenue y rev obtainable by inspecting each transaction x. Given the limited budget of inspection and annotation, we address the problem of devising an efficient selection strategy to identify suspicious trades and increase revenue as follows: Customs trade selection problem. Given trade flows B, construct a sample selection strategy f that maximizes the detection of fraudulent transactions and the associated tax revenue. The trade flows B consist of the online stream of trade records 2 , including the importer ID, commodity ID (such as the Harmonized System Codes), and declared price of goods. The characteristic distributions of illicit transactions are assumed to change over time (see Sec. 5.1.1). Extant research on customs fraud detection mainly concentrates on the static setting, in which a model is trained on large training batches and deployed for fraud detection without further updates [6] , [8] . We consider a practical online setting where the characteristic distribution of trade flows B and the traits of illicit trades change over time. This is the case for the active customs trade selection problem. The active customs selection problem requires the selection strategy to help the model update and adapt for new fraud types. All inspected items can bring additional information, and strategically choosing the right items to maximize the model performance is handled in this problem. We formally define the active customs trade selection problem as follows: At each time t, given a batch of items B t from trade flows B, based on a strategy f trained with X t , customs officers select a batch of items B S t to inspect physically. After inspection, the annotated results are used to update the strategy f . We evaluate the model from 2. These terms are used interchangeably: transactions, items, goods. A transaction (item) from import flows B. y cls A binary label denoting that item x is fraudulent. y rev A non-negative value (label) denoting item x's additional revenue upon its inspection. Import flows arriving during an unit interval before time t. B S t (f ) A set of items selected by strategy f at time t. These items are subject to inspection. After inspection, their labels y cls and y rev are obtained. For simplicity, we denote it as B S t . Training data at time t. Xt is used to update the parameters of strategy f . rt An inspection (selection) rate at time t. An evaluation metric for B S t (f ). (e.g., Revenue@k%). timestamp t k onward. The goal is to devise a strategy f * that maximizes the precision and revenue in the long-term: where m is the evaluation metric, which is the precision or revenue from fraud detection. Table 1 lists the notation used frequently throughout the paper, and the main training process for fraud detection with active customs selection is described in Algorithm 1. Input: Previous inspection histories H, initial inspection rate r0, target inspection rate r, unlabeled datastream of new items in each timestamp Bt Output: Items for inspection B S t in each timestamp t / * Considering a weekly inspection is made. * / Initialize the training data X1 from inspection histories H; for t = t1, · · · , do Obtain the batch of new items Bt; Determine the weekly inspection budget rt, using r0 and r; / * Selection by the algorithm * / Train the selection strategy f with Xt; Based on f, select a set B S t of rt|Bt| items for inspection; / * Inspection by officers * / Obtain the ground-truth annotation (xi, y cls i , y rev i ) for each item xi ∈ B S t after manual inspection; Evaluate the results by precision and revenue; Add the newly annotated items into the training data: The quality of the active customs trade selection problem depends on having a good selection strategy f. We propose a new strategy that combines two approaches: exploitation and exploration. The exploitation approach selects the most likely fraudulent and highly profitable items to secure shortterm revenue for customs administration. The exploration approach, in contrast, selects uncertain items at the risk of temporary revenue regret, yet potentially detects novel fraud patterns in the future. Our algorithm mixes these two components to gain long-term benefits and secure immediate revenue from imbalanced customs datasets. Figure 2 illustrates the overall framework of the proposed model. We employ the current state-of-the-art algorithm in illicit trade detection, DATE [6] , as a baseline for this research. It is a tree-enhanced dual-attentive model that optimizes the dual objectives of (1) illicit transaction classification and (2) revenue prediction. We leverage the predicted fraud score of DATE for our exploitation strategy. We update the DATE model at each timestamp and select the most suspicious items as per the inspection budget (see Algorithm 2). Input: Training set Xt, items received Bt, inspection rate rt Output: A batch of selected items B S t / * Corresponds to the selection part in Alg. 1. * / Train the DATE model using training set Xt; Perform prediction on Bt, obtain the predicted annotation (xi,ŷ cls i ,ŷ rev i ) for each item xi ∈ Bt; Obtain the set B S t of rt|Bt| items with the highest fraud scorê y cls i ; The exploitation strategy selects the more familiar and highly suspicious transactions for inspection; therefore, it tends to underperform over time as trade patterns gradually change. In contrast, our hybrid strategy chooses to add a small portion of new and uncertain trades as a learning sample in the training data, which gradually affects the model's future prediction performance. Since fraud types are constantly evolving, the model performance might drop over time. We propose an exploration strategy to select uncertain trade items, with additional consideration of diversity and revenue, to resolve these issues. One approach to detecting new fraud types is to utilize uncertainty in the query strategy. Selecting items for which the model is least confident can provide more information on similar new observations. However, this strategy can create an unfavorable scenario where newer labeled data do not include diverse transaction types and labels for identical transactions continue to accumulate. Considering this, we include the concept of diversity along with uncertainty in our selection strategy; i.e., we choose the most diverse samples possible for stable and fast exploration [5] , [28] . We take in gradient embedding and k-means++ initialization from BADGE [5] in our exploitation model DATE to determine which trades should be queried for inspection by considering uncertainty and diversity concepts. The detailed implementation of each concept is described below. Uncertainty. If a sample generates a large gradient loss and, consequently, a large parameter update, the item potentially contains useful information. This means that the magnitude of gradient embedding reflects the uncertainty of the model on samples. With this motivation, we aim to choose trade flows with uncertainty using the magnitude of gradient embedding. At time t, for each trade item x i in B t , the illicitness classifier h θ from the DATE model returns its fraud scoreŷ cls i , which indicates the illicit class of y cls i . where W is a weight matrix that projects the transaction embedding z φ to the DATE illicitness class space. The gradient embedding g xi is the gradient of the loss function with respect to W and sample x i . Since the received data points are unlabeled (not yet inspected), we predict the pseudo labelĉ i by the fraud score with a threshold of 0.5 (i.e. c i = 1(ŷ cls i ≥ 0.5)). This pseudo label is used to calculate the loss, resulting in the gradient embedding described as: where c ∈ {0, 1} corresponds to the two classes and p c i is the predicted probability for class c; p c=0 i = 1 −ŷ cls i and p c=1 i =ŷ cls i . Diversity. For effective exploration, we inspect items that bring large changes and diverse views to the model. Taking diversity into account, we can avoid a situation where similar items are inspected redundantly. We suggest a batch construction method by selecting the most representative samples through the k-means++ algorithm [31] , which produces a good initial clustering situation. K-means++ obtains a set B s t of k centroids sampled in proportion to the nearest set's centroids. Samples with small gradients are also unlikely to be chosen, as the distances between them are small. Gradient embedding with k-means++ seeding tends to result in the selection of a batch of large and diverse gradient samples. By doing so, our selection strategy can consider both sample uncertainty and batch diversity. To induce the algorithm to select more uncertain and highrevenue items, we introduce extra weights to amplify the effect of uncertainty and revenue. These weights, called the uncertainty scale and revenue scale, adjust the probability of chosen samples by resizing their gradient embedding vectors. Uncertainty scale. We magnify the impact of uncertain items by quantifying the model's ability to calibrate an item. We give each item an uncertainty score (Eq. 12) such that the score indicates the magnitude of the model's uncertainty about the item. The uncertainty score unc i is defined as follows: This concave function maximizes the uncertainty score when the system cannot determine whether an item is fraudulent or not (i.e., the uncertainty score is the largest when the predicted fraud scoreŷ cls i is 0.5). We adjust the uncertainty score by using a multiplier -1.8 and set its range between 0.1 and 1, which leads the exploration algorithm to select every item with some chance. 3 Our intention here is to leave some degree of uncertainty even when the base model is overconfident about its prediction results (i.e.,ŷ cls i is close to 0 or 1) [32] . Revenue scale. Active learning in customs operation requires additional consideration, as revenue needs to be collected as the customs duty. Maximizing the customs duty is one of the top priorities of customs authorities. Therefore, we further amplify the gradient embedding by the DATE model's predicted revenueŷ rev i . The distribution of the amount of customs duty is right-skewed, so we take the 3 . This setting shows the best results on our datasets. For practical use, the best parameters can be found using the validation set. log of the predicted revenue (Eq. 13). We can define the final scale factor S i of x i as As a result, the gradient embedding g c xi becomes k is a constant for computational stability. The algorithm covered in Sections 4.2.1 to 4.2.2 is named bATE, inspired by BADGE [5] and DATE [6] . In practice, some importers might commit fraud by analyzing and reverse engineering the model's prediction patterns. We can call them adaptive adversaries of the model. In this situation, randomness is known to improve the robustness and competitiveness of the online algorithm [33] . With this motivation, we introduce randomness to our sampling strategy. Using the validation performance of the DATE model, we establish a gatekeeper. If Rev@n% is higher than the predefined value of θ, the bATE exploration algorithm is used. Otherwise, if the DATE models' outputs are highly unreliable, these inputs can be considered an attack, thereby facilitating the random selecting of items for inspection. To address the issues above, we propose the final exploration strategy, gATE, which is formally written as Algorithm 3: Input: Training set Xt, items received Bt, inspection rate rt Output: A batch of selected items B S t / * Corresponds to the selection part in Alg. 1. * / Train the DATE model using training set Xt; Obtain Rev@n% from validation set; if Rev@n% > θ then Perform prediction on Bt, obtain the predicted annotation (xi,ŷ cls i ,ŷ rev i ) for each item xi ∈ Bt; Calculate the gradient embedding gx i (Eq. 14); Obtain the set B S t of rt|Bt| items by k-means++ initialization; else Obtain the set B S t of rt|Bt| items by random sampling; end The exploitation-only model can lead to confirmation bias. With a model trained only on historical data and considering the concept drift in customs datasets, the model tends to be unreliable because of outliers. However, a pure exploration strategy cannot secure customs revenue and is unrealistic in the customs setting. Hence, we consider a balance between the two to achieve both short-term and longterm performance. We propose a hybrid selection strategy under the online active learning setting that includes two main approaches, exploitation and exploration, by utilizing DATE and gATE. To select items that will potentially enhance the model's performance, we design a gATE strategy for exploration ( §4.2.1-4.2.3). We use the DATE strategy that exploits We employed item-level import declarations for three countries in Africa. The import data fields included numeric variables such as the item price, weight, and quantity and categorical variables such as the commodity code (HS code), importer ID, country code, and received office. After matching the data format for each country, we preprocessed the variables by following the approach used in a previous study [6] . For categorical variables, we quantified the risk indicators of the importers, declarants, HS code, and countries of origin from their non-compliance records. For example, importers were ranked by their past fraud rates. The importers, whose ranks were above the 90th percentile, were regarded as high-risk importers, and their risk indicators were given values of 1; otherwise, the values are set to 0. This is called risk profiling, and it is more efficient than one-hot encoding those variables. We also add frequentlyused cross features such as unit.value (= cif.value /quantity), value/kg (= cif.value /gross.weight), tax.ratio (= total.taxes /cif.value), unit.tax (= total.taxes /quantity), and face.ratio (= fob.value /cif.value). Table 2 illustrates the import declaration data. The three customs were subjected to detailed inspection (i.e., achieving a nearly 100% inspection rate). However, this practice is not sustainable, and the customs offices of these countries plan to reduce the inspection rate in the future. Due to the manual inspection policy, the item labels and tariffs charged are accurately labeled in these logs at the single-goods level. Table 3 and Figure 3 depict the statistics of the data we utilized. The experiment aims to find the best selection strategy to maintain the customs trade selection model in the long (c) In country T, the exploitation strategy failed, unlike a hybrid model. Fig. 4 : For some cases, the performance of the exploitation strategy DATE drops over time, but the performance of hybrid strategies remains stable even for cases in which the exploitation strategy fails. This shows that exploration is necessary for maintaining a selection system in the long run. run. Therefore, we simulated an environment in which a selection model is deployed and maintained for multiple years 4 . Given that one month of training data is available, the system receives import declarations and selects a batch of items to inspect during the week. A selection model is trained based on a predefined strategy, and the most recent four weeks of data are used to validate the model. By using the inspection results, the model is updated every week. To simulate a scenario of data providers who are willing to reduce the inspection rate gradually, we implemented several methods to decay the inspection rate over time. In this experiment, we set the target inspection rate to 10%. Starting with 100% inspection, we used a linear decaying policy by reducing the inspection rate by 10% each week. Once the target inspection rate is reached, the system maintains this inspection rate for the remaining period. In Fig. 4, 6 , 7, and 13, we use a vertical dashed line to indicate when the decay ends, and the target rate is maintained. 5 We evaluate the selection strategy performance by referring to two metrics used in previous work [6] : Precision@n% and Revenue@n%. Since the underlying data distribution changes each week in an online setting, these value metrics also fluctuate significantly. In other words, unless the illicit rates or item prices are fixed, these two value metrics will be difficult to interpret directly. We used normalized performances by dividing each value metric by the maximum achievable value, namely, the oracle value. • Norm-Precision@n%: In a situation where n% of all declared goods are inspected, Pre@n% indicates how many actual instances of fraud exist among the inspected items. The Norm-Pre@n% value of the corresponding algorithm is defined as the value obtained by dividing the Pre@n% of the algorithm by the Pre@n% of the oracle. 4 . Previous works split the data into training and testing sets on a temporal basis and compared the performance of diverse machine learning models [6] , [8] . However, the algorithm's performance in a static prediction state cannot indicate the model's performance in a real setting when the model is deployed. 5 . In countries where the daily import declaration is larger, it would be possible to update the selection strategy every day, and more reliable results could be obtained even with a shorter period. Example. For example, if a system with a 10% inspection rate is operating in an environment with a 2% illicit rate, the Pre@10% and Rev@10% of the oracle would be 0.2 and 1, respectively. Let us consider that the deployed selection strategy achieves a Pre@10% value of 0.18. To prevent any potential interpretation bias caused by the illicit rate that varies from country to country, we divide 0.18 by the performance upper bound of 0.2, which results in 0.9 for Norm-Pre@10%. Note. We employ a fully labeled dataset and these metrics as the ground truth information. For countries already maintaining a low inspection rate, these metrics can be modified by conditioning on their observable goods. Securing tax revenue was the most critical screening factor for the developing countries we interacted with since their fiscal income depends on customs services [34] . Therefore, we mainly used Norm-Rev@n% in reporting the results in the following sections. To explore this possibility, we first compare the performance of the pure exploitation strategy DATE and a partialexploitation strategy with some random exploration, called a naive hybrid strategy. The naive hybrid strategy uses DATE and random exploration at a ratio of nine to one. The data reveal that a pure exploitation strategy can lead to a substantial degree of malfunctioning. Figure 4 (c) shows that for country T, the performance of the state-of-the-art DATE exploitation strategy drops unexpectedly from time to time, yet the hybrid strategy remains stable. The degradation continues despite the increasing size of the training data, confirming that items chosen for inspection are uninformative and indicating a concept drift in the country's trade pattern. Hence, we conclude that the exploration strategy items significantly boost the performance of the exploitation strategy. Considering that the randomly selected items may affect only 1% of the total revenue on average, the performance boost arises from inspecting unknown items. The longitudinal data also allow us to examine how frequently concept drifts occurred in the trading pattern of country T. Figure 5 shows the ratio of each import country for an item with a commodity code starting with 620 in 2015, 2017, and 2019, indicating a significant level of concept drifts in trade rates for the top imported items. Countries A and B used to be where the item was imported the most, but starting in 2017, the shift in import countries sharply changed, and the country C became the dominant source country for imported goods. Is it common for the performance of the exploitation strategy to decrease over time? We check again to see if these behaviors are common across all countries. Reassuringly, we also observe that the full-exploitation strategy does not always fail. In Figure 4 (a)-(b), we can see the results obtained from country M and country N. For these countries, maintaining the strategy of screening the most fraudulent items is still valid. However, when we compare the average performance of the exploitation strategy and the hybrid strategy, we can also find that the former does not outperform the latter (Norm-Rev@10%, Exploitation vs. Hybrid: 59.6% vs. 61.5% for country M, 83.2% vs. 84.0% for country N; moving average over the previous 13 weeks). It is interesting to see that inspecting a set of random items is even better than inspecting reasonably fraudulent items with highŷ cls values (top 9-10%) for maintaining a customs trade selection system in the long run. Therefore, how much will the performance improve if the better exploration strategy is used rather than the random strategy? We measured the performance of the proposed exploration strategies. A natural question arises regarding which strategies would be best for exploration. We first compared the performance of pure exploration strategies, assuming a need to build a system that tends to explore. This experimental setting is necessary for customs administration where there are not enough import histories available, so customs want to construct the working selection system as quickly as possible. This experimental setting is also widely used in the active learning community [5] , [28] to compare performances between pool-based active learning algorithms. We performed experiments with four exploration strategies, including our proposed model designed in Section 4.2. • Random [4] : Known to be used as an exploration strategy to detect novel fraud in the production systems of some countries. • BADGE [5] : State-of-the-art active learning approach that selects items considering uncertainty and diversity. • bATE: Explores by considering predicted revenue as well as item uncertainty and item diversity. • gATE: Strategically determines whether the exploration strategy is random or bATE, depending on the performance of the base model. Figure 6 shows the 13-week moving average performance of the exploration strategies. The results show that the three advanced strategies, BADGE, bATE, and gATE, outperform the random strategy by a large margin. bATE is the top-performing strategy in countries M and T, and gATE performs the best in country N. In the point of view of active learners, these results suggest that the introduced scaling components in our method play a role in constructing the working customs trade selection system more rapidly. However, it turns out that the performance of the full-exploration strategy is not comparable to the fullexploitation strategy. In our experiments, the performance of the full-exploitation strategy reaches 0.844 in country N (Table 4) , while the performance of the full-exploration strategy is 0.52 in the same country, which is not impressive in itself. Although a set of explored samples consists of items with uncommon HS codes or under-invoiced item near the decision boundary, they are not always frauds. Including these items is helpful to the model training process to some extent. However, a model solely trained on these items (i.e., full exploration) is susceptible to noise and hence does not exhibit the best performance. It can be seen that the exploration strategy and exploitation strategy need to be used together to guarantee the reliable performance of the customs trade selection system. Next, we compare the performance of these exploration strategies by applying them with an exploitation strategy. Following Section 5.2-5.3, each hybrid strategy selects 90% of the items by DATE, and the four exploration strategies select the remaining 10% of the items. We also compare the strategies with DATE to show the long-term sustainability of the hybrid strategies. Figure 7 and Table 4 summarize the performance of the hybrid models with different exploration strategies. First, we can see that all hybrid strategies outperform a fullexploitation strategy DATE by some margin. For country T, where a staggering decline in the exploitation strategy performance is recorded, our hybrid strategy performs exceptionally stably, and the model ultimately improves. Even though the DATE model for exploitation remains effective for the other two countries, the 10% trade-off for exploration does not hurt the overall performance; rather, this method slightly outperforms the exploitation algorithm. This proves (b) Norm-Pre@10% performance of four exploration strategies on three country datasets. Fig. 6 : The performance of the advanced exploration strategy outperforms random selection when the customs trade selection system is operated by the exploration strategy. Note that random exploration is widely used in many customs offices. In addition, the performance of bATE outperforms BADGE, suggesting that the introduced scaling components are practical on active customs trade selection settings. (b) Norm-Pre@10% performance of four hybrid strategies on three country datasets. Fig. 7 : Hybrid strategies outperforms the state-of-the-art fully exploitation strategy DATE. We confirm that a robust hybrid model can be made even with a simple exploration strategy. our initial claim that even if we inspect suspicious items, we can guarantee similar performance by learning new patterns from the unknown items. Second, advanced exploration strategies can help to improve the performance of the whole hybrid model as much as possible. This is shown by the result that DATE+bATE achieves 1.6% higher revenue than DATE+random in country T. It is noteworthy that the best exploration algorithm contributes the most to the hybrid strategy when it is used in the country with the largest trade volume and the highest illicit rate. Third, the hybrid model's performance with a random exploration strategy is still comparable to the hybrid model in general. In practice, we encourage customs administration to start with a simple exploration strategy without using additional computing power. In contrast to relying on single exploitation or exploration model, the customs trade selection model will be improved even more robustly with both strategies. Fig. 8 : Trade log of a fraudster: Thanks to the exploration strategy which noticed the importer's fraud action, the hybrid targeting system is later able to detect major frauds he committed. The system that only operates the exploitation strategy could not detect the frauds until the end. Timely exploration allows customs to inspect goods from unknown importers and extend their knowledge. Based on this input, the updated model can prevent potential frauds. Figure 8 shows a successful case of detecting sequential frauds by our hybrid strategy. This example introduces a trade log of an importer who has imported goods since 2015. After 2.5 years, one of the transactions is subjected to physical inspection by exploration and was labeled a fraud. 6 The importer mixed normal and fraudulent transactions to avoid further inspection. Yet, the newly updated exploitation strategy was able to catch his sequential frauds. Without being triggered by exploration (i.e., fully-exploitation), the targeting system would not have detected frauds from unidentified importers. In our experiment, 1,652 importers and their 170,683 items followed the same pattern (i.e., sequential and sporadic frauds) and were subjected to inspection by hybrid strategy. Among them, only 74 importers and their trades are inspected by the full-exploitation model. 6 . From Jan 2015 to Aug 2017, the importer processed 59 transactions, including eight frauds, but none of these transactions was inspected yet. This case study demonstrates that a timely exploration triggers targeting systems to cope better with frauds from new importers. We further compared the statistics of the selected items between two targeting systems (i.e., hybrid and full exploitation). Figure 9 illustrates how we break down the results into five components and shows their statistics. In the left figure, A ∪ B is a set of items selected by hybrid exploitation, and D ∪ E is a set of items selected by hybrid exploration. Likewise, B ∪ C ∪ E is a set of items selected by the full-exploitation model. The hybrid targeting system makes better trade selection based on the inputs it receives through exploration. Since the exploration strategy inspected 41,100 items from 8,744 importers (D ∪ E), the exploitation module selected 369,849 suspicious items from 7,944 importers (A ∪ B). In contrast, the full-exploitation model operated from a limited importer pool. Total 410,949 items were selected from merely 1,392 importers (B ∪ C ∪ E). In addition, the detection rate of the hybrid model and the corresponding revenue per item are higher than those of the full-exploitation model. For comprehensive understanding, we also compared the performance of the hybrid model and the full-exploitation model on various criteria over time. Detailed results are summarized in Figure 10 with explanations. This paper investigates the human-in-the-loop online active learning problem, where the indicators of the annotated samples are the key criteria for evaluation. One such example can be found in customs inspection, where customs officers need to decide which new cargo to examine (i.e., an exploration strategy) while retaining the history of existing illicit trades (i.e., an exploitation (i) Combining these effects, the hybrid model is able to secure three times more revenue compared to the fullyexploitation model. strategy). We present a selection strategy that efficiently combines exploration and exploitation strategies. Our numerical evaluation, based on multiyear transaction logs, provides insights for practical guidelines for setting model parameters in the context of customs screening systems. To facilitate the proposed approach in customs administrations, the model code is open source. It currently supports diverse exploitation and exploration strategies with various tunable parameters ranging from models to simulation settings so that users can confirm whether our proposed work is well suited for their data. With minor adjustments, our code can also support various decision-making problems with constrained resources. Refer to the supplementary material for the code and data availability. Our forthcoming work will include the following two areas: • Determining right balance for hybrid strategies [35] : In this paper, the ratio between exploration and exploitation is set empirically. The model performance is sensitive to this ratio, and the performance numbers vary depending on the dataset (Fig. 11 ). An adaptive algorithm for selecting this ratio will manage this trade-off better. The RP1 algorithm [36] leverages an online learning mechanism with an exponential weight framework [37] to dynamically tune this ratio, which could be applicable in our model. Country T -Week 120 (b) In country T, The model performs the best with 30% of exploration. Fig. 11 : Best performing exploration ratio differs by data. In the case which the exploitation strategy does not work well, increasing an exploration ratio helps (Country T). • Incorporating semi-supervised techniques: Higher performance can be achieved by using richer information from a set of uninspected imports by incorporating a semisupervised learning strategy in our framework. Building a set of augmented customs data and learning from it would be a key challenge for devising a semi-supervised learning model. In line with this study, we prepared a GitHub repository for simulating customs selection considering the needs of customs administration. Our code is released at https:// github.com/Seondong/Customs-Fraud-Detection. The import transaction data used in the paper cannot be made public due to nondisclosure agreements. Nevertheless, the source code runs compatibly with the synthetic data we included in the repository. In the next section, we will share a step-by-step guide for running our code with synthetic data. Instructions for running the code and reproducing our experiments are as follows: 1) Setup the Python environment: e.g., Anaconda Python $ source activate py37 2) Install the requirements: (py37) $ pip install -r requirements.txt 3) Run the simulation: Run main.py by selecting the query strategies defined in ./query_strategies/. The command below runs on synthetic data with a hybrid strategy consisting of 90% DATE and 10% bATE. By running the command, the orange line in Figure 13 (c) can be reproduced. (py37) $ export CUDA_VISIBLE_DEVICES=3 && python main.py --data synthetic --semi_supervised 0 --batch_size 512 --sampling hybrid --subsamplings DATE/bATE --weights 0.9/0.1 --mode scratch --train_from 20130101 --test_from 20130201 --test_length 7 --valid_length 28 --initial_inspection_rate 100 --final_inspection_rate 10 --epoch 10 --closs bce --rloss full --save 0 --numweeks 100 --inspection_plan fast_linear_decay 4) Check the results: The simulation summaries are saved in .csv format in ./results/performances/. The figures in this paper can be drawn by running Jupyter Notebooks in the ./analysis/ directory. 5) For further usage: the .sh files in the ./bash directory will give you some ideas for running repeated experiments. See main.py for hyperparameter descriptions. Customs officers can simulate our strategies using their data by plugging them into the ./data directory and adding an argument in main.py. The framework can support new selection strategies; The simple XGBoost selection method is found in ./query_strategies/xgb.py). For reproducibility, we provide the experimental results using synthetic import declarations. The dataset is generated by CTGAN [38] and shares similar data fields with real datasets. It consists of 100,000 artificial imports collected from Jan 2013 to Dec 2013. The number of unique importers is 8,653, and the average illicit rate is 7.6%. Figure 12 depicts the weekly statistics of the dataset. Figure 13 shows the experimental results on synthetic data. We confirm that the synthetic data we introduce can help simulate customs selection. According to Figure 13 , the advanced model showed higher performances, supporting our statement that 'Synthetic data also has its fraudulent patterns'. Looking into the details further, we can re-establish our findings from Section 5.4-5.5 with the synthetic data. We run all the experiments five times and report their averages. Figure 13 (a)-(b) compares the performance between exploration strategies and exploitation strategies. Among the exploration strategies, the state-of-the-art active learning approaches-BADGE and bATE-outperform random learning by a large margin. Additionally, gATE performs nearly randomly because its default hyperparameter θ = 0.3 (Algorithm 3) is set too high for synthetic data. However, pure exploration is not comparable to exploitation due to the nature of our problem. Note the large performance gap between the performance of bATE in Figure 13 (a) and the performance of simple XGBoost [9] in Figure 13 (b). Customs administration should secure short-term revenue by inspecting the most likely fraudulent and highly profitable items and inspecting uncertain items that bring new insights for changing traffic. DATE is well designed for that purpose, showing its effectiveness compared to the state-ofthe-art classification models for tabular data (e.g., XGB, XGB with logistic regression [39] , and TabNet [40] ). Since the data length is relatively short, it is difficult to say that mixing exploration boosts the customs selection performance- Figure 13 (c)-and yet simple exploration is effective enough to be used as a component of the hybrid strategy. Moreover, the benefit of inspecting 1% of the uncertain items is meaningful enough to compensate for the loss of not inspecting 1% of the fraudulent items. This work was supported by the Institute for Basic Science (IBS-R029-C2, IBS-R029-Y4). We thank the World Customs Organization (WCO) and their partner countries to support their datasets. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of WCO. COVID-19 urgent notice: Counterfeit medical supplies and introduction of export controls on personal protective equipment Learning under concept drift: A review Active learning literature survey Performance measurement of the kcs customs selectivity system Deep batch active learning by diverse, uncertain gradient lower bounds DATE: Dual attentive tree-aware embedding for customs fraud detection Hybrid approaches for detecting credit card fraud Customs fraud detection: Assessing the value of behavioural and high-cardinality data under the imbalanced learning issue XGBoost: A scalable tree boosting system Learning with drift detection Paired learners for concept drift Adaptive random forests for evolving data stream classification Concept drift learning with alternating learners Active learning with drifting streaming data A survey of deep active learning Deep bayesian active learning with image data Multi-step online unsupervised domain adaptation Active online domain adaptation Carpe diem, seize the samples uncertain "at the moment" for adaptive batch selection Online adaptive asymmetric active learning for budgeted imbalanced data Active learning by querying informative and representative examples Bayesian active learning for classification and preference learning Uncertainty in deep learning Learning loss for active learning Adaptive region-based active learning Active learning for convolutional neural networks: A core-set approach Diverse mini-batch active learning BatchBALD: Efficient and diverse batch acquisition for deep bayesian active learning Ada-boundary: accelerating dnn training via adaptive boundary batch selection Active learning for skewed data sets K-means++: The advantages of careful seeding On calibration of modern neural networks Online primal-dual algorithms for maximizing ad-auctions revenue World Customs Organization Customs fraud detection in the presence of concept drift Ready policy one: World building through active learning The weighted majority algorithm Modeling tabular data using conditional gan Practical lessons from predicting clicks on ads at facebook TabNet: Attentive Interpretable Tabular Learning Sundong Kim is a senior researcher at Data Science Group, Institute for Basic Science. Before joining IBS, he obtained his Ph.D. at KAIST. His research interests include predictive analytics with real-world data with temporal and imbalanced in nature. He has published over 10 peer-reviewed articles in leading conferences and journals. He is a leading expert in the BACUDA initiative, developing fraud detection and category classification algorithms with the World Customs Organization. Tung-Duong Mai is an undergraduate student at School of Computing, KAIST. His research interests include predictive analytics with machine learning techniques. He has participated in the BACUDA project, developing customs fraud detection algorithms with the World Customs Organization. He worked on developing an algorithm to mitigate concept drift. Han is a Ph.D. student at School of Computing, KAIST. His research interests include developing robust representation learning algorithms to deal with data deficiency and corruption, which are common in publicly available datasets. He has worked on unsupervised learning algorithms for unstructured data and anomaly detection to discriminate out-ofdistribution samples from data. He also developed a semi-supervised anomaly detection algorithm to discriminate illicit trades. Sungwon Park Sungwon Park is a master student at School of Computing, KAIST. His research interests include general machine learning theory and machine learning application for social goods. He worked on a cross-national customs fraud detection model using domain generalization to support developing countries' customs administration.Thi Nguyen D.K Nguyen Thi is an undergraduate student at School of Computing, KAIST. Her research interest is data science, especially prediction analytics. In this project, she worked on concept drift analysis, experimented with active customs trade selection algorithms. Jaechan So Jaechan So is an undergraduate student majoring electrical engineering and computer science at KAIST. His research focuses include active learning, efficient and accurate online learning strategy, and data analysis in terms of uncertainty.Karandeep Singh Karandeep Singh is working as Senior Researcher at Data Science Group, Institute for Basic Science. He obtained his Ph.D. in 2019 from ETRI, Daejeon. His research interests include the application of computational techniques for networked systems and their information flows. Specific areas of application include graph-structured systems at all scales, such as interpersonal relationships in a social network, (mis)information flows on the web, and customs flows across different countries. Meeyoung Cha is an associate professor at KAIST in the School of Computing and a Chief Investigator at the Institute for Basic Science. Her research focuses on network and data science with an emphasis on modeling, analyzing complex information propagation processes, machine learning-based computational social science, and deep learning. She has served on the editorial boards of PeerJ and ACM TSC.