key: cord-0111193-x00hlrlb
authors: Qian, Hongyi; Zhang, Shen; Wang, Baohui; Peng, Lei; Gao, Songfeng; Song, You
title: A comparative study on machine learning models combining with outlier detection and balanced sampling methods for credit scoring
date: 2021-12-25
journal: nan
DOI: nan
sha: 9f4876b0f8b0aa701a88998a6c52f0a3eb88b1dc
doc_id: 111193
cord_uid: x00hlrlb

Peer-to-peer (P2P) lending platforms have grown rapidly over the past decade as the network infrastructure has improved and the demand for personal lending has grown. Such platforms allow users to create peer-to-peer lending relationships without the help of traditional financial institutions. Assessing the borrowers' credit is crucial to reduce the default rate and benign development of P2P platforms. Building a personal credit scoring machine learning model can effectively predict whether users will repay loans on the P2P platform. And the handling of data outliers and sample imbalance problems can affect the final effect of machine learning models. There have been some studies on balanced sampling methods, but the effect of outlier detection methods and their combination with balanced sampling methods on the effectiveness of machine learning models has not been fully studied. In this paper, the influence of using different outlier detection methods and balanced sampling methods on commonly used machine learning models is investigated. Experiments on 44,487 Lending Club samples show that proper outlier detection can improve the effectiveness of the machine learning model, and the balanced sampling method only has a good effect on a few machine learning models, such as MLP.

Personal credit is one of the main businesses of digital financial services, and peer-to-peer (P2P) lending is the main battlefield of personal credit business. As intermediary institutions, P2P lending platforms connect borrowers and lenders, saving the cumbersome process of traditional financial institutions and reducing transaction costs (Zhao et al., 2017) . This type of lending usually involves smaller amounts but comes with a higher risk of default. Since the outbreak of COVID-19 in late 2019, the global P2P lending market has had to face more and more default loans problems, which puts higher demand on lenders to review personal loan applications.

To reduce the loss of loan default, some scholars proposed methods of predicting the default probability of personal loans, which is also called credit scoring (Crook et al., 2007; Dastile et al., 2020; Lessmann et al., 2015) . Specifically, credit scoring comprehensively examines the various indicators of the borrower and conducts a comprehensive assessment of their ability to fulfill economic commitments, which is usually modeled as a binary problem. For example, Lending Club is the largest online loan marketplace and borrowers can easily access lower bagging with differentiated sampling rates; SBC is an under-sampling based on clustering; REMDD is a resampling ensemble model based on data distribution; DM-ACME is a distance-to-model and adaptive clustering-based multi-view ensemble learning method; MV-ACME is DM-ACME with ensemble strategies composed of hard probability and majority voting; BRF is the balanced random forest; KNU is the k-Nearest Oracles-Union; RMkNN is the reduced minority kNN; B-Net is the bayesian network; RP is the random projection; Mg-GBDT is the multi-grained augmented gradient boosting decision trees. AugBoost if the unsupervised feature augmentation Boosting. GCN is the graph convolutional network.

As one of the most commonly used datasets in the field of credit scoring, many scholars have done their researches on the Lending club dataset. Some of these results are shown in table 1. It should be noted that although the table shows the optimal results of these studies, they are not completely equivalent in comparison due to different starting conditions. However, an observable trend is that ensemble learning methods and hybrid models often achieve better results than single classifiers (Liu et al., 2021; Xia et al., 2018) . Recently, some studies have applied deep learning methods to reflect nonlinear relationships between borrower's attributes and default risk. (Bastani et al., 2019) and (Lee et al., 2021) applied wide and deep learning (WDP)-based and graph convolutional network (GCN)-based credit default prediction model to P2P lending respectively.

Similar to most personal credit scoring datasets, the Lending Club dataset faces the class imbalance problem.

The actual default of customers is only a minority, which will affect the construction of credit scoring model.

Many balanced sampling methods have been applied to solve the problem (Moscato et al., 2021; Namvar et al., 2018) . Commonly used methods include simple random sampling methods such as random undersampling (RUS) and random oversampling (ROS), and more complex oversampling methods such as synthetic minority Besides the problem of the unbalanced dataset, when using real-world data to build a credit scoring model, outliers also affect the robustness of the model. And the outlier detection algorithm can detect noise or abnormal values and reduce their impact on the model. There have been some attempts at outlier detection studies on other datasets of credit scoring (Wei et al., 2019; Xia, 2019; Zhang et al., 2021) . However, the existing Lending Club researches only focus on the balanced sampling method, while the research on outlier detection method is still in a blank state.

This article aims to build a framework for credit scoring that covering a variety of outlier detection and balanced sampling methods and models, find the applicable conditions of different methods, and find their optimal combination. To the best of our knowledge, this is the first comparative study that combines outlier detection, imbalanced data processing, and machine learning models on the Lending club dataset.

The rest of the paper is organized as follows. Section 2 introduces common outlier detection method, balanced sampling method, and machine learning models and their application in the field of credit scoring.

Section 3 presents the proposed credit scoring framework. Section 4 shows the experimental settings, including dataset description and evaluation criteria. The experimental results are analyzed in Section 5. Section 6 presents the conclusion and discusses the direction of future work.

The proposed framework includes three technologies: outlier detection method, balanced sampling method, and machine learning model. This section will review the representative work in each area.

Real-world datasets usually contain a small number of outliers, because natural or man-made reasons such as small probability events occur or input errors in the data acquisition system, and these outliers will affect the performance of the machine learning model. In response to this problem, scholars have proposed a series of outlier detection algorithms to identify the outliers in the datasets. Both local outliers factor (LOF) (Breunig et al., 2000) and isolation forest (IForest) (Liu et al., 2008) are the classical outlier detection methods. LOF judges whether an instance is an outlier by comparing the density of each point and its neighboring points. A low-density point is more likely to be recognized as an outlier. However, (Liu et al., 2008) pointed out that density is not always a good measure of outlier detection, because the density of a group of outliers is often very high, while the density of the edges of inliers points is low. IForest is ensembled by multiple decision trees, divides the hyperplane, and calculates the number of hyperplanes that isolate a sample. The number of hyperplanes that isolate an outlier in the low-density space is less than inliers. (He et al., 2003) improved LOF and proposed a cluster-based LOF outlier detection method (CBLOF). For the latest outlier detection method, proposed a copula-based outlier detection method (COPOD) inspired by copulas for modeling multivariate data distribution. COPOD first constructs an empirical copula and then uses it to predict tail probabilities of each given data point to determine its level of extremeness, which makes it an ideal choice for high dimensional datasets.

Outlier detection algorithms have also been applied in credit scoring. proposed a bagging-based local outlier factor algorithm to identify the outliers and subsequently boost them back into the training set to form the outlier-adapted training set to enhance the outlier adaptability of base classifiers. In another of their studies, (Zhang et al., 2021) proposed a new voting-based outlier detection method to enhance the classic outlier detection algorithms by integrating the outlier scores through the weighted voting mechanism, and boost the outlier scores into the training set to form an outlier-adapted training set. Outlier detection is also used to reject inference in credit scoring, employing the rejected data can mitigate the sample bias caused by building the model only with the accepted applicants. For example, (Xia, 2019) applies IForest to find "good" applicants whose loan applications are rejected due to accidental factors.

Real-world datasets often face the class imbalance problem, the default instances in personal credit scoring datasets usually are a minority class. When the ratio of the majority sample to the minority sample exceeds 3:1, the prediction results of the classifiers will be skewed towards the majority samples. Using the balanced sampling method, a dataset with balanced positive and negative samples can be obtained to eliminate the bias in the training of the classifier.

The existing balanced sampling methods can be divided into three categories:

• Under-sampling method: selecting samples from the original set to make the number of positive and negative instances equal. For example, random undersampling (RUS) is a fast and easy way to balance the data by randomly selecting a subset of data for the targeted classes. (Wilson, 1972) Specifically, ADASYN focuses on generating samples next to the original samples which are wrongly classified using a k-Nearest Neighbors classifier while the basic implementation of SMOTE will not make any distinction between easy and hard samples to be classified using the nearest neighbors rule.

• Hybrid method: combining with over-sampling and under-sampling method. SMOTE can generate noisy samples by interpolating new points between marginal outliers and inliers. This issue can be solved by cleaning the space resulting from over-sampling. In this regard, SMOTETomek (Batista et al., 2004) and

SMOTEENN (Batista et al., 2003) added Tomek's link (Tomek, 1976) and ENN to the pipeline after applying SMOTE over-sampling to obtain a cleaner space respectively.

The balanced sampling method is widely used in the credit scoring field. (Moscato et al., 2021) and (Namvar et al., 2018) compared the performance of various balanced sampling methods on the Lending Club dataset. In their research, RUS achieved the best results among all sampling methods. However, neither of them added the no sampling method to the experiments as a comparison. In addition to the usual methods, there have been other attempts. (Engelmann & Lessmann, 2021) proposed an approach based on a conditional Wasserstein GAN that can effectively model tabular datasets with numerical and categorical variables and pays special attention to the downstream classification task through an auxiliary classifier loss.

Machine learning is dedicated to the study of how to use the experience to improve the performance of the system itself by computing data and is an effective method to solve complex real-world problems. During the years of its development, many machine learning models have emerged. For linear models, logistic regression (LR) is the most commonly used algorithm and is often used as a benchmark model for model performance comparison (Ala'Raj & Abbod, 2016). For datasets with fewer data dimensions (e.g. Australia, Germany, Japan, etc. credit scoring datasets), it can effectively mine the interrelationships between credit data variables (Abelian & Castellano, 2017; Crook et al., 2007; Zhang et al., 2019) . However, its effectiveness is limited when dealing with high dimension, sparse data. Another commonly used credit scoring benchmark method is the multilayer perceptron (MLP) (Garcia et al., 2019; Hájek, 2011; Moscato et al., 2021) . MLP is composed of an input layer, hidden layers, and an output layer, and a backpropagation algorithm is used to learn model parameters. MLP is susceptible to outliers and imbalance sample problems.

With the development of machine learning, ensemble learning models have gradually become a popular research direction. Unlike traditional learning methods for training one classifier, ensemble learning methods usually consist of multiple base classifiers. (Hansen & Salamon, 1990) and (Schapire, 1990) pointed out that the ensemble of a set of classifiers often predicts more accurately than the best individual classifier. The representative ideas of ensemble learning are Bagging (Breiman, 1996) and Boosting (Schapire, 1999) . Random forest (RF) (Breiman, 2001 ) is one of the representatives of Bagging, which is an ensemble by multiple decision trees, and each decision tree acts as a weak learner. Gradient boosting decision tree (GBDT) is one of the popular boosting algorithms (Friedman, 2001) . GBDT uses an additive model which is a linear combination of basis functions, and continuously reduces the residuals generated during the training process.

The ensemble tree models are usually more robust to outliers and sample imbalance problems. In credit scoring, (Malekipirbazari & Aksakalli, 2015) proposed an RF-based classification method for predicting borrower status. Compared with other algorithms such as LR, SVM, and KNN, RF obtains the best results. Recently, more and more state-of-art solutions are built based on GBDT. (Liu et al., 2021) proposed a step-wise multigrained augmented gradient boosting decision trees (mg-GBDT) for credit scoring. This method adopts multigranularity scanning for feature enhancement, which enriches the input features of GBDT.

In this paper, a systematic study combining multiple outlier detection, balanced sampling methods, and machine learning models is proposed. The experimental process is shown in Figure 1 . The process can be divided into four steps: data preprocessing, outlier detection, balanced sampling method, model training, and testing. Data preprocessing includes missing value filling and standardization. Then, samples containing outliers in the dataset were removed by an outlier detection algorithm. After that, the balanced sampling method is used to balance the number of positive and negative samples. Finally, the machine learning model is trained and tested on the processed dataset. The entire process is discussed in detail below. Step 1. Data pre-processing

Filling missing values and standardization.

Some numeric variables in the original dataset have missing values, which need to be preprocessed. These missing values are filled in by the mean of that indicator across all samples. In addition, since different variables usually have different value ranges, the value range of each indicator needs to be standardized to ensure the validity of the credit scoring model. In this paper, each variable is normalized according to formula 1:

where u is the mean of the samples, and s is the standard deviation of the samples. Specifically, the scikit-learn 1 library is used to fill in missing values and normalize the data.

Step 2. Outlier detection

Calculate the anomaly degree of each sample and remove samples with high anomaly degrees from the training data according to the set threshold.

There are always some outlier samples in real datasets, which can mislead the training of machine learning models, especially linear models such as LR. In this paper, four outlier detection algorithms are used to deal with outliers, namely LOF, CBLOF, IForest, and COPOD. Specifically, the algorithms imported from pyod 2 library are used to evaluate the degree of outliers of the samples, and default parameters are used for each outlier detection algorithm. The samples with the top 0.5% of the outlier scores obtained by the algorithms were removed from the training data.

Step 3. Balanced sampling method

Construct a positive and negative sample-balanced dataset using the balanced sampling method.

Credit scoring datasets often face the problem of sample imbalance, which can make the classification model heavily biased. For example, MLP calculates the loss of each sample and then back-propagates the updated weights, and the sample imbalance can easily lead to the MLP model whose weight coefficients "serve" the large-class samples and have less ability to discriminate among the small-class samples.

Therefore, the balanced sampling method is used in this paper to build positive and negative sample balanced datasets, as shown in table 2. A total of eight sampling methods in three categories are used to solve the sample imbalance problem, among which RUS, IHT, and ENN belong to under-sampling methods, ROS, SMOTE, and ADASYN belong to over-sampling methods, SMOTEENN and SMOTETomek belong to hybrid methods.

Specifically, the imbalanced-learn 3 library is used for the positive and negative sample balancing process of the samples, and default parameters are used for each algorithm. Step 4. Model training and testing

Train the model and compare the predicted result with the real value.

Experiments were conducted using a variety of machine learning models, including LR, MLP, RF, and GBDT. Specifically, the machine learning model is imported from the scikit-learn library, and the 5-fold crossvalidation training process is used. Each time 80% of the data is selected as the training set, and the remaining 20% is used as the test set. The hyperparameters of each model are shown in table 3. 

This section describes the experimental setup in detail, consisting of two parts, dataset description, and evaluation criteria.

The dataset used in this paper is from Lending Club 4 in Q2 2017. Lending Club is a US P2P lending company, headquartered in San Francisco, California. It was the first P2P lender to offer loan trading on a secondary market and is the world's largest P2P lending platform. The original dataset contains a total of 105,451 samples and 151 features, with "loan status" as the target variable. As shown in table 4, "loan status" has 7 states, "Current", "Fully Paid", "Charged Off", "Late (31-120 days)", "In Grace Period", "Late (16-30 days)", and "Default". Referring to the practice in previous papers (Moscato et al., 2021; Namvar et al., 2018) , only the samples with "loan status" status as "Charged off" and "Fully Paid" are taken as positive and negative samples respectively. This results in an unbalanced dataset containing 44,487 samples with a positive sample ratio of 23.80%. Some variables are preprocessed before input to the model:

• Models such as LR cannot directly deal with class-type variables. The category variables "sub grade", "home ownership", "verification status", "purpose", "addr state", "initial list status", and "application type" are coded as one-hot;

• The original credit score "fico score" provides the values "low" and "high", which are replaced by their average value;

• Log transformations are performed on exponential numerical variables such as "annual inc" and "revol bal";

After the above processing, the actual number of variables input to the model is 123. Table 6 shows the statistical characteristics of the numerical variables. 

Comparing the effectiveness of models requires the use of appropriate evaluation criteria. In the field of credit scoring, overall accuracy (ACC) is one of the most commonly used indicators, which is defined as the ratio of the number of correctly classified samples to the total number of samples. In addition, TNR, TPR, FPR, and G-mean were also popular evaluation indicators. These indicators can be calculated according to formulas 2 -6. The probability threshold for classifying the test set is set to exactly the extent that its positive and negative sample ratios remain consistent with the original dataset.

T P R = T P T P + F N (4)

The confusion matrix consisting of TP, TN, FP, and FN in these formulas is shown in table 7, where TP and TN represent the numbers of correctly classified normal and default users, respectively. FP and FN denote the numbers of misclassified normal and default users, respectively. However, the above metrics are strongly influenced by the probability threshold, and the comprehensive performance of the classifier can be well evaluated by using the area under the curve (AUC). AUC is an evaluation metric based on the receiver operating characteristic (ROC) curve, which is equal to the area under the ROC curve and well reflects the trade-off between the true positive rate and the false positive rate of the prediction model under different probability thresholds. The AUC will be used as the main evaluation metric.

The performance of classifiers was evaluated according to different outlier detection and sampling strategies 5.1. The original results of each model Table 8 shows the classification results of LR. In general, the difference brought by using different outlier detection and balanced sampling methods is not significant, the difference between the first and the bottom of the AUC ranking is only 0.0064. Among the outlier detection and balanced sampling methods, the most effective are LOF and None (AUC 0.7079), and the most ineffective are None and ADASYN (AUC 0.7015).

The most effective and least effective criteria are the maximum and minimum AUC that can be achieved using the strategy, as well as for other models. However, under any other criteria like G-mean, the best strategy to match LR is IForest and SMOTE. Table 9 shows the classification results of MLP, the use of different outlier detection and balanced sampling methods brought the greatest differences, with the difference between the first and the last ranked AUC being 0.0879. This suggests that the MLP is very sensitive to data outliers and sample imbalances. Among the outlier detection and balanced sampling methods, the most effective were LOF and IHT (AUC 0.7041), slightly below the LR. And the least effective were None and ADASYN (AUC 0.6162), which is the same as LR. Under any other criteria like G-mean, the best strategy to match MLP is CBLOF and IHT. Among the best strategies for each model, LOF outlier detection and no sample balance methods each take three seats. IForest is more effective for RF, and IHT downsampling methods are more effective for MLP. RF performs better than MLP when outlier detection and balanced sampling methods are not used. However, combining LOF and IHT, the AUC of the best MLP model even surpassed the best RF model. The mode "None" is the case comparing the models' classification results without outlier detection and balanced sampling methods. The mode "Mean" is the case comparing the average classification results of each model using every outlier detection and balanced sampling methods. The mode "Best" is the case comparing the best classification results that the models can achieved. Table 13 synthesizes the results of the different outlier detection methods. In general, as long as the outlier detection methods are used, the effect will be better than without it. After taking the average of the experimental results of the same outlier detection algorithms (9*4=36 types), the ranking from highest to lowest is IForest, COPOD, CBLOF, LOF, None. The average classification results using IForest is the best, but in most cases, LOF can help the model get the best results according to table 12. The "Rank" is calculated as the mean of the ranking (AUC) of outlier detection methods under the same model and the same balanced sampling method. Table 14 synthesizes the results of the different balanced sampling methods. After taking the mean of the experimental results of the same balanced sampling methods (5*4=20 types), it is found that the effect of not using a balanced sampling method is the best, simple RUS and ROS are ranked 2, 3. While complex balanced sampling methods are less effective, and the commonly used SMOTE algorithm is ranked only the second to last. The ADASYN algorithm, which achieved the worst results among the four types of models, ranked first from the bottom. Except for the effect of IHT on MLP, for other models, the balanced sampling method will reduce the classification results. The ranking is similar to two relevant studies (Moscato et al., 2021; Namvar et al., 2018) which using the Lending Club data too. In their research, RUS also achieved the best results among all sampling methods. However, neither of them added the no sampling method to the experiments as a comparison. The "Rank" is calculated as the mean of the ranking (AUC) of balanced sampling methods in the case of the same model and the same outlier detection method.

Credit scoring is crucial for P2P lending platforms, outliers and sample imbalance make the construction of credit scoring models challenging. There have been many studies using balanced sampling methods to deal with the sample imbalance problem (Moscato et al., 2021; Namvar et al., 2018; Niu et al., 2020) , but few studies have incorporated outlier detection methods into the modeling process.

This paper provides a systematic study for the application of outlier detection and balanced sampling methods in the field of credit scoring. The experiments use real P2P lending platform data and cover a variety of outlier detection, balanced sampling methods, and machine learning models. The experimental results show that the outlier detection algorithm can enhance the robustness of the model. In addition, a suitable balanced sampling method brings a large improvement to MLP but is relatively less useful to LR, RF, and GBDT.

In future plans, firstly, more advanced models, such as ensemble strategies or deep learning models, will be studied. Secondly, feature selection is also an important part of the construction of the credit scoring model, which will be considered to be integrated into the existing framework.

A comparative study on base classifiers in ensemble methods for credit scoring

Classifiers consensus system approach for credit scoring

Wide and deep learning for peer-to-peer lending

Balancing Training Data for Automated Annotation of Keywords: a Case Study

A study of the behavior of several methods for balancing machine learning training data

Bagging predictors

Random forests

LOF: Identifying Density-Based Local Outliers

SMOTE: synthetic minority oversampling technique

Statistical and machine learning models in credit scoring: A systematic literature survey

Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning

Greedy function approximation: A gradient boosting machine

Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction

Municipal credit rating modelling by neural networks

Neural network ensembles

ADASYN: Adaptive synthetic sampling approach for imbalanced learning

A novel ensemble method for credit scoring: Adaption of different imbalance ratios

Discovering cluster-based local outliers

Dempster-Shafer Fusion of Semi-supervised Learning Methods for Predicting Defaults in Social Lending

Graph convolutional network-based credit default prediction utilizing three types of virtual distances among borrowers

Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research

COPOD: Copula-Based Outlier Detection

Isolation Forest

Step-wise multi-grained augmented gradient boosting decision trees for credit scoring

Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

Risk assessment in social lending via random forests

A novel approach to define the local region of dynamic selection techniques in imbalanced credit scoring problems

A benchmark of machine learning approaches for credit score prediction. Expert Systems with Applications

Credit risk prediction in an imbalanced social lending environment

Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending

A Comparison of Credit Rating Classification Models Based on Spark-Evidence from Lending-club

The Strength of Weak Learnability

A Brief Introduction to Boosting

An instance level analysis of data complexity

Multi-view ensemble learning based on distance-to-model and adaptive clustering for imbalanced credit risk assessment in P2P lending

Best classification algorithms in peer-to-peer lending

Two modifications of cnn

A Novel Noise-Adapted Two-Layer Ensemble Model for Credit Scoring Based on Backflow Learning

Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. systems man and cybernetics

A Novel Reject Inference Model Using Outlier Detection and Gradient Boosting Technique in Peer-to-Peer Lending

Predicting loan default in peer-to-peer lending using narrative data

A novel heterogeneous ensemble credit scoring model based on bstacking approach

Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending

A novel multi-stage hybrid model with enhanced multi-population niche genetic algorithm: An application in credit scoring

A new hybrid ensemble model with voting-based outlier detection and balanced sampling for credit scoring

A novel multi-stage ensemble model with enhanced outlier adaptation for credit scoring

P2P Lending Survey: Platforms, Recent Advances and Prospects

This work was supported by HuaRong RongTong (Beijing) Technology Co., Ltd. We acknowledge HuaRong RongTong (Beijing) for providing us with high-performance machines for computation. We also acknowledge the anonymous reviewers for proposing detailed modification advice to help us improve the quality of this manuscript.