With the rapid development of interest rate market and big data, the banking industry has shown the obvious phenomenon of “two or eight law”, 20% of the high quality customers occupy most of the bank’s assets, how to prevent the loss of bank credit card customers has become a growing concern for banks. Therefore, it is particularly important to establish a customer churn early warning model. In this paper, we will use the random forest method to establish a customer churn early warning model, focusing on the churn of bank credit card customers and predicting the possibility of future churn of customers. Due to the large data size of banks, the complexity of their customer base, and the diversity of user characteristics, it is not easy for banks to accurately predict churned customers, and there are few customer churn early warning studies suitable for banks. Compared with the traditional bank credit risk prediction algorithm, this method is proved to be useful in the early stage of churn warning, and has the advantages of high prediction accuracy, large amount of processed data, and good model interpretability, which can help to retain valuable customers in advance and thus achieve the purpose of reducing cost and increasing efficiency.

With the emergence of Internet finance, the business situation of commercial banks has changed drastically, and the competition among major banks has become more and more intense. Both the internal transformation of banks and the changes in the external environment have made the problem of customer churn more and more serious. Customer churn has a significant impact on the banking industry’s profits, and it may negatively affect the corporate image while increasing the cost of marketing new customers [

According to the theory of life cycle of customers, the phenomenon of customer churn is inevitable. Banks can maximize the life value cycle of customers through reasonable marketing strategies, thus reducing the percentage of customer churn and increasing business revenue. Therefore, preventing customer churn and retaining old customers have become important issues of concern for major banks. Due to the large amount of data, the complexity of the customer base, and the diversity of user characteristics, it is not easy for banks to accurately predict churn, but it is urgent to accurately predict churn and propose corresponding retention measures. So far, there are few studies on the existing churn of bank customers, so it is relevant to conduct churn prediction studies for bank customers. Reducing churn not only means extending the life cycle of customers and generating profits for the company, but also improves the bank’s corporate image, builds word-of-mouth and increases loyalty.

For the customer churn prediction problem, researchers have proposed some methods. Prasad et al. [

Chen et al. [

By comparing the existing studies, it is clear that all models have their advantages and disadvantages. Although the above studies have contributed to customer churn prediction, there are fewer existing churn studies for bank customers. Therefore, churn research on bank customers is of practical importance. Bank data are usually characterized by high data dimensionality, while random forests can handle a large number of input variables [

Random forest is a classifier that uses multiple CART decision trees (Classification And Regression Tree) to train and predict samples. The basic idea is to select and train multiple base classifiers with weak classification ability based on a random method, so that the integrated classifier formed by their combination has strong classification ability.

Simply put, a random forest is a forest built in a random way. In the random forest model operation, randomness is the focus of the model operation, and the correlation between decision trees is reduced by randomly selecting samples and features. The randomness in random forest has two main meanings, one is to randomly select equal amount of data as training samples in the original training data with put-back, and the other is to randomly select a part of features to build decision trees when building decision trees. These two kinds of random make the correlation between each decision tree is small, further improving the accuracy of the model. The forest consists of many decision trees, and each decision tree is uncorrelated with each other. Whenever a new input sample enters the forest, each decision tree in the forest is allowed to determine the class of this sample separately. The category that predicts the result is the category that is selected the most often.

Assume that there are M objects in the training data set, and N samples are randomly selected from the sample data, each time the samples are not identical, and these samples form the training data set of the decision tree; each node of each tree is generated by randomly selecting p(p << m) attributes from all attributes (the number of all attributes is m), and selecting the attribute with the greatest information gain as the node for division. A non-pruned decision tree is generated. After repeating the above method several times, multiple decision trees are built, and these trees are grouped together to form a random forest. Finally, the final prediction type of the sample is decided by voting on the prediction results of these trees. The specific flow is shown in

SMOTE is Synthetic Minority Oversampling Technique, which is an improved scheme based on random oversampling algorithm. SMOTE is one of the methods to improve the performance of classification models for unbalanced data, because random oversampling algorithms tend to produce the problem of model overfitting, making the information learned by the model too specific and not generalization enough. Compared with the traditional simple sampling process, it analyzes samples from a few classes and adds new samples to the data set by synthesizing manually the characteristics of the few classes obtained from the analysis. In order to solve the overfitting brought by simple sampling because of the number of samples from a few classes and enhance the generalization of the model, the basic idea is as follows.

Obtain all minority class samples, and for each minority class sample, draw its k proximity samples (k is the number of customizations)

Set a sampling ratio according to the sample category ratio, and take a number of samples close to the minority class sample x_0

For each randomly selected sample x_1, create a new sample according to the following formula, and put the new sample into the data set

Expressed in the formula as follows,

Principal Component Analysis (PCA) is a method that uses the knowledge of linear algebra to perform data dimensionality reduction. There are certain correlations among the original variables in the dataset, and a smaller number of composite variables can be used to combine the information among the original variables. Therefore, it converts multiple variables into a few uncorrelated composite variables, thus reflecting the whole data set more comprehensively. These composite variables are called principal components, and the principal components are uncorrelated with each other, i.e., the information represented does not overlap.

The sample for this study is derived from 10,127 customer information of a commercial bank, which contains 21 fields, including customer Id, whether to churn, age, gender, education status, marriage status, income estimation, credit limit, and working balance, etc. The data is obtained from the kaggle authoritative dataset website. The details of the customer data model as well as the business category data model are shown in the

No. | Attribute name | Attribute category |
---|---|---|

1 | Customer ID interval | Interval |

2 | Customer age interval | Interval |

3 | Gender nominal | Nominal |

4 | Education level nominal | Nominal |

5 | Marital status nominal | Nominal |

6 | Income nominal | Nominal |

… | … | … |

No. | Attribute name | Attribute category |
---|---|---|

1 | Months of user inactivity interval | Interval |

2 | Credit card limit interval | Interval |

3 | Number of banking transactions interval | Interval |

4 | Number of monthly bills interval | Interval |

5 | Total amount of customer transactions interval | Interval |

6 | Churn or not nominal | Nominal |

Data pre-processing is the further processing of the original data before the data is operated, checking the integrity as well as consistency of the data, and using some operations to reduce the amount of data, noise, and transforming the data into a form suitable for computer processing. In order to ensure data reliability and exclude the influence of abnormal data on the results, it is necessary to process the data and filter the features before constructing the model:

Missing values may be generated due to missing data caused by data phone or saving failure due to machine reasons, or subjective human errors, etc. In order to make effective use of the data to build a good model effect, missing values in the sample data are filled with 0.

Outliers are usually considered to be data points that are significantly different from other data points or do not conform to the expected normal pattern of the phenomenon represented by the whole, and outlier detection is dedicated to solving the problem of finding patterns that do not conform to the expected behavior, finding outliers and judging whether they are reasonable according to common sense.

Some of the data are stored as textual information and need to be digitized so that the model can process the data. The attributes that are obviously irrelevant to the results such as user number are removed, textual binary attributes such as age and whether they are churned users are transformed into statistical-friendly 1 and 0 data, and non-numerical attributes such as total income, education level and marital status are uniquely coded to expand the features to a certain extent, and the columns containing empty data are cleaned up. The valid data after cleaning are 10,127 items.

After the above pre-processing of the data, the preliminary pre-processed data model is obtained, as shown in

No. | Attribute name | Attribute category |
---|---|---|

1 | Customer ID interval | Interval |

2 | Customer age interval | Interval |

3 | Gender interval | Interval |

4 | Number of children interval | Interval |

5 | Education level interval | Interval |

6 | Marital status interval | Interval |

7 | Income interval | Interval |

… | … | … |

16 | Months of user inactivity interval | Interval |

17 | Credit card limit interval | Interval |

18 | Number of banking transactions interval | Interval |

19 | Number of monthly bills interval | Interval |

20 | Total amount of customer transactions interval | Interval |

21 | Churn or not interval | Interval |

To test the prediction effectiveness of the model, the training data is divided into two parts in the ratio of 7:3, a training set for training and a validation set for testing. The training set is used to train the model, and then the validation set is used to verify the validity of the model, selecting the model that achieves the best results until a satisfactory model is obtained. Finally, after the model “passes” the validation set, we then use the test set to test the final results of the model and evaluate the accuracy of the model, as well as the error. The correctness, error rate, recall rate, accuracy rate, etc. of the prediction results of the test set are used to determine the prediction effectiveness of the model.

The number of customers in the dataset is 10,127, of which churned users account for 16% of the dataset. The specified features are selected as input data, and whether or not to churn as the final prediction target. Seventy percent of the credit samples are randomly selected as the training set to train the model, and 30% of the credit samples are used as the test set for testing the model prediction effect. Please see

The imbalanced data can easily make the classifier show high accuracy in numerical value even if the judgment is wrong. The imbalance problem of the training set is handled by SMOTE method, and the number of samples of few classes is added to make it more suitable for the operation of the model.

Use pandas to read the data set obtained after the previous processing

Divide the dataset into feature set x and target feature y

Divide the dataset into dataset and test set according to the ratio of 0.3

Use SMOTE function to transform the test set into a dataset with a 1:1 ratio of positive to negative results

Use principal component analysis to reduce the dimensionality of the single coded categorical variables and thus reduce the variance. Build a better model by using several principal components at the same time instead of tens of single coded features.

The model is trained using the obtained training set and the model performance is evaluated using the test set. The experimental results show that the results obtained from training using the balanced dataset have higher accuracy than the original ones.

The processed data set was analyzed for correlation coefficients and visualized to generate a heat map, in which the variables with relationship coefficients less than 0.2 were selected, and the heat map is shown in

This part mainly uses the Random Forest Classifier method in python’s sklearn package to construct the model, where the important parameters are described as follows.

n_estimators: the number of trees in the forest

max_features: the number of features in the subset of features randomly selected by each base learner for slicing

max_depth: the maximum depth of the tree

min_samples_split, the minimum number of samples of the node, which means the minimum number of samples that can be further cut by the current tree node

criterion: cutting strategy, gini or entropy

min_impurity_decrease: set the stopping condition

class_weight: set the weight of different classes of samples in the dataset, default is None, that is, the weight of all classes of samples is 1.

A random forest with k CART decision trees is constructed, and based on the bootstrap sampling method, the input samples are sampled k times as the input samples of k CART decision trees. k = 100 is set in this paper. m = |M| determines the number of input features m for each CART decision tree, and each tree is split according to the rule of minimum Gini index until all features All splitting is finished.

Different test set divisions of the same data are selected and cross-validated five times to obtain the measure F1-score. It can be seen that the model maintains a high accuracy rate of 91.42%∼92.2%. Please see

In this paper, the four parameters min_samples_split, max_depth, criterion, and n_estimators are optimized: for these parameters the parameters are tuned using the GridSearchCV grid search method. The method uses an iterative approach to bring all combinations of the parameters into the test and obtain the locally optimal parameter results. The optimal parameters obtained are listed in the following

Parameters | Digital |
---|---|

min_samples_split | 5 |

max_depth | 13 |

Criterion | Entropy |

n_estimators | 46 |

The ACC and AUC obtained using the parameter-tuned random forest model for the test set are 0.9084 and 0.9103, respectively, please see

ACC (train) | ACC (test) | AUC (train) | AUC (test) | |
---|---|---|---|---|

Random forest | 0.9226 | 0.9083 | 0.9635 | 0.9103 |

AdaBoost | 0.8963 | 0.8901 | 0.8907 | 0.8901 |

SVM | 0.8979 | 0.8859 | 0.8941 | 0.8870 |

Customer churn forecasting is the process of determining potential lost customers using historical data from customer records. It is an important issue of concern for many industries, especially in the highly competitive and increasingly liberal domestic and international telecommunications, finance, passenger transportation, and newspaper industries, and has received widespread attention from both academia and the real world. The key of attrition prediction lies in the accuracy of the model built, the interpretation of the model and the quality of the data feature variables. How to build an efficient prediction model and find the effective data variables is a key problem in the field of customer churn prediction and customer relationship management. In previous studies, scholars have predicted customer churn through various algorithms and models, which have shown that customer churn prediction models are effective, can accurately discover the real situation of customer churn for enterprises, can provide better decision support for comprehensive and effective customer relationship management, and have wide application prospects. Therefore, establishing customer churn early warning has become an important way to save lost customers. In this paper, we propose a model construction using random forest algorithm, focusing on the churn of bank credit card customers, predicting the possibility of future churn, and establishing a set of customer churn early warning model with practical significance. Compared with the traditional bank credit risk prediction algorithm, it can predict churned customers more accurately and efficiently, and provide guidance strategies for banks to retain their customers height. Figures should be in the original scale, with no stretch or distortion.