Imbalanced Data in classification: A case study of credit scoring

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.53 MB, 225 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

<b><small>MINISTRY OF EDUCATION AND TRAININGUNIVERSITY OF ECONOMICS HO CHI MINH</small></b>

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3">

<b><small>MINISTRY OF EDUCATION AND TRAININGUNIVERSITY OF ECONOMICS HO CHI MINH</small></b>

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

<b>STATEMENT OF AUTHENTICATION</b>

<b>I certify that the Ph.D. dissertation, “Imbalanced data in classification: A</b>

<b>case study of credit scoring”, is solely my own research.</b>

This dissertation is only used for the Ph.D. degree at the University of Eco- nomics Ho Chi Minh City (UEH), and no part of it has been submitted to any other university or organization to obtain any other degree. Any studies of other authors used in this dissertation are properly cited.

Ho Chi Minh City, April 2, 2024

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

First of all, I would like to express my deepest gratitude to my supervisors, Assoc. Prof. Dr. Le Xuan Truong and Dr. Ta Quoc Bao, for their scientific direction and dedicated guidance throughout the process of conducting this Ph.D. dissertation.

I sincerely thank the teachers of the UEH’s doctoral training program for imparting valuable knowledge, and the teachers at the Department of Mathe- matics and Statistics, UEH for their sincere comments on my dissertation.

I sincerely thank Dr. Le Thi Thanh An for her moral and academic support so that I could complete the research. Besides, I really appreciate the interest and help of my colleagues at Ho Chi Minh City University of Banking.

Finally, I am grateful for the unconditional support that my mother and my family have given to me on my educational path.

Ho Chi Minh City, April 2, 2024

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

1.3 Research gap identifications . . . . 5

1.3.1 Gaps in credit scoring . . . . 5

1.3.2 Gaps in the approaches to solving imbalanced data . . . 7

1.3.3 Gaps in Logistic regression with imbalanced data . . . . 9

1.4 Research objectives, research subjects, and research scopes...10

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

1.6 Contributions of the dissertation...13

1.7 Dissertation outline...14

<b>2 LITERATURE REVIEW OF IMBALANCED DATA16</b> 2.1 Imbalanced data in classification...16

2.1.1 Description of imbalanced data...16

2.1.2 Obstacles in imbalanced classification...16

2.1.3 Categories of imbalanced data...17

2.2 Performance measures for imbalanced data...19

2.2.1 Performance measures for labeled outputs...19

2.2.1.1 Single metrics...19

2.2.1.2 Complex metrics...21

2.2.2 Performance measures for scored outputs...22

2.2.2.1 Area under the Receiver Operating

2.3.3.1 Integration of algorithm-level method and en-semble classifier algorithm...42

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

2.3.3.2 Integration of data-level method and ensemble

classifier algorithm...43

2.3.3.3 Comments on ensemble-based approach...45

2.3.4 Conclusions of approaches to imbalanced data...46

2.4 Credit scoring...48

2.4.1 Meaning of credit scoring...48

2.4.2 Inputs for credit scoring models...49

2.4.3 Interpretability of credit scoring models...51

2.4.4 Approaches to imbalanced data in credit scoring...52

2.4.5 Recent credit scoring ensemble models...53

2.5 Chapter summary...55

<b>3 IMBALANCED DATA IN CREDIT SCORING56</b> 3.1 Classifiers for credit scoring...56

3.1.1.6 Support vector machine...62

3.1.1.7 Artificial neural network...64

3.1.2 Ensemble classifiers...66

3.1.2.1 Heterogeneous ensemble classifiers...66

3.1.2.2 Homogeneous ensemble classifiers...67

3.1.3 Conclusions of statistical models for credit scoring...69

3.2 The proposed credit scoring ensemble model base Decision tree 71 3.2.1 The proposed algorithms...71

3.2.1.1 Algorithm for balancing data - OUS(<i>B</i>) algorithm 71 3.2.1.2 Algorithm for constructing ensemble classifier -DTE(<i>B</i>) algorithm...72

3.2.2 Empirical data sets...73

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

3.2.3 Computation process...74

3.2.4 Empirical results...76

3.2.4.1 The optimal Decision tree ensemble classifier . 76 3.2.4.2 Performance of the proposed model on the Viet-namese data sets...77

3.2.4.3 Performance of the proposed model on the pub-lic data sets...79

3.2.4.4 Evaluations...81

3.2.5 Conclusions of the proposed credit scoring ensemble model based Decision tree...82

3.3 The proposed algorithm for imbalanced and overlapping data . 83 3.3.1 The proposed algorithms...84

3.3.1.1 Algorithm for dealing with noise, overlapping, and imbalanced data...84

3.3.1.2 Algorithm for constructing ensemble model...84

3.3.2 Empirical data sets...85

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

4.2.1 Prior correction...95

4.2.2 Weighted likelihood estimation (WLE)...96

4.2.3 Penalized likelihood regression (PLR)...97

4.3 The proposed works...98

4.3.1 The modification of the cross-validation procedure...99

4.3.2 The modification of Logistic regression...101

4.4.6 Important variables for output...111

4.4.6.1 Important variables for F-LLR fitted model...111

4.4.6.2 Important variables of the Vietnamese data set 112 4.5 Discussions and Conclusions...115

5.1.1 The interpretable credit scoring ensemble classifier...118

5.1.2 The technique for imbalanced data, noise, and

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

C.1 German credit data set (GER)...140

C.2 Vietnamese 1 data set (VN1)...141

C.3 Vietnamese 2 data set (VN2)...142

C.4 Taiwanese credit data set (TAI)...143

C.5 Bank personal loan data set (BANK)...145

C.6 Hepatitis C patients data set (HEPA)...146

C.7 The Loan schema data from lending club (US)...147

C.8 Vietnamese 3 data set (VN3)...150

C.9 Australian credit data set (AUS)...151

C.10 Credit risk data set (Credit 1)...152

C.11 Credit card data set (Credit 2)...153

C.12 Credit default data set (Credit 3)...154

C.13 Vietnamese 4 data set (VN4)...155

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

<b>LIST OF ABBREVIATIONS</b>

regression FN, FNR False negative, False negative rate

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

balancing data

for

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

2.7 Illustration of ROS technique...35

2.8 Illustration of SMOTE technique...35

2.9 Approaches to imbalanced data in classification...47

3.1 Illustration of a Decision tree...61

3.2 Illustration of a decision boundary of SVM...63

3.3 Illustration of a two-hidden-layer ANN...65

3.4 Importance level of features of the Vietnamese data sets...77

3.5 Computation protocol of the proposed ensemble classifier...86

4.1 Illustration of F-CV...100

4.2 Illustration of F-LLR...102

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

<b>LIST OF TABLES</b>

1.1 General implementation protocol in the dissertation...13

2.1 Confusion matrix...19

2.2 Representatives employing the algorithm-level approach to ID . 27 2.3 Cost matrix in Cost-sensitive learning...28

2.4 Summary of SMOTE algorithm...36

2.5 Representatives employing the data-level approach to ID...41

2.6 Representatives employing the ensemble-based approach to ID . 45 3.1 Representatives of classifiers in credit scoring...70

3.2 OUS(<i>B</i>) algorithm...72

3.3 DTE(<i>B</i>) algorithm...73

3.4 Description of empirical data sets...74

3.5 Computation protocol of empirical study on DTE...75

3.6 Performance measures of DTE(<i>B</i>) on the Vietnamese data sets 76 3.7 Performance of ensemble classifiers on the Vietnamese data sets 78 3.8 Performance of ensemble classifiers on the German data set...80

3.9 Performance of ensemble classifiers on the Taiwanese data set . 81 3.10 TOUS(<i>B</i>) algorithm...84

3.11 TOUS-F(<i>B</i>) algorithm...85

3.12 Description of empirical data sets...85

3.13 Average testing AUC of the proposed ensembles...89

3.14 Average testing AUC of the models based LLR...90

3.15 Average testing AUC of the ensemble classifiers based tree...91

4.1 Cross-validation procedure for Lasso Logistic regression...99

4.2 F-measure-oriented Cross-Validation Procedure...100

4.3 Algorithm for F-LLR classifier...101

4.4 Description of empirical data sets...104

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

4.5 Implementation protocol of empirical study...106

4.6 Average testing performance measures of classifiers...108

4.7 Average testing performance measures of classifiers (cont.)...109

4.8 The number of wins of F-LLR on empirical data sets...110

4.9 Important features of the Vietnamese data set...113

4.10 Important features of the Vietnamese data set (cont.)...114

B.1 Algorithm of Bagging classifier...138

B.2 Algorithm of Random Forest...138

B.3 Algorithm of AdaBoost...139

C.1 Summary of the German credit data set...140

C.2 Summary of the Vietnamese 1 data set...141

C.3 Summary of Vietnamese 2 data set...142

C.4 Summary of the Taiwanese credit data set (a)...143

C.5 Summary of the Taiwanese credit data set (b)...144

C.6 Summary of the Bank personal loan data set...145

C.7 Summary of the Hepatitis C patients data set...146

C.8 Summary of the Loan schema data from lending club (a)...147

C.9 Summary of the Loan schema data from lending club (b)...148

C.10 Summary of the Loan schema data from lending club (c)...149

C.11 Summary of the Vietnamese 3 data set...150

C.12 Summary of the Australian credit data set...151

C.13 Summary of the Credit 1 data set...152

C.14 Summary of the Credit 2 data set...153

C.15 Summary of the Credit 3 data set...154

C.16 Summary of the Vietnamese 4 data set...155

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

In classification, imbalanced data occurs when there is a great difference in the quantities of classes of the training data set. This problem frequently arises in various fields, for example, credit scoring and medical diagnosis. With imbalanced data, predictive modeling for real-world applications has posed a challenge because most machine learning algorithms are designed for balanced data sets. Therefore, addressing imbalanced data has attracted much attention from researchers and practitioners.

In this dissertation, we propose solutions for imbalanced classification. Fur- thermore, these solutions are applied to a credit scoring case study. The solu- tions are derived from three papers published in the scientific journals.

• The first paper presents an interpretable decision tree ensemble model for imbalanced credit scoring data sets.

• The second paper introduces a novel technique for addressing imbalanced data, particularly in the cases of overlapping and noisy samples.

• The final paper proposes a modification of Logistic regression focusing on the optimization F-measure, a popular metric in imbalanced

These classifiers have been trained on a range of public and private data sets with highly imbalanced status and overlapping classes. The primary results demonstrate that the proposed works outperform both traditional and some recent models.

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

<b>TÓM TẮT</b>

Khi giải quyết các bài tốn phân loại, hiện tượng dữ liệu khơng cân bằng xảy ra nếu các lớp trong tập huấn luyện có sự chênh lệch số phần tử đáng kể. Vấn đề này thường gặp trong nhiều lĩnh vực, ví dụ như đánh giá tín dụng và chẩn đốn y khoa. Với dữ liệu khơng cân bằng, việc lập mơ hình dự báo cho các bài toán ứng dụng thực tế đã đặt ra một thách thức lớn bởi vì hầu hết các thuật toán học máy được thiết kế cho dữ liệu cân bằng. Vì vậy, xử lý dữ liệu khơng cân bằng cho bài tốn phân loại đã và đang thu hút nhiều sự quan tâm của các nhà nghiên cứu và người làm ứng dụng.

Trong luận án này, chúng tôi đề xuất một số giải pháp cho bài tốn phân loại với dữ liệu khơng cân bằng. Những giải pháp này được áp dụng cho một tình huống nghiên cứu là đánh giá tín dụng. Các kết quả mới của luận án được trích từ ba bài báo đã được cơng bố trên những tạp chí khoa học, bao gồm:

• Bài báo thứ nhất đề xuất một mơ hình có khả năng giải thích. Đây là mơ hình quần hợp các mơ hình cây quyết định và ứng dụng cho đánh giá tín dụng.

• Bài báo thứ hai giới thiệu một kỹ thuật mới cho dữ liệu không cân bằng, đặc biệt trong trường hợp dữ liệu có chồng chéo các lớp và nhiễu.

• Bài báo thứ ba đề xuất một hiệu chỉnh cho mơ hình hồi quy Logistic. Sự điều chỉnh này tập trung vào tối đa hoá độ đo F - một độ đo hiệu quả phổ biến trong các bài tốn phân loại khơng cân bằng.

Các mơ hình phân loại này được thực nghiệm trên tập dữ liệu công khai và dữ liệu riêng với tính chất khơng cân bằng và chồng chéo các lớp. Kết quả đã chứng minh rằng các mơ hình của chúng tơi có hiệu quả vượt trội so với các mơ hình truyền thống và các mơ hình được đề xuất gần đây.

</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">

<b>Chapter 1</b>

<b>1.1Overview of imbalanced data in classification</b>

Nowadays, classification plays a crucial role in several fields, for example, medicine (cancer diagnosis), finance (fraud detection), business administration (customer churn prediction), information retrieval (oil spill tracking, telecommu- nication fraud), image identification (face recognition), and so on. Classification is the problem of predicting a class label for a given sample. On training data sets that comprise samples with different label types, classification algorithms learn samples’ features to recognize the labels’ patterns. After that, these pat- terns, now presented as a fitted classification model, will make predictions about the labels of new samples.

Classification is categorized into two types, binary and multi-classification. Binary classification, which is the basic type, focuses on the two-class label problems. In contrast, multi-classification solves the tasks of several class la- bels. Multi-classification is sometimes considered binary with two classes: one class corresponding to the concern label, and the other representing the remain- ing labels. In binary classification, data sets are partitioned into positive and negative classes. The positive is the interest class, which has to be identified in the classification task. In this dissertation, we focus on binary classification. For convenience, we define some concepts as follows.

<i>the set of samples S = X × Y , where X ⊂ </i>R<i><sup>k </sup>is the domain of samples’ features</i>

<i>The subset of samples labeled </i>1 <i><sub>is called the positive class, denoted </sub>S</i><small>+</small><i>. Theremaining subset is called the negative class, denoted S</i><small>−</small><i>. A sample s ∈ S</i><small>+</small> <i>iscalled a positive sample, otherwise it is called a negative sample.</i>

</div><span class="text_page_counter">Trang 22</span><div class="page_container" data-page="22">

<i><b>Definition 1.1.2. A binary classifier is a function mapping the domain of</b></i>

<i>features X to the set of labels </i>{0<i>, 1}.</i>

<i>With a given sample s</i><small>0</small> = (<i>x</i><small>0</small><i>, y</i><small>0</small>) ∈ <i>S, there are four possibilities follows:• If f (s</i><small>0</small>) = <i>y</i><small>0</small> = 1<i>, s</i><small>0</small> <i>is called a true positive sample.</i>

<i>• If f (s</i><small>0</small>) = <i>y</i><small>0</small> = 0<i><sub>, </sub>s</i><small>0</small> <i>is called a true negative sample.</i>

<i>• If f (s</i><small>0</small>) = 1 <i>and y</i><small>0</small> = 0<i>, s</i><small>0</small> <i>is called a false positive sample.• If f (s</i><small>0</small>) = 0 <i><sub>and </sub>y</i><small>0</small> = 1<i><sub>, </sub>s</i><small>0</small> <i>is called a false negative sample.</i>

<i>The number of the true positive, true negative, false positive, and false negativesamples, are denoted TP, TN, FP, and FN, respectively.</i>

Some popular criteria used to evaluate the performance of a classifier are accuracy, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR).

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

In many application domains where there is a balance of the positive and negative classes, accuracy is the first target of classifiers. However, the interest class (the positive class) sometimes consists of unusual events or rare events. The number of samples in the positive class is too small for classifiers to rec- ognize the positive patterns. In such situations, if classifiers make mistakes in the positive class, the cost of loss will be very heavy. Therefore, accuracy is no longer the most important performance criterion but something related to TP such as the TPR.

For example, in fraud detection, the customers are divided into “bad” and “good” classes. Since the credit regulations are made public and the customers have preliminarily been screened before applying for a loan, a credit data set often includes a majority class of good customers and a minority class of the bad. The loss of misclassifying the “bad” into “good” is often far greater than

</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24">

the loss of misclassifying the “good” into “bad”. Hence, identifying the bad is often considered more crucial than the other task. Consider a list of credit customers consisting of 95% good and 5% bad. If pursuing a high accuracy, we can choose a trivial classifier mapping all customers with good labels. Then the accuracy of this classifier is 95%, but TPR is 0%. In other words, this classifier was unable to identify bad customers. Instead, another classifier with a lower accuracy but greater TPR can be considered to replace this trivial classifier.

Another example of the rare classification is cancer diagnosis. In this case, the data set has two classes, which are the “malignant” and “benign”. The num- ber of malignant patients is always much less than those of benign. However, malignancy is the first target of any cancer diagnosis process because of the heavy consequences of missing cancer patients. Therefore, it is unreasonable to base on the accuracy metric to evaluate the performance of cancer diagnosis classifiers.

The phenomenon of skew distribution in training data sets for

<i>classification is known as imbalanced data.</i>

<i>positive and negative classes, respectively. If the quantity of S</i><small>+</small> <i>is far less thanthe one of S</i><small>−</small>

<i>, S is called an imbalanced data set. Besides, the imbalanced ratio(IR) of S is defined as the ratio of the quantities of negative and positive class:</i>

<small>−</small>| |<i>S</i><small>+</small>|

When a training data set is imbalanced, simple classifiers usually have a very high accuracy but low TPR. These classifiers aim to maximize the accuracy (sometimes called global accuracy), thus equating losses caused by the error type I and error type II (Shen, Zhao, Li, Li, & Meng, 2019). Therefore, the classification results are often biased toward the majority class (the negative class) (Galar, Fernandez, Barrenechea, Bustince, & Herrera, 2011; Haixiang et al., 2017). In the case of a rather high imbalanced ratio, the minority class

</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">

(the positive class) is usually ignored since the common classifiers often treat it as noise or outliers. Hence, the target of recognizing the patterns of the positive class fails although identifying the positive samples is often the crucial task of imbalanced classification. Therefore, imbalanced data is a challenge in classification.

Besides, experiment studies showed that if the imbalanced ratio increased, the overall model performance decreased (Brown & Mues, 2012). Furthermore, some authors stated that imbalanced data was not only the main reason for the poor performance but the noise and overlapping samples also degraded the performance of learning methods (Batista, Prati, & Monard, 2004; Haixiang et al., 2017). Thus, researchers or practitioners should deeply understand the nature of data sets to handle them correctly.

A typical case study of imbalanced classification is credit scoring. This issue is reflected in the bad debt ratio of commercial banks. For example, in Viet- nam, the bad debt ratio in the on-balance sheet was 1.9% in 2021 and 1.7% in 2020. Besides, the gross bad debt ratio (including on-balance sheet bad debt, unresolved bad debt sold to VAMC, and potential bad debt from restructuring) was 7.3% in 2021 and 5.1% in 2020<sup>1</sup>. Although bad customers account for a very small part of the credit customers, the consequences of the bad debt of the bank are extremely heavy. In countries where most economic activities rely on the banking system, the increase in the bad debt ratio may not only threaten the execution of the banking system but also push the economy to a series of collapses. Therefore, it is important to identify the bad customers in credit scoring.

In Vietnam, the credit market is tightly controlled by regulations of the State bank. Commercial banks now consciously manage credit risk by strictly applying credit appraisal processes before funding. In the field of academic research, credit scoring has attracted many authors (Bình & Anh, 2021; Hưng & Trang, 2018; Quỳnh, Anh, & Linh, 2018; Thắng, 2022). However, few works have solved the imbalanced issue (Mỹ, 2021).

<small> class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">

These facts prompted us to study imbalanced classification deeply. The

<b>dis- sertation titled “Imbalance data in classification: A case study of</b>

<b>credit scoring” aims to find suitable solutions for the imbalanced data and</b>

related issues, especially a case study of credit scoring in Vietnam.

<b>1.3Research gap identifications</b>

<b>1.3.1Gaps in credit scoring</b>

In the dissertation, we choose credit scoring as a case study of imbalanced classification.

Credit scoring is an arithmetical representation based on the analysis of the creditworthiness of customers (Louzada, Ara, & Fernandes, 2016). Credit scor- ing provides valuable information to banks and finance institutions in order not only to hedge the credit risk but also to standardize regulations on credit management. Therefore, credit-scoring classifiers have to meet two significant requirements. They are:

i) The ability to accurately classify the bad customers;

ii) The ability to easily explain the predicted results of the classifiers. Over the two recent decades, the first requirement has been solved with the development of methods to improve the performance of credit scoring mod- els. They are traditional statistical methods (K-nearest neighbors, Discriminant analysis, and Logistic regression) and popular machine learning models (Deci- sion tree, Artificial neural network, and Support vector machine) (Baesens et al., 2003; Brown & Mues, 2012; Louzada et al., 2016). Those are called single classifiers. The effectiveness of a single classifier is not similar across the data sets. For example, some studies showed that Logistic regression outperformed Decision tree (Marqués, García, & Sánchez, 2012; Wang, Ma, Huang, & Xu, 2012), but another result concluded that the Logistic regression worked worse than Decision tree (Bensic, Sarlija, & Zekic-Susac, 2005). Besides, according to (Baesens et al., 2003), Support vector machine was better than Logistic re- gression, Li et al. (2019); Van Gestel et al. (2006) indicated that there was an

</div><span class="text_page_counter">Trang 27</span><div class="page_container" data-page="27">

insignificant difference among Support vector machine, Logistic regression, and Linear discriminant analysis. In summary, empirical credit scoring studies lead to the important conclusion that there is no best single classifier for all data sets.

Under the development of computational software and programming lan- guages, there is a shift from single classifiers to ensemble ones. The term “ensemble classifier” or “ensemble model” refers to the collection of multiple classifier algorithms. Ensemble models work by leveraging the collective power for decision-making across multiple sub-classifiers. In the literature on credit scoring, empirical studies concluded that the ensemble models had superior per- formance to the single ones (Brown & Mues, 2012; Dastile, Celik, & Potsane, 2020; Lessmann, Baesens, Seow, & Thomas, 2015; Marqués et al., 2012). How- ever, ensemble algorithms do not directly handle the imbalanced data issue.

While the second requirement of a credit scoring model often attracts less attention than the first, its role is equally important. It provides the reasons for the classification results, which is the framework for assessing, managing, and hedging credit risk. For example, nowadays, customers’ features are col- lected into empirical data sets more and more diversely, but not all of them are useful for credit scoring. Administrators need important information from the classification model that influences the likelihood of default to set transpar- ent credit standards. There is usually a trade-off between the effectiveness and transparency of classifiers (Brown & Mues, 2012). As performance measures increase, explaining the predicted result becomes more difficult. For example, single classifiers such as Discriminant analysis, Logistic regression, and Decision trees are interpretable, but they usually work far less effectively than Support vector machine and Artificial neural network, which are the representatives of “black box” classifiers. Another case is ensemble classifiers. Most of them operate in an incomprehensible process although they have outstanding perfor-mance. Even with popular ensemble classifiers such as Bagging Tree, Random Forest, or AdaBoost, which do not have very complicated structures, their in- terpretability is not discussed. According to Dastile et al. (2020), in the credit

</div><span class="text_page_counter">Trang 28</span><div class="page_container" data-page="28">

scoring application, only 8% studies proposed new models with the discussion of interpretability.

Therefore, building a credit-scoring ensemble classifier that satisfies both requirements is an essential task.

In Vietnam, credit data sets usually suffer from imbalance, noise, and over- lapping issues. Although the economy is under the influence of the digital trans- formation process and credit scoring models have developed rapidly, Vietnamese commercial banks have still applied traditional methods such as Logistic regres- sion and Discriminant analysis. Some studies used machine learning methods such as Artificial neural network (Kiều, Diệp, Nga, & Nam, 2017; Nguyen & Nguyen, 2016; Thịnh & Toàn, 2016), Support vector machine (Nhâm, 2021), Random forest (Ha, Nguyen, & Nguyen, 2016), and ensemble models (Luu & Hung, 2021). The idea of these studies is to support the applications of advanced methods in credit scoring, but they are not concerned with the imbalanced issue and interpretability. Very few studies dealt with the imbalance issue (Mỹ, 2021; Toàn, Lịch, Hương, & Thọ, 2017). However, these works only solved imbalanced data and ignored the noise and overlapping samples.

In summary, it is necessary to build a credit-scoring ensemble classifier that can tackle the imbalanced data and other related issues such as noise and over- lapping samples to raise the performance measures, especially on Vietnamese data sets. Furthermore, the proposed model can point out the important fea- tures to predict the credit risk status.

<b>1.3.2Gaps in the approaches to solving imbalanced data</b>

There are three popular approaches to imbalanced classification in the lit- erature. They are algorithm-level, data-level, and ensemble-based approaches (Galar et al., 2011).

The algorithm-level approach solves imbalanced data by modifying the clas-sifier algorithms to reduce the bias toward the majority class. This approach needs deep knowledge about the intrinsic classifiers which users usually lack. In addition, designing specific corrections or modifications for the given

</div><span class="text_page_counter">Trang 29</span><div class="page_container" data-page="29">

clas-sifier algorithms makes this approach not versatile. A representative of the algorithm-level approach is the Cost-sensitive learning method which imposes or corrects the costs of loss upon misclassifications and requires the minimal total loss of the classification process (Xiao, Xie, He, & Jiang, 2012; Xiao et al., 2020). However, the values of the costs of losses are usually assigned by the researchers’ intention. In short, the algorithm-level approach is inflexible and unwieldy.

The data-level approach balances training data sets by applying re-sampling techniques, which belong to three main groups, including over-sampling, under- over-sampling, and the hybrid of over and under-sampling. Over-sampling techniques increase the quantity of the minority class while under-sampling techniques de- crease the one of the majority class. This approach implements easily and performs independently of the classifier algorithms. However, re-sampling tech- niques change the distribution of the training data set which may lead to a poor classification model. For instance, random over-sampling techniques in- crease the computation time and may repeat the noise, and overlapping samples, thus probably leading to an over-fitting classification model. Some hierarchical methods of over-sampling can cause other problems. For example, the Synthetic Minority Over-sampling technique (SMOTE) can exacerbate the overlapping is- sue. In contrast, under-sampling techniques may miss useful information about the majority class, especially on severely imbalanced data (Baesens et al., 2003; Sun, Lang, Fujita, & Li, 2018).

The third is the ensemble-based approach which integrates ensemble classi- fier algorithms with algorithm-level or data-level approaches. This approach exploits the advantage of ensemble classifiers to improve the performance cri- teria. The ensemble-based approach seems to be the trend in dealing with imbalanced data (Abdoli, Akbari, & Shahrabi, 2023; Shen, Zhao, Kou, & Al- saadi, 2021; Yang, Qiao, Huang, Wang, & Wang, 2021; Zhang, Yang, & Zhang, 2021). However, the ensemble-based approach often faces complex models that are too difficult to interpret the results. This is a concern that must be realized fully.

</div><span class="text_page_counter">Trang 30</span><div class="page_container" data-page="30">

In summary, although there are many methods for imbalanced classification, each of them has some drawbacks. Some hybrid methods are complex and inaccessible. Moreover, there are very few studies dealing with either imbalance or noise and overlapping samples. With the available studies, on some data sets, the methods do not raise the performance measures as high as expected. Hence, it is coming up with the idea of a new algorithm that can deal with imbalance, noise, and overlapping to increase the performance measure on the positive class.

<b>1.3.3Gaps in Logistic regression with imbalanced data</b>

Logistic regression (LR) is one of the most popular single classifiers, especially in credit scoring (Onay & Öztürk, 2018). LR can provide an understandable output that is a conditional probability of belonging to the positive class. This probability is the reference to predict the sample’s label by comparing it with a given threshold. The sample is classified into the positive class if and only if its conditional probability is greater than this threshold. This characteristic of LR can innovate into multi-classification. Besides, the computation process of LR, which employs the maximum likelihood estimator, is quite simple. It does not take much time since there are several available packages of software or programming languages. Furthermore, LR can show the impact of predictors on the output by evaluating the statistically significant level of the parameters corresponding to the predictors. In other words, LR provides an interpretable and affordable model.

However, LR is ineffective on imbalanced data sets (Firth, 1993; King & Zeng, 2001), specifically, the conditional probability of positive samples is un- derestimated. Therefore, the positive samples are likely misclassified. Besides, the statistically significant level of predictors is usually based on the parameter testing procedure, which uses the “<i>p</i>-value” criterion as a framework. Mean- while, the <i>p</i>-value has recently been criticized in the statistical community be- cause of its misunderstanding (Goodman, 2008). Those lead to the limitation in the application fields of LR although it has several advantages.

</div><span class="text_page_counter">Trang 31</span><div class="page_container" data-page="31">

There are multiple methods to deal with imbalanced data for LR such as prior correction (Cramer, 2003; King & Zeng, 2001), weighted likelihood esti- mation (WLE) (Maalouf & Trafalis, 2011; Manski & Lerman, 1977; Ramalho & Ramalho, 2007) and penalized likelihood regression (PLR) (Firth, 1993; Green- land & Mansournia, 2015; Puhr, Heinze, Nold, Lusa, & Geroldinger, 2017). All of them are related to the algorithm-level approach, which requires much effort from the users. For example, prior correction and WLE need the ratio of the positive class in the population which is usually unavailable in real-world ap- plications. Besides, some methods of PLR are too sensitive for initial values in the computation process of the maximum likelihood estimation. Furthermore, some methods of PLR were just for the biased parameter estimates, not for the biased conditional probability (Firth, 1993). A hybrid of these methods and re-sampling techniques has not been considered in the literature on LR with imbalanced data. The hybrid methods can exploit the advantages of each individual and directly solve imbalanced data for LR.

In summary, LR for imbalanced data needs to be modified in the computation process by a combination of data-level and algorithm-level approaches. The modification can deal with imbalanced data and still retain the ability to provide the impacts of the predictors on the response without the “<i>p</i>-value” criterion.

<b>1.4 Research objectives, research subjects, and research scopes</b>

<b>1.4.1Research objectives</b>

In this dissertation, we aim to achieve the following objectives.

The first objective is to propose a new ensemble classifier that satisfies two key requirements of a credit-scoring model. This ensemble classifier is expected to outperform the traditional classification models and popular balanced methods such as the Bagging tree, Random forest, and AdaBoost combined with random over-sampling (ROS), random under-sampling (RUS), SMOTE, and Adaptive synthetic sampling (ADASYN). Furthermore, the proposed model can identify the significance of input features in predicting the credit risk status.

The second objective is to propose a novel technique to address the

</div><span class="text_page_counter">Trang 32</span><div class="page_container" data-page="32">

chal-lenges of imbalanced data, noise, and overlapping samples. This technique can leverage the strengths of re-sampling methods and ensemble models to tackle these critical issues in classification. Subsequently, this technique can be applied to credit scoring and other imbalanced classification applications, for example, medical diagnosis.

The final objective is to modify the computation process of Logistic regres- sion to address imbalanced data and mitigate the issue of overlapping samples. This modification directly impacts the F-measure, which is commonly used to evaluate the performance of classifiers in imbalanced classification. The pro- posed work can compete with popular balanced methods for Logistic regression such as weighted likelihood estimation, penalized likelihood regression, and re- sampling techniques, including ROS, RUS, and SMOTE.

<b>1.4.2Research subjects</b>

This dissertation investigates the phenomenon of imbalanced data and other related issues such as noise and overlapping samples in classification. We exam- ine various balancing methods, encompassing algorithm-level, data-level, and ensemble-based approaches in a case study of credit scoring. Within these ap- proaches, data-level and ensemble-based are paid more attention than algorithm- level. Additionally, Lasso-Logistic regression, which is a version of penalization on Logistic regression, is studied in two application contexts: a based learner of an ensemble classifier and the individual classifier.

<b>1.4.3Research scopes</b>

The dissertation focuses on binary classification problems for imbalanced data sets and their application in credit scoring. Interpretable classifiers, in-cluding Logistic regression, Lasso-logistic regression, and Decision trees, are considered. To deal with imbalanced data, the dissertation concentrates on the data-level approach and the integration of data-level methods and ensem- ble classifier algorithms. Some popular re-sampling techniques such as ROS, RUS, SMOTE, ADASYN, Tomek-link, and Neighborhood Cleaning Rule, are

</div><span class="text_page_counter">Trang 33</span><div class="page_container" data-page="33">

investigated in this study. In addition, popular performance criteria, which are suitable for imbalanced classification such as AUC (Area Under the Re-ceiver Operating Characteristics Curve), KS (Kolmogorov-Smirnov statistic), F-measure, G-mean, and H-measure, are used to evaluate the effectiveness of considered classifiers.

<b>1.5 Research data and research methods</b>

<b>1.5.1Research data</b>

The case study of credit scoring uses six secondary data sets. Three of them are from the UCI machine learning repository such as German, Taiwan, and the Bank personal loan data sets. These data sets are very popular in studying credit scoring and are used as a benchmark in the literature. Besides, the three private data sets are collected from commercial banks in Vietnam. All Viet- namese data sets are highly imbalanced with different levels. Furthermore, to justify the ability to improve the performance measures of the proposed works, the empirical study used one data set belonging to the medical field, Hepatitis data. This data set was available on the UCI machine learning repository.

The case study of Logistic regression employs nine data sets. Four of them, which are German, Taiwanese, Bank personal loan, and Hepatitis data sets, are also used in the case study of credit scoring. The others are easy to access through the Kaggle website and UCI machine learning repository.

<b>1.5.2Research methods</b>

The dissertation applies the quantitative research method to clarify the ef- fectiveness of the proposed works such as the credit scoring ensemble classifier, the algorithm for balancing and free-overlapping data, and the modification of Logistic regression.

The general implementation protocol of the proposed works follows the steps in Table 1.1. This implementation protocol is applied in all computation pro- cesses in the dissertation. However, in each case, the content in Step 2 may vary in some ways. The computation processes are conducted by the programming

</div><span class="text_page_counter">Trang 34</span><div class="page_container" data-page="34">

<b>Table 1.1: General implementation protocol in the dissertation</b>

Steps Contents

2 Constructing the new model with different hyper-parameters to find the optimal model on the training data.

and classifier algorithms on the same training data.

test- ing data, then calculating their performance measures.

language R, which has been widely used in the machine learning community.

<b>1.6Contributions of the dissertation</b>

The dissertation contributes three methods to the literature on credit scoring and imbalanced classification. The proposed methods were published in three articles, including:

(1) An interpretable decision tree ensemble model for imbalanced credit

<i>scoring datasets, Journal of Intelligent and Fuzzy System, Vol 45, No 6,</i>

10853–10864, 2023.

<i>(2) TOUS: A new technique for imbalanced data classification, Studies inSys- tems, Decision, and Control, Vol 429, 595–612, 2022, Springer.</i>

(3) A modification of Logistic regression with imbalanced data:

<i>F-measure-oriented Lasso-logistic regression, ScienceAsia, 49S, 68–77, 2023.</i>

Regarding the literature on credit scoring, the dissertation suggests the inter- pretable ensemble classifier which can address imbalanced data. The proposed model which uses Decision tree as the base learner has more specific advan- tages than the popular approaches such as higher performance measures and interpretability. The proposed model corresponds to the first article.

</div><span class="text_page_counter">Trang 35</span><div class="page_container" data-page="35">

Regarding the literature on imbalanced data, the dissertation proposes a method for balancing, de-noise, and free-overlapping samples thanks to the ensemble-based approach. This method outperforms the integration of the re- sampling techniques (ROS, RUS, and SMOTE, Tomek-link, and Neighborhood Cleaning Rule) and popular ensemble classifier algorithms (Bagging tree, Ran- dom forest, and AdaBoost). This work corresponds to the second article.

Regarding the literature on Logistic regression, the dissertation provides a modification to its computation process. The proposed work makes Logistic regression more effective than the existing methods for Logistic regression with imbalanced data and retain the ability to show the important level of input features without using <i>p−</i>value. This modification is in the third article.

<b>1.7 Dissertation outline</b>

The dissertation “Imbalanced data in classification: A case study of credit scoring” has five chapters.

• Chapter 1. Introduction

• Chapter 2. Literature review of imbalanced data • Chapter 3. Imbalanced data in credit scoring

• Chapter 4. A modification of Logistic regression with imbalanced data • Chapter 5. Conclusions

Chapter 1 is the introduction, which briefly introduces the contents of the dissertation. This chapter presents the overview of imbalanced data in classifi- cation. Besides, other contents are the motivations, research gap identifications, objectives, subjects, scopes, data, methods, contributions, and the dissertation outline.

Chapter 2 is the literature review on imbalanced data in classification. This chapter provides the definition, obstacles, and related issues of imbalanced data, for example, the overlapping classes. Besides, this chapter deeply presents the performance measures for imbalanced data. The most important section is the

</div><span class="text_page_counter">Trang 36</span><div class="page_container" data-page="36">

review of approaches to imbalanced data, including algorithm-level, data-level, and ensemble-based-level. Chapter 2 also examines the basic background and recent proposed works of credit scoring. The detailed discussion of previous studies clarifies the pros and cons of existing balancing methods. That is the framework for developing the new balanced methods in the dissertation.

Chapter 3 is the case study of imbalanced classification - credit scoring. This chapter is based on the main contents of the first and second articles referred to in Section 1.6. We propose an ensemble classifier that can address imbalanced data and provide the importance level of predictors. Furthermore, we innovate the algorithm of this credit-scoring ensemble classifier to handle overlapping and noise before dealing with imbalanced data. The empirical studies are conducted to verify the effectiveness of the proposed algorithms.

Chapter 4 is another study on imbalanced data which is related to Logistic regression. This chapter proposes a modification of the inner and outer of the computation process of Logistic regression. The inner is a change in the perfor- mance criterion to estimate the score, and the outer is a selective application of re-sampling techniques to re-balance the training data. The experiment stud- ies on nine data sets to verify the performance of the modification. Chapter 4 corresponds to the third article referred to in Section 1.6.

Chapter 5 is the conclusion, which summarizes the dissertation, implies the applications of the proposed works, and refers to some further studies.

</div><span class="text_page_counter">Trang 37</span><div class="page_container" data-page="37">

<b>Chapter 2</b>

<b>LITERATURE REVIEW OF IMBALANCED DATA</b>

<b>2.1 Imbalanced data in classification</b>

<b>2.1.1 Description of imbalanced data</b>

According to Definition 1.1.4, any data set with a skewed quantity of

<i>samples in two classes is technically imbalanced data (ID). In other words,</i>

any two-class data set with an imbalanced ratio (IR) greater than one is considered ID. There are not any conventional definitions of the IR threshold to conclude that a data set is imbalanced. Most authors simply define ID that there is a class with a much greater (or lower) number of samples than one of the other (Brown & Mues, 2012; Haixiang et al., 2017). Other authors assess a data set imbalanced if the interest class has significantly fewer samples than the other and ordinary classifier algorithms encounter difficulty in distinguishing two classes (Galar et al., 2011; López, Fernández, García, Palade, & Herrera, 2013; Sun, Wong, & Kamel, 2009). Therefore, a data set is considered as ID when its IR is greater than one and most samples of the minority class cannot be identified by standard classifiers.

<b>2.1.2 Obstacles in imbalanced classification</b>

In ID, the minority class is usually misclassified since there is too little in-formation about their patterns. Besides, standard classifier algorithms often operate according to the rules of the maximum accuracy metric. Hence, the classification results are usually biased toward the majority class to get the highest global accuracy and very low accuracy for the minority class. On the other hand, the patterns of the minority class are often specific, especially in extreme ID, which leads to the ignorance of minority samples (they may be treated as noise) to favor the more general patterns of the majority class. As a

</div><span class="text_page_counter">Trang 38</span><div class="page_container" data-page="38">

consequence, the minority class, which is the interested object in the classifica- tion process, is usually misclassified in ID.

The above analyzes are also supported by empirical studies. Brown and Mues (2012) concluded that the higher the IR, the lower the performance of classi- fiers. Furthermore, Prati, Batista, and Silva (2015) found that the expected performance loss, which was the proportion of the performance difference be- tween ID and the balanced data, became significant when IR was from 90/10 and greater. Prati et al. also pointed out that the performance loss tended to increase quickly for higher values of IR.

In short, IR is the factor that reduces the effectiveness of standard classifiers.

<b>2.1.3Categories of imbalanced data</b>

In real applications, combinations of ID and other phenomena make classifi- cation processes more difficult. Some authors even claim that ID is not only the main reason for the poor performance but the overlapping, small sample size, small disjuncts, borderline, rare, and outlier samples are also the causes of the low effectiveness of popular classifier algorithms (Batista et al., 2004; Fernández et al., 2018; Napierala & Stefanowski, 2016; Sun et al., 2009).

• Overlapping or class separability (Fig.2.1b) is the phenomenon of the un- clear decision boundary of two classes. It also means that some samples of two classes are blended. On data sets with overlapping, the standard classi- fier algorithms such as Decision tree, Support vector machine, or K-nearest neighbors become harder to perform. Batista et al. (2004) stated that the IR was less important than the degree of overlap between classes. Similarly, Fernández et al. (2018) believed that any simple classifier algorithm could perform classification independently of the IR in case of no overlapping.

• Small sample size: Learning algorithms need a sufficient amount of sam- ples of data sets to generalize the rule to discriminate classes. Without large training sets, a classifier cannot only generalize characteristics of the data but it can also produce an over-fitting model (Cui, Davis, Cheng, & Bai, 2004; Wasikowski & Chen, 2009). On imbalanced and small data

</div><span class="text_page_counter">Trang 39</span><div class="page_container" data-page="39">

<b>Figure 2.1: Examples of circumstances of imbalanced data.</b>

<i><small>Source: Galar et al. (2011)</small></i>

sets, the lack of information about the positive class becomes more serious. Krawczyk and Woźniak (2015) stated that when fixing the IR, the more samples of the minority class, the lower the error rate of classifiers.

• Small disjuncts (Fig. 2.1c): This problem occurs when the minority class consists of several sub-spaces in the feature space. Therefore, small dis-juncts provide classifiers with a smaller number of positive samples than large disjuncts. In other words, small disjuncts cover rare samples that are too hard to be found in the data sets, and learning algorithms often ignore rare samples to set the general classification rules. It leads to a higher error rate on small disjuncts (Prati, Batista, & Monard, 2004; Weiss, 2009).

• The characteristics of positive samples such as borderline, rare, and outlier, affect the performance of standard classifiers. The fact is that borderline samples are always too difficult to be recognized. In addition, the rare and outliers are extremely hard to be identified. According to Napierala and Stefanowski (2016); Van Hulse and Khoshgoftaar (2009), an imbalanced data set with many borderline or rare and outlier samples made standard classifiers less efficient.

In summary, studying ID should pay attention to the related issues such as the overlapping, small sample size, small disjuncts, and the characteristics of the positive samples.

</div><span class="text_page_counter">Trang 40</span><div class="page_container" data-page="40">

<b>2.2Performance measures for imbalanced data</b>

The quality of a classifier is evaluated by inspecting how effective it shows on testing data. It means the outputs of the classifier are compared with the true labels of the testing data which are hidden in the process of constructing the classifier. There are two types of outputs, which are labeled and scored types. Depending on each type, some metrics are used to analyze the performance of classifiers. In ID, there are some notes on the choice of performance measures.

<b>2.2.1Performance measures for labeled outputs</b>

Most learning algorithms show labeled outputs, for example, K-nearest neigh- bors, Decision tree, ensemble classifier based Decision tree, and so on. A conve- nient way to introduce the performance of labeled-output classifiers is a cross- tabulation between actual and predicted labels, known

<i>as confusion matrix.</i>

<b>Table 2.1: Confusion matrix</b>

Predicted positive Predicted negative Total

In Table 2.1, TP, FP, FN, and TN follow the Definition 1.1.3. Besides, POS and NEG are the numbers of the actual positive and negative samples in the training data, respectively. PPOS and PNEG are the numbers of the predicted positive and negative samples, respectively. N is the total number of samples.

From the confusion matrix, several metrics are built to provide a framework for analyzing many aspects of a classifier. These metrics can be divided into two types, single and complex metrics.

<b>2.2.1.1Single metrics</b>

<i>The most popular single metric is accuracy or its complement, error rate.</i>

Accuracy is the proportion of the correct outputs, and error rate is the propor- tion of the incorrect ones. Therefore, the higher (or lower) accuracy

</div>