Tải bản đầy đủ (.docx) (178 trang)

Imbalanced Data in classification: A case study of credit scoring

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.54 MB, 178 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

<b><small>MINISTRY OF EDUCATION AND TRAININGUNIVERSITY OF ECONOMICS HO CHI MINH</small></b>

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

<small>Ho Chi Minh City - 2024</small>

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

<b>STATEMENT OF AUTHENTICATION</b>

<b>I certify that the Ph.D. dissertation, “Imbalanced data in classification: A</b>

<b>case study of creditscoring”, is solelymyownresearch.</b>

ThisdissertationisonlyusedforthePh.D.degreeattheUniversityofEco-nomicsHoChiMinhCity(UEH),andnopartofithasbeensubmittedtoanyother university or organization to obtainanyother degree.Anystudiesofother authors used in this dissertation are properlycited.

Ho Chi MinhCity,April 2,2 0 2 4

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

Firstofall,Iwouldliketoexpressmydeepestgratitudetomysupervisors, Assoc. Prof. Dr. Le XuanTruongand Dr.TaQuoc Bao, for theirscientificdirection and dedicated guidance throughout the process of could complete the research. Besides, I really appreciate theinterestand help ofmycolleagues at Ho Chi Minh City University ofBanking.

Finally,Iamgratefulfortheunconditionalsupportthatmymotherandmyfamilyhave giventomeonmyeducationalpath.

Ho Chi Minh City, April 2, 2024

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

1.3.2 Gaps in the approaches to solving imbalanced data .. . 7

1.3.3 Gaps in Logistic regression with imbalanced data . .. . 9

1.4 Research objectives, research subjects, andresearchscopes...10

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

2.3.3.2 Integration of data-level method and ensemble

3.2.1.1 Algorithm for balancing data - OUS(<i>B</i>)algorithm 71 3.2.1.2 Algorithm for constructing ensemble classifier - DTE(<i>B</i>)a l g o r i t h m ...72

3.2.2 Empirical datasets...73

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

4.2.1 Priorcorrection...95

4.2.2 Weightedlikelihoode s t i m a t i o n (WLE)...96

4.2.3 Penalized likelihoodregression (PLR)...97

4.3 Theproposedworks...98

4.3.1 The modification of thecross-validationprocedure...99

4.3.2 The modification ofL o g i s t i c regression...101

4.4.6 Important variablesf o r output...111

4.4.6.1 Important variables for F-LLRfitted model...111

4.4.6.2 Important variables of the Vietnamesedataset 112

5.1.1 The interpretable credit scoringensembleclassifier...118

5.1.2 The technique for imbalanced data, noise,

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

C.5 Bank personal loan data set(BANK)...145

C.6 Hepatitis C patients dataset(HEPA)...146

C.7 The Loan schema data from lendingclub(US)...147

C.8 Vietnamese 3 datas e t (VN3)...150

C.9 Australian credit datas e t (AUS)...151

C.10 Credit risk data set(Credit1)...152

C.11 Credit card data set(Credit2)...153

C.12 Credit default data set(Credit3)...154

C.13 Vietnamese 4 datas e t (VN4)...155

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

<b>LIST OFABBREVIATIONS</b>

ADASYN Adaptive synthetic samplingANN Artificial neuralnetwork AUC Area under theROCcurves AUS Australian credit datas e t BANK Bank personal loan data set

CART Classification and regression treea l g o r i t h m CHAID Chi-square automatic interaction detector algorithmCNN Condensed nearestneighbors

Credit1 Credit risk datas e t Credit2 Credit card datas e t Credit3 Credit default dataset

FLAC Firth’s logistic regression with

addedcovariateFLIC Firth’s logistic regression with intercept-correctionF-LLR F-measure-oriented Lasso-Logisticregression FIR Firth-type - a version of Penalized likelihood regressionFN,FNR Falsenegative,Falsenegativerate

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

FP, FPR False positive, False positive rate GER German credit data set

HEPA Hepatitis patient data set

HEOM Heterogeneous Euclidean-Overlap metric HVDM Heterogeneous value difference metric ID Imbalanced data

IR Imbalanced ratio

KNN K-nearest neighbor classifier KS Kolmogorov-Smirnov statistic LDA Linear discriminant analysis

LLE Lasso-Logistic regression ensemble classifier LR Logistic regression

LLR Lasso-Logistic regression MLE Maximum likelihood estimate NCL Neighborhood cleaning rule OSS One-side selection

OUS Over-Under sampling - the proposed algorithm balancing data

for PLR Penalized likelihood regression

QDA Quadratic discriminant analysis

ROC Receiver Operating Characteristics Curve ROS Random over-sampling

RPART Recursive Partitioning and

RegressionTreealgorithmRUS Randomunder-sampling SMOTE Synthetic Minority Over-samplingt e c h n i q u e

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

samplingtechniqueUCI University of California,I r v i n e US Loan schema data set from lendingc l u b VACM Vietnam Asset ManagementC o m p a n y VN1 Vietnamese credit 1 dataset

VN2 Vietnamese credit 2 dataset VN3 Vietnamese credit 3 dataset VN4 Vietnamese credit 4 dataset

WLE Weightedlikelihoode s t i m a t i o n

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

2.8 Illustration ofS M O T E technique...35

2.9 Approaches to imbalanced datai n classification...47

3.1 Illustration of aD e c i s i o n tree...61

3.2 Illustration of a decision boundaryo f SVM...63

3.3 Illustration of at w o - h i d d e n - l a y e r ANN...65

3.4 Importance level of features of the Vietnamesed a t a sets...77

3.5 Computation protocol of the proposedensembleclassifier...86

4.1 IllustrationofF-CV...100

4.2 IllustrationofF-LLR...102

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

3.9 Performance of ensemble classifiers on theTaiwanesedata set .813.10TOUS(<i>B</i>)algorithm...84

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

4.5 Implementation protocol ofe m p i r i c a l study...106

4.6 Averagetesting performance measuresofclassifiers...108

4.7 Averagetesting performance measures ofclassifiers(cont.)...109

4.8 The number of wins of F-LLR on empiricaldatasets...110

4.9 Important features of the Vietnamesed a t a set...113

4.10 Important features of the Vietnamese datas e t (cont.)...114

B.1 Algorithm ofB a g g i n g classifier ...138

B.2 Algorithm ofR a n d o m Forest...138

B.3 AlgorithmofAdaBoost...139

C.1 Summary of the German creditdataset...140

C.2 Summary of the Vietnamese 1dataset...141

C.3 Summary of Vietnamese 2d a t a set...142

C.4 Summary of theTaiwanesecredit dataset(a)...143

C.5 Summary of theTaiwanesecredit dataset(b)...144

C.6 Summary of the Bank personal loandataset...145

C.7 Summary of the Hepatitis C patientsdataset...146

C.8 Summary of the Loan schema data from lendingclub (a)...147

C.9 Summary of the Loan schema data from lendingclub(b)...148

C.10 Summary of the Loan schema data from lendingclub (c)...149

C.11 Summary of the Vietnamese 3dataset...150

C.12 Summary of the Australian creditdataset...151

C.13 Summary of the Credit 1dataset...152

C.14 Summary of the Credit 2dataset...153

C.15 Summary of the Credit 3dataset...154

C.16 Summary of the Vietnamese 4dataset...155

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

In classification, imbalanced data occurs when there is a great difference in the quantities of classes of the training data set. This credit scoring datas e t s .

• Thesecondpaperintroducesanoveltechniqueforaddressingimbalanced data, particularly in the cases of overlapping and noisysamples.

• ThefinalpaperproposesamodificationofLogisticregressionfocusingon theoptimizationF-measure,apopularmetricinimbalancedclassification.

Theseclassifiershavebeentrainedonarangeofpublicandprivatedatasets with highly imbalanced status and overlapping classes. The primaryresultsdemonstrate that the proposed works outperform both traditional andsomerecentmodels.

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

trong trường hợp dữ liệu có chồng chéo các lớpvànhiễu.

• Bài báo thứ ba đề xuất một hiệu chỉnhchomơ hình hồi quy Logistic. Sự

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

classification, which is the basic type, focuses on the two-class label problems. In contrast, multi-classification solves the tasks of several classla- bels.Multi-classificationissometimesconsideredbinarywithtwoclasses:oneclasscorrespondingtotheconcer nlabel,andtheotherrepresentingtheremain- ing labels. In binary classification, data sets are partitioned into positiveandnegative classes. The positive is the interest class, which has tobeidentifiedintheclassificationtask.Inthisdissertation,wefocusonbinaryclassification.F orconvenience,wedefine some concepts asfollows.

<i><b>Definition 1.1.1.A data set with</b>kinputfeaturesfor binary classification isthe</i>

<i>The subset of samples labeled</i>1<i>is called the positive class, denotedS</i><small>+</small><i>.</i>

<i>. A</i>

</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">

<i><b>Definition 1.1.2.A binary classifier is a function mapping the</b></i>

<i><b>Definition 1.1.3.Considering a data set</b>Sand a classifierf:X→ {0,1}.With a</i>

<i>• Iff(s</i><small>0)= 1</small><i>andy</i><small>0= 0</small><i>,s</i><small>0</small><i>is called a false positivesample.• Iff(s</i><small>0)= 0</small><i><sub>and</sub>y</i><small>0= 1</small><i><sub>,</sub>s</i><small>0</small><i>is called a false negativesample.</i>

<i>The number of the true positive, true negative, false positive, and falsenegativesamples, are denoted TP, TN, FP, and FN, respectively.</i>

Some popular criteria used to evaluate the performance of a classifier are accuracy, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR).

</div><span class="text_page_counter">Trang 22</span><div class="page_container" data-page="22">

the loss of misclassifying the“good”into“bad”.Hence, identifying the bad is often considered more crucial than the other task. Consider a list ncy is the first target ofanycancer diagnosis process because of the heavy consequences of missing cancer patients. Therefore, it isunreasonabletobaseontheaccuracymetrictoevaluatetheperformanceofcancerdiagnosis classifiers.

Thephenomenonofskewdistributionintrainingdatasetsforclassification is known

<i><b>Definition 1.1.4.Let</b>S=S</i><small>+</small>∪<i>S</i><small>−</small><i>be the data set, whereS</i><small>+</small><i>andS</i><small>−</small><i>are</i>

Whenatrainingdatasetisimbalanced,simpleclassifiersusuallyhaveavery high accuracy butlowTPR. These classifiers aim to maximize the accuracy (sometimes called global accuracy), thus equating losses causedbytheerrortype I and error typeII(Shen, Zhao, Li, Li, & Meng, 2019). Therefore,theclassification results are often biasedtowardthe majority class (thenegativeclass) (Galar,Fernandez,Barrenechea,Bustince,&Herrera,2011;Haixiangetal.,2017).In thecaseofaratherhighimbalanced ratio,t h e minorityclass

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

ce of learning methods (Batista, Prati, & Monard, 2004; Haixiang et al., 2017). Thus, researchers or practitioners should deeply understandthenatureofdatasetstohandlethemcorrectly.

Atypicalcasestudyofimbalancedclassificationiscreditscoring.Thisissue is reflected in the bad debt ratio of commercial banks.Forexample, inViet-nam, the bad debt ratio in the on-balance sheetwas1.9% in 2021 and 1.7%in2020. Besides, the gross bad

baddebt,unresolvedbaddebtsoldtoVAMC,andpotentialbaddebtfromrestructuring)was7.3 % in 2021 and 5.1% in 2020<sup>1</sup>. Although bad customers account foraverysmallpartofthecreditcustomers,theconsequencesofthebaddebtofthe

bankareextremelyheavy.Incountrieswheremosteconomicactivitiesrelyonthe banking system, the increase in the bad debt ratiomaynot only threaten the execution of the banking system but also push the economy to a seriesofcollapses. Therefore, it is important to identify the bad customers

</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24">

Thesefactspromptedustostudyimbalancedclassificationdeeply.Thedis- sertation

<b>titled“Imbalance data in classification: A case study of credit scoring”aims</b>

to find suitable solutions for the imbalanced data and related issues, especially a case study of credit scoring inVietnam.

ii) The ability to easily explain the predicted results of thec l a s s i fi e r s . Over thetworecent decades, the first requirement has been solved withthe development of methods to improve the performance of credit asets.Forexample,somestudiesshowedthatLogisticregressionoutperformed Decision tree (Marqués, García, & Sánchez, 2012;Wang,Ma, Huang, &Xu,2012),butanotherresultconcludedthattheLogisticregressionworkedworsethan Decision tree (Bensic, Sarlija, & Zekic-Susac, 2005). Besides, according to (Baesens et al., 2003), Support vector machinewasbetter than Logisticre-gression,Lietal.(2019);VanGesteletal.(2006)indicatedthattherewasan

</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">

insignificant difference among Support vector machine, Logistic emodelshadsuperiorper-formance to the single ones (Brown & Mues, 2012;

2020;Lessmann,Baesens,Seow,&Thomas,2015;Marquésetal.,2012).How-ever,ensemblealgorithmsdonotdirectlyhandletheimbalanceddataissue.

While the second requirement of a credit scoring model often attracts less attention than the first, its role is equally important. It provides the reasons fortheclassificationresults,whichistheframeworkforassessing,managing, and hedging credit risk.Forexample,nowadays,customers’ features arecol- lected into empirical data sets more and morediversely,but not all classifiers. Another case is ensemble classifiers. Most ofthemoperate in an incomprehensible process although theyhaveoutstanding perfor-mance.EvenwithpopularensembleclassifierssuchasBaggingTree,RandomForest,orAda Boost,whichdonothaveverycomplicatedstructures,theirin-terpretability is not discussed. According to Dastile et al. (2020), in thecredit

</div><span class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">

bankshavestill applied traditional methods such as Logistic regres-sionandDiscriminantanalysis.Somestudiesusedmachinelearningmethods such as Artificial neural network (Kiều, Diệp, Nga, & Nam, 2017; Nguyen&Nguyen, 2016; Thịnh &Toàn,2016), Support vector machine

The algorithm-level approach solves imbalanced databymodifying theclas-sifier algorithms to reduce the biastowardthe majority class. This approach needsdeepknowledgeabouttheintrinsicclassifierswhichusersusuallylack.

</div><span class="text_page_counter">Trang 27</span><div class="page_container" data-page="27">

Inaddition,designingspecificcorrectionsormodificationsforthegivenclas-sifier algorithms makes this approach not versatile. A representative ofthealgorithm-level approach is the Cost-sensitive learning method which imposesor corrects the costs of loss upon misclassifications and requires theminimal total loss of the classification process (Xiao, Xie, He, & Jiang, 2012; creaset h e o n e o f t h e m a j o r i t y c l a s s . This approach implements easily andperforms independently of the classifier algorithms.However,re-sampling tech-niques change the distribution of the training data set whichmaylead to

</div><span class="text_page_counter">Trang 28</span><div class="page_container" data-page="28">

Insummary,although there are many methods for imbalanced classification,each of them has some drawbacks. Some hybrid methods are complex andinaccessible. Moreover, there are very few studies dealing with eitherimbalanceor noise and overlapping samples. With theavailablestudies, sthereferencetopredictthesample’slabelbycomparingitwitha given threshold. The sample is classified into the positive class if andonlyif its conditional probability is greater than this threshold.

ThischaracteristicofLRcaninnovateintomulti-classification.Besides,thecomputationprocess of LR, which employs the maximum

-wh ile, the<i>p</i>-valuehasrecentlybeencriticizedinthestatisticalcommunitybe- cause of its misunderstanding (Goodman, 2008). Those lead to thelimitation in the application fields of LR although it has severala d v a n t a g e s .

</div><span class="text_page_counter">Trang 29</span><div class="page_container" data-page="29">

There are multiple methods to deal with imbalanced data for LR suchaspriorcorrection(Cramer,2003;King&Zeng,2001),weightedlikelihoodesti-mation(WLE)(Maalouf&Trafalis,2011;Manski&Lerman,1977;Ramalho&

Ramalho,2007)andpenalizedlikelihoodregression(PLR)(Firth,1993;Green-land&Mansournia,2015;Puhr,Heinze,Nold,Lusa,&Geroldinger,2017).Allofthemarerelatedt othealgorithm-levelapproach,whichrequiresmucheffortfrom the users.Forexample, prior correction and WLE need the ratio

ofthepositiveclassinthepopulationwhichisusuallyunavailableinreal-worldap-plications.Besides,somemethodsofPLRaretoosensitiveforinitialvaluesinthecompu tationprocessofthemaximumlikelihoodestimation.Furthermore,some methods of PLR were just for the biased parameter estimates, notforthe biased conditional probability (Firth, 1993). A hybrid of these methods and re-sampling techniques has not been considered in the literature Thefirstobjectiveistoproposeanewensembleclassifierthatsatisfiestwokeyrequirementsofacredit-scoringmodel.Thisensembleclassifierisexpectedtooutperform the traditional classification models and popular balanced

</div><span class="text_page_counter">Trang 30</span><div class="page_container" data-page="30">

ThismodificationdirectlyimpactstheF-measure,whichiscommonlyusedto evaluate the performance of classifiers in imbalanced classification. Thepro-posedworkcancompetewithpopularbalancedmethodsforLogisticregression such as weighted likelihood estimation, penalized likelihood regression, andre-sampling techniques, including ROS, RUS, andSMOTE.

<b>1.4.2Researchs u b j e c t s</b>

Thisdissertationinvestigatesthephenomenonofimbalanceddataandotherrelated issues such as noise and overlapping samples in classification.Weexam- ine various balancing methods, encompassing algorithm-level, data-level,

The dissertation focuses on binary classification problems forimbalanceddata sets and their application in credit scoring. Interpretable classifiers,in-cluding Logistic regression, Lasso-logistic regression, and Decision trees,areconsidered.Todeal with imbalanced data, the dissertation

concentratesonthedata-levelapproachandtheintegrationofdata-levelmethodsandensem- ble classifier algorithms. Some popular re-sampling techniques

asROS,RUS,SMOTE,ADASYN,Tomek-link,andNeighborhoodCleaningRule,are

</div><span class="text_page_counter">Trang 31</span><div class="page_container" data-page="31">

investigated in thisstudy.In addition, popular performance criteria,whichare suitable for imbalanced classification such asAUC(Area

theempiricalstudyusedonedatasetbelongingtothemedicalfield,Hepatitis data. This data setwas availableon the UCI machine learningrepository.

</div><span class="text_page_counter">Trang 32</span><div class="page_container" data-page="32">

<b>Table1.1: General implementation protocol in thed i s s e r t a t i o n</b>

Steps Contents

algorithms on the same trainingda t a .

calculating their performancemeasures.

(1) An interpretable decision tree ensemble model for imbalanced

<i>creditscoringdatasets,Journal of Intelligent andFuzzySystem,Vol45, No 6,10853–</i>

10864, 2023.

<i>(2) TOUS:Anewtechniqueforimbalanceddataclassification,StudiesinSys-tems,</i>

<i>Decision, and Control,Vol429, 595–612, 2022, Springer.</i>

(3) A modification of Logistic regression with imbalanced data:

<i>F-measure-oriented Lasso-logistic regression,ScienceAsia,49S, 68–77,2 0 2 3 .</i>

Regardingtheliteratureoncreditscoring,thedissertationsuggeststheinter-pretable ensemble classifier which can address imbalanced data. The proposedmodel which uses Decision tree as the base learner has more specificadvan-tagesthanthepopularapproachessuchashigherperformancemeasuresandinterpretabi lity.The proposed model corresponds to the firstarticle.

</div><span class="text_page_counter">Trang 33</span><div class="page_container" data-page="33">

Regarding the literature on imbalanced data, the dissertation proposesamethod for balancing, de-noise, and free-overlapping samples

totheensemble- basedapproach.Thismethodoutperformstheintegrationofthere-samplingtechniques(ROS,RUS,andSMOTE,Tomek-link,andNeighborhood

CleaningRule)andpopularensembleclassifieralgorithms(Baggingtree,Ran-dom forest, and AdaBoost). This work corresponds to the secondarticle.

RegardingtheliteratureonLogisticregression,thedissertationprovidesamodifica tion to its computation process. The proposed work makes Logisticregression more effective than the existing methods for Logistic regression withimbalanced data and retain the ability to show the important level ofinputfeatures without using<i>p−</i>value. This modification is in the thirdarticle.

<b>1.7Dissertationo u t l i n e</b>

The dissertation “Imbalanced data in classification: A case study of credit scoring” has five chapters.

• Chapter 1.I n t r o d u c t i o n

• Chapter 2. Literature review of imbalancedd a t a • Chapter 3. Imbalanced data in credits c o r i n g

• Chapter 4. A modification of Logistic regression with imbalanceddata

</div><span class="text_page_counter">Trang 34</span><div class="page_container" data-page="34">

review of approaches to imbalanced data, including algorithm-level,data-

level,andensemble-based-level.Chapter2alsoexaminesthebasicbackgroundandrecentproposedworksofcreditsc oring.T h e detaileddiscussionofprevious studies clarifies the pros and cons of existing balancing methods. That is the idetheimportancelevelofpredictors.Furthermore,weinnovatethealgorithmofthiscredit-scoringensembleclassifiertohandleoverlappingand noise before dealing with imbalanced data. The empirical studies are conductedto verify the effectiveness of the

</div><span class="text_page_counter">Trang 35</span><div class="page_container" data-page="35">

conventionaldefinitionsoftheIRthresholdtoconcludethatadata set is imbalanced. Most authors simply define ID that there is a class with amuchgreater (orlower)number of samples

(Brown&Mues,2012;Haixiangetal.,2017).Otherauthorsassessadatasetimbalanced iftheinterestclasshassignificantlyfewersamplesthantheotherandordinaryclassifier

algorithms encounter difficulty in distinguishingtwoclasses (Galaretal., 2011; López,Fernández,García, Palade, & Herrera, 2013; Sun,Wong,& Kamel, 2009). Therefore, a data set is considered as ID when its IR is greaterthanoneandmostsamplesoftheminorityclasscannotbeidentifiedbystandardclas sifiers.

<b>2.1.2Obstacles in imbalancedclassification</b>

In ID, the minority class is usually misclassified since there istoolittlein-formation about their patterns. Besides, standard classifier algorithmsoftenoperate according to the rules of the maximum accuracy metric. Hence,theclassification results are usually biasedtowardthe majority class to getthehighest global accuracy and verylowaccuracy for the minority class. Ontheother hand, the patterns of the minority class are often specific, especiallyinextreme ID, which leads to the ignorance of minority samples (theymay

betreatedasnoise)tofavorthemoregeneralpatternsofthemajorityclass.Asa

</div><span class="text_page_counter">Trang 36</span><div class="page_container" data-page="36">

consequence,theminorityclass,whichistheinterestedobjectintheclassifica- tion process, is usually misclassified inID.

Theaboveanalyzesarealsosupportedbyempiricalstudies.Br own andMues(2012) concluded that the higher the IR, thelowerthe performance of classi- fiers. Furthermore, Prati, Batista, andSilva(2015) found that the expected performanceloss,whichwastheproportionoftheperformancedifferencebe-tweenID and the balanced data, became significant when IRwasfrom90/10and greater. Prati et al. also pointed out that the performance loss tendedtoincrease quickly for highervaluesofIR.

In short, IR is the factor that reduces the effectiveness of standard classifiers.

nded.Ondatasetswithoverlapping,thestandardclassi-fieralgorithmssuchasDecisiontree,Supportvectormachine,orK-nearest neighbors become harder to perform. Batista et al. (2004) stated hedata but it can also produce an over-fitting model (Cui, Davis, Cheng, & Bai, 2004;Wasikowski& Chen, 2009). On imbalanced and smalldata

</div><span class="text_page_counter">Trang 37</span><div class="page_container" data-page="37">

<b>Figure 2.1:E x a m p l e s o f c i r c u m s t a n c e s o f i m b a l a n c e d d a t a .</b>

<i><small>Source: Galar et al. (2011)</small></i>

sets,thelackofinformationaboutthepositiveclassbecomesmoreserious. Krawczyk andWoźniak(2015) stated that when fixing the IR, themoresamples of the minority class, thelowerthe error rate ofclassifiers.

• Small disjuncts (Fig. 2.1c): This problem occurs when the minorityclassconsists of several sub-spaces in the feature space.

• Thecharacteristicsofpositivesamplessuchasborderline,rare,andoutlier, affect the performance of standard classifiers. The fact is that borderline samplesarealwaystoodifficulttoberecognized.Inaddition,therareand outliers are extremely hard tobeidentified. According to Napieralaand Stefanowski (2016);VanHulse and Khoshgoftaar (2009),

</div><span class="text_page_counter">Trang 38</span><div class="page_container" data-page="38">

<b>2.2Performance measures for imbalancedd at a</b>

<i>between actual and predicted labels, known asconfusionmatrix.</i>

<b>Table 2.1: Confusion matrix</b>

InTable2.1,TP, FP,FN, and TN follow the Definition 1.1.3. Besides, POS and NEG are the numbers of the actual positive and negative samples inthe trainingdata,respectively. PPOSandPNEGarethenumbersofthepredicted positiveandnegativesamples,respectively.Nisthetotalnumberofsamples.

Fromtheconfusionmatrix,severalmetricsarebuilttoprovideaframeworkfor analyzing many aspects of a classifier. These metrics canbedividedintotwotypes, single and complexmetrics.

</div><span class="text_page_counter">Trang 39</span><div class="page_container" data-page="39">

imbalanced data set with very high IR, standard classifiers often get a very high accuracy andlowerror rate. It means the number of positive samples classified correctly is small despite their crucial role intheclassificationtask.Secondly,theerrorrateconsidersthecostofmisclassifying

thepositiveclassandthenegativeequally.WhereasinID,themisclassification

ofthepositivesampleisoftenmorecostlythantheoneofthenegative.There-fore,imbalancedclassificationstudiesusesomesinglemetricsthatfocusona specific

<i>class such asTPR (orrecall),FPR, TNR, FNR, andprecision.</i>

TPR is the proportion of the positive samples classified correctly. Other

</div><span class="text_page_counter">Trang 40</span><div class="page_container" data-page="40">

insteadofaccuracy,TPRisthemostfavoredmetricbecauseoftheimportance of the positive class.However,in credit scoring and cancer diagnosis, ifonlyfocusing on the TPR and ignoring the FPR, a trivial classifier will design all samples with the positive label. In other words, the classifier cannotidentify nce of a classifier, especially in ID. It leads to combinations of the above

<i>single measureis one of the most popular complex </i>

parameter<i>β</i>is set greater than 1 if and only if FN is more concerned thanFP.F<small>1</small>is the special case of F<i><small>β</small>when the importance of precision</i>

</div>

×