Construct credit scoring models using logistic regression, neural network and the hybrid model

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.73 MB, 84 trang )

UNIVERSITY OF ECONOMICS

INSTITUTE OF SOCIAL STUDIES

HO CHI MINH CITY

THE HAGUE

VIETNAM

THE NETHERLANDS

VIETNAM – NETHERLANDS
PROGRAMME FOE M.A. IN DEVELOPMENT ECONOMICS

CONSTRUCT CREDIT SCORING MODELS USING
LOGISTIC REGRESSION, NEURAL NETWORK AND
THE HYBRID MODEL

BY
LE MINH TIEN

MASTER OF ARTS INDEVELOPMENT ECONOMICS

HO CHI MINH CITY, NOVEMBER 2015

1

UNIVERSITY OF ECONOMICS

INSTITUTE OF SOCIAL STUDIES

HO CHI MINH CITY

THE HAGUE

VIETNAM

THE NETHERLANDS

VIETNAM – NETHERLANDS
PROGRAMME FOE M.A. IN DEVELOPMENT ECONOMICS

CONSTRUCT CREDIT SCORING MODELS USING
LOGISTIC REGRESSION, NEURAL NETWORK AND
THE HYBRID MODEL
A thesis submitted in partial fulfilment of the requirements for the degree of
MASTER OF ARTS INDEVELOPMENT ECONOMICS

By
LE MINH TIEN

Academic Supervisor:
DR. PHAM DINH LONG

HO CHI MINH CITY, NOVEMBER 2015

2

Abstract
Viet Nam economy is facing many difficulties, the operation of enterprises is not effective
leading to the non performing loan ratio of Banks increases. In the period 2007 to 2014, Viet
Nam have seen a downtrend in credit growth from 53,89% in 2007 to 11,8% in 2014 without
signs of strong recovery in the next period. A decline of credit growth implies that enterprises are
facing difficult in approaching credit from lending institutions and those enterprises which
operate mainly base on credit will be strongest affected ones. Non performing loan ratio of
Banks in Viet Nam has increased in 2007 to 2014, from 2% in 2007 then reached 3,25% in 2014
(highest in 2012 at 4,08%). In this period, almost enterprises could not approach Banks’ loans
while Banks are afraid of non performing loan ratio increasing. However, Banks are competing
strongly with domestic and foreign ones to achieve shares and maintain profit at the current. Viet
Nam is known as a densely populated country (a market size of 90 million people and high
proportion of young people) which is considered as a potential retail market for Banks to expand
and develop in the next period. To increase the competitiveness of Banks and also improve
effective loan risk management, this study applied different methods that are common used to
build up credit scoring model such as logistic regression, neural network and hybrid model.
Credit scoring model is considered as an application which is developed and widely applied in
the sector of finance and banking in the last decades, it is useful in accelerating credit analysis
process of Banks. Final results confirmed that characteristics like age, education, marital status,
current living status, living time in the current place, type of job, working time in current job,
working time in current field, number of dependent people, historical payment have a
statistically significant effect on repayment capacity of a customer. Credit scoring models can
classify customers according to different strategic purposes of users. And the performance of
hybrid models seemed better and more reliable than separate ones.

3

Content
CHAPTER 1: INTRODUCTION................................................................................................ 8

CHAPTER 2: LITERATURE REVIEW ................................................................................. 11
2.1 The concept of credit scoring model: ..................................................................................... 11
2.2 Judgmental analysis method and credit scoring model: ......................................................... 12
2.3 Advantages and disadvantages of credit scoring models: ...................................................... 13
2.4 Historical development of credit scoring models: .................................................................. 14
2.4.1 Development in credit card and instant loan markets: ................................................................... 16
2.4.2 Development in mortgage markets: ............................................................................................... 17
2.4.3 Development in consumer credit market: ...................................................................................... 18

2.5 Common variables in constructing credit scoring models: ..................................................... 20
2.6 Common techniques employed in credit scoring models: ...................................................... 23
CHAPTER 3: METHODOLOGY ............................................................................................ 26
3.1 Data: ........................................................................................................................................ 26
3.1.1 Variables: ....................................................................................................................................... 26
3.1.2 Assumptions:.................................................................................................................................. 28

3.2 Methodology: .......................................................................................................................... 30
3.3 Logistic regression: ................................................................................................................. 31
3.3.1 Theory: ........................................................................................................................................... 31
3.3.2 Odds ratio:...................................................................................................................................... 31
3.3.3 Information value: .......................................................................................................................... 32
3.3.4 Quality of the model: ..................................................................................................................... 32
3.3.4.1 Log-likelihood ratio (LR) test: ................................................................................................ 32
3.3.4.2 Pearson Chi-Square test: ......................................................................................................... 33
3.3.4.3 Akaike Information Criterion (AIC): ...................................................................................... 33

3.4 Neural Network: ..................................................................................................................... 34
3.4.1 Theory: ........................................................................................................................................... 34
3.4.2 Components of artificial neural network: ...................................................................................... 34
4

3.4.3 Back Propagation Algorithm: ........................................................................................................ 37

3.5 The hybrid model: ................................................................................................................... 38
3.6 Comparison of models: ........................................................................................................... 38
CHAPTER 4: EMPIRICAL RESULTS ................................................................................... 39
4.1 Data: ........................................................................................................................................ 39
4.1.1 Dependent Variable: ...................................................................................................................... 39
4.1.2 Independent Variables: .................................................................................................................. 40

4.2 Estimation results:................................................................................................................... 48
4.2.1 Construction of Logit models: ....................................................................................................... 49
4.2.1 Comparison of Logit models: ....................................................................................................... 50
4.2.1.1 Log-likelihood ratio (LR) test: ................................................................................................ 50
4.2.1.2 Person Chi-square test:............................................................................................................ 51
4.2.1.3 Akaike Information Criterion (AIC): ...................................................................................... 51
4.2.1.4 Classification tables: ............................................................................................................... 52
4.2.1.5 Comparison summary: ............................................................................................................ 53

4.3 Neural network: ...................................................................................................................... 53
4.3.1 Measurement of Model performance: ............................................................................................ 53
4.3.2 Importance of independent variables: ............................................................................................ 54

4.4 Hybrid model: ......................................................................................................................... 55
4.4.1 Hybrid model 1: ............................................................................................................................. 55
4.4.2 Hybrid model 2: ............................................................................................................................. 56

4.5 Summary comparison: ............................................................................................................ 57
CHAPTER 5: CONCLUSION .................................................................................................. 58

5.1 Research summary and implication: ....................................................................................... 58
5.1.1 Research summary: ........................................................................................................................ 58
5.1.2 Implication: .................................................................................................................................... 59

5.2 Limitations of the study: ......................................................................................................... 60
References .................................................................................................................................... 62

5

List of tables
Table 01 Common variables in previous studies……………………………………………. 23
Table 02 Common methods in previous studies.……………………………………………. 26
Table 03 Variables and their definitions…………………………………………….……….27
Table 04 Summary of selected variables in logit models…………………………………… 50
Table 05 Log-likelihood ratio (LR) test…..…………………………………………………. 50
Table 06 Person Chi-square test result……………………………………………………… 51
Table 07 Akaike Information Criterion (AIC) result………………………………………. 51
Table 08 Classification table of logit models………………………………………………... 51
Table 09 Summary logit model comparison………………………………………………… 52
Table 10 Neural network model summary………………………………………………….. 53
Table 11 Classification of Neural network model…………………………………………... 53
Table 12 Importance of independent variables of Neural network model………………... 54
Table 13 Hybrid model 1 summary…………………………………………………………..55
Table 14 Classification of Hybrid model 1………………………………………………….. 55
Table 15 Hybrid model 2 summary…………………………………………………………..56
Table 16 Classification of Hybrid model 2………………………………………………….. 56
Table 17 Selected model summary…………………………………………………………... 57
Table 18 Correlation Matrix…….…………………………………………………………... 65
Table 19 Collinearity Test……….………………………………………………………….... 66

Table 20 Results of logit model 1...…………………………………………………………... 66
Table 21 Results of logit model 2...…………………………………………………………... 67
Table 22 Results of logit model 3...…………………………………………………………... 68
Table 23 Results of Neural network model………………………………………………..... 72
Table 24 Results of Hybrid model 1.……………………………………………………….... 74
Table 25 Results of Hybrid model 2….………………………………………………….…... 78
6

Table 26 Summary of Information value of variables ………...…………………………... 81

List of figures
Figure 01Viet Nam credit growth in 2006-2014…………………………………………… 09
Figure 02 Non performing loan ratio in 2006-2014……………………………………….. 10
Figure 03 Steps to construct Credit scoring model………………………………………… 30
Figure 04 Processing information in an Artificial Neuron………………………………… 34
Figure 05 Neural network with one hidden layer…………………………………………... 34
Figure 06 Example of Summation function………………………………………………… 35
Figure 07 Example of Sigmoid function of ANN…………………………………………… 36
Figure 08 Back propagation algorithm of single neuron………………………………….. 37
Figure 09 Ratio of good/bad customer of dataset………………………………………….. 40
Figure 10 Ratio of good/bad customer base on age of customer…………………………. 41
Figure 11 Ratio of good/bad customer base on Current living status……………………. 42
Figure 12 Ratio of good/bad customer base on Education level………………………….. 43
Figure 13 Ratio of good/bad customer base on Gender…………………………………… 44
Figure 14 Ratio of good/bad customer base on Marital status…………………………… 44
Figure 15 Ratio of good/bad customer base on Living time at current place…………… 45
Figure 16 Ratio of good/bad customer base on Type of job………………………….…… 45
Figure 17 Ratio of good/bad customer base on Working time in present job…………… 46
Figure 18 Ratio of good/bad customer base on Working time in current field….………. 47

Figure 19 Ratio of good/bad customer base on Number of dependent people…………… 48
Figure 20 Ratio of good/bad customer base on Historical payment ………...…………… 48

7

CHAPTER 1: INTRODUCTION
In 2007, The Financial Crisis began from United States (US) by a decisive decline of home
prices, then affected entire the economy and spread through the world economy. A cut deep into
demand all over the world made Viet Nam economy facing many difficulties in export sector in
this time. Enterprises have to narrow down their operations result in the credit growth of the
banking system has slowed down in the recent period.
Viet Nam credit growth in 2006-2014
60%
53,89%

50%
40%

37,53%
31,19%

30%
25,44%

25,43%

20%

12,51%

11,80%

12,00%
10%
8,91%
0%
2006

2007

2008

2009

2010

2011

2012

2013

2014

-10%

Figure 01: Viet Nam credit growth in 2006-2014
Source: The State Bank of Vietnam’s annual report 2006-2014.
Banks are afraid of losing their capital because of existing difficulties of the economy while the
sign of economic recovery is still very weak, thus they are careful in making their lending

decisions. Economists forecast this situation would still continue in the next few years. To
survive and develop in this period, some economists suggested that, in the coming period, the
retail banking segment will be the alternative strategy could help Banks developing their
businesses and maybe is the key growth because Viet Nam has a market size of 90 million
people (with a high proportion of young people) which will generate opportunities for Banks to
expand their services to help consumers increasing asset value and better businesses management
as well as carry out daily payment activities. Viet Nam with some typical characteristics of lowincome developing country such as dynamic young population, rising income and desire to
8

improve the quality and lifestyle will be a great potential for the retail banking development. To
take advantage of this opportunity, Banks have to improve procedure system to make it more
convenient and better risk management to develop this new segment.
Non performing loan ratio in 2006- 2014
5%
4,08%

4%
3,60%
3%

3,25%

2,66%

2%

3,79%

2,40%

2,14%

2,18%

2%

1%
0%
2006

2007

2008

2009

2010

2011

2012

2013

2014

Figure 02: Non performing loan ratio in 2006-2014
Source: The State Bank of Vietnam’s annual report 2006-2014.
Credit scoring model is a useful tool that was first introduced in 1940, and developed rapidly
over the last two decades. This is a statistical technique that helping banks or lending institutions

predicting the probability of a customer can pay back the loan on time or not (Mester, 1997).
This model enables banks and financial institutions classifying and evaluating easily and quickly
customer’s risks to make lending decisions faster and more accurate than judgmental system.
This paper will build up credit scoring model by using different techniques such as logistic
regression, neural network and a hybrid model of them to find out the suitable one and give some
implications to assess customer’s risk.
Research questions and objectives:
This study aim to build up credit scoring model by using three different techniques such as
logistic regression, neural network and a hybrid model of them to identify which characteristics

9

of customer will affect their default probability; then comparing the performance between
models and finding the best one. The results of this study will answer this below questions:


Which characteristics of customer can be used to identify that customer can pay back the
loan or not ?



Find out the better technique to construct credit scoring model in this study.

Scope of this study:
In order to conduct this study, the collected sample of this study will comprise personal
information of 690 customers of MBBank that has had a loan with this bank within the year of
2012. And the status of their loans (default/delinquent or good) will be recorded at the end of
2013. The personal information of customers will contain their own characteristics such as:
gender, age, education, marital status, current living status, living time in the current place, type

of job, working time in current job, working time in current field, number of dependent people,
historical payment.
Structure of this study:
The first chapter of this study mentions the reason of conducting the study, research questions
and objecitives. In the second chapter, this study will present an overview relating to credit
scoring model, then mentions the independent variables that are commonly used and the
preferred method mostly applied to construct a credit scoring model. Details of the independent
variables were used in the research such as the meaning of each variable, assumptions of these
variables and the steps that constructed models also was contained in the third chapter. The first
part of the fourth chapter will present an overview related to the dataset was used in this study
and illustrate initial aspects about the relationship between the independent variables and the
dependent variable. The second part of the fourth chapter will focus on outlining the steps taken
to construct and select final credit scoring model. Findings, implications and limitations of this
study will be discussed in the last chapter.

10

CHAPTER 2: LITERATURE REVIEW
This chapter will present an overview relating to credit scoring model, then mentions the
independent variables that are commonly used and the preferred method mostly applied to
construct a credit scoring model.
2.1 The concept of credit scoring model:
Credit appraisal is the process of gathering, analyzing and classifying factors or variables to be
able to give credit analyst an overview about customers in helping to make final financing
decision. This is an important process which plays important role in ensuring safety risks of the
operation of credit institutions. Credit scoring is a common tool that lenders have used to classify
credit of customers. The definition of “credit scoring” should be understood into two components
which are "credit" and "scoring". The sense of “credit” is that a person borrow money at the
present and have to pay money back in the future include principal and interest, while “scoring”

demonstrates an action that applies different measurements to access or classify customers into
separate groups base on different purposes of credit institutions. Hance, “Credit scoring” is
interpreted as the use of statistical models to convert the appropriate data to variables that have
statistical significance on affecting payment ability of customers. A credit scoring model is a
model that could quantify customer characteristics and classify customers followed different
purposes of credit institutions. Through the process of classifying customers of credit scoring
model, those lenders can make final decisions or choose customers who could be received
finance and those who should be rejected.
The amount of consumer loans is usually small in total loan market, therefore evaluation of each
loan is not always create cost-effective. Simultaneously with development of technology and
research papers in risk management, banks or lenders depend much on scoring models for
approving these loans to shorten time and make granted decisions more accurate. According to
Mester (1997), credit scoring models were applied to assess credit card applications accounting
for 70% total small business applications in banks and other lending institutions. In the fifties, a
credit scoring system was first applied but limited in banks in the US, until in the 1990s it also
applied for valuating loans in housing finance (Straka, 2000). The multiple discriminant analysis
also was known as the oldest and the most common model in the history of credit scoring model,
11

it was introduced by Altman in 1968. Since then other new techniques such as logit-probit
model, multiple discriminant model, neural networks…have been generated and developed in
this field. All of these methods are successful to identify variables or characteristics of customers
that affect significantly in customer’s default ability.
Hörkkö (2010) claimed that the ultimate objective of such models is helping bankers or lenders
to distinguish “good” or ”bad” customers to make accurate granting decisions result in
minimizing the credit risk and default rates. Separate personal information of customers such as
age, education, marital status, number of children…would be considered as input variables, then
different statistical technique were used to discover which customer characteristics can
distinguish between “good” customers and “bad” customers. CSM could be purchased or

constructed itself depend on bank’s ability. There is unfix CSM, it will vary relying on the
sample data and the techniques use to create CSM. In general, historical information such as
defaulted or non-defaulted status is employed as binary dependent variable. Next, separate
personal information of customers such as age, education, marital status, number of
children…will be considered as independent variables. Then using statistical technique to
estimate or evaluate in distinguishing between “good” customers and “bad” customers with the
support of empirical modeling. The CSM will create a score for each applicant, then comparing
the score of customer with the cut-off value requirement of lenders to make granting decision or
not. If the score is higher than the required threshold point, lenders will accept the application
form.
2.2 Judgmental analysis method and credit scoring model:
Credit assessment is a process of reviewing and comparing characteristics of a customer with old
customers. If characteristics of customer are equivalent with old customers who have not paid
back their loans on time, credit officer will reject their loan application. By contrast, customers
who have characteristics are equivalent with old customers who have paid back their loans on
time will be received financing approved. There are two common techniques are used in credit
assessment process such as judgmental analysis and credit scoring model.
Different analysis techniques will have separate advantages and disadvantages. The performance
of judgmental analysis technique is high or low depends heavily on accumulated experience,
12

analytical thinking of credit analyst. Therefore, this technique will be much affected by
components such as subjectivity, uncertainty, personal perspective while making credit
decisions. Besides these above disadvantages, judgmental analysis technique has its own
advantages such as ability to use quantitative factors in process analysis and more accurate
decisions with experienced credit analysis.
On the other hand, credit scoring technique demonstrates a process that credit institutions using a
large amount of data (characteristics of customer) of old customers to create a
quantitative/statistical model which could classify potential customers into bad group or good

group to generate a quick finance decision. It is said that credit scoring models have helped
lenders to improve customer service sector, saving time and saving operation and approval costs
in analysis process. However, this technique also facing some criticisms regarding to the
certainty of these models because statistical models always contain assumptions and issues
related to type of data were used to build up the model. Despite criticism, it is believed that credit
scoring systems have rapidly developed and become a crucial technique in finance and banking
sectors in recent decades.
2.3 Advantages and disadvantages of credit scoring models:
Rapidly developing credit scoring applications confirm that it brings many benefits to users,
especially in finance and banking sectors in recent decades. Outstanding advantage of these
applications is that it requires less information in order to make classification result. Credit
scoring models use only those variables which are statistical significance in reflecting the
repayment capacity of the customers, those variables which are not statistical significance will be
removed from the model. Meanwhile, judgmental analysis makes decisions based on a review
process of all the information related to customers and does not exclude any elements. In
addition, credit scoring system’s consideration and evaluation focus on both the characteristics of
good customers and bad customers, while judgmental analysis focuses largely on the
characteristics of bad customers. Credit scoring models are built by using a large amount of
history data put into statistical models, while judgmental analyst based primarily on experience
and analytical thinking of evaluators or credit officers. Credit scoring systems help lenders to
make objective decisions while judgmental analysis contain subjectivity of evaluators, caused
issues related to discrimination. Lenders can understand the relationship between the
13

characteristics of customer and payment behavior when applying credit scoring technique, while
it is difficult to clearly describe in judgmental analysis process. Another important advantage of
credit scoring technique is that classification result of a customer is the same with different credit
evaluator and it is completely opposed to judgmental analysis.
Besides these above benefits, credit scoring models still have other benefits such as: time

efficiency, making decision faster, minimize cost approval and reduce mistakes than judgmental
analysis, providing information for risk control process more easily, require less information of
customers to classify, credit scoring model can change its structure to classify customers
followed different purposes of users.
Despite many benefits that credit scoring models bring in reality, there are some criticisms when
applying credit scoring model. At first, a credit scoring model uses any features or characteristics
of client as variables put into the model. Model will contain any variables that have statistically
significant while the relationship between that variable and creditworthiness of customer is
ambiguous or not clear. In addition, credit scoring models often eliminate the proxy for
economic factors that may affect the repayment capacity of the customer. Credit scoring model
in different regions, different countries, different cities could vary by regional differences,
therefore, there is not an official credit scoring model all over the world. However, budget to buy
credit scoring model and cost to train analyst how to use the model are two serious issues for
those who want to apply credit scoring system. Sometimes, credit scoring model could reject a
good customer because suddenly changes of customer (for example: changing in living place or
company) without deeper analysis like judgmental analysis technique. Another weakness is that
credit scoring model uses history data to build up model, thus the weight of variables shall be
fixed and the model will lack more accuracy when the pattern of customers changing over time.
2.4 Historical development of credit scoring models:
Risk management plays an important role in bank and financial institution operation where
aware that banking activities influence and are influenced by the economy, thus they are
responsibility to raise requirements to ensure safety banking system operation and the economy
in general. However, tradeoff between risk and return would be analyzed carefully because they
always believed that “high risk and high return”. The key important in risk management is that a
14

combination of all information of customer to analyze and make final finance decisions. Thus,
the performance of classifying customers system will help lenders in ensuring risk and operating
efficiency.

Rapidly developing credit industry in recent decades all over the world makes sense that the
management of a large amount of loans is very difficult and hard to ensure risk for entire system.
Credit scoring technique was established and developed to handle these above issues. In recent
decades, credit scoring models are widely applied in financial sector and have proved that the
ability to classify good and bad customers quickly. Applying credit scoring model can reduce the
cost of the approval process and reduce making wrong final decisions, saving time and effort in
the analysis process. Typically, credit approval process for customers will be conducted by two
common techniques which are judgmental analysis and credit scoring model. Judgmental
analysis a techniques that operates based on knowledge, experienced, analytical thinking of
credit analysis officer. Due to the rapid development of credit industry and the need to quantify
risk operation, financial institutions decide to apply credit scoring model in credit analysis
process.
Credit scoring systems have capacity in classifying customers into good customer who is
expected to pay back loans on time and as bad customer who is insolvent. It is said that credit
scoring models classify customers more accurate than judgmental analysis technique, allows
banks to control and provide cross services to different customer groups. One of the main
objective of the use of credit scoring in the financial sector is that it contribute to the
development of credit management, and support effectively credit approval process.
In developed countries, credit scoring models have been applied widely and effectively. The
number of applications increases rapidly because it was supported by good infrastructure and the
availability of huge data, while in developing countries facing many limited about the
availability of data and IT infrastructure to apply credit scoring models effectively.
According to West (2000), credit scoring models are widely applied in the financial sector and
the primary purpose is to improve the process of gathering information and credit analysis,
reducing cost, faster decision-making. According to statistics, about 82% of banks in the US
used credit scoring models to decide which customers should be received an approval for credit
15

card applications. In recent duration, some credit institutions and mortgage lenders have started

to develop credit scoring models to support credit decisions, in attempt to improve and enhance
risk management.
The most important task of credit scoring models is information collected process of customers,
this process should be ensured accuracy and honesty. In general, this information is collected
from loan application of customers and other related sources. Personal information of customers
such as age, gender, marital status, education, income, type of current job, experience,
homeownership, number of dependents, state of birth...will be considered as inputs to build up a
credit scoring model.
2.4.1 Development in credit card and instant loan markets:
According to Agaewal et al (2009), the authors evaluated the effect of the characteristics of
customers on the default possibility. Dataset of the study included 170.000 samples. They
observed payment behavior of these customers and pointed out factors such as monthly
spending, amount of debt, income, asset accumulation, economic conditions, legal environment
and demographic structure will affect the repayment capacity of the customers. The final result
also suggested that customer who left their place of birth seem likely to default than others, while
customer group who have married and owned house have very low probability of default.
Another interesting result is that the age of customers is under 30 and over 60 always able to pay
back their loans better than the rest. Finally, groups who have high income group or possess
assets always have responsibility on repayment.
In 1999, Dunn and Kim conducted a research to identify factors that determine the probability of
default in credit card sector. They made phone call directly to 500 customers who living in Ohio
Sate. The final interview results showed that the rate between the minimum amounts of money
that customers have to pay monthly and their income has statistical significance on repayment
capacity of customers. Further results of this study agreed that age, marital status, number of
dependents also have linked strongly to the possibility of default while education level, income,
ownership status do not affect significantly as initial assumption.
There are some studies did not focus on the purpose of classification, but they concentrated on
the purpose of profit maximization. According to Boyes et al (2002), the main purpose of credit
16

analysis is to give more accurate estimation related to the probability of default, depend on
different level of default probability that lenders could give loans to customers with different
interest equivalent with customer risk. Research results showed that age, education, assets
ownership, number of dependents, the proportion between spending and income are factors that
influenced strongest. Autio et al (2009) conducted a survey with 1951 young adults between the
age of 18 and 29 years old through the internet. They collected personal information such as age,
gender, financial status, income, employment status, family structure...In addition, the status of
their credit such as mortgage, student loan, small loans would be gathered...Especially, the study
also measured the attitude of the observations regarding to borrowing money activity. Final
results showed that for the group who the age from 18 to 23 years old tend to apply for instant
loans more frequently than the others, while high income and stable employment status group
preferred to using credit card loans. Research also suggested that gender did not affect
significantly to borrowing decisions of a customer; while employment status, income, and family
structure are the strongest influenced factors.
2.4.2 Development in mortgage markets:
In the mortgage market, lenders usually require collateral of customers to guarantee their loans.
The probability of default in this market is affected strongly by changing in exchange rate or
interest because duration of these loans is always in long term (Zorn and Lea, 1989).
In 2006, Vasanthi and Raja investigated the relationship between income and customer
characteristics. They concluded that the age of customer who is leader of household is very
important: for example, younger household leader has higher probability of default because of
high financial stress and less experienced in money management. Customers have high income
and only borrow a small loan always have lowest probability of default because they can control
and manage their financial status logically. However, some traditional characteristics such as
educated level, marital status also affected significantly on payment behavior of customers.
Those who have high education level are easy to get a good and stable job, thus having strong
financial capacity. The results of this research also suggested that young customers and
customers who divorced were likely to default higher than the others because they lack
experienced in financial management and psychologically unstable.

17

2.4.3 Development in consumer credit market:
In a study of Kocenda and Vojtek (2009), they chose 3403 observations and collected 21
characteristics information of customers (variables) to conduct a research. The main purpose of
their research is that they would like to identify which characteristics of customer have
significantly affected on payment behavior of customer and which technique is better in building
up a credit scoring model. The final results of this study expressed that the performance of
logistic analysis technique and CART analysis technique is the same. However, two techniques
indicated that characteristics such as education level, marital status, purpose of borrowing, assets
accumulation, transaction history between lenders and customers are strongest affected factors.
The disadvantage of this study is that it applied small dataset to set up credit scoring model.
However the authors also suggested that using non parametric measurement could be considered
as alternative method to build up a good credit scoring model.
In 1997, Arminger et al applied three techniques such as LDA, classification tree analysis and
feed forward network building up credit scoring model to compare each other and identify the
best technique. They collected information from 8163 observations during the period 1991 and
1992 in Germany. Initial basic information considered as inputs or variables such as gender,
experienced, age, ownership, marital status…The findings of this study indicated that the
performance of three technique are the same, all of three technique have high classification
power in build up a credit scoring model, but the performance of LDA technique is a little better
than others. Groups contain customers who have experienced, high assets accumulation, female
and those who have married are less likely to default.
Jacodson and Roszbach conducted a study related to credit scoring models in 2003. In their
study, they calculated deviation in data selection process. The study employed bivariate approach
to set up credit scoring model. This method used both cases (rejected and approved loans) as
inputs for the construction of credit scoring model. Research conducted on 13338 observations in
Swedish during the period 1994 and 1995. The data sources were used including financial

information and personal information of customers. They collected 57 input variables, however,
finally only 16 variables were used to build the model. Results indicated that variables such as
income, age, annual income changes, amount of loan were the strongest influenced factors on the
default probability of a customer.
18

In 2004, Roszbach continued to use the above dataset to investigate the relationship of the
default period of customers. A common loan always has many prompt, thus creating a cash flow
until customers liquidate their loans or become default. This study explained two ways to
calculate net present value (NPV) of a loan of customer. In the case of which customers pay back
their loan on time, NPV will be calculated as usual, while in the case that customers become
default, NPV will be calculated by estimating a cash flow generated during the period that pay
back a part of loans and plus cost of handling non performing loans. The author applied tobit
model to measure exactly the period when customers become default. The results of study
indicated that lenders have not act logically when they assessed tradeoff between risk and return.
However, lending policy of these institutions does not encourage an extending loan to earn more
profit. The other result showed that lenders did not differentiate the value of loans. Roszbach
also supplied evidences that proved lenders have not acted consistently with their objective of
profit maximization. Using tobit model, lenders could estimate duration when a customer are
likely to default and then choosing the one who survive longer and publish lending policy
effectively.
Dinh and Kleimesier (2007) employed 56037 observations of one of biggest bank in Viet Nam to
build up a credit scoring model. They applied forward-stepwise method to select variables.
However, this study faced many limited because of lacking necessary information which is a
basic problem when they conducted a research in a developing country like Viet Nam. The study
indicated that the duration that customers have relationship with lenders is the strongest affected
factor, followed by gender, amount of loans. However, the authors also suggested that credit
scoring models should update frequently to keep its performance against economic condition
changes.

The study of Updegrave (1987) showed that variables such as number of variables in a model,
payment history, working experienced, time living in current place, income, ownership, age and
saving rate are the strongest affected factors on payment behavior of customers. This result also
was supported by the study of Steenackers and Goovaerts (1989) when they conducted a similar
research in Belgian. The authors employed 19 variables to create credit scoring model at the
beginning, but in finally, there are only 11 variables had statistical significance. Using logistic
regression technique to build up a credit scoring model, the final model indicated that factors
19

such as age, time working and living at current place, the amount of loans, phone call, working
in state or private sector, monthly income, assets accumulation affected strongly on payment
behavior of customers.
In 2004, Ozdemir applied logistic regression method to build up a model measuring relationship
between default risk with demographic and financial factors in credit retail market. Observations
in this study were collected from a bank in Turkey. The final results of this study confirmed that
demographic factors did not have statistical significance in affecting payment behavior of
customers, while financial factors did. However, interest rate and term loan are two factors that
affected strongly on payment behavior of customers. Customers who have higher interest rate
and term loan are likely to default than the others. The authors explained that with long term
loans, customers will have higher probability in facing sudden changes such as economic
change, exchange rate change or interest rate change…
In 1997, Han and Henley reviewed all of researches related to methods that were used to build up
credit scoring model. The final result expressed that there is no optimal method using to build up
credit scoring model, each method will have separate advantages and disadvantages that depend
much on data structure and the purpose of users.
2.5 Common variables in constructing credit scoring models:
The main responsibility of credit scoring models is to classify customers into good group or bad
group. With the rapid development of credit scoring applications in America, England and other
developed countries, credit scoring models become more important and considered as a crucial

tool in risk management and accelerate the lending process. Applying credit scoring models,
lenders are easy to analyze customers, assess their payment history and identify their worthiness
to make final credit decision.
To build up a credit scoring model, personal characteristics are common used as input variables
such as gender, age, education level, number of dependent, type of job, working experience
(Hand et al., 2005; Lee and Chen., 2005; Lee et al., 2002; Steenackers and Goovarts., 1989).
However, other information should be considered as input variables such the amount of loan,
assets accumulation, monthly income, saving rate, purpose of borrowing and others information

20

(Lee and Chen., 2005; Ong et al., 2005; Steenackers and Goovarts., 1989) to enhance the
performance of credit scoring model.
All of information of customers could be considered as inputs put into statistical models, then
variables have statistical significance will be used as variables in credit scoring model to classify
customers. The rapid development of credit scoring applications has proved the useful of this
kind of model. However, there is not an explicit research which explains why these variables are
used in credit scoring models to classify customers. Additionally, selected variables in credit
scoring model depend heavily on the initial data structure provided to build the model. At the
beginning, credit scoring models were built to classify customers into two groups such as “good”
and “bad”. Then with the rapid development of credit scoring and more complex requirements,
credit scoring also are developed to classify customers into three groups such as “good”, “bad”,
“confuse”…therefore lenders will have more information about customers to make final
decisions.
There is not explicit requirement about number of variables in a credit scoring model, thus
selected variables in a model will depend much on data structure, specific culture and economic
conditions in each region. However, a credit scoring model is built up that commonly contains
approximately twenty variables. It is believed that increasing number of variables will enhance
the performance of credit scoring model such as Salchenberger et al (1992), Leshno and Spector

(1996), Dvir et al (2006).
Credit scoring models have demonstrated the necessity and its important role in practical
applications, especially in the field of finance and banking. There is some criticism related to
identify the cut-off point of this model. All of the past researches suggested that there is not
optimal cut-off point which depends much on the attitude of lenders. In cases which lenders want
to increase the growth of lending activities and market share, they will install the cut-off point
lower than usual, by contrast if lenders want to control risk strictly. Besides the issues regarding
number of variables and data structure in credit scoring model, researchers also pay attention to
sample size that was used to build up a model. It is believed that the larger sample sizes the
higher accuracy. However, sample size was used to build a model depend much on the
availability of information. There were studies which used small sample size, only contained
about 300 or 400 observations such as Dutta et al (1994) and Fletcher and Goss (1993). While
21

other researches can apply a large sample size with over thousands observations such as Belloti
and Crook (2009), Hsieh (2004), Banasik et al (2003)…In particular, the construction of credit
scoring model for consumer credit market are common using a small sample size about under
1100 observations (Sustersic et al, 2009; Lee and Chen. 2005; Kim and Sohn, 2004).
Finally, there are studies that faced bias data problem because authors have chosen customers
who have received loan approval as input data to build up a credit scoring model. This problem
makes credit scoring model have restrictions on representative for whole population and then
affect the performance of model.
The table below shows the popular independent variables used in some previous studies. The
name of these variables may be different compare to the name was used in previous studies.
However, the meaning of these variables also is the same among studies.
Variables
Age***

Jacobson R 2003

Dinh et al. 2007

Agarwal et al. 2009

KocendaVojtek 2009









Education






Gender





Marital status***





Central city





State of birth



Migrating out of state of birth



Time living in current place
Income















Monthly expenses
Number of dependent people



Type of job





Working time in current job






Wealth


Cell Phone
Residential status***







Old loans







Length of relationship



Maturity of the loan



Savings



Table 01: Common variables in previous studies
Note :*** the most significant variables in previous studies
22






In most of the previous studies, they found that characteristics of customer such as age,
education, marital status and residential status have an important significance on the payment
ability of customers (Agarwal et al, 2009; Dinh Kleimeiser, 2007; Kocenda Vojtek, 2009).
Moreover, other personal characteristics such as income, length of relationship, maturity of the
loan, savings…also have impacted on payment ability of customers as well (Vasanthi & Raja,
2006; Ozdemir & Boran, 2004; Jacobson Roszbach, 2003).
2.6 Common techniques employed in credit scoring models:
Arminger et al. (1997) used three different methods such as LR, CT and NN in credit modeling
and then compared their performance. In their study, they used input variables such as gender,
time in present job, age, available/married…as independent variables. The dataset was collected
from one of the largest retail bank in Germany. They used cross validation method to set up
model and test their performance. The results of their research implied that all three techniques
have predictive power equally but LR is a bit better than the others and the performance of CT
technique is worst.
Similarly, in 1996, Desai and his partners conducted a research using neural network, logit
regression and linear discriminant analysis model to test their performance in building up CSM.
Their data was collected from 53 different credit institutions in the US from 1988 to 1991. The
results provided ambiguous aspects between techniques: NN outperformed than the others in
predicting bad loans, but both LR and NN approach are equal in performance of classifying good
and bad loans. Overall, LR is always better than LDA. In another study of Lee et al. (2002), they
compared the performance of four techniques such as LDA, LR, NN and neural discriminant
method and found that four models had a same predicted power in distinguishing good or bad
customers.
The parametric and nonparametric techniques LR and CT were used in Koenda and Vojtek study
(2009) to estimate determinants of default. They claimed that both results are reliable and
suggested that CT method could be used to create better models. However, in previous studies of

Luo (2008) and Yang (2009), they suggested that LR always is the outperform method because
of its power in identifying which characteristics of customers affecting default rate.

23

According to Hand and Henley (1997) and previous studies revealed that depend on specific data
and input variables were used, we will have different best method. Each classification has their
own advantages and disadvantages such as: LR, nearest neighbour method are easy to apply and
understand their results, while neural network have a high predictive power but difficult in
explaining how exactly the results were built. Paliwal (2009) pointed out that since the last
decade, neural networks are more popular and applied broadly in lending institutions as the
alternative method in constructing CSM instead of using traditional statistical models. Some
other studies discovered that a hybrid model was combined between feed-forward neural
networks and traditional statistical methods such as DA and LR will enhance model’s
performance (Cheng et al, 1994, Paliwal et al, 2009).
Studies

LDA

CT

LR

NN

Hybrid Model
***

Cheng et al, 1994







Desai et al, 1996







***













Arminger et al, 1997

Hand & Henley, 1997


Lee et al, 2002

***

Koenda & Vojtek, 2009
Paliwal et al, 2009









***

Table 02: Common methods in previous studies
Note :*** the better method in their study
However, logistic regression is the most popular method which is proposed by many papers
because of its high performance in distinguishing good or bad customers (Cheng et al 2003,
Laitinen 2000). Some other studies also criticize this method due to it do not require an
assumption about existing linearity relationship between independent variables and dependent
variable and dependent variable need not be normally distributed. Other study by Chen and
Huang (2003) proved that the weak non-linear in most of the credit scoring datasets and thus
logistic regression give a reliable estimate. Previous researches have suggested that estimations
from logit or probit regression are always more accurate than DA (Wilson et al., 2000).

In recent period, there are some studies that propose a new approach to set up CSM. They
combine different techniques to construct credit scoring models because they realize that each
model will have its own advantages in specific segment or criteria (Koh et al, 2006). Thus they
24

will take advantage of strengths of individual models to create a better CSM by combining
different techniques together. The studies of Lee et al (2000) and Zhu et al (2001) have
supported this opinion. The final results proved that the hybrid model outperformed significantly.
Similarly, Lee & Chen (2005) compared the performance of individual models such as DA, LR
with a hybrid model was combined between neural network and multivariate adaptive regression
splines and finally gave the same result.
Basing on the above justifications, this study will apply logistic regression, neural network and
the hybrid model of these techniques to build up CSM and conduct an assessment between their
performances.

25

Construct credit scoring models using logistic regression, neural network and the hybrid model

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về