Performance of Modern Techniques for Rating Model Design

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (715.03 KB, 23 trang )

Performance of Modern Techniques for Rating
Model Design

Master Thesis

Supervising Professor: Practical Supervisor:

Prof. Dr. Uwe Schmock Ernst Young Zürich
Master of Advanced Studies in Finance

Student: Anca Antonov
Year: 2002/2003

Zürich 2004

Performance of modern techniques for rating model design
2

Table of Contents :

1. Introduction

2. Classification problem
2.1. Classification theoretical framework
2.2. Classification problem and corporate rating problem
2.3. Generalization Performance

3. Data Set Description

4. Linear Classifiers

4.1. Linear Regressions
4.2. Gradient Descendent
4.3. Discriminant Analysis
4.3.1. Linear Discriminant Analysis

5. Non – Linear Classifiers
5.1. Quadratic Discriminant Analysis
5.2. Polynomial Regression
5.3. Logistic Regression
5.4. K-Nearest Neighbors Regression
5.5. Parzen Windows density estimator

6. Neural Networks
6.1. Multilayer Perceptrons
6.1.1. BackPropagation Training with Gradient Descendent
6.1.2. QuickPropagation
6.1.3. Scaled Conjugate Gradient
6.1.4. Stochastic Learning Process
6.1.5. Pruning – Optimal Brain Surgeon

6.2. Radial Basis Neural Networks
6.2.1. Probabilistic Neural Networks
6.3. Learning Vector Quantization
6.4. SelfOrganized Maps

7. Conclusions and Further Research

Performance of modern techniques for rating model design
3
1.
Introduction

Credit risk forecasting is one of the leading topics in modern finance, as the bank regulation has made increasing use of external
and internal credit ratings. One of the most important examples is the package of rules for determining the required capital for
the market risk in the trading book, issued by the Basel Committee on Banking Supervision. The discussion process that led to
the June 1999 BSBS proposal for a revised international accord follow this trend on importance growing attached to the credit
scoring process, culminating in a more prominent proposed role for credit ratings in the determination of overall capital for
banking institution, but the problems that the regulators and practioners are facing are not few.

Most of the problems that we are facing in the credit scoring are rather technical than theoretical. In an ideal
(theoretical) world, probabilities of default (PDs) could directly be assigned to obligors. In such a world the model
builder would know the probability distribution of future defaults within the population of borrowers. This
information is, however, unknown to the model builder a priori. Due to this data restriction, however, usually a two-
step approach is carried out. First, based on the past information, to infer the default risk models assign a credit
score for each corporate observation, which leads towards a ranking between the contemplated corporations.
Second, given the ranking, corporations are mapped to an internal grade for which a PD has to be estimated.
Statistical scoring methods combine and weight individual accounting ratios to produce a measure – a credit risk score – that

discriminates between healthy and problem firms. The most widely used statistical methods are discriminant linear analysis and
logistic regression.

The classic Fisher linear discriminant analysis seeks to find a linear function of accounting variables that maximizes the
differences (variance) between the two groups of firms while minimizing the differences within each group. The variables of the
scoring function are generally selected among a large set of accounting ratios on the basis of their statistical significance. The
coefficients of the scoring functions represent the contributions (weights) of each ratio to the overall score.

All in all, multivariate accounting-based credit-scoring models have been shown to perform quite well. In particular, linear
discriminant analysis seems robust even when the underlying statistical hypotheses do not hold exactly, especially when used
with large samples. Logistic analysis has produced similar results. Some recent studies (see[4] for a review of all methods
applied in rating methodologies) use both methods and choose the one with the best out-of-sample performance, to avoid
problems of sample-specific bias and over fitting.

A relatively new – and less thoroughly tested – approach to the problem of credit risk classification is based on artificial
intelligence methods, such as expert systems and automated learning (neural networks, decision trees and genetic algorithms).
These methods dispense with some of the restrictive assumptions of the earlier statistical models as explained in the beginning
of the chapter 6.

The scope of this master thesis is to test the efficiency and the accuracy of each of the logistic regression, discriminant linear
and quadratic analysis, polynomial regression, k-nearest neighbours comparing with the neuronal networks. This master thesis is
organized as follows: an overview of the classification problem and the appropriate statistical measure for the generalization
performance). The data used will be analysis in third part and the statistical methods will be explained in the 4-6 part together
with the results. The conclusion will be detailed on the seventh part, together with the last remarks some and possible extension
of the topic.

2.
Classification problem
2.1.

Classification theoretical framework

Decision theoretical framework is based on classical decision theory as it is presented in the book of Christopher Bishop [2]. An
observation is a feature vector
xX =
, from some space
{ }
NjxX
j
..1),( ==
, where all the observations available constitute
the sample space. The true classification membership is
},....,,{
321 M
wwww=Ω∈
ω
, M being the total number of classes,
supposed finite. We assume that pairs
),(
ω
X
are drawn independently from the joint distribution
Ω×
Xonxp ),(
ω
.

Given an observation (vector of features),
xX =

, the scope is to make a decision regarding the true class membership of the
observation, this is the classification problem: predict the true class
ω
, for an observation
xX
=
. In this context, we define the
action space
Y
, which is the space of allowable decisions in our classification problem. The action space can be seen as an
extension of the space of the classes,
,....}'','',{ doubtoutlierY Ω=
as in Ripley. In our case, we will assume that the action
space equal the space of classes,
Ω=
Y
. The relationship between the action space
Y
and the space of the classes
Ω
is
quantified by the
loss function

),( yL
ω
which gives the loss incurred if
ω
is the true state of nature and action
y

is taken.

Performance of modern techniques for rating model design
4
The link between the sample space and the action space is the
classifier
. The classifier is a nonrandomized decision rule
δ
from
the space of allowable decisions that specifies, for each observation
Xx
∈
, what action
Yy∈
to take if
xX
=
is observed.
The classifier is a map from the sample space
X
to the action space
Y
,
YX
→
:
δ
, such that:

A classifier
δ
can be evaluated by its
risk function
),(
δω
R
, given by:
dxxPxLxLER
X
x
∫
== )())(,())(,(),(
ωδωδωδω
ω

The risk function is the expected loss by using the classifier
δ
when
ω
is the true state of nature. It is desirable to choose a
classifier that ahs a small value of the risk function
),(
δωR
for all classes
ω
. Comparing classifiers on the basis of their
respective risk functions can be difficult since different classifiers might give superior results to others in separate subspaces

Ω .
In Bayes decision analysis one attempts to choose the classifier
δ
that minimizes the risk on the space of the classes, Ω ,
weighted by the prior probabilities of the classes. This is called the Bayes risk,
),(
δπB
, and is given as,
)(),(),(
ωπδωδπ
ω
⋅=
∑
Ω∈
RB

where
)(
ωπ
is the prior probability given the class
ω
.
Minimizing the Bayes risk is achieved by each x choosing the classifier
)(
xδ
that minimizes the posterior expected loss
))(,(
xLE
x
δω

ω
(which is usually called conditional risk)
,
∑
=
ω
ω
ωδωδω
)())(,())(,(
xpxLxLE
x

where
)(
ω
xp
is the posterior probability for class
ω
given
x
.

Using the zero–one loss function,



=
=
otherwise

yif
yL
1
0
),(
1/0
ω
ω

the classifier is constructed by choosing the class giving the maximum posterior probability.

So, as stated before, in classification the aim is to predict the true class
ω
for a certain observation. In discrimination the aim is
to separate the sample space into disjoint regions for the classes
},....,{
21
M
www
, but both of them put a lot of emphasis on the
posterior distribution of the class
ω
.

2.2.
Classification problem and corporate rating problem

Usually an expert committee of a special financial agency performs the process of bond rating, but this process is covert with
mystery because of the confidentiality issues. Accordingly, many researchers have tried to formulate alternative approaches to
predict companies ratings, especially since the rating agency update their “grades” infrequently, there is a considerable value to
being able to anticipate rating changes before they are announced. In this framework, we can consider that this process of
mapping the companies ratios into the rating as a classical classification problem where the feature space is five dimensional
and contains the following ratios:
•
X1 – Working Capital over Total Assets
•
X2 – Retained Earnings over Total Assets
•
X3 – EBIT over Total Assets
•
X4 – Capital over Total Assets
•
X5 – Sales over Total Assets
and the categories are the Standard & Poor’s Classification starting with AAA – companies with extremely strong capabilities to
meet their financial commitments and ending with D – already defaulted companies.

So far, there have been under analysis three main approaches to construct classifiers depending on the philosophy behind their
construction:
•
A Posteriori Classifiers: which try to model the a posteriori probabilities
)( xcp
K

•
Probability Density Classifiers: which try to model the conditional probabilities and combine them with the help of Bayes
rules

•
Decision boundary Classifiers: which construct the discrimination function and the decision boundary as well.

jjj
x εδω
+= )(

Performance of modern techniques for rating model design
5

2.3.
Generalization Performance

The biggest statistical challenge in this case, is to differentiate the models according to their generalization performance, that is
the accuracy on the never seen data.1 The learning or the regression process will imply to find an estimate for
);(
ˆ
Dx
λ
δ
, given
the data set
D
, from a class of predictors
λ
δ
, indexed by
λ
, where in general
),,(

WAS
=Λ∈
λ
, where
XS
⊂
denotes a
chosen subset of available data inputs, A is a selected architecture within a class of model architecture
A
and
W
are the
adjustable parameters space.
The
prediction risk

)(
λ
P
is defined as the expected performance on future data:
[ ]
2
2
)()()()(
ε
σδδλ
+−=
∫
dxxxxpP

which is approximated by the expected performance on the
finite test set
:










−≈
∑
=
==
2
1
))((
1
)(
N
j
jj
x
N
EP
λ
δωλ

where
),(
==
jj
tx
are new observations that were not used in constructing the classifier. In all the paper,
)(
λP
as a measure of
the generalization ability of the model and our strategy will be to choose the model/architecture -
λ
, that minimize the
prediction risk.

Since we can not directly calculate the prediction risk,
)(
λP
we have to estimate it from the available data set
D
and the
standard test – validation set is not advisable when the data set is not very large, but instead we can use a sample reuse
method which makes maximally efficient use of the data: Cross-Validation (CV). Furthermore, the method has the advantage of
making minimum assumptions on the statistics of the data.

Let
)(
)(
x
jλ

δ
be a predictor trained using all observations except
),(
jj
x ω
such that )(
)(
x
j
λ
δ
minimizes:

2
)(
))((
1
1
∑
≠
−
−
=
jk
jkj
x
N
MSE
λ
δω

then, an estimator for the prediction risk
)(
λP
is the cross validation mean square error:
∑
=
−=
N
j
jj
x
N
CV
1
2
)(
))((
1
)(
λ
δωλ

which is known as leave-one-out cross–validation. However, this form of cross validation is expensive to compute especially for
neural networks and therefore,
v-fold cross validation
has been chosen instead where larger subsets of D are ignored in the
training phase. The data set D is divided into
v
randomly selected disjoints subsets of roughly equal size such that:

jijiBB
DB
ji
v
j
j
≠∀=
=
=
,,
1
φ
I
U

Let
)(
)(
x
j
B
λ
δ
be the estimator trained on all the data except
j
Bx
∈
),(
ω
, then the cross validation is defined as:

Typically, choices for
v
are 5 or 10 and we can observe that leave out cross validation is obtained in the limit
Nv
→
.

Special for neural network, other two useful criteria have been created:
Generalized Cross Validation (GCV)
and
Akaike`s
Final Prediction Error (FPE)
with the following forms:

N
S
N
S
MSEEPF
N
S
MSEVCG
)(
1
)(
1
)()(

)(
1
1
)()(
2
λ
λ
λλ
λ
λλ
−
+
⋅=






−
⋅=

∑∑
∈
−=
jBx
Bj
j
jkk
J

x
Bcardv
CV
),(
2
)(
))(((
)(
11
)(
ω
λ
δωλ

Performance of modern techniques for rating model design
6
where
)(
λS
denotes the number of weights of model
λ
. Note that they are slightly different for small sample size, but they are
asymptotically equivalent for large N:
)()(
)(
21)()(
λλ
λ
λλ
FPEGCV

N
S
MSEP ≈≈






+≡

It have been shown by Moody[14] that
FPE is an unbiased estimator
of prediction risk for neural network models, provided that
the noise in the observed targets are independent and identically distributed and provided that weight decay is not used.

3. Data Set Description

Our final belief is that we are able to analyse the data in such a manner that finally we are able to understand what separates
different categories, which are the features that have a strong discriminate power.
For analysing the ability of each model to capture the features that have a strong discriminate power, we have used four
different sets:

•
One randomly generated (Rand)
•
One with an obvious clustering (Separable Set - SS)
•
One with a light overlapping distribution (Overlapping Set - OS)

•
One with hard overlapping distribution (Hard Overlapping Set - HOS)
•
A real data set composed of a database of 720 companies in which we will try to find the right class membership ie
the rating. (Real Data - RD)

The graph of the first four categories of data are presented below:

Fig. No. 1: Random, Separable, Overlapping, and Hardly Overlapping Sets

Performance of modern techniques for rating model design
7
The real data set has 720 records, with the following characteristics of the categories.
Group Initial No of
observations
Initial Prior Probability
Estimates (%)
After Excluding
Outliers
Prior Probability
Estimates (%)

AAA 10 1.41 10 1.55
AA 72 10.14 66 10.20
A 214 30.14 196 30.30
BBB 249 35.07 230 35.55
BB 95 13.38 92 14.22
B 57 8.03 53 8.18
CCC 10 1.41 - -
D 3 0.42 - -
Table 1. Categories Summary – with and without outliers

The main statistics about the features are summarized in the table 2:

X1 X2 X3 X4 X5
Min
-5.152 -1.388 -0.4253 -0.5301 0.00
Max
4.217 0.9606 0.4913 3.46455 4.68
25% Q
-0.7275 0.0294 0.0379 0.31552 0.512
75% Q
0.649 0.2681 0.1173 0.89595 1.1675
Mean
0 0.1450 0.0812 0.67701 0.9592
Variance
1 0.0523 0.0059 0.29792 0.4846
Skew
0.360 -1.315 0.0473 1.58505 1.903
Kurtosis
1.679 8.5697 5.7096 3.65485 4.798
Table 2. Features Summary

The above summary and the box plots presented in the Fig.2, clearly show the presence of the outliers, especially for the
variable 1 and 5. Also, the density plots in the figure 3 show the estimated density of the features variables and indicate some
suspicious outline values.
The problem is that the usual outliers tests assume the outliers affect normal distribution and the testing of the normality. In
this case, the outliers have been defined as the features that normalized are exceeding three.
ifeatureofiancemean
ifeatureX
X
ii
i
i
ii
var/,
3
−
−
≥
−
σµ
σ
µ

Proceeding like this, our data set is reduced to 647 records (with 7%) and we removed the last rating classes: C and D, but the
category BBB remains the largest one, and can be considered as the benchmark for clever guessing, having a prior probability of
35.55%.

Fig.2: Histogram and Density plots of the Data

Performance of modern techniques for rating model design
8
The first step in analysing the data is to look at the scatter plots to see if we can observe any structure in the data, especially in
the case of the non-linear regression problem. We can scatter both the output variable against the input variable, or the
features variables one against the other, but the first one are most relevant

Fig.3 Scatter Plots of the Variables

Based on that we can assume

that the variables there is linear relationship that we are going to use for the polynomial
regression between the variables: X1&X2, X2&X3, X2&X5.

The co linearity test included in Finmetrics module, gives a result of 3.966, which is clearly showing that the data are
independent. If the result is greater than 20 then the data should be interpreted that the data present a high co linearity and a
sign of doubt should be raised if the value is larger than 10. The same idea is confirmed by the covariance matrix, which exhibits
very low values.

X1 X2 X3 X4 X5
X1
0.012645 0.0039532 0.0010997 0.0026837 0.023819
X2
0.0039532 0.031686 0.0041818 0.03103 0.020852
X3
0.0010997 0.0041818 0.004312 0.0067614 0.01077
X4

0.0026837 0.03103 0.0067614 0.19612 0.0070272
X5
0.023819 0.020852 0.01077 0.0070272 0.2993
Table 3. Covariance Matrix

Nevertheless, when the data are prepared for training the neural network, a normalization must be performed such that each
group has zero mean and one variance. This is the standard procedure. A little bit complicate normalization is the “whitening”
where the idea is to decorrelate the input variable by rotating the basis vectors and also standardize them.

The underlying idea is that the covariance matrix
)1()1(
TTT
XX
µµ
⋅−⋅−=Σ is symmetric and can thus be diagonalized using
the orthonormal transformation
Q
. If we choose
( )
2
1
1
−
Λ⋅⋅⋅−=
QXZ
T
µ

then or new variables are zero mean, uncorrelated and with variance equal with one (the covariance matrix is the identity
matrix). Also, in order to make them lie in the interval –1 and 1, we are transforming the patterns according to the following

relationship:
MinMax
MinMaxpattern
pattern
−
+−∗
=
)(2

Regarding the output variables, we can consider two approaches: binary (where the classes are encoded with 1 and 0 bits)
where the records are linked to the category to which the Euclidian distance is minimum or continuous (where we can allow the
classifier to take any value between 1 and 6). Although the binary approach is the classical one, we will prefer the continuous
one, as it implies similar results.

Performance of modern techniques for rating model design
9
4.
Linear Classifiers

4.1. Linear Regression

In the case of the linear regression we assume that the classifier has the following form:
Xwx
T

⋅=)(
δ
, in other words we
assume that there is a linear process underlying the classification decision:
εδ
+⋅=
Xwx
T
*
)(
and the noise process fulfills the
Gauss-Markov conditions. Applying the least square error,
wXw
w
−=
δ
min
we obtain that:
()
ω
TT
XXXw
1
−
=
. For our
data set, the results are synthesized in the table no.4.

Accurancy 1 Class
Error

CV GCV FPE Free
Parameters
Separable
Set
99.722 100 0.040302 0.049906 0.049752 4
Light
Overlapping
90.972 100 0.092036 0.10045 0.10037 4
Hard
Overlapping
25.833 76.528 1.9399 2.1653 2.1636 4
Random Set
24.861 71.944 1.8605 2.0469 2.0453 4
Real Data
Set
42.222 88.254 0.89425 1.0327 1.0233 6
Table 4. Linear Regression Accuracy
4.2. Gradient Descent

Another fix to the singularity problem is to avoid inverting the matrix
XX
T
, by using the gradient descent scheme, which
iteratively searches for the minimum defined in the equation above. The algorithm tries to adapt the parameters in the direction
of the negative gradient of the error function and which is the basic strategy for learning the neural networks. The algorithm
goes as follows(known as Adaline as well):

Initialize the weights

Repeat until the convergence criteria is met or the number of total iterations is reached


Compute the update
)()()(
1
nxnetw
N
n
⋅⋅=∆
∑
=
η
where )()()()(
nxnwnne
T
⋅−=
δ


Compute the new weights
)()()1(
twtwtw
∆+=+


Compute the new error:
)1( +
tE

The crucial parameter in this case is the learning rate or

η
. A too large learning rate means that the algorithm will fail to
converge, while a too small
η
it will make it very slow. In the literature have been shown that the learning rate should satisfy
the following relationship:
[]
XXTrace
T
2
0 <<
η

but the results are highly sensitive to the way the parameter are chosen as can be seen in the graph below which is executed on
the separable set.

Fig. No.4 Performance Versus Learning Rate

Performance of Modern Techniques for Rating Model Design

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về