Finding Minimal Neural Networks for
Business Intelligence Applications
Rud
y
Setiono
y
School of Computing
National University of Singapore
d/d
www.comp.nus.e
d
u.sg
/
~ru
d
ys
Outline
• Introduction
• Feed-forward neural networks
•
Neural
network
training
and
pruning
•
Neural
network
training
and
pruning
• Rule extraction
• Business intelligence applications
• Conclusion
•
References
References
• For discussion: Time-series data mining
2
using neural network rule extraction
Introduction
• BusinessIntelligence(BI):Asetofmathematicalmodelsandanalysis
methodologiesthatexploitavailabledatatogenerateinformationand
knowledgeusefulforcomplexdecision‐makingprocess.
•
Mathematical models and analysis methodologies for BI include various
•
Mathematical
models
and
analysis
methodologies
for
BI
include
various
inductivelearningmodelsfordataminingsuchasdecisiontrees,artificial
neuralnetworks,fuzzylogic,geneticalgorithms,supportvectormachines,
andintelligentagents.
3
Introduction
BI Analytical Applications include:
• Customersegmentation:Whatmarketsegmentsdomycustomersfallinto,
andwhataretheircharacteristics?
• Propensitytobuy:Whatcustomersaremostlikelytorespondtomy
promotion?
• Frauddetection:HowcanItellwhichtransactionsarelikelytobefraudulent?
Ct tt iti
Whi h t i t ik f li?
•
C
us
t
omera
tt
r
iti
on:
Whi
c
h
cus
t
omer
i
sa
t
r
i
s
k
o
f
l
eav
i
ng
?
• Creditscoring:Whichcustomerwillsuccessfullyrepayhisloan,willnot
defaultonhiscreditcardpayment?
•
Time
series prediction
4
•
Time
‐
series
prediction
.
Feed-forward neural networks
A feed-forward neural network with one hidden layer:
ibl l i
• Inputvar
i
a
bl
eva
l
uesareg
i
ven
totheinputunits.
• Thehiddenunitscom
p
utethe
p
activationvaluesusinginput
valuesandconnectionweight
valuesW.
• Thehiddenunitactivationsare
giventotheoutputunits.
• Decisionismadeattheoutput
layeraccordingtotheactivation
valuesoftheoutputunits.
5
Feed-forward neural networks
Hiddenunitactivation:
•
Compute the weighted input: w
1
x
1
+ w
2
x
2
+ …. + w
x
Compute
the
weighted
input:
w
1
x
1
+
w
2
x
2
+
….
+
w
n
x
n
• Applyanactivationfunctiontothisweightedinput,forexamplethelogistic
fif( ) 1/(1
)
f
unct
i
on
f(
x
)
=
1/(1
+e
‐x
)
:
6
Neural network training and pruning
Neuralnetworktraining:
• Findanoptimalweight(W,V).
• Minimizeafunctionthatmeasureshowwellthenetworkpredictsthedesired
outputs (class label)
outputs
(class
label)
• Errorinpredictionfori‐th sample:
e
= (desired output)
–
(predicted output)
e
i
=
(desire d
output)
i
–
(predicted
output)
i
• Sumofsquarederrorfunction:
∑
E(W,V)=
∑
e
i
2
• Cross‐entropyerrorfunction:
E(W,V)=‐ Σ d
i
logp
i
+(1‐ d
i
)log(1–p
i
)
d
is the desired output either 0 or 1
7
d
i
is
the
desired
output
,
either
0
or
1
.
Neural network training and pruning
Neuralnetworktraining:
•
Many optimization methods can be applied to find an optimal (W,V):
Many
optimization
methods
can
be
applied
to
find
an
optimal
(W,V):
o Gradientdescent/errorbackpropagation
o Conjugategradient
o QuasiNewtonmethod
o Geneticalgorithm
Nt ki id d ll ti dif it di t tii dt d
•
N
e
t
wor
k
i
scons
id
ere
d
we
ll
t
ra
i
ne
d
if
it
canpre
di
c
t
t
ra
i
n
i
ng
d
a
t
aan
d
cross‐
validationdata withacceptableaccuracy.
8
Neural network training and pruning
Neuralnetworkpruning:Removeirrelevant/redundantnetworkconnections
1. Initialization.
(a)LetWbethesetofnetworkconnectionsthatarestillpresentinthenetworkand
(b)letCbethesetofconnectionsthathavebeencheckedforpossibleremoval
(c) W corresponds to all the connections in the fully connected trained network and C is the empty set.
(c)
W
corresponds
to
all
the
connections
in
the
fully
connected
trained
network
and
C
is
the
empty
set.
2.Saveacopyoftheweightvaluesofallconnectionsinthenetwork.
3.Findw∈ Wandw– Csuchthatwhenitsweightvalueissetto0,theaccuracyofthenetworkisleastaffected.
4.Settheweightfornetworkconnectionw to0andretrainthenetwork.
5.Iftheaccuracyofthenetworkisstillsatisfactory,then
(a)Removew,i.e.setW:=W−{w}.
(b)ResetC:=∅.
(c) Go to Step 2.
(c)
Go
to
Step
2.
6.Otherwise,
(a)SetC:=C∪ {w}.
9
(b)RestorethenetworkweightswiththevaluessavedinStep2above.
(c)IfC≠W, gotoStep2.Otherwise,Stop.
Neural network training and pruning
PrunedneuralnetworkforLEDrecognition(1)
z
1
z
2
z
3
z
4
2
3
z
7
z
5
z
6
Howmanyhiddenunitsandnetworkconnectionsareneededtorecognizeall
d l?
7
ten
d
igitscorrect
l
y
?
10
Neural network training and pruning
PrunedneuralnetworkforLEDrecognition(2)
z
1
Rawdata
A neural network
z
1
z
2
z
3
z
4
z
5
z
6
z
7
Digit
1110111 0
0010010 1
1
0
1
1
1
0
1
2
A
neural
network
fordataanalysis
Processed
data
1
0
1
1
1
0
1
2
1011011 3
0111010 4
1
1
0
1
0
1
1
5
1
1
0
1
0
1
1
5
1101111 6
1010010 7
1
1
1
1
1
1
1
8
11
1
1
1
1
1
1
1
8
1111011 9
Neural network training and pruning
PrunedneuralnetworkforLEDrecognition(3)
diff d l k
Many
diff
erentprune
d
neura
l
networ
k
s
canrecognizedall10digitscorr ectly.
12
Part2.Noveltechniquesfordataanalysis
Neural network training and pruning
PrunedneuralnetworkforLEDrecognition(4):Whatdowelearn?
0
1
2
=
0
=
1
=
2
Mustbeon
Mustbeoff
Classificationrulescanbe
ttdf d tk
Doesn’tmatter
ex
t
rac
t
e
d
f
romprune
d
ne
t
wor
k
s.
13
Part2.Noveltechniquesfordataanalysis
Rule extraction
Re‐RX:analgorithmforruleextractionfromneuralnetworks
•
New
pedagocical
rule extraction algorithm: Re
‐
RX (
Re
cursive
R
ule E
x
traction)
New
pedagocical
rule
extraction
algorithm:
Re
RX
(
Re
cursive
R
ule
E
x
traction)
• Handlesmixofdiscrete/continuousvariableswithoutneedfordiscretization of
continuousvariables
– Discretevariables:propositionalruletreestructure
– Continuousvariables:hyperplane rulesatleafnodes
• Examplerule:
IfYearsClients<5andPurpose≠PrivateLoan,then
IfNumberofapplicants≥2andOwnsrealestate=yes,then
IfSavingsamount+1.11Income‐ 38249Insurance‐ 0.46Debt>‐1939300,then
Customer=goodpayer
Else…
Cbi h ibili d
14
•
C
om
bi
nescompre
h
ens
ibili
tyan
d
accuracy
Part2.Noveltechniquesfordataanalysis
Rule extraction
AlgorithmRe‐RX(
S
,D,C):
Input:AsetofsamplesS havingdiscreteattributesD andcontinuousattributesC
Output:Asetofclassificationrules
1. TrainandpruneaneuralnetworkusingthedatasetS andallitsattributesD andC.
2
Lt
D'
d
C'
b th t f di t d ti tt ib t till t i th tk
2
.
L
e
t
D'
an
d
C'
b
e
th
ese
t
so
f
di
scre
t
ean
d
con
ti
nuousa
tt
r
ib
u
t
ess
till
presen
t
i
n
th
ene
t
wor
k
,
respectively.LetS'bethesetofdatasamplesthatarecorrectlyclassifiedbythepruned
network.
f
'
h
hl
li h l i
'
di h l f hi
3. I
f
D
'
=
,t
h
engenera t ea
h
yperp
l
ane tosp
li
tt
h
esamp
l
es
i
nS
'
accor
di
ngtot
h
eva
l
ueso
f
t
h
e
i
r
continuousattributesC' andstop.Otherwise,usingonlydiscreteattributesD',generat etheset
ofclassificationrulesR forthedatasetS'.
4. ForeachruleR
i
generated:
Ifsupport(R
i
) >
1
anderror(R
i
)>
2
,then:
Let
S
be the set of data samples that satisfy the condition of rule
R
and
D
be the set of
–
Let
S
i
be
the
set
of
data
samples
that
satisfy
the
condition
of
rule
R
i
,
and
D
i
be
the
set
of
discreteattributesthatdonotappearintheruleconditionofR
i
– IfD
i
=,thengener a t eahyperplane tosplitthesamplesinS
i
accordingtothevaluesof
th i ti tt ib t
C
d t
15
th
e
i
rcon
ti
nuousa
tt
r
ib
u
t
es
C
i
an
d
s
t
op
Otherwise,callRe‐RX(S
i
,D
i
,C
i
)
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
• Oneofthekeydecisionsfinancialinstitutionshavetomake
istodecidewhetheror notto
g
rantcredittoacustomerwhoa
pp
liesforaloan.
g pp
• Theaimofcreditscoringistodevelopclassificationmodelsthatareableto
distinguish good from bad payers, based on the repayment behaviour of past
distinguish
good
from
bad
payers,
based
on
the
repayment
behaviour
of
past
applicants.
•
These models usually summarize all available information of an applicant in a score:
These
models
usually
summarize
all
available
information
of
an
applicant
in
a
score:
• P(applicantisgoodpayer | age,maritalstatus,savingsamount, …).
• Application scoring:ifthisscoreisaboveapredeterminedthreshold,creditisgranted;
otherwisecreditisdenied.
• Similarscoringmodelsarenowalso usedtoestimatethecreditriskofentireloan
portfoliosinthecontextofBasel II.
16
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
• BaselIIcapitalaccord:frameworkregulatingminimum
capitalrequirementsforbanks.
Ct dt
dit ik
h h it l t
•
C
us
t
omer
d
a
t
a cre
dit
r
i
s
k
score
h
owmuc
h
cap
it
a
l
t
o
setasideforaportfolioofloans.
• Datacollectedfromvariousoperationalsystemsinthebank,
bd hi h idi ll dtd
b
ase
d
onw
hi
c
h
scoresareper
i
o
di
ca
ll
yup
d
a
t
e
d
.
• Banksarere
q
uiredtodemonstrateand
p
eriodicall
y
validate
q py
theirscoringmodels,andreporttothenationalregulator.
17
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
Experiment1:CARDdatasets.
•
The 3 CARD datasets:
•
The
3
CARD
datasets:
Dataset Training set Testset Total
Class0Class1 Class 0 Class 1 Class0 Class1
CARD1 291 227 92 80 383 307
CARD1 284 234 99 73 383 307
CARD3 290 228 93 79 383 307
• Originalinput:6continuousattributesand9discreteattributes
• In
p
utaftercodin
g
:C
4
,
C
6
,
C
41
,
C
44
,
C
49
,
andC
51
p
lusbinar
y
‐valued
p g
4
,
6
,
41
,
44
,
49
,
51
p y
attributesD
1
,D
2
,D
3
,D
5
,D
7
,…,D
40
,D
42
,D
43
,D
45
,D
46
,D
47
,D
48
,andD
50
18
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
Experiment1:CARDdatasets.
l k f h f h d d
• 30neura
l
networ
k
s
f
oreac
h
o
f
t
h
e
d
atasetsweretraine
d
• Neuralnetworkstartshasonehiddenneuron.
• Thenumberofinputneurons,includingonebiasinputwas52
• Theinitialweightsofthenetworkswererandomlyand
uniformly generated in the interval [
−
1 1]
uniformly
generated
in
the
interval
[ 1
,
1]
• Inadditiontotheaccuracyrates,theAreaundertheReceiver
OperatingCharacteristic(ROC)Curve(AUC)isalsocomputed.
19
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
Experiment1:CARDdatasets.
•
Where
α
are the predicted outputs for Class 1 samples i 12
•
Where
α
i
are
the
predicted
outputs
for
Class
1
samples
i
=
1
,
2
,
…mandβ
j
arepredictedoutputforClass0samples,j=1,2,…n.
• AUCisamoreappropriateperformancemeasurethanACC
when the class distribution is
skewed
20
when
the
class
distribution
is
skewed
.
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
Experiment1:CARDdatasets.
Dataset #connections ACC(θ
1
)AUC
d
(θ
1
) ACC(θ
2
)AUC
d
(θ
2
)
CARD1(TR) 9.13±0.94 88.38±0.56 87.98±0.32 86.80±0.90 86.03±1.04
CARD1(TS) 87.79±0.57 87.75±0.43 88.35±0.56 88.16±0.48
CARD2(TR) 7.17±0.38 88.73±0.56 88.72±0.57 86.06±1.77 85.15±2.04
CARD2(TS) 81.76±1.28 82.09±0.88 85.17±0.37 84.25±0.55
CARD3(TR)
757
±
063
88 02
±
051
88 02
±
069
86 48
±
107
87 07
±
060
CARD3(TR)
7
.
57
±
0
.
63
88
.
02
±
0
.
51
88
.
02
±
0
.
69
86
.
48
±
1
.
07
87
.
07
±
0
.
60
CARD3(TS) 84.67±2.45 84.28±2.48 87.15±0.88 87.15±0.85
• θ isthecut‐offpointforneuralnetworkclassification:ifoutputisgreaterthanθ,thanpredict
Class1,elsepredictClass0.
• θ
1
andθ
2
arecut‐offpointsselectedtomaximizetheaccuracyonthetrainingdataandthetest
datasets,respectively.
21
• AUC
d
=AUCforthediscreteclassifier=(1
–
fp +tp)/2
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
Experiment1:CARDdatasets.
• Oneprunedneuralnetworkwasselectedforruleextractionforeachofthe3CARDdatasets:
i
C ()
C
(S)
d
i
Dataset #connect
i
ons AU
C
(
TR
)
AU
C
(
T
S)
Unprune
d
i
nputs
CARD1 8 93.13% 92.75% D
12
,D
13
,D
42
,D
43
,C
49
,C
51
CARD2
9
93 16%
89 36%
D
D
D
D
D
C
C
CARD2
9
93
.
16%
89
.
36%
D
7
,
D
8
,
D
29
,
D
42
,
D
44
,
C
49
,
C
51
CARD3 7 93.20% 89.11% D
42
,D
43
,D
47
,C
49
,C
51
• Errorratecomparisonversusothermethods:
Methods CARD1 CARD2 CARD3
GeneticAlgorithm 12.56 17.85 14.65
NN(other) 13.95 18.02 18.02
NeuralWorks
14 07
18 37
15 13
NeuralWorks
14
.
07
18
.
37
15
.
13
NeuroShell 12.73 18.72 15.81
PrunedNN
(
θ
1
)
12.21 18.24 15.33
22
(
1
)
PrunedNN(θ
2
) 11.65 14.83 12.85
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
Experiment1:CARDdatasets.
• Neuralnetworkswithjustonehiddenunitandveryfewconnectionsoutperformmorecomplex
l tk!
neura
l
ne
t
wor
k
s
!
• Rulecanbeextractedtoprovidemoreunderstandingabouttheclassification.
•
Rules for CARD1 from Re
‐
RX:
•
Rules
for
CARD1
from
Re
‐
RX:
RuleR
1
:IfD
12
=1andD
42
=0,thenpredictClass0,
Rule R
: else if D
= 1 and D
= 0 then predict Class 0
Rule
R
2
:
else
if
D
13
=
1
and
D
42
=
0
,
then
predict
Class
0
,
RuleR
3
:elseifD
42
=1andD
43
=1,thenpredictClass1,
RuleR
4
:elseifD
12
=1andD
42
=1,thenClass0,
o RuleR
4a
:IfR
49
−0.503R
51
>0.0596,thenpredictClass0,else
o RuleR
4b
:predictClass1,
RuleR
5
:elseifD
12
=0andD
13
=0,thenpredictClass1,
RuleR
6
:elseifR
51
=0.496,thenpredictClass1,
Rule R
: else predict Class 0
23
Rule
R
7
:
else
predict
Class
0
.
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
Experiment1:CARDdatasets.
• RulesforCARD2:
RuleR
1
:IfD
7
=1andD
42
=0,thenpredictClass0,
RuleR
2
:elseifD
8
=1andD
42
=0,thenpredictClass0,
2
8
42
RuleR
3
:elseifD
7
=1andD
42
=1,thenClass1
RuleR
3a
:ifI
29
=0,thenClass1
RuleR
3a−i
:ifC
49
−0.583C
51
<0.061,thenpredictClass1,
RuleR
3a−ii
:elsepredictClass0,
RuleR
3b
:elseClass0
RuleR
3b−i
:ifC
49
−0.583C
51
<−0.274,thenpredictClass1,
RuleR
3b−ii
:elsepredictClass 0.
RuleR
4
:elseifD
7
=0andD
8
=0,thenpredictClass0,
Rl R
l di t Cl 0
24
R
u
l
e
R
5
:e
l
sepre
di
c
t
Cl
ass
0
.
Part2.Noveltechniquesfordataanalysis
Business intelligence applications
Experiment1:CARDdatasets.
• RulesforCARD3:
RuleR
1
:IfD
42
=0,thenClass1
Rule R
1
: if C
51
> 1.000, then predict Class 1,
Rule
R
1
a
:
if
C
51
>
1.000,
then
predict
Class
1,
RuleR
1b
:elsepredictClass0,
Rule R
: else Class 1
Rule
R
2
:
else
Class
1
RuleR
2a
:ifD
43
=0,thenClass1
l
f
h d l
Ru
l
eR
2a−i
:i
f
C
49
−0.496C
51
<0.0551,t
h
enpre
d
ictC
l
ass1,
RuleR
2a−ii
:elsepredictClass0,
RuleR
2b
:elseClass0
RuleR
2b−i
:ifC
49
−0.496C
51
<2.6525,thenpredictClass1,
25
RuleR
2b−ii
:elsepredictClass0,