RuleExtractionfromArtificial
NeuralNetworks
RudySetiono
SchoolofComputing
NationalUniversityofSingapore
1
Outline
1. Thetrainproblem
2.
Motivations
2.
Motivations
3. Feedforward neuralnetworksforclassification
4
Rl ttif l tk
4
.
R
u
l
eex
t
rac
ti
on
f
romneura
l
ne
t
wor
k
s
5. Examples
6. Differenttypesofclassificationrules
7.
Regression rules
7.
Regression
rules
8. Hierarchicalrules:TheReRX algorithm
9
Cli
9
.
C
onc
l
us
i
ons
2
1.Thetrainproblem
Westbound trains Eastbound trains
scribe which trains are east/westbound?
Attributes of a train:
lon
g
cars can onl
y
be rectan
g
ular
,
and if closed then their roofs are
gyg,
either jagged or flat
if a short car is rectangular then it is also double side
d
a short closed rectan
g
ular car can have either a flat or
p
eaked roof
3
gp
Thetrainproblem
Westbound trains Eastbound trains
Att ib t f t i
Att
r
ib
u
t
es o
f
a
t
ra
i
n:
a long car can have either two or three axels
a car can be either open or closed
a train has 2,3 or 4 cars, each can be either short or long
4
…….
Thetrainproblem
Westbound trains
Eastbound trains
Answers
:
Answers
:
if a train has short closed car, then it is westbound, otherwise eastbound
if a train has two cars, or has a car with a jagged roof, then it is eastbound,
otherwise westbound.
and many others
5
All the above rules can be obtained by neural networks!
2.Motivations
• Neuralnetworkshavebeenappliedtosolvemanyapplicationproblems
involving
‐ patternclassification
‐ functionapproximation/datafitting
‐ dataclustering
• Theyoftengivebetterpredictiveaccuracythanothermethodssuchas
regressionordecisiontrees.
• Dataminin
g
usin
g
neuralnetworks:: ifwecanextractrulesfromatrained
g g
network,abetterunderstandingaboutthedataandtheproblemcanbe
gained
gained
.
• Howtoextractsuchrules?
6
3.Feedforward neuralnetworksforpatternclassification
• Dataisfedintothenetworkinputunits
• Patternclassificationisdeterminedbytheoutputunitwiththe
largestoutputvalue.
• Unitsinthehiddenlayerallowthenetworktoseparateany
number of disjoint sets
7
number
of
disjoint
sets
.
Networkhiddenunits
Foreachunit:
I
N
is usually
fi d I
1
fi
xe
d
,
I
N
=
1
• Sumoftheweightedinputs iscomputed:
net=I
t
w
• Anonlinearfunctionisappliedtoobtainedtheunit’sactivation
l
va
l
ue:
o=f(net)
•
This acti ation fnctionis sall the logistic sigmoid fnction
8
•
This
acti
v
ation
f
u
nction
is
u
s
u
all
y
the
logistic
sigmoid
f
u
nction
(unipolar)orthehyperbolictangentfunction(bipolar).
Hyperbolictangentfunction
• Thefunctionisusedtoapproximatetheon‐offfunction.
• Thesumoftheweight edinputslarge outputiscloseto1(on).
• Thesumoftheweight edinputssmall outputiscloseto‐1(off).
• Differentiable:
f'(net)=(1‐o
2
)/2
whereo=f(net)
•
Derivative is largest when o = 0 that is when net = 0
9
•
Derivative
is
largest
when
o
=
0
,
that
is
when
net
=
0
• andapproaches0as|net|becomeslarge.
Neuralnetworktraining
• Givenasetofdata,minimisethetotalerrors:
Σ
i
(
target
‐
predicted
)
2
Σ
i
(
target
i
‐
predicted
i
)
• Supervisedlearning.
• Nonlinearoptimisationproblem:findasetofneuralnetworkweightsthat
minimisesthetotalerrors.
• Optimisationmethodsused:backpropagation/gradientdescent,quasi‐Newton
method,conjugategradientmethod,etc.
• Apenaltytermisusuallyaddedtotheerrorfunctionsothatredundantconnections
havesmall/zeroweights.
• Anexampleofanaugmentederrorfunction:
Σ
i
(target
i
–
predict
i
)
2
+CΣ
j
w
j
2
• N = # of samples
• K = # of weights
10
i
j
j
• C is a penalty parameter
Neuralnetworkpruning
• Afteranetworkhasbeentrained,redundantconnectionsandunitsare
removedbypruning.
• Prunednetworksgeneralisebetter:theycanpredictnewpatternsbetter
thanfullyconnectednetworks.
Si l lifii l b d f kl l d
•
Si
mp
l
ec
l
ass
ifi
cat
i
onru
l
escan
b
eextracte
d
f
roms
k
e
l
eta
l
prune
d
networks.
• Variousmethodsfornetworkpruningcanbefoundintheliterature.
11
Neuralnetworkpruning
A Simple Pruning Algorithm:
1 St t ith t i d f ll t d t k
1
.
St
ar
t
w
ith
a
t
ra
i
ne
d
f
u
ll
y connec
t
e
d
ne
t
wor
k
.
2. Identify potential connection for pruning (for example, one with small
magnitude).
3 Set the weight of this connection to 0
3
.
Set
the
weight
of
this
connection
to
0
.
4. Retrain the network (if necessary).
5. If the network still meets the required accuracy, go to step 2.
6. Otherwise, restore the removed connection and its corresponding weight.
Stop.
12
4.Ruleextractionfromneuralnetworks
1. Trainandpruneanetworkwithasinglehiddenlayer
2.
Cluster the hidden unit activation values:
2.
Cluster
the
hidden
unit
activation
values:
‐ Originalactivationvaluesliesin[‐1,1]
‐
Clustering implies dividing this interval into subintervals
Clustering
implies
dividing
this
interval
into
subintervals
,
forexample[‐1,‐0.8),[‐0.8,0.5),[0.5,1]
An algorithm is needed to ensure the network does not lose its accuracy
‐
An
algorithm
is
needed
to
ensure
the
network
does
not
lose
its
accuracy
3. Generateclassificationrulesintermsofclusteredactivationvalues
4
Gtl hi h li th lt d ti ti l i t f th
4
.
G
enera
t
eru
l
esw
hi
c
h
exp
l
a
i
n
th
ec
l
us
t
ere
d
ac
ti
va
ti
onva
l
ues
i
n
t
ermso
f
th
e
inputdataattributes
5
M th t t f l
5
.
M
erge
th
e
t
wose
t
so
f
ru
l
es
Decompositional approach!
13
Ruleextractionbydecompositional approach
Outputlayer
Hiddenlayer
Inputlayer
14
5.Example:Irisclassificationproblem
• 150instances.
•
4 continuous attributes: sepal length, sepal width,
4
continuous
attributes:
sepal
length,
sepal
width,
petallength,petalwidth.
•
Three different iris flowers:
•
Three
different
iris
flowers:
setosa versicolor virginica
15
Anetworkwith2hiddenunits
setosa versicolor vir
g
inica
• Threeclassproblem:3outputunits.
• Fourinputattributes:4inputunits+1forbias.
• Thenetworkhasonly2hiddenunitsand10connectionsafterpruning.
• Itcorrectlyclassifiesallbutonetrainingpattern.
16
• 2‐dimensionalplotoftheactivationvalues.
Scatteredplotofhiddenunitactivations
HH
22
setosa
virginica
versicolor
H
1
versicolor
Ruleintermsofthehiddenunitactivations:
IfH
1
‐0.7:Irissetosa
ElseifH
2
‐0.55:Irisversicolor
Else:Irisvirginica
17
Irisclassificationrules
Ifpetallength 2.23,thenIrissetosa.
El if
H
1
> - 0.7
El
se
if
3.57petallength+3.56petalwidth‐ sepallength‐ 1.57sepalwidth
12.63,thenIrisversicolor.
ElseIrisvirginica.
setosa
versicolor
virginica
H
<=
055
H
2
<=
-
0
.
55
18
Example:Breastcancerdiagnosis
• Ninemeasurementstakenfromfineneedleaspiratesofhumanbreast
tissues:clumpthickness,uniformityofcellsize,uniformityofcell
shape, etc.
shape,
etc.
• Eachmeasurementintegervalued0to10.
•
458 benign samples and 241 malignant samples from 699 patients
•
458
benign
samples
and
241
malignant
samples
from
699
patients
.
• Dataissplitinto350trainingsamplesand349testsamples.
100 l tk ti d
•
100
neura
l
ne
t
wor
k
swere
t
ra
i
ne
d
:
‐ Original numberofhiddenunits:5
‐ Original numberofconnections:460
• Afterpruning:
‐ Averagenumberofconnections:10.70
‐ Averagepredictiveaccuracy:92.70%.
19
BreastCancerDiagnosis:Example1
o Extractedrules:
If
uniformity of cell size
≤
4
If
uniformity
of
cell
size
≤
4
andbarenuclei≤ 5, then
benign
benign
.
Elsemalignant.
benign malignant
o Predictiveaccuracy:93.98%.
bias
20
cell size ≤ 4 bare nuclei ≤ 5
BreastCancerDiagnosis:Example2
o If
clump thickness
≤
6
benign malignant
clump
thickness
≤
6
,
blandchromatin≤ 3,and
normal nucleoli
≤
9
then benign
normal
nucleoli
≤
9
,
then
benign
.
Elsemalignant.
o Predictiveaccuracy:93.12%.
clump thickness ≤ 6
bland chromatin ≤ 3
normal nucleoli ≤ 9
21
Example:Applicationtohepatobiliary disorders
• Datacollectedfrom536patientsinaJapanesehospitals.
• Ninereal‐valuedmeasurementsobtainedfrombiomedicaltests.
• Patientsarediagnosedashaving oneofthe4liverdisorders(ALD,
PH,LC,orC).
• Accuracyfromdifferentmethods:
Li
Fl
Nltk
Li
near
discriminant
analysis
F
uzzy neura
l
networks
N
eura
l
ne
t
wor
k
extracted rules
ALD 57.6% 69.7% 87.9%
PH 64.7% 82.4% 92.2%
LC
65.7%
71.4%
80.0%
LC
65.7%
71.4%
80.0%
C 63.6% 81.8% 90.9%
Ttl
63 2%
77 3%
88 3%
22
T
o
t
a
l
63
.
2%
77
.
3%
88
.
3%
Example:LEDdisplayrecognition
AnLED(LightEmittingDiode)deviceanddigits0,1, 9:
z
1
1
z
2
z
3
z
4
z
z
z
7
z
5
z
6
23
Example:LEDdisplayrecognition
= 0
=
0
Mustbeon
Mustbeoff
Doesn
’
t matter
24
Doesn t
matter
Example:LEDdisplayrecognition
= 1
=
1
Mustbeon
Mustbeoff
Doesn
’
t matter
25
Doesn t
matter