nn rule extraction

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (990.1 KB, 48 trang )

RuleExtractionfromArtificial
NeuralNetworks
RudySetiono
SchoolofComputing
NationalUniversityofSingapore
1
Outline
1. Thetrainproblem
2.
Motivations
2.
Motivations
3. Feedforward neuralnetworksforclassification
4
Rl ttif l tk
4
.
R
u
l
eex
t
rac
ti
on
f
romneura
l
ne
t
wor

k
s
5. Examples
6. Differenttypesofclassificationrules
7.
Regression rules
7.
Regression

rules
8. Hierarchicalrules:TheReRX algorithm
9
Cli
9
.
C
onc
l
us
i
ons
2
1.Thetrainproblem
Westbound trains Eastbound trains
scribe which trains are east/westbound?
Attributes of a train:
 lon
g
cars can onl
y

be rectan
g
ular
,
and if closed then their roofs are
gyg,
either jagged or flat
 if a short car is rectangular then it is also double side
d
 a short closed rectan
g
ular car can have either a flat or
p
eaked roof
3
gp
Thetrainproblem
Westbound trains Eastbound trains
Att ib t f t i
Att
r
ib
u
t
es o
f
a
t
ra
i

n:
 a long car can have either two or three axels
 a car can be either open or closed
 a train has 2,3 or 4 cars, each can be either short or long

4

…….
Thetrainproblem
Westbound trains
Eastbound trains
Answers
:
Answers
:
 if a train has short closed car, then it is westbound, otherwise eastbound
 if a train has two cars, or has a car with a jagged roof, then it is eastbound,
otherwise westbound.
 and many others
5
All the above rules can be obtained by neural networks!
2.Motivations
• Neuralnetworkshavebeenappliedtosolvemanyapplicationproblems
involving
‐ patternclassification
‐ functionapproximation/datafitting
‐ dataclustering
• Theyoftengivebetterpredictiveaccuracythanothermethodssuchas
regressionordecisiontrees.
• Dataminin

g
usin
g
neuralnetworks:: ifwecanextractrulesfromatrained
g g
network,abetterunderstandingaboutthedataandtheproblemcanbe
gained
gained
.
• Howtoextractsuchrules?
6
3.Feedforward neuralnetworksforpatternclassification
• Dataisfedintothenetworkinputunits
• Patternclassificationisdeterminedbytheoutputunitwiththe
largestoutputvalue.
• Unitsinthehiddenlayerallowthenetworktoseparateany
number of disjoint sets
7
number

of

disjoint

sets
.
Networkhiddenunits
Foreachunit:
I
N

is usually
fi d I
1
fi
xe
d
,
I
N
=
1
• Sumoftheweightedinputs iscomputed:
net=I
t
w
• Anonlinearfunctionisappliedtoobtainedtheunit’sactivation
l
va
l
ue:
o=f(net)
•
This acti ation fnctionis sall the logistic sigmoid fnction
8
•
This

acti
v
ation


f
u
nction

is
u
s
u
all
y
the

logistic

sigmoid

f
u
nction

(unipolar)orthehyperbolictangentfunction(bipolar).
Hyperbolictangentfunction
• Thefunctionisusedtoapproximatetheon‐offfunction.
• Thesumoftheweight edinputslarge outputiscloseto1(on).
• Thesumoftheweight edinputssmall outputiscloseto‐1(off).
• Differentiable:
f'(net)=(1‐o
2
)/2

whereo=f(net)
•
Derivative is largest when o = 0 that is when net = 0
9
•
Derivative

is

largest

when

o

=

0
,
that

is

when

net

=

0

• andapproaches0as|net|becomeslarge.
Neuralnetworktraining
• Givenasetofdata,minimisethetotalerrors:
Σ
i
(
target
‐
predicted
)
2
Σ
i
(
target
i
‐
predicted
i
)

• Supervisedlearning.
• Nonlinearoptimisationproblem:findasetofneuralnetworkweightsthat
minimisesthetotalerrors.
• Optimisationmethodsused:backpropagation/gradientdescent,quasi‐Newton
method,conjugategradientmethod,etc.
• Apenaltytermisusuallyaddedtotheerrorfunctionsothatredundantconnections
havesmall/zeroweights.
• Anexampleofanaugmentederrorfunction:
Σ

i
(target
i
–
predict
i
)
2
+CΣ
j
w
j
2
• N = # of samples
• K = # of weights
10
i
j
j
• C is a penalty parameter
Neuralnetworkpruning
• Afteranetworkhasbeentrained,redundantconnectionsandunitsare
removedbypruning.
• Prunednetworksgeneralisebetter:theycanpredictnewpatternsbetter
thanfullyconnectednetworks.
Si l lifii l b d f kl l d
•
Si
mp
l

ec
l
ass
ifi
cat
i
onru
l
escan
b
eextracte
d

f
roms
k
e
l
eta
l
prune
d

networks.
• Variousmethodsfornetworkpruningcanbefoundintheliterature.
11
Neuralnetworkpruning
A Simple Pruning Algorithm:
1 St t ith t i d f ll t d t k
1

.
St
ar
t
w
ith
a
t
ra
i
ne
d

f
u
ll
y connec
t
e
d
ne
t
wor
k
.
2. Identify potential connection for pruning (for example, one with small
magnitude).
3 Set the weight of this connection to 0
3
.

Set

the

weight

of

this

connection

to

0
.
4. Retrain the network (if necessary).
5. If the network still meets the required accuracy, go to step 2.
6. Otherwise, restore the removed connection and its corresponding weight.
Stop.
12
4.Ruleextractionfromneuralnetworks
1. Trainandpruneanetworkwithasinglehiddenlayer
2.
Cluster the hidden unit activation values:
2.
Cluster

the


hidden

unit

activation

values:
‐ Originalactivationvaluesliesin[‐1,1]
‐
Clustering implies dividing this interval into subintervals
Clustering

implies

dividing

this

interval

into

subintervals
,
forexample[‐1,‐0.8),[‐0.8,0.5),[0.5,1]
An algorithm is needed to ensure the network does not lose its accuracy
‐
An

algorithm


is

needed

to

ensure

the

network

does

not

lose

its

accuracy
3. Generateclassificationrulesintermsofclusteredactivationvalues
4
Gtl hi h li th lt d ti ti l i t f th
4
.
G
enera
t

eru
l
esw
hi
c
h
exp
l
a
i
n
th
ec
l
us
t
ere
d
ac
ti
va
ti
onva
l
ues
i
n
t
ermso
f


th
e
inputdataattributes
5
M th t t f l
5
.
M
erge
th
e
t
wose
t
so
f
ru
l
es
Decompositional approach!
13
Ruleextractionbydecompositional approach
Outputlayer
Hiddenlayer
Inputlayer
14
5.Example:Irisclassificationproblem
• 150instances.
•

4 continuous attributes: sepal length, sepal width,
4

continuous

attributes:

sepal

length,

sepal

width,

petallength,petalwidth.
•
Three different iris flowers:
•
Three

different

iris

flowers:

setosa versicolor virginica
15
Anetworkwith2hiddenunits

setosa versicolor vir
g
inica
• Threeclassproblem:3outputunits.
• Fourinputattributes:4inputunits+1forbias.
• Thenetworkhasonly2hiddenunitsand10connectionsafterpruning.
• Itcorrectlyclassifiesallbutonetrainingpattern.
16
• 2‐dimensionalplotoftheactivationvalues.
Scatteredplotofhiddenunitactivations
HH
22
setosa
virginica
versicolor
H
1
versicolor
Ruleintermsofthehiddenunitactivations:
IfH
1
 ‐0.7:Irissetosa
ElseifH
2

 ‐0.55:Irisversicolor
Else:Irisvirginica
17
Irisclassificationrules
Ifpetallength 2.23,thenIrissetosa.

El if
H
1
> - 0.7
El
se
if

3.57petallength+3.56petalwidth‐ sepallength‐ 1.57sepalwidth
12.63,thenIrisversicolor.
ElseIrisvirginica.
setosa
versicolor
virginica
H
<=
055
H
2
<=
-
0
.
55
18
Example:Breastcancerdiagnosis
• Ninemeasurementstakenfromfineneedleaspiratesofhumanbreast
tissues:clumpthickness,uniformityofcellsize,uniformityofcell
shape, etc.
shape,


etc.
• Eachmeasurementintegervalued0to10.
•
458 benign samples and 241 malignant samples from 699 patients
•
458

benign

samples

and

241

malignant

samples

from

699

patients
.
• Dataissplitinto350trainingsamplesand349testsamples.
100 l tk ti d
•
100

neura
l
ne
t
wor
k
swere
t
ra
i
ne
d
:
‐ Original numberofhiddenunits:5
‐ Original numberofconnections:460
• Afterpruning:
‐ Averagenumberofconnections:10.70
‐ Averagepredictiveaccuracy:92.70%.
19
BreastCancerDiagnosis:Example1
o Extractedrules:
If
uniformity of cell size
≤
4
If

uniformity

of


cell

size

≤
4

andbarenuclei≤ 5, then
benign
benign
.
Elsemalignant.
benign malignant
o Predictiveaccuracy:93.98%.
bias
20
cell size ≤ 4 bare nuclei ≤ 5
BreastCancerDiagnosis:Example2
o If
clump thickness
≤
6
benign malignant
clump

thickness

≤
6

,
blandchromatin≤ 3,and
normal nucleoli
≤
9
then benign
normal

nucleoli

≤
9
,
then

benign
.
Elsemalignant.
o Predictiveaccuracy:93.12%.
clump thickness ≤ 6
bland chromatin ≤ 3
normal nucleoli ≤ 9
21
Example:Applicationtohepatobiliary disorders
• Datacollectedfrom536patientsinaJapanesehospitals.
• Ninereal‐valuedmeasurementsobtainedfrombiomedicaltests.
• Patientsarediagnosedashaving oneofthe4liverdisorders(ALD,
PH,LC,orC).
• Accuracyfromdifferentmethods:
Li

Fl
Nltk

Li
near
discriminant
analysis
F
uzzy neura
l

networks
N
eura
l
ne
t
wor
k

extracted rules
ALD 57.6% 69.7% 87.9%
PH 64.7% 82.4% 92.2%
LC
65.7%
71.4%
80.0%
LC
65.7%
71.4%

80.0%
C 63.6% 81.8% 90.9%
Ttl
63 2%
77 3%
88 3%
22
T
o
t
a
l
63
.
2%
77
.
3%

88
.
3%

Example:LEDdisplayrecognition
AnLED(LightEmittingDiode)deviceanddigits0,1, 9:
z
1
1

z
2
z
3
z
4
z
z
z
7
z
5
z
6
23
Example:LEDdisplayrecognition
= 0
=

0
Mustbeon
Mustbeoff
Doesn
’
t matter
24
Doesn t

matter
Example:LEDdisplayrecognition

= 1
=

1
Mustbeon
Mustbeoff
Doesn
’
t matter
25
Doesn t

matter

nn rule extraction

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về