Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 13 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (385.9 KB, 10 trang )

100 Barak Chizi and Oded Maimon
Rokach, L., Decomposition methodology for classification tasks: a meta decomposer frame-
work, Pattern Analysis and Applications, 9(2006):257–271.
Rokach L., Genetic algorithm-based feature set partitioning for classification prob-
lems,Pattern Recognition, 41(5):1676–1700, 2008.
Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-
sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008.
Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE In-
ternational Conference on Data Mining, IEEE Computer Society Press, pp. 473–480,
2001.
Rokach L. and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel-
ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158.
Rokach, L. and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery
Handbook, pp. 321–352, 2005, Springer.
Rokach, L. and Maimon, O., Data mining for improving the quality of manufacturing: a
feature set decomposition approach, Journal of Intelligent Manufacturing, 17(3):285–
299, 2006, Springer.
Rokach, L., Maimon, O., Data Mining with Decision Trees: Theory and Applications, World
Scientific Publishing, 2008.
Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap-
proach, Proceedings of the 14th International Symposium On Methodologies For Intel-
ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag,
2003, pp. 24–31.
Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical
Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer-
Verlag, 2004.
Rokach, L. and Maimon, O. and Arbel, R., Selective voting-getting more for less in sensor
fusion, International Journal of Pattern Recognition and Artificial Intelligence 20 (3)
(2006), pp. 329–350.
Scherf, M. and Brauer, W. Feature selection by means of a feature weighting approach. Tech-
nical Report FKI- 221- 97, Technische Universit at Munchen 1997.


Setiono, R. and Liu, H. Chi2: Feature selection and discretization of numeric attributes. In
Proceedings of the Seventh IEEE International Conference on Tools with Artificial In-
telligence, 1995
Singh, M. and Provan, G. M. Efficient learning of selective Bayesian classifiers. In Machine
Learning: Proceedings of the Thirteenth International network Conference on Machine
Learning. Morgan Kaufmann, 1996.
Skalak, B. Prototype and feature selection by sampling and random mutation hill climbing
algorithms. In Machine Learning: Proceedings of the Eleventh International Conference.
Morgan Kaufmann, 1994.
Vafaie, H. and De Jong, K. Genetic algorithms as a tool for restructuring feature space rep-
resentations. In Proceedings of the International Conference on Tools with A. I. IEEE
Computer Society Press, 1995.
Ward, B., What’s Wrong with Economics. New York: Basic Books, 1972.
6
Discretization Methods
Ying Yang
1
, Geoffrey I. Webb
2
, and Xindong Wu
3
1
School of Computer Science and Software Engineering, Monash University, Melbourne,
Australia
2
Faculty of Information Technology
Monash University, Australia

3
Department of Computer Science

University of Vermont, USA

Summary. Data-mining applications often involve quantitative data. However, learning from
quantitative data is often less effective and less efficient than learning from qualitative data.
Discretization addresses this issue by transforming quantitative data into qualitative data. This
chapter presents a comprehensive introduction to discretization. It clarifies the definition of
discretization. It provides a taxonomy of discretization methods together with a survey of
major discretization methods. It also discusses issues that affect the design and application of
discretization methods.
Key words: Discretization, quantitative data, qualitative data.
Introduction
Discretization is a data-processing procedure that transforms quantitative data into
qualitative data.
Data Mining applications often involve quantitative data. However, there exist
many learning algorithms that are primarily oriented to handle qualitative data (Ker-
ber, 1992, Dougherty et al., 1995, Kohavi and Sahami, 1996). Even for algorithms
that can directly deal with quantitative data, learning is often less efficient and less
effective (Catlett, 1991, Kerber, 1992, Richeldi and Rossotto, 1995, Frank and Wit-
ten, 1999). Hence discretization has long been an active topic in Data Mining and
knowledge discovery. Many discretization algorithms have been proposed. Evalua-
tion of these algorithms has frequently shown that discretization helps improve the
performance of learning and helps understand the learning results.
This chapter presents an overview of discretization. Section 6.1 explains the ter-
minology involved in discretization. It clarifies the definition of discretization, which
has been defined in many differing way in previous literature. Section 6.2 presents a
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_6, © Springer Science+Business Media, LLC 2010
102 Ying Yang, Geoffrey I. Webb, and Xindong Wu
comprehensive taxonomy of discretization approaches. Section 6.3 introduces typi-
cal discretization algorithms corresponding to the taxonomy. Section 6.4 addresses

the issue that different discretization strategies are appropriate for different learn-
ing problems. Hence designing or applying discretization should not be blind to its
learning context. Section 6.5 provides a summary of this chapter.
6.1 Terminology
Discretization transforms one type of data to another type. In the large amount of
existing literature that addresses discretization, there is considerable variation in the
terminology used to describe these two data types, including ‘quantitative’ vs. ‘qual-
itative’, ‘continuous’ vs. ‘discrete’, ‘ordinal’ vs. ‘nominal’, and ‘numeric’ vs. ‘cat-
egorical’. It is necessary to make clear the difference among the various terms and
accordingly choose the most suitable terminology for discretization.
We adopt the terminology of statistics (Bluman, 1992, Samuels and Witmer,
1999), which provides two parallel ways to classify data into different types. Data
can be classified into either qualitative or quantitative. Data can also be classified
into different levels of measurement scales. Sections 6.1.1 and 6.1.2 summarize this
terminology.
6.1.1 Qualitative vs. quantitative
Qualitative data, also often referred to as categorical data, are data that can be
placed into distinct categories. Qualitative data sometimes can be arrayed in a mean-
ingful order. But no arithmetic operations can be applied to them. Examples of qual-
itative data are: blood type of a person: A, B, AB, O; and assignment evaluation: fail,
pass, good, excellent.
Quantitative data are numeric in nature. They can be ranked in order. They also
admit to meaningful arithmetic operations. Quantitative data can be further classified
into two groups, discrete or continuous.
Discrete data assume values that can be counted. The data cannot assume all val-
ues on the number line within their value range. An example is: number of children
in a family.
Continuous data can assume all values on the number line within their value
range. The values are obtained by measuring. An example is: temperature.
6.1.2 Levels of measurement scales

In addition to being classified into either qualitative or quantitative, data can also be
classified by how they are categorized, counted or measured. This type of classifi-
cation uses measurement scales, and four common levels of scales are: nominal,
ordinal, interval and ratio.
6 Discretization Methods 103
The nominal level of measurement scales classifies data into mutually exclusive
(non-overlapping), exhaustive categories in which no meaningful order or ranking
can be imposed on the data. An example is: blood type of a person: A, B, AB, O
The ordinal level of measurement scales classifies data into categories that can
be ranked. However, the differences between the ranks cannot be calculated by arith-
metic. An example is: assignment evaluation: fail, pass, good, excellent. It is mean-
ingful to say that the assignment evaluation of pass ranks higher than that of fail. It
is not meaningful in the same way to say that the blood type of A ranks higher than
that of B.
The interval level of measurement scales ranks data, and the differences between
units of measure can be calculated by arithmetic. However, zero in the interval level
of measurement does not mean ‘nil’ or ‘nothing’ as zero in arithmetic means. An
example is: Fahrenheit temperature. It has a meaningful difference of one degree
between each unit. But 0 degree Fahrenheit does not mean there is no heat. It is
meaningful to say that 74 degree is two degrees higher than 72 degree. It is not
meaningful in the same way to say that the evaluation of excellent is two degrees
higher than the evaluation of good.
The ratio level of measurement scales possesses all the characteristics of interval
measurement, and there exists a zero that, the same as arithmetic zero, means ‘nil’ or
‘nothing’. In consequence, true ratios exist between different units of measure. An
example is: number of children in a family. It is meaningful to say that family X has
twice as many children as does family Y. It is not meaningful in the same way to say
that 100 degree Fahrenheit is twice as hot as 50 degree Fahrenheit.
The nominal level is the lowest level of measurement scales. It is the least power-
ful in that it provides the least information about the data. The ordinal level is higher,

followed by the interval level. The ratio level is the highest. Any data conversion
from a higher level of measurement scales to a lower level of measurement scales
will lose information. Table 6.1 gives a summary of the characteristics of different
levels of measurement scales.
Table 6.1. Measurement Scales
Level Ranking ? Arithmetic operation ? Arithmetic zero ?
Nominal no no no
Ordinal yes no no
Interval yes yes no
Ratio yes yes yes
6.1.3 Summary
In summary, the following classification of data types applies:
1. qualitative data:
a) nominal;
b) ordinal;
104 Ying Yang, Geoffrey I. Webb, and Xindong Wu
2. quantitative data:
a) interval, either discrete or continuous;
b) ratio, either discrete or continuous.
We believe that ‘discretization’ as it is usually applied in data mining is best de-
fined as the transformation from quantitative data to qualitative data. In consequence,
we will refer to data as either quantitative or qualitative throughout this chapter.
6.2 Taxonomy
There exist diverse taxonomies in the existing literature to classify discretization
methods. Different taxonomies emphasize different aspects of the distinctions among
discretization methods.
Typically, discretization methods can be either primary or composite. Primary
methods accomplish discretization without reference to any other discretization
method. Composite methods are built on top of some primary method(s).
Primary methods can be classified as per the following taxonomies.

1. Supervised vs. Unsupervised (Dougherty et al., 1995). Methods that use the
class information of the training instances to select discretization cut points are
supervised. Methods that do not use the class information are unsupervised.
Supervised discretization can be further characterized as error-based, entropy-
based or statistics-based according to whether intervals are selected using met-
rics based on error on the training data, entropy of the intervals, or some statisti-
cal measure.
2. Parametric vs. Non-parametric. Parametric discretization requires input from
the user, such as the maximum number of discretized intervals. Non-parametric
discretization only uses information from data and does not need input from the
user.
3. Hierarchical vs. Non-hierarchical. Hierarchical discretization selects cut points
in an incremental process, forming an implicit hierarchy over the value range.
The procedure can be split or merge (Kerber, 1992). Split discretization ini-
tially has the whole value range as an interval, then continues splitting it into
sub-intervals until some threshold is met. Merge discretization initially puts
each value into an interval, then continues merging adjacent intervals until some
threshold is met. Some discretization methods utilize both split and merge pro-
cesses. For example, intervals are initially formed by splitting, and then a merge
process is performed to post-process the formed intervals. Non-hierarchical dis-
cretization does not form any hierarchy during discretization. For example, many
methods scan the ordered values only once, sequentially forming the intervals.
4. Univariate vs. Multivariate (Bay, 2000). Methods that discretize each attribute
in isolation are univariate. Methods that take into consideration relationships
among attributes during discretization are multivariate.
5. Disjoint vs. Non-disjoint (Yang and Webb, 2002). Disjoint methods discretize
the value range of the attribute under discretization into disjoint intervals. No
6 Discretization Methods 105
intervals overlap. Non-disjoint methods discretize the value range into intervals
that can overlap.

6. Global vs. Local (Dougherty et al., 1995). Global methods discretize with re-
spect to the whole training data space. They perform discretization once only,
using a single set of intervals throughout a single classification task. Local meth-
ods allow different sets of intervals to be formed for a single attribute, each set
being applied in a different classification context. For example, different dis-
cretizations of a single attribute might be applied at different nodes of a decision
tree (Quinlan, 1993).
7. Eager vs. Lazy (Hsu et al., 2000, Hsu et al., 2003). Eager methods perform
discretization prior to classification time. Lazy methods perform discretization
during the classification time.
8. Time-sensitive vs. Time-insensitive. Under time-sensitive discretization, the
qualitative value associated with a quantitative value can change along the time.
That is, the same quantitative value can be discretized into different values de-
pending on the previous values observed in the time series. Time-insensitive
discretization only uses the stationary pro-perties of the quantitative data.
9. Ordinal vs. Nominal. Ordinal discretization transforms quantitative data into
ordinal qualitative data. It aims at taking advantage of the ordering information
implicit in quantitative attributes, so as not to make values 1 and 2 as dissimi-
lar as values 1 and 10. Nominal discretization transforms quantitative data into
nominal qualitative data. The ordering information is hence discarded.
10. Fuzzy vs. Non-fuzzy (Wu, 1995, Wu, 1999, Ishibuchi et al., 2001). Fuzzy dis-
cretization first discretizes quantitative attribute values into intervals. It then
places some kind of membership function at each cut point as fuzzy borders.
The membership function measures the degree of each value belonging to each
interval. With these fuzzy borders, a value can be discretized into a few different
intervals at the same time, with varying degrees. Non-fuzzy discretization forms
sharp borders without employing any membership function.
Composite methods first choose some primary discretization method to form the
initial cut points. They then focus on how to adjust these initial cut points to achieve
certain goals. The taxonomy of a composite method sometimes is flexible, depending

on the taxonomy of its primary method.
6.3 Typical methods
Corresponding to our taxonomy in the previous section, we here enumerate some
typical discretization methods. There are many other methods that are not reviewed
due to the space limit. For a more comprehensive study on existing discretization
algorithms, Yang (2003) and Wu (1995) offer good sources.
106 Ying Yang, Geoffrey I. Webb, and Xindong Wu
6.3.1 Background and terminology
A term often used for describing a discretization approach is ‘cut point’. Discretiza-
tion forms intervals according to the value range of the quantitative data. It then as-
sociates a qualitative value to each interval. A cut point is a value among the quanti-
tative data where an interval boundary is located by a discretization method. Another
commonly-mentioned term is ‘boundary cut point’, which are values between two
instances with different classes in the sequence of instances sorted by a quantitative
attribute. It has been proved that evaluating only the boundary cut points is sufficient
for finding the minimum class information entropy (Fayyad and Irani, 1993).
We use the following terminology. Data comprises a set or sequence of instances.
Each instance is described by a vector of attribute values. For classification learning,
each instance is also labelled with a class. Each attribute is either qualitative or quan-
titative. Classes are qualitative. Instances from which one learns cut points or other
knowledge are training instances. If a test instance is presented, a learning algo-
rithm is asked to make a prediction about the test instance according to the evidence
provided by the training instances.
6.3.2 Equal-width, equal-frequency and fixed-frequency discretization
We arrange to present these three methods together because they are seemingly sim-
ilar but actually different. They all are typical of unsupervised discretization. They
are also typical of parametric discretization.
When discretizing a quantitative attribute, equal width discretization (EWD)
(Catlett, 1991, Kerber, 1992, Dougherty et al., 1995) predefines k, the number of
intervals. It then divides the number line between v

min
and v
max
into k intervals of
equal width, where v
min
is the minimum observed value, v
max
is the maximum ob-
served value. Thus the intervals have width w =(v
max
−v
min
)/k and the cut points
are at v
min
+ w,v
min
+ 2w,···,v
min
+(k −1)w.
When discretizing a quantitative attribute, equal-frequency discretization (EFD)
(Catlett, 1991, Kerber, 1992,Dougherty et al., 1995) predefines k, the number of in-
tervals. It then divides the sorted values into k intervals so that each interval contains
approximately the same number of training instances. Suppose there are n training
instances, each interval then contains n/k training instances with adjacent (possibly
identical) values. Note that training instances with identical values must be placed
in the same interval. In consequence it is not always possible to generate k equal-
frequency intervals.
When discretizing a quantitative attribute, fixed-frequency discretization

(FFD) (Yang and Webb, 2004) predefines a sufficient interval frequency k. Then
it discretizes the sorted values into intervals so that each interval has approximately
4
the same number k of training instances with adjacent (possibly identical) values.
It is worthwhile contrasting EFD and FFD, both of which form intervals of equal
frequency. EFD fixes the interval number that is usually arbitrarily chosen. FFD fixes
4
Just as for EFD, because of the existence of identical values, some intervals can have in-
stance frequency exceeding k.
6 Discretization Methods 107
the interval frequency that is not arbitrary but to ensure each interval contains suffi-
cient instances to supply information such as for estimating probability.
6.3.3 Multi-interval-entropy-minimization discretization ((MIEMD)
Multi-interval-entropy-minimization discretization (Fayyad and Irani, 1993) is typ-
ical of supervised discretization. It is also typical of non-parametric discretization.
To discretize an attribute, MIEMD evaluates as a candidate cut point the midpoint
between each successive pair of the sorted values. For evaluating each candidate cut
point, the data are discretized into two intervals and the resulting class information
entropy is calculated. A binary discretization is determined by selecting the cut point
for which the entropy is minimal amongst all candidates. The binary discretization
is applied recursively, always selecting the best cut point. A minimum description
length criterion (MDL) is applied to decide when to stop discretization.
6.3.4 ChiMerge, StatDisc and InfoMerge discretization
EWD and EFD are non-hierarchical discretization. MIEMD involves a split proce-
dure and hence is hierarchical discretization. A typical merge approach to hierarchi-
cal discretization is ChiMerge (Kerber, 1992). It uses the
χ
2
(Chi square) statistic
to determine if the relative class frequencies of adjacent intervals are distinctly dif-

ferent or if they are similar enough to justify merging them into a single interval.
The ChiMerge algorithm consists of an initialization process and a bottom-up merg-
ing process. The initialization process contains two steps: (1) ascendingly sort the
training instances according to their values for the attributes being discretized, (2)
construct the initial discretization, in which each instance is put into its own interval.
The interval merging process contains two steps, repeated continuously: (1) compute
the
χ
2
for each pair of adjacent intervals, (2) merge the pair of adjacent intervals with
the lowest
χ
2
value. Merging continues until all pairs of intervals have
χ
2
values ex-
ceeding a predefined
χ
2
-threshold. That is, all intervals are considered significantly
different by the
χ
2
independence test. The recommended
χ
2
-threshold is at the 0.90,
0.95 or 0.99 significant level.
StatDisc discretization (Richeldi and Rossotto, 1995) extends ChiMerge to al-

low any number of intervals to be merged instead of only 2 as ChiMerge does. Both
ChiMerge and StatDisc are based on a statistical measure of dependency. The statis-
tical measures treat an attribute and a class symmetrically. A third merge discretiza-
tion, InfoMerge (Freitas and Lavington, 1996) argues that an attribute and a class
should be asymmetric since one wants to predict the value of the class attribute given
the discretized attribute but not the reverse. Hence InfoMerge uses information loss,
which is calculated as the amount of information necessary to identify the class of an
instance after merging and the amount of information before merging, to direct the
merge procedure.
108 Ying Yang, Geoffrey I. Webb, and Xindong Wu
6.3.5 Cluster-based discretization
The above mentioned methods are all univariate. A typical multivariate discretization
technique is cluster-based discretization (Chmielewski and Grzymala-Busse, 1996).
This method consists of two steps. The first step is cluster formation to determine
initial intervals for the quantitative attributes. The second step is post-processing to
minimize the number of discretized intervals. Instances here are deemed as points in
n-dimensional space which is defined by n attribute values. During cluster formation,
the median cluster analysis method is used. Clusters are initialized by allowing each
instance to be a cluster. New clusters are formed by merging two existing clusters that
exhibit the greatest similarity between each other. The cluster formation continues as
long as the level of consistency of the partition is not less than the level of consistency
of the original data. Once this process is completed, instances that belong to the same
cluster are indiscernible by the subset of quantitative attributes, thus a partition on
the set of training instances is induced. Clusters can be analyzed in terms of all
attributes to find out cut points for each attribute simultaneously. After discretized
intervals are formed, post-processing picks a pair of adjacent intervals among all
quantitative attributes for merging whose resulting class entropy is the smallest. If
the consistency of the dataset after the merge is above a given threshold, the merge
is performed. Otherwise this pair of intervals are marked as non-mergable and the
next candidate is processed. The process stops when each possible pair of adjacent

intervals are marked as non-mergable.
6.3.6 ID3 discretization
ID3 provides a typical example of local discretization. ID3 (Quinlan, 1986) is an
inductive learning program that constructs classification rules in the form of a de-
cision tree. It uses local discretization to deal with quantitative attributes. For each
quantitative attribute, ID3 divides its sorted values into two intervals in all possible
ways. For each division, the resulting information gain of the data is calculated. The
attribute that obtains the maximum information gain is chosen to be the current tree
node. And the data are divided into subsets corresponding to its two value intervals.
In each subset, the same process is recursively conducted to grow the decision tree.
The same attribute can be discretized differently if it appears in different branches of
the decision tree.
6.3.7 Non-disjoint discretization
The above mentioned methods are all disjoint discretization. Non-disjoint discretiza-
tion (NDD) (Yang and Webb, 2002), on the other hand, forms overlapping inter-
vals for a quantitative attribute, always locating a value toward the middle of its
discretized interval. This strategy is desirable since it can efficiently form for each
single quantitative value a most appropriate interval.
When discretizing a quantitative attribute, suppose there are N instances. NDD
identifies among the sorted values t

atomic intervals, (a

1
,b

1
],(a

2

,b

2
], ,(a

t

,b

t

],
6 Discretization Methods 109
each containing s

instances, so that
5
s

=
s
3
s

×t

= N. (6.1)
One interval is formed for each set of three consecutive atomic intervals, such
that the kth (1 ≤ k ≤ t


−2) interval (a
k
,b
k
] satisfies a
k
= a

k
and b
k
= b

k+2
. Each
value v is assigned to interval (a

i

1
,b

i
+
1
] where i is the index of the atomic interval
(a

i
,b


i
] such that a

i
< v ≤b

i
, except when i = 1 in which case v is assigned to interval
(a

1
,b

3
] and when i = t

in which case v is assigned to interval (a

t

−2
,b

t

]. Figure 6.1
illustrates the procedure. As a result, except in the case of falling into the first or the
last atomic interval, a numeric value is always toward the middle of its corresponding
interval, and intervals can overlap with each other.

Atomic Interval
In
te
r
va
l
Fig. 6.1. Atomic Intervals Compose Actual Intervals
6.3.8 Lazy discretization
The above mentioned methods are all eager. In comparison, lazy discretization
(LD) (Hsu et al., 2000,Hsu et al., 2003) defers discretization until classification time.
It waits until a test instance is presented to determine the cut points for each quan-
titative attribute of this test instance. When classifying an instance, LD creates only
one interval for each quantitative attribute containing its value from the instance, and
leaves other value regions untouched. In particular, it selects a pair of cut points for
each quantitative attribute such that the value is in the middle of its corresponding in-
terval. Where the cut points locate is decided by LD’s primary discretization method,
such as EWD.
5
Theoretically any odd number k besides 3 is acceptable in (6.1) as long as the same number
k of atomic intervals are grouped together later for the probability estimation. For simplic-
ity, we take k = 3 for demonstration.

×