Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 64 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (382.36 KB, 10 trang )

610 Sigal Sahar
Bayardo Jr., R. J., Agrawal, R., and Gunopulos, D. (1999). Constraint-based rule mining
in large, dense databases. In Proceedings of the Fifteenth IEEE ICDE International
Conference on Data Engineering, pages 188–197, Sydney, Australia.
Brin, S., Motwani, R., and Silverstein, C. (1997). Beyond market baskets: Generalizing asso-
ciation rules to correlations. In Proceedings of ACM SIGMOD International Conference
on Management of Data, pages 265–276, Tucson, AZ, USA.
Fayyad, U. M., Piatetsky-Shapiro, G., and Smyth, P. (1996). Advances in Knowledge Dis-
covery and Data Mining, chapter 1: From Data Mining to Knowledge Discovery: An
Overview, pages 1–34. AAAI Press.
Hilderman, R. J. and Hamilton, H. J. (2000). Principles for mining summaries using objective
measures of interestingness. In Proceedings of the Twelfth IEEE International Confer-
ence on Tools with Artificial Intelligence (ICTAI), pages 72–81, Vancouver, Canada.
Hilderman, R. J. and Hamilton, H. J. (2001). Knowledge Discovery and Measures of Interest.
Kluwer Academic Publishers.
Hipp, J. and G
¨
unter, U. (2002). Is pushing constraints deeply into the mining algorithms
really what we want? SIGKDD Explorations, 4(1): 50–55.
Kamber, M. and Shinghal, R. (1996). Evaluating the interestingness of characteristic rules.
In Proceedings of the Second International Conference on Knowledge Discovery and
Data Mining, pages 263–266, Portland, OR, USA.
Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkam, A. I. (1994). Find-
ing interesting rules from large sets of discovered association rules. In Proceedings of
the Third ACM CIKM International Conference on Information and Knowledge Man-
agement, pages 401–407, Orlando, FL, USA. ACM Press.
Kl
¨
osgen, W. (1996). Advances in Knowledge Discovery and Data Mining, chapter 10: Ex-
plora: a Multipattern and Multistrategy Discovery Assistant, pages 249–271. AAAI
Press.


Liu, B., Hsu, W., and Chen, S. (1997). Using general impressions to analyze discovered
classification rules. In Proceedings of the Third International Conference on Knowledge
Discovery and Data Mining, pages 31–36, Newport Beach, CA, USA. AAAI Press.
Liu, B., Hsu, W., and Ma, Y. (1999). Pruning and summarizing the discovered associations.
In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pages 125–134, San Diego, CA, USA.
Liu, B., Hsu, W., and Ma, Y. (2001a). Discovery the set of fundamental rule changes. In
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pages 335–340, San Francisco, CA, USA.
Liu, B., Hsu, W., and Ma, Y. (2001b). Identifying non-actionable association rules. In
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pages 329–334, San Francisco, CA, USA.
Liu, B., Hu, M., and Hsu, W. (2000). Multi-level organization and summarization of the
discovered rules. In Proceedings of the Sixth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 208–217, Boston, MA, USA.
Major, J. A. and Mangano, J. J. (1995). Selecting among rules induced from a hurricane
databases. Journal of Intelligent Information Systems, 4:39–52.
Ng, R. T., Lakshmanan, L. V. S., Han, J., and Pang, A. (1998). Exploratory mining and
pruning optimizations of constrained association rules. In Proceedings of ACM SIGMOD
International Conference on Management of Data, pages 13–24.
Padmanabhan, B. and Tuzhilin, A. (2000). Small is beautiful: Discovering the minimal
set of unexpected patterns. In Proceedings of the Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 54–63, Boston, MA, USA.
30 Interestingness Measures 611
Piatetsky-Shapiro, G. (1991). Knowledge Discovery in Databases, chapter 13:
Discovery, Analysis, and Presentation of Strong Rules, pages 248–292.
AAAI/MIT Press.
Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-
tive reports. Lecture notes in artificial intelligence, 3055. pp. 217-228, Springer-Verlag
(2004).

Sahar, S. (1999). Interestingness via what is not interesting. In Proceedings of the Fifth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 332–336, San Diego, CA, USA.
Sahar, S. (2001). Interestingness preprocessing. In Proceedings of the IEEE ICDM Interna-
tional Conference on Data Mining, pages 489–496, San Jose, CA, USA.
Sahar, S. (2002a). Exploring interestingness through clustering: A framework. In Proceed-
ings of the IEEE ICDM International Conference on Data Mining, pages 677–680, Mae-
bashi City, Japan.
Sahar, S. (2002b). On incorporating subjective interestingness into the mining process. In
Proceedings of the IEEE ICDM International Conference on Data Mining, pages 681–
684, Maebashi City, Japan.
Sahar, S. and Mansour, Y. (1999). An empirical evaluation of objective interestingness cri-
teria. In SPIE Conference on Data Mining and Knowledge Discovery, pages 63–74,
Orlando, FL, USA.
Shah, D., Lakshmanan, L. V. S., Ramamritham, K., and Sudarshan, S. (1999). Interestingness
and pruning of mined patterns. In Proceedings of the ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge Discovery (DMKD), Philadelphia, PA,
USA.
Silberschatz, A. and Tuzhilin, A. (1996). What makes patterns interesting in knowledge
discovery systems. IEEE Transactions on Knowledge and Data Engineering (TKDE),
8(6):970–974.
Spiliopoulou, M. and Roddick, J. F. (2000). Higher order mining: Modeling and mining
the results of knowledge discovery. In Proceedings of the Second Conference on Data
Mining Methods and Databases, pages 309–320, Cambridge, UK. WIT Press.
Srikant, R., Vu, Q., and Agrawal, R. (1997). Mining association rules with item constraints.
In Proceedings of the Third International Conference on Knowledge Discovery and Data
Mining, pages 67–73, Newport Beach, CA, USA. AAAI Press.
Subramonian, R. (1998). Defining diff as a Data Mining primitive. In Proceedings of the
Fourth International Conference on Knowledge Discovery and Data Mining, pages 334–
338, New York City, NY, USA. AAAI Press.

Tan, P N., Kumar, V., and Srivastava, J. (2002). Selecting the right interestingness measure
for association patterns. In Proceedings of the Eight ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, pages 32–41, Edmonton, Alberta,
Canada.
Toivonen, H., Klemettinen, M., Ronkainen, P., H
¨
at
¨
onen, K., and Mannila, H. (1995). Pruning
and grouping discovered association rules. In Proceedings of the MLnet Familiariza-
tion Workshop on Statistics, Machine Learning and Knowledge Discovery in Databases,
pages 47–52, Heraklion, Crete, Greece.
Tuzhilin, A. and Adomavicius, G. (2002). Handling very large numbers of association rules
in the analysis of microarray data. In Proceedings of the Eight ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pages 396–404, Edmon-
ton, Alberta, Canada.
612 Sigal Sahar
Zaki, M. J. (2000). Generating non-redundant association rules. In Proceedings of the Sixth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 34–43, Boston, MA, USA.
31
Quality Assessment Approaches in Data Mining
Maria Halkidi
1
and Michalis Vazirgiannis
2
1
Department of Computer Science and Engineering, University of California at Riverside,
USA,
Department of Informatics, Athens University of Economics and Business, Greece


2
Department of Informatics, Athens University of Economics and Business, Greece

Summary. The Data Mining process encompasses many different specific techniques and
algorithms that can be used to analyze the data and derive the discovered knowledge. An
important problem regarding the results of the Data Mining process is the development of
efficient indicators of assessing the quality of the results of the analysis. This, the quality as-
sessment problem, is a cornerstone issue of the whole process because: i) The analyzed data
may hide interesting patterns that the Data Mining methods are called to reveal. Due to the
size of the data, the requirement for automatically evaluating the validity of the extracted pat-
terns is stronger than ever.
ii)A number of algorithms and techniques have been proposed which under different assump-
tions can lead to different results. iii)The number of patterns generated during the Data Mining
process is very large but only a few of these patterns are likely to be of any interest to the do-
main expert who is analyzing the data. In this chapter we will introduce the main concepts
and quality criteria in Data Mining. Also we will present an overview of approaches that have
been proposed in the literature for evaluating the Data Mining results .
Key words: cluster validity, quality assessment, unsupervised learning, clustering
Introduction
Data Mining is mainly concerned with methodologies for extracting patterns from large data
repositories. There are many Data Mining methods which accomplishing a limited set of tasks
produces a particular enumeration of patterns over data sets. The main tasks of Data Mining
which have already been discussed in previous sections are: i) Clustering, ii) Classification,
iii) Association Rule Extraction,iv)Time Series,v)Regression, and vi) Summarization.
Since a Data Mining system could generate under different conditions thousands or mil-
lion of patterns, questions arise for the quality of the Data Mining results, such as which of
the extracted patterns are interesting and which of them represent knowledge.
In general terms, a pattern is interesting if it is easily understood, valid, potentially useful
and novel. A pattern also is considered as interesting if it validates a hypothesis that a user

O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_31, © Springer Science+Business Media, LLC 2010
614 Maria Halkidi and Michalis Vazirgiannis
seeks to confirm. An interesting pattern represents knowledge. The quality of patterns depends
both on the quality of the analysed data and the quality of the Data Mining results. Thus several
techniques have been developed aiming at evaluating and preparing the data used as input to
the Data Mining process. Also a number of techniques and measures have been developed
aiming at evaluating and interpreting the extracted patterns.
Generally, the term ‘Quality’ in Data Mining corresponds to the following issues:
• Representation of the ‘real’ knowledge included in the analyzed data. The analyzed data
hides interesting information that the Data Mining methods are called to reveal. The re-
quirement for evaluating the validity of the extracted knowledge and representing it to be
exploitable by the experts of domain is stronger than ever.
• Algorithms tuning. A number of algorithms and techniques have been proposed which
under different assumptions could lead to different results. Also, there are Data Mining
approaches considered as more suitable for specific application domains (e.g. spatial data,
business, marketing etc.). The selection of a suitable method for a specific data analysis
task in terms of their performance and the quality of its results is one of the major problems
in Data Mining.
• Selection of the most interesting and representative patterns for the data. The number of
patterns generated during the Data Mining process is very large but only a few of these
patterns are likely to be of any interest to the domain expert analyzing the data. Many
of the patterns are either irrelevant or obvious and do not provide new knowledge. The
selection of the most representative patterns for a data set is another important issue in
terms of the quality assessment.
Depending on the data-mining task the quality assessment approaches aim at estimating
different aspects of quality. Thus, in the case of classification the quality refers to i) the ability
of the designed classification model to correctly classify new data samples, ii) the ability of
an algorithm to define classification models with high accuracy, and iii) the interestingness of
the patterns extracted during the classification process. In clustering, the quality of extracted

patterns is estimated in terms of their validity and their fitness to the analyzed data. The number
of groups into which the analyzed data can be partitioned is another important problem in
the clustering process. On the other hand the quality in association rules corresponds to the
significance and interestingness of the extracted rules. Another quality criterion for association
rules is the proportion of data that the extracted rules represent. Since the quality assessment
is widely recognized as a major issue in Data Mining, techniques for evaluating the relevance
and usefulness of discovered patterns attract the interests of researchers. These techniques are
broadly referred to as:
• Interestingness measures in case of classification or association rules applications.
• Cluster validity indices (or measures) in case of clustering.
In the following section, there is a brief discussion about the role of pre-processing in the
quality assessment. Then we proceed with the presentation of quality assessment techniques
related to the Data Mining tasks. These techniques, depending on the Data Mining tasks refer
to, are organized into the following categories: i) Classifiers accuracy techniques and related
measures, ii) Classification rules interestingness measures, iii) Association Rules Interesting-
ness Measures, and iv) Cluster Validity approaches.
31 Quality Assessment Approaches in Data Mining 615
31.1 Data Pre-processing and Quality Assessment
Data in the real world tends to be ‘dirty’. Database users frequently report errors, unusual
values, and inconsistencies in the stored data. Then, it is usual in the real world for the analyzed
data to be:
• incomplete (i.e. lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data),
• noisy: containing errors or outliers,
• inconsistent: containing discrepancies in codes used to categorize items or the names used
to refer to the same data items.
Based on a set of data that lacks quality the results of the mining process unavoidably
tend to be inaccurate and lack any interest for the expert of domain. In other words quality
decisions must be based on quality data. Data pre-processing is a major step in the knowledge
discovery process. Data pre-processing techniques applied prior to Data Mining step could

help to improve the quality of analyzed data and consequently the accuracy and efficiency of
the subsequent mining processes.
There are a number of data pre-processing techniques aimed at substantially improving
the overall quality of the extracted patterns (i.e. information included in analyzed data). The
most widely used are summarized below (Han and Kamber, 2001):
• Data cleaning which can be applied to remove noise and correct inconsistencies in the
data.
• Data transformation. A common transformation technique is the normalization. It is ap-
plied to improve the accuracy and efficiency of mining algorithms involving distance mea-
surements.
• Data reduction. It is applied to reduce the data size by aggregating or eliminating redun-
dant features.
31.2 Evaluation of Classification Methods
Classification is one of the most commonly applied Data Mining tasks and a number of clas-
sification approaches has been proposed in literature. These approaches can be compared and
evaluated based on the following criteria (Han and Kamber, 2001):
• Classification model accuracy: The ability of the classification model to correctly predict
the class into which new or previously unseen data are classified.
• Speed: It refers to the computation costs in building and using a classification model.
• Robustness: The ability of the model to handle noise or data with missing values and make
correct predictions.
• Scalability: The method ability to construct the classification model efficiently given large
amounts of data.
• Interpretability: It refers to the level of understanding that the constructed model provides.
31.2.1 Classification Model Accuracy
The accuracy of a classification model designed according to a set of training data is one of the
most important and widely used criteria in the classification process. It allows one to evaluate
how accurately the designed model (classifier) will classify future data (i.e. data on which the
model has not been trained). Accuracy also helps in the comparison of different classifiers.
The most common techniques for assessing the accuracy of a classifier are:

616 Maria Halkidi and Michalis Vazirgiannis
1. Hold-out method. The given data set is randomly partitioned into two independent sets, a
training set and a test set. Usually, two thirds of the data are considered for the training
set and the remaining data are allocated to the test set. The training data are used to define
the classification model (classifier). Then the classifiers accuracy is estimated based on
the test data. Since only a proportion of the data is used to derive the model the estimate
of accuracy tends to be pessimistic. A variation of the hold-out method is the random sub-
sampling technique. In this case the hold-out method is repeated k times and the overall
accuracy is estimated as the average of the accuracies obtained from each iteration.
2. k-fold cross-validation. The initial data set is partitioned into k subsets, called ‘folds’,
let S = {S
1
, !S
k
}. These subsets are mutually exclusive and have approximately equal
size. The classifier is iteratively trained and tested k times. In iteration i, the S
i
subset is
reserved as the test set while the remaining subsets are used to train the classifier. Then the
accuracy is estimated as the overall number of correct classifications from the k iterations,
divided by the total number of samples in the initial data. A variation of this method is the
stratified cross-validation in which the subsets are stratified so that the class distribution
of the samples in each subset is approximately the same as that in the initial data set.
3. Bootstraping. This method is k-fold cross validation, with k set to the number of initial
samples. It samples the training instances uniformly with replacement and leave-one-out.
In each iteration, the classifier is trained on the set of k −1 samples that is randomly
selected from the set of initial samples, S. The testing is performed using the remaining
subset.
Though the use of the above-discussed techniques for estimating classification model ac-
curacy increases the overall computation time, they are useful for assessing the quality of

classification models and/or selecting among the several classifiers.
Alternatives to the Accuracy Measure
There are cases that the estimation of an accuracy rate may mislead one about the quality
of a derived classifier. For instance, assume a classifier is trained to classify a set of data
as ‘positive’ or ‘negative’. A high accuracy rate may not be acceptable since the classifier
could correctly classify only the negative samples giving no indication about the ability of the
classifier to recognize positive and negative samples. In this case, the sensitivity and specificity
measures can be used as an alternative to the accuracy measures (Han and Kamber, 2001).
Sensitivity assesses how well the classifier can recognize positive samples and is defined
as
Sensitivity =
true
positive
positive
(31.1)
where true
positive corresponds to the number of the true positive samples and positive is the
number of positive samples.
Specificity measures how well the classifier can recognize negative samples. It is defined
as
Speci f icity =
true
negative
negative
(31.2)
where true
negative corresponds to the number of the true negative examples and negative the
number of samples that is negative.
The measure that assesses the percentage of samples classified as positive that are actually
positive is known as precision. That is,

31 Quality Assessment Approaches in Data Mining 617
Precision =
true
positive
true positive + false positive
(31.3)
Based on above definitions the accuracy can be defined as a function of sensitivity and
specificity:
Accuracy = Sensitivity ·
positive
positive +negative
+ (31.4)
Speci f icity ·
negative
positive + negative
In the classification problem discussed above it is considered that each training sample
belongs to only one class, i.e. the data are uniquely classified. However, there are cases where
it is more reasonable to assume that a sample may belong to more than one class. It is then
necessary to derive models that assign data to classes with an attached degree of belief. Thus
classifiers return a probability class distribution rather than a class label. The accuracy measure
is not appropriate in this case since it assumes the unique classification of samples for its
definition. An alternative is to use heuristics where a class prediction is considered correct if
it agrees with the first or second most probable class.
31.2.2 Evaluating the Accuracy of Classification Algorithms
A classification (learning) algorithm is a function that given a set of examples and their classes
constructs a classifier. On the other hand a classifier is a function that given an example as-
signs it to one of the predefined classes. A variety of classification methods have already been
developed (Han and Kamber, 2001). The main question that arises in the development and
application of these algorithms is about the accuracy of the classifiers they produce.
Below we shall discuss some of the most common statistical methods proposed (Diet-

terich, 1998) for answering the following question: Given two classification algorithms A and
B and a data set S, which algorithm will produce more accurate classifiers when trained on
data sets of the same size?
McNemar’s Test
Let S be the available set of data, which is divided into a training set R, and a test set T . Then
we consider two algorithms A and B trained on the training set and the result is the definition
of two classifiers and . These classifiers are tested on T and for each example x ∈T we record
how it was classified. Thus the contingency table presented in Table 31.1 is constructed.
Table 31.1. McNemar ’s test:Contingency table
Number of examples misclassified Number of examples misclassified by
by both classifiers (n
00
)
ˆ
f
A
but not by
ˆ
f
B
(n
01
)
Number of examples misclassified Number of examples misclassified
by
ˆ
f
B
but not by
ˆ

f
A
(n
10
) neither by
ˆ
f
A
nor
ˆ
f
B
(n
11
)
The two algorithms should have the same error rate under the null hypothesis, Ho. Mc-
Nemar’s test is based on a
χ
2
test for goodness-of-fit that compares the distribution of counts
618 Maria Halkidi and Michalis Vazirgiannis
Table 31.2. Expected counts under Ho
n
00
(n
01
+ n
10
)/2)
(n

01
+ n
10
)/2) n
11
)
expected under null hypothesis to the observed counts. The expected counts under Ho are
presented in Table 31.2.
The following statistic, s, is distributed as
χ
2
with 1 degree of freedom. It incorporates a
”continuity correction” term (of -1 in the numerator) to account for the fact that the statistic is
discrete while the
χ
2
distribution is continuous:
s =
(
|
n
10
−n
01
|
−1)
2
n
10
+ n

01
According to the probabilistic theory (Athanasopoulos, 1991), if the null hypothesis is correct,
the probability that the value of the statistic, s, is greater than
χ
2
1,0.95
is less than 0.05, i.e.
P(|s| >
χ
2
1,0.95
) < 0.05. Then to compare the algorithms A and the definied classifiers
ˆ
f
A
and
ˆ
f
B
are tested on T and the value of s is estimated as described above. Then if |s| >
χ
2
1,0.95
,
the null hypothesis could be rejected in favor of the hypothesis that the two algorithms have
different performance when trained on the particular training set R.
The shortcomings of this test are:
1. It does not directly measure variability due to the choice of the training set or the internal
randomness of the learning algorithm. The algorithms are compared using a single train-
ing set R. Thus McNemar’s test should be only applied if we consider that the sources of

variability are small.
2. It compares the performance of the algorithms on training sets, which are substantially
smaller than the size of the whole data set. Hence we must assume that the relative dif-
ference observed on training sets will still hold for training sets of size equal to the whole
data set.
A Test for the Difference of Two Proportions
This statistical test is based on measuring the difference between the error rate of algorithm
A and the error rate of algorithm B (Snedecor and Cochran, 1989). More specifically, let
p
A
=(n
00
+ n
01
)/n be the proportion of test examples incorrectly classified by algorithm
A and let p
B
=(n
00
+ n
10
)/n be the proportion of test examples incorrectly classified by
algorithm B. The assumption underlying this statistical test is that when algorithm A classifies
an example x from the test set T, the probability of misclassification is p
A
. Then the number
of misclassifications of n test examples is a binomial random variable with mean np
A
and
variance p

A
(1 −p
A
)n.
The binomial distribution can be well approximated by a normal distribution for reason-
able values of n. The difference between two independent normally distributed random vari-
ables is itself normally distributed. Thus, the quantity p
A
− p
B
can be viewed as normally
distributed if we assume that the measured error rates p
A
and p
B
are independent. Under the
null hypothesis, Ho, it will have a mean of zero and a standard deviation error of
se =

2p ·

1 −
p
A
+ p
B
2

/n
31 Quality Assessment Approaches in Data Mining 619

where n is the number of test examples.
Based on the above analysis, we obtain the statistic
z =
p
A
− p
B

2p(1 − p)/n
which has a standard normal distribution. According to the probabilistic theory if the z value is
greater than Z
0.975
the probability of incorrectly rejecting the null hypothesis is less than 0.05.
Thus the null hypothesis could be rejected if |z|> Z
0.975
= 1.96 in favor of the hypothesis that
the two algorithms have different performances. There are several problems with this statistic,
two of the most important being:
1. The probabilities p
A
and p
B
are measured on the same test set and thus they are not
independent.
2. The test does not measure variation due to the choice of the training set or the internal
variation of the learning algorithm. Also it measures the performance of the algorithms
on training sets of size significantly smaller than the whole data set.
The Resampled Paired t Test
The resampled paired t test is the most popular in machine learning. Usually, the test conducts
a series of 30 trials. In each trial, the available sample S is randomly divided into a training

set R (it is typically two thirds of the data) and a test set T. The algorithms A and B are both
trained on R and the resulting classifiers are tested on T. Let p
(i)
A
and p
(i)
B
be the observed
proportions of test examples misclassified by algorithm A and B respectively during the i-th
trial. If we assume that the 30 differences p
(i)
= p
(i)
A
− p
(i)
B
were drawn independently from a
normal distribution, then we can apply Student’s t test by computing the statistic
t =
¯p ·

n


n
i=1
(p
(i)
− ¯p)

2
n−1
where ¯p =
1
n
·

n
i=1
p
(i)
. Under null hypothesis this statistic has a t distribution with n −1
degrees of freedom. Then for 30 trials, the null hypothesis could be rejected if |t| > t
29,0.975
=
2.045. The main drawbacks of this approach are:
1. Since p
(i)
A
and p
(i)
B
are not independent, the difference p
(i)
will not have a normal distri-
bution.
2. The p
(i)
’s are not independent, because the test and training sets in the trials overlap.
The k-fold Cross-validated Pairedt Test

This approach is similar with the resampled paired t test except that instead of constructing
each pair of training and test sets by randomly dividing S, the data set is randomly divided into
k disjoint sets of equal size, T
1
,T
2
, ,T
k
. Then k trials are conducted. In each trial, the test set
is T
i
and the training set is the union of all of the others T
j
, j = i. The t statistic is computed as
described in Section 31.2.2. The advantage of this approach is that each test set is independent
of the others. However, there is the problem that the training sets overlap. This overlap may
prevent this statistical test from obtaining a good estimation of the amount of variation that
would be observed if each training set were completely independent of the others training sets.

×