Tải bản đầy đủ (.pdf) (24 trang)

kiến trúc máy tính nguyễn thanh sơn l2 performance evaluation sinhvienzone com

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (504.93 KB, 24 trang )

Machine Learning and
Data Mining
(IT4242E)
Quang Nhat NGUYEN


Hanoi University of Science and Technology
School of Information and Communication Technology
Academic year 2018-2019
CuuDuongThanCong.com

/>

The course’s content:


Introduction



Performance evaluation of the ML and DM system



Probabilistic learning



Supervised learning




Unsupervised learning



Association rule mining

Machine learning and Data mining
CuuDuongThanCong.com

/>
2


Performance evaluation (1)


The evaluation of the performance of a ML or DM system
is usually done experimentally rather than analytically
• An analytical evaluation aims at proving a system is
correct and complete (e.g., theorem provers in Logics)
• But, it is impossible to build a formal definition of a
problem to be solved by a ML or DM system (For a ML
or DM problem, what are correctness and
completeness?)

Machine learning and Data mining
CuuDuongThanCong.com

/>

3


Performance evaluation (2)


The evaluation of the system performance should:
• Be done automatically by the system, by using a set of
test examples (i.e., a test set)
• Not involve any test users



Evaluation methods
→ How to have a convincing/confident evaluation of the
system performance?



Evaluation metrics
→ How to measure (i.e., to compute) the performance of
the system?
Machine learning and Data mining
CuuDuongThanCong.com

/>
4


Evaluation methods (1)

Training
set

Dataset

Validation
set

Test set

Used to train the system

Optional, and used to optimize
the values of the system’s
parameters

Used to evaluate the
trained system

Machine learning and Data mining
CuuDuongThanCong.com

/>
5


Evaluation methods (2)


How to get a confident/convincing evaluation of the

system performance?
• The larger the training set is, the higher the performance of the
trained system is
• The larger the test set is, the more confident/convincing the
evaluation is
• Problem: Very difficult (i.e., rarely) to have (very) large dataset(s)



The system performance depends on not only ML/DM
algorithms used, but also:
• Class distribution
• Cost of misclassification
• Size of the training set
• Size of the test set
Machine learning and Data mining
CuuDuongThanCong.com

/>
6


Evaluation methods (3)


Hold-out (Splitting)



Stratified sampling




Repeated hold-out



Cross-validation

• k-fold
• Leave-one-out


Bootstrap sampling

Machine learning and Data mining
CuuDuongThanCong.com

/>
7


Hold-out (Splitting)


The whole dataset D is divided into 2 disjoint subsets
• Training set D_train – To train the system
• Test set D_test – To evaluate the performance of the trained sys.
→ D = D_train  D_test, and usually |D_train| >> |D_test|




Requirements:
Any examples in the test set D_test must not be used in the training
of the system
❑ Any examples used in the training of the system (i.e., those in
D_train) must not be used in the evaluation of the trained system
❑ The test examples in D_test should allow an unbiased evaluation of
the system performance





Usual splitting: |D_train|=(2/3).|D|, |D_test|=(1/3).|D|
Suitable if we have a large dataset D
Machine learning and Data mining
CuuDuongThanCong.com

/>
8


Stratified sampling









For such datasets that is small (in size) or unbalanced, the
examples in the training and test sets may not be
representative
For example: There are (very) few examples for a specific
class label
Goal: The class distribution in the training and test sets should
be approximately equal to that in the original dataset (D)
Stratified sampling
• An approach to have a balanced (in class distribution) dataset
• Guarantee the class distributions (i.e., the percentages of examples for
class labels) in the training and tests set are approximately equal



The stratified sampling method can not be applied to a
regression problem (because for that problem the system’s
output is a real value, not a discrete value / class label)
Machine learning and Data mining
CuuDuongThanCong.com

/>
9


Repeated hold-out


To apply the Hold-out evaluation method for multi times

(i.e., multi runs), each one uses a different training and
test sets
• For each run, a certain percentage of the dataset D is randomly
selected to create the training set (possibly together with the
stratified sampling method)
• The error values (or the values of other measure metrics) are
averaged amongst the runs to get the final (average) error value



This evaluation method is still not perfect
• In each run, a different test set is used
• There are still some overlapping (i.e., repeatedly) used examples
among those test sets
Machine learning and Data mining
CuuDuongThanCong.com

/>
10


Cross-validation




To avoid any overlapping amongst the used test sets (i.e., the same
examples are contained in some different test sets)
k-fold cross-validation
• The whole dataset D is divided into k disjoint subsets (called “fold”) that

have approximately equal sizes
• For each run (i.e., of the total k runs), a subset is circulated to use for the
test set, and the remaining (k-1) subsets are used for the training set
• The k error values (i.e., each one for each fold) are averaged to get the
overall error value



Usual choices of k: 10, or 5



Often, each subset (i.e., fold) is stratified sampling (i.e., to
approximate the class distribution) prior to apply the Cross-validation
evaluation method
Suitable if we have a small to medium dataset D



Machine learning and Data mining
CuuDuongThanCong.com

/>
11


Leave-one-out cross-validation


A type of the Cross-validation method

• The number of folds is exactly the size of the original dataset
(k=|D|)
• Each fold contains just one example



To maximally exploit the original dataset



No random sub-sampling



Not possible to apply the stratified sampling method
→ Because in each run (loop), the test set contains just one example



(Very) high computational cost



Suitable if we have a (very) small dataset D
Machine learning and Data mining
CuuDuongThanCong.com

/>
12



Bootstrap sampling (1)




The Cross-validation method applies sampling without replacement
→ For each example, once selected (used) for the training set, then
it cannot be selected (used) again (one more time) for the training
set
The Bootstrap sampling method applies sampling with replacement
to create the training set
• Assume that the whole dataset D contains n examples
• To sample with replacement (i.e., repeating) for n times for the
dataset D to create the training set D_train that contains n
examples
From the dataset D, randomly select an example x (but not remove x
from the dataset D)
➢ Put the example x into the training set: D_train = D_train  x
➢ Repeat the above 2 steps for n times


• To use the set D_train for training the system
• To use those examples in D but not in D_train to create the test
set: D_test = {zD; zD_train}
Machine learning and Data mining
CuuDuongThanCong.com

/>
13



Bootstrap sampling (2)


Important notes:
• The training set has size of n, and an example in D may appear
multi times in D_train
• The test set has size maximum to 1 time in D_test



Suitable if we have a (very) small dataset D

Machine learning and Data mining
CuuDuongThanCong.com

/>
14


Validation set











The examples in the test set must not be used (in any way!) in the
training of the system
In some ML/DM problems, the system’s training process includes 2
stages:
• Stage 1: To train the system (= To learn approximately the target
function)
• Stage 2: To optimize the values of the system’s paramters
The test set cannot be used for the purpose of optimization of the
system’s parameters
To divide the whole dataset D into 3 disjoint subsets: training set,
validation set, and test set
The validation set is used to optimize the values of the system’s
parameters and the used ML/DM algorithm’s ones
→ For a parameter, the optimal value is the one that results in the
best performance for the validation set
Machine learning and Data mining
CuuDuongThanCong.com

/>
15


Evaluation metrics (1)
◼ Accuracy

→The accuracy degree of the prediction of the trained
system to the test examples

◼ Efficiency

→The costs in time and memory resources needed for the
training and the test of the system
◼ Robustness

→The tolerance degree of the system to
noise/error/missing-value examples

Machine learning and Data mining
CuuDuongThanCong.com

/>
16


Evaluation metrics(2)
◼ Scalability

→How the system’s performance (e.g., training/prediction
speed) varies to the size of the dataset
◼ Interpretability

→How the system’s results and operation are easy to
understand for users
◼ Complexity

→The complexity of the model (i.e., the target function)
learned by the system


Machine learning and Data mining
CuuDuongThanCong.com

/>
17


Accuracy


For a classification problem
→ The system’s output is a nominal (discrete) value
(a = b)
1
1, if
(
)
Identical
o
(
x
),
c
(
x
)
;
Identical
(
a

,
b
)
=


D _ test xD _ test
0, if otherwise
•x:
An example in the test set D_test
•o(x): The class label produced by the system for the example x
•c(x): The true (expected/real) class label for the example x

Accuracy =



For a regression problem
→ The system’s output is a real number
Error =

1
D _ test

 Error ( x);

Error ( x) = d ( x) − o( x)

xD _ test


•o(x): The output value produced by the system for the example x
•d(x): The true (expected/real) output value for the example x
• Accuracy is an inverse function to the function Error
Machine learning and Data mining
CuuDuongThanCong.com

/>
18


Confusion matrix




Also called Contingency Table
Can be used only for a classification problem


Cannot be used for a regression problem

• TPi: The number of examples of
class ci are correctly classified

• FPi: The number of examples
not belonging to class ci are
incorrectly classified in class ci
• TNi: The number of examples
not belonging to class ci are
correctly classified

• FNi: The number of examples
of class ci are incorrectly
classified into classes different
from ci

Classified
by the system

(For Class ci)

Class ci Not Class ci
True
class

Class ci

TPi

FNi

Not Class ci

FPi

TNi

Machine learning and Data mining
CuuDuongThanCong.com

/>

19


Precision and Recall (1)


Very often used in evaluation of text
mining and information retrieval
systems



Precision for class ci
→ The number of examples correctly
classified to class ci divides the
number of examples classified to
class ci



Recall for class ci
→ The number of examples correctly
classified to class ci divides the
number of examples of class ci

Pr ecision(ci ) =

TPi
Re call(ci ) =
TPi + FN i


Machine learning and Data mining
CuuDuongThanCong.com

TPi
TPi + FPi

/>
20


Precision and Recall (2)


How to compute the overall Precision and Recall
values for all the class labels C={ci}?



Micro-averaging
C

Pr ecision =

 TP

 TP

i


i =1

Re call =

C

 (TP + FP )
i =1



C

i

i

i =1

i

C

 (TP + FN )
i

i =1

i


Macro-averaging
C

Pr ecision =

 Pr ecision(c )

C

i

i =1

C

Re call =

 Re call(c )
i

i =1

C

Machine learning and Data mining
CuuDuongThanCong.com

/>
21



F1 measure


The F1 evaluation metric is a combination of Precision
and Recall
F1 =



2. Pr ecision.Re call
=
Pr ecision + Re call

2
1
1
+
Pr ecision Re call

F1 measure is a harmonic mean of the 2 metrics
Precision and Recall
• F1 measure tends to have the value that is close to the smaller
one amongst Precision and Recall

•F1 measure has a high value if both of Precision and Recall are
high
Machine learning and Data mining
CuuDuongThanCong.com


/>
22


Top-k accuracy


Machine learning and Data mining
CuuDuongThanCong.com

/>
23


Select a trained model


The selection of a trained model should compromise (balance)
between:
• The complexity of the trained model
• The prediction accuracy degree of the trained model



Occam’s razor. A good trained model is the one that is simple
and achieves high accuracy (in prediction) for the used
dataset




For example:
• A trained classifier Sys1: (Very) simple, and rather (to a certain
degree) fit to the training set

• A trained classifier Sys2: More complex, and perfectly fit to the
training set
→ Sys1 is preferred to Sys2
Machine learning and Data mining
CuuDuongThanCong.com

/>
24



×