Auto annotation of multimedia contents theory and application

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.04 MB, 173 trang )

1
Chapter 1
Introduction

With the steady progress in image/video compression and communication technologies,
many home users are beginning to have high bandwidth cable connections to view
images and DVD-quality videos. Many users are putting large amounts of digital
image/video online, and more and more media content providers are delivering live or
on-demand image/videos over the Internet. While the amount of image/video data,
including image/video collection, is rapidly increasing, multimedia applications are still
very limited in content management capabilities. There is a growing demand for new
techniques that are able to efficiently process, model and manage image/video contents.
There are two main approaches to browsing and searching for images and videos in large
multimedia collections(Smith et al. 2003). One is based on query by examples (QBE), in
which an image or a video is used as the query. Almost all QBE systems use visual
content features such as color, texture and shape as the basis for retrieval. These low-
level content features are inadequate to model the contents of images and videos
effectively. Moreover, it is difficult to formulate precise queries using visual features or
examples. As a result, QBE is not very effective and not readily accepted by ordinary
users. The other approach is query by concepts
1
or keywords (QBK), which essentially
retrieves images based on text annotations attached to images/videos. QBK approach is
ease to use and is readily accepted by ordinary users because human thinks in terms of
semantics. However, for QBE to be effective, good annotations for images and videos

1
Throughout the thesis, we liberally use the terms “concept” and “keyword” interchangeably.

2
are needed. As most current image/video collections either have none or come with few
and incomplete annotation, effective techniques must be developed to annotate images.
Most commercial image collections are annotated manually. As the size of image/video
collection is large, in order of 10
4
-10
7
or more, manually annotating or labeling such
large collection is tedious, time consuming and error prone. Recently, supervised
statistical learning approaches have been developed to perform automatic and semi-
automatic annotation in order to reduce human efforts. However, the performance of such
supervised learning approaches is still low. Moreover, they need large amounts of labeled
training samples to learn the target concepts. These problems have motivated our
research to explore machine learning approaches to perform auto-annotation of large
image/video collections.
Throughout this thesis, we use the term image to denote both image and video. We
also loosely use the term keyword and concept interchangeably to denote text annotations
of images. In a way, the problem is reduced to learn a set of classifiers, one for each
predefined keyword or concept, in order to automatically annotate the contents of an
image. Although the title of the thesis is on multimedia contents, this thesis is about
visual contents including images and videos.

1.1 Motivation
There are several factors that motivate our research:
(1) There are large collections of images/videos that need annotation: such collections
typically come with incomplete or without any annotation. However, for effective
searching and browsing, the users prefer to use semantic concepts or keywords.

3

(2) Supervised learning approaches still need large amounts of training data for
effective learning: Manually annotating a large amount of training data is error
prone, and the errors could affect the final learning performance. Because of
subjective judgment and perceptual differences, different users/experts may
assign different annotations to the same image/video. It is therefore important to
minimize the manual labeling data required to learn the target concepts.

(3) The need to develop effective techniques for auto-annotation of image collections:
The goal is to find an effective way to auto-annotate image/video collections
based on a pre-defined list of concepts that require as few training data as possible.

The annotated image/video collections can then be used as the basis to support keyword-
based search and filtering of images.

1.2 Our Approaches

In this dissertation, we propose a scalable and flexible framework to automatically
annotate large collections of images and videos based on a predefined list of concepts.
The framework is based on the idea of hierarchical learning, i.e., performing auto-
annotation from image region level to image level, which is consistent with human
cognition. In particular, at the region level, we assign one or more pre-defined concepts
to a region based on the association between visual contents and concepts; while at the

4
image level, we make use of contextual relationships among the concepts and regions to
disambiguate the concepts learned. The framework is open and can incorporate different
base learners ranging from the traditional single-view learners to multi-view learners. In
multi-view learning approach, two learners, representing two orthogonal views of the
problem, are trained to solve the problem collaboratively. In addition, it is well known

that labeling training data for machine learning is tedious, time-consuming and error
prone, especially for multimedia data. Consequently, it is of utmost importance to
minimize the amount of labeled data needed to train the classifiers for the target concepts.
Based on the framework, we implement three learning approaches to auto-annotate image
collection with the aim to contrast the scalability, flexibility and effectiveness of different
learning approaches, and the extensibility and efficiency of the framework. The three
learning approaches are as follows:

(1) The fully supervised single-view learning-based approach. In this approach, we
consider the use of fully supervised SVM classifiers, one for each concept, to
associate visual contents of segmented regions with concepts. In order to alleviate the
unreliability problem of current segmentation methods, we employ two different
segmentation methods to derive two sets of segmented regions for each image. We
then employ a contextual model to disambiguate the concepts learned from multiple
overlapping regions. We also evaluate the performances of classifiers developed
based on different types of SVM: (a) hard SVM, which returns a binary decision for
each concept; and (b) soft-SVM, that returns a probability value for each concept.

5
(2) A bootstrapping scheme with two view-independent learners. Here we develop two
sets of view-independent SVM classifiers by using two disjoint subsets of content
features for each region. The two sets of view-independent classifiers are then used in
a co-training framework to learn the concepts for each region collaboratively. As with
the view-independent approach, we investigate: (a) the annotation of regions
generated by two different segmentation methods; (b) the use of a contextual model
to disambiguate the concepts learned from multiple overlapping regions; and (c) the
performance of both hard- and soft-SVM models to train the classifiers. We compare
the performance of bootstrapping approach and fully supervised single-view approach.
We expect the performance of bootstrapping approach to be comparable of better than

the fully supervised single-view approach, but require a much smaller set of training
samples. In addition, we investigate the role of active learning, in which users/experts
participate in the loop of learning by manually judging the classes of samples selected
by the system. We aim to demonstrate that the resulting bootstrapping cum active
learning framework is scalable and requires a much smaller set of training sample to
kick start the learning process.

(3) A bootstrapping scheme for Web image/video mining. In order to evaluate the
flexibility and extensibility of our bootstrapping cum active learning framework, we
apply the framework to web image/video annotation and retrieval. Web images
posses both intrinsic visual contents (visual view) and text annotations derived from
the associated HTML pages (text view). We develop two sets of view-independent

6
SVM-based classifiers based on the above two orthogonal views. For effective
learning, we also incorporate a language model into our framework.

1.3 Contributions
In this thesis, we make the following three contributions:

1. We introduce the two-level framework for the auto-annotation of images/videos.
The framework generates multiple sets of segmented regions and concepts and
employs a contextual model at the image level to disambiguate the concepts. It is
designed to incorporate different base learners.

2. We propose and implement a bootstrapping cum active learning approach and
incorporate them into our framework, which reduces the amount of labeled samples
required for effective learning.

3. We extend the bootstrapping approach to handle the heterogeneous media in

Internet and explore web image annotation and retrieval by incorporating both the
visual and textual features.

1.4 Thesis Overview

The dissertation is organized as follows:

7
Chapter 2 discusses the basic question at what is auto-annotation for images/videos. The
chapter also motivates the need for such approach and discusses how we may measure
the performance of learning approaches.
Chapter 3 overviews the state-of-the-art research on image/video retrieval, the feature
extraction, feature dimension reduction, indexing and retrieval.
Chapter 4 overviews existing research on statistical learning, its principle and related
applications. We also introduce the ideas of supervised, unsupervised, semi-supervised
and active learning schemes.
Chapter 5 discusses our proposed hierarchical Image/video semantic concept structure
and learning framework. Based on the hierarchical lexicon (structure), we introduce the
learning framework, which is an open framework and can incorporate different base
learners.
In Chapter 6, we propose and implement an approach to auto-annotate image/video
collections using the framework introduced in Chapter 5. In this chapter, we evaluate and
verify the framework with traditional learning approach, i.e., single-view learner.
Chapter 7 extends the work in Chapter 6 by proposing a bootstrapping approach to
annotate large image/video collections. Our bootstrapping cum active learning approach
performs image annotation by using a small set of labeled data and a large set of
unlabeled data.
Chapter 8 applies the bootstrapping framework to annotate and retrieve WWW images.
Here we explore the integration of image visual feature (visual view) and its associated
textual feature (textual view) to collaboratively learn the semantic concepts for web

8
images. We also evaluate different combinations of visual feature and textual feature to
find an effective combination of features for auto image annotation task.
Finally, Chapter 9 concludes the thesis with discussions for future research.

9

Chapter 2
Auto-annotation of Images

In this chapter, we discuss what we mean by auto-annotation of images, why one needs to
perform auto-annotation of images, why it is difficult, some possible approaches to be
employed, why we are using machine learning techniques, and what characteristics that
make auto-annotation of images difficult for machine learning. We also discuss several
evaluation criteria for machine learning approaches.

2.1 What is Auto-annotation of Images?

The purpose of auto-annotation of images is to assign the appropriate concept labels (or
keywords) to each image. In fact, concepts are categories that describe the main contents
of images (one image may belong to multiple categories). Thus auto-annotation of
images is basically a classification or pattern recognition problem. Our aim is to learn the
annotation of as many concepts as possible instead of just the main theme of the image.
The list of concepts is predetermined. Each image can be assigned one or one more
concepts. We can imagine that an image in a collection enters a concept channel that is
composed of the tandem concept learners, and when the image exits, it is assigned
concepts that represent the contents of the image. The process by which different
learners determine the concepts of the image is the result of many aspects of pattern
recognition and natural language analysis – as there are interactions between the concepts

10
and their relations between the context of regions and content of image as we will see in
the latter chapter.

2.2 Why We Need Auto-annotation of Images?

Semantic concepts of image are very important for multimedia retrieval because human
thinks in terms of semantics. Although content-based image/video retrieval (CBIR) has
been developed and has achieved a sufficiently high level of retrieval effectiveness, it has
been limited to research community and is not readily accepted by the ordinary user. One
important reason is that it is difficult for ordinary user to master and understand the
relationship between what they want and the low level visual features used in CBIR
systems. With semantic concepts of images, this problem can be easy handled. Auto-
annotation of images can be used for:
1. Routing/filtering images/videos of interest such as animals, plants, vehicles, etc.
2. Image/video retrieval—assigning semantic concepts to images, thus ordinary
users can master and understand the retrieval for their own use.
3. Using the assigned semantic concepts as initial retrieval step for a content-based
image/video retrieval system, and then performing relevance feedback in order to
make the user actively participate the process.
4. Combining with web techniques to enrich the web content search technique and
facilitating image search as is done in text search engines.
However, Auto-annotation of images is a very difficult and challenging task. There are
many reasons:

11
1. Auto-annotation of images requires the association of low level features
(attributes) of images with concepts. However, the representations of the low
level features such as color, texture and shape, etc. are limited. For example, due

to variations in light intensity, the same image may have different low level
features. Also, the same objects often have multiple visual representations.
2. Due to reasons such as light intensity, occlusion, etc, the image segmentation is a
very difficult task and the result is unstable and unreliable. Thus the intermediate
results such as objects/regions do not often correspond to meaningful
objects/regions. Therefore the semantic concepts of images based on such
regions/objects are not complete.
3. Due to the same reasons such as the variations in light intensity, occlusion and
different object postures, the meta-level features such as face detection do not
achieve high accuracy.
4. Last but not the least, there are lots of synonyms for the same object or concept
at language level – different words with same/similar meanings.

2.3 Why Use Machine Learning?

There are in general two approaches to problem solving, one is “knowledge engineering”
and the other is “machine learning”. In the knowledge engineering approach, one creates
a program that solves the problem of interest directly. As text categorization, a
knowledge engineering approach to image annotation would require one to determine a
set of rules that correctly apply to the problem. Determining a specific solution that can

12
be applied to all kinds of images is a very daunting task and the current computer vision
techniques cannot provide us with a general solution for this. We want to develop a
system that would be able to adjust itself to handle different semantic concepts for
different images.
The machine learning approach provides us with an indirect approach, wherein the
system itself learns how to solve the problem of interest. As noted in (Mitchell 1997),
machine learning involves acquiring general concepts based on specific training samples.
In concept learning, a learner is used to automatically infer the definition of a concept

starting from samples that are members or non-members of concepts of interest. Training
images (or image regions/objects) with pre-assigned concepts (labels) are provided to a
learning program. This program then analyzes the training samples, extracts and
generalizes the knowledge (learning parameters) they contained and stores this
knowledge in the knowledge base. This knowledge base is then used to solve previously
unseen problem. Of course, in general, we assume that the unseen samples come from the
same distribution of samples. Often the knowledge base is viewed as a collection of
hypotheses, with some ordering imposed on how they are to be utilized for prediction, for
example, for hierarchical image classification.
We choose to use the machine learning approach. This allows us to avoid the knowledge
engineering bottleneck of having to acquire, organize, and resolve large amounts of
incomplete and conflicting expert knowledge. Instead we will design a machine learning
framework that will compute a solution that can be easily applied and have reasonable
accuracy. Also, using machine learning makes the system very flexible. One can

13
potentially retrain the system when one has a new set of semantic concepts with enough
image samples, etc.

2.4 Supervised Learning and Semi-supervised Learning

When most people use machine learning, they are often referring to a specific type of
machine learning approaches called “supervised learning”. In this section, we will discuss
the supervised learning and semi-supervised learning in general and also explain the
terminologies and notation used in this dissertation.
By definition, the concepts to be learned are called target concepts, and it can be seen as a
function c:
{
}
1,1→−X

that classifies any instance
x
in
X
as a member or a non-member
of a concept of interest. In order to learn the target concept, the user typically provides a
set of training samples. Each of which consists of an instance
∈
xX
and its label,
y
. The
notation
, yx
denotes such a training sample. Sometimes we also denote ()
y
c= x for
convenience. Instances for which
y
=1 are called positive samples, while
y
= -1 are
called negative samples. The symbol
L
is used to denote the set of labeled training
samples (also known as training set). By definition, an algorithm that uses only the
labeled samples in
L
is called a supervised learner. Similarly, a learning algorithm that
trains on both labeled and unlabeled samples is called semi-supervised learner, and an

unsupervised learner is trained solely on unlabeled samples (for example, k-means
clusters the unlabeled samples based on their similarities with each other).
Before we continue the discussion, let us give the basic concepts of statistical learning,
such as induction, deduction, and transduction. Vapnik (1995) intuitively gives the

14
explanations. Classical philosophy considers two types of inference: deduction,
describing the movement from general to particular, and induction, describing the
movement from particular to general. Induction derives the function from the given data.
While the model of estimating the value of a function at a given point of interest
describes a new concept of inference: moving from particular to particular, this is referred
to transductive inference (Figure 2.1). Deduction derives the values of the given function
for points of interest, while transduction derives the values of the unknown function for
points of interest from the given data. The classical scheme suggests the derivation of S,
the values of the unknown function for points of interest in two steps: first using the
inductive step, and then using the deductive step, rather than obtaining the direct solution
in one step.
As we will see later, linear discriminant functions (including support vector machine) are
transductive inference while maximum likelihood and Bayesian parameters estimation
(includes GMM, GLM and EM, k-nearest neighbor classifier) are inductive inference.
Now we return back to the inductive inference for concept leaning. Given a training set
L
for the target concept c, an inductive learning algorithm
L
searches for a function
:{1,1}h →−X
such that
, h( ) ( )cxX x x∀∈ =
. The learner
L

searches for h within the
set
H of all possible hypotheses, which is (typically) determined by the designer who
designs the learning algorithm. A hypothesis
h is consistent with the training set
L
if and
only if
,() ,() ()cLhc∀∈=xx x x
. Finally, the version space V represents the subset of
hypotheses in
H that are consistent with the training set
L
.

15
Approximating
function
Samples
Values of the
function at points
of interest
I
nd
u
ct
i
o
n

D
e
d
u
c
t
io
n
Transduction

Figure 2.1 Different types of inference
(by Courtesy of Vapnik (Vladimir Vapnik 1995))

The paradigm of supervised learning and semi-supervised learning is shown in Figure 2.2.
As we can see, there are common function blocks except that semi-supervised learner
also learns from unlabeled data via different strategies such as active learning as we will
discuss later.
The raw data (includes image/video in our case, text documents etc.) are preprocessed to
extract the features that represent the raw data. In the training mode, as outlined by
spotted-line box, the teacher (also known as oracle or expert, they can label all the
samples in the set) assigns the concepts to each training sample. This labeling by the
teacher is in fact where the term “supervised learning” comes from. The “supervision” is
the provision of the label by the teacher for all training samples, where semi-supervised
leaning as shown in Figure 2.2 (b), the teacher only labels partial samples. Learners will
use both labeled and unlabeled samples and during the learning, leaner queries unlabeled
samples.

16
There are two approaches to train the learner. One is incremental approach. Learner
actually learns incrementally for each sample. That is, they may update their knowledge

based on each time a labeled sample is received. An advantage of incremental learning is
that one can stop the learner at any time and it will have developed at least a portion of a
knowledge base. Incremental approach is in general used in on-line learner in real-time
environment. The other approach is the batch mode. The learner read through all of the
labeled samples before doing any computations. Often batch mode is more efficient in
time and space.
Some learners may go through the set of samples more than once. The teacher only has to
label each sample once, but the learner can make as many passes as it wants to boost the
learning, such as bagging and boosting (Freund and Schapire 1996, Breiman 1998). Each
pass is referred to an “epoch”. Learners that require epochs often take more time than
single-pass, but can improve the accuracy of learner. As we can see latter, active learning
needs more epochs by which learner select and use the informative samples to boost the
learning.
Once training has been completed, the resulting knowledge base can be used to predict
previously unseen samples (without labels). As shown in Figure 2.2, the spotted-line
learning from unlabeled samples to the predictor utilizes the knowledge bases obtained
via supervised learning or semi-supervised learning. In general, labeled samples are
divided into two sets, training set and testing set. As their names expressed, training set is
used to train the leaner, while testing set is used to test the model. As shown in Figure 2.2
both sets can also be used to test the ability of the learner to construct an accurate and
generalized knowledge base.

17

raw data
(image/video
frame, text
documents)
raw data
(image/video

frame, text
documents)
Feature Extraction
(Preprocessing)
unlabeled
samples
Teacher
(Fully labeling)
learner
Labeled samples
Knowledge Base
Predictor
Concept(s)
Feature Extraction
(Preproces s ing)
unlabeled samples
Teacher
(Partially labelling)
learner
Unlabeled s amples
Knowledge Base
Predictor
Concept(s)
Labeled samples
(a) supervised learning
(b) semi-supervised learning (via active learning)

Figure 2.2 Supervised learning vs. Semi-supervised learning

2.5 Why Auto-annotation of Images Is Difficult for Machine Learning?

Auto-annotation of images for machine learning is difficult for several reasons as follows:
1.
There is no clear mapping from a set of visual properties to its semantic concepts.

18
Although region-based systems attempt to decompose images into constituent
objects/regions, a representation composed of visual properties of regions is indirectly
related to its semantics. There is no clear mapping from a set of visual properties to
the high level semantics. For examples, an approximately round brown region might
be a flower, an apple, a face, or part of a sunset sky. Moreover, visual properties such
as color, texture and shape, etc. of an object vary dramatically in different images due
to the variation in light intensity, occlusion and object poses, etc.
2.
The curse of dimensionality
Most machine learning approaches do not scale up well. As one increases the number
of input features, the performance of a machine learning system often degrades. There
are at least two reasons that the curse of dimensionality affects the performance of
learner. One reason is, in general, the demand for a large number of samples grows
exponentially with the dimensionality of the feature space. This severely restricts the
application of machine learning methods such as k-nearest neighbor algorithms etc.
The fundamental reason for the curse of dimensionality is that high-dimensional
functions have the potential to be much more complicated than low-dimensional ones,
and that these complications are harder to discern (Richard O.Duda et al. 2000). The
other reason is that if we have d dimensions of features, we should have more than d
training samples. Thus the more number of dimensions of features, the more number
of training samples we need, and in turn the more effort we will need to label them.
3.
Many features are irrelevant to the target concepts
Some learners except the rule-based or decision-tree approaches (that select only a

subset of the instance features when forming the hypotheses) use all the features

19
whether or not they are relevant to the target concepts. In fact, most machine learning
approaches (except rule-based and decision tree) have difficulties in deciding which
features that are irrelevant to the target concepts. For example, when we use k-nearest
neighbor algorithm to learn the target concept, and when each sample is described by
30 features while only 5 of these features are relevant to target concepts. In this case,
samples that have identical values for the 2 relevant features may be distant from one
another in the 30-dimensional sample space. As a result, the similarity metric used by
the k-nearest-neighbor algorithm that uses all features may be misleading. The
distance between neighbors is likely to be dominated by the large number of
irrelevant features.
4.
Feature and label noise
Feature noise refers to the fact that features in the samples may not be what they
usually supposed to be. For example, in text documents, the words may be misspelled,
or the wrong words may have been used. Label noise refers to the fact that the labels
assigned to the samples by the human may contain errors. In other words, some of the
training samples may have wrong labels. This is especially unavoidable in image
object/region labeling in our case due to: (a) human perception difference (b) the
unreliable/unstable region segmentation, and (c) the image quality, etc.

2.6
Why We Need to Minimize the Required Amount of Labeled Data for Learning?

In many supervised learning approaches such as those in auto-annotation of images, we
have thousands and thousands of images and many more segmented image regions, and

20

for text documents from the web, the samples are even more than we can imagine. Thus
labeling a reasonable set of samples in order to create a training set
L
is tedious, time
consuming, erroneous and costly. In many cases, the labeling is not only one-to-one (one
sample with one concept) but often it is one-to-many (one sample with at least one
concept). Hence labeling also needs the expertise of trained personnels and high amount
of work. Because of these difficulties, finding a way to minimize the number of labeled
samples is beneficial.
Usually, the training set is chosen to be a random sampling process from training samples.
Fortunately, in many cases active learning can be employed. Here, the learner can
actively choose the training data. In this dissertation, we are primarily interested in two
types of techniques that can help to reduce the number of labeled samples, while
retaining the reasonable performance of learning. These two techniques are semi-
supervised learning and active learning. The former boost the accuracy of a supervised
learner based on an additional set of unlabelled examples, while the latter minimizes the
amount of labeled samples by asking user/expert to label only the most informative
samples in the domain.
Semi-supervised learning and active learning techniques share two important
characteristics. First, besides the set
L
of labeled samples, they use an additional
working set of unlabeled samples
U
. The labeled samples in
L
can be seen as pairs
,xy
, where
x

denotes the sample and
y
is a numeric symbol representing the degree
of confidence of the presence of a concept. The samples in
U
can be seen as pairs ,?x ,
where “?” signifies that the sample’s label is unknown. Second, both semi-supervised

21
learning and active learning try to maximize the accuracy of supervised-based learning
based on the learners that are used to learn the target concepts.

2.7
Performance Measurement

There are deterministic ways to evaluate a CBIR system in specific domains such as the
detection of image copyright violation and the identification of online objectionable
images. However, it is difficult to evaluate a general-purpose CBIR system due to the
complexity of image semantics and the lack of a “gold standard” (Wang 2001). It is very
difficult to specify one single performance measure for use in all situations, since the
choice of performance measure depends on the characteristics of the image collection and
the needs of the user.
In the thesis, we will adopt two performance measures one derives from information
retrieval and the other from signal processing community called ROC (receiver operating
characteristics) analysis.

2.7.1 Contingency Table for Performance Measurement
In the field of information retrieval, contingency table is often used in performance
measurement, which measures the precision, recall and
, for >0F

β
β
. Here we introduce
the contingency table based on information retrieval; here the term sample refers to
document or image (objects/region) depending on the context.
A system using a
22
×
contingency table structure is shown in Table 2.1, which is also
known as confusion matrix for binary classification. We use “Positive” to denote

22
“samples in the category of interest”, and “Negative” to denote that “the samples are not
in the category of interest”. Context will be used to indicate whether we are talking about
predicted or actual categorization values. The contingency table classifies the
training/testing samples into one of 4 categories of : False Positive (FP) if the system
labels it as a positive while it actually is a negative; False Negative (FN) if the system
labels it as a negative while it actually is a positive; True Positive (TP) and True Negative
(TN) if the system correctly predicts the label. In the followings, we will use TP, TN, FP,
FN to denote the number of true positive, true negative, false positive and false negative
samples, respectively. Note that with this notation, the number of positive points (if we
take feature vector of samples as a point in the feature space) in the training/testing set
can be written as TP+TN, and the training/testing set size as TP+FP+TN+FN. In ROC
notation, true Positive (TP) is also called the “hit” while the false negative (FN) is called
“false alarm” (Duda et al. 2000).

Table 2.1
22
×
Contingency Table

TN FN
FP TP
Actual label value
machine
predicted
label
value
Negative
Positive
Negative Positive

Directly comparing such table is difficult, therefore several performance measures based
on the contingency table have been developed. These performance measures extract a
single value from the 4 values in the table. They are usually in the range [0.0, 1.0] or
expressed in terms of percentage. The process of transforming 4 values into a single

23
value causes some loss of information, and so there are situations where certain
performance measures may be preferred over the others (Liere 1999).
We use the following performance measurements based on the contingency table.
1.
Sensitivity
Sensitivity (Precision) is defined as the ratio between the number of true positive
predictions TP and the number of Positive instances (TP+FN) in the test set.

if 0, then Sensitivity=0
TP
Sensitivity

TP FN
TP FN
=
+
+=
(2.1)
2.
Specificity
Specificity is defined as the ratio between the number of true negative predictions
TN and the number of negative instances (TN+FP) in the test set.

0, 0
TN
Specificity
TN FP
if TN FP then Specificity
=
+
+= =
(2.2)
3.
Accuracy
Accuracy measures the ability of the system to correctly categorize samples. It is
defined as the ratio between the number of correctly identified samples and the
testing set size:

0, 0
TP TN
accuracy
TP TN FP FN

if TP TN FP FN then accuracy
+
=
+++
+++ = =
(2.3)

For medical diagnosis, sensitivity gives the percentage of correctly classified
diseased individuals and the specificity the percentage of correctly classified

24
individuals without the disease (Swets and Pickett 1982, Weiss and Kulikowski
1990).
ROC analysis is a classical method in signal detection theory (Swets and Pickett
1982), and is used in statistics, medical diagnosis (Weiss and Kulikowski 1990,
Center 1991). It is also used in machine learning as an alternative method for
comparing systems (Provost et al. 1998). ROC space denotes a coordinate system
used for visualizing the performance of a classifier, where the true positive rate
(also call “hit” rate) is plotted on the y-axis, and the false positive rate (also the
false alarm) on the x-axis. In this way, the classifiers are compared not by a
number, but by a point in a plane. For classifiers obtained by thresholding a real
valued function or depending on a real parameter, this produces a curve (called
the ROC curve), which describes the trade-off between sensitivity and specificity.
Two systems can therefore be compared, with the better one being the highest and
leftmost one.

4.
Precision
Precision and recall are commonly used in information retrieval. Here we give
the definition to complete the performance measure. Precision measures the

ability of the system to classify samples as positive while they are actually
positive samples. It is defined as the ratio between the number of correctly
identified positive samples (TP) and the number of totally identified positive
samples (TP+FP):

0, 0
TP
precision
TP FP
if TP FP then precision
=
+
+= =
(2.4)

25
5. Recall
Recall measures the ability of the system to classify the samples as positive ones
for which are actually the positive samples. It is defined as the ratio between the
number of correctly identified samples and the number of actual positive samples:

0, 0
TP
recall
TP FN
if TP FN then recall
=
+
+= =
(2.5)

6.

F
β

In actual practice, the classification systems exhibit precision-recall tradeoff. That
is, for a system that has been tuned for optimal performance, if an adjustment of
parameters causes precision to rise, then recall will fall and vise versa. For
example, if the system is not sure whether to predict the sample as positive or
negative, while the goal of the system is to get the high precision, then it will
predict it as negative since the computation of precision does not include the
number of negative examples (neither TN nor FN in Equation (2.4)). The system
seeking a high precision rating should only classify as positive samples of which
it is very sure, so that it has a better chance to have a higher performance.
However, if this same system in the same situation wants to get a high recall value,
it should classify the samples as positive since from Equation (2.5), we can see
that the computation of recall does include the number of false positive examples.
The system seeking a high recall should classify the samples as positive when it is
not very sure about, thus it would have a better chance to predict positives for all
samples that are actually positives.

Auto annotation of multimedia contents theory and application

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về