Tải bản đầy đủ (.pdf) (194 trang)

Ebook Object detection and recognition in digital images: Part 2

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (9.77 MB, 194 trang )

4
Object Detection and Tracking
4.1

Introduction

This section is devoted to selected problems in object detection and tracking. Objects in
this context are characterized by their salient features, such as color, shape, texture, or other
traits. Then the problem is telling whether an image contains a defined object and, if so, then
indicating its position in an image. If instead of a single image a video sequence is processed,
then the task can be to track, or follow, the position and size of an object seen in the previous
frame and so on. This assumes high correlation between consecutive frames in the sequence,
which usually is the case. Eventually, an object will disappear from the sequence and the
detection task can be started again.
Detection can be viewed as a classification problem in which the task is to tell the presence or
absence of a specific object in an image. If it is present, then the position of the object should be
provided. Classification within a group of already detected objects is usually stated separately,
however. In this case the question is formulated about what particular object is observed.
Although the two groups are similar, recognition methods are left to the next chapter. Thus,
examples of object detection in images are, for instance, detection of human faces, hand
gestures, cars, and road signs in traffic scenes, or just ellipses in images. On the other hand, if
we were to spot a particular person or a road sign, etc. we would call this recognition. Since
detection relies heavily on classification, as already mentioned, one of the methods discussed
in the previous section can be used for this task. However, not least important is the proper
selection of features that define an object. The main goal here is to choose features that are
the most characteristic of a searched object or, in other words, that are highly discriminative,
thus allowing an accurate response of a classifier. Finally, computational complexity of the
methods is also essential due to the usually high dimensions of the feature and search spaces.
All these issues are addressed in this section with a special stress on automotive applications.

4.2



Direct Pixel Classification

Color conveys important information about the contents of an environment. A very appealing
natural example is a coral reef. Dozens of species adapt the colors of their skin so as to
be as indistinguishable from the background as possible to gain protection from predators.
Object Detection and Recognition in Digital Images: Theory and Practice, First Edition. Bogusław Cyganek.
© 2013 John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.


Object Detection and Tracking

347

The latter do the same to outwit their prey, and so on. Thus, objects can be segmented out
from a scene based exclusively on their characteristic colors. This can be achieved with direct
pixel classification into one of the two classes: objects and background. An object, or pixels
potentially belonging to an object, are defined providing a set or range of their allowable colors.
A background, on the other hand, is either also defined explicitly or can be understood as “all
other values.” Such a method is usually applied first in a chain on the computer vision system
to sieve out the pixels of one object from all the others. For example Phung et al. proposed a
method for skin segmentation using direct color pixel classification [1]. Road signs are detected
by direct pixel segmentation in the system proposed by Cyganek [2]. Features other than color
can also be used. For instance Viola and Jones propose using Haar wavelets in a chain of
simple classifiers to select from background pixels which can belong to human faces [3].
Although not perfect, the methods in this group have an immense property of dimensionality
reduction. Last but not least, many of them allow very fast image pre-processing.

4.2.1 Ground-Truth Data Collection
Ground-truth data allow verification of performance of the machine learning methods. However, the process of its acquisition is tedious and time consuming, because of the high quality

requirements of this type of data.
Acquisition of ground-truth data can be facilitated by an application built for this purpose
[4, 5]. It allows different modes of point selection, such as individual point positions, as well
as rectangle and polynomial outlines of visible objects, as shown in Figure 4.1.
An example of its operation for points marked inside the border of a road sign is depicted
in Figure 4.2. Only the positions of the points are saved as meta-data to the original image.
These can then be processed to obtain the requested image features, i.e. in this case it is color
in the chosen color space. This tool was used to gather point samples for the pixel-based
classification for human skin selection and road sign recognition, as will be discussed in the
next sections.

(a)

(b)

Figure 4.1 A road sign manually outlined by a polygon defined by the points marked by an operator.
This allows selection of simple (a) and more complicated shapes (b). Selected points are saved as metadata to an image with the help of a context menu. Color versions of this and subsequent images are
available at www.wiley.com/go/cyganekobject.


348

Object Detection and Recognition in Digital Images

(a)

(b)

(c)


Figure 4.2 View of the application for manual point marking in images. Only the positions of the
selected points are saved in the form of meta-data to the original image. These can be used to obtain
image features, such as color, in the indicated places.

4.2.2

CASE STUDY – Human Skin Detection

Human skin detection gets much attention in computer vision due to its numerous applications.
The most obvious is detection of human faces for their further recognition, human hands for
gesture recognition,1 or naked bodies for parental control systems [6, 7], for instance.
Detection of human skin regions in images requires the definition of characteristic parameters such as color and texture, as well as the choice of proper methods of analysis, such as
used color space, classifiers, etc. There is still ongoing research in this respect. As already
discussed, a method for human skin segmentation based on a mixture of Gaussians was proposed by Jones and Rehg [8]. Their model contains J = 16 Gaussians which were trained from
almost one billion labeled pixels from the RGB images gathered mostly from the Internet. The
reported detection rate is 80% with about 9% of false positives. A similar method based on
MoG was undertaken by Yang and Ahuja in [9].
On the other hand, Jayaram et al. [10] report that the best results are obtained with histogram
methods rather than using the Gaussian models. They also pointed out that different color
spaces improve the performance but not consistently. However, a fair trade-off in this respect
is the direct use of the RGB space. A final observation is that in all color spaces directly
partitioned into achromatic and chromatic components, performance was significantly better
if the luminance component was employed in detection. Similar results, which indicate the
positive influence of the illumination component and the poor performance of the Gaussian
modeling, were reported by Phung et al. [1]. They also found that the Bayesian classifier with
the histogram technique, as well as the multilayer perceptron, performs the best. The Bayes
classifier operates in accordance with Equation (3.77), in which x is a color vector, ω0 denotes
a “skin,” whereas ω1 is a “nonskin” classes, as described in Section 3.4.5. However, the Bayes
classifier requires much more memory than, for example, a mixture of Gaussians. Therefore
there is no unique “winner” and application of a specific detector can be driven by other factors

such as the computational capabilities of target platforms.
With respect to the color space, some authors advocate using perceptually uniform color
spaces for object detection based on pixel classification. Such an approach was undertaken by
Wu et al. [11] in their fuzzy face detection method. The front end of their detection constitutes
1A

method for gesture recognition is presented in Section 5.2.


Object Detection and Tracking

Table 4.1

349

Fuzzy rules for skin detection in sun lighting.

Rule no

Rule description

R1 :

Range of skin color components in daily conditions found in experiments
IF R > 95 AND G > 40 AND B > 20 THEN T0 = high;
Sufficient separation of the RGB components; Elimination of gray areas
IF max(R,G,B)-min(R,G,B) > 15 THEN T1 = high;
R, G should not be close together
IF |R-G| > 15 THEN T2 = high;
R must be the greatest component

IF R > G AND R > B THEN T3 = high;

R2 :
R3 :
R4 :

skin segmentation operating in the Farnsworth color space. A perceptual uniformity of this
color space makes the classification process resemble subjective classification made by humans
due to similar sensitivity to changes of color.
Surveys on pixel based skin detection are provided in the papers by Vezhnevets et al. [12],
by Phung et al. [1], or the recent one by Khan et al. [13]. Conclusions reported in the latter
publication indicate that the best results were obtained with the cylindrical color spaces and
with the tree based classifier (Random forest, J48). Khan et al. also indicate the importance of
the luminance component in feature data, which stays in agreement with the results of Jayaram
et al. [10] and Phung et al. [1].
In this section a fuzzy based approach is presented with explicit formulation of the human
skin color model, as proposed by Peer et al. [14]. Although simple, the conversion of the
histogram to the membership function greatly reduces memory requirements, while fuzzy
inference rules allow real-time inference. A similar approach was also undertaken to road sign
detection based on characteristic colors, which is discussed in the next section (4.2.3).
The method consists of a series of the fuzzy IF . . . THEN rules presented in Table 4.1 for
daylight conditions and in Table 4.2 for artificial lighting, respectively. These were designed
based on expert knowledge from data provided in the paper by Peer et al. [14], although other
models or modifications can be easily adapted.
The combined (aggregated) fuzzy rule for human skin detection directly in the RGB space
is as follows
RHS :
Table 4.2

IF T0−3 are high OR T4−6 are high THEN H = high;

Fuzzy rules for flash lighting.

Rule no

Rule description

R5 :

Skin color values for flash illumination
IF R > 220 AND G > 210 AND B > 170 THEN T4 = high;
R and G components should be close enough
IF |R-G| ≤ 15 THEN T5 = high;
B component has to be the smallest one
IF B < R AND B < G THEN T6 = high;

R6 :
R7 :

(4.1)


350

Object Detection and Recognition in Digital Images

(R<95)
1
0.9

0.5


0.1

R

0
90 95 105

255

Figure 4.3 A possible membership function for the relation R > 95.

The advantage of the fuzzy formulation (4.1) over its crisp version is that the influence of each
particular rule can be controlled separately. Also, new rules can be easily added if necessary.
For instance in the rule R1 when checking the condition for the component R being greater
than 95 this can be assigned different values than simple “true” or “false” in the classical
formulation. Thus, in this case knowing a linear membership function presented in Figure
4.3, the relation R < 95 can be evaluated differently (in the range from 0 to 1) depending
on a value of R. Certainly, a type of membership function can be chosen with additional
“expert” knowledge. Here, we assume a margin of noise in the measurement of R which in this
example spans from 90–105. Apart from this region we reach two extremes for R “significantly
lower” with the membership function spanning 0–0.1 and for R “significantly greater” with
a corresponding membership function from 0.9–1. Such fuzzy formulation has been shown
to offer much more control over a crisp formulation. Therefore it can be recommended for
tasks which are based on some empirical or heuristic observations. A similar methodology
was undertaken in fuzzy image matching, discussed in the book by Cyganek and Siebert [15],
or in the task of figure detection, discussed in Section 4.4. The fuzzy AND operation can be
defined with the multiplication or the minimum rule of the membership functions [16], as it
was already formulated in Equations (3.162) and (3.163), respectively.
On the other hand, for the fuzzy implication reasoning the two common methods of Mamdani

and Larsen,
μ P⇒C (x, y) = min (μ P (x), μC (y))
μ P⇒C (x, y) = μ P (x)μC (y)

(4.2)

can be used [17, 18]. In practice the Mamdani rule is usually preferred since it avoids multiplication. It is worth noting that the above inference rules are conceptually different from the
definition of implication in the traditional logic. Rules (4.2) convey the intuitive idea that the
truth value of the conclusion C should not be larger than that of the premise P.
In the traditional implication if P is false and C is true, then P ⇒ C is defined also to be
true. Thus, assuming about 5% transient region as in Figure 4.3, the rule R1 in Table 4.1 for
exemplary values R = 94, G = 50, and B = 55 would evaluate to (0.4 × 0.95 × 0.96) × 1 ≈
0.36, in accordance with the Larsen rule in (4.2). For Mamdani this would be 0.4. On the other


Object Detection and Tracking

351

hand, the logical AND the traditional formulation would produce false. However, the result of
the implication would be true, since false ⇒ true evaluates to true. Thus, neither crisp false,
nor true, reflect the insight into the nature of the real phenomenon or expert knowledge (in our
case these are the heuristic values found empirically by Peer et al. [14] and used in Equation
(4.1)).
The rule RHS in (4.1) is an aggregation of the rules R1 –R6 . The common method of fuzzy
aggregation is the maximum rule, i.e. the maximum of the output membership functions of
the rules which “fired.” Thus, having output fuzzy sets for the rules the aggregated response
can be inferred as
μ H = max μ1P⇒C , . . . , μnP⇒C ,


(4.3)

where μP⇒C are obtained from (4.2). Finally, from μH the “crisp” answer can be obtained
after defuzzification. In our case the simplest method for this purpose is also the maximum
rule (we need a “false” or “true” output), although in practice the centroid method is very
popular.
The presented fuzzy rules were then incorporated into the system for automatic human face
detection and tracking in video sequences. For face detection the abovementioned method by
Viola and Jones was applied [3]. For tests the OpenCV implementation was used [19, 20].
However, in many practical examples it showed high rate of false positives. These can be
suppressed however at the cost of the recall factor. Therefore, to improve the former without
sacrificing the latter, the method was augmented with a human skin segmentation module to
take advantage if color images are available. Faces found this way can be tracked, for example,
with the method discussed in Section 4.6. The system is a simple cascade of a prefilter, which
partitions a color image into areas-of-interest (i.e. areas with human skin), and a cascade for
face detection in monochrome images, as developed by Viola and Jones. Thus, the prefilter
realizes the already mentioned dimensionality reduction, improving speed of execution and
increasing accuracy. This shows the great power of a cascade of simple classifiers which can
be recommended in many tasks in computer vision. The technique can be seen as an ensemble
of cooperating classifiers which can be arranged in a series, parallel, or a mixed fashion. These
issues are further discussed in Section 5.6. The system is depicted in Figure 4.4.
In a cascade simple classifiers are usually employed, for which speed is preferred over
accuracy. Therefore one of the requirements is that the preceding classifier should have a high

monochrome
image

color image

Human skin

detection

accept

Cascade of classifiers

Classifier 1

accept

Classifier N

accept

re
ct
je

ct
je

ct
je
re

re

Figure 4.4 A cascade of classifiers for human face detection. The first classifier does dimensionality
reduction selecting only pixels-of-interest based on a model of a color for human skin based on fuzzy
rules.



352

Object Detection and Recognition in Digital Images

(a)

(b)

Figure 4.5 Results of face detection with the system in Figure 4.4. A color skin map (a). A test image
with outlined face areas (b).

recall factor. In other words, it is better if such a classification module has a high ratio of false
positives rather than too high false negative, i.e. it passes even possibly wrong answers to the
next classification stage rather than rejecting too many. If this is done, then there is still hope
that the next expert module in the chain will have a chance to correctly classify an object, and
so on. Thus, in the system in Figure 4.4 the human skin detector operates in accordance with
the fuzzy method (4.1). For all relations in the particular rules of (4.1) the fuzzy margin of 5%
was set as presented in Figure 4.3. Summarizing, this method was chosen for three reasons.
Firstly, as found by comparative experiments, it has the desirable property of a high recall
factor, for the discussed reasons, at the cost of slightly lower precision when compared with
other methods. Secondly, it does not require any training and it is very fast, allowing run-time
operation. Thirdly, it is simple to implement.
Figure 4.5(a) depicts results of face detection in a test color image carried out in the system
presented in Figure 4.4. Results of human skin detection computed in accordance with (4.1)
are shown in Figure 4.5(a). The advantage of this approach is a reduction in the computations
which depend on the contents of an image, since classifiers which are further along in the
chain exclusively process pixels passed by the preceding classifiers. This reduction reached
up to 62% in the experiments with different images downloaded from the Internet from the

links provided in the paper by Hsu [21].

4.2.3

CASE STUDY – Pixel Based Road Signs Detection

In this application the task was to segment out image regions which could belong to road
signs. Although shapes and basic colors are well defined for these object, in real situations
there can be high variations of the observed colors due to many factors, such as materials and
paint used in manufacturing the signs, their wear, lighting and weather conditions, and many
others. Two methods were developed which are based on manually collected samples from
a few dozen images from real traffic scenes. In the first approach a fuzzy classifier was built
from the color histograms. In the second, the one-class SVM method, discussed in Section
3.8.4, was employed. These are discussed in the following sections.


Object Detection and Tracking

4.2.3.1

353

Fuzzy Approach

For each of the characteristic colors for each group of signs their color histograms were created
based on a few thousand samples gathered. An example of the red component in the HSV color
space and for the two groups of signs is presented in Figure 4.6. Histograms allow assessment
of the distributions of different colors of road signs and different color spaces. Secondly, they
allow derivation of the border values for segmentation based onsimple thresholding. Although
not perfect, this method is very fast and can be considered in many other machine vision tasks

(e.g. due to its simple implementation) [22].
Based on the histograms it was observed that the threshold values could be derived in the
HSV space which give an insight into the color representation. However, it usually requires
prior conversion from the RGB space.
From these histograms the empirical range values for the H and S channels were determined
for all characteristic colors encountered in Polish road signs from each group [23]. These
are given in Table 4.3. In the simplest approach they can be used as threshold values for
segmentation. However, for many applications the accuracy of such a method is not satisfactory.
The main problem with crisp threshold based segmentation is usually the high rate of false
positives, which can lower the recognition rate of the whole system. However, the method is
one of the fastest ones.
Better adapted to the actual shape of the histograms are the piecewise linear fuzzy membership functions. At the same time they do not require storage of the whole histogram which
can be a desirable feature especially for the higher dimensional histograms, such as 2D or
3D. Table 4.4 presents the piecewise linear membership functions for the blue and yellow
colors of the Polish road signs obtained from the empirical histograms of Figure 4.7. Due to
specific Polish conditions it was found that detection of warning signs (group “A” of signs) is
more reliable based on their yellow background rather than their red border, which is thin and
usually greatly deteriorated.
Experimental results of segmentation of real traffic scenes with the different signs are
presented in Figure 4.8 and Figure 4.9. In this case, the fuzzy membership functions from
Table 4.4 were used. In comparison to the crisp thresholding method, the fuzzy approach
allows more flexibility in classification of a pixel to one of the classes. In the presented
experiments such a threshold was set experimentally to 0.25. Thus, if for instance for a pixel
p, if min(μHR (p), μSR (p)) ≥ 0.25, it is classified as possibly the red rim of a sign.
It is worth noticing that direct application of the Bayes classification rule (3.77) requires
evaluation of the class probabilities. Its estimation using, for instance, 3D histograms can
even occupy a matrix of up to 255 × 255 × 255 entries (which makes 16 MB of memory
assuming only 1 byte per counter). This could be reduced to 3 × 255 if channel independence
is assumed. However, this does not seem to be justified especially for the RGB color space,
and usually leads to a higher false positive rate. On the other hand, the parametric methods

which evaluate the PDF with MoG do not fit well to some recognition tasks what results in
poor accuracy, as frequently reported in the literature [10, 1].

4.2.3.2

SVM Based Approach

Problems with direct application of the Bayes method, as well as the sometimes insufficient
precision of the fuzzy approach presented in the previous section, has encouraged the search


Warning signs

0

0

(d)

0.002

0.05
300

0.004

0.1

250


0.006

0.15

150

0.008

0.014

0.016

0.2

200

0

50

50

100

100

150

(e)


150

200

200

250

250

300

300

0.03

0

0

0.005

0.01

0.015

0.02

0.025


0

0

50

50

100

100

150

(f)

150

I

0

S

0.018

0

H


300

I

V

(c)

250

0.005

0.01

0.015

0.02

0.025

(b)

200

S

(a)

150


0.01

100

100

0.012

50

50

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.3


0

0

H

Histograms of the red color of warning and prohibition road signs
S

0.25

0.35

0.4

0.45

0

0.05

0.1

0.15

0.2

0.25

H


Figure 4.6 HSV histograms of the red color of warning signs (a,b,c) and the prohibitive signs in daylight conditions (d,e,f).

Prohibitive signs

200

200

250

250

300

300


Object Detection and Tracking

355

Table 4.3 Empirical crisp threshold values for different colors encountered in
Polish road signs. The values refer to the normalized [0–255] HSV space.

Red
Blue
Yellow
Green


H

S

[0–10] ∪ [223–255]
[120–165]
[15–43]
[13–110]

[50–255]
[80–175]
[95–255]
[60–170]

for other classifiers. In this case the idea is to use the one-class SVM, operating in one of the
color spaces, to select pixels with the characteristic colors of the road signs. Once again, the
main objectives of the method are accuracy and speed, since it is intended to be applied in
real-time applications. Operation of the OC-SVMs is discussed in Section 3.12. In this section
we outline the main properties and extensions of this method [2].
The idea is to train the OC-SVM with color values taken from examples of pixels encountered
in images of real road signs. This seems to fit well to the OC-SVM since significantly large
amounts of low dimensional data from one class are available. Thus, a small number of SVs is
usually sufficient to outline the boundaries of the data clusters. A small amount of SVs means
faster computation of the decision function which is one of the preconditions for automotive
applications. For this purpose and to avoid conversion the RGB color space is used. During
operation each pixel of a test image is checked to see if it belongs to the class or not with the
help of formulas (3.286) and (3.287). The Gaussian kernel (3.211) was found to provide the
best results.
A single OC-SVM was trained in a 10-fold fashion. Then its accuracy was measured in
terms of the ROC curves, discussed in Appendix A.5. However, speed of execution – which

is a second of the important parameters in this system – is directly related to the number of
support vectors which define a hypersphere encompassing data and are used in classification
of a test point, as discussed in Section 3.12. These, in turn, are related to the parameter γ
of the Gaussian kernel (3.211), as depicted in Figure 4.10. For γ ≤ 10 processing time in
the software implementation is in the order of 15–25 ms per frame of resolution 320 × 240

Table 4.4 Piecewise linear membership functions for the red, blue, and yellow colors of Polish
road signs.
Attribute

Piecewise-linear membership functions – coordinates (x,y)

Red H (HR )
Red S (SR )
Blue H (HB )
Blue S (SB )
Yellow H (HY )
Yellow S (SY )

(0, 1) - (7, 0.9) - (8, 0) - (245, 0) - (249, 0.5) - (255, 0)
(75, 0.4) - (80, 1) - (180, 1) - (183, 0)
(125, 0) - (145, 1) - (150, 1) - (160, 0)
(100, 0) - (145, 1) - (152, 1) - (180, 0)
(20, 0) - (23, 1) - (33, 1) - (39, 0)
(80, 0) - (95, 0.22) - (115, 0.22) - (125, 0) - (128, 0) - (150, 0.48) - (155,
0.48) - (175, 0.18) - (200, 0.22) - (225, 1) - (249, 0.95) - (251, 0)


Object Detection and Recognition in Digital Images


Yellow for the warning signs

Blue for information signs

356

Yellow objects

Red objects

Original road scenes

Figure 4.7 Piecewise linear membership function created from the histograms of color values for
selected road signs (from [22]).

Figure 4.8 Results of fuzzy segmentation for different road signs. Color versions of this and subsequent
images are available at www.wiley.com/go/cyganekobject.


357

Red objects

Blue objects

Original road scenes

Object Detection and Tracking

Figure 4.9 Results of image segmentation with the fuzzy method for different road signs.


which is an acceptable result for automotive applications. Thus, in the training stage the two
parameters of OC-SVM need to be disovered which fulfill the requirements
Other kernels, such as the Mahalanobis (3.218) or a polynomial gave worse results. For the
former this caused the much higher number of support vectors necessary for the task, leading
to much slower classification. The latter resulted in the worst accuracy.

250

150
×

200

RGB
HSI
HS

×

×

×

×

#SVs

#SVs


100
150
×

100

RGB
HSI
HS

50
50
0

×
×

×

0

5

10

γ
(a)

15


20

25

00

5

10

γ
(b)

15

20

25

Figure 4.10 Number of support vectors with respect to the parameter γ of the Gaussian kernel for the
blue (a) and red (b) sample points and in different color spaces for C = 1 (from [2]).


358

Object Detection and Recognition in Digital Images

Figure 4.11 Comparison of image segmentation with the fuzzy method (middle row) and the one-class
SVM with RBF kernel (lower row) (from [2]). (For a color version of this figure, please see the color
plate section.)


Segmentation with the presented method proved to be especially useful for objects which
are placed against a similar background, as shown Figure 4.11. In this respect it allows more
precise response as compared with the already discussed fuzzy approach, in which only two
color components are used in classification. It can be seen that the fuzzy method is characteristic
of lower precision which manifests with many false positives (middle row in Figure 4.11).
This leads to incorrect figure detections and system response which will be discussed in the
next sections.
On the other hand, the SVM based solutions can suffer from overfitting in which their
generalization properties diminish. This often happens in configurations which require comparatively large numbers of support vectors. Such behavior was observed for the Mahalanobis
kernel (3.218), and also for the Gaussian kernel (3.211) for large values of the parameter γ .
However, the RBF kernel operates well for the majority of scenes from the verification group,
i.e. those which were not used for training, such as those presented in Figure 4.11. However, to
find the best operating parameters, as well as to test and compare performance of the OC-SVM
classifier with different settings, a special methodology was undertaken which is described
next. Thanks to its properties the method is quite versatile. Specifically it can be used to
segment out pixels of an object from the background, especially if the number of samples in
the training set is much smaller than the expected number of all other pixels (background).
The used f-fold cross-validation method consists of dividing a training data set into f
partitions of the same size. Then, sequentially, f − 1 partitions are used to train a classifier,
while the remaining data is used for testing. The procedure follows sequentially until all
partitions have been used for testing. In implementation the LIBSVM library was employed,


Object Detection and Tracking

(a)

359


(b)

Figure 4.12 ROC curves of the OC-SVM classifier trained with the red color of Polish prohibition road
signs in different color spaces: HIS, ISH, and IJK (a), Farnsworth, YCbCr, and RGB (b). Color versions
of the plots are available at www.wiley.com/go/cyganekobject.

also discussed in Section 3.12.1.1. In this library, instead of the control parameter C, the
parameter υ = 1/(CN) is assumed [24]. Therefore training can be stated as a search for the best
pair of parameters (γ , υ) using the described cross-validation and the grid search procedure
[24]. Parameters of the search grid are preselected to a specific range which show promising
results. In the presented experiments the search space spanned the range 0.0005 ≤ γ ≤ 56 and
0.00005 ≤ υ ≤ 0.001, respectively.
Figure 4.12 depicts ROC curves for the OC-SVM classifier tested in the 10-fold crossvalidation fashion for the red color encountered in prohibition road signs. Thus, to compute a
single point in the ROC curve an entire f-fold cycle has to be completed. In other words, if in
this experiment there are 10 sets of the training data, then for a single ROC point the classifier
has to be trained and checked 10 times (i.e. each time with 10 − 1 = 9 sets used to build
and 1 left for performance checking). The FPR and TPR values of a single point are then the
arithmetic average of all 10 build-check runs. Six color spaces were tested. These are HIS,
ISH, and IJK, shown in Figure 4.12(a) and RGB, YCbCr, and Farnsworth in Figure 4.12(b).
The best results were obtained for the perceptually uniform Farnsworth color space (black
in Figure 4.12(b)). Apart from the Farnsworth color space the YCbCr space gave very good
results with the lowest FPR. It is interesting since computation of the latter from the original
RGB space is much easier. Differences among other color spaces are not so significant. These
and other color spaces are discussed in Section 4.6. In this context the worst performance
was obtained for the HSI color space. As shown in Table 4.5, the comparably high number
of support vectors means that in this case the OC-SVM with the RBF kernel was not able to
closely encompass this data set.
Figure 4.13 shows two ROC curves of the OC-SVM classifier trained on blue color samples
which were collected from the obligation and information groups of road signs (in Polish
regulations regarding road signs these are called groups C and D, respectively [23]). In this

experiment a common blue color was assumed for the two groups of signs. The same 10-fold
cross-validation procedure and the same color spaces were used as in the case of the red


360

Object Detection and Recognition in Digital Images

Table 4.5 Best parameters found for the OC-SVM based on the f-fold cross-validation method for
the red and blue color signs. The grid search method was applied with the range 0.0005 ≤ γ ≤ 56
and 0.00005 ≤ υ ≤ 0.001.
Red

Parameters

Blue

Color space

γ

υ

#SVs

γ

υ

#SVs


HSI
ISH
IJK
Farnsworth
YCbCr
RGB

5.6009
0.0010
0.0020
0.2727
0.0010
1.2009

0.000921
0.000287
0.000050
0.001000
0.000050
0.000020

52
2
2
8
2
10

0.0010

0.0010
0.0020
0.0910
0.0010
2.8009

0.0002400
0.0003825
0.0009525
0.0002875
0.0003350
0.0001450

3
4
8
8
4
25

color data. The best performance once again was obtained for the Farnsworth color space,
though other color spaces do almost as well, including the HSI space. The RGB space shows
the highest FPR rate, though at relatively large TPR values. Such characteristics can be also
verified by analyzing the number of support vectors contained in Table 4.5. In this case
it is the largest one (25) which shows the worst adaptation of the hypersphere to the blue
color data.
Only in one case does the number of support vectors (#SVs) exceed ten. For the best
performing Farnsworth color space #SVs is 5 for red and 3 for blue colors, respectively. A
small number of SVs indicates sufficient boundary fit to the training data and fast run time
performance of the classifier. This, together with the small number of control parameters,

gives a significant advantage of the OC-SVD solution. For instance a comparison of OC-SVM
with other classifiers was reported by Tax [25]. In this report the best performance on many

(a)

(b)

Figure 4.13 ROC curves of the OC-SVM classifier trained with the blue color of Polish information
and obligation road signs in different color spaces: HIS, ISH, and IJK (a), Farnsworth, YCbCr, and
RGB (b). Color versions of the plots are available at www.wiley.com/go/cyganekobject.


Object Detection and Tracking

361

test data was achieved by the Parzen classifier (Section 3.7.4). However, this required a large
number of prototype patterns which resulted in a run-time response that was much longer than
for other classifiers. On the other hand, the classical two-class SVM with many test data sets
requires a much larger number of SVs.

4.2.4

Pixel Based Image Segmentation with Ensemble of Classifiers

For more complicated data sets than discussed in the previous section, for example those
showing specific distribution, segmentation with only one OC-SVM cannot be sufficient. In
such cases, presented in Section 3.12.2 the idea of prior data clustering and building of an
ensemble operating in data partitions can be of help. In this section we discuss the operation
of this approach for pixel-based image clustering. Let us recall that operation of the method

can be outlined as follows:
1. Obtain sample points characteristic to the objects of interest (e.g. color samples);
2. Perform clustering of the point samples (e.g. with a version of the k-means method); for
best performance this process can be repeated a number of times, each time checking the
quality of the obtained clustering; after the clustering each point is endowed with a weight
indicating strength of membership of that point in a partition;
3. Form an ensemble consisting of the WOC-SVM classifiers, each trained with points from
different data partitions alongside with their membership weights.
Thus, to run the method, a number of parameters need to be preset both for the clustering
and for the training stages, respectively. In the former the most important is the number of
expected clusters M, as well as parameters of the kernel, if the kernel version of the k-means
is used (Section 3.11.3). On the other hand, for each of the WOC-SVM member classifiers
two parameters need to be determined, as discussed in the previous sections. These are the
optimization constant C (or its equivalent ν = 1/(NC)), given in Equation (3.263), as well as the
σ parameter if the Gaussian kernel is chosen (other kernels can require different parameters,
as discussed in Section 3.10). However, the two parameters can be discovered by a grid
search, i.e. at first a coarse range of the parameters can be checked, and then a more detailed
search around the best values can be performed [24]. As already mentioned, the points in each
partition are assigned weights. However, for a given cluster 1 ≤ m ≤ M the weights have to
fulfill the summation condition (3.240), i.e.
Nm

wmi < Nm .

0<

(4.4)

i=1


Therefore, combining condition (3.288) with (4.4) the following is obtained
1
1<

C

Nm

wmi ≤ Nm .
i=1

(4.5)


362

Object Detection and Recognition in Digital Images

Thus, for a given partition and its weights the training parameter C should be chosen in
accordance with the following condition
1
Nm

≤ C < 1.

(4.6)

wmi

i=1


In practice, a range of C and σ values is chosen and then for each the 10-fold cross-validation
is run. That is, the available training set is randomly split into 10 parts, from which 9 are used
for training, and 1 left for testing. The procedure is run number of times and the parameters
for which the best accuracy was obtained are stored.
The described method assumes twofold transformation of data to the two different feature
spaces. The first mapping is carried out during the fuzzy segmentation. The second is obtained
when training the WOC-SVM classifiers. Hence, by using different kernels or different sets
of features for clustering and training, specific properties of the ensemble can be obtained.
The efficacy of the system can be measured by the number of support vectors per number
of data in partitions, which should be the minimum possible for the required accuracy. Thus,
efficacy of an ensemble of WOC-SVMs can be measured as follows [26]
M

M

ρi =

ρ=
i=1

i=1

#SVi
,
#Di

(4.7)

where #SVi is the number of support vectors for a data set Di , M denotes the number of

members of the ensemble, and #Di is a cardinality of the i-th data subset. A rule of thumb is to
keep ρ i below 0.1–0.2. Higher values usually indicate data overfitting and as a consequence
worse generalization properties of the classifier. The value of Equation (4.7) can be used in
the training stage to indicate a set of parameters with sufficiently small ρ. On the other hand,
if only one subset i shows excessive value of its ρ i , then new clustering of this specific subset
can be considered. In other cases, the clustering process can be repeated with different initial
numbers of clusters M.
During operation a pixel is assigned as belonging to the class only if accepted by exactly
one of the member classifiers of that ensemble. Nevertheless, depending on the problem this
arbitration rule can be relaxed, e.g. a point can be also assigned if accepted by more than
one classifier, etc. The classification proceeds in accordance to Equations (3.286) and (3.287).
Thus, its computation time depends on a number #SV, as used in (4.7). Nevertheless, for a
properly trained system #SV is much lower than the total number of original data. Therefore,
the method is very fast and this is what makes it attractive to real-time applications.
In the following two experimental results exemplifying the properties of the proposed
ensemble of classifiers for pixel based image segmentation are presented [26]. In the first
experiment, a number of samples of the red and blue colors occurring in the prohibitive and
information road signs, respectively, were collected. Then these two data sets were mixed and
used to train different versions of the ensembles of WOC-SVMs, presented in the previous
sections. Then the system was tested for an image in Figure 4.14(a). Results of the redand-blue color segmentation are presented in Figure 4.14(b–d), for M = 1, 2, and 5 clusters,
respectively. We see a high number of false positives in the case of one classifier, i.e. for M = 1.
However, the situation is significantly improved if only two clusters are used. In the case of


Object Detection and Tracking

363

(a)


(b)

(c)

(d)

Figure 4.14 Red-and-blue color image segmentation with the ensemble of WOC-SVMs trained with the
manually selected color samples. An original 640 × 480 color image of a traffic scene (a). Segmentation
results for M = 1 (b), M = 2 (c), and M = 5 (d) (from [26]). (For a color version of this figure, please
see the color plate section.)

five clusters, M = 5, we notice an even lower number of false positives. However, the red rim
of the prohibition sign is barely visible indicating lowered generalization properties (i.e. tight
fit to the training data).
In this experiment the kernel c-means with Gaussian kernels was used. Deterministic annealing was also employed. That is, the parameter γ in (3.253) starts from 3 and is then gradually
lowered to the value 1.2.
The second experiment was run with an image shown in Figure 4.15(a) from the Berkeley
Segmentation Database [27]. This database contains manually outlined objects, as shown in
Figure 4.15(b). From the input image a number of color samples of bear fur were manually
gathered, as shown in Figure 4.15(c).
The image in Figure 4.16(a) depicts manually filled animals in an image, based on their
outline in Figure 4.15(b). Figure 4.16(b–c) show results of image segmentation with the
ensemble composed of 1 and 7 members, respectively. As can be seen, an increase in the
number of members in the ensemble leads to fewer false positives. Thanks to the ground-truth
data in Figure 4.16(a) these can be measured quantitatively, as precision and recall (see Section
A.5). These are presented in Table 4.6.


364


Object Detection and Recognition in Digital Images

(a)

(b)

(c)

Figure 4.15 A 481 × 321 test image (a) and manually segmented areas of image from the Berkeley
Segmentation Database [27] (b). Manually selected 923 points from which the RGB color values were
used for system training and segmentation (c), from [26]. Color versions of the images are available at
the book web page [28]. (For a color version of this figure, please see the color plate section.)

The optimal number of clusters was obtained with the entropy criterion (3.259). Its values
for color samples used to segment images in Figure 4.14(a) and Figure 4.15(a) are shown in
Figure 3.28 with the groups of bars for the 4th and 5th data set.
From the results presented in Table 4.6 we can easily see that highest improvements in
accuracy are obtained by introducing a second classifier. This is due to the best entropy
parameter for the two classes in this case, as shown in Figure 3.28. Then accuracy improves
with increasing numbers of classifiers in the ensemble, reaching a plateau. Also, kernel based
clustering allows slightly better precision of response as compared with the crisp version.
Further details of this method, also applied to data sets other than images, can be found in
paper [26].

4.3

Detection of Basic Shapes

Detection of basic shapes such as lines, circles, ellipses, etc. belongs to one of the fundamental
low-level tasks of computer vision. In this context the basic shapes are those that can be

described parametrically by means of a certain mathematical model. For their detection the
most popular is the method by Hough [29], devised over half a century ago as a voting method

(a)

(b)

(c)

Figure 4.16 Results of image segmentation based on chosen color samples from Figure 4.15(c).
Manually segmented objects from Figure 4.15(b–c) used as a reference for comparison. Segmentation
results with the ensemble of WOC-SVMs for only one classifier, M = 1 (b) and for M = 7 classifiers
(b). Gaussian kernel used with parameter σ = 0.7 (from [26]).


Object Detection and Tracking

365

Table 4.6 Accuracy parameters precision P vs. recall R of the pixel based image segmentation from
Figure 4.15 with results shown in Figure 4.16 (from [26]).
M

P

R
Crisp k-means

1
2

5
7

0.63
0.63
0.77
0.76

P

R

Kernel c-means (Gaussian kernel)
0.97
0.95
0.92
0.91

0.63
0.76
0.79
0.81

0.97
0.90
0.92
0.91

for recognition of lines, it was then introduced to the computer vision community by Duda
and Hart [30], and further extended to detection of arbitrary shapes by Ballard [31]. However,

in the case of general shapes the method is computationally extensive.
A good overview on the Hough method and its numerous variations can be found for instance
in the book by Davies [32]. However, what is less known is that application of the structural
tensor, discussed in Section 2.7, can greatly facilitate detection of basic shapes. Especially fast
and accurate information can be obtained by analyzing the local phase ϕ of the tensor (2.94),
as well as its coherence (2.97). Such a method, called orientation-based Hough transform, was
proposed by J¨ahne [33]. The method does not require any prior image segmentation. Instead,
for each point the structural tensor is computed which provides three pieces of information,
that is, whether a point belongs to an edge and, if so, what is its local phase and what is the
type of the local structure.
Then, only one parameter is left to be determined, the distance p0 of a line segment to the
origin of the coordinate system. The relations are as follows (see Figure 4.17).
x2 − x20
= ctg ϕ,
x10 − x1

x20 = p0 sin ϕ,

x10 = p0 cos ϕ,

Figure 4.17 Orientation-based Hough transform and the UpWrite method.

(4.8)


366

Object Detection and Recognition in Digital Images

(there are the lower and upper indices, not the powers) which after rearranging yield

cos ϕ sin ϕ

x1
x2

= p0 .

(4.9)

w

In the above w = [cosϕ, sinϕ]T is a normal vector to the sought line and p0 is a distance of
a line segment to the center of the image coordinate system.
It is interesting to observe that such an orientation-based approach is related to the idea
called the UpWrite method, originally proposed for detection of lines, circles, and ellipses by
McLaughlin and Alder [34]. Their method assumes computation of local orientations as the
phase of the dominant eigenvector of the covariance matrix of the image data. Then, a curve
is found as a set of points passing through consecutive mean points m of local pixel blobs
with local orientations that follow, or track, the assumed curvature (or its variations). In other
words, the inertia tensor (or statistical moments) of pixel intensities are employed to extract
a curve – these were discussed in Section 2.8. Finally, the points found can be fitted to the
model by means of the least-squares method.
The two approaches can be connected into the method for shape detection in multichannel
and multiscale signals2 based on the structural tensor [35]. The method joins the ideas of
the orientation-based Hough transform and the UpWrite technique. However, in the former
case the ST was extended to operate in multichannel and multiscale images. Then the basic
shapes are found in accordance with the additional rules. On the other hand, it differs from the
UpWrite method mainly by application of the ST which operates on signal gradients rather
than statistical moments used in the UpWrite. The two approaches are discussed in the next
sections. Implementation details can be found in the papers [35, 36].


4.3.1 Detection of Line Segments
Detection of compound shapes which can be described in terms of line segments can be done
with trees or with simple grammar rules [35, 37, 38]. In this section the latter approach is
discussed. The productions describe expected local structure configurations that could contain
a shape of interest. For example the SA and SD,E,F,T productions help find silhouettes of shapes
for the different road signs (these groups are named “A” and “D”, “E”, “F”, “T”). They are
formed by concatenations of simple line segment Li . The rules are as follows
SA → L 1 L 2 L 3,

S D,E,F,T → L 3 L 4 .

(4.10)

The line segments Li are defined by the following productions
L i → L (ηi π, pi , κi ) , L → L H |L U ,

(4.11)

where Li defines a local structure segment with a slope π/ηi ± pi which is returned by the
detector L controlled by a set of specific parameters κ i . The segment detector L, described by
2 These

can be any signals, so the method is not restricted to operating only with intensity values of the pixels.


Object Detection and Tracking

367


(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 4.18 Shape detection with the SA grammar rule. For detection the oriented Hough transform,
computed from the structural tensor operating in the color image at one scale, was used. Color version
of the image is available at www.wiley.com/go/cyganekobject.

the second production in Equation (4.11), can be either the orientation-based Hough transform
LH from the multichannel and multiscale ST [35], or the UpWrite LU .
If all Li of a production are parsed, i.e. they respond with a nonempty set of pixels (in
practice above a given threshold), then the whole production is also fulfilled. However, since
images are multidimensional structures, these simple productions lack spatial relations. In
other words, a production defines only necessary, but not sufficient, conditions. Therefore
further rules of figure verification are needed. These are discussed in Section 4.4.
Figure 4.18 depicts the results of detection of triangular shapes with the presented technique.
The input is a color image shown in Figure 4.18a. Its three color channels R, G, and B, presented
in Figure 4.18(b–d), are used directly to compute the ST, as defined in Equation (2.107) on

p. 53. The weights are the same ck = 1 /3 for all channels. The parameters ηi in (4.11) are η1 =
1/3, η2 = 2/3, and η3 = 0. Parameter pi which controls slope variation is pi = 3%, i.e. it is the
same for all component detectors. Results of the L1 , L2 , and L3 productions, as well as their
combined output, are depicted in Figure 4.18(e–h), respectively. The shape that is found can be
further processed to find parameters of its model, e.g. with the Hough transform. However, in
many applications explicit knowledge of such parameters is not necessary. Therefore in many
of them a detected shape can be tracked, as discussed in Section 3.8, or it can be processed
with the adaptive window technique, discussed in Section 4.4.3.

4.3.2

UpWrite Detection of Convex Shapes

As alluded to previously, components of the ST provide information on areas with high local
structure together with their local phases, as discussed in Section 2.7. The former can be
used to initially segment an image into areas with strong local structures (such as curves,
for instance), then the latter provides their local curvatures. These, in turn, can be tracked as
long as they do not differ significantly, or in other words, to assure curvature continuity. This
forms a foundation to the version of the UpWrite method presented here which is based on the
structural tensor.


368

Object Detection and Recognition in Digital Images

k1
k2

C


ki
ki+1

ki+2

Figure 4.19 Curve detection with the UpWrite tensor method. Only places with a distinct structure are
considered, for which their local phase is checked. If a change of phase fits into the predefined range,
then the point is included into the curve and the algorithm follows.

The condition on strong local structure can be formulated in terms of the eigenvalues of the
structural tensor (2.92), as follows
λ1 > τ,

λ2 ≈ 0,

(4.12)

where τ is a constant threshold. In other words, phases of local structures will be computed
only in the areas for which there is one dominating eigenvalue. A classification of types of local
areas based on the eigenvalues of the ST can be found in publications such as [39, 40, 15].
A similar technique for object recognition with local histograms computed from the ST is
discussed in Section 5.2.
Figure 4.19 depicts the process of following local phases of a curve. A requirement of curve
following from a point to point is that their local phases do not differ more than by an assumed
threshold. Hence, a constraint on the gradient of curvature is introduced
ϕ = ϕk − ϕk+1 < κ,

(4.13)


where κ is a positive threshold. Such a formulation allows detection of convex shapes, however.
Thus, choice of the allowable phase change κ can be facilitated providing the degree of a
polygon approximating the curve. The method is depicted in Figure 4.20.
In this way ϕ from Equation (4.13) can be stated in terms of a degree N of a polygon,
rather than a threshold κ, as follows
ϕmax =



, and ϕ = ϕk − ϕk+1 <
.
N
N

(4.14)

In practice it is also possible to set some threshold on the maximum allowable distance
between pairs of consecutive points of a curve. This allows detection of curves in real discrete
images in which it often happens that the points are not locally connected mostly due to image
distortions and noise.


Object Detection and Tracking

369

k1

k2


Figure 4.20 The allowable phase change in each step of the method can be determined providing a
degree of the approximating polygon.

Figure 4.21 presents results of detection of the circular objects in real traffic scenes. Detected
points for the allowable phase change, set with a polygon of degree N = 180, are visualized
in Figure 4.21b. The maximum separation between consecutive points was set to not exceed
4 pixels.
Figure 4.22 also shows detection of oval road signs. In this case, however, the finer phase
change was allowed, setting N = 400. The same minimal distance was used as in the previous
example.
The method is fast enough for many applications. In the C++ implementation this requires
about 0.4 s on average to process an image of 640 × 480 pixels. At first some time is consumed for computation of the ST, as discussed in Section 2.7.4.1. Although subsequent phase
computations are carried out exclusively in areas with strong structure, some computations
are necessary to follow a curve with backtracking. That is, the algorithm assumes to find
the longest possible chain of segments of a curve. A minimal allowable length of a segment
is set as a parameter. If this is not possible then it backtracks to the previous position and
starts in other direction, if there are such possibilities. Nevertheless, memory requirements are

(a)

(b)

Figure 4.21 Detection of ovals in a real image (a). Detected points with the UpWrite tensor method
for the allowable phase change as in a polygon of degree N = 180 (b). Color versions of the images are
available at www.wiley.com/go/cyganekobject.


370

Object Detection and Recognition in Digital Images


(a)

(b)

Figure 4.22 Detection of ovals in a real image (a). Detected points with the method for the allowable
phase change set to N = 400 (b). (For a color version of this figure, please see the color plate section.)

moderate, i.e. some storage is necessary for ST as well as to save the positions of the already
processed pixels. Such requirements are convenient when compared with other algorithms,
such as circle detection with the Hough method.
The next processing steps depend on the application. If parameters of a curve need to be
determined, then the points can be fitted to the model by the voting technique as in the Hough
transform. Otherwise, the least-squares method can be employed to fit a model to data [41, 42].
However, such a method should be able to cope with outliers, i.e. the points which do not
belong to a curve at all and which are results of noise. In this respect the so called RANSAC
method could be recommended [43, 44]. It has found broad application in other areas of
computer vision, such as determination of the fundamental matrix [15, 45, 46]. Nevertheless,
in many practical applications the parameters of a model are irrelevant or a model is not
known. For example in the system for road sign recognition, presented in Section 5.7, such
information would be redundant. A found object needs to be cropped from its background
and then, depending on the classifier, it is usually registered to a predefined viewpoint and
size. For this purpose a method for the tight encompassing of a found set of points is more
important. This can be approached with the adaptive window growing method, discussed in
Section 4.4.3. The mean shift method can also be used (Section 3.8).

4.4

Figure Detection


Many objects can be found based on detection of their characteristic points. The problem
belongs to the dynamically changing domain of sparse image coding. The main idea is to
detect characteristic points belonging to an object which are as much as possible invariant
to potential geometrical transformation of the view of that object, as well as to noise and
other distortions. The most well known point descriptors are SIFT [47], HOG [48], DAISY
[49], SURF [50], as well as many of their variants, such as PCA-SIFT proposed by Ke
and Sukthankar [51], OpponentSIFT [52], etc. A comparison of sparse descriptors can be
found in the paper by Mikolajczyk and Schmid [53]. They also propose an improvement
called the gradient location and orientation histogram descriptor (GLOH), which as reported
outperforms SIFT in many cases. These results were further verified and augmented in the


×