crane gesture recognition using pseudo 3-d hidden markov models5

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (212.83 KB, 5 trang )

Crane Gesture Recognition Using Pseudo 3-D Hidden Markov Models
Stefan M¨uller, Stefan Eickeler, Gerhard Rigoll
Gerhard-Mercator-University Duisburg
Department of Computer Science
Faculty of Electrical Engineering
47057 Duisburg – Germany
e-mail:
stm,eickeler,rigoll @fb9-ti.uni-duisburg.de
Abstract
A recognition technique based on novel pseudo 3-D Hid-
den Markov Models, which can integrate spatial as well as
temporal derived features is presented in this paper. The
approach allows the recognition of dynamic gestures such
as waving hands as well as static gestures such as stand-
ing in a special pose. Pseudo 3-D Hidden Markov Mod-
els (P3DHMMs) are an extension of the pseudo 2-D case,
which has been successfully used for the classiﬁcation of
images and the recognition of faces. In the P3DHMM
case the so-called superstates contain P2DHMMs and thus
whole image sequences can be generated by these mod-
els. Our approach has been evaluated on a crane signal
database, which consists of 12 different predeﬁned gestures
for maneuvering cranes.
1. Introduction
There are many publications which, recently, report
about the use of Hidden Markov Models (HMMs) for the
recognition of human actions in image sequences. For ex-
ample Yamato et al. [1], which is probably the ﬁrst publica-
tion addressing this problem, use discrete HMMs and thus
a sequence of VQ-labels in order to recognize six classes
representing tennis strokes. In their approach several pre-

processing steps including low pass ﬁltering, background
subtraction and binarization are applied to each image of a
sequence. The outcome of these steps is a two level image,
where the pose of the human is roughly extracted. Prior to
the calculation of the features itself, size normalization and
a centering step are applied to the binarized image. The fea-
tures itself are the amounts of black pixels in a mesh, i.e. a
subsampled image arranged in a feature vector. These fea-
tures are vector quantized and thus the image sequence be-
comes a sequence of VQ-labels, which can be processed by
a discrete HMM (at that time the preferred modeling tech-
nique).
Schuster and Rigoll also applied discrete HMMs to the
task of image sequence recognition in [2]. Their approach
utilizes a much simpler preprocessing, which leads to a sys-
tem with real-time capabilities. The color images of a se-
quence are subsampled for each RGB plane separately and
horizontal or vertical stribes are directly fed into a vector
quantizer. Alternatively, the same steps are applied to a dif-
ference image sequence. This real-time capable system has
been evaluated on a ten class database, which consists of
gestures such as nod-no,nod-yes, kotow and clapping.
The system mentioned above has been improved by uti-
lizing continuous HMMs in conjunction with geometric
moments calculated on difference images. As reported in
[3] the improved system is capable of classifying 24 ges-
tures with a recognition accuracy of
90%.
Continuous HMMs in combination with moments are
also used by Starner et al. in [4]. This system recognizes

American Sign Language by extracting the hands of a per-
son from images and performsa second moment analysis on
the extracted blobs. Besides the components derived from
the extracted shapes of the hands, dynamic features such as
the change of the position between frames are also part of
the feature vector.
Most of the systems mentioned previously heavily rely
on the existence of motion or moving body parts, due to
the calculation of e.g. moments on the difference images.
In order to overcome this limitation, we propose the usage
of pseudo 3-D HMMs, which are able to integrate features
derived from temporal as well as spatial information and
which can also perform an elastic matching on the individ-
ual images. This is different from the previously mentioned
approaches, because either VQ-labels are assigned to whole
images ([1, 2]) or global features are calculated ([3],[4]) and
thus no elastic matching on the image itself is performed.
The elastic matching procedure should also allow a position
invariant recognition of gestures.
This paper is organized as follows. Section 2 gives an
introduction to pseudo 3-D HMMs and describes the fea-
ture extraction used in the experiments. Section 3 presents
experimental results. A summary is given in Section 4.
2. Pseudo 3-D HMMs for the Stochastic Mod-
eling of Three-Dimensional Data
Hidden Markov Models are ﬁnite non-deterministic state
machines which have been successfully applied to continu-
ous speech [5] and online handwriting recognition [6]. They
consist of a ﬁxed number of states with associated output
density functions (pdfs) as well as transition probabilities

,where denotes the actual
state at time
, is a distinct state and denotes a feature
vector. Especially large feature vectors consisting of inho-
mogeneous components are often divided into statistically
independent streams (see e.g. [7]) and thus for
streams
and given streamweights
the pdf of state can be
calculated as
(1)
For every stream
, the pdfs are usually given by
ﬁnite Gaussian mixtures of the form
(2)
where
is the mixture coefﬁcient for the th mixture
in stream
and is a multivariate Gaus-
sian density with mean vector
and covariance matrix
. The use of streams allows the integration of fea-
tures derived from temporal as well as spatial data into a
single model. Furthermore, the stream weights provide the
opportunity to adjust the inﬂuence of temporal and spatial
features.
AHMM
with N states is fully described by
the N
N-dimensional transition matrix , the N-dimen-

sional output pdf vector
and the initial state distribu-
tion vector
which consists of the probabilities
. After the model has been trained us-
ing the Baum-Welch algorithm, feature sequences
can be scored according to
(3)
Usually the likelihood
is estimated by the Viterbi
algorithm, which is an approximation based on the most
likely state sequence (
). For recognition tasks,
is used to classify an unknown pattern to class p
which satisﬁes Eq. 4.
p
argmax
p
p
(4)
A very detailed explanation of the HMM-framework is
given by Rabiner in [5].
It has been shown that HMMs can not only be ap-
plied successfully to time series problems, but also to pat-
tern recognition problems with the pattern varying in space
rather than in time. Therefore, HMMs have been recently
applied to image recognition problems with promising re-
sults [8, 9]. In both publications pseudo 2-D HMMs have
been utilized, which are also known as planar HMMs. A
P2DHMM is an extension of the one-dimensional HMM

paradigm, which has been developed in order to model two-
dimensional data. They are called pseudo due to the fact
that the state alignment of consecutivecolumns is calculated
independently from each other. P2DHMMs are stochas-
tic state machines with a two-dimensional arrangement of
the states, as outlined in Fig. 1. The states in horizon-

tal direction are denoted as superstates, and each super-
state consists of a one-dimensional HMM in vertical direc-
tion. The P2DHMM shown in Fig. 1 can be trained from
data, after features have been extracted, using the segmen-
tal k-means algorithm. Once the models have been trained
for each class, the recognition procedure is accomplished
by calculating the class-dependent probability that the (un-
classiﬁed) data has been generated by the corresponding
HMM. For this procedure, the doubly embedded Viterbi al-
gorithm can be utilized, which has been proposed by Kuo
and Agazzi in [8]. Alternatively, Samaria shows in [10],
that a P2DHMM can be transformed into an equivalent
one-dimensional HMM by the insertion of special start-of-
line states and features. Fig. 2 shows an augmented
P2DHMM with start-of-line states (indicated by a cross).

These states generate a high probability for the emission of
start-of-line features. When using the structure in Fig. 2
one has to take care of the fact that the value for the start-
of-line feature is different from all possible ordinary fea-
tures. These equivalent HMMs can be trained by the stan-
dard Baum-Welch algorithm and the recognition step can be
carried out using the standard Viterbi algorithm.

The natural extension of the two-dimensional case leads
to a structure as shown in Fig. 3, which shows a pseudo
3-D HMM. Each superstate now consists of a P2DHMM.
We implemented the structure in Fig. 3 by applying the
technique suggested by Samaria twice, i.e. by additionally
inserting special start-of-image states and features. Due
to this implementation technique, the P3DHMM shown in
Fig. 3 can be trained from data, by applying standard HMM
techniques.
The feature extraction used throughout this paper is
based on the discrete cosine transform (DCT). Each image
of a sequence is scanned with a sampling window top to
bottom and left to right. The pixels in the sampling window
of the size
are transformed using the DCT according
to the equation:
(5)
A triangle shaped mask extracts the ﬁrst 15 coefﬁcients
, which are arranged in a vector. These DCT
coefﬁcients are calculated on the individual images (static
feature component) of a sequence as well as the difference
images (dynamicfeature component). Due to the utilization
of the HMM framework, both features can be integrated by
using feature-streams and by assigning stream weights in
order to control the inﬂuence of the individual streams (see
also Eq. 1).
3. Experiments and Results
In order to obtain a detailed evaluation of the P3DHMM
approach, experiments on a crane signal database consist-
ing of 12 classes have been performed. Crane signals are

a well deﬁned set of gestures, which allow to maneuver
a crane in the presence of obstacles or problematic envi-
ronments (see also [11]). Fig. 4 shows the 12 classes slew
left (right), travel to (from) me, extend (retract) jib, jib up
(down), hoist, lower, stop and emergency stop,wherethe
latter two classes represent two examples for static gestures
with hardly any movement involved. Five individuals per-
formed each of the 12 gestures several times and thus two
repetitions for each gesture built the training set, whereas
the remaining repetitions are used for testing. Fig. 5 il-
lustrates the two classes jib up and jib down in the upper
and lower row, respectively, taken from the stm set. Ta-
ble 1 shows the recognition accuracies achieved in the ex-
periments and presents also results on the crane signal task
using one-dimensional HMMs and geometric moments as
described in [3]. In the experiments, four superstates with
P2DHMMs per superstate have been used as con-
ﬁguration of the P3DHMMs. Note that the P3DHMM ap-
proach shows a slightly higher recognition accuracy com-
slew left (right), travel to (from) me, extend (re-
tract) jib, jib up (down), hoist, lower, stop
emer-
gency stop
pared to the one-dimensional case. However, there are two
more important reasons for using P3DHMMs: One is the
fact that static and dynamic gestures can be now mixed and
handled with the same unique recognition paradigm. The
other is the possibility that due to the warping capabilities
of the P3DHMM an elastic matching can be performed on
the individual images which results in a position and size

invariant gesture recognition mode.
4. Summary
Image sequence recognition based on novel pseudo
three-dimensional Hidden Markov Models has been pre-
sented. The modeling technique allows the integration of
spatial and temporal derived features in an elegant way and
is also capable of recognizing static gestures where hardly
any body movement is involved. Compared to an approach
based on one-dimensional HMMs and geometric moments,
1D HMM P3DHMM
ste 100% 88.6%
stm 85.3% 91.2%
ank 100% 100%
bw 88.2% 94.1%
jmr 80.5% 80.5%
average 90.74% 90.88%
the P3DHMMs showed a slightly better recognition accu-
racy on a 12 class crane signal task. Due to the warping
capabilities of the P3DHMMs, the proposed approach leads
to a position independent recognition mode. However, this
has not been fully evaluated yet and the present publication
shows mainly the feasibility of this modeling approach.
References
[1] J. Yamato, J. Ohya, and K. Ishii, “Recognizing Hu-
man Action in Time-Sequential Images Using Hidden
Markov Model”, In Proc. IEEE Int. Conference on
Computer Vision and Pattern Recognition, 1992, pp.
379–385.
[2] M. Schuster and G. Rigoll, “Fast Online Video Im-
age Sequence Recognition with Statistical Methods”,

In Proc. IEEE Int. Conference on Acoustics, Speech
and Signal Processing, Atlanta, 1996, pp. 3450–3453.
[3] G. Rigoll and A. Kosmala, “New Improved Feature
Extraction Methods for Real-Time High Performance
Image Sequence Recognition”, In Proc. IEEE Int.
Conferenceon Acoustics, Speech, and SignalProcess-
ing, Munich, 1997, pp. 3373–3376.
[4] T. Starner, J. Weaver, and A. Pentland, “Real-Time
American Sign Language Recognition Using Desk
and Wearable Computer Based Video”, IEEE Trans.
on Pattern Recognition and Machine Intelligence,
Vol. 20, No. 12, Dec. 1998, pp. 1371–1375.
[5] L. R. Rabiner, “A Tutorial on Hidden Markov Mod-
els and Selected Applications in Speech Recognition”,
Proc. of the IEEE, Vol. 77, No. 2, Feb. 1989, pp. 257–
285.
[6] K. S. Nathan, J. R. Bellegarda, D. Nahamoo, and
E. J. Bellegarda, “On-line Handwriting Recognition
Using Continuous Parameter Hidden Markov Mod-
els”, In Proc. IEEE Intern. Conference on Acoustics,
jib up jib down stm
Speech, and Signal Processing , Minneapolis, 1993,
Vol. 5, pp. 121–124.
[7] V. N. Gupta, M. Lenning, and P. Mermelstein, “Inte-
gration of Acoustic Information in a Large Vocabulary
Word Recognizer”, In Proc. IEEE Intern. Conference
on Acoustics, Speech, and Signal Processing , Dallas,
1997, pp. 697–700.
[8] S. Kuo and O. Agazzi, “Keyword Spotting in Poorly
Printed Documents Using Pseudo 2-DHidden Markov

Models”, IEEE Trans. on Pattern Recognition and
Machine Intelligence, Vol. 16, No. 8, 1994, pp. 842–
848.
[9] S. Eickeler, S. M¨uller, and G. Rigoll, “High Quality
Face Recognition in JPEG Compressed Images”, In
Proc. IEEE Intern. Conference on Image Processing,
Kobe, 1999.
[10] F.S. Samaria, “Face Recognition Using Hidden
Markov Models”, Ph. D. Thesis, Cambridge Univer-
sity, 1994.
[11] A. Parrish, “Mechanical Engineers’s Reference
Book”, Butterworth, London, 1980.

crane gesture recognition using pseudo 3-d hidden markov models5

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về