on merging hidden markov models with deformable templates

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (690.42 KB, 4 trang )

ON MERGING HIDDEN MARKOV MODELS WITH DEFORMABLE
TEMPLATES
Ram R. Rao
and
Russell M. Mersereau
School of Electrical and Computer Engineering
Georgia Institute of Technology Atlanta, Georgia 30332

ABSTRACT
Hidden Markov modeling has proven extremely useful
for statistical analysis of speech signals. There are,
however, inherent problems in two dimensional exten-
sions to HMM’s, one of which is the exponential com-
plexity associated with fully 2-D HMM’s. In this paper,
we propose a new 2-D HMM-like structure obtained by
embedding states within regions of a deformable tem-
plate structure. With this state-embedded deformable
template (SEDT), each region of a deformable tem-
plate has an underlying observation probability distri-
bution. This structure allows for computation of the
P[image]tempZate]. The template that maximizes this
probability provides an optimal segmentation of the
image. This segmentation capability will be demon-
strated in facial analysis applications.
1. INTRODUCTION
Facial analysis is a difficult problem which has many
potential applications. Robust facial analysis systems
are an integral part of any model-based coding, fa-
cial recognition [I], or visual speech recognition sys-
tem [a]. Many researchers are attempting to provide a
standard framework for tackling these image analysis

tasks. Two of the more interesting analysis approaches
are deformable templates and hidden Markov model-
ing. Both of these approaches have advantages and
shortcomings.
Deformable templates [3] have been used to model
the eyes, lips, and face for applications such as visual
speech recognition and face recognition. These tem-
plates have certain structural characteristics, such as
associating the head with an ellipse, or the lips with
four parabolas. They also have energy functions which
are often the sum of an image-related energy term, and
an internal energy term. The image-related term is
This work is supported by the U.S. Army Research Office,
Contract DAAL03-92-G-0068.
usually a function of the edge, peak and valley fields
derived from the image. The internal energy is of-
ten heuristically designed to keep template parameters
within acceptable ranges. Minimization of the energy
function yields the template which best matches the
image. The main problem with deformable templates is
that the energy functions are experimentally designed,
and they do not statistically segment the image.
There is strong motivation for statistically modeling
the pixel values which occur in an image. Since there
is a difference between “skin” colors and background
colors in head and shoulder images [4], one would like
to model these distributions and use this information
to segment the image.
Hidden Markov models [5] provide a strong sta-
tistical framework for analyzing one-dimensional ran-

dom processes. The key concept behind HMM’s is
a set of states which have probabilistic output dis-
tributions. Two-dimensional HMM’s aren’t quite so
tractable. Fully two-dimensional HMM’s have been
shown to have exponential complexity [6]. One practi-
cal solution to this has been to use psuedo-2D HMM’s
[7]. Essentially, one dimensional HMM’s operate on
the rows of the image, and these HMM’s are nested in
another HMM. Psuedo-2D HMM’s, however, can not
incorporate any shape constraints since each row is an-
alyzed independently.
2. STATE-EMBEDDED DEFORMABLE
TEMPLATES
Since it seems that deformable templates provide a
good framework for structurally analyzing an image,
and HMM’s provide a good framework for statistically
analyzing an image, it makes sense to capitalize on
the benefits of both. Our solution entails associating a
state with each region of a deformable template. These
states have observation probability density functions
which reflect the probability of observing a particular
pixel value while in the state. For example, the head
556
O-8186-7310-9/95 $4.00 0
1995
IEEE
Proceedings of the 1995 International Conference on Image Processing (ICIP '95)
0-8186-7310-9/95 $10.00 © 1995 IEEE
(a)
(b)

(b)
Figure 1: (a) SEDT used for facial extraction. X =
(21,z2,yl,y2). (b) SEDT used for lip tracking. X =
(a 2% YL Y2, Y3).
can be modeled by an ellipse with a foreground state
and a background state (Figure la). This has some in-
tuitive sense since the face normally has different sta-
tistical characteristics than the background, especially
when using color.
Our SEDT’s are specified as follows:
l
The variable, X = (Xi . . . X,), parameterizes a
deformable template structure. For example, if
the template were a rectangle, K = 4, and X
could be the x and 2/ coordinates of the upper-
left and lower-right corners of the rectangle.
l
The template divides the image into N regions
RXJ - . . RN-~. In case of N = 2, we have an im-
age divided into foreground and background re-
gions. Each region has an associated observation
probability density function, IQ(Q), where Q is
a (possibly multidimensional) pixel value. bj (0)
can be any parameterized pdf such as a Gaussian
or Gaussian mixture.
From this, it follows that:
N-l
(1)
where I is the image, and 1(x, y) is the (possibly mul-
tidimensional) pixel value at location (z, y).

Maximizing P[I(A] over J yields the optimal tem-
plate. Equivalently, we can minimize - log P[I]A]. Look-
ing at SEDT’s from a deformable template perspective,
we can think of - logP[I]A] as our energy function.
Alternatively, looking at our solution from an HMM
perspective, we can think of the optimal template as
Figure 2: Shown is a) original image with the initial and
final position of the template (foreground: /J == 200,
u2 = 100; background: p = 200,o’ = 10); b) points for
which P[pixel c foreground] > P[pixel E background]
being analogous to the optimal state sequence parti-
tioning. The analog of Viterbi training would be to par-
tition the data using the optimal templates, reestimate
the output probability distribution functions given the
partitioned data, and repeat until convergence.
3. SYNTHETIC EXAMPLE
The first test of our t,emplate was to find an arbitrary
sized rectangle within an image. The rectangle had pix-
els with intensity specified by a Gaussian with mean
and variance, pf and af, respectively. Likewise, the
background had intensity specified by a Gaussian with
pb and gb. Our template was a rectangle specified by
the coordinates of its upper-left and lower-right cor-
ners.
Starting with an initial template, estimates of the
foreground and background pdf’s were made. A steep-
est descent minimization algorithm was then used to
minimize log P[I]B] over A. This new template was
then used to reestima.te the foreground and background
pdf’s, and the proces$s was repeated until convergence.

It was seen that this process is sensitive to the initial
placement of the template. Good results were obtained
when the initial template completely covered the un-
known rectangle, or when it was contained within the
unknown rect,angle. These template choices work well
because either the foreground pdf or the background
pdf is reliably estimated initially. Now since we didn’t
know the position of the rectangle, our system was al-
ways started with a rectangle that covered a majority
of the input image (Figure 2).
There is a problem with this procedure. Consider
the case where the foreground has a lower variance
than the background, and they both have equal means.
The choice of a large initial template would likely con-
tain pixels from both the foreground and background.
557
Proceedings of the 1995 International Conference on Image Processing (ICIP '95)
0-8186-7310-9/95 $10.00 © 1995 IEEE
(b) (a)
Cd)
Figure 3: Initialization procedure. (a) Region used
to estimate facial distribution for “Chris”; (b) Result
of applying this distribution to ‘LHaluk” and applying
threshold; (c) Probability of pixel being part of face
for “Haluk” using distribution derived from (a) (dark
region = high probability); (d) “Haluk” image, with
initial template position”
Thus, the estimate of the variance of the foreground pdf
would approa,ch the variance of the background pdf.
When there is a large overlap between the two pdf’s

the system will not work well. This can be remedied
by altering the reestimation procedure to ensure that
there is adequate separation between the two pdf’s.
4. FACIAL EXTRACTION
One of our main objectives was to find a robust pro-
cedure for extracting the boundary of a person’s head
in a full-color head and shoulders video sequence. The
head was modeled as an ellipse with no rotation, and
the foreground and background pdf’s were modeled as
Gaussian mixtures. Each mixture contained two Gaus-
sians with full covariance matrices.
In the development of our system, a number of facts
became clear. First, if the foreground and background
pdf’s are available, minimizing the energy function,
- logP[I]h], would successfully segment the face from
the background. However, since these distributions are
unknown in the initial frame, they must somehow be es-
timated. Second, if a point on the person’s face could
be located, a region around this point could be used
to estimate the foreground pdf. Assuming everything
outside this region was background, we could also esti-
mate a background pdf. The facial border could then
Cd)
Figure 4: Facial Extraction. (a) Original image with
initial and final placement of template; (b) Pixels for
which P&ad > Pba&,rOUnd; (c) & (d) Probability of
head and background, respectively (dark = high prob-
ability).
be found by iterating between minimizing the energy
function and reestimating foreground and background

distributions.
One important task was to develop a subsystem
which could locate a point on a person’s face. This
could be done by first developing a general ‘(face” pdf.
Ideally, one would like to collect a large database of
faces under varying lighting conditions to estimate a
general “face” pdf, but we didn’t have such a large
database. We chose to use the facial distribution of one
person as an approximation of the facial distribution
for a different person. A point in the face was found
by applying this pdf to the input image. A threshold
was applied to the new image to find all points which
had probability within a certain range of the pixel with
maximum probability. The median 5 and y values of
these pixels would be located in the person’s face. The
median operation works much better than averaging,
and also works better than attempting to find an n by
n square of pixels whose joint probability is greatest. It
also seems to implicitly use the fact that for the most
part, the face of interest is near the center of the image.
This procedure is shown in Figure 3.
Figure 4 shows the convergence of the template to
the final head border. Image specific distributions for
the foreground and background are estimated using the
initial template. A steepest descent minimization al-
gorithm is then used to minimize - logP[I]A]. This
process is repeated until convergence. Comparing Fig-
558
Proceedings of the 1995 International Conference on Image Processing (ICIP '95)
0-8186-7310-9/95 $10.00 © 1995 IEEE

Figure 5: Results of lip tracking algorithm (top); Pixels
for which P+ > Pba&,rOzlnd (bottom).
ure 4(c) and Figure 3(c) shows the difference between
using a general facial distribution, and one matched to
the actual image. Notice how the facial region is much
darker in Figure 4, indicating a higher probability.
5. LIP TRACKING
Another goal of our research is to develop a robust lip
analysis system. As a first step, we wanted to test
the ability of SEDT’s to track the border of the lips
through a video sequence. Our template is shown in
Figure l(b). The template has two parabolas which
are embedded in a rectangle. There are a total of five
parameters - four for the rectangle, and one to specify
the vertical position of the intersection of the parabo-
las.
Our test consisted of manually placing the tem-
plate in frame 1 of the video sequence, and estimating
the foreground and background distributions. These
distributions were applied to successive frames, and
a minimization algorithm was run to find the opti-
mal template. As shown in Figure 5, the results are
very promising. Likewise, the inner contour of the lips
can be tracked by estimating the distribution of the
mouth opening, and considering the lips themselves to
be background.
6. CONCLUSION
facial extraction and Bp tracking. Our method cap-
italizes on the statistical segmentation properties of
HMM’s and incorporates the shape coherence proper-

ties of deformable templates. Work remains in finding
automatic methods for initializing the templates, par-
ticularly for the the lip tracking algorithm. It is also
necessary to assess whi.ch color spaces and parameter
sets work best and which ones are most invariant to
varying lighting conditions and differing speakers.
7. RE:FERENCES
PI
PI
PI
PI
PI
PI
VI
R. Chellapa, C. Wilson, and S. Sirohey, “Human
and machine recognition of faces: A survey,” Pro-
ceedings of the IEEE, vol. 83, pp. 705-740, May
1995.
M. Hennecke, K. Prasad, and D. Stork, “Using de-
formable templates to infer visual speech dynam-
ics,”
in Proceedings of the 28th Annual Asilomar
Conference on Signals, Systems, and Computers,
(Pacific Grove, CA), November 1994.
A. Yuille, P. Hallinan, and D. Cohen, LLFeature ex-
traction from faces using deformable templates,”
International
Joumal
of Computer Vision, vol. 8,
no. 2, pp. 99-111, 1992.

H. M. Hunke, “Locating and tracking of human
faces with neural networks,” Tech. Rep. CMU-CS-
94-155, Carnegie Mellon University, August 1994.
L. Rabiner and B. Juang, Fundamentals of Spech
Recognition. Englewood Cliffs, NJ: Prentice-Hall,
1993.
E. Levin and R. Pieraccini, “Dynamic planar warp-
ing for optical character recognition,” in Proc.
Int.
Conf. Acoust.,Speech,Signal Processing, pp. III-149
- 111-152, 1992.
0. Agazzi and S. Kuo, “Hidden Markov model
based optical character recognition in the presence
of deterministic transformations,” Pattern Recogni-
tion, vol. 26, no. 12, pp. 1813-26, 1993.
In this paper, we have presented an extension to de-
formable templates which allows for statistical segmen-
tation of images. The system performed well on many
foreground/background segmentation tasks including
5!59
Proceedings of the 1995 International Conference on Image Processing (ICIP '95)
0-8186-7310-9/95 $10.00 © 1995 IEEE

on merging hidden markov models with deformable templates

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về