Tải bản đầy đủ (.pdf) (20 trang)

Tài liệu 54 Video Scanning Format Conversion and Motion Estimation pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (398.14 KB, 20 trang )

de Haan, G. “Video Scanning Format Conversion and Motion Estimation”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c

1999byCRCPressLLC
54
Video Scanning Format Conversion
and Motion Estimation
Gerard de Haan
Philips Research Laboratories
54.1 Introduction
54.2 Conversion vs. Standardization
54.3 Problems with Linear Sampling Rate Conversion
Applied to Video Signals
TemporalInterpolation

VerticalInterpolationandInterlaced
Scanning
54.4 Alternatives for Sampling Rate Conversion Theory
Simple Algorithms

Advanced Algorithms
54.5 Motion Estimation
Pel-RecursiveEstimators

Block-MatchingAlgorithm

Search
Strategies


54.6 Motion Estimation and Scanning Format Conversion
Hierarchical Motion Estimation

Recursive Search Block-
Matching
References
54.1 Introduction
The scanning format of a video signal is a major determinant of general picture quality. Specifi-
cally, it determines such aspects as stationary and dynamic resolution, motion portrayal, aliasing,
scanning structure visibility, and flicker. Various formats have been designed and standardized to
strike a particular balance between qualit y, cost, transmissioncapacity,and compatibility withother
standards.
The field of video scanning format conversion is concerned with the translation of video signals
from one format into another. It consists of two basic parts: temporal interpolation and spatial
interpolation. A particular case is de-interlacing, which poses an inseparable spatio-temporal inter-
polation problem.
Verticalandtemporalinterpolationcausepracticalandfundamentaldifficultiesinachievinghigh-
quality scanning format conversion. This is because the conditions of the sampling theorem are
generally not met in video signals. If they were satisfied, standard conversions of arbitrary accuracy
would be possible using suitable linear filters.
Theearlierconversionmethodsneglectedthefundamentalproblemsand,consequently,negatively
influencedtheresolutionandthemotionportrayal. Morerecentalgorithms apply motion vectorsto
predict the position of moving objects at unregistered temporal instances to improve the quality of
the picture at the output format. A so-called motion estimator extracts these vectors from the input
c

1999 by CRC Press LLC
signal. The motion vectors partly solve the fundamental problems, but the demands on the motion
estimator for scanning format conversion are severe.
In this section we shall first briefly indicate why we can expect that the importance of scanning

format conversionwill grow. Then we discuss in more detail the fundamental problemsof temporal
interpolation of video signals. Next we provide a concise overview of the basic methods in scanning
format conversion, focused on temporal sampling rate conversion and de-interlacing. Finally, we
give an overviewof motion estimation algorithms,which are cr ucial in the more advancedscanning
format convertors.
54.2 Conversion vs. Standardization
Scanning formats have been designed in the past to st rike a particular compromise between quality,
cost, transmission capacity, and compatibility with other standards. There were three main formats
in use a decade ago: 50 Hz interlaced, 60 Hz interlaced, and 24 (or 25) Hz progressive (film). With
the arrival of video-conferencing, HDTV, workstations, and PCs, many new video formats have
appeared. These include low end formats such as CIF and QCIF with smaller picture size and lower
frame rates, progressive and interlaced HDTV formats at 50 Hz and 60 Hz, and other video formats
usedoncomputerworkstationsandenhancedtelevisiondisplayswithfieldratesupto100Hz. Itwill
be clear that the problem of scanning format conversion is of a growing importance, despite many
attempts to globally standardize video formats.
54.3 Problems with Linear Sampling Rate Conversion Applied to
Video Signals
High-quality scanning format conversion is difficult to achieve, as the conditions of the sampling
theorem are generally not met in video signals. The solution of Sample Rate Conversion (SRC)
for systems satisfying the conditions of the sampling theory is well known for arbitrary sampling
ratios [1].
Figure 54.1 illustrates the procedure for a ratio of 2. To arrive at the double output sampling r ate,
in a first step, zero-valued samples are inserted between every input pair of samples. In a second
step, a low-pass filter (LPF) at the output rate is applied toremovethe first repeat spectrumfrom the
input data. In case of a temporal SRC, the interpolating LPF has to be a temporal LPF, i.e., a filter
including picture delays. Though feasible, this makes it a fairly expensive filter.
A more complicated,though still not fundamental, problem occurs at the signal acquisitionstage.
Sincescenesdooccurwith almost unlimitedspatial and/or temporal bandwidth, the sampling theo-
remrequiresthatthissignalbelow-passfilteredpriortothescanningprocess. Interlacedscanning,as
commonly applied, even demands two-dimensional prefiltering in the vertical-temporal frequency

plane. In a video system, it is the camera that samples the scene in a vertical and temporal sense;
therefore,theprefilterhastoberealizedintheopticalpath. Althoughthereareconsiderablepractical
problems achieving this filtering, it would apparently bring down the problem of temporal inter-
polation of video images to the common sampling rate conversion problem. The next section will
show, however, that in addition to the practical problems there is a fundamental problem as well.
54.3.1 Temporal Interpolation
Considering the eye’s sine-wave temporal frequency response for full brightness potential and full
field display [2], as shown in Fig. 54.2, temporal prefiltering with a bandwidth of 75 Hz at first sig ht
seems sufficient. The fundamental problem now is that the relation shown in Fig. 54.2 holds for
c

1999 by CRC Press LLC
FIGURE 54.1: Consecutive steps in upsampling with a factor of two.
temporal frequencies as the y occur at the retina of the observer. These frequencies, however, equal
the frequencies at the display only if the eye is stationary with respect to this display. Particularly
with the eyetracking objectsmovingonthescreen,thisassumptionisnolongervalid. Foratracking
obserververyhightemporalfrequenciesonthescreencanbetransformed tomuchlowerfrequencies
or even DC at the retina. Consequently, suppression of these frequencies, with an interpolating
lowpass filter, results in excessive blurring of moving objects as will be discussed next.
Figure 54.3 shows, in a time-discrete representation, a simple object, a square, moving with a
constant velocity. Again, in this example, we consider up-sampling with a factor of two. Therefore,
the true position of the object is available at every second temporal position only (e.g., the odd
numbered samples). The “tracking observer” views along the motion tr ajectory, represented with a
line in the illustration, which results in a stationary image of the object on the retina. If the output
field sampling frequency exceeds the cutoff temporal frequency of the human visual system,
1
the
viewer will have the illusion that the object is continuously present.
Therefore, the object is actually seen at a position corresponding with the motion trajectory. If
now,e.g., in the 6th output field, the object is interpolated according to SRCtheory, weighted copies

of the object fromsurroundingfields resulting fromthe interpolating LPF aredisplayed. Figure 54.3
illustrates the case of a symmetr ical transversal lowpass filter. In this situation, the viewer sees the
object at the correct position but also variousattenuated and displaced copies(the impulse response
oftheinterpolatingtemporalfilter)oftheobject inaneighborhood. Theattenuationdependsonthe
coefficientsoftheinterpolatingfilter,andthedistancebetweenthecopiesisrelatedtothedisplacement
1
Actuallythepictureupdatefrequencymaybeevenas lowas 16Hz, toguar antee smoothperceivedmotion (see,e.g., [3]).
The higher display rates are merely necessaryto prevent the annoying large area flicker.
c

1999 by CRC Press LLC
FIGURE 54.2: The contrast sensitivit y of the human observer (y-axis) for large areas of uniform
brightness, as a function of the temporal frequency (x-axis).
FIGURE54.3: Theeffectoftemporalinterpolationforanobjecttrackingobserver. Thefieldnumbers
are counted at the output field rate.
of the mov ing object in a field period. For the object-tracking observer, therefore, the temporal LPF
is transformed into a spatial LPF. For an object velocity of one pixel per field period (one pel/field),
its frequency characteristic equals the temporal frequency characteristic of the interpolating LPF.
2
1
pel/fieldisaslowmotion,asinbroadcastpicturematerial;velocitiesinarangeexceeding16pel/field
do occur. Thus,the spatial blur caused by the SRC process becomes unacceptable even for moderate
object velocities.
54.3.2 Vertical Interpolation and Interlaced Scanning
Much similar to the situation of field rate conversion, it may seem that sequential scan conversion is
an up-sampling problem for which SRC-theory provides an adequate solution. However, straight-
forward, one-dimensional, up-sampling in the vertical frequency domain is incorrect as the data is
clearly sub-Nyquist sampled due to interlace.
If, more correctly,thesequential scan conversion is consideredas a two-dimensional up-sampling
problem in the vertical-temporal frequency domain, we arrive at a discussion similar to the one

2
It is assumedhere thatbothfilters are normalized to their respective samplingfrequency.
c

1999 by CRC Press LLC
in Section 54.3.1: the problem cannot be solved as we do not know the temporal frequency at the
retina of a movement-tracking observer. It is possible to disregard this problem and to perform a
two-dimensional SRC, implicitly assuming a stationary v iewer and prefiltered information. Such
systems were described and have been implemented for studio applications. With the older image
pick-up tubes the results can be satisfactory, as these devices have a poor dynamic resolution. When
modern (CCD-)cameras are used, however, the limitations of the assumptions become obvious.
54.4 Alternatives for Sampling Rate Conversion Theory
With the problem of linear interpolation of video signals clarified, we will discuss alternative algo-
rithms developed over time. These algorithms fall into two categories. A first category simplifies
the interpolation filter prescribed by SRC-theory, considering that a completely correct solution is
impossible anyway. The resulting “simple algorithms” are more attractive for hardware realization
than the method from which theyare derived and under certain conditions can perform quite simi-
larly. Thesecondcategoryincludesthemost“advancedalgorithms”forscanningformatconversion.
These methods can be characterized by their common attempt to inter polate the 3-D image data in
the direction in which the correlation is highest. The difference between the var ious options lies
mainly in the number of possible directions, and dimensions, which are considered. The imple-
mentation can show various linear interpolation filters controlled by one or more detectors, or a
multi-dimensional nonlinear filter that has an inherent edge adaptivity. As this description allows a
large number of algorithms, we will illustrate it with some important examples.
54.4.1 Simple Algorithms
SRC-theory in the temporal and vertical frequency domain is not applicable due to the missing
prefilter in common video systems. A sophisticated linear interpolation filter therefore makes little
sense. Anyinterpolating (spatio-)temporal low-pass filter will suppress originaltemporal frequency
components as well as aliased signal components, as they occupy, by definition, the same spectrum.
Asthefirsteffectisdesiredandthesecondnot,thetransferfunctionofthefilterstrikesacompromise

between alias and blurring. Repetition of the most recent sample in this sense is optimal for the
dynamic resolution and worst for alias. A strong temporal low-pass filter suppresses much (not
necessarily all) alias and yields a poor dynamic resolution. The annoyance of the temporal alias
depends on the input and output picture frequency, and particularly their difference. In the easiest
case, both frequencies are high and their difference 50 Hz or more. In the worst case, input and
output picture rate are low and their difference in the order of 10 Hz. In case of an annoying beat
frequency, an interpolating LPF usually improves picture quality, otherwise the best compromise is
closer to repetition of the most recent sample.
54.4.2 Advanced Algorithms
Asindicatedbefore,thesemethodsarecharacterizedbytheircommonattempttointerpolatethe3-D
imagedatainthedirectioninwhichthecorrelationishighest. Tothisendtheyeitherhaveanexplicit
orimplicitdetectortofindthisdirection. Incaseof(1-D)temporalinterpolationtheexplicitdetector
is usually called a motion detector, for 2-D spatial interpolation it is called an edge detector, while
the most advanced device estimating the optimal spatio-temporal (3-D) interpolation direction is
usually called a motion estimator. The interpolation filter can be recursive or transversal, and can
have any number of taps, but a transversal filter with one or two taps is the most common choice.
For a two taps FIR approach we can write the inter polated video signal F
int
, in picture n, at spatial
c

1999 by CRC Press LLC
position x = (x, y)
T
as a function of the input v ideo signal F(x,n):
F
int
(x,n)= 0.5

F


x +

δ
1
δ
2

,n+ δ
3

+ F

x


δ
1
δ
2

,n− δ
3

(54.1)
In this terminology a motion detector controls δ
3
,anedgedetectorδ
1
, and δ

2
, while a motion
estimator can be applied to determine δ
1

2
, and δ
3
.
Algorithms with a Motion Detector
Todetectmotion, the differencebetween two successive pictures is calculated. It is too simple,
however,toexpectthissignaltobecomezeroinapicturepartwithoutmovingobjects. Thecommon
problems with the detection are noise and alias. Additional problems occurringin some systems are
colorsubcarriers causing non-stationarities incolored regions, interlacecausing nonstationarities in
vertically detailed picture parts,and timing jitter of the sampling clock which is particularlyharmful
in detailed areas.
All these problems imply that the output of the motion detector usually is not a binary, but r ather
a multi-level signal, indicating the probability of motion. Usual (but not always valid) assumptions
made to improve the detector are:
1. Noise is small and signal is large.
2. The spectrum part around the color carrier carries no motion information.
3. Low-frequency energy in the signal is larger than in the noise and alias.
4. Moving objects are large compared to a pixel.
The general structure of the motion detector resulting from these assumptions is depicted in
Figure54.4. As can be seen, the difference signalisfirst low-pass (and car rierreject)filteredtoprofit
FIGURE 54.4: Gener al structure of a motion detector.
from (54.2) and (54.3). It also makes the detector less “nervous” for timing jitter in detailed areas.
After the rectification another low-pass filter improves the consistency of the motion signal, based
on assumption (54.4). Finally, the nonlinear (but monotonous) transfer function in the last block
translates the signal in a probability figure for the motion P

m
, using (54.1). This last function may
have to be adapted to the expected noise level. Low-pass filters are not necessarily linear. More than
onedetectorcanbeused,workingonmorethanjusttwopicturesintheneighborhoodofthecurrent
image, and a logical or linear combination of their outputs may lead to a more reliable indication of
motion.
The motion detector (MD) is applied to switch or fade between two processing modes, one of
which is optimal for stationary and the other for moving image parts. Examples are:
• De-interlacing. The MD fades between intra-field interpolation (line-averaging,or edge
c

1999 by CRC Press LLC
dependent spatial interpolation) and inter-fieldinterpolation (repetition of the previous
field, averag ing of neighboring fields, etc.).
• Field rate doubling on interlaced video: The MD fades between repetition of fields (best
dynamic resolution without motion compensation for moving picture parts) and repe-
tition of frames (best spatial resolution in stationary image parts).
To slightly elaborate on the first example of de-interlacing, we define the interpolated pixel
X
m
(x,n)in a moving picture part as:
X
m

x
,n

= 0.5

F


x −

0
1

,n

+ F

x +

0
1

,n

(54.2)
while for stationary picture parts the interpolated pixel X
s
(x,n)is taken as:
X
s

x
,n

= F

x,n− 1


(54.3)
and taking the probability of motion P
m
, from the motion detectorintoaccount,the output is given
by:
F
int

x
,n

= P
m
X
m

x
,n

+ (1 − P (m))X
s

x
,n

(54.4)
In most practical cases the output P
m
has a nonlinear relation with the actual probability.

Algorithms with an Edge Detector
To detect the orientation of a spatial edge, usually the differences between pairs of spatially
neighboring pixels are calculated. Again it is a bit unrealistic to expect that a zero difference is a
reliable indication of a spatial direction in which the signal is stationary. The same problems (noise,
alias, carriers, timing-jitter) occur as with motion detection. The edge detector (ED) is applied to
switch or fade between at least two but usually more processing modes, each of them optimal for
interpolation of a certain orientation of the spatial edge. Examples are:
• De-interlacing. The ED fades between vertical line-averaging and diagonal averaging
(+/ − 45

, or even more angles).
• Up-conversion to a higher resolution format. A simple bi-linear interpolation filter is
applied with its coefficients adapted to the output of the edge detector.
FIGURE 54.5: Identification of pixels as applied for direction dependent spatial interpolation.
c

1999 by CRC Press LLC
In Fig. 54.5, X is the pixel to be interpolated for the sequential scan conversion and the result
applying pixels in a neighborhood (A, B, C, D, E and F ) is either X
a
,X
b
,orX
c
, where:
X
a
= 0.5[A + F ]=0.5

F


x −

1
1

,n

+ F

x +

1
1

,n

(54.5)
and:
X
b
= 0.5[B + E]=0.5

F

x −

0
1


,n

+ F

x +

0
1

,n

(54.6)
and:
X
c
= 0.5[C + D]=0.5

F

x +

+1
−1

,n

+ F

x +


−1
+1

,n

(54.7)
The selection of X
a
,X
b
,orX
c
to the interpolated output F
int
is controlled by a luminance gradient
indication calculated from the same neighborhood:
F
int

x
,n

=




X
a
,

(
|A − F | < |C − D|∧|A − F | < |B − E|
)
X
b
,
(
|B − E|≤|A − F |∧|B − E|≤|C − D|
)
X
c
,
(
|C − D| < |A − F |∧|C − D| < |B − E|
)
(54.8)
In this example, the gradient is calculated on the same pixels that are used in the interpolation step.
Thisisnotnecessarilythecase. Similartotheearlierdescribedmotiondetector,itisadvantageous to
filter the video signal prior to and/or after the rectification in Eq. (54.8). Also the decision, i.e., the
optimal interpolation angle, can be low-pass filtered to improve the consistency of the interpolation
angle. Finally, the edge dependent interpolation can be combined with (motion adaptive or motion
compensated)temporal interpolation to improve the interpolation quality of near horizontal edges.
Implicit Detection in Nonlinear Interpolation Filters
Many nonlinear interpolation methods have been described. Most popular is the class of
order statistical filters. Combinations w ith linear (bandsplitting) filters are known, optimizing the
interpolation for individual spectrum parts. We will limit ourselves to some basic examples here.
An illustration of a basic inherently adapting filter is shown in Figure 54.6. The line to be inter-
FIGURE 54.6: Sequential scan conversion with three-tap vertical-temporal median filtering. The
thin lines show which pixels are input for the median filter.
c


1999 by CRC Press LLC
polated is found as the median of the spatially neighboring lines (a and b) and the corresponding
line (c) from the previous field:
F
int
(x,n)= median [a, b, c]=
median

F

x
+

0
1

,n

,F

x −

0
1

,n

,F


x,n− 1


(54.9)
with:
median
(
X, Y, Z
)
=




X,
(
Y ≤ X ≤ Z ∨ Z ≤ X ≤ Y
)
Y,
(
X<Y ≤ Z ∨ Z ≤ Y<X
)
Z, (otherwise)
(54.10)
The inherent adaptation to edges is understood as follows: In case of a temporal edge (i.e., motion)
larger than the spatial edge (i.e., vertical detail), the difference between a and b is relatively small
compared to their difference withc. Therefore, an intra-fieldinterpolation results (a or b is copied).
Incaseofanon-movingverticaledge,thedifferencebetweenaandbwillberelativelylargecompared
to the difference between c and a or b. In this case, the inter-field interpolation (c is copied) is most
likely.

It is possible to combine edge detectors with non-linear filters, e.g., a so-called weighted median
filter. In a weighted median filter, the (integer) weight given to a sample indicates the number of
times its value is included in the input of the filter to the ranking stage. An increase of this weight
increases the chance this sample value is selected as the median. It therefore provides a method,
using the output of an edge detector with uncertainties, to statistically improve the performance of
the interpolation.
We will again use Fig. 54.5 to identify the location of the pixels used in the interpolation. The
output value for the pixel position indicated with X results as:
F
int

x
,n

= median

A, B, C, D, E, F, α · X
−1
,β·
B + E
2

,
(
α, β ∈ N
)
(54.11)
with:
X
−1

= F

x,n− 1

,A= F

x −

1
1

,n

,B= F

x −

0
1

,n

,
(54.12)
as illustrated in Fig. 54.5. The weighting (α and β) implies that an assumed “important” pixel is fed
more than once to the median calculating circuit:
α · A =
A, A, A A, A
α times
(54.13)

The combinationarises if a motion detectorisusedtocontrol the weighting factors of the pixelfrom
the previous field and that of the value found by line averaging. A large value of α increases the
probability of field insertion, while a large β causes an increased probability of line averaging.
Althoughtheexamplesinthissectionarelimitedtode-interlacing,itshouldbenotedthatproposals
exist for field rate conversion as well.
Algorithms with a Motion Estimator
The idea to interpolate picture content in the direction in which it is most correlated can be
extended to a three-dimensional case. This results in an interpolation along the motion trajectory.
Figure54.7definesthemotiontrajectoryasthelinethatconnectsidenticalpicturepartsinasequence
c

1999 by CRC Press LLC
FIGURE 54.7: Identical picture parts of successive images lie on the motion trajectory. Its projection
in the image plane is the motion vector.
of pictures. The projection of this motion trajectory between two successive pictures on the image
plane, called the motion vector, is also shown in this figure. Not all temporal information changes
can be described adequately as object velocities: e.g., fades and concealed or obscured background.
Nevertheless, this method has the strongest physical background, as due to their inertia it always
takes time for objects to completely disappear, or change geometry, resulting in a strong correlation
of successive images after compensation for motion. This is in contrast to spatial (edge adaptive)
interpolation for which there is a statistical but no physical background.
Knowledgeofmotionvectorsallowsustointerpolateimagedataat anytemporalinstancebetween
twosuccessive pictures. Themostcommonformusesmotioncompensatedaveragingaccordingto:
F
int

x
,n+ α

= 1/2 ·


F

x − αD

x
,n

,n

+ F

x
+ (1 − α)D

x
,n

,n+ 1

,(0 ≤ α ≤ 1) (54.14)
whereD(x,n)istheobjectdisplacementatpositionx = (x, y)
T
estimatedbetweenfieldsn andn+1,
while α determines the temporal instance for which the interpolated data has to be valid. However,
all previously mentioned interpolation methods that involvea temporal component can be used as a
basis of a motion compensated interpolation. So linear, nonlinear, motion adaptive, edge adaptive,
and inherently adapting interpolation methods can be upgraded toward their motion-compensated
counterparts. Further more, bandsplitting can be used to sophisticate the interpolation.
Wewillnotelaboratefurtheronthesemethodsastheyfollowstraightforwardfromtheearliertext.

We will make an exception, however, for temporal interpolation on interlaced signals, as this poses
non-trivial problems even with knowledge of local motion.
Motion Compensated De-Interlacing
In general, the pixels required for the motion compensated interpolation do not exist in the
time discrete input signal, e.g.,due to non-integer velocities. In the horizontal domain this problem
canbesolvedwithlinearSRC-theory,butnotintheverticaldomain. Threesolutionsforthisproblem
have been proposed:
1. Application of a gener alized sampling theory (GST).
c

1999 by CRC Press LLC
2. Straight extensionof the motion vector into earlier picturesuntil it points (almost) to an
existing pixel.
3. Recursive de-interlacing of the signal.
The implication of GST is that it is possible to perfectly reconstruct a signal sampled at 1/n times
the Nyquist rate if n independent sets of samples describe the signal. The de-interlacingproblemisa
specificcaseforwhichn = 2. Therequiredtwosetsarethecurrentfieldandthemotioncompensated
previous field, respectively. If the two do not coincide, i.e., the object does not have an odd integer
vertical motion vector component, the independency constraint is fulfilled, and the problem can
theoretically be solved. Practical problems are:
a. The velocity can have an odd vertical component.
b. Perfect reconstruction requires the use of pixels from many lines, for which the velocity
need not be constant.
c. For nearly odd integer valued vertical velocities, noise may be enhanced.
Solution 2 is valid only if we assume the velocity constant over a larger temporal interval. This
is a rather severe limitation which makes the method practically useless. Solution 3 is based on the
assumption that it is possible at some time to have a perfectly de-interlaced picture in a memory.
Oncethisistrue, thepictureisusedtode-interlace the next input field. Withmotioncompensation,
this solution can be perfect as the de-interlaced picture in the memory allows the use of SRC-theory
also in the vertical domain. Ifthis new de-interlacedfield is written in the memory, it can be used to

de-interlace the next incoming field. Limitations of this method are:
a. Propagation of motion vector and interpolation errors.
b. Even a perfectly de-interlaced picture can contain alias in the vertical domain in the
common case of a camera without an optical prefilter.
In practice, problem a is the more serious one, particularly for nearly odd vertical velocities.
Although there are restrictions, motion compensated interpolation techniques for field rate up-
conversion and de-interlacing provide the most advanced option. However, they require nontrivial
algorithms to measure object displacements between consecutive images. These motion estimation
methods therefore shall be discussed more extensively in the next section.
54.5 Motion Estimation
This section provides an overview of motion estimation algorithms developed over time. The esti-
matorsapplicableforscanningformat conversionrequireadditionalconstraintswhicharediscussed
in the last part of this section.
54.5.1 Pel-Recursive Estimators
The category of pel-recursive motion estimators can be derived from iterative methods that use a
previously calculated motion vector D
i−1
to find the result vector D
i
according to:
D
i
= D
i−1
+ update (54.15)
Severalalgorithmsbasedoniterationcanbefoundintheliterature. Acommonformappliesiterative
minimization of the squared value of the displaced frame difference (DFD) along the steepest
gradient of the luminance function:
D
i

= D
i−1

1
2
· α ·

δ/δD
i−1
x
δ/δD
i−1
y

DFD
2

x
,D
i−1
,n

(54.16)
c

1999 by CRC Press LLC
where the DFD is defined as:
DFD

x

,D
i−1
,n

= F

x,n

− F

x − D
i−1
,n− 1

(54.17)
and:
D
i
=

D
i
x
D
i
y

(54.18)
As before, n stands for the field or picture number. The constant α is positive and determines the
speedofconvergenceandtheaccuracyoftheestimate. Thevalueofα islimitedtoamaximum,since

instability oranoisyestimationresultcanoccurforhighervalues. Equation(54.16) canberewritten
as:
D
i
= D
i−1
− α · DFD

x,D
i−1
,n

·

δ/δx
δ/δy

F

x − D
i−1
,n

(54.19)
The method is known as “steepest descent algorithm”. The updating process can be stopped after a
fixed number of iterations, at the moment the update term falls under a threshold, or in case slow
convergence or even divergence is detected. Rather than iterating the estimation process in a fixed
position of the picture, the estimated result from a previously scanned position in the same picture
can be used as the prediction for the present location. We shall then speak of a spatial recursive
process,and if for everypixel an update is calculated, the name “pel-recursive motion estimation” is

commonly used. The spatial prediction can be based on either a single previously calculated result,
inwhichcasetheconvergenceshallbeone-dimensional,oronanumberofearliercalculatedvectors.
In case more than one vector is used, the design can select the best according to a criterion before
or after updating, e.g., the smallest DFD or a weighted average can be calculated The coefficients
that determine the weighting can be based on statistical properties of the vectorfield. Depending on
the choice of the relative positions in the picture from which prediction vectors are taken, a one- or
two-dimensionalconvergence can result. Inthe case of temporal recursion, afurther refinement can
beobtainedbymotioncompensatingthepredictionvaluesfromtheprecedingfieldbeforeweighting
them with the values from the present field.
The algorithmcan be improved by calculating the update term from a group of pixels rather than
from only one pixel. This is then referred to as “gradient summed error algorithm”:
D
i
= D
i−1
− α ·

x∈ group

DFD

x
,D
i−1
,n

·

δ/δx
δ/δy


F

x − D
i−1
,n


(54.20)
Again the group can extend into a one-, two-, or three-dimensional neighborhood. Weighted aver-
aging is an option and weights can be adapted to image statistics. In case of gradients taken from a
temporally neighboring position, motion compensation can be applied prior to weighting with the
spatial neighboring gr adients.
Simplificationsofthealgorithmarepossible. Particularlythepreventionofmultiplicationisuseful,
andpossible,e.g., byonlyusingthesignofthegradienttodeterminethedirectionoftheupdatewith
a fixed length.
In the literature, many variants of the steepest descent or gradient summed error algorithm are
described, which mainly differ from the above-mentioned algorithms in that the convergence speed
determining constant α is substituted by variables to adapt the estimator to local picture statistics.
54.5.2 Block-Matching Algorithm
In block-matching motion estimation algorithms, a displacement vector is assigned to the center X
of a block of pixel positions B(X) in the current field n by searching a similar block within a search
c

1999 by CRC Press LLC
area SA(X), also centered at X, but in the previous field n − 1. The similar block has a center that
is shifted with respect to X
over the displacement vector D(X,n). To find D(X,n), a number of
candidatevectorsC
areevaluatedapplyinganerrormeasure∈ (C,X,n)toquantifyblocksimilarity.

Figure 54.8 illustrates the procedure.
FIGURE54.8: BlockofsizeX × Y incurrentfieldn andtrial block in searchareaSA(X) in previous
field n − 1, shifted over candidate vector C
.
More formally, CS
max
is defined as the set of candidate vectors C, describingall possible displace-
ments (integer on the pixel grid) with respect to X
within the search area SA(X) in the previous
image:
CS
max
=

C |−N ≤ C
x
≤+N, − M ≤ C
y
≤+M

(54.21)
whereN and M are constants limiting SA(X). Furthermore, a block B(X) centeredat X and of size
X × Y consisting of pixel positions x
in the present field n,isnowconsidered:
B

X

=


x
|X
x
− X/2 ≤ x ≤ X
x
+ X/2 ∧ X
y
− Y/2 ≤ y ≤ X
y
+ Y/2

(54.22)
The displacement vector D(X,n)resulting from the block-matching process is a candidate vector
C
which yields the minimum value of an error function ∈ (C,X,n):
D

X
,n



C ∈ CS
max
|∈

C,X,n

≤∈


F ,X,n

∀F ∈ CS
max

(54.23)
IfthevectorD(X,n)with the smallestmatchingerrorisassigned toallpixelpositionsx in the block
B(X
):
∀x
∈ B(X):
D

x
,n



C ∈ CS
max
|∈

C,X,n

≤∈

F ,X,n

∀F ∈ CS
max


(54.24)
rather thantothecenterpixelonly, a large reductionofcomputationsisachieved. Asanimplication,
consecutive blocks B(X
) are not overlapping.
The error value for a given candidate vector C
is a function (COST) of the luminance values of
the pixels in the current block and those of the shifted block from a previous field, summed over the
block B(X
):


C
,X,n

=

x∈B(X)
COST

F

x,n

,F

x − C,n− p

(54.25)
c


1999 by CRC Press LLC
A common choice for p is either 1 or 2, depending on whether the signal is interlaced or not.
Although the COST function itself can be rather straightforward and simple to implement, the
high repetitionfactorforthiscalculationcreatesahugeburden. Tosavecalculationaleffortinblock-
matchingmotionestimationalgorithms,severalmethodshavebeenpublished. Theusualingredients
are:
1. The use of a simpler COST function.
2. Estimation on sub-sampled picture material.
3. Design of a cleversearchstrategy, preventingthatall possible vectors need to be checked.
Concerning option 1, there is almost general consensus. The most popular choice thus far for the
error function is the Summed Absolute Difference (SAD) criterion:


C
,X,n

= SAD

C,X,n

=

x∈B(X)
|F

x,n

, −F


x − C,n− p

| (54.26)
MostimportantalternativesaretheMeanSquareError(MSE),andtheNormalizedCrossCorrelation
Function(NCCF)criterion. Thesimplererrorfunctionsthathavebeendesignedwillnotbediscussed
here, as the economizing hardly ever justifies the performance loss.
Option 2 is straightforward and has little negative effect on the performance with sub-sampling
factorsuptofour. Option3isthemosteffective,andwill be dealtwith separ ately in the nextsection.
54.5.3 Search Strategies
Sub-Sampled Full Search
In the most straightforward search strategy for all candidate vectors C in the search area, the
matching error for a block B(X
) of pixel positions is calculated. The method is referred to as full
search, exhaustive search, or brute force block-matching. To economize the calculational effort, the
matching errors of only half of the possible result vectors D
(X,n)can be calculated in a first step,
using a first candidate set CS
1
which is a subset of CS
max
. Figure 54.9 illustrates this option, further
showing the candidate vectors in the second step of the algorithm.
FIGURE 54.9: Candidate vectors tested in the second step around D
1
(X,n)for sub-sampled full
search block-matching. T he grid shown is the pixel grid.
c

1999 by CRC Press LLC
N-Step Search

The idea to adapt the search area from coarse to fine is not limited to a two-step process. As
illustratedinFig.54.10, thefirststepofathree-stepblock-matcherperformsasearchonacoarsegrid
consisting of only nine candidate vectors in the entire search area. The second step includes a finer
search, with eight candidate vectors around the best matching vector of the first step, and finally in
FIGURE 54.10: Illustration of the three-step search. Vectors resulting from the steps are indicated
with the step number. The candidates in each step are shaded as in a, b, and c, respectively.
the thirdstepa searchonthe full resolution grid is performed,with another eight candidatesaround
the best vector of the second step. Note that a search range of +/ − 6 pels is assumed; other search
areas require modifications, either resulting in less accurate vectors or in more consecutive steps.
Generalizations to N-steps block-matching are obvious.
Related is the 2-D logarithmic, or cross-search, method that checks five vectors per step, one
in the middle, four symmetrically around it (two with a different x-component and two with a
differenty-component). Again,fourvectorsarecheckedaroundtheresult,andthedistancebetween
the candidates is halved when the best matching vector is the middle one. Hence, the number of
consecutive steps depends on the resulting vector, which is a drawback, as the hardware has to be
designed for the worst case situation and cannot profit from a low average number of steps.
One-at-a-Time-Search
Yetafurtherreductionofcandidatevectorscanberealizedifthetwo-dimensionaloptimization
problem is split into two separate one-dimensional optimizations. The candidate set, for step i of
the algorithm, CS
i
(X,n), is adapted during the process, as in the previously discussed algorithms,
but contains only three candidate vectors C
(X,n). Depar ting from vector 0, this method performs
a search for the minimum error along the x-axis of the search area:
CS
i
x

X

,n

=

C|C = D
i−1

X
,n

+ U,U
x
= 0 ∨±1,U
y
= 0

(54.27)
The procedure is repeated N times until D
N
(X,n)= D
N−1
(X,n). From this minimum a search
is started parallel to the y-axis and repeated M times until D
M+N
(X,n)= D
D+N−1
(X,n).
In its simplest form, shown in Fig. 54.11, the process stops at this minimum and the estimated
motion vector D
(x,n)= D

M+N
(X,n)for all pixel positions x in B(X). Itis possible, however, to
refine the result by repeating the OTS procedure, departing with every iteration from the previous
result D
M+N
(X,n).
c

1999 by CRC Press LLC
FIGURE 54.11: One-at-a-time search (OTS) block-matching. The new candidate vectors that have
to be evaluated for a number of successive steps is indicated.
Aproblemofallefficientsearchtechniquesistheriskofconvergingtoalocalratherthantheglobal
minimumof the matcherrorfunction. The coarser the initial gridofcandidatevectors is, the higher
this risk. It can be reduced by prefiltering the video information prior to the motion estimation,
but this introduces inaccuracies in detailed picture par ts. If the prefiltering and the block size are
adaptedseparatelyforeverystepinthesearchprocedure,wearriveatthehierarchicalblock-matching
algorithms, dealt with in the next subsection.
54.6 Motion Estimation and Scanning For mat Conversion
Insituationswheremotionvectorsaregeneratedfortemporalinterpolationofpictures,itisimportant
thatthevectorsrepresenttherealvelocitiesofobjects,orthe“true-motion”asitiscalled,inthepicture.
None of the described motion estimators is guaranteed to yield true motion vectors. They generate
a vector that yields the “best match” or the minimal displaced frame difference and often even only
the local best match, or the local minimum of the DFD.
To improvethisrelation betweenestimateddisplacement vectorsandactualobjectvelocity,meth-
odshavebeendesignedwhichmodifyeitherthealgorithmorthedisplacementvectors. Thecommon
solutionisbasedontheobservation that the velocity field does not usually containmanyfinedetails.
Inother words, the motion vector field is spatially consistent: largeareas(objects, background) with
identicalvectorsusuallyexist. Objectinertiafurthercausesvelocityfieldstobetemporallyconsistent.
Toimproveconsistency,anumberofmethods havebeenproposed. Twoclassescanbedistinguished,
combinations of which are possible:

• Methods, that performapost-processingontheoutputvector field to improve the consis-
tency.
• Methods in which a smoothness constraint is integrated in the estimator.
Postprocessing can be straightforward, applying basically low-pass filtering to improve the spatial
and/or temporal consistency or smoothness of the vector field generated by any motion estimation
algorithm. Oftenthefilterisanonlinearone;themedianparticularlyispopularasitisedgepreserving.
Moresophisticatedmethodsinthisclassmerelyusetheoutputvectorfieldoftheestimatortoinitialize
asimulatedannealingorgeneticoptimizationalgorithmusinganewcostfunction,usuallyincluding
smoothness constraints.
Integrated solutions can be expected to realize a better performance than the straightforward
representatives of the first class at a lower expense than the sophisticated processing methods. The
c

1999 by CRC Press LLC
constraint can either be explicit, e.g., by adding a “discontinuity penalty” to the error criterion of a
block-matcher:


C
,X,n

=

x∈B(X)


F

x
,n


, −F

x − C,n− 1



+ α ·




D

X
, −

X
0

,n

− C




+ β ·





D

X
, −

0
Y

,n

− C




(54.28)
(where the values of α and β determine the smoothness and it is proposed to adapt their value in
the neighborhood of edges in the image), or implicit through hierarchy or recursion, which will be
discussed separately. Again, both classes can be combined.
54.6.1 Hierarchical Motion Estimation
Hierarchicalmotionestimatorsrealizeaconsistentvelocityfieldbyinitializinglocalestimatorswitha
globalestimate,ofteninmorethantwosteps. Insub-bandcodingterminology,aresolutionpyramid
isbuiltandcoarsevectorsareestimatedonthelowfrequencyband. Theresultisusedas aprediction
for a more a ccurate estimate at the next sub-band, which contains higher frequencies, etc. At the
top of the pyramid, the signal is strongly prefiltered and sub-sampled. The bandw idth of the filter
increasesandthesub-samplingfactorsdecrease,goingdowninthehierarchy,untilthefullresolution
is reached on the lowest hierarchical level.
Thevalueofthemotionvectorinfieldn athierarchicallevell, D

i−1
(X,n,l),andusinglogarithmic
search, is found as:
D
i−1

X
,n,l







D
N

X
,n,l− 1


,(i= 1)

C
∈ CS
i−1

X
,n,l


|

C,X,n

≤ 

F ,X,n

, ∀F ∈ CS
i−1

X
,n,l


,(i>1)
(54.29)
where the search area is defined as:
CS
i

X
,n,l

=

C|C = D
i−1


X
,n,l

+ U ,U
x
= 0 ∨±2
N−i
∧ U
y
= 0 ∨ 2
N−i

,
i = 1 N, l = 1 L
(54.30)
D
N
(X,n,l− 1) is the result vector for the block at position X in field n in the last (Nth) step of
the logarithmic search, at one higher (l − 1) hierarchical level.
The method is also referred to as multi-resolution, or multi-grid motion estimation. The initial
block size can be the total image, which preventslimitation of the consistency to partsof the picture.
Theinver tedapproachhasalsobeenpublished,performingblock-matchingoninitiallysmall blocks,
which are grown to larger sizes until the minimum of the match error is considered clearly distinct.
Combinations with other than the logarithmic search strategy are possible, and the hierarchical
method is not limited to block-matching algorithms either.
Phase Plane Correlation
An important variant of a two step hierarchical motion estimation is a method called phase
plane correlation (PPC). This algorithm is an extension of earlier Fourier techniques for motion
estimation,whichwerecapableofgeneratingglobaldisplacementvectorsonly. InthePPCalgorithm
a two-level hierarchy is proposed.

In the first hierarchical level, on fairly large blocks (typically 64 by 64), a limited number of
candidate vectors, usually less than 10, is generated, which are fed to the second level. Here one of
c

1999 by CRC Press LLC
FIGURE54.12: Hierarchicalblock-matching. Resultsfromanestimationprocessatadown-sampled
image are used to initialize the next estimation process on a higher resolution image.
these candidate vectors is assignedas the resulting vector to a much smaller area (typically 1 by 1 up
to 8 by 8, is reported) inside the large initial block.
The name of the method refers to the procedure used for generating the candidate vectors in the
firststep. Fortheblockinthecurrentfieldn,theDiscreteFourierTransform(DFT)oftheluminance
function F(x
,n)will be notated as G(f ,n). The so-called phase difference matrix PD(x,n)is
calculated according to:
PD

x
,n

= F
−1


G

f
,n

· G



f
,n− p

| G

f ,n

|·|G

f ,n− p

|


(54.31)
The resulting matrix or “correlation surface” exhibits peaks corresponding to the relative displace-
ment of the information in the two blocks. The Fourier transformation reduces the computational
complexity, andenablessimplefilteringinthefrequencydomain. Mostimportantisthesignificantly
increased sharpnessof the correlation peaks by normalizing each frequency component pr ior to the
reverse transformation. A “peak hunting” algorithm is applied to find the largest peaks in the phase
difference matrix, which correspond to the best matching candidate vectors. Sub-pixel accuracy
betterthanatenthofapixel can be achieved byfittinga quadratic curvethroughtheelements in this
matrix. For interlaced video signals, p = 2 is the common choice.
The peaks in the phase plane can be applied to identify the most likely candidate vectors C
(x,n)
for a consecutive block-matching algorithm, evaluating all candidates in each sub-block in the area
to which the phase plane corresponds.
54.6.2 Recursive Search Block-Matching
Rather than calculating promising candidate vectors for a block-matching algorithm on a lower res-

olution level or in the frequency domain on larger blocks, the recursive search block-matcher takes
spatialand/ortemporal“predictionvectors”froma3-Dneighborhood. This implicitlyassumesspa-
tial and/or temporal consistency. If the assumption is false, this consistency in the vectorfield results
anyway, as there are no other candidate vectors available. As far as the predictions are concerned,
thereisastrongsimilarity withthepel-recursivealgorithms, and the various optionsdescribed there
are globally valid here too. Figure 54.13 illustrates a proposed choice of predictions.
Themostcommonupdatingprocessinvolvesasingle,oraveryfew,updatevectorsaddedtoeither
of the prediction vectors. It was suggested, for example, to apply a candidate set CS(X
,n):
c

1999 by CRC Press LLC
FIGURE 54.13: Relative position of current block and blocks from which prediction vectors can be
taken in a recursive search block-matcher.
CS

X
,n

=

C
∈ CS
max
|C = D

X


X

Y

,n

+ U
a

X
,n

∨ C = D

X


−X
Y

,n

+ U
b

X
,n




D


X


X
Y

,n

,D

X


−X
Y

,n

,C
= D

X
+

0
2Y

,n− 1


(54.32)
where the update vectors U
a
(X,n)and U
b
(X,n)may bealternatingly available, and taken from a
limited fixed integer update set, such as:
US
i
=

0
0

,

0
1

,

0
−1

,

0
2

,


0
−2

,

1
0

,

−1
0

,

3
0

,

−3
0

(54.33)
Result vectors can have sub-pixel accuracy, if the update set (also) contains fractional update values.
Quarter pel resolution, for example, is realized with adding:
US
f
=


0
0.25

,

0
−0.25

,

0.25
0

,

−0.25
0

(54.34)
The method is very efficient and realizes, due to the inherent smoothness constraint, very coherent
and close to true-motion vector fields, most suitable for scanning format conversion.
References
[1] Engstrom, E.W., A study of televisionimagecharacteristics. Part Two. Determination of frame
frequency for television in terms of flicker characteristics,
Proc. of the I.R.E., 23 (4), 295-310,
1935.
[2] van den Enden, A.W.M. and Verhoeckx, N.A.M.,
Disc rete-Time Signal Processing, Prentice-
Hall, Englewood Cliffs, NJ.

[3] Zworykin, V.K. and Morton, G.A.,
Television, 2nd ed., John Wiley & Sons, New York, 1954.
c

1999 by CRC Press LLC

×