de Haan, G. “Video Scanning Format Conversion and Motion Estimation”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c
1999byCRCPressLLC
54
Video Scanning Format Conversion
and Motion Estimation
Gerard de Haan
Philips Research Laboratories
54.1 Introduction
54.2 Conversion vs. Standardization
54.3 Problems with Linear Sampling Rate Conversion
Applied to Video Signals
TemporalInterpolation
•
Vertical Interpolation and Interlaced
Scanning
54.4 Alternatives for Sampling Rate Conversion Theory
Simple Algorithms
•
Advanced Algorithms
54.5 Motion Estimation
Pel-RecursiveEstimators
•
Block-MatchingAlgorithm
•
Search
Strategies
54.6 Motion Estimation and Scanning Format Conversion
Hierarchical Motion Estimation
•
Recursive Search Block-
Matching
References
54.1 Introduction
The scanning format of a video signal is a major determinant of general picture quality. Specifi-
cally, it determines such aspects as stationary and dynamic resolution, motion portrayal, aliasing,
scanning structure visibility, and flicker. Various formats have been designed and standardized to
strike a particular balance between quality, cost, transmission capacity, and compatibility with other
standards.
The field of video scanning format conversion is concerned with the translation of video signals
from one format into another. It consists of two basic parts: temporal interpolation and spatial
interpolation. A particular case is de-interlacing, which poses an inseparable spatio-temporal inter-
polation problem.
Vertical and temporal interpolation cause practical and fundamental difficulties in achieving high-
quality scanning format conversion. This is because the conditions of the sampling theorem are
generally not met in video signals. If they were satisfied, standard conversions of arbitrary accuracy
would be possible using suitable linear filters.
The earlierconversion methods neglected the fundamental problems and, consequently, negatively
influenced the resolution and the motion portrayal. More recent algorithms apply motion vectors to
predict the position of moving objects at unregistered temporal instances to improve the quality of
the picture at the output format. A so-called motion estimator extracts these vectors from the input
c
1999 by CRC Press LLC
signal. The motion vectors partly solve the fundamental problems, but the demands on the motion
estimator for scanning format conversion are severe.
In this section we shall first briefly indicate why we can expect that the importance of scanning
format conversion will grow. Then we discuss in more detail the fundamental problems of temporal
interpolation of video signals. Next we provide a concise overview of the basic methods in scanning
format conversion, focused on temporal sampling rate conversion and de-interlacing. Finally, we
give an overview of motion estimation algorithms, which are crucial in the more advanced scanning
format convertors.
54.2 Conversion vs. Standardization
Scanning formats have been designed in the past to strike a particular compromise between quality,
cost, transmission capacity, and compatibility with other standards. There were three main formats
in use a decade ago: 50 Hz interlaced, 60 Hz interlaced, and 24 (or 25) Hz progressive (film). With
the arrival of video-conferencing, HDTV, workstations, and PCs, many new video formats have
appeared. These include low end formats such as CIF and QCIF with smaller picture size and lower
frame rates, progressive and interlaced HDTV formats at 50 Hz and 60 Hz, and other video formats
used on computer workstations and enhanced television displays with field rates up to 100 Hz. It will
be clear that the problem of scanning format conversion is of a growing importance, despite many
attempts to globally standardize video formats.
54.3 Problems with Linear Sampling Rate Conversion Applied to
Video Signals
High-quality scanning format conversion is difficult to achieve, as the conditions of the sampling
theorem are generally not met in video signals. The solution of Sample Rate Conversion (SRC)
for systems satisfying the conditions of the sampling theory is well known for arbitrary sampling
ratios [1].
Figure 54.1 illustrates the procedure for a ratio of 2. To arrive at the double output sampling rate,
in a first step, zero-valued samples are inserted between every input pair of samples. In a second
step, a low-pass filter (LPF) at the output rate is applied to remove the first repeat spectrum from the
input data. In case of a temporal SRC, the interpolating LPF has to be a temporal LPF, i.e., a filter
including picture delays. Though feasible, this makes it a fairly expensive filter.
A more complicated, though still not fundamental, problem occurs at the signal acquisition stage.
Since scenes do occur with almost unlimited spatial and/or temporal bandwidth, the sampling theo-
rem requiresthat this signal be low-pass filtered prior to the scanning process. Interlaced scanning, as
commonly applied, even demands two-dimensional prefiltering in the vertical-temporal frequency
plane. In a video system, it is the camera that samples the scene in a vertical and temporal sense;
therefore, the prefilter has to be realized in the optical path. Although there are considerable practical
problems achieving this filtering, it would apparently bring down the problem of temporal inter-
polation of video images to the common sampling rate conversion problem. The next section will
show, however, that in addition to the practical problems there is a fundamental problem as well.
54.3.1 Temporal Interpolation
Considering the eye’s sine-wave temporal frequency response for full brightness potential and full
field display [2], as shown in Fig. 54.2, temporal prefiltering with a bandwidth of 75 Hz at first sight
seems sufficient. The fundamental problem now is that the relation shown in Fig. 54.2 holds for
c
1999 by CRC Press LLC
FIGURE 54.1: Consecutive steps in upsampling with a factor of two.
temporal frequencies as they occur at the retina of the observer. These frequencies, however, equal
the frequencies at the display only if the eye is stationary with respect to this display. Particularly
with the eye tracking objects moving on the screen, this assumption is no longer valid. For a tracking
observer very high temporal frequencies on the screen can be transformed to much lower frequencies
or even DC at the retina. Consequently, suppression of these frequencies, with an interpolating
lowpass filter, results in excessive blurring of moving objects as will be discussed next.
Figure 54.3 shows, in a time-discrete representation, a simple object, a square, moving with a
constant velocity. Again, in this example, we consider up-sampling with a factor of two. Therefore,
the true position of the object is available at every second temporal position only (e.g., the odd
numbered samples). The “tracking observer” views along the motion trajectory, represented with a
line in the illustration, which results in a stationary image of the object on the retina. If the output
field sampling frequency exceeds the cutoff temporal frequency of the human visual system,
1
the
viewer will have the illusion that the object is continuously present.
Therefore, the object is actually seen at a position corresponding with the motion trajectory. If
now, e.g., in the 6th output field, the object is interpolated according to SRC theory, weighted copies
of the object from surrounding fields resulting from the interpolating LPF are displayed. Figure 54.3
illustrates the case of a symmetrical transversal lowpass filter. In this situation, the viewer sees the
object at the correct position but also various attenuated and displaced copies (the impulse response
of the interpolating temporal filter) of the object in a neighborhood. The attenuation depends on the
coefficientsoftheinterpolatingfilter, andthedistancebetweenthecopiesisrelatedtothedisplacement
1
Actually the picture update frequency may be even as low as 16 Hz, to guarantee smooth perceived motion (see, e.g., [3]).
The higher display rates are merely necessary to prevent the annoying large area flicker.
c
1999 by CRC Press LLC
FIGURE 54.2: The contrast sensitivity of the human observer (y-axis) for large areas of uniform
brightness, as a function of the temporal frequency (x-axis).
FIGURE 54.3: The effectof temporalinterpolation for anobject tracking observer. Thefield numbers
are counted at the output field rate.
of the moving object in a field period. For the object-tracking observer, therefore, the temporal LPF
is transformed into a spatial LPF. For an object velocity of one pixel per field period (one pel/field),
its frequency characteristic equals the temporal frequency characteristic of the interpolating LPF.
2
1
pel/field is a slow motion, as in broadcast picture material; velocities in a range exceeding 16 pel/field
do occur. Thus, the spatial blur caused by the SRC process becomes unacceptable even for moderate
object velocities.
54.3.2 Vertical Interpolation and Interlaced Scanning
Much similar to the situation of field rate conversion, it may seem that sequential scan conversion is
an up-sampling problem for which SRC-theory provides an adequate solution. However, straight-
forward, one-dimensional, up-sampling in the vertical frequency domain is incorrect as the data is
clearly sub-Nyquist sampled due to interlace.
If, more correctly, the sequential scan conversion is considered as a two-dimensional up-sampling
problem in the vertical-temporal frequency domain, we arrive at a discussion similar to the one
2
It is assumed here that both filters are normalized to their respective sampling frequency.
c
1999 by CRC Press LLC
in Section 54.3.1: the problem cannot be solved as we do not know the temporal frequency at the
retina of a movement-tracking observer. It is possible to disregard this problem and to perform a
two-dimensional SRC, implicitly assuming a stationary viewer and prefiltered information. Such
systems were described and have been implemented for studio applications. With the older image
pick-up tubes the results can be satisfactory, as these devices have a poor dynamic resolution. When
modern (CCD-)cameras are used, however, the limitations of the assumptions become obvious.
54.4 Alternatives for Sampling Rate Conversion Theory
With the problem of linear interpolation of video signals clarified, we will discuss alternative algo-
rithms developed over time. These algorithms fall into two categories. A first category simplifies
the interpolation filter prescribed by SRC-theory, considering that a completely correct solution is
impossible anyway. The resulting “simple algorithms” are more attractive for hardware realization
than the method from which they are derived and under certain conditions can perform quite simi-
larly. Thesecond category includes the most “advancedalgorithms” forscanning format conversion.
These methods can be characterized by their common attempt to interpolate the 3-D image data in
the direction in which the correlation is highest. The difference between the various options lies
mainly in the number of possible directions, and dimensions, which are considered. The imple-
mentation can show various linear interpolation filters controlled by one or more detectors, or a
multi-dimensional nonlinear filter that has an inherent edge adaptivity. As this description allows a
large number of algorithms, we will illustrate it with some important examples.
54.4.1 Simple Algorithms
SRC-theory in the temporal and vertical frequency domain is not applicable due to the missing
prefilter in common video systems. A sophisticated linear interpolation filter therefore makes little
sense. Any interpolating (spatio-)temporal low-pass filter will suppress original temporal frequency
components as well as aliased signal components, as they occupy, by definition, the same spectrum.
As the first effect is desired and the second not, the transfer function of the filter strikes a compromise
between alias and blurring. Repetition of the most recent sample in this sense is optimal for the
dynamic resolution and worst for alias. A strong temporal low-pass filter suppresses much (not
necessarily all) alias and yields a poor dynamic resolution. The annoyance of the temporal alias
depends on the input and output picture frequency, and particularly their difference. In the easiest
case, both frequencies are high and their difference 50 Hz or more. In the worst case, input and
output picture rate are low and their difference in the order of 10 Hz. In case of an annoying beat
frequency, an interpolating LPF usually improves picture quality, otherwise the best compromise is
closer to repetition of the most recent sample.
54.4.2 Advanced Algorithms
As indicated before, these methods are characterized by their common attempt to interpolate the 3-D
image data in the direction in which the correlation is highest. To this end they either have an explicit
or implicit detectorto find thisdirection. In case of (1-D) temporal interpolation the explicitdetector
is usually called a motion detector, for 2-D spatial interpolation it is called an edge detector, while
the most advanced device estimating the optimal spatio-temporal (3-D) interpolation direction is
usually called a motion estimator. The interpolation filter can be recursive or transversal, and can
have any number of taps, but a transversal filter with one or two taps is the most common choice.
For a two taps FIR approach we can write the interpolated video signal F
int
, in picture n, at spatial
c
1999 by CRC Press LLC
position x = (x, y)
T
as a function of the input video signal F(x,n):
F
int
(x,n)= 0.5
F
x +
δ
1
δ
2
,n+ δ
3
+ F
x
−
δ
1
δ
2
,n− δ
3
(54.1)
In this terminology a motion detector controls δ
3
,anedgedetectorδ
1
, and δ
2
, while a motion
estimator can be applied to determine δ
1
,δ
2
, and δ
3
.
Algorithms with a Motion Detector
To detect motion, the difference between two successive pictures is calculated. It is too simple,
however, to expect this signal to become zero in a picture part without moving objects. The common
problems with the detection are noise and alias. Additional problems occurring in some systems are
color subcarriers causing non-stationarities in colored regions, interlace causing nonstationarities in
vertically detailed picture parts, and timing jitter of the sampling clock which is particularly harmful
in detailed areas.
All these problems imply that the output of the motion detector usually is not a binary, but rather
a multi-level signal, indicating the probability of motion. Usual (but not always valid) assumptions
made to improve the detector are:
1. Noise is small and signal is large.
2. The spectrum part around the color carrier carries no motion information.
3. Low-frequency energy in the signal is larger than in the noise and alias.
4. Moving objects are large compared to a pixel.
The general structure of the motion detector resulting from these assumptions is depicted in
Figure 54.4. As can be seen, the difference signal is first low-pass (and carrier reject) filtered to profit
FIGURE 54.4: General structure of a motion detector.
from (54.2) and (54.3). It also makes the detector less “nervous” for timing jitter in detailed areas.
After the rectification another low-pass filter improves the consistency of the motion signal, based
on assumption (54.4). Finally, the nonlinear (but monotonous) transfer function in the last block
translates the signal in a probability figure for the motion P
m
, using (54.1). This last function may
have to be adapted to the expected noise level. Low-pass filters are not necessarily linear. More than
one detector can be used, working on more than just two pictures in the neighborhood of the current
image, and a logical or linear combination of their outputs may lead to a more reliable indication of
motion.
The motion detector (MD) is applied to switch or fade between two processing modes, one of
which is optimal for stationary and the other for moving image parts. Examples are:
• De-interlacing. The MD fades between intra-field interpolation (line-averaging, or edge
c
1999 by CRC Press LLC
dependent spatial interpolation) and inter-field interpolation (repetition of the previous
field, averaging of neighboring fields, etc.).
• Field rate doubling on interlaced video: The MD fades between repetition of fields (best
dynamic resolution without motion compensation for moving picture parts) and repe-
tition of frames (best spatial resolution in stationary image parts).
To slightly elaborate on the first example of de-interlacing, we define the interpolated pixel
X
m
(x,n)in a moving picture part as:
X
m
x
,n
= 0.5
F
x −
0
1
,n
+ F
x +
0
1
,n
(54.2)
while for stationary picture parts the interpolated pixel X
s
(x,n)is taken as:
X
s
x
,n
= F
x,n− 1
(54.3)
and taking the probability of motion P
m
, from the motion detector into account, the output is given
by:
F
int
x
,n
= P
m
X
m
x
,n
+ (1 − P (m))X
s
x
,n
(54.4)
In most practical cases the output P
m
has a nonlinear relation with the actual probability.
Algorithms with an Edge Detector
To detect the orientation of a spatial edge, usually the differences between pairs of spatially
neighboring pixels are calculated. Again it is a bit unrealistic to expect that a zero difference is a
reliable indication of a spatial direction in which the signal is stationary. The same problems (noise,
alias, carriers, timing-jitter) occur as with motion detection. The edge detector (ED) is applied to
switch or fade between at least two but usually more processing modes, each of them optimal for
interpolation of a certain orientation of the spatial edge. Examples are:
• De-interlacing. The ED fades between vertical line-averaging and diagonal averaging
(+/ − 45
◦
, or even more angles).
• Up-conversion to a higher resolution format. A simple bi-linear interpolation filter is
applied with its coefficients adapted to the output of the edge detector.
FIGURE 54.5: Identification of pixels as applied for direction dependent spatial interpolation.
c
1999 by CRC Press LLC