Tải bản đầy đủ (.pdf) (21 trang)

Digital Signal Processing Handbook P55

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (398.15 KB, 21 trang )

Osama Al-Shaykh, et. Al. “Video Sequence Compression.”
2000 CRC Press LLC. <>.
VideoSequenceCompression
OsamaAl-Shaykh
UniversityofCalifornia,
Berkeley
RalphNeff
UniversityofCalifornia,
Berkeley
DavidTaubman
HewlettPackard
AvidehZakhor
UniversityofCalifornia,
Berkeley
55.1Introduction
55.2MotionCompensatedVideoCoding
MotionEstimationandCompensation

Transformations

Discussion

Quantization

CodingofQuantizedSymbols
55.3DesirableFeatures
Scalability

ErrorResilience
55.4Standards
H.261



MPEG-1

MPEG-2

H.263

MPEG-4
Acknowledgment
References
Theimageandvideoprocessingliteratureisrichwithvideocompressionalgorithms.
Thischapteroverviewsthebasicblocksofmostvideocompressionsystems,discusses
someimportantfeaturesrequiredbymanyapplications,e.g.,scalabilityanderrorre-
silience,andreviewstheexistingvideocompressionstandardssuchasH.261,H.263,
MPEG-1,MPEG-2,andMPEG-4.
55.1 Introduction
Videosourcesproducedataatveryhighbitrates.Inmanyapplications,theavailablebandwidthis
usuallyverylimited.Forexample,thebitrateproducedbya30frame/scolorcommonintermediate
format(CIF)(352×288)videosourceis73Mbits/s.Inordertotransmitsuchasequenceovera
64Kbits/schannel(e.g.,ISDNline),weneedtocompressthevideosequencebyafactorof1140.A
simpleapproachistosubsamplethesequenceintimeandspace.Forexample,ifwesubsampleboth
chromacomponentsby2ineachdimension,i.e.,4:2:0format,andthewholesequencetemporally
by4,thebitratebecomes9.1Mbits/s.However,totransmitthevideoovera64kbits/schannel,it
isnecessarytocompressthesubsampledsequencebyanotherfactorof143.Toachievesuchhigh
compressionratios,wemusttoleratesomedistortioninthesubsampledframes.
Compressioncanbeeitherlossless(reversible)orlossy(irreversible).Acompressionalgorithmis
losslessifthesignalcanbereconstructedfromthecompressedinformation;otherwiseitislossy.The
compressionperformanceofanylossyalgorithmisusuallydescribedintermsofitsrate-distortion
curve,whichrepresentsthepotentialtrade-offbetweenthebitrateandthedistortionassociatedwith
thelossyrepresentation.Theprimarygoalofanylossycompressionalgorithmistooptimizethe

rate-distortioncurveoversomerangeofratesorlevelsofdistortion.Forvideoapplications,rate
c

1999byCRCPressLLC
is usually expressed in terms of bits per second. The distortion is usually expressed in terms of the
peak-signal-to-noise ratio (PSNR) per frame or, in some cases, measures that try to quantify the
subjective nature of the distortion.
In addition to good compression performance, many other properties may be important or even
critical to the applicability of a given compression algorithm. Such properties include robustness
to errors in the compressed bit stream, low complexity encoders and decoders, low latency require-
ments, and scalability. Developing scalable video compression algorithms has attracted considerable
attention in recent years. Generally speaking, scalability refers to the potential to effectively decom-
press subsets of the compressed bit stream in order to satisfy some practical constraint, e.g., display
resolution, decoder computational complexity, and bit rate limitations.
The demand for compatible video encoders and decoders has resulted in the development of
differentvideocompressionstandards. Theinternational standardsorganization(ISO)hasdeveloped
MPEG-1 to store video on compact discs, MPEG-2 for digital television, and MPEG-4 for a wide
range of applications including multimedia. The international telecommunication union (ITU) has
developed H.261 for video conferencing and H.263 for video telephony.
All existing video compression standards are hybrid systems. That is, the compression is achieved
in two main stages. The first stage, motion compensation and estimation, predicts each frame from
its neighboring frames, compresses the prediction parameters, and produces the prediction error
frame. The second stage codes the prediction error. All existing standards use block-based discrete
cosine transform (DCT) to code the residual error. In addition to DCT, others non-block-based
coders, e.g., wavelets and matching pursuit, can be used.
In this chapter, we will provide an overview of hybrid video coding systems. In Section 55.2,we
discuss the main parts of a hybrid video coder. This includes motion compensation, signal decompo-
sitions and transformations, quantization, andentropycoding. Wecompare various transformations
such as DCT, subband, and matching pursuit. In Section 55.3, we discuss scalability and error re-
silience in video compression systems. We also describe a non-hybrid video coder that provides

scalable bit-streams [28]. Finally, in Section 55.4, we review the key video compression standards:
H.261, H.263, MPEG 1, MPEG 2, and MPEG 4.
55.2 Motion Compensated Video Coding
Virtually all video compression systems identify and reduce four basic types of video data redun-
dancy: inter-frame (temporal) redundancy, interpixel redundancy, psychovisual redundancy, and
coding redundancy. Figure 55.1 shows a typical diagram of a hybrid video compression system.
First the current frame is predicted from previously decoded frames by estimating the motion of
blocks or objects, thus reducing the inter-frame redundancy. Afterwards to reduce the interpixel
redundancy, the residual error after frame prediction is transformed to another format or domain
such that the energy of the new signal is concentrated in few components and these components
are as uncorrelated as possible. The transformed signal is then quantized according to the desired
compression performance (subjective or objective). The quantized transform coefficients are then
mapped to codewords that reduce the coding redundancy. The rest of this section will discuss the
blocks of the hybrid system in more detail.
55.2.1 Motion Estimation and Compensation
Neighboring frames in typical video sequences are highly correlated. This inter-frame (temporal)
redundancy can be significantly reduced to produce a more compressible sequence by predicting
each frame from its neighbors. Motion compensation is a nonlinear predictive technique in which
the feedback loop contains both the inverse transformation and the inverse quantization blocks, as
c

1999 by CRC Press LLC
FIGURE 55.1: Motion compensated coding of video.
shown in Fig. 55.1.
Most motion compensation techniques divide the frame into regions, e.g., blocks. Each region
is then predicted from the neighboring frames. The displacement of the block or region, d, is not
fixed and must be encoded as side information in the bit stream. In some cases, different prediction
models are used to predict regions, e.g., affine transformations. These prediction parameters should
also be encoded in the bit stream.
To minimize the amount of side information, which must be included in the bit stream, and to

simplify the encoding process, motion estimation is usually block based. That is, every pixel

i in a
given rectangular block is assigned the same motion vector, d. Block-based motion estimation is an
integral part of all existing video compression standards.
55.2.2 Transformations
Mostimage and video compressionschemes apply a transformation tothe rawpixels ortothe residual
error resulting from motion compensation before quantizing and coding the resulting coefficients.
The function of the transformation is to represent the signal in a few uncorrelated components. The
most common transformations are linear transformations, i.e., the multi-dimensional sequence of
input pixel values, f[

i], is represented in terms of the transform coefficients, t[

k],via
f[

i]=


k
t[

k]w

k
[

i]
(55.1)

for some w

k
[

i]. The input image is thus represented as a linear combination of basis vectors, w

k
.
It is important to note that the basis vectors need not be orthogonal. They only need to form an
over-complete set (matching pursuits), a complete set (DCT and some subband decompositions), or
very close to complete (some subband decompositions). This is important since the coder should be
able to code a variety of signals. The remainder of the section discusses and compares DCT, subband
decompositions, and matching pursuits.
The DCT
There are two properties desirable in a unitary transform for image compression: the energy
should be packed into a few transform coefficients, and the coefficients should be as uncorrelated
c

1999 by CRC Press LLC
as possible. The optimum transform under these two constraints is the Karhunen-Lo
´
eve transform
(KLT) where the eigenvectors of the covariance matrix of the image are the vectors of the trans-
form [10]. Although the KLT is optimal under these two constraints, it is data-dependent, and is
expensive to compute. The discrete cosine transform (DCT) performs very close to KLT especially
when the input is a first order Markov process [10].
The DCT is a block-based transform. That is, the signal is divided into blocks, which are indepen-
dently transformed using orthonormal discrete cosines. The DCT coefficients of a one-dimensional
signal, f , are computed via

t
DCT
[Nb+ k]=
1

N













N−1

i=0
f[Nb+ i],k= 0
N−1

i=0

2f[Nb+ i] cos
(2i + 1)kπ
2N

, 1 ≤ k<N
∀b
(55.2)
where N is the size of the block and b denotes the block number.
The orthonormal basis vectors associated with the one-dimensional DCT transformation of
Eq. (55.2)are
w
DCT
k
[i]=
1

N

1,k= 0, 0 ≤ i<N

2 cos
(2i+1)kπ
2N
, 1 ≤ k<N,0 ≤ i<N
(55.3)
Figure 55.2(a) shows these basis vectors for N = 8.
FIGURE 55.2: DCT basis vectors (N = 8): (a) one-dimensional and (b) separable two-dimensional.
The one-dimensional DCT described above is usually separably extended to two dimensions for
image compression applications. In this case, the two-dimensional basis vectors are formed by the
tensor product of one-dimensional DCT basis vectors and are given by
w
DCT

k

[

i]=w
DCT
k
1
,k
2
[i
1
,i
2
]

= w
DCT
k
1
[i
1
]·w
DCT
k
2
[i
2
]; 0 ≤ k
1
,k
2

,i
1
,i
2
<N
c

1999 by CRC Press LLC
Figure 55.2(b) shows the two-dimensional basis vectors for N = 8.
The DCT is the most common transform in video compression. It is used in the JPEG still image
compression standard, and all existing video compression standards. This is because it performs
reasonably well at different bit rates. Moreover, there are fast algorithms and special hardware chips
to compute the DCT efficiently.
The major objection to the DCT in image or video compression applications is that the non-
overlapping blocks of basis vectors, w

k
, are responsible for distinctly “blocky” artifacts in the de-
compressed frames, especially at low bit rates. This is due to the quantization of the transform
coefficients of a block independent from neighboring blocks. Overlapped DCT representation ad-
dresses this problem [15]; however, the common solution is to post-process the frame by smoothing
the block boundaries [18, 22].
Due to bit rate restrictions, some blocks are only represented by one or a small number of coarsely
quantizedtransform coefficients,hencethedecompressedblockwill onlyconsistofthesebasisvectors.
This will cause artifacts commonly known as ringing and mosquito noise.
Figure 55.8(b) shows frame 250 of the 15 frame/s CIF Coast-guard sequence coded at 112 Kbits/s
using a DCT hybrid video coder.
1
This figure provides a good illustration of the “blocking” artifacts.
Subband Decomposition

The basic idea of subband decomposition is to split the frequency spectrum of the image into
(disjoint) subbands. This is efficient when the image spectrum is not flat and is concentrated in a few
subbands, which is usually the case. Moreover, we can quantize the subbands differently according
to their visual importance.
As for the DCT, we begin our discussion of subband decomposition by considering only a one-
dimensional source sequence, f[i]. Figure 55.3 provides a general illustration of an N-band one-
dimensional subband system. We refer to the subband decomposition itself as analysis and to the
FIGURE 55.3: 1D, N-band subband analysis and synthesis block diagrams. (Source: Taubman, D.,
Chang, E., and Zakhor, A., Directionality and scalability in subband image and video compression,
in Image Technology: Advances in Image Processing, Multimedia, and Machine Vision, Jorge L.C. Sanz,
Ed., Springer-Verlag, New York, 1996. With permission).
inversetransformation as synthesis. The transformation coefficients of bands 1, 2,...,Naredenoted
by the sequences u
1
[k],u
2
[k],...,u
N
[k], respectively. For notational convenience and consistency
with the DCT formulation above, we write t
SB
[·] for the sequence of all subband coefficients, arranged
1
It is coded using H.263 [3], which is an ITU standard.
c

1999 by CRC Press LLC
accordingto t
SB
[(β−1)+Nk]=u

β
[k],where1 ≤ β ≤ N is the subband number. These coefficients
are generated by filtering the input sequence with filters H
1
,...,H
N
and downsampling the filtered
sequencesbyafactorofN, as depicted in Fig. 55.3. In subband synthesis, the coefficients for each
band are upsampled, interpolated with the synthesis filters, G
1
,...,G
N
, and the results summed to
form a reconstructed sequence,
˜
f[i],asdepictedinFig.55.3.
If the reconstructed sequence,
˜
f[i], and the source sequence, f[i], are identical, then the subband
system is referred to as perfect reconstruction (PR) and the corresponding basis set is a complete
basis set. Although perfect reconstruction is a desirable property, near perfect reconstruction (NPR),
for which subband synthesis is only approximately the inverse of subband analysis, is often sufficient
in practice. This is because distortion introduced by quantization of the subband coefficients, t
SB
[k],
usually dwarfs that introduced by an imperfect synthesis system.
The filters, H
1
,...,H
N

, are usually designed to have band-pass frequency responses, as indicated
in Fig. 55.4, so that the coefficients u
β
[k] for each subband, 1 ≤ β ≤ N, represent different spectral
components of the source sequence.
FIGURE 55.4: Typical analysis filter magnitude responses. (Source: Taubman, D., Chang, E., and Za-
khor, A., Directionalityand scalability insubband imageand video compression, inImageTechnology:
Advances in Image Processing, Multimedia, and Machine Vision, Jorge L.C. Sanz, Ed., Springer-Verlag,
New York, 1996. With permission).
The basis vectors for subband decomposition are the N-translates of the impulse responses,
g
1
[i],...,g
N
[i], of synthesis filters G
1
,...,G
N
. Specifically, denoting the kth basis vector as-
sociated with subband β by w
SB
Nk+β−1
,wehave
w
SB
Nk+ β − 1
[i]=g
β
[i − Nk]
(55.4)

Figure 55.5 illustrates five of the basis vectors for a particularly simple, yet useful, two-band PR
subband decomposition, with symmetric FIR analysis and synthesis impulse responses. As shown in
Fig. 55.5 and in contrast with the DCT basis vectors, the subband basis vectors overlap.
As for the DCT, one-dimensional subband decompositions may be separably extended to higher
dimensions. By this we mean that a one-dimensional subband decomposition is first applied along
one dimension of an image or video sequence. Any or all of the resulting subbands are then further
decomposed into subbands along another dimension and so on. Figure 55.6 depicts a separable two-
dimensional subband system. For video compression applications, the prediction error is sometimes
decomposed into subbands of equal size.
Two-dimensional subband decompositions have the advantage that they do not suffer from the
disturbing blocking artifacts exhibited by the DCT at high compression ratios. Instead, the most
noticeable quantization-induced distortion tends to be ‘ringing’ or ‘rippling’ artifacts, which become
most bothersome in the vicinity of image edges. Figures 55.11(c) and 55.8(c) clearly show this
effect. Figure 55.11 shows frame 210 of the Ping-pong sequence compressed using a scalable, three-
dimensional subband coder [28] at 1.5 Mbits/s, 300 Kbits/s, and 60 Kbits/s. As the bit rate decreases,
we notice loss of detail and introduction of more ringing noise. Figure 55.8(c) shows frame 250 of
the Coast-guard sequence compressed at 112 Kbits/s using a zerotree scalable coder [16]. The edges
of the trees and the boat are affected by ringing noise.
c

1999 by CRC Press LLC
FIGURE 55.5: Subband basis vectors with N = 2,h
1
[−2 ...2]=

2 ·
(−
1
8
,

1
4
,
3
4
,
1
4
,−
1
8
), h
2
[−2 ...0]=

2 · (−
1
4
,
1
2
,−
1
4
), g
1
[−1 ...1]=

2 · (
1

4
,
1
2
,
1
4
), and
g
2
[−1 ...3]=

2 · (−
1
8
,−
1
4
,
3
4
,−
1
4
,−
1
8
).h
i
and g

i
are the impulse responses of the H
i
(analysis)
and G
i
(synthesis) filters, respectively. (Source: Taubman, D., Chang, E., and Zakhor, A., Direction-
ality and scalability in subband image and video compression, in Image Technology: Advances in
Image Processing, Multimedia, and Machine Vision, Jorge L.C. Sanz, Ed., Springer-Verlag, New York,
1996. With permission).
Matching Pursuit
Representing a signal using an over-complete basis set implies that there is more than one
representation for the signal. For coding purposes, we are interested in representing the signal with
the fewest basis vectors. This is an NP-complete problem [14]. Different approaches have been
investigated to find or approximate the solution. Matching pursuits is a multistage algorithm, which
in each stage finds the basis vector that minimizes the mean-squared-error [14].
Suppose we want to represent a signal f[i] using basis vectors from an over-complete dictionary
(basis set) G. Individual dictionary vectors can be denoted as:
w
γ
[i]∈G.
(55.5)
Here γ is an indexing parameter associated with a particular dictionary element. The decomposition
begins by choosing γ to maximize the absolute value of the following inner product:
t =<f[i],w
γ
[i] >,
(55.6)
where t is the transform (expansion) coefficient. A residual signal is computed as:
R[i]=f[i]−tw

γ
[i].
(55.7)
This residual signal is then expanded in the same way as the original signal. The procedure continues
iteratively until either a set number of expansion coefficients are generated or some energy threshold
for the residual is reached. Each stage k yields a dictionary structure specified by γ
k
, an expansion
coefficient t[k], and a residual R
k
, which is passed on to the next stage. After a total of M stages, the
signal can be approximated by a linear function of the dictionary elements:
ˆ
f[i]=
M

k=1
t[k] w
γ
k
[i].
(55.8)
c

1999 by CRC Press LLC

×