Tải bản đầy đủ (.pdf) (23 trang)

59 VLSI Architectures for Image Communications

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (385.96 KB, 23 trang )

P. Pirsch, et. Al. “VLSI Architectures for Image Communications.”
2000 CRC Press LLC. <>.
VLSIArchitecturesforImage
Communications
P.Pirsch
Laboratoriumfur
Informationstechnologie,
UniversityofHannover
W.Gehrke
PhilipsSemiconductors
59.1Introduction
59.2RecentCodingSchemes
59.3ArchitecturalAlternatives
59.4EfficiencyEstimationofAlternativeVLSIImplementations
59.5DedicatedArchitectures
59.6ProgrammableArchitectures
IntensivePipelinedArchitectures

ParallelDataPaths

Co-
processorConcept
59.7Conclusion
Acknowledgment
References
59.1 Introduction
Videoprocessinghasbeenarapidlyevolvingfieldfortelecommunications,computer,andmedia
industries.Inparticular,forrealtimevideocompressionapplicationsagrowingeconomicalsignifi-
canceisexpectedforthenextyears.BesidesdigitalTVbroadcastingandvideophone,servicessuch
asmultimediaeducation,teleshopping,orvideomailwillbecomeaudiovisualmassapplications.
Tofacilitateworldwideinterchangeofdigitallyencodedaudiovisualdata,thereisademandfor


internationalstandards,definingcodingmethods,andtransmissionformats.Internationalstan-
dardizationcommitteeshavebeenworkingonthespecificationofseveralcompressionschemes.The
JointPhotographicExpertsGroup(JPEG)oftheInternationalStandardsOrganization(ISO)has
specifiedanalgorithmforcompressionofstillimages[4].TheITUproposedtheH.261standardfor
videotelephonyandvideoconference[1].TheMotionPicturesExpertsGroup(MPEG)ofISOhas
completeditsfirststandardMPEG-1,whichwillbeusedforinteractivevideoandprovidesapicture
qualitycomparabletoVCRquality[2].MPEGmadesubstantialprogressforthesecondphaseof
standardsMPEG-2,whichwillprovideaudiovisualqualityofbothbroadcastTVandHDTV[3].
Besidestheavailabilityofinternationalstandards,thesuccessfulintroductionofthenamedservices
dependsontheavailabilityofVLSIcomponents,supportingacostefficientimplementationofvideo
compressionapplications.Inthefollowing,wegiveashortoverviewofrecentcodingschemesand
discussimplementationalternatives.Furthermore,theefficiencyestimationofarchitecturalalter-
nativesisdiscussedandimplementationexamplesofdedicatedandprogrammablearchitecturesare
presented.
c

1999byCRCPressLLC
59.2 Recent Coding Schemes
Recent video coding standards are based on a hybrid coding scheme that combines transform coding
and predictive coding techniques. An overview of these hybrid encoding schemes is depicted in
Fig. 59.1.
FIGURE 59.1: Hybrid encoding and decoding scheme.
The encoding scheme consists of the tasks motion estimation, typically based on blockmatching
algorithms, computation of the prediction error, discrete cosinetransform (DCT), quantization (Q),
variable length coding (VLC), inverse quantization (Q
−1
), and inverse discrete cosine transform
(IDCT or DCT-1). The reconstructed image data are stored in an image memory for further predic-
tions. The decoder performs the tasks variable length decoding (VLC
−1

), inverse quantization, and
motion compensated reconstruction.
Generally, video processing algorithms can be classified in terms of regularity of computation and
data access. This classification leads to three classes of algorithms:
• Low-Level Algorithms — These algorithms are based on a predefined sequence of opera-
tions and a predefined amount of data at the input and output. The processing sequence
of low-level algorithms is predefined and does not depend on the values of data processed.
Typical examples of low-level algorithms are block matching or transforms such as the
DCT.
• Medium-Level Algorithms — The sequence and number of operations of medium-level
algorithms depend on the data. Typically, the amountof input datais predefined, whereas
theamountofoutputdata varies accordingtotheinput datavalues. Withrespecttohybrid
coding schemes, examples for these algorithms are quantization, inverse quantization, or
variable length coding.
• High-Level Algorithms — High-level algorithms are associated with a variable amount of
input and output data and a data-dependent sequence of operations. As for medium-
c

1999 by CRC Press LLC
level algorithms, the sequence of operations is highly data dependent. Control tasks of
the hybrid coding scheme can be assigned to this class.
Since hybrid coding schemes are applied for different video source rates, the required absolute
processing power varies in the range from a few hundred MOPS (Mega Operations Per Second) for
video signals in QCIF format to several GOPS (Giga Operations Per Second) for processing of TV
or HDTV signals. Nevertheless, the relative computational power of each algorithmic class is nearly
independent of the processedvideo format. In case of hybrid coding applications, approximately90%
of the overall processing power is required for low-level algorithms. The amount of medium-level
tasks is about 7% and nearly 3% is required for high-level algorithms.
59.3 Architectural Alternatives
In terms of a VLSI implementation of hybrid coding applications, two major requirements can be

identified. First, the high computational power requirements have to be provided by the hardware.
Second, low manufacturing cost of video processingcomponentsis essential for the economic success
of an architecture. Additionally, implementation size and architectural flexibility have to be taken
into account.
Implementations of video processing applications can either be based on standard processors from
workstations or PCs or on specialized video signal processors. The major advantage of standard pro-
cessors is their availability. Application of these architectures for implementation of video processing
hardware does not require the time consuming design of new VLSI components. The disadvantage
of this implementation strategy is the insufficient processing power of recent standard processors.
Video processing applications would still require the implementation of cost intensive multiproces-
sor systems to meet the computational requirements. To achieve compact implementations, video
processing hardware has to be based on video signal processors, adapted to the requirements of the
envisaged application field.
Basically, two architectural approaches for the implementations of specialized video processing
components can be distinguished. Dedicated architectures aim at an efficient implementation of one
specific algorithm or application. Due to the restriction of the application field, the architecture
of dedicated components can be optimized by an intensive adaptation of the architecture to the
requirements of the envisaged application, e.g., arithmetic operations that have to be supported,
processing power, or communication bandwidth. Thus, this strategy will generally lead to compact
implementations. The major disadvantage of dedicated architecture is the associated low flexibility.
Dedicated components can only be applied for one or a few applications. In contrast to dedicated
approaches with limited functionality, programmable architectures enable the processing of different
algorithms under software control. The particular advantage of programmable architectures is the
increased flexibility. Changes of architectural requirements, e.g., due to changes of algorithms or
an extension of the aimed application field, can be handled by software changes. Thus, a generally
cost-intensive redesign of the hardware can be avoided. Moreover, since programmable architectures
cover a wider range of applications, they can be used for low-volume applications, where the design
of function specific VLSI chips is not an economical solution.
For both architectural approaches, the computational requirements of video processing appli-
cations demand for the exploitation of the algorithm-inherent independence of basic arithmetic

operations to be performed. Independent operations can be processed concurrently, which en-
ables the decrease of processing time and thus an increased through-put rate. For the architectural
implementation of concurrency, two basic strategies can be distinguished: pipelining and parallel
processing.
In case of pipelining several tasks, operations or parts of operations are processed in subsequent
steps in different hardware modules. Depending on the selected granularity level for the implemen-
c

1999 by CRC Press LLC
tation of pipelining, intermediate data of each step are stored in registers, register chains, FIFOs, or
dual-port memories. Assuming a processing time of T
P
for a non-pipelined processor module and
T
D,IM
for the delay of intermediate memories, we get in the ideal case the following estimation for
the throughput-rate R
T,
Pipe
of a pipelined architecture applying N
Pipe
pipeline stages:
R
T,
Pipe
=
1
T
P
N

Pipe
+ T
D,IM
=
N
Pipe
T
P
+ N
Pipe
· T
D,IM
(59.1)
From this follows that the major limiting factor for the maximum applicable degree of pipelining is
the access delay of these intermediate memories.
The alternative to pipelining is the implementation of parallel units, processing independent data
concurrently. Parallel processing can be applied on operation level as well as on task level. Assuming
the ideal case, this strategy leads to a linear increase of processing power and we get:
R
T,
Par
=
N
Par
T
P
(59.2)
where N
Par
= number of parallel units.

Generally, both alternatives are applied for the implementation of high-performance video pro-
cessing components. In the following sections, the exploitation of algorithmic properties and the
application of architectural concurrency is discussed considering the hybrid coding schemes.
59.4 Efficiency Estimation of Alternative VLSI Implementations
Basically, architectural efficiency can be defined by the ratio of performance over cost. To achieve
a figure of merit for architectural efficiency we assume in the following that performance of a VLSI
architecture can be expressed by the achieved throughput rate R
T
and the cost is equivalent to the
required silicon area A
Si
for the implementation of the architecture:
E =
R
T
A
Si
(59.3)
Besides the architecture, efficiency mainly depends on the applied semiconductor technology and the
design-style (semi-custom, full-custom). Therefore, a realistic efficiency estimation has to consider
the gains provided by the progress in semiconductor technology. A sensible way is the normalization
of the architectural parameters according to a reference technology. In the following we assume a
reference process with a grid length λ
0
= 1.0 micron. For normalization of silicon area, the following
equation can be applied:
A
Si,0
= A
Si


λ
0
λ

2
(59.4)
where the index 0 is used for the system with reference gate length λ
0
.
According to [7] the normalization of throughput can be performed by:
R
T,0
= R
T

λ
λ
0

1.6
(59.5)
From Eqs. (59.3), (59.4), and (59.5), the normalization for the architectural efficiency can be derived:
E
0
=
R
T,0
A
Si,0

=
R
T
A
Si

λ
λ
0

3.6
(59.6)
c

1999 by CRC Press LLC
E can be used fortheselectionofthebest architectural approachoutof several alternatives. Moreover,
assuming a constant efficiency for a specific architectural approach leads to a linear relationship of
throughput rate and siliconareaand this relationship canbe applied forthe estimation of the required
siliconareaforaspecificapplication. Due tothepowerof3.6inEqu. (59.6), the chosensemiconductor
technology for implementation of a specific application has a significant impact on the architectural
efficiency.
In the following, examples of dedicated and programmable architectures for video processing
applications are presented. Additionally, the discussed efficiency measure is applied to achieve a
figure of merit for silicon area estimation.
59.5 Dedicated Architectures
Due to their algorithmic regularity and the high processing power required for the discrete cosine
transform and motion estimation, these algorithms are the first candidates for a dedicated imple-
mentation. As typical examples, alternatives for a dedicated implementation of these algorithms are
discussed in the following.
The discrete cosine transform (DCT) is a real-valued frequency transform similar to the Discrete

Fourier transform (DFT). When applied to an image block of size L × L, the two dimensional DCT
(2D-DCT) can be expressed as follows:
Y
k,l
=
L−1

i=0
L−1

j=0
x
i,j
· C
i,k
· C
j,l
(59.7)
where C
n,m
=





1

2
for m = 0

cos

(2n+1)mπ
2L

otherwise
with
(i, j )= coordinates of the pixels in the initial block
(k, l)= coordinates of the coefficients in the transformed block
x
i,j
= value of the pixel in the initial block
Y
k,l
= value of the coefficient in the transformed block
Computing a 2D DCT of size L × L directly according to Eq. (59.7) requires L
4
multiplications
and L
4
additions.
The required processing power for the implementation of the DCT can be reduced by the exploita-
tion of the arithmetic properties of the algorithm. The two-dimensional DCT can be separated into
two one-dimensional DCTs according to Eq. (59.8)
Y
k,l
=
L−1

i=0

C
i,k
·


L−1

j=0
x
i,j
· C
j,l


(59.8)
The implementation of the separated DCT requires 2L
3
multiplications and 2L
3
additions. As an
example, the DCT implementation according to [9]isdepictedinFig.59.2. This architecture is
based on two one-dimensional processing arrays. Since this architecture is based on a pipelined
multiplier/accumulator implementation in carry-save technique, vector merging adders are located
at the output of each array. The results of the 1D-DCT have to be reordered for the second 1D-DCT
stage. For this purpose, a transposition memory is used. Since both one-dimensional processor
arrays require identical DCT coefficients, these coefficients are stored in a common ROM.
c

1999 by CRC Press LLC
FIGURE 59.2: Separated DCT implementation according to [9].

Moving from a mathematical definition to an algorithm that can minimize the number of calcu-
lations required is a problem of particular interest in the case of transforms such as the DCT. The
1D-DCT can also be expressed by the matrix-vector product :
[Y]=[C][X]
(59.9)
where [C]isanL× L matrix and [X] and [Y] 8-point input and output vectors. As an example, with
θ = p/16, the 8-points DCT matrix can be computed as denoted in Eq. (59.10)











Y
0
Y
1
Y
2
Y
3
Y
4
Y
5

Y
6
Y
7











=











cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ cos4θ cos4θ cos4θ
cos θ cos 3θ cos 5θ cos 7θ − cos 7θ − cos 5θ − cos 3θ − cos θ
cos 2θ cos 6θ − cos 6θ − cos 2θ − cos2θ − cos6θ cos 6θ cos 2θ

cos 3θ − cos 7θ − cos θ −cos 5θ cos 5θ cos θ cos 7θ − cos 3θ
cos 4θ − cos 4θ − cos 4θ cos 4θ cos 4θ − cos 4θ −cos 4θ cos4θ
cos 5θ − cos θ cos7θ cos 3θ − cos 3θ −cos 7θ cos θ − cos 5θ
cos 6θ − cos 2θ cos 2θ − cos 6θ − cos6θ cos 2θ − cos 2θ cos6θ
cos 7θ − cos 5θ cos 3θ − cos θ cos θ −cos 3θ cos 5θ − cos 7θ






















x
0
x

1
x
2
x
3
x
4
x
5
x
6
x
7











(59.10)




Y

0
Y
2
Y
4
Y
6




=




cos 4θ cos 4θ cos 4θ cos 4θ
cos 2θ cos 6θ − cos 6θ − cos 2θ
cos 4θ − cos 4θ − cos 4θ cos 4θ
cos 6θ − cos 2θ cos 2θ − cos 6θ








x
0

+ x
7
x
1
+ x
6
x
2
+ x
5
x
3
+ x
4




(59.11)




Y
1
Y
3
Y
5
Y

7




=




cos θ cos 3θ cos 5θ cos 7θ
cos 3θ − cos 7θ − cos θ − cos 5θ
cos 5θ − cos θ cos7θ cos 3θ
cos 7θ − cos 5θ cos 3θ − cos θ








x
0
+ x
7
x
1
+ x
6

x
2
+ x
5
x
3
+ x
4




(59.12)
More generally, the matrices in Eqs. (59.11) and (59.12) can be decomposed in a number of simpler
matrices, the composition of which can be expressed as a flowgraph. Many fast algorithms have been
proposed. Figure 59.3 illustrates the flowgraph of the B.G. Lee’s algorithms, which is commonly
used [10]. Several implementations using fast flow-graphs have been reported [11, 12].
Anotherapproachthathasbeen extensivelyusedisbased onthetechniqueof distributedarithmetic.
Distributed arithmetic is an efficient way to compute the DCT totally or partially as scalar products.
To illustrate the approach, let us compute a scalar product between two length-M vectors C and X :
Y =
M−1

i=0
c
i
· x
i
with x
i

=−x
i,0
+
B−1

j=1
x
i,j
· 2
−j
(59.13)
where {c
i
} are N-bit constants and {x
i
} are coded in B bits in 2s complement. Then Eq. (59.13) can
be rewritten as :
Y =
B−1

j=0
C
j
· 2
−j
with C
j=0
=
M−1


i=0
c
i
x
i,j
and C
0
=−
M−1

i=0
c
i
x
i,0
(59.14)
c

1999 by CRC Press LLC
FIGURE 59.3: Lee FDCT flowgraph for the one-dimensional 8-points DCT [10].
The change of summing order in i and j characterizes the distributed arithmetic scheme in which the
initial multiplications are distributed to another computation pattern. Since the term C
j
has only 2
M
possible values (which depend on the x
i,j
values), it is possible to store these 2
M
possible values in a

ROM. An input set of M bits {x
0,j
,x
1,j
,x
2,j
, ...,x
M−1,j
} is used as an address, allowing retrieval
of the C
j
value. These intermediate results are accumulated in B clock cycles, for producing one Y
value. Figure 59.4 shows a typical architecture for the computation of a M input inner product. The
inverter and the MUX are used for inverting the final output of the ROM in order to compute C
0
.
FIGURE 59.4: Architecture of a M input inner product using distributed arithmetic.
Figure59.5illustrates twotypicalusesofdistributed arithmeticforcomputingaDCT.Figure59.5(a)
implements the scalar productsdescribed by the matrix of Eq. (59.10). Figure 59.5(b) takes advantage
of a first stage of additions and substractions and the scalar products described by the matrices of
Eq. (59.11) and Eq. (59.12).
Properties of several dedicated DCT implementations have been reported in [6]. Figure 59.6
shows the silicon area as a function of the throughput rate for selected design examples. The design
parametersarenormalizedtoafictive1.0 µm CMOSprocessaccordingtothe discussednormalization
strategy. As a figure of merit, a linear relationship of throughput rate and required silicon area can
be derived:
α
T,0
≈ 0.5 mm
2

/ Mpel/s
(59.15)
Equation (59.15) can be applied for the silicon area estimation of DCT circuits. For example,
assuming TV signals according to the CCIR-601 format and a frame rate of 25Hz, the source rate
c

1999 by CRC Press LLC
FIGURE 59.5: Architecture of an 8-point one-dimensional DCT using distributed arithmetic.
(a) Pure distributed arithmetic. (b) Mixed D.A.: first stage of flowgraph decomposition products of
8 points followed by 2 times 4 scalar products of 4 points.
equals 20.7 Mpel/s. As a figure of merit from Eq. (59.15) a normalized silicon area of about 10.4 mm
2
can be derived. For HDTV signals the video source rate equals 110.6 Mpel/s and approximately 55.3
mm
2
silicon area is required for the implementation of the DCT. Assuming an economically sensible
maximum chip size of about 100 mm
2
to 150 mm
2
, we can conclude that the implementation of
the DCT does not necessarily require the realization of a dedicated DCT chip and the DCT core can
be combined with several other on-chip modules that perform additional tasks of the video coding
scheme.
For motion estimation several techniques have been proposed in the past. Today, the most im-
portant technique for motion estimation is block matching, introduced by [21]. Block matching
is based on the matching of blocks between the current and a reference image. This can be done
by a full (or exhaustive) search within a search window, but several other approaches have been
FIGURE 59.6: Normalized silicon area and throughput for dedicated DCT circuits.
c


1999 by CRC Press LLC

×