Tải bản đầy đủ (.pdf) (4 trang)

A high performance VLSI architecture for integer motion estimation in HEVC

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (298.73 KB, 4 trang )

A High Performance VLSI Architecture for Integer Motion
Estimation in HEVC
XuYuan1, Liu Jinsong1, Gong Liwei1, Zhang Zhi1, Robert K. F. Teng1,2
Shenzhen Key Lab of Advanced Communication and Information Processing
College of Information Engineering, Shenzhen University, Shenzhen, 518000, China1
Department of Electrical Engineering, California State University, Long Beach, 90840, USA2
Abstract

VLSI architectures have been studied targeting at

A high performance VLSI architecture for integer
motion estimation (IME) in High Efficiency Video
Coding (HEVC) is presented in this paper. It supports
coding tree block (CTB) structure with the asymmetric
motion partition (AMP) mode. The architecture
contains two parallel sub-architectures to

meet

1080p@30fps real-time video coding. The size L×L of
CTB in the architecture is set to L=32 pixels by default,
and it can be extended to L=64 and L=16 pixels. A
serial mode decision module to find optimal partition
mode for the architecture has also been implemented.

various standards (e.g. H.264): Cao Wei et al. has
proposed a reconfigurable architecture for VBSME in
H.264 with memory partition scheme [3]; G.A Ruiz et
al. has proposed an efficient VLSI processor including
Lagrangian cost module and mode decision module [4];
Tuan et al. has defined four levels of data-reuse scheme


according to memory situations [5], etc. However, few
architectures targeting IME in HEVC have been
reported so far.
This paper studies a parallel VLSI architecture for
IME in HEVC. This structure can support AMP mode
aiming at high resolution application. A serial mode
decision module to find optimal partition mode has

1. Introduction

also been implemented.

High Efficiency Video Coding (HEVC) is the
recent video coding standard of the ITU-T Video

2. HEVC Motion Estimation Theory

Coding Experts Group (VCEG) and the ISO/IEC
(MPEG)

HEVC is the theory of motion estimation for the

bit-rate

VLSI architecture studied in this paper. The coding

reduction and equal perceptual video quality have been

object in HEVC is CTB. Its size can be represented as


demonstrated in the HM10.0. Compared to previous

LhL (L=16, 32, 64), while the traditional macroblock

video coding standards, HEVC has many new concepts,

size is 16h16. CTB is further partitioned into coding

such

blocks (CBs) or one CB according to a quadtree as

Moving

Picture

standardization

as

Experts

organizations

quadtree

structure,

Group
[1].


The

asymmetry

motion

prediction (AMP) [2] in integer motion estimation

shown in figure 1. The root of quadtree is CTB.

(IME), etc., resulting in higher coding efficiency and

The size of CBs can be represented as MhM (M=8,

more design complexity. The IME is the critical part of

16, 32, 64). Figure 1(a) shows a corresponding trellis

video coding design because of the high memory

diagram of quadtree decomposition, smaller CBs are

bandwidth, high hardware cost, complex control logic,

typically distributed around the Object boundary. CB is

etc. Therefore, the high performance architecture of

further partitioned into prediction blocks (PBs) through


IME is important for the HEVC encoder. Many IME

three modes for inter prediction, as shown in figure 2.
They are two square modes, MhM, M/2hM/2; two

978-1-4673-6417-1/13/$31.00 ©2013 IEEE


symmetric modes, M/2 h M, M h M/2; and four

block size of the architecture is 32h32. When the

asymmetric modes, M/4hM (L), M/4hM (R), Mh

size of CTB is 64h64, a quarter down sampling

M/4 (U), MhM/4 (D). M denotes its corresponding

module is designed to reduce the search points. This

parent CB size.

strategy strikes a good balance between hardware

2EMHFWERXUGDU\



resources and compression quality. Other schemes,

such as full search algorithm and level D data-reuses
[5], have also been used in the architecture design. The

  

full search algorithm scheme has the advantages of
computational regularity and excellent output video

        

quality. The Level D scheme reuse pixels data in the
entire search window strips of a consecutive current

(a)

(b)

block.

Figure 1 Example of quadtree structure (a) Quadtree
structure (b) Corresponding trellis diagram

$;,,QWHUFRQQHFW
6HDUFK
$UHD

2QFKLS5$0

6KLIWB5HJV
DUUD\


&XUUHQW
&7%

5HJLVWHUV

3(
DUUD\
6$'V

3(
DUUD\
6$'V
&RPSDUDWRUV
%HVW6$'V

''5
5HIHUHQFH
)UDPH

&XUUHQW
)UDPH

0RGH
'HFLVLRQ

5$0

Figure 3 Top-level of design
Figure 2 Modes for splitting a CB into PBs

A 32h33 pixels 2D three direction shift register
Meanwhile, three constraints must be complied with:

array is proposed to improve data fetching efficiency.

(1) Asymmetric motion partition mode is turning off

Since two processing element (PE) arrays are used in

when M=8,

parallel, two consecutive reference candidate blocks

(2) For reducing the memory bandwidth, 4h4 PBs

are stored in the shift register array. When the next two

are not allowed for inter prediction,

reference candidates have been pushed into the shift

(3) 4h8 PBs and 8h4 PBs are only adopted in

register array, 32h2 pixels are updated, whereas the

uni-predictive coding.

other 32h31 pixels are reused. In order to eliminate
bubble clock cycles, a column of 33 pixels is added to


3. VLSI Architecture
With
estimation

the

the array, so the array size is changed to 33h33 pixels.

newly

theory,

a

introduced
high

HEVC

performance

motion
VLSI

architecture for integer motion estimation has been
studied. The Top-level of the architecture is shown in
figure 3. On-chip RAM includes current CTB and
search area. Once the size of CTB is decided, the

complexity of IME is also determined. The

maximum size of CTB is 64h64, while the default

Search window can be scanned in three directions:
upward, downward, left to right. When the direction is
left to right, the shift register array uses one cycle to
update the requested 33 pixels. For typical search range
[-24, 23], it takes (24+24)/2=24 clock cycles to finish
one column data matching and one clock cycle to shift
to another column. 24 h 48=1152 clock cycles are
needed for one CTB processing.


of SADs from 593 to 145 in order to make tradeoff

Processing element (PE) is for calculating the

between the performance and precision.

differences of a pair of pixels which are used for Sum
of Absolute Differences (SADs) of PBs. The size of PB

Table 1 Content of 145 SADs
Block Size
Block
Number
8×4
16×16
32
4×8
32×16

32
8×8
16×32
16
16×8
32×8(up)*
8
8×16
32×8(down)*
8
16×4(up)*
8×32(left)*
8
16×4(down)*
8×32(right)*
8
4×16(left)*
32×32
8
4×16(right)*
8
Note: * represents AMP mode

decides the number of PEs. One PE array includes

Block Size

1024 PEs. Two PE arrays can concurrently calculate 2
h 145 SADs in one clock cycle. The contents of 145
SADs are shown in table 1.

In order to implement the CTB quadtree partition,
each PE array can be divided into 16h16 execution
units (16h16 EUs) as shown in figure 4(a). Each 16h
16 EU is divided into sixteen 4h4 EUs as shown in
figure 4 (b). When CTB size is 64h64, quarter down
sampling will be performed first to reduce the number

(a) PE array architecture

[DQG[6$'V
[DQG[6$'V
5HIHUHFH
GDWD

(8

(8

(8

(8

(8

(8

(8

(8


(8

(8

(8

(8

(8

(8

(8

[DQG[6$'V
[ [ [ [ [
    
6$' 6$' 6$' 6$' 6$'

&XUUHQW
GDWD

[
(8

[6$'V

(b) 16h16 EU architecture
Figure 4 The design architecture




Block
Number
4
2
2
2
2
2
2
1


Hierarchical adder trees are built to help PE array

processing. The architecture can process over 70K

to generate 145 SADs for each shift operation, as

CTBs per second at system clock 110M considering the

shown in figure 4(a) and 4(b). The best 145 SADs

initiate clock cycles. It can meet the requirement of

selected by comparator modules will be sent to the

1080p@30fps video which need 61K CTBs per


mode decision module.

second.

The mode decision module proposed in this paper
is based on the structure presented by G.A. Ruiz [4].
The improvement of the circuit is shown in figure 5.
Three extra adders are used to sum up the four 16×16
blocks’ cost for eliminating the bubble clock cycles of
shifting data, and sixteen registers stored the blocks’
cost to meet the requirement of HEVC.

5. Conclusion
A high performance parallel VLSI architecture for
integer motion estimation has been studied. It can meet
real-time

HEVC

encoding

requirements

for

1080p@30fps video. The architecture has been
implemented on Virtex-6 XC6VLX-550T with 110M
system clock. When implemented as ASIC, the number

$FFXPXODWRU


of parallel PE arrays would be reduced as system clock

08;

UHJ

&267

increased.
&203$5$725

UHJ
UHJ

08;

UHJ

UHJ

UHJ
UHJ

UHJ

UHJ
UHJ

UHJ


UHJ
UHJ

UHJ

Reference
[1] B. Bross, et al.. “High Efficiency Video Coding
(HEVC) Text Specification Draft 9”, document
JCTVC-K1003, ITU-T/ISO/IEC Joint collaborative
Team on Video coding (JCT-VC), (Oct.2012)
[2] Gary J. Sullivan, et al.. “Overview Of The High
Efficiency Video Coding (HEVC) Standard”, IEEE
Figure 5 Mode Decision architecture

Trans. Circuits Syst. Video Technol., vol. 22, no. 12,
pp. 1648–1667, (Dec. 2012).

4. Results and Performance Analysis
The proposed architecture has been implemented in
Verilog, simulated and verified by ModelSim. The
Verilog code has been synthesized, placed and routed
into Xilinx Virtex-6 XC6VLX-550T using Xilinx ISE
tool. The synthesized results are given in table 2.
Table2 Resources Utilization of the FPGA
Resource
utilization
percentage
slice logic
55346

16%
slice register
19744
2.9%
Bram
148kB
5.2%
The performance of the IME is related to the size of
search window. The default range of displacement is
[-24, 23]. It takes 1152 clock cycles for one CTB

[3] Cao

Wei,

et

al.

“A

High-Performance

Reconfigurable VLSI Architecture for VBSME in
H.264”, IEEE Trans Consumer Electron., vol. 54,
no. 3, pp. 1338–1345, (Aug. 2008).
[4] G.A. Ruiz and J.A. Michell, “An efficient VLSI
processor chip for variable block size integer
motion


estimation

in

H.264/AVC”,

Signal

Processing: Image Communication, vol. 26, pp.
289-303, (July 2011)
[5] Tuan J-C, et al. “on the data reuse and memory
bandwidth analysis of full-search block-matching
VLSI architecture”, IEEE Trans Circuits Syst.
Video Technol., 12(1)pp.61-72, (2002)



×