Tải bản đầy đủ (.pdf) (10 trang)

Model-Based Design for Embedded Systems- P38 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (434.74 KB, 10 trang )

Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 336 2009-10-2
336 Model-Based Design for Embedded Systems
M
ONTIUMC. The MONTIUM design methodology to map DSP applications on
the M
ONTIUM TP is divided into three steps:
1. The high-level description of the DSP application is analyzed and com-
putationally intensive DSP kernels are identified.
2. The identified DSP kernels or parts of the DSP kernels are mapped on
one or multiple M
ONTIUM TPs that are available in a SoC. The DSP oper-
ations are programmed on the M
ONTIUM TP using MONTIUMC.
3. Depending on the layout of the SoC in which the M
ONTIUM processing
tiles are applied, the M
ONTIUM processing tiles are configured for a par-
ticular DSP kernel or part of the DSP kernel. Furthermore, the channels
in the NoC between the processing tiles are configured.
11.3.1.3 A
NNABELLE Heterogeneous System-on-Chip
In this section, the prototype A
NNABELLE SoC is described according to the
heterogeneous SoC template mentioned before, which is intended to be
used for digital radio broadcasting receivers (e.g., digital audio broadcast-
ing, digital radio mondiale). Figure 11.6 shows the overall architecture of the
A
NNABELLE SoC. The ANNABELLE SoC consists of an ARM926 GPP with a five-
layer AMBA AHB, four M
ONTIUM TPs, an NoC, a Viterbi decoder, two ADCs,
two DDCs, a DMA controller, SRAM/SDRAM memory interfaces, and exter-


nal bus interfaces.
The four M
ONTIUM TPs and the NoC are arranged in a reconfigurable
subsystem, labelled “reconfigurable fabric.” The reconfigurable fabric is con-
nected to the AHB bus and serves as a slave to the AMBA system. A config-
urable clock controller generates the clocks for the individual M
ONTIUM TPs.
Every individual M
ONTIUM TP has its own adjustable clock and runs at its
own speed. A prototype chip of the A
NNABELLE SoC has been produced using
the Atmel 130 nm CMOS process [8].
The reconfigurable fabric that is integrated in the A
NNABELLE SoC
is shown in detail in Figure 11.7. The reconfigurable fabric acts as a
Clock controller
ARM926
GPP
DMA
controller
M
ONTIUM
TP
M
ONTIUM
TP
M
ONTIUM
TP
M

ONTIUM
TP
CCU
DDC DDC ADC ADC
Viterbi
decoder
IRQ
controller
SRAM/
SDRAM
External
bus
interface
5-Layer AMBA advanced high-performance bus
CCU
Network-on-chip
Reconfigurable fabric
CCU CCU
FIGURE 11.6
Block diagram of the A
NNABELLE SoC.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 337 2009-10-2
Reconfigurable MultiCore Architectures 337
Port 1
Port 1
Port 3
Port 3
Port 0 Port 0
Port 2Port 2
Queue 0 Queue 1

Control
AHB port
AHB-NoC bridge
Router 1 Router 2
M
ONTIUM
TP 4
MONTIUM
TP 1
M
ONTIUM
TP 2
MONTIUM
TP 3
CCU 3CCU 2
CCU 1
CCU 4
FIGURE 11.7
The A
NNABELLE SoC reconfigurable fabric.
reconfigurable coprocessor for the ARM926 processor. Computationally
intensive DSP algorithms are typically offloaded from the ARM926 proces-
sor and processed on the coarse-grained reconfigurable M
ONTIUM TPs inside
the reconfigurable fabric. The reconfigurable fabric contains four M
ONTIUM
TPs, which are connected via a CCU to a circuit-switched NoC. The reconfig-
urable fabric is connected to the AMBA system through a AHB–NoC bridge
interface. Configurations, generated at design-time can be loaded onto the
M

ONTIUM TPs at run-time. The reconfigurable fabric provides “block mode”
and “streaming mode” computation services.
For ASIC synthesis, worst-case military conditions are assumed. In par-
ticular, the supply voltage is 1.1 V and the temperature is 125

C. Results
obtained with the synthesis are as follows:
• The area of one M
ONTIUM core is 3.5 mm
2
of which 0.2 mm
2
is for the
CCU and 3.3 mm
2
is for the MONTIUM TP (including memory).
• With Synopsys tooling we estimated that the M
ONTIUM TP, within the
A
NNABELLE ASIC realization, can implement an FIR filter at about 100
MHz or an FFT at 50 MHz. The worst-case clock frequency of the
A
NNABELLE chip is 25 MHz.
• With the Synopsys prime power tool, we estimated the energy
consumption using placed and routed netlists. The following section
provides some of the results.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 338 2009-10-2
338 Model-Based Design for Embedded Systems
TABLE 11.2
Dynamic Power Consumption of one

M
ONTIUM on ANNABELLE
Energy (mW/MHz)
Module FIR-5 FFT-512 FFT-288
Datapath 0.19 0.24 0.15
Memories 0.0 0.27 0.21
Sequencer 0.02 0.07 0.05
Decoders 0.0 0 0.0
CCU 0.02 0.02 0.02
Total 0.23 0.60 0.43
TABLE 11.3
Energy Comparison of MONTIUM/
ARM926
MONTIUM ARM926
Algorithm (μJ) (μJ) Ratio
FIR-5 0.243 — —
FFT-112 0.357 9 25
FFT-176 0.616 16 26
FFT-256 0.707 14 20
FFT-288 1.001 23 23
FFT-512 1.563 30 19
FFT-1920 5.054 168 33
11.3.1.4 Average Power Consumption
To determine the average power consumption of the A
NNABELLE as accurate
as possible, we performed a number of power estimations on the placed
and routed netlist using the Synopsys power compiler. Table 11.2 pro-
vides the dynamic power consumption in mW/MHz of various M
ONTIUM
blocks for three well-known DSP algorithms. These figures show that the

overhead of the sequencer and decoder is low: <16% of the total dynamic
power consumption. Finally, Table 11.3 compares the energy consumption
of the M
ONTIUM and the ARM926 on ANNABELLE. For the FIR-5 algorithm the
memory is not used.
11.3.1.5 Locality of Reference
As mentioned above, locality of reference is an important design parame-
ter. One of the reasons for the excellent energy figures of the M
ONTIUM is the
use of locality of reference. To illustrate this, Table 11.4 gives the amount of
memory references local to the cores compared to the amount of off-core
communications. These figures are, as expected, algorithm dependent.
Therefore, in this table, we chose three well-known algorithms in the
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 339 2009-10-2
Reconfigurable MultiCore Architectures 339
TABLE 11.4
Internal and External Memory References per Execution of
an Algorithm
Number of Memory References
Algorithm Internal External Ratio
1024p FFT 51200 4096 12.5
200 tap FIR 405 2 202.5
SISO algorithm (N softbits) 18*N 3*N 6
TABLE 11.5
Reconfiguration of Algorithms on the MONTIUM
Algorithm Change Size # cycles
1024p FFT Scaling factors ≤150 bit ≤10
to iFFT Twiddle factors 16384 bit 512
200 tap FIR Filter coefficients ≤3200 bit ≤80
streaming DSP application domain: a 1024p FFT, a 200 tap FIR filter, and

a part of a Turbo decoder (SISO algorithm [17]). The results show that for
these algorithms 80%–99% of the memory references are local (within a tile).
11.3.1.6 Partial Dynamic Reconfiguration
One of the advantages of a multicore SoC organization is that each individ-
ual core can be reconfigured while the other cores are operational. In the
M
ONTIUM, the configuration memory is organized as a RAM memory. This
means that to reconfigure the M
ONTIUM, the entire configuration memory
need not be rewritten, but only the parts that are changed. Furthermore,
because the M
ONTIUM has a coarse-grained reconfigurable architecture, the
configuration memory is relatively small. The M
ONTIUM has a configuration
size of only 2.6 kB. Table 11.5 gives some examples of reconfigurations.
To reconfigure a M
ONTIUM from executing a 1024 point FFT to executing
a 1024 point inverse FFT requires updating the scaling and twiddle factors.
Updating these factors requires less than 522 clock cycles in total. To change
the coefficients of a 200 tap FIR filter requires less than 80 clock cycles.
11.3.2 Aspex Linedancer
The Linedancer [4] is an “associative” processor and it is an example of a
homogeneous SoC. Associative processing is the property of instructions to
execute only on those PEs where a certain value in their data register matches
a value in the instruction. Associative processing is built around an intel-
ligent memory concept: content addressable memory (CAM). Unlike stan-
dard computer memory (random access memory or RAM) in which the user
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 340 2009-10-2
340 Model-Based Design for Embedded Systems
supplies a memory address and the RAM returns the data word stored at that

address, a CAM is designed such that the user supplies a data word and the
CAM searches its entire memory to see if that data word is stored anywhere
in it. If the data word is found, the CAM returns a tag list of one or more
storage addresses where the word was found. Each CAM line, that contains
a word, can be seen as a processor element (PE) and each tag list element as
a 1 bit condition register. Dependending on this register, the aggregate asso-
ciative processor can either instruct the PEs to continue processing on the
indicated subset, or to return the involved words subsequently for further
processing. There are several implementations possible which vary from bit
serial to word parallel, but the latest implementations [4,5] can perform the
involved lookups in parallel in a single clock cycle.
In general the Linedancer belongs to the subclass of massively parallel
SIMD architectures, with typically more than 512 processors. This SIMD sub-
class is perfectly suited to support data parallelism, for example, for signal,
image, and video processing; text retrieval; and large databases. The asso-
ciative functions furthermore allow the processor to function like an intelli-
gent memory (CAM), permitting high speed searching and data-dependent
image processing operations (such as median filters and object recognition/
labeling).
The so called “ASProCore” of the Linedancer, is designed around a very
large number—up to 4,096—of simple PEs arranged in a line, see Figure 11.8.
Application areas are diverse but have in common the simple process-
ing of very large amounts of data, from samples in 1D-streams to pixels in
2D or 3D-images. To mention a few: software defined radio (e.g., WiMAX),
broadcast (Video compression), medical imaging (3D reconstruction), and in
high-end printers—in particular for raster image processing (RIP).
Program
On-chip
RISC
Associative string processing array (ASProCore)

Network
cascadable
over chips
Thousands of PEs
On-chip or off-chip memory
PE 1
PE 0
PE 4,095
PE 4,094
PE 4,093
PE 2
Inter-PE communication network
Common
instruction
bus
FIGURE 11.8
The scalable architecture of Linedancer.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 341 2009-10-2
Reconfigurable MultiCore Architectures 341
191
0
4,095
EXT CAM ALU PDS
RLP
Single
PE
63
64
3
0

0
63
64
Mask
Extended memory
Associative
memory
ALU
array
Inter-
PE
comm.
Bulk IO
memory
Data
LLP
FIGURE 11.9
The architecture of Linedancer’s associative string processor (ASProCore).
In the following sections, the associative processor (ASProCore) and the
Linedancer family are introduced. At the end, we present the development
tool chain and a brief conclusion on the Linedancer application domain.
11.3.2.1 ASProCore Architecture
Each PE has a 1-bit ALU, 32–64 bit full associative memory array, and 128
bit extended memory. See Figure 11.9 for a detailed view on the ASProCore
architecture. The processors are connected in a 1D network, actually a 4K
bit shift register, in between the indicated “left link port” (LLP) and “right
link port” (RLP). The network allows data to be shared between PEs with
minimum overhead. The ASProCore also has a separate word serial bit par-
allel memory, the primary data store (PDS), for high-speed data input. The
on-chip DMA engine automatically translates 2D and 3D images into the 1D

array (and passed through via the PDS). The 1D architecture allows for linear
scaling of performance, memory, and communication, provided the applica-
tion is expressed in a scalable manner. The Linedancer features also a single
or dual bit RISC core (P1, HD, respectively) for sequential processing and
controlling the ASProCore.
11.3.2.2 Linedancer Hardware Architecture
The current Linedancers, the P1 and the HD, have been realized in 0.13 μm
CMOS process. Both have one or two 32-bit SPARC core(s) with 128 kB inter-
nal program memory. System clock frequencies vary from 300, 350, 400 MHz.
The Linedancer-P1 integrates an associative processor (ASProCore, with 4K
PEs), a single SPARC core with a 4 kB instruction cache, and a DMA con-
troller capable of transferring 64 bit at 66 MHz over a PCI-interface, as shown
in Figure 11.10.
It further hosts 128 kB internal data memory. The chip consumes 3.5 W
typical at 300 MHz. The Linedancer-HD integrates two associative proces-
sors (2 × 2K PEs), two SPARC cores with each 8 kB instruction cache and
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 342 2009-10-2
342 Model-Based Design for Embedded Systems
Instr
DMA engine
32 bit
RISC CPU
128 kB RAM
External
DRAM
(prog)
External
DRAM
(data)
PCI

Neighbor Linedancers
Neighbor Linedancers
128 kB RAM
ASProCore
4096 processing elements
1Mbit storage
Data
FIGURE 11.10
The Linedancer-P1 layout.
4 kB data cache, four internal DMA engines, and an external data channel
capable of transferring 64 bit at 133 MHz over a PCI-X interface, as shown
in Figure 11.11. The ASProCore has been extended with a chordal ring
inter-PE communication network that allows for faster 2D- and 3D-image
processing. It further hosts four external DDR2 DRAM interfaces, eight
dedicated streaming data I/O ports (up to 3.2 GB/s), and 1088 kB internal
data memory. The chip consumes 4.5 W typical at 300 MHz.
11.3.2.3 Design Methodology
The software development environment for Linedancer consists of a com-
piler, linker, and debugger. The Linedancer is programmed in C, with some
parallel extensions to support the ASProCore processing array. The toolchain
is based on the GNU compiler framework, with dedicated pre and postpro-
cessing tools to compile and optimise the parallel extensions to C.
Associative SIMD processing adds an extra dimension to massive parallel
processing, enabling new views on problem modeling and the subsequent
implementation (for example, in searching/sorting and data-dependent
image processing). The Linedancer’s 1D-architecture scales better than a 2D
array often used in multi-ALU arrays as PACT’s XPP [6] or the Tilera’s 2D
multicore array [7]. Because of the large size of the array, power consumption
is relatively high compared to the M
ONTIUM processor and prevents applica-

tion into handheld devices.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 343 2009-10-2
Reconfigurable MultiCore Architectures 343
GPIO
PCI-X
Control
External
data DRAM
(4 banks)
JTAG
Program
memory
Program
memory
V7 ASProCore
4,096 processing elements
1 Mbit storage
8 × Direct
data interfaces
4 × DMA engines
Internal
data memory
32 bit
RISC CPU
32 bit
RISC CPU
Neighbor
Linedancers
Neighbor
Linedancers

3.2 GB/s
streaming
data
I/O
ASProCore
controller
Instr
FIGURE 11.11
The Linedancer-HD layout.
11.3.3 PACT-XPP
The eXtreme processing platform (XPP) is an example of a homogeneous
array structure. It is a run-time reconfigurable coarse-grained data process-
ing architecture. The XPP provides parallel processing power for high band-
width data such as video and audio processing. The XPP targets streaming
DSP applications in the multimedia and telecommunications domain [10,20].
11.3.3.1 Architecture
The XPP architecture is based on a hierarchical array of coarse-grained, adap-
tive computing elements, called processing array elements (PAEs). The PAE
are clustered in processing array clusters (PACs). All PAEs in the XPP archi-
tecture are connected through a packet-oriented communication network.
Figure 11.12 shows the hierarchical structure of the XPP array and the PAEs
clustered in a PAC.
Different PAEs are identified in the XPP array: “ALU-PAE, RAM-PAE,”
and “FNC-PAE.” The ALU-PAE contains a multiplier and is used for DSP
operations. The RAM-PAE contains a RAM to store data. The FNC-PAE is a
unique sequential VLIW-like processor core. The FNC-PAEs are dedicated to
the control flow and sequential sections of applications. Every PAC contains
ALU-PAEs, RAM-PAEs, and FNC-PAEs. The PAEs operate according to a
data flow principle; a PAE starts processing data as soon as all required input
packets are available. If a packet cannot be processed, the pipeline stalls until

the packet is received.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 344 2009-10-2
344 Model-Based Design for Embedded Systems
PAC
CM CM
SCM
CM
PAC
PAC
PAC
IO
IO
IO
IO
IO
IO
IO
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
FNC
FNC
FNC
FNC
ALU

ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
IO
CM
FIGURE 11.12
The structure of an XPP array composed of four PACs. (From Baumgarte, V.
et al., J. Supercomput., 26(2), 167, September 2003.)
Each PAC is controlled by a configuration manager (CM). The CM is
responsible for writing configuration data into the configurable object of the
PAC. Multi-PAC XPP arrays contain additional CMs for concurrent con-
figuration data handling, arranged in a hierarchical tree of CMs. The top
CM, called supervising CM (SCM), has an external interface, not shown in
Figure 11.12, that connects the supervising CM to an external configuration
memory.
11.3.3.2 Design Methodology
DSP algorithms are directly mapped onto the XPP array according to their
data flow graphs. The flow graph nodes define the functionality and oper-
ations of the PAEs, whereas the edges define the connections between
the PAEs. The XPP array is programmed using the native mapping lan-
guage (NML), see [20]. In NML descriptions, the PAEs are explicitly allo-
cated and the connections between the PAEs are specified. Optionally, the

allocated PAEs are placed onto the XPP array. NML also includes statements
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 345 2009-10-2
Reconfigurable MultiCore Architectures 345
to support configuration handling. Configuration handling is an explicit part
of the application description.
A vectorizing C compiler is available to translate C functions to NML
modules. The vectorizing compiler for the XPP array analyzes the code for
data dependencies, vectorizes those code sections automatically, and gener-
ates highly parallel code for the XPP array. The vectorizing C compiler is
typically used to program “regular” DSP operations that are mapped on
“ALU-PAEs” and “RAM-PAEs” of the XPP array. Furthermore, a coarse-
grained parallelization into several FNC-PAE threads is very useful when
“irregular” DSP operations exist in an application. This allows running even
irregular, control-dominated code in parallel on several FNC-PAEs. The
FNC-PAE C compiler is similar to a conventional RISC compiler extended
with VLIW features to take advantage of ILP within the DSP algorithms.
11.3.4 Tilera
The Tile64 [7] is a TP based on the mesh architecture that was originally
developed for the RAW machine [26]. The chip consists of a grid of processor
tiles arranged in a network (see Figure 11.13), where each tile consists of a
GPP, a cache, and a nonblocking router that the tile uses to communicate
with the other tiles on the chip.
The Tilera processor architecture incorporates a 2D array of homogenous,
general-purpose cores. Next to each processor there is a switch that connects
DDR I/O
General I/O
DDR I/O
General I/O
Processor Caches
Switch

FIGURE 11.13
Tile64 processor.

×