Tải bản đầy đủ (.pdf) (30 trang)

Model-Based Design for Embedded Systems- P13 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (727.22 KB, 30 trang )

Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 336 2009-10-2
336 Model-Based Design for Embedded Systems
M
ONTIUMC. The MONTIUM design methodology to map DSP applications on
the M
ONTIUM TP is divided into three steps:
1. The high-level description of the DSP application is analyzed and com-
putationally intensive DSP kernels are identified.
2. The identified DSP kernels or parts of the DSP kernels are mapped on
one or multiple M
ONTIUM TPs that are available in a SoC. The DSP oper-
ations are programmed on the M
ONTIUM TP using MONTIUMC.
3. Depending on the layout of the SoC in which the M
ONTIUM processing
tiles are applied, the M
ONTIUM processing tiles are configured for a par-
ticular DSP kernel or part of the DSP kernel. Furthermore, the channels
in the NoC between the processing tiles are configured.
11.3.1.3 A
NNABELLE Heterogeneous System-on-Chip
In this section, the prototype A
NNABELLE SoC is described according to the
heterogeneous SoC template mentioned before, which is intended to be
used for digital radio broadcasting receivers (e.g., digital audio broadcast-
ing, digital radio mondiale). Figure 11.6 shows the overall architecture of the
A
NNABELLE SoC. The ANNABELLE SoC consists of an ARM926 GPP with a five-
layer AMBA AHB, four M
ONTIUM TPs, an NoC, a Viterbi decoder, two ADCs,
two DDCs, a DMA controller, SRAM/SDRAM memory interfaces, and exter-


nal bus interfaces.
The four M
ONTIUM TPs and the NoC are arranged in a reconfigurable
subsystem, labelled “reconfigurable fabric.” The reconfigurable fabric is con-
nected to the AHB bus and serves as a slave to the AMBA system. A config-
urable clock controller generates the clocks for the individual M
ONTIUM TPs.
Every individual M
ONTIUM TP has its own adjustable clock and runs at its
own speed. A prototype chip of the A
NNABELLE SoC has been produced using
the Atmel 130 nm CMOS process [8].
The reconfigurable fabric that is integrated in the A
NNABELLE SoC
is shown in detail in Figure 11.7. The reconfigurable fabric acts as a
Clock controller
ARM926
GPP
DMA
controller
M
ONTIUM
TP
M
ONTIUM
TP
M
ONTIUM
TP
M

ONTIUM
TP
CCU
DDC DDC ADC ADC
Viterbi
decoder
IRQ
controller
SRAM/
SDRAM
External
bus
interface
5-Layer AMBA advanced high-performance bus
CCU
Network-on-chip
Reconfigurable fabric
CCU CCU
FIGURE 11.6
Block diagram of the A
NNABELLE SoC.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 337 2009-10-2
Reconfigurable MultiCore Architectures 337
Port 1
Port 1
Port 3
Port 3
Port 0 Port 0
Port 2Port 2
Queue 0 Queue 1

Control
AHB port
AHB-NoC bridge
Router 1 Router 2
M
ONTIUM
TP 4
MONTIUM
TP 1
M
ONTIUM
TP 2
MONTIUM
TP 3
CCU 3CCU 2
CCU 1
CCU 4
FIGURE 11.7
The A
NNABELLE SoC reconfigurable fabric.
reconfigurable coprocessor for the ARM926 processor. Computationally
intensive DSP algorithms are typically offloaded from the ARM926 proces-
sor and processed on the coarse-grained reconfigurable M
ONTIUM TPs inside
the reconfigurable fabric. The reconfigurable fabric contains four M
ONTIUM
TPs, which are connected via a CCU to a circuit-switched NoC. The reconfig-
urable fabric is connected to the AMBA system through a AHB–NoC bridge
interface. Configurations, generated at design-time can be loaded onto the
M

ONTIUM TPs at run-time. The reconfigurable fabric provides “block mode”
and “streaming mode” computation services.
For ASIC synthesis, worst-case military conditions are assumed. In par-
ticular, the supply voltage is 1.1 V and the temperature is 125

C. Results
obtained with the synthesis are as follows:
• The area of one M
ONTIUM core is 3.5 mm
2
of which 0.2 mm
2
is for the
CCU and 3.3 mm
2
is for the MONTIUM TP (including memory).
• With Synopsys tooling we estimated that the M
ONTIUM TP, within the
A
NNABELLE ASIC realization, can implement an FIR filter at about 100
MHz or an FFT at 50 MHz. The worst-case clock frequency of the
A
NNABELLE chip is 25 MHz.
• With the Synopsys prime power tool, we estimated the energy
consumption using placed and routed netlists. The following section
provides some of the results.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 338 2009-10-2
338 Model-Based Design for Embedded Systems
TABLE 11.2
Dynamic Power Consumption of one

M
ONTIUM on ANNABELLE
Energy (mW/MHz)
Module FIR-5 FFT-512 FFT-288
Datapath 0.19 0.24 0.15
Memories 0.0 0.27 0.21
Sequencer 0.02 0.07 0.05
Decoders 0.0 0 0.0
CCU 0.02 0.02 0.02
Total 0.23 0.60 0.43
TABLE 11.3
Energy Comparison of MONTIUM/
ARM926
MONTIUM ARM926
Algorithm (μJ) (μJ) Ratio
FIR-5 0.243 — —
FFT-112 0.357 9 25
FFT-176 0.616 16 26
FFT-256 0.707 14 20
FFT-288 1.001 23 23
FFT-512 1.563 30 19
FFT-1920 5.054 168 33
11.3.1.4 Average Power Consumption
To determine the average power consumption of the A
NNABELLE as accurate
as possible, we performed a number of power estimations on the placed
and routed netlist using the Synopsys power compiler. Table 11.2 pro-
vides the dynamic power consumption in mW/MHz of various M
ONTIUM
blocks for three well-known DSP algorithms. These figures show that the

overhead of the sequencer and decoder is low: <16% of the total dynamic
power consumption. Finally, Table 11.3 compares the energy consumption
of the M
ONTIUM and the ARM926 on ANNABELLE. For the FIR-5 algorithm the
memory is not used.
11.3.1.5 Locality of Reference
As mentioned above, locality of reference is an important design parame-
ter. One of the reasons for the excellent energy figures of the M
ONTIUM is the
use of locality of reference. To illustrate this, Table 11.4 gives the amount of
memory references local to the cores compared to the amount of off-core
communications. These figures are, as expected, algorithm dependent.
Therefore, in this table, we chose three well-known algorithms in the
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 339 2009-10-2
Reconfigurable MultiCore Architectures 339
TABLE 11.4
Internal and External Memory References per Execution of
an Algorithm
Number of Memory References
Algorithm Internal External Ratio
1024p FFT 51200 4096 12.5
200 tap FIR 405 2 202.5
SISO algorithm (N softbits) 18*N 3*N 6
TABLE 11.5
Reconfiguration of Algorithms on the MONTIUM
Algorithm Change Size # cycles
1024p FFT Scaling factors ≤150 bit ≤10
to iFFT Twiddle factors 16384 bit 512
200 tap FIR Filter coefficients ≤3200 bit ≤80
streaming DSP application domain: a 1024p FFT, a 200 tap FIR filter, and

a part of a Turbo decoder (SISO algorithm [17]). The results show that for
these algorithms 80%–99% of the memory references are local (within a tile).
11.3.1.6 Partial Dynamic Reconfiguration
One of the advantages of a multicore SoC organization is that each individ-
ual core can be reconfigured while the other cores are operational. In the
M
ONTIUM, the configuration memory is organized as a RAM memory. This
means that to reconfigure the M
ONTIUM, the entire configuration memory
need not be rewritten, but only the parts that are changed. Furthermore,
because the M
ONTIUM has a coarse-grained reconfigurable architecture, the
configuration memory is relatively small. The M
ONTIUM has a configuration
size of only 2.6 kB. Table 11.5 gives some examples of reconfigurations.
To reconfigure a M
ONTIUM from executing a 1024 point FFT to executing
a 1024 point inverse FFT requires updating the scaling and twiddle factors.
Updating these factors requires less than 522 clock cycles in total. To change
the coefficients of a 200 tap FIR filter requires less than 80 clock cycles.
11.3.2 Aspex Linedancer
The Linedancer [4] is an “associative” processor and it is an example of a
homogeneous SoC. Associative processing is the property of instructions to
execute only on those PEs where a certain value in their data register matches
a value in the instruction. Associative processing is built around an intel-
ligent memory concept: content addressable memory (CAM). Unlike stan-
dard computer memory (random access memory or RAM) in which the user
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 340 2009-10-2
340 Model-Based Design for Embedded Systems
supplies a memory address and the RAM returns the data word stored at that

address, a CAM is designed such that the user supplies a data word and the
CAM searches its entire memory to see if that data word is stored anywhere
in it. If the data word is found, the CAM returns a tag list of one or more
storage addresses where the word was found. Each CAM line, that contains
a word, can be seen as a processor element (PE) and each tag list element as
a 1 bit condition register. Dependending on this register, the aggregate asso-
ciative processor can either instruct the PEs to continue processing on the
indicated subset, or to return the involved words subsequently for further
processing. There are several implementations possible which vary from bit
serial to word parallel, but the latest implementations [4,5] can perform the
involved lookups in parallel in a single clock cycle.
In general the Linedancer belongs to the subclass of massively parallel
SIMD architectures, with typically more than 512 processors. This SIMD sub-
class is perfectly suited to support data parallelism, for example, for signal,
image, and video processing; text retrieval; and large databases. The asso-
ciative functions furthermore allow the processor to function like an intelli-
gent memory (CAM), permitting high speed searching and data-dependent
image processing operations (such as median filters and object recognition/
labeling).
The so called “ASProCore” of the Linedancer, is designed around a very
large number—up to 4,096—of simple PEs arranged in a line, see Figure 11.8.
Application areas are diverse but have in common the simple process-
ing of very large amounts of data, from samples in 1D-streams to pixels in
2D or 3D-images. To mention a few: software defined radio (e.g., WiMAX),
broadcast (Video compression), medical imaging (3D reconstruction), and in
high-end printers—in particular for raster image processing (RIP).
Program
On-chip
RISC
Associative string processing array (ASProCore)

Network
cascadable
over chips
Thousands of PEs
On-chip or off-chip memory
PE 1
PE 0
PE 4,095
PE 4,094
PE 4,093
PE 2
Inter-PE communication network
Common
instruction
bus
FIGURE 11.8
The scalable architecture of Linedancer.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 341 2009-10-2
Reconfigurable MultiCore Architectures 341
191
0
4,095
EXT CAM ALU PDS
RLP
Single
PE
63
64
3
0

0
63
64
Mask
Extended memory
Associative
memory
ALU
array
Inter-
PE
comm.
Bulk IO
memory
Data
LLP
FIGURE 11.9
The architecture of Linedancer’s associative string processor (ASProCore).
In the following sections, the associative processor (ASProCore) and the
Linedancer family are introduced. At the end, we present the development
tool chain and a brief conclusion on the Linedancer application domain.
11.3.2.1 ASProCore Architecture
Each PE has a 1-bit ALU, 32–64 bit full associative memory array, and 128
bit extended memory. See Figure 11.9 for a detailed view on the ASProCore
architecture. The processors are connected in a 1D network, actually a 4K
bit shift register, in between the indicated “left link port” (LLP) and “right
link port” (RLP). The network allows data to be shared between PEs with
minimum overhead. The ASProCore also has a separate word serial bit par-
allel memory, the primary data store (PDS), for high-speed data input. The
on-chip DMA engine automatically translates 2D and 3D images into the 1D

array (and passed through via the PDS). The 1D architecture allows for linear
scaling of performance, memory, and communication, provided the applica-
tion is expressed in a scalable manner. The Linedancer features also a single
or dual bit RISC core (P1, HD, respectively) for sequential processing and
controlling the ASProCore.
11.3.2.2 Linedancer Hardware Architecture
The current Linedancers, the P1 and the HD, have been realized in 0.13 μm
CMOS process. Both have one or two 32-bit SPARC core(s) with 128 kB inter-
nal program memory. System clock frequencies vary from 300, 350, 400 MHz.
The Linedancer-P1 integrates an associative processor (ASProCore, with 4K
PEs), a single SPARC core with a 4 kB instruction cache, and a DMA con-
troller capable of transferring 64 bit at 66 MHz over a PCI-interface, as shown
in Figure 11.10.
It further hosts 128 kB internal data memory. The chip consumes 3.5 W
typical at 300 MHz. The Linedancer-HD integrates two associative proces-
sors (2 × 2K PEs), two SPARC cores with each 8 kB instruction cache and
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 342 2009-10-2
342 Model-Based Design for Embedded Systems
Instr
DMA engine
32 bit
RISC CPU
128 kB RAM
External
DRAM
(prog)
External
DRAM
(data)
PCI

Neighbor Linedancers
Neighbor Linedancers
128 kB RAM
ASProCore
4096 processing elements
1Mbit storage
Data
FIGURE 11.10
The Linedancer-P1 layout.
4 kB data cache, four internal DMA engines, and an external data channel
capable of transferring 64 bit at 133 MHz over a PCI-X interface, as shown
in Figure 11.11. The ASProCore has been extended with a chordal ring
inter-PE communication network that allows for faster 2D- and 3D-image
processing. It further hosts four external DDR2 DRAM interfaces, eight
dedicated streaming data I/O ports (up to 3.2 GB/s), and 1088 kB internal
data memory. The chip consumes 4.5 W typical at 300 MHz.
11.3.2.3 Design Methodology
The software development environment for Linedancer consists of a com-
piler, linker, and debugger. The Linedancer is programmed in C, with some
parallel extensions to support the ASProCore processing array. The toolchain
is based on the GNU compiler framework, with dedicated pre and postpro-
cessing tools to compile and optimise the parallel extensions to C.
Associative SIMD processing adds an extra dimension to massive parallel
processing, enabling new views on problem modeling and the subsequent
implementation (for example, in searching/sorting and data-dependent
image processing). The Linedancer’s 1D-architecture scales better than a 2D
array often used in multi-ALU arrays as PACT’s XPP [6] or the Tilera’s 2D
multicore array [7]. Because of the large size of the array, power consumption
is relatively high compared to the M
ONTIUM processor and prevents applica-

tion into handheld devices.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 343 2009-10-2
Reconfigurable MultiCore Architectures 343
GPIO
PCI-X
Control
External
data DRAM
(4 banks)
JTAG
Program
memory
Program
memory
V7 ASProCore
4,096 processing elements
1 Mbit storage
8 × Direct
data interfaces
4 × DMA engines
Internal
data memory
32 bit
RISC CPU
32 bit
RISC CPU
Neighbor
Linedancers
Neighbor
Linedancers

3.2 GB/s
streaming
data
I/O
ASProCore
controller
Instr
FIGURE 11.11
The Linedancer-HD layout.
11.3.3 PACT-XPP
The eXtreme processing platform (XPP) is an example of a homogeneous
array structure. It is a run-time reconfigurable coarse-grained data process-
ing architecture. The XPP provides parallel processing power for high band-
width data such as video and audio processing. The XPP targets streaming
DSP applications in the multimedia and telecommunications domain [10,20].
11.3.3.1 Architecture
The XPP architecture is based on a hierarchical array of coarse-grained, adap-
tive computing elements, called processing array elements (PAEs). The PAE
are clustered in processing array clusters (PACs). All PAEs in the XPP archi-
tecture are connected through a packet-oriented communication network.
Figure 11.12 shows the hierarchical structure of the XPP array and the PAEs
clustered in a PAC.
Different PAEs are identified in the XPP array: “ALU-PAE, RAM-PAE,”
and “FNC-PAE.” The ALU-PAE contains a multiplier and is used for DSP
operations. The RAM-PAE contains a RAM to store data. The FNC-PAE is a
unique sequential VLIW-like processor core. The FNC-PAEs are dedicated to
the control flow and sequential sections of applications. Every PAC contains
ALU-PAEs, RAM-PAEs, and FNC-PAEs. The PAEs operate according to a
data flow principle; a PAE starts processing data as soon as all required input
packets are available. If a packet cannot be processed, the pipeline stalls until

the packet is received.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 344 2009-10-2
344 Model-Based Design for Embedded Systems
PAC
CM CM
SCM
CM
PAC
PAC
PAC
IO
IO
IO
IO
IO
IO
IO
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
FNC
FNC
FNC
FNC
ALU

ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
IO
CM
FIGURE 11.12
The structure of an XPP array composed of four PACs. (From Baumgarte, V.
et al., J. Supercomput., 26(2), 167, September 2003.)
Each PAC is controlled by a configuration manager (CM). The CM is
responsible for writing configuration data into the configurable object of the
PAC. Multi-PAC XPP arrays contain additional CMs for concurrent con-
figuration data handling, arranged in a hierarchical tree of CMs. The top
CM, called supervising CM (SCM), has an external interface, not shown in
Figure 11.12, that connects the supervising CM to an external configuration
memory.
11.3.3.2 Design Methodology
DSP algorithms are directly mapped onto the XPP array according to their
data flow graphs. The flow graph nodes define the functionality and oper-
ations of the PAEs, whereas the edges define the connections between
the PAEs. The XPP array is programmed using the native mapping lan-
guage (NML), see [20]. In NML descriptions, the PAEs are explicitly allo-
cated and the connections between the PAEs are specified. Optionally, the

allocated PAEs are placed onto the XPP array. NML also includes statements
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 345 2009-10-2
Reconfigurable MultiCore Architectures 345
to support configuration handling. Configuration handling is an explicit part
of the application description.
A vectorizing C compiler is available to translate C functions to NML
modules. The vectorizing compiler for the XPP array analyzes the code for
data dependencies, vectorizes those code sections automatically, and gener-
ates highly parallel code for the XPP array. The vectorizing C compiler is
typically used to program “regular” DSP operations that are mapped on
“ALU-PAEs” and “RAM-PAEs” of the XPP array. Furthermore, a coarse-
grained parallelization into several FNC-PAE threads is very useful when
“irregular” DSP operations exist in an application. This allows running even
irregular, control-dominated code in parallel on several FNC-PAEs. The
FNC-PAE C compiler is similar to a conventional RISC compiler extended
with VLIW features to take advantage of ILP within the DSP algorithms.
11.3.4 Tilera
The Tile64 [7] is a TP based on the mesh architecture that was originally
developed for the RAW machine [26]. The chip consists of a grid of processor
tiles arranged in a network (see Figure 11.13), where each tile consists of a
GPP, a cache, and a nonblocking router that the tile uses to communicate
with the other tiles on the chip.
The Tilera processor architecture incorporates a 2D array of homogenous,
general-purpose cores. Next to each processor there is a switch that connects
DDR I/O
General I/O
DDR I/O
General I/O
Processor Caches
Switch

FIGURE 11.13
Tile64 processor.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 346 2009-10-2
346 Model-Based Design for Embedded Systems
the core to the iMesh on-chip network. The combination of a core and a
switch form the basic building block of the Tilera Processor: the tile. Each
core is a fully functional processor capable of running complete operating
systems and off-the-shelf “C” code. Each core is optimized to provide a high
performance/power ratio, running at speeds between 600 MHz and 1 GHz,
with power consumption as low as 170 mW in a typical application. Each
core supports standard processor features such as
• Full access to memory and I/O
• Virtual memory mapping and protection (MMU/TLB)
• Hierarchical cache with separate L1-I and L1-D
• Multilevel interrupt support
• Three-way VLIW pipeline to issue three instructions per cycle
The cache subsystem on each tile consists of a high-performance, two-
level, non-blocking cache hierarchy. Each processor/tile has a split level 1
cache (L1 instruction and L1 data) and a level 2 cache, keeping the design,
fast and power efficient. When there is a miss in the level 2 cache of a spe-
cific processor, the level 2 caches of the other processors are searched for the
data before external memory is consulted. This way, a large level 3 cache is
emulated.
This promotes on-chip access and avoids the bottleneck of off-chip global
memory. Multicore coherent caching allows a page of shared memory,
cached on a specific tile, to be accessed via load/store references to other
tiles. Since one tile effectively prefetches for the others, this technique can
yield significant performance improvements.
To fully exploit the available compute power of large numbers of pro-
cessors, a high-bandwidth, low-latency interconnect is essential. The net-

work (iMesh) provides the high-speed data transfer needed to minimize
system bottlenecks and to scale applications. iMesh consists of five distinct
mesh networks: Two networks are completely managed by hardware and
are used to move data to and from the tiles and memory in the event of
cache misses or DMA transfers. The three remaining networks are available
for application use, enabling communication between cores and between
cores and I/O devices. A number of high-level abstractions are supplied for
accessing the hardware (e.g., socket-like streaming channels and message-
passing interfaces.) The iMesh network enables communication without
interrupting applications running on the tiles. It facilitates data transfer
between tiles, contains all of the control and datapath for each of the net-
work connections, and implements buffering and flow control within all the
networks.
11.3.4.1 Design Methodology
The TILE64 processor is programmable in ANSI standard C and C++. Tiles
can be grouped into clusters to apply the appropriate amount of processing
power to each application and parallelism can be explicitly specified.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 347 2009-10-2
Reconfigurable MultiCore Architectures 347
11.4 Conclusion
In this chapter, we addressed reconfigurable multicore architectures for
streaming DSP applications. Streaming DSP applications express computa-
tion as a data flow graph with streams of data items (the edges) flowing
between computation kernels (the nodes). Typical examples of streaming
DSP applications are wireless baseband processing, multimedia processing,
medical image processing, and sensor processing. These application domains
require flexible and energy-efficient architectures. This can be realized with a
multicore architecture. The most important criteria for designing such a mul-
ticore architecture are predictability and composability, energy efficiency,
programmability, and dependability. Two other important criteria are per-

formance and flexibility. Different types of processing cores have been dis-
cussed, from ASICs, reconfigurable hardware, to DSPs and GPPs. ASICs
have high performance but suffer from poor flexibility while DSPs and GPPs
offer flexibility but modest performance. Reconfigurable hardware combines
the best of both worlds. These different processing cores are, together with
memory- and I/O blocks assembled into MP-SoCs. MP-SoCs can be clas-
sified into two groups: homogeneous and heterogeneous. In homogeneous
MP-SoCs, multiple cores of a single type are combined whereas in a hetero-
geneous MP-SoC, multiple cores of different types are combined.
We also discussed four different architectures: the M
ONTIUM/ANNABELLE
SoC, the Aspex Linedancer, the PACT-XPP, and the Tilera processor. The
M
ONTIUM, a coarse-grain, run-time reconfigurable core has been used as one
of the building blocks of the A
NNABELLE SoC. The ANNABELLE SoC can be
classified as a heterogeneous MP-SoC. The Aspex Linedancer is a homoge-
neous MP-SoC where a single instruction is executed by multiple processors
simultaneously (SIMD). The PACT-XPP is an array processor where multi-
ple ALUs are combined in a 2D structure. The Tilera processor is an example
of a homogeneous MIMD MP-SoC.
References
1. The International Technology Roadmap for Semiconductors, ITRS
Roadmap 2003. Website, 2003. />Home2003.htm.
2. A coarse-grained reconfigurable architecture template and its compi-
lation techniques. PhD thesis, Katholieke Universiteit Leuven, Leuven,
Belgium, January 2005.
3. Nvidia g80, architecture and gpu analysis, 2007.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 348 2009-10-2
348 Model-Based Design for Embedded Systems

4. Aspex Semiconductor: Technology. Website, 2008. ex-
semi.com/q/technology.shtml.
5. Mimagic 6+ Enables Exciting Multimedia for Feature Phones. Web-
site, 2008. />Brief.pdf/.
6. PACT. 2008.
7. Tilera Corporation. 2008.
8. Atmel Corporation. ATC13 Summary. , 2007.
9. A. Banerjee, P.T. Wolkotte, R.D. Mullins, S.W. Moore, and Gerard J.M.
Smit. An energy and performance exploration of network-on-chip archi-
tectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
17(3): 319–329, March 2009.
10. V. Baumgarte, G. Ehlers, F. May, A. Nückel, M. Vorbach, and M. Wein-
hardt. PACT XPP—A self-reconfigurable data processing architecture.
Journal of Supercomputing, 26(2):167–184, September 2003.
11. M.D. van de Burgwal, G.J.M. Smit, G.K. Rauwerda, and P.M. Heysters.
Hydra: An energy-efficient and reconfigurable network interface. In Pro-
ceedings of the International Conference on Engineering of Reconfigurable Sys-
tems and Algorithms (ERSA’06), Las Vegas, NV, pp. 171–177, June 2006.
12. G. Burns, P. Gruijters, J. Huisken, and A. van Wel. Reconfigurable
accelerator enabling efficient sdr for low-cost consumer devices. In SDR
Technical Forum, Orlando, FL, November 2003.
13. A.P. Chandrakasan, S. Sheng, and R.W. Brodersen. Low-power cmos
digital design. IEEE Journal of Solid-State Circuits, 27(4):473–484, April
1992.
14. W.J. Dally, U.J. Kapasi, B. Khailany, J.H. Ahn, and A. Das. Stream pro-
cessors: Progammability and efficiency. Queue, 2(1):52–62, 2004.
15. European Telecommunication Standard Institute (ETSI). Broadband Radio
Access Networks (BRAN); HIPERLAN Type 2; Physical (PHY) Layer,ETSI
TS 101 475 v1.2.2 edition, February 2001.
16. Y. Guo. Mapping applications to a coarse-grained reconfigurable archi-

tecture. PhD thesis, University of Twente, Enschede, the Netherlands,
September 2006.
17. P.M. Heysters, L.T. Smit, G.J.M. Smit, and P.J.M. Havinga. Max-log-map
mapping on an fpfa. In Proceedings of the 2005 International Conference
on Engineering of Reconfigurable Systems and Algorithms (ERSA’02),Las
Vegas, NV, pp. 90–96, June 2002. CSREA Press, Las Vegas, NV.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 349 2009-10-2
Reconfigurable MultiCore Architectures 349
18. P.M. Heysters. Coarse-grained reconfigurable processors – flexibility
meets efficiency. PhD thesis, University of Twente, Enschede, the
Netherlands, September 2004.
19. R.P. Kleihorst, A.A. Abbo, A. van der Avoird, M.J.R. Op de Beeck,
L. Sevat, P. Wielage, R. van Veen, and H. van Herten. Xetal: A low-power
high-performance smart camera processor. IEEE International Symposium
on Circuits and Systems, 2001. ISCAS 2001, 5:215–218, 2001.
20. PACT XPP Technologies . , 2007.
21. D.C. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry,
D. Cox, P. Harvey et al. Overview of the architecture, circuit design, and
physical implementation of a first-generation cell processor. IEEE Journal
of Solid-State Circuits, 41(1):179–196, January 2006.
22. G.K. Rauwerda, P.M. Heysters, and G.J.M. Smit. Towards software
defined radios using coarse-grained reconfigurable hardware. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 16(1):3–13,
January 2008.
23. Recore Systems. , 2007.
24. G. J. M. Smit, A. B. J. Kokkeler, P. T. Wolkotte, and M. D. van de Burgwal.
Multi-core architectures and streaming applications. In I. Mandoiu and
A. Kennings (editors), Proceedings of the Tenth International Workshop on
System-Level Interconnect Prediction (SLIP 2008), New York, pp. 35–42,
April 2008. ACM Press, New York.

25. S.R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan
et al. An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Jour-
nal of Solid-State Circuits, 43(1):29–41, January 2008.
26. E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim
et al. Baring it all to software: Raw machines. Computer, 30(9):86–93,
September 1997.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 350 2009-10-2
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 351 2009-10-1
12
FPGA Platforms for Embedded Systems
Stephen Neuendorffer
CONTENTS
12.1 Introduction 351
12.2 Background 353
12.2.1 Processor Systems in FPGAs . 353
12.2.2 FPGA Configuration and Reconfiguration 355
12.2.3 Partial Reconfiguration with Processors 358
12.2.4 Reusable FPGA Platforms for Embedded Systems 360
12.3 EDK Designs with Linux 361
12.3.1 Design Constraints 361
12.3.2 Device Trees 362
12.4 Introduction to Modular Partial Reconfiguration 363
12.5 EDK Designs with Partial Reconfiguration 364
12.5.1 Abstracting the Reconfigurable Socket 365
12.5.2 Interface Architecture 365
12.5.3 Direct Memory Access Interfaces 366
12.5.4 External Interfaces 368
12.5.5 Implementation Flow 369
12.6 Managing Partial Reconfiguration in Linux 370
12.7 Putting It All Together 372

12.8 Conclusion 375
References 377
12.1 Introduction
Increasingly, programmable logic (such as field programmable gate arrays
[FPGAs]) is a critical part of low-power and high-performance signal pro-
cessing systems. Typically, these systems also include a complex system
architecture, along with control processors, digital signal processing (DSP)
elements, and perhaps dedicated circuits. In some cases, it is economical
to integrate these system components in ASIC technology. As a result, a
wide variety of general purpose or application specific standard product
(ASSP) system-on-chip (SOC) architectures are available in the market. From
the perspective of a system designer, these architectures solve a large por-
tion of the system design problem, typically providing application-specific
351
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 352 2009-10-1
352 Model-Based Design for Embedded Systems
I/O interfaces, an operating system for the control processor, processor
application programming interfaces (APIs) for accessing dedicated circuits,
or communicating with programmable elements such as DSP cores.
As FPGAs have become larger and more capable, it has become possi-
ble to integrate a large portion of the system architecture completely within
an FPGA, including control processors, communication buses, DSP process-
ing, memory, I/O interfaces, and application-specific circuits. For a system
designer, such a System-in-FPGA (SIF) architecture may result in better sys-
tem characteristics if an appropriate ASSP does not exist. At the same time,
designing using FPGAs eliminates the initial mask costs and process technol-
ogy risks associated with custom ASIC design, while still allowing a system
to be highly tuned to a particular application.
Unfortunately, designing a good SIF architecture from scratch and imple-
menting it successfully can still be a risky, time-consuming process. Given

that FPGAs only exist in fixed sizes, leveraging all the resources available
in a particular device can be challenging. This problem has become even
more acute given the heterogeneous nature of current FPGA architectures,
making it more important to trade off critical resources in favor of less criti-
cal ones. Furthermore, most design is still performed at the register-transfer
level (RTL) level, with few mechanisms to capture interface requirements or
guarantee protocol compatibility. Constructing radically new architectures
typically involves significant code rewriting and under practical design pres-
sures is not an option, given the time required for system verification.
Model-based design is one approach to reducing this risk. By focusing
on capturing a designer’s intention and providing high-level design con-
structs that are close to a particular application domain, model-based design
can enable a designer to quickly implement algorithms, analyze trade-offs,
and explore different alternatives. By raising the level of abstraction, model-
based design techniques can enable a designer to focus on key system-level
design decisions, rather than low-level implementation details. This process,
often called “platform-based design” [10,16], enables higher level abstrac-
tions to be expressed in terms of lower level abstractions, which can be more
directly implemented.
Unfortunately, in order to provide higher level design abstractions, exist-
ing model-based design methodologies must still have access to robust basic
abstractions and design libraries. Of particular concern in FPGA systems is
the role of the control processor as more complex processor programs, such
as an operating system, are used. The low-level interfaces between the pro-
cessor and the rest of the system can be fragile, since the operating system
and hardware must coordinate to provide basic abstractions, such as pro-
cess scheduling, memory protection, and power management. Architecting,
debugging, and verifying this interaction tends to require a wide span of
skills and specialized knowledge and can become a critical design problem,
even when using traditional design techniques.

One solution to this problem is to separate the control processor
subsystem from the bulk of the system and provide it as a fixed part of
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 353 2009-10-1
FPGA Platforms for Embedded Systems 353
the FPGA platform. This subsystem can remain simple while being capable
of configuring and reconfiguring the FPGA fabric, bootstrapping an operat-
ing system, and providing a basis for executing application-specific control
code. Historically, several architectures have provided such a platform with
the processor system implemented in ASIC technology coupled with pro-
grammable FPGA fabric, including the Triscend architecture [20], which was
later acquired by Xilinx. Although current FPGAs sometimes integrate hard
processor cores (such as in the Xilinx Virtex 2 Pro family), a complete proces-
sor subsystem is typically not provided.
This chapter describes the use of the partial reconfiguration (PR) capabil-
ities of some FPGAs to provide a complete processor-based platform using
existing general-purpose FPGAs. PR involves the reconfiguration of part of
an FPGA (a reconfigurable region) while another part of the FPGA (a static
region) remains active and operating. Using PR, the processor subsystem
can be implemented as a largely application-independent static region of the
FPGA, while the application-specific portion can be implemented in a recon-
figurable region. The processor subsystem can be verified and optimized
beforehand, combined with an operating system image and distributed as
a binary image. From the perspective of a designer or a model-based design
tool, the static region of the FPGA becomes part of the FPGA platform, while
the reconfigurable region can then be treated as any other FPGA, albeit with
some resources reserved.
To understand the requirements for designing such a platform, we will
first provide some background of how processors and PR are used to design
SIF architectures. Then, we will describe the currently available tools, par-
ticularly related to PR, for building a reusable platform. Lastly, we will

provide an in-depth design example showing how such a platform can be
constructed.
12.2 Background
12.2.1 Processor Systems in FPGAs
Processor-based systems are commonly constructed in FPGAs. An obvious
way to build such a system is to take the RTL used for an ASIC implementa-
tion and target the RTL toward the FPGA using logic synthesis. In most cases,
however, the resulting FPGA design is relatively inefficient (being both rela-
tively large in silicon area and slow). Recent studies suggest that direct FPGA
implementation may be around 40 times larger (in silicon area) and one-third
of the clock speed of a standard-cell design on small benchmark circuits [9].
Experience with emulating larger processor designs, such as the Sparc V9
core from the OpenSparc T1 [19] and the PowerPC 405 core, in FPGAs sug-
gest a slowdown of at least 20 times compared to ASIC implementations.
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 354 2009-10-1
354 Model-Based Design for Embedded Systems
The differences arise largely because of the overhead of FPGA pro-
grammability, which requires many more transistors than an equivalent
ASIC implementation. However, whereas many ASIC processors have com-
plex architectures in order to meet high computation requirements, systems
designed for FPGAs tend to make use of FPGA parallelism to meet the bulk
of the computation requirements. Hence, only relatively simple control pro-
cessors are necessary in FPGA systems, when combined with application-
specific FPGA design. When a processor architecture can be tuned to match
the FPGA architecture, as is typically done with “soft-core” processors, such
as the Xilinx Microblaze, reasonable clock rates ( 100 MHz) can be achieved
even in small, relatively slow, cost-optimized Xilinx Spartan 3 FPGAs. Alter-
natively, somewhat higher clock rates (up to 500 MHz) and performance
can be achieved by incorporating the processor core as a “hard-core” in the
FPGA, as is done with PowerPC cores in Xilinx Virtex 4 FX FPGAs.

One advantage of a faster control processor is being able to effectively run
larger, more complex control programs. Operating systems are often used to
mitigate this complexity. An operating system not only provides access to
various resources in the system, but also enables multiple pieces of indepen-
dent code to effectively share those resources by providing locking, memory
allocation, file abstractions, and process scheduling. In addition, operating
systems are designed to be robust and stable where an application process
cannot corrupt the operating system or other processes, making it signifi-
cantly easier to design and debug large systems.
Such an architecture, which combines a simple control processor hosting
an operating system with a high-performance computational engine, is not
unique to FPGA-based systems. With the move toward multicore architec-
tures in embedded processing platforms, typically one processor core serves
the role of the control processor. This processor typically boots first, and is
responsible for configuring and managing the main computational engine(s),
which are typically programmable processors tuned for a particular appli-
cation domain, such as signal processing or networking. Even in platforms
where the computational engines are specialized and not programmable pro-
cessors at the instruction level, such as in low-power cell phone platforms,
some initialization and coordination of data transfer must still be performed.
The variety in the possible architectures can be seen in Figure 12.1, which
summarizes the architecture of several embedded processing platforms.
Platform Application Control Proc. Data Proc.
IBM cell Media/computing 64-bit PPC 8 128-bit SIMD RISC
Nexperia PNX8526
Digital television MIPS 1 VLIW and dedicated
Intel IXP2800
Network processing XScale (ARMv5) 16 multithreaded RISC
TI OMAP2430
Cell phone handset ARM 1136 dedicated

FIGURE 12.1
Summary of some existing embedded processing platforms with control
processors.
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 355 2009-10-1
FPGA Platforms for Embedded Systems 355
Regardless of the processor core architecture, the core must still be inte-
grated into a system in order to access peripherals and external memory.
Typically, most system peripherals and device interfaces are implemented in
the FPGA fabric, in order to provide the maximum amount of system flexi-
bility. For instance, the Xilinx embedded development kit (EDK) [24] enables
FPGA users to assemble existing processor and peripheral IP cores to design
a SIF architecture. Application-specific FPGA modules can be imported as
additional cores into EDK, or alternatively, the RTL generated by EDK can
be encapsulated as a blackbox inside a larger HDL design.
12.2.2 FPGA Configuration and Reconfiguration
FPGAs are designed primarily to implement arbitrary bit-oriented logic cir-
cuits. In order to do this, they consist primarily of “lookup tables” (LUTs)
for implementing the combinational logic of the circuit, “flip-flops” (FFs)
for implementing registers in the circuit, and programmable interconnect for
passing signals between other elements. Typically, pairs of LUTs and FFs are
grouped together with some additional combinational logic for efficiently
forming wide logic functions and arithmetic operations. The Xilinx Virtex 4
slice, which combines two LUTs and two FFs, is shown in Figure 12.2. In the
Virtex 4 architecture, four slices are grouped together with routing resources
in a single custom design called a configurable logic block (CLB). The layout
of FPGAs consists primarily of many tiles of the basic CLB, along with tiles
for other other elements necessary for a working system, such as embedded
memory (BRAM), external IO pins, clock generation and distribution logic,
and even processor cores.
In order to implement a given logic circuit, the logic elements must be

configured. Typically, this involves setting the value in a large number of
individual SRAM configuration memory cells controlling the logic elements.
These configuration cells are often organized in a large shift chain, enabling
the configuration bitstream to be shifted in from an external source, such
as a nonvolatile PROM. This shift chain is illustrated in Figure 12.3, taken
from an early FPGA-related patent [5]. Although this arrangement enables
the FPGA configuration to be loaded relatively efficiently, changing any part
of the configuration requires loading a completely new bitstream.
In order to increase flexibility, additional logic is often added to the
configuration logic of FPGAs that enables portions of the FPGA config-
uration to be loaded independently. In Xilinx Virtex FPGAs, the config-
uration shift chain is broken into individually addressed “configuration
frames” [26]. The configuration logic contains a register, called the frame
address register (FAR), which routes configuration data to the correct
configuration frame. The configuration bitstream itself consists of “configu-
ration commands,” which can update the FAR and other registers in the con-
figuration logic, load configuration frames, or perform other configuration
operations. This architecture enables “partial reconfiguration” of the FPGA,
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 356 2009-10-1
356 Model-Based Design for Embedded Systems
FXINA MUXFX
FXINB
G
D
LUT
BY
F
BX
CE
CLK

SR
Inputs
LUT
MUXF5
F5
X
XQ
FX
Y
YQ
FF/LAT
DQ
CE
CLK
SR
D
Inputs
REV
FF/LAT
D
Q
CE
CLK
SR REV
FIGURE 12.2
Simplified architecture of Xilinx Virtex 4 slice [27]. The multiplexers in the
middle are primarily used to implement wide multiplexers from several
slices. (From Xilinx, Virtex-4 FPGA User Guide, ug070 v2.40 edition, April
2008. With permission.)
where some configuration frames are reconfigured while other portions

remain active.
In Virtex 4 FPGAs, the configuration frames themselves are organized
in columns along the North–South axis of the FPGA. Each configuration
frame is the height of 16 CLBs or 4 BRAM memory elements and matches
the height of the clock distribution tree. Hence, PR of large portions of the
FPGA is best done using rectangular regions that are a multiple of 16 CLBs in
that direction. In the East–West direction, the columns are narrow (requiring
many configuration frames to configure all of the LUTs in one CLB), which
enables the exact size of a reconfigurable region to be more finely controlled.
Note that in Virtex 2 and Virtex 2 Pro FPGAs, the configuration frames cross
the entire device in the North–South direction, making connectivity between
regions more difficult. Although PR is possible in these families, rather com-
plex architectures tend to be used [11].
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 357 2009-10-1
FPGA Platforms for Embedded Systems 357
Hold
phi1
In
phi2
Pass
device
Basic cell
Pass
device
FIGURE 12.3
Early FPGA configuration logic [5]. phi1, phi2, and“hold” signals control
loading of data into the shift chain. Blocks marked “pass device” are con-
trolled by the configuration logic. (From Xilinx, Virtex-4 FPGA User Guide,
ug070 v2.40 edition, April 2008. With permission.)
Although the logic in a design can often be floorplanned to fit the nat-

ural layout of the configuration frames, signal routing is often much more
problematic. For instance, the FPGA architecture may require certain exter-
nal I/O pins to be used for certain purposes, such as clock inputs. It may also
be difficult to floorplan a region containing exactly the right number of exter-
nal pins, while still maintaining a reasonable mix of other elements. These
difficulties can be reduced by allowing static signals to be routed through
reconfigured regions of the FPGA.
Implementing such “route-overs” require both capabilities in the FPGA
architecture and capabilities in the design tools. The FPGA architecture must
support the ability to overwrite the configuration of routing resources with-
out causing active signals using those resources to glitch. This capability is
supported by Xilinx Virtex 2, Virtex 2 Pro, Virtex 4, and Virtex 5 FPGAs, but
not by lower cost Spartan 3 FPGAs. The design tools must have the comple-
mentary capability to generate bitstreams for reconfigurable regions where
route-overs use exactly the same set of configuration bits to route each sig-
nal. This capability is implemented in the Xilinx early access (EA) PR tools
using a “merge-based” process [17]. In this process, the static portion of the
design is placed and routed first and the routing resources used are stored in
a design database. This database, combined with floorplanning constraints,
are used to constrain the routing of reconfigurable modules to lie within the
boundaries of the reconfigurable region and avoid routing resources used
by route-overs. To generate a partial bitstream, the implementation of each
reconfigurable module is first merged with the implementation of the static
region, ensuring that any route-over uses the same signal routing as the static
design. Using this process, each reconfigurable module can be implemented
without the knowledge of the implementation of any other reconfigurable
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 358 2009-10-1
358 Model-Based Design for Embedded Systems
module and configured independently, as long as every configuration frame
is guaranteed to contain information from the static design and at most one

reconfigurable region.
From the perspective of the configuration logic, the process of loading
a partial bitstream is handled in exactly the same way. However, from the
perspective of building systems, there are several key differences. Primar-
ily, a partial bitstream never contains the configuration commands that are
normally present in a bitstream to trigger the initialization and power-on-
reset process of the FPGA, since issuing such commands would immediately
halt processing in the static region. As a result, a PR design must never rely
on the power-on-reset state of flip-flops for proper operation. Secondarily,
although routing resources can be reconfigured without glitching in some
FPGA architectures, any signal that is sourced by a flip-flop or register that is
reconfigured will still glitch during reconfiguration. As a result, extra logic is
typically included to ensure that signals driven from the reconfigured region
into the static region are forced to a value during reconfiguration.
12.2.3 Partial Reconfiguration with Processors
The PR process itself can be initiated either through an external configura-
tion interface, such as Xilinx SelectMap interface or the joint test action group
(JTAG) boundary scan interface, or internally, through the internal configu-
ration access port (ICAP) [26]. The most convenient way to use the ICAP is by
using a processor, such as the Xilinx Microblaze processor or PowerPC hard
cores found in some FPGAs. A program running on the processor in the static
region of the FPGA can make decisions about when reconfiguration should
occur and can load an appropriate partial bitstream through the ICAP. When
used in this way, the combination of FPGA plus the static design capa-
ble of reconfiguration is often called a “self-reconfiguring platform” (SRP)
[1,17,22]. The basic architecture of an SRP is shown in Figure 12.4.
One example of how such a system might work is shown in Figure 12.5.
This system includes a large number of FPGA computational units in a mod-
ular rack-mounted system. Data arrives on the right of the figure and is pro-
cessed by FPGAs directly connected to A/D converters. Under control of

the control workstation, data is routed through a network switch to other
FPGA computational units for further processing and data reduction. The
processed data is stored or displayed by the control workstation. A similar
system, is currently in use at the Allen Telescope Array, using racks of FPGA
boards to combine the results from a large number of radio telescopes [13,14].
In order to provide scalability and fault tolerance, each computational
unit performs self-checks when it is first powered on. Based on these checks,
the unit notifies a centralized server of its availability. When work is avail-
able, the centralized server distributes it to any available and unallocated
computational units. If any units fail (based on periodic internal checks, or
external verification of work results), the centralized server can decide not
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 359 2009-10-1
FPGA Platforms for Embedded Systems 359
Control
processor
Reconfigurable
FPGA
resource(s)
Common
I/O interfaces
(Console, JTAG)
Control/data bus
External memory
interface
ICAP
interface
FPGA
Sys
I/O
FIGURE 12.4

Basic architecture of a SRP. (From Xilinx, Virtex-4 FPGA User Guide, ug070
v2.40 edition, April 2008. With permission.)
Control
workstation
Network switch
Control
proc.
FPGA
resource
Control
proc.
FPGA
resource
Antenna
inputs
A/D
FIGURE 12.5
A radio telescope system architecture based on FPGAs. (From Xilinx, Virtex-4
FPGA User Guide, ug070 v2.40 edition, April 2008. With permission.)
to assign additional work to the failed unit and schedule a replacement.
This management and coordination task is handled by distributed software
executing on the control workstation and the control processors in each
FPGA unit.
Another example, based on a software-defined radio is shown in Figure
12.6. In this system, a large number of different communication protocols,
called “waveforms,” must be implemented in a system although only a small
number are active at any one time [21]. In Figure 12.6, waveforms are exe-
cuted primarily in the reconfigurable FPGA resources on the right. The con-
trol processor responds to events initiated by the user of the system through
the interfaces on the left, controls reconfiguration of the FPGA resources, and

manages the transfer of data between the radio and the other interfaces of the
system. When a connection is established, the correct waveform is selected
from a library of FPGA implementations and inserted into the system. This
type of system also enables a straightforward path toward supporting new
Nicolescu/Model-Based Design for Embedded Systems 67842_C012 Finals Page 360 2009-10-1
360 Model-Based Design for Embedded Systems
To antenna
A/D
Control
proc.
FPGA
resources
D/A
Audio/video
keypad
interfaces
Control/data bus
FIGURE 12.6
A software defined radio architecture based on FPGAs.
waveforms through any device that the processor has access to, including
a wireless network connection based on an existing waveform supporting
data traffic.
12.2.4 Reusable FPGA Platforms for Embedded Systems
Typically, the SRP concept is seen largely as a mechanism for enabling bet-
ter use of the reconfigurability of the FPGA. Such a system may consume
less power, cost less, or be more flexible than an equivalent system without
reconfiguration, since only the portion of the system that is active needs to be
loaded in the FPGA. In practice, however, these advantages are often diffi-
cult to realize, because of the complexity of the resulting system. Compared
with a processor, which is typically capable of switching between processes

in hundreds of cycles, reconfiguration of a large FPGA may take hundreds
of thousands of cycles. In order to leverage FPGA reconfigurability, systems
must be capable of accepting this latency. If a task needs to be resumed later,
its state must be saved and reloaded, adding not only additional latency but
also storage requirements. Even if multiple tasks can be time-shared, realiz-
ing a cost savings by fitting a design into a smaller FPGA is difficult since
only discrete sizes of FPGAs are available and there is some overhead in
using PR techniques.
The SRP concept can also be viewed as a means for enabling faster and
more robust system design. The processor is decoupled from the bulk of
the FPGA design, enabling it to be designed, verified, and optimized in a
working system independent from the FPGA design. Within a given appli-
cation domain (such as designs requiring DDR2 memory and Gigabit Ether-
net as basic infrastructure), the processor system can also be made generic
and leveraged across different designs. The processor also becomes centrally
involved in how the bulk of the FPGA is configured, enabling flexible pro-
gramming of the configuration process, rather than relying solely on fixed
configuration modes. As a result, new capabilities, such as Built-in Self Test
or network-enabled processing resources, can be enabled, which previously
required an external processor.

×