Tải bản đầy đủ (.pdf) (10 trang)

Handbook of algorithms for physical design automation part 97 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (160.08 KB, 10 trang )

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 942 24-9-2008 #3
942 Handbook of Algorithms for Physical Design Automation
to map a circuit to an application specific integrated circuit (ASIC). The next chapter will describe
the physical design algorithms for FPGAs; this chapter sets the stage by describing the architecture
of FPGAs. Section 45.2 describes several programming technologies, Section 45.3 describes logic
block architectures, Section 45.4 describes routing architectures,and Sections 45.5and 45.6 describe
embedded memories and embedded computation blocks.
45.2 PROGRAMMING TECHNOLOGIES
The circuit being implemented on an FPGA is stored in the FPGA using a set of configuration
bits. These bits can be constructed in various ways; this section describes static random access
memory (SRAM), Flash, and antifuse-based configuration bits. These schemes are all used in con-
temporary commercial FPGAs; many FPGAs vendors, such as Xilinx, Altera, and Lattice, use
SRAM configurable bits to control the programmable switches to configure routing and logic
[Altera05,Lattice05,Xilinx05a].Actel produces both Flash and antifuse FPGA products [Actel05a].
QuickLogic uses antifuse technology in their products [Quick05]. Table 45.1 provides a comparison
among these three technologies; details on each are provided b elow. FPGAs based on emerging
technologies have also been described [Ferrera04,Dehon05], but because they are not commercially
available yet, they will not be discussed further here.
45.2.1 SRAM-BASED FPGAS
The most popular scheme to implement configuration bits is to use SRAM cells. SRAM technology
is fast, and allows for repr ogrammability.In addition, SRAM bits can be implemented using standard
complementary metal-oxide-semiconductor(CMOS) processes, meaning FPGAs using SRAMs can
be implementedin leading-edgeprocesses. Figure45.1 shows a typical six-transistor SRAM memory
cell. It uses the data bit in both the true and complement forms to achieve fast read and write
time [Trimberger94]. Although a six-transistor cell is generally more stable because it is resistant
to state flipping owing to crosstalk or charge sharing [Betz99], four-transistor and five-transistor
SRAM cells are possible. Xilinx uses a five-transistor SRAM cell for their FPGAs [Trimberger94].
The main disadvantage of SRAM is its volatility. Data stored in SRAM cells is erased when the
power is turned off. Therefore, additional off-chip memory, like electrically erasable programmable
read-only memory (EEPROM), is necessary to store the configuration bits and program the FPGA at
power-up. This potentiallycauses security concerns, because designs can be copied by capturing the


external bit stream [Zeidman02]. To address this, some FPGA vendors, such as Altera and Lattice,
apply on-chip Flash memory to store the configuration bits, so the SRAM-based FPGA can be
programmed without external memory upon power-up. A second disadvantage of this technology
is that SRAM cells are susceptible to neutron-induced errors, also known as soft-errors, which are
TABLE 45.1
Comparison among SRAM, Antifuse, and Flash
Features SRAM Flash Antifuse
Volatile Yes No No
In-system programmable Yes Yes No
Power consumption High Lower Lower
Density High High High
IP security No Yes Yes
Soft-error resistance Low High High
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 943 24-9-2008 #4
Field-Programmable Gate Array Architectures 943
Data
Data_bar
Program line:
Asserted during the
configuration phase
Load line:
Load the value for
data during configuration
phase
Load line_bar:
Load the value for
data_bar during
configuration phase
FIGURE 45.1 Six-transistor SRAM cell.
caused by neutrons, alpha particles, cosmic or terrestrial radiation. These errors are common in high-

radiation environments, such as at high altitude or in space. Such errors do not permanently damage
the FPGA, but they may cause instability and functional failure in the system. The main strategies
to overcome these errors in SRAM-based FPGAs are triple redundancy, error-correcting or parity
codes, and redundancy in time.
45.2.2 FLASH-BASED FPGAS
Flash cells provide nonvolatile programmability while retaining the ability to reprogram the FPGAs.
Figure 45.2 illustrates the Flash switch used in Actel’s ProASIC3. In the Flash switch, two transistors
share the floating gate, which stores the programming data. The sensing transistor is used for writing
and verification of the flo ating gate voltage while the switching transistor is employed to configure
routing nets and logic. Flash-based FPGAs are more secure andconsume less powerthan their SRAM
Wordline
Switch
output
Switch
input
Sensing
transistor
Switching
transistor
Floating gate
Readback
Configure
FIGURE 45.2 Flash-based switch.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 944 24-9-2008 #5
944 Handbook of Algorithms for Physical Design Automation
counterparts [Actel05a]. However, the manufacturing process for Flash is more complicated than
that of SRAM. As a result, Flash technology usually lags one to two p rocess generations behind
SRAM technologies. Testing is also lengthy owing to the nature of Flash. Therefore, Flash-based
FPGAs have a slower time-to-market compared to the SRAM-based FPGAs.
45.2.3 ANTIFUSE-BASED FPGAS

Antifusescan also beused to implementconfigurationbits [Actel05b].An antifuse is a thin insulating
layer between conductors. The insulating layer gets mutated by applying high voltage. After the
alteration, a low-resistance path is created between the conductors. Such alteration is irreversible.
Like Flash, antifuse technology is nonvolatile. The major disadvantage of antifuse FPGA is its
one-time programma bility. However, it consumes less power and is more area-efficient th a n SRAM
and Flash.
45.3 LOGIC BLOCK ARCHITECTURES
Programmability is provided in an FPGA in two ways. Logic is implemented in configurable logic
blocks; these logic blocks are then connected to each other and to the I/O pads using a configurable
routing network [Rose93,Betz99]. This section focuses on logic blocks and the next section focuses
on the routing network.
45.3.1 LOOKUP-TABLES
Most FPGAs use lookup-tables (LUTs) as their basic logic element. A K-input LUT (K-LUT) is a
memory with 2
K
bits, K address lines, and a single output line. Each K-LUT can be configured to
implement any function of K inputs by storing the truth table of the desired function in the 2
K
storage
bits. Figure 45.3 shows an example of a 2-input LUT implemented using SRAM cells (antifuse and
Flash memory cells could also be used).
Early research has shown that K = 4 works well; this is used in most commercial FPGAs
[Rose90,Singh92]. Later work reconfirmed that K = 4 is a good choice for area, but that for
performance, K = 7 works well [Ahmed04]. In general, the parameter K has a significant impact
on the efficiency of the architecture. If K is too large, it may not be possible to completely fill each
logic block, while if K is too small, delay will suffer because more logic blocks will be needed along
the critical path of a circuit. Figure 45.4 shows how a 6-input function might be implemented with
two 4-LUTs; had a 6-LUT been used, only 1-LUT would be required.
Variations on the basic LUT architecture have been used. Figure 45.5 shows a logic block that
employs a fracturable LUT mask (FLM) [Lewis05]. A k, m-FLM can implement a single k-input

function or two functions, each with up to k − 1 inputs, which together use no more than k + m
distinct inputs. The architecture in Figure 45.5a is a 6,2-FLM. An extension of the FLM architecture,
called a shared LUT mask (SLM) architecture, is shown in Figure 45.5b. A k, m-SLM can implement
two identical functions of k inputs provided that the two functions share k − m inputs. The SLM
Output
Output
SRAM
cells
SRAM
cells
Inputs Inputs
00
11
10
01
0
0
0
1
FIGURE 45.3 Two-input LUT. Unprogrammed and programmed as a two-input and gate.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 945 24-9-2008 #6
Field-Programmable Gate Array Architectures 945
Out
Out
A
B
C
D
E
F

A
B
C
D
E
F
2 logic levels 1 logic level
4-input
LUT
4-input
LUT
6-input
LUT
FIGURE 45.4 Implementing a 6-input function using two 4-LUTs.
(a) (b)
E0
E1
F 0
F 1
A
E
F
FLM SLM
B
C0
C1
D0
D1
A
B

DC0
DC1
Z 1(A,B,C1,D1,F )
Z 2(A,B,C,D,E,F )
Z 0(A,B,C 0,D0,E )
4
4
4
4-LUT
Fixed
input
routing
Fixed
input
routing
4-LUT
4-LUT
4-LUT
3-LUT
3-LUT
3-LUT
3-LUT
3-LUT
3-LUT
3-LUT
3-LUT
MUX
network
MUX
network

4
/
/
/
/
/
/
/
/
/
/
/
/
3
3
3
3
3
3
3
3
Z 0(A,B,DC 0,DC1,E 0,F 0)
Z 1(A,B,DC 0,E0,F 0)
Z 0(A,B,DC 0,DC1,E 1,F 1)
Z 2(A,B,DC 1,E1,FF)
FIGURE 45.5 Adv anced logic block structures.
architecture does this through the sharing of LUT masks (the set of configuration bits that indicate
the function implemented by the LUT) so that both functions are the same but can have different
inputs. The logic block in the Altera Stratix II FPGA is based on a 6,2-SLM [Altera05].
Lookup-tables are usually coupled with flip-flops, as shown in Figure 45.6. In this structure, a

configuration bit is used to control the state of the output multiplexer. Depend ing on the value of
this configuration bit, the output signal of the LUT can either be registered or unregistered. As in
Ref. [Betz99], we refer to the LUT and flip-flop as a basic logic element (BLE).
45.3.1.1 Clusters
To increase speed and reduce area and compile time, larger logic blocks are preferred. However,
LUT complexity grows exponentially with the number of inputs [Rose93]. Clusters are logic blocks
of larger granular ity, typically composed of multiple BLEs, internal cluster routing, and possibly
specialized internal cluster connections, such as carry and arithmetic chains [Marquardt00]. Within
a cluster, BLE inputs are typically connected to the cluster inputs and BLE outputs by a multiplexer-
based crossbar. This internal interconnect is generally faster than the general purpose routing between
Clock
Inputs
Output
4-input
LUT
D
flip-flop
FIGURE 45.6 LUT coupled with a flip-flop (BLE).
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 946 24-9-2008 #7
946 Handbook of Algorithms for Physical Design Automation
N
I
N Outputs
Logic cluster
Clock
I Inputs
BLE
#1
BLE
#N



FIGURE 45.7 Basic BLE and basic cluster composed of identical BLEs.
blocks. Altera refers to clusters as logic array blocks (LABs), while Xilinx refers to clusters as
configurable logic blocks (CLBs).
Figure 45.7 shows a typical cluster. The cluster architecture is described by these four parameters:
(1) K, the number of inputs to a LUT, (2) N, the number of BLEs in a cluster, (3) I, the number of
inputs to the cluster that connect to LUT inputs, and (4) M
clk
, the number of clock inputs to a cluster
(most studies assume this is 1).
Increasing K or N increases the functionality of the cluster. This reduces the number of blocks
needed to implement circuits and the number of blocks on the critical path, but increases the size
of the block and makes the local cluster interconnect slower. Research has found that K = 4–6 and
N = 3–10 provide the best combined speed and area [Ahmed04].
The value of I is often smaller than K × N, because BLEs often share inputs or use the outputs
from BLEs within the cluster. Smaller values of I use smaller multiplexers in the crossbar, reducing
area, but overly small I values make some BLEs unusable. Research has found that 98 percent
utilization can be achieved when I =[(K/2 ) ×(N + 1)] [Ahmed04].
45.3.1.2 Carry Chains
Carry chains are locally routed connections that aid in the efficient implementation of arithmetic
operations. They also can be used in the efficient implementation of logical operations, such as
parity and comparison . Fast carry chains are important because the critical path for these operations
is often through the carry.
Each 4-LUT in a BLE can be fractured to implement two 3-LUTs; this is sufficient to implement
both the sum and carry, given two input bits (a and b) and a carry input, as shown in Figure 45.8.
The carry out signal from one BLE would typically be connected to the carry in of an adjacent BLE
using a fast dedicated connection. The Z-input is used to break the carry chain before the first bit of
an addition.
More complex carry schemes have been described. In Ref. [Hauck00], carry chains based on

carry select, variable block, and Brent–Kung schemes are described; the Brent–Kung scheme is
shown to be 3.8 times faster than the simple ripple carry adder in Figure 45.8. Support for carry-
lookahead adders is included in the Actel Axcelerator device, the Xilinx Virtex-II, Virtex-II Pro,
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 947 24-9-2008 #8
Field-Programmable Gate Array Architectures 947
Carry out
Sum out
Carry in
Z
a
b
3-LUT 3-LUT
2-LUT
2-LUT 2-LUT 2-LUT
P
FIGURE 45.8 Carry chain connections to a 4-LUT.
and Virtex-4 devices. Carry select capabilities are included in the Altera Stratix FPGAs. The Altera
Stratix-II contains two dedicated 1-bit adders in each logic block. Because high-fanin arithmetic can
cause routing congestion in a small area of the device, both Xilinx and Altera parts support two
independent carry chains in each cluster. This allows for narrower fanin logic, which helps reduce
routing congestion around the adders.
45.3.2 NON-LUT-BASED LOGIC BLOCKS
Not all FPGAs contain logic blocks based on LUTs. The Actel ProASIC3 logic blocks contain a set
of multiplexers, which allow for the implementation of 3-inputcombinational or sequentialfunctions
in each logic block [Actel05a]. The QuickLogicEclipse II logic cell contains two 6-inputAND gates,
four 2-input AND gates, and seven two-to-one multiplexers [Quick05]. The use of universal logic
modules as FPGA logic blocks has also been proposed; these blocks can implement any function of
theirinputs by applyinginput permutationand negation[Lin94]. Finally, programmabledevicesusing
more coarse-grained logic blocks exist; these logic blocks are typically arithmetic/logic units and
are suitable for computationally intensive applications [Ebeling96,Goldstein00,Singh00, Mei03].

45.4 ROUTING ARCHITECTURES
Connections between logic blocks are implemented using fixed prefabricated metal tracks. These
tracks are arranged in channels; channels typically run vertically and horizontally, forming a grid
[Lemieux04a]. Although many academic studies have assumed that all channels contain the same
number of tracks [Betz99], many commercial architectures (such as those from Altera) contain more
tracks in each horizontal channel than each vertical channel. Figure 45.9 shows an FPGA with tracks
arranged in horizontal and vertical channels.
45.4.1 SEGMENTATION
Tracks within a channel can span one logic block, or multiple logic b locks. Typically, not all tracks
within a channel will be of the same length. Several studies have investigated the optimum segment
length. In Ref. [Brown96], a heterogeneous routing architecture, in which some tracks span three
logic blocks, some span two logic blocks, and some span one logic block, is found to work well.
In Ref. [Betz99], it is shown that longer wires result in a more efficient architecture; they suggest
a homogeneous architecture in which all tracks span either four or eight logic block gives the most
efficient FPGA.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 948 24-9-2008 #9
948 Handbook of Algorithms for Physical Design Automation
Horizontal
channel
Switch
block
Connection
block
Vertical
channel
Routing
track
Logic
block
Logic

block
Logic
block
Logic
block
Logic
block
Logic
block
FIGURE 45.9 Overall routing architecture.
45.4.2 PROGRAMMABLE SWITCHES
The tracks are connected to each other and to the logic blocks using programmable switches. These
programmable switches can be buffered or unbuffered, as shown in Figure 45.10. Switches in
modern FPGAs are typically buffered, because unbuffered switches result in a quadratic increase
in delay for long connections. Buffered switches can be bidirectional, as shown in Figure 45.10b or
unidirectional, as shown in Figure 45.10c. Although many academic studies assume bidirectional
switches [Betz99], most modern FPGAs contain unidirectional switches [Lemieux04b]; these
switches allow for better delay optimization and result in a more dense routing fabric.
45.4.3 SWITCH BLOCKS AND CONNECTION BLOCKS
Tracks are connected to each other using switch blocks, and to logic blocks using connection blocks.
Commercial FPGAs often contain combined switch blocks and connection blocks, however for
clarity, this section will describe each separately.
(a) Unbuffered (b) Buffered bidirectional (c) Buffered unidirectional
FIGURE 45.10 Programmable switches.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 949 24-9-2008 #10
Field-Programmable Gate Array Architectures 949
0
000
0
0

0
0
0
00
1
1
1
1
1
11
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3

3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
44
4
0
FIGURE 45.11 Switch block patterns.
A switch block lies at the intersection of each horizontal and vertical channel, and can connect
each incident track to some number of other incident tracks. Academic work uses the notation F
s
to
describe the number of outgoing tracks to which each incoming track can be connected [Rose93].
Most physical design algorithm studies assume F
s
= 3; in this case, each incoming track can be

connected to one track on each of the other three sides of the switch block. The switch pattern
determines wh ich F
s
tracks to which each incoming track can be connected. Academic work has
proposed the three switch patterns in Figure 45.11. The disjoint pattern divides the routing fab-
ric into domains; if there are W tracks in each channel, there are W domains. This simplifies the
routing task, and results in an efficient layout. The universal pattern has been shown to support
the largest number of simultaneous connections through each switch block [Chang96], while the
Wilton block has been shown to result in good overall routability [Wilton97]. An extension of
the Wilton block to architectures with different segment lengths is described in Ref. [Masud99].
In Ref. [Sivaswamy05], it is proposed that some of the connections in a switch block should be hard-
wired (nonprogrammable);this gives 30 percent speedup, a slight reduction in area, and an 8 percent
reduction in power.
Connection blocks are used to connect logic block pins to the routing tracks. Each logic block pin
can be connected to a subset of routing tracks in the neighboring channel. The quantity F
c
indicates
the proportion o f the tracks in each channel to which a pin can be connected. In ref. [Betz99], it is
shown that F
c
= 0.25–0.5 (depending on the type of switch block employed) works well.
45.4.4 BUS-BASED ROUTING ARCHITECTURES
FPGAs are often used to implement datapath-intensive circuits, in which many signals are part of
wide buses. Because each bit of a bus is connected in the same way, it has been suggested that a
datapath rou ting architecture, in which a single configuration bit contro ls multiple switches, will lead
to an improvement in FPGA density. In Ref. [Ye05], the architecture in Figure 45.12 is presented. In
this architecture, some of the tracks (the top four in Figure 45.12) are dedicated bus-based routing
tracks, and connections to them are controlled by a bus switch; a bus switch contains one switch for
each bit controlled by a single configuration cell. In this case, each bus ( and eachbus switch) is 4-bits
wide. The lower tracks are regular bit-based routing tracks, which are connected to each other and to

the logic cells using standard connection and switch blocks, as described above. In Ref. [Ye05], it is
shown that a bus-width of 4 works well, and that 40–50 percent of the tracks should be buses (with
the remainder being bit-based routin g tracks). It is sh own that this r esults in a density improvement
of 9.6 percent compared to a conventional architecture.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 950 24-9-2008 #11
950 Handbook of Algorithms for Physical Design Automation
4-bit bus
Bit-based routing
tracks
P = Configuration cell
LUT LUT LUT LUT
PP P P
P
FIGURE 45.12 Bus-based routing architecture.
45.4.5 PIPELINED INTERCONNECT ARCHITECTURES
In deep-submicron technologies, the delay of long wires can limit the clock speed of the circuit
implemented on an FPGA. To address this, several authors have proposed pipelined intercon-
nect architectures [Singh01a,Singh01b,Weaver04]. In these architectures, some of the interconnect
switches contain registers. This results in additional complexity for the router, however, because it
must now balance the number o f registers on each path.
45.5 MEMORIES
Today, FPGAs are often used to implement entire systems. These systems often require storage.
Although it is possible to implement storage off-chip, on-chip storage has a number of advantages.
On-chip storage reduces system costs, allows for a wider, faster memory interface, and reduces I/O
demands on the FPGA.
There are two ways of implementing memory on FPGAs: embedded memory and distributed
memory. Embedded memory solutions offer a number of relatively large fixed dedicated memory
blocks on the FPGA. Distributed memory, on the other hand, uses small memories spread across the
entire FPGA chip, often implemented in unused logic elements.
45.5.1 EMBEDDED MEMORY

Most FPGAs contain embedded memory blocks (EMBs). EMBs are typically arrangedin columns or
rows to simplify connectionsto logic and between otherEMBs [Wilton99],as shown in Figure 45.13.
Altera’s Stratix and Stratix-II devices include three different sized EMBs: 512 bits, 4 Kbits, and
512 Kbits [Altera05]. Xilinx’s Virtex-4, Virtex-II, and Spartan series contain 18Kbits EMBs [Xil-
inx05a]. Actel’s ProASIC3 and ProASIC-Plus contain 4 Kbits and 2 Kbits EMBs, respectively
[Actel05].
Each EMB has a fixed number of bits, but its aspect ratio can be configured by the user. For
example, in the Stratix II architecture, a 4-Kbit EMB may be configured to act as memories with
aspect ratios of 4096 × 1, 2048 ×2,1024 × 4, 512 × 8, 256 × 16, or 128 × 32. On many devices,
EMBs can be configured to act as a ROM, single-port RAM, or dual-port RAM. In addition, they
typically include parity bits, various enable/reset control signals, and have synchronous inputs with
synchronous or asynchronous outputs.
Of particular importance is the interface between the memory and the logic. Figure 45.14 shows
one published scheme; in this architecture, each EMB connects to the logic through a memory-
logic interconnect block [Wilton99]. Figure 45.15 shows the contents of one of these memory-logic
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C045 Finals Page 951 24-9-2008 #12
Field-Programmable Gate Array Architectures 951
Logic blocks
Logic blocks
Memory
arrays
FIGURE 45.13 Logic and memory in an FPGA.
Logic
block
Logic
block
Logic
block
Logic
block

Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block

Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Logic
block
Memory/logic
interconnect
block
Memory

block
Memory/logic
interconnect
block
Memory
block
Memory/logic
interconnect
block
FIGURE 45.14 Memory/logic interconnect architecture.

×