Tải bản đầy đủ (.pdf) (10 trang)

Handbook of algorithms for physical design automation part 100 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (202.94 KB, 10 trang )

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 972 9-10-2008 #17
972 Handbook of Algorithms for Physical Design Automation
move to other CLBs, overcrowding them instead. This process gradually results in moving nodes
from overcrowded regions to empty regions. They take care not to cause thrashing in which LUTs
are moved back and forth between two clusters. Avoiding thrashing can be done by keeping a history
of violations of CLBs. Hence, if thrashing has been occurring for a few moves, the relative cost of
both CLBs involved in thrash ing is increased, resulting in the extra LUT or register to be moved to
a third CLB.
46.4.4 LINEAR DATAPATH PLACEMENT
Callahan et al. [29] presented GAMA, a linear-time simultaneous placement and mapping method
for LUT-based FPGAs. They only focus on datapaths that are comprised of arrays of bitslices. The
basic idea is to preserve the datapath structure so that we can reduce the problem size by primarily
looking at a bitslice of the datapath. Once a bitslice is mapped and placed, other bitslices of the
datapath can be mapped and placed similarly on rows above or below the initial bitslice.
One of the goals in developing GAMA was to perform mapping and p lacement with little compu-
tational effort. To achieve a linear time complexity, the authors limit the search space by considering
only a subset of solutions, which means they might not produce an optimal solution. Because optimal
mapping of directed acyclic graphs (DAGs) is NP-comp lete, GAMA first splits the circuit graph into
a forest of trees before processing it by the mapping and placement steps. The tree coveringalgorithm
does not directly handle cycles or nodes with multiple fanouts, and might duplicate nodes to reduce
the number of trees. Each tree is compared to elements from a preexisting pattern library that contains
compound modules such as the one shown in Figure 46.12. Dynamic programming is used to find
the best cover in linear time. After the tree covering process, a postpro cessing step is attempted to
find opportunities for local optimization at the boundaries of the covered trees. I nterested readers
are referred to Ref. [29] for more details on the mapping process of GAMA.
Because the modules will form a bitslice datapath layout, the placement problem translates into
finding a linear ordering of the modules in the datapath. Wirelength minimization is the primary goal
during linear placement. The authors assume that the output of every module is available at its right
boundary. A tree is placed by recursively placing its left and right subtrees, and then placing the root
node to the right of the subtrees. The two subtrees are placed next to each other. Figure 46.13 shows
an example of a tree placement. Because subtree t2 is wider, placing it to the right of subtree t1 will


result in longer wirelength. Because the number of fanin nodes to the root of the tree is bounded, an
exhaustive search for the right placement order of the subtrees is reasonable and would result in a
linear-time algorithm.
In addition to the local placement algorithm, Callahan et al. also attempt some global optimiza-
tions. The linear placement algorithm arranges modules within a tree, but all trees in the circuit
must also be globally placed. A greedy algorithm is used to place trees next to each other so that
+
&
+
&
Pattern in library
Library pattern found
in circuit graph
FIGURE 46.12 Example of a pattern in the tree covering library. (Based on Callahan, T. J. et al., Proceed-
ings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 123–132, 1998. W ith
permission.)
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 973 9-10-2008 #18
FPGA Technology Mapping, Placement, and Routing 973
t2
t1
(a)
t2
t1
(b)
FIGURE 46.13 Tree placement example. (Based on Callahan, T. J. et al., Proceedings of the ACM/SIGDA
International Symposium on Field Programmable Gate Arrays, 123–132, 1998. With per mission.)
the length of the critical path in the circuit is minimized. Furthermore, after global and local place-
ment is accomplished, individual modules are moved across tree boundaries to further optimize the
placement.
Ababei and Bazargan [30] proposed a linear placem ent methodology for datapaths in a dynami-

cally reconfigurablesystem in which datapaths correspondingto different basic blocks

in a program
are loaded, overwritten, and possibly reloaded on linear strips of an FPGA. They assume that the
FPGA chip is divided into strips as shown in Figure 46.14. An expression tree corresponding to
computations in a basic block is placed entirely in one strip, getting its input values from either
memory blocks on the two sides of the strip and writing the output of the expression to one of
these memory blocks.
Depending on how frequently basic blocks are loaded and reloaded, three placement algorithms
are developed:
1. Static placement: This case is similar to the problem considered by Callahan et al. [29],
that is, each expression tree is given an empty FPGA strip to be placed on. The solution
proposed by Ababei tries to minimize critical path delay, congestion, and wirelength
Strip 1
I/O
M
I/O
M
I/O
M
I/O
M
I/O
M
I/O
M
I/O
M
I/O
M

Strip 2
Strip 3
Strip 4
FIGURE 46.14 FPGA divided into linear strips.

A basic block is a sequence of code, for example, written in the C language, with no jumps or function calls. A basic block,
usually the body of a loop with many iterations, could be mapped t o a coprocessor like an attached FPGA to p erform
computations faster. Data used by the basi c block should be made accessible to the coprocessor and the output of the
computations should in turn be made accessible to the processor . This could be achieve d either by streaming data from the
processor to the FPGA and vice versa, or by providing direct memory access to the FPGA.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 974 9-10-2008 #19
974 Handbook of Algorithms for Physical Design Automation
using a matrix bandwidth minimization formulation. The matrix bandwidth minimization
algorithm is covered in Section 47.3.2.1.
2. Dynamic placement with no module reuse: In this scenario, we assume that multiple basic
blocks can be mapped to the same strip, either because a number of them run in parallel, or
because there is a good chance that a mapped b asic block be invoked again in the future. The
goal is to place the modules of a new expression on the empty regions between the modules
of previous basic blocks,leaving the previously placed modules and their connections intact.
As a result, the placementof thenewbasicblock becomes a linear,noncontiguousplacement
problem with blockages being the modules from previous basic blocks.
3. Dynamic placement with no module reuse: This scenario is similar to the previous one,
except that we try to reuse a few modules and connections left over by previous basic blocks
that are no longer active. Doing so will save in reconfiguration time and results in better
usage of the FPGA real estate. Finding the largest common subgraph between the old and
the new expression trees helps u s maximize the reuse of the modules that are already placed.
The authors proposeagreedysolutionfor thesecond problem, that is,dynamicplacement without
module reuse. The algorithm works directly on expression trees. Modules are rank-ordered b ased on
parameters such as the volume (sum of module widths) of their children subtrees, and latest arrival
time on the critical path. The ordering of the nodes determines the linear order in which they should

be placed on the noncontiguous space.
To solve the third problem, that is, dynamic placement with module reuse, first a linear ordering
of modules is obtained using the previous two algorithms to minimize wirelength, congestion, and
critical path delay. Then a maximum matching between the existing inactive modules and the linear
ordering is sought such that the maximum number of modules are reused while perturbations to
the linear ordering are kept at a minimum. The algorithm is then extended to be applied to general
graphs, and not just trees. To achieve better reuse, a maximum common subgraph problem is solved
to find the largest subset of modules and their connections of the expression graphs that are already
placed and those of the new b asic block.
46.4.5 VARIATION-AWARE PLACEMENT
Hutton et al. proposed the first statistical timing analysis placement method for FPGAs [31]. They
consider both inter- and intradie process variations in their modeling, but do not model spatial
correlations among within-die variables. In other words, local variations are modeled as independent
random variables.

In Ref. [31], they model delay of a circuit element as a Gaussian variable, which
is a function of V
t
and L
eff
, each of which are broken into their global (systematic) and local (random)
components. Block-based statistical timing analysis [ 33] is used to compute the timing criticality of
nodes, which will be used instead of TVPR’s timing-cost component (see Equations 46.3 and 46.5).
SSTA (statistical static timing analysis) is performed only at each temperature, not at every move.
In their experiments, they compare their statistical timing-based placement to TVPR, and con-
sider the effect of guard-banding and speed-binning. Guard-banding is achieved by adding k.σ to
the delay of every element, where k is a user-defined factor such as 3, 4, or 5, and σ is the stan-
dard deviation of the element’s delay. Timing yield considering speed-binning is computed during
Monte Carlo simulations by assuming that chips are divided into fast, medium, and slow critical
path delays. Their statistical placement shows yield improvements over TVPR in almost all combi-

nations of guard-banding and speed-binning scenarios. In a follow-up work, Lin and He [34] show

Cheng et al. [32] show that by ignoring spatial correlations, we lose at least 14 percent in the accuracy of the estimated delay.
The error in delay estimation accur acy is defined as the integration of the absolute er ror between the distributions obtained
through Monte Carlo simulations and statistical sum and maximum computations of the c ircuit delay. See S ection 46. 3 of
Ref. [32] for more details.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 975 9-10-2008 #20
FPGA Technology Mapping, Placement, and Routing 975
that combining statistical physical synthesis, statistical placement and statistical routing result in
significant yield improvements (from 50 failed chips per 10,000 chips to 5 failed chips in their
experimental setup).
Cheng et al. [32] propose a placement method that tailors the placement to individual chips,
after the variation map for every chip is ob tained. This is a preliminary work that tries to answer the
question of given the exact map of FPGA element delays, how much improvement can we get by
adapting the placement to individual chips. They show about 5.3 percent improvement on average
in their experimental setup, although they do not address how the device parameter maps can be
obtained in practice.
46.4.6 LOW POWER PLACEMENT
Low power FPGA placement and routing methods try to assign noncritical elements to low power
resources on the FPGA. There have been many recent works targetting FPGA power minimization.
We will only focus on two efforts: one deals with the placement problem [35] and the other addresses
dual voltage assignment to routes [36], the latter will be discussed in Section 46.5.4.
The authors in Ref. [35] consider an architecture that is divided into physical regions, each of
which can be independently power gated. To enable leakage power savings, designers must look into
two issues carefully:
1. Region granularity:They should determine the best granularity of the power gating regions.
Too small a region would have high circuit overheads both in terms of sleep transistors and
configuration bits that must control them. On the other hand, a finer granularity gives more
control over which logic units could shut down and could potentially harness more leakage
savings.

2. Placement strategies: CAD developers should adopt placement strategies that constrain
logic blocks with similar activity to the same regions. I f all logic blocks placed in one
region are going to be inactive for a long period of time, then the whole region can be power
gated. However, architectural properties of the FPGA would influence the effectiveness of
the placement strategy. For example, if the FPGA architecture has carry chains that run in
the vertical direction, then the placement algorithm must place modules in regions that are
vertically aligned. Not doing so could harm performance significantly.
By constraining the placement of modules with similar power activity, we can achieve two
goals: power gate unused logic permanently, and power gate inactive modules for the duration of
their inactive period. In their experiments, they consider various sizes of the power gating regions
and also look into dynamic versus static powering down of unused/idle regions.
46.5 ROUTING
Versatile placement and ro uting [6]uses Dijkstra’s algorithm (i.e., a maze router) to connect terminals
of a net. Its router is based on the negotiation-basedalgorithm PathFinder [37].PathFinder first routes
all nets independently using the shortest route for each path. As a result, some routing regions will
become overcongested. Then in an iterative process, nets are ripped-up and rerouted to alleviate
congestion. Nets that are not timing-critical take detours away from the congested regions, and nets
that are timing critical are likely to take the same route as round one.
There is a possibility that two routing channels show a thrashing effect, that is, nets are ripped-
up from one channel and rerouted through the other, and then in the next iteration be ripped-up
from the second and rerouted through the first. To avoid this, VPR use a history term that not only
penalizes routing through a currently congested region, but it also uses the congestion data from the
recent history to avoid thrashing. So the congestion of a channel is defined as its current resource
(over-)usage plus a weighted sum of the previous congestion values from previousrouting iterations.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 976 9-10-2008 #21
976 Handbook of Algorithms for Physical Design Automation

Reexpand around
new wire
Expansion wavefront

FIGURE 46.15 Local expansion of the wavefront. (Based on Betz, V. and Rose, J., Field-Pro grammable
Logic and Applications (W. Luk, P. Y. Cheung, and M. Glesner, eds.), pp. 213–222, Springer-Verlag, Berlin,
Germany, 1997. With permission.)
To route a multiterminal net, VPR uses the m aze routing algorithm, described in Chapter 23.
After connecting two terminals of a k terminal net, VPR’s maze router starts a wave from all points
on the wire connecting the two terminals. The wave is propagated until the next terminal is reached.
The process is repeated k −1 times. When a new terminal is reached, instead of restarting the wave
from the new wiring tree from scratch, the maze routing algorithm starts a local wave from the new
branch of wire that connected the new terminal to the rest of the tree. When the wavefront of the
local wave gets as far out as the previous wavefront, the two waves are merged and expanded until
a new terminal is reached. Figure 46.15 illustrates the process.
46.5.1 HIERARCHICAL ROUTING
Chang et al. propose a hierarchical routing method for island-style FPGAs with segmented routing
architecture in Ref. [38] (Section 45.4.1). Because nets are simultaneously routed, the net-ordering
problem at the detailed routing level would not be an issue, in fact, global routing and detailed routing
are performed at the same time in this approach. They model timing in their formulation as well, and
estimate the delay of a route to be the number of programmable switches that it has to go through.
This is a reasonable estimation because the delay of the switch points is much larger than the routing
wires in a typical FPGA architecture. Each channel is divided into a number of subchannels, each
subchannel corresponding to the set of segments of the same length within that channel.
After minimum spanning routing trees are generated, delay bounds are assigned to segments o f
the route and then the problems of channel assignment and delay bound recalculation are solved
hierarchically. Figure 46.16 shows an example of a hierarchical routing step, in which connection i
is generated by a minimum spanning tree algorithm. The problem is divided into two subproblems,
one containin g pin1 and the other containing pin2. The cutline b etween the two regions contains a
number of horizontal subchannels. The algorithm tries to decide on the subchannel through which
this net is going to be routed. Once the subchannel is decided (see the right part of Figure 46.16),
then the routing problem can be broken into two smaller subproblems. While dividing the problem
into smaller subproblems, the algorithm keeps updating the delay bounds on the nets, and keeps an
eye on the congestion.

To decide on which subchannel j to use to route a routing segment i, the following cost
function is used:
C
ij
= C
(1)
ij
+ C
(2)
ij
+ C
(3)
ij
(46.6)
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 977 9-10-2008 #22
FPGA Technology Mapping, Placement, and Routing 977
pin1
After assignment
pin1
ch1
ch2
ch3
ch4
ch1
ch2
l
i1
l
i2
Subchannel

j
ch3
ch4
Cutline
Pin2
Region 1
Region 2
pin2
Connection i
FIGURE 46.16 Delay bound redistribution after a hierarchical routing step. (Based on Chang, Y W. et al.,
ACM T ransactions on Design Automation of E lectronic Systems, 5, 433–450, 2000. With permission.)
where C
(0)
ij
is zero if connection i can reach subchannel j,and∞ otherwise. Reachability can be
determined by a breadth-first search on the connectivity graph. The second term intends to utilize
the routing segments evenly according to the connection length and its delay bound:
C
(2)
ij
= a




l
i
U
i
− L

j




(46.7)
where
l
i
is the Manhattan distance of the connection i
U
i
is the delay bound of the connection
L
j
is the length of routing segments in the subchannel j
a > 0 is a constant
The term tries to maximize routing resource efficiency in routing. So, for example, if a net
has a delay bound U
i
= 4 and Manhattan distance l
i
= 8, it can be routed through four switches,
which means the ideal routing resource whose length is just right for this connection is 8/4 = 2.
For a subchannel that contains routing segments of length 2, the cost function will evaluate to zero,
that is, segment length of 2 is ideal for routing this net. On the other hand, if a subchannel with
segment length of 6 is considered, then the cost function will evaluate to 4, which means using
segments of length 6 might be an overkill for this net, as its slack is high and we do not have to waste
our length 6 routing resources on this net.
Cost component C

(3)
ij
in Equation 46.6 is shown in Figure 46.17. Figure 46.17b shows a typical
nontimingdriven routing,and Figure 46.17a showsthe cost function used in Ref. [38]. The basic idea
is to assign a lower cost to routes that are likely to use fewer bends. For example, in Figure 46.17a,
if subchannel s3 is chosen, then chances are that when the subproblem of routing from a pin to s3
is being solved, more bends are introduced between the pin and s3. On the other hand, routing the
net through s1ors5 will guarantee that the route from the subchannel to at least one pin is going to
use no bends. Note that the cost of routing outside the bounding box of the net increases linearly to
discourage detours, which in turn hurt the delay of a net.
After a net is divided into two subnets, the delay bound of the net is distributed among the
two subnets based on their lengths. So, for example, in Figure 46.17, if the original delay bound of
connection i was U
i
,thenU
i1
=[l
i1
/(l
i1
+ l
i2
)]×U
i
,andU
i2
=[l
i2
/(l
i1

+ l
i2
)]×U
i
.
46.5.2 SAT-BASED ROUTING
Recent advances in SAT (Satisfiability problem) solvers have encouraged researchers to formulate
various problems as SAT prob lem s and u tilize the efficiency of these solvers. Nam et al. [39] for-
mulated the detailed routing on a fully segmented routing arc hitecture (i.e., a ll routin g segments
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 978 9-10-2008 #23
978 Handbook of Algorithms for Physical Design Automation
Cost
s1
s2 s3
(b)
(a)
s4
s5
Cutline
Pin
C
ij
(3)
X-coordinate of the subchannel used for routing
Typical nontiming
cost
Subchannels
Pin
FIGURE 46.17 Cost function. (Based on Chang, Y W. et al., ACM Transactions on Design Aut omation of
Electronic Systems, 5, 433–450, 2000. With permission.)

are of length 1) as a SAT problem. The basic idea is shown in Figure 46.18. Figure 46.18a shows
an instance of a global routing problem that includes three nets, A, B,andC and an FPGA with a
channel width of three tracks. Figure 46.18b shows possible solutions for the routing of net A.
In a SAT problem, constraints are written in the form of conjunctive normal form (CNF) clauses.
The CNF formulation of the constraints on net A are shown in Equation 46.8, where AH, BH, and
CH are integer variables showing the horizontal track numbers that are assigned to nets A, B,andC,
respectively. AV is the vertical track number assigned to net A. The conditions on the first line enforce
that a unique track number is assigned to A, the second line ensures that the switchbox constraints
are met (here it is assumed that a subset switchbox is used), and the third line enforces that a valid
track number is assigned to the vertical segment of net A. These conditions state the connectivity
constraints for net A.
2
12
Row index
(a) Global routing example (b) Possible solutions for net A
Net B
Net A
Net C
Column index
CLB
(0, 0)
CLB
(0, 0)
CLB
(2, 0)
CLB
(4, 0)
CLB
(2, 0)
CLB

(4, 0)
CLB
(4, 0)
CLB
(4, 2)
CLB
(2, 2)
CLB
(0, 2)
CLB
(0, 2)
Vertical
channel 1
2
2
1
1
0
0
21
0
Horizontal
channel 1
Vertical
channel 3
SRC
0
1
DST
34

0
1
0
FIGURE 46.18 SAT formulation of a detailed routing problem. (From N am, G J., Sakallah, K. A., and
Rutenbar, R. A., IEEE Trans. Comput. Aided Des. Integrated Circuits Syst ., 21, 674, 2002. With permission.)
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 979 9-10-2008 #24
FPGA Technology Mapping, Placement, and Routing 979
Conn(A) = [
(
AH ≡ 0
)

(
AH ≡ 1
)

(
AH ≡ 2
)
] ∧
[
(
AH = AV
)
] ∧
[
(
AV ≡ 0
)


(
AV ≡ 1
)

(
AV ≡ 2
)
]
(46.8)
To ensure that different nets do not share the same track number in a channel (exclusivity
constraint), conditions like Equation 46.9 must be added to the problem:
Excl
(
H1
)
=
(
AH = BH
)

(
AH = CH
)
(46.9)
where H1 refers to the horizontal channel shown in Figure 46.18a. The routability problem of the
example of Figure 46.18a can be formulated as in Equation 46.10:
Routable
(
X
)

= Conn
(
A
)
∧Conn
(
B
)
∧Conn
(
C
)
∧ Excl
(
H1
)
(46.10)
wh
ere X is
a vector of track variables AH, BH, CH, AV,BV, and CV. If Routable(X) is satisfiable, then
a routing solution exists and can be derived from the values returned by the SAT solver. The authors
extend the model so that doglegs can be defined too. Interested readers and referred to Ref. [39]
for details.
Even though detailed routing can be elegantly formulated as a SAT problem, in practice its
application is limited. If a solution does not exist (i.e., when there are not enough tracks), the SAT
solver would take a long time exploring all track assignment possibilities and returning with anegative
answer, that is, Routable(X) is not satisfiable. Furthermore, even if a solution exists but the routing
instance is difficult (e.g., when there are barely enough routing tracks to route the given problem
instance), the SAT solver might take a long time. In practice, the SAT solver could be terminated
if the time spent on the problem is more than a prespecified limit. This could either mean that the

problem instance is difficult, or no routing solution exists for the given number of tracks.
46.5.3 GRAPH-BASED ROUTING
The FPGA global routing problem can be modeled as a graph matching problem in which branches
of a routing tree are assigned (matched) to sets of routing segments in a multisegm ent architecture
to estimate the number of channels required for detailed routing. Lin et al. propose a graph-based
routing method in Ref. [40]. The input to the problem is a set of globally routed nets. The goal is
to assign each straight segment of each net to a track in the channel that it is globally routed so that
a lower bound on the required number of tracks is obtained for each channel. Interactions between
channels are ignored in this work, as a result, the bound on the number of tracks needed for each
channel is calculated in isolation. The actual number of tracks needed for the whole design might be
larger depending on the switchbox architecture and the way horizontal and vertical channels interact.
They model the track assignment problem within one channel as a weighted matching problem.
Straight segments of nets are called subnets (e.g., a net routed in the shape of an “L” is divided into
two subnets). Within a channel, subnets belonging to a maximum clique C of overlapping subnets

are assigned to tracks from a set o f tracks H using a bipartite graph matching problem. Members of
set C form the nodes on one side of the bipartite graph used in the matching problem, and the nodes
on the other side of the matching graph are tracks in set H. The weight on the edges from subnets to
routing tracks are determined based on the track length utilization. The track utilization U
r
(i
x
, t) of
a subnet i
x
on track t is defined as
U
r
(
i

x
, t
)
=
len
(
i
x
)

1≤y<k
len

s
y

+
α
k
(46.11)

Refer to Ref. [41] for more discussions on finding cliques of overlapping net intervals and calculating lower bounds on
channel densities.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 980 9-10-2008 #25
980 Handbook of Algorithms for Physical Design Automation
where
len(i
x
) and len(s
y

) are the respective lengths of the subnet i
x
and the segment s
y
y is an FPGA routing segment in the track that i
x
is globally routed in
k is the number of segments needed to route the subnet on that track
Note that the first and the last FPGA routing segments used in routing the subnet might be
longer than what the subnet needs, and hence some of the track length would go underutilized. The
algorithm tries to maximize routing segment utilization by matching a subnet to a track that has
segments whose lengths and starting points match closely to those of the span of the subnet. This
is achieved by maximizing the sum of track utilizations U
r
(i
x
, t) over all subnets. Parameter α in
the equation above is used to enable simultaneous routability and timing optimization . They further
extend the algorithm to consider timing as well as routability using an iterative process. After an
initial routing, they distribute timing slacks to nets, and order channels based on how critical they
are. A channel is critical if its density is the highest.
46.5.4 LOW POWER ROUTING
The authors in Ref. [36] assume that all switches and connection boxes in a modified island-style
FPGA are Vdd-programmable. An SRAM bit can determine if the driver driving a particular switch
or connection box will be in high or low Vdd. To avoid adding level converters, they enforce the
constraint that no low-Vdd switch can drive a high-Vdd element. The result is each routing tree can
be mapped either fully in high-Vdd, or fully in low-Vdd, or mapped to high-Vdd from the source
up to a point in the routing tree, and then low-Vdd from that point to the sink. In terms of power
consumption, it is desired to map as many routing resources to low-Vdd, as that would consume less
power than high-Vdd. But because low-Vdd resources are slower, care must be taken not to slow

down critical paths in the circuit.
They propose a heuristic sensitivity-based algorithm and a linear programming formulation for
assigning voltage levels to programmable routing resources (switches and their associated buffers).
The sensitivity-based method first calculates power sensitivity P/V
dd
for each routing resource,
which is the power reduction by changing high-Vdd to low-Vdd. A resource with the highest
sensitivity is tried with low-Vdd. If the path containing the switch does not violate th e timing con-
straint, then the switch and all its downsteam routing resources are locked on low-Vdd. Otherwise,
the switch is changed back to high-Vdd. The linear programming method tries to distribute path
slacks among route segments such that the number of low-Vdd resources is maximized subject to
the constraint that no low-Vdd switch drives a high-Vdd one.
46.5.5 OTHER ROUTING METHODS
In this subsection, we review miscellaneous routing methods such as pipeline routing, congestion-
driven routing, and statistical timing routing.
46.5.5.1 Pipeline Routing
Eguro and Hauck [42] propose a timing-driven pipeline-aware routing algorithm that reduces critical
path delay. A pipeline-aware routing problem requires the connection from a source node to a sink
node to pass through certain number of pipeline registers and each segment of the route (between
source, sink, and registers) must satisfy delay constraints. The work by Eguro and Hauck adapts
PathFinder [37]. When considering pipelining, the problem becomes more difficult compared to a
traditional routing problem, because as registers move along a route, the criticality of the r outing
segments would change. For example, suppose a net is to connect logic block A to logic block B
through one register R. In the first routing iteration, R might be placed close to A, which makes the
subroute A–R not critical, but R–B would probably be critical. In the next iteration, R might move
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C046 Finals Page 981 9-10-2008 #26
FPGA Technology Mapping, Placement, and Routing 981
closer to B, and hence the two subroutes might be considered critical and noncritical in successive
iterations.
To address the problem stated above, the authors in Ref. [42] perform simultaneous wave propa-

gation maze routing searches, each assuming that the net has a distinct timing-criticality value. When
the sink (or a register) is reached in the search process, the routing wave that b est balances congestion
and timing criticality is chosen. Interested readers are referred to Ref. [42] for more details.
46.5.5.2 Congestion-Driven Routing
Another work that d eals with routability and congestion estimation is fGrep [43]. To estimate conges-
tion, waves are started from a source node, and all possible paths are implicitly enumerated at every
step of the wave propagation. The probability that the n et passes through a particular routing element
is the ratio of the total number of p aths that pass through that routing element to the total number
of paths that can route the net. Routing demand or congestion on a routing element is defined as the
sum of these probabilities among all nets. Of course, performing full wave propagation for every net
would be costly. As a trade-off, the authors trim the wave once it has passed a certain predetermined
distance, which results in the speedup of the estimation at the cost of accuracy. Another speedup
technique used by the authors is to start waves from all terminals of a net and stop when two waves
reach each other.
46.5.5.3 Statistical Timing Routing
Statistical timing analysis has found its way into FPGA CAD tools in recent years. Sivaswamy
et al. [44] showed that using SSTA during the routing stage could greatly improve timing yield
over traditional static timing analysis methods with guard-banding. More specifically, in their
experimental setup they could reduce the yield loss from about 8 per 10,000 chips to about 1 per
10,000 chips. They considered inter- and intradie variations and modeled spatial correlations in their
statistical modeling o f device parameters.
Matsumo et al. [45] proposed a reconfiguration methodology for yield enhancement in which
multiple rou ting solutions are generated for a design and the one that yields the best timing for a
particular FPGA chip is loaded on that chip. This can be done by performing at-speed testing of an
individual FPGA chip using each of the n configurations that are generated and by picking the one
that yields the best clock speed. The advantage of this method compared to a method that requires
obtaining the delay map of all elements on the chip (e.g., the work by Cheng et al. [32]) is that
extensive tests are not required to determine which configuration yields the best timing results.
In the current version of their method, Matsumo et al. [45] fix the placement and only explore
different routing solutions. In each configuration, they try to avoid routing each critical path through

the same regions used by other configurations, which means that ideally, each configuration routes
a critical path through a unique set of routing resources that are spatially far away from the paths
in other configurations. As a result, if a critical path in one configuration is slow due to process
variations, chances are that other configurations would route the same path through regions that are
faster, resulting in a faster clock frequency. Figure 46.19 shows three configuration s with different
routes f or a critical path and the delay variation map of the switch matrix. Using the delay map in
Figure 46.19, wecancalculatethe delayofthe criticalpath inthefirst,second, andthirdconfigurations
as 4.9, 4.5, and 5.1, respectively.
They ignore spatial correlations in their method, hence they can analytically calculate the
probability that a design fails timing constraints given n configurations. The probability that none of
the n configurations passes the timing test is
Y
n
(
Target
)
= 1 −

1 −Y
1
(Target)

n
(46.12)

×