Tải bản đầy đủ (.pdf) (10 trang)

Handbook of algorithms for physical design automation part 91 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (207.66 KB, 10 trang )

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 882 23-9-2008 #3
882 Handbook of Algorithms for Physical Design Automation
microprocessors are presented to illustrate h ow the basic techniques described in this ch apter are
applied in practice.
42.1 METRICS FOR CLOCK NETWORK DESIGN
Unlike other signals that carry data information, the clock signal in edge-triggered circuits carry
timing informationby the signal transitions(i.e., edges). Therefore, the metrics used in clock network
design are different from those for general signal net design, and these are discussed in the remainder
of this section.
42.1.1 SKEW
Clock skew refers to the spatial variation in the arrival time of a clock transition. The clock skew
between two points i and j on a chip is defined as t
i
− t
j
,wheret
i
and t
j
are the clock arrival time
to point i and point j, respectively. The clock skew of a chip is defined as the maximum clock
skew between any two clocked elements on the chip. In general, clock skew forces designers to be
conservative and use a longer clock period, that is, a lower clockfrequency, for the design (unlessboth
the clock network and the circuit are specially designed to take advantage of clock skew as described
in Section 42.4). Therefore, clock networks with zero skew are most desirable. However, because of
static mismatches in the clock paths and clock loads, clock skew is nonzero in practice, and hence
skew minimization is always one of the most important objectives in clock network design. Skew
can be effectively minimized in both physical design and circuit design stages. Skew minimization
approaches in physical design stage are discussed in this chapter. Deskewing techniques in circuit
design stage will be illustrated by several examples in Chapter 43.
Jitter is anoth er measure of the variation in the arrival time of a clock transition. Specifically, it


refers to the temporal variation of the clock period at a given point on the chip. Like skew, it is an
important metric to the quality of the clock signal because it also forces designers to be conservative
and usea longer clock period. The structureof the clock networkhas insignificant effect onjitter.Jitter
is caused by delay variation in clock buffer due to power supply noise and temperature fluctuation,
influence o f substrate/power supply noise to the clock generator, capacitive coupling between clock
and adjacent signal wires, and data-dependent nature of load capacitance of latch/register [1]. It is
more effectively minimized by the d esign of other components like power supply network and clock
generator. Therefore, it is typically not considered during clock network design.
42.1.2 TRANSITION TIME
The transition time is usually defined as the time for a signal to switch between 20 and 80 percent of
the supply voltage.

This corresponds to the rise time for the rising transition, and the fall time for
the falling transition. The reciprocal of the transition time is called the slew rate.

Slow transitions could potentially cause large skew and jitter values in the presence of process
variations or noise. Transition timesalso need tobe substantially less thanthe clock periodto allow the
clock to achieve a rail-to-rail transition, to provide adequate noise immunity. Another motivation for
sharp transition times is that they limit the short-circuit power, which is roughlyproportional to input
transition time [2], in the clock network. However, to reduce transition time, larger or mo re buffers
are normally required, which would increase power consumption, layout congestion, and process
variations. In practice, transition times are bounded rather than minimized in clock network design.

Definitions as switching time between 10 and 90 percent, and between 30 and 70 percent are also common.

However, in common u sage, the term slew rate is often used to mean tr ansition time r ather than its reciprocal.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 883 23-9-2008 #4
Clock Network Design: Basics 883
42.1.3 PHASE DELAY
Phase delay (or latency) is defined as the m aximum delay from the clock generator to any clock

terminal. It is important to realize that because the clock is a periodic signal, the absolute delay
from the clock generator to a clock terminal is not important. However, it has been observed that
the shorter the phase delay, the more robust the clock network will generally be [3]. Therefore, the
phase delay can be used as a simple albeit indirect criterion in clock network design.
42.1.4 AREA
The clock network is a huge structure driving a large number of widely distributed terminals. It
consists of a large number of wire segments, many of which are long and wide. Hence the clock
network utilizes a significant wire area. For example, it con su mes 3 percent of the total available
metals 3 and 4 [4]. Moreover, because the clock network is sensitive to noise, it is usually shielded
and hence uses even more wire resources. In addition, typically, a lot o f possibly large buffers are
inserted in the clock network. Those buffers could occupy a significant device area. It is important
to minimize both wire area and device area in clock network design.
42.1.5 POWER
Because of battery life concern in portable electronic devices and heat dissipation problem in high-
performance ICs, power consumption is a very important design consideration in recent years. The
clock signal switches twice every cycle. Whenever it switches, the huge capacitance associated
with the wires and devices of the c lock network needs to be charged or discharged. Therefore,
clock distribution is a significant component of total power consumption. The clock distribution and
generation circuitry is known to consume up to 40 percent and 36 percent of the total power budget
of high-performance [4] and embedded [5] microprocessors, respectively. However, a significant
portion of the clock power is consumed in the input capacitance of the clocked elements [3,6].
Unless large amounts of local clock g ating is done, as is typical in high-performance designs, this
portion of power cannot be reduced by modifying the clock network.
42.1.6 SKEW SENSITIVITY TO PROCESS VARIATIONS
If the manufacturing process is ideal, a careful clock network design can eliminate any clock skew.
However,with reductions in thefeature sizesof VLSI processes, manufacturing variationsare becom-
ing increasingly significant. These variations are the major causes of clock skew in modern designs,
as designers usually can keep the systematic skew under nominal process parameters low. As a
design goal, it is important not only to minimize metrics such as the skew but also to minimize their
sensitivity to process variations.

42.2 CLOCK NETWORKS WITH TREE STRUCTURES
A common and simple approach to distribute the clock is to use a tree structure. The most basic tree
structure is the H-tree as shown in Figure 42.1, and it is obtained by recursively drawing H-shapes
at the leaf nodes. With enough recursions, the H-tree can distribute a clock from the center to within
an arbitrarily short distance of every point on the chip.
If all clock terminals have the same load and are arranged in a regular array as in Figure 42.1, and
if there is no process variation, the H-tree will have zero skew. However, the clock loads are almost
always irregularly arranged all over the chip. To handle the irregularity, algorithms that produce
generalized H-tree structures are presented in Sections 42.2.1 through 42.2.4. Wire sizing and buffer
insertion in clock trees are discussed in Section 4 2.2.5.
As a notational matter, we point out that Manha ttan distances and rectilinear rou ting are assumed
throughout this chapter. However, for simplicity, nonrectilinear segments are drawn in most figures
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 884 23-9-2008 #5
884 Handbook of Algorithms for Physical Design Automation
AB
Source
Terminal
FIGURE 42.1 Clock network for 64 terminals with H-tree structure.
(e.g., Figure 42.2). Each nonrectilinear segment can be replaced by a set of two (or more) rectilinear
segments in an actual implementation.
42.2.1 METHOD OF MEANS AND MEDIANS
Jackson et al.[7] proposed analgorithm called themethod ofmeans andmedians (MMM)to construct
a clock tree for a set of arbitrarily distributed terminals. The algorithm takes a top-down recursive
approach, a recursive step of wh ich is illustrated in Figure 42.2. In each step, the set of terminals is
partitioned acco rding to either the x-ory-coordinate into two subsets about the median coordinate of
the set. Note that the number of terminals in the two subsets may be equal, if the number of nodes is
even, or may differ by one otherwise. Then the center of mass (i.e., mean coordinate) of the entire set
is connected to both centers of mass of the two subsets. The partitionin g directio n at each recursive
level is determined by an one level look-ahead technique in which both x-then-y partitioning and
y-then-x partitioning are attempted, and the one that minimizes skew between its current endpoints

is chosen. The clock trees for the subsets are recursively constructed until there is only one terminal
in each subset. The time complexity of MMM is O(n log n),wheren is the number of terminals.
42.2.2 GEOMETRIC MATCHING ALGORITHM
The geometric matching algorithm (GMA) proposed by Kahng et al. [8] solves the same prob-
lem formulation as the MMM algorithm, but takes a bottom-up recursive approach. A geometric
Center of mass
of subset 2
Subset 2
Center of mass
of subset 1
Center of mass
of whole set
Subset 1
Median of whole set
in x direction
FIGURE 42.2 Recursive step of the MMM algorithm. The set is partitioned according to x-coordinate.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 885 23-9-2008 #6
Clock Network Design: Basics 885
(a) (b) (c)
FIGURE 42.3 Recursive steps of t he GMA algorithm. Seven terminals are merged into four subtrees in (a),
then two subtrees in (b), and finally one subtree in (c).
matching of a set of k points is a set of k/2 line segments connecting the points, with no two
line segments connecting to the same point. The cost of the g eometric matching is the sum of the
lengths of its line segments. The GMA is illustrated in Figure 42.3. In each recursive step, a set of
k path-length-balanced subtrees are g iven. (At the beginning, each terminal is a subtree by itself.)
The subtrees are merged by finding a minimum-cost matching of their tapping points (i.e., roots) to
form k/2 new subtrees. The tapping point of each new subtree is chosen to be the b alance point
that minimizes the maximu m difference in path lengths to the leaves of the subtree. The resulting
set of subtrees (including the k/2 new ones and potentially one unmatched subtree when k is odd)
will be recursively matched until a single path-length-balanced tree is obtained.

In some cases, it is impossible to find a balance point such that the path lengths to all leaves are
exactly the same. For example, in Figure 42.4a, if l
1
+l < l
2
, then the best balance point is node A but
the path lengths to leaves are still not completely balanced . For those cases, a H-flipping operation
as shown in Figure 42.4b can be applied to reduce the skew.
If using optimal matching algorithm in planar geometry, the time complexity of GMA is
O(n
2.5
log n),wheren is the number of terminals. Faster nonoptimal matching heuristics can also be
used to speed up the algorithm. It was experimentally shown in Ref. [8] that the trees generated by
GMA are better in wirelength and skew than those by MMM.
42.2.3 EXACT ZERO-SKEW ALGORITHM
Both the MMM algorithm and GMA assume the delay is linear to the path length, and then focus
on balancing of path lengths. For high-performance designs with tight skew constraints, algorithms
based on more accurate delay models are desirable. Tsay [9] presented an algorithm that produces
clock trees withexact zero skew according to the Elmoredelay model [10].Like GMA, this algorithm
recursively merges subtrees in a bottom-up manner. However, it assumes that a tree topology, which
determines the pairing up of subtrees, is given. It addresses the problem of finding the tapping points
precisely so that the merged trees have zero skew.
Suppose two zero-skew subtrees are merged by a wire of length l as shown in Figure 42.5a.
The wire is divided by the tapping point into two segments of length xl and (1 − x)l, respectively.
By representing each subtree by a lumped delay model and each segment by a π-model, we can
transform the circuit into an equivalent RC tree as shown in Figure 42.5b.
(a)
(b)
H-flipping
A

l
l
2
l
1
FIGURE 42.4 H-flipping operation for further skew minimization.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 886 23-9-2008 #7
886 Handbook of Algorithms for Physical Design Automation
(b)
(a)
Tapping point
Subtree 2
Subtree 1
xl
(1Ϫx)l
Subtree 2
Subtree 1
t
2
t
1
r
1
r
2
C
1
c
1
/2 c

1
/2 c
2
/2 c
2
/2
C
2
FIGURE 42.5 Zero-skew merge of two subtrees.
To ensurethe delay fromthe tapping pointto leaf nodesof both subtreesto beequal,it requiresthat
r
1
(
c
1
/2 +C
1
)
+ t
1
= r
2
(
c
2
/2 +C
2
)
+ t
2

(42.1)
Let α be the wire resistance per unit length and β be the wire capacitance per unit length. Then,
r
1
= αxl, r
2
= α(1 − x)l, c
1
= βxl,andc
2
= β(1 − x)l. Hence, after solving Equation 42.1, we
find the zero-skew condition to be
x =
(
t
2
− t
1
)
+ αl
(
βl/2 + C
2
)
αl
(
βl + C
1
+ C
2

)
If 0 ≤ x ≤ 1, it indicates that the delay can be balanced by setting the tapping point som ewhere
along the segment. On the other hand, if x < 0orx > 1, it implies the two subtrees are too much out
of balance and extra delay needs to b e introduced through wire elongation, which is commonly done
by snaking. Without loss of generality, consider the case x < 0. For this case, the tapping point has
to be at the root of subtree 1 and the segment connecting subtree 1 to subtree 2 has to be elongated.
Assume the length of the elongated segment is l

. To balance the delay,
t
1
= t
2
+ αl


βl

/2 +C
2

or
l

=

(
αC
2
)

2
+ 2αβ
(
t
1
− t
2
)
− αC
2
αβ
Similarly, for the case x > 1, the tapping point should be at the root of subtree 2, and
l

=

(
αC
1
)
2
+ 2αβ
(
t
2
− t
1
)
− αC
1

αβ
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 887 23-9-2008 #8
Clock Network Design: Basics 887
(a) (b)
E
A
B
F
D
C
G
E
A
B
G
C
D
F
FIGURE 42.6 Two different ways to construct a zero-skew clock tree f or terminals A–D. The connection EF
in (b) is much shorter than the one in ( a).
42.2.4 DEFERRED MERGE EMBEDDING
In the exact zero-skew algorithm in Section 42.2.3, there are many possible ways to route the
connection between each pair of tapping points. As shown in Figure 42.6, the routing will determine
the location of the tapping point, and hence the wirelength of the connection at the next higher level.
In Ref. [9], it was suggested that a few possible wiring patterns (e.g., two one-bend connections)
may be constructed and the one which gives a shorter length at the next level is picked.
In general, the problem is to embed any given connection topology to create a zero-skew clock
tree while minimizin g total wirelength. This problem can be solved in linear time by the defer red
merge embedding (DME) method independently proposed by Edahiro [11], Chao et al. [12], and
Boese and Kahng [13]. The DME algorithm consists of two phases. First, a bottom-up phase finds

a line segment called the merging segment, ms(v), to represent all possible placement locations for
each tapping point v. Then, a top-down phase resolves the exact location for each tapping point.
We use the example in Figure 42.6 to explain how to find the merging segments in the bottom-up
phase. The steps are illustrated in Figure 42.7. Consider the tapping point E. The distances d
AE
from
A to E and d
BE
from B to E that balance the delay according to some delay model are first computed.
The algorithm to compute the distances depends on the delay model used. For example, for Elmore
delay model, Tsay’s algorithm[9] can be applied. Then we set ms(E) to bethe set of all points within
a distance d
AE
from A and within a distance d
BE
from B. ms(F) can be found similarly. Next, consider
the tapping point G. The least possible length of the connection between E and F is the minimum
distance between any point in ms(E) and any point in ms(F). Based on this length, we can compute
the d istances d
EG
from E to G and d
FG
from F to G that balance the delay. Finally, we set ms(G) to
be the set of all points within a distance d
EG
from some point in ms(E) and within a distance d
FG
from some point in ms(F).
A Manhattan arc is defined to be a line segment, possibly of zero length, with slope +1or−1.
A crucial observationis that all merging segmentsare Manhattan arcs. To prove this observation,first

(a) (b)
d
AE
d
BE
d
CF
d
DF
ms(F )
ms(E )
C
A
D
B
d
EG
d
FG
A
D
B
C
ms(G )
ms(E )
ms(F )
FIGURE 42.7 Construction of merging segments in the bottom-phase of DME.
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 888 23-9-2008 #9
888 Handbook of Algorithms for Physical Design Automation
notice that the merging segment of a terminal is a single point and thus a Manhattan arc. Consider

the merge of two subtrees rooted at X and Y to form a tree rooted at Z such that both ms(X) and
ms(Y) are Manhattan arcs. Let l be the minimum distance between any point in ms(X) and any point
in ms(Y). l is the least possible length of the connection between X and Y. To balance the delay, we
compute d
XZ
and d
YZ
. There are two possible cases. The first case is d
XZ
+ d
YZ
= l. Note that both
the region within a distance d
XZ
from ms(X) and the region within a distance d
YZ
from ms(Y) are
tilted rectangles. Moreover, the two rectangles are touching each other as d
XZ
+ d
YZ
= l.ms(Z) is
set to the intersection of them and hence is a Manhattan arc. The second case is d
XZ
+ d
YZ
> l.In
this case, Z coincides with either X or Y, and wire is elongated to balance the delay. Without loss of
generality, assume Z coincides with X.Thenms(Z) is set to all points in ms(X) that are also within
a distance d

YZ
from ms(Y). Hence it is also a Manhattan arc. By induction, therefore, all merging
segments must be Manhattan arcs. Because of this observation, each merging segment can be found
in constant time. The whole bottom-up phase requires linear time.
For the top-down phase,the locations of tapping points are fixed in a top-down manner asfollows.
For the root r of the whole tree, its location is set to any point in ms(r). For any other tapping point
v, its location is set to any point in ms(v) that is within a distance d
vp
(determined in bottom-up
phase) from the location of v’s parent p. The top-down phase also takes linear time. Therefore, DME
is a linear time algorithm. It has been proved that for linear (i.e., path length) delay model, DME
produces zero-skew tree with optimal wirelength. However, it has also been shown that DME is not
optimal for Elmore delay model [13].
Instead of achieving zero skew, the DME algorithm can be extended to handle general skew
constraints. The extended DME algorithm has applications in clock skew scheduling (Section 42.4)
and process variation aware clock tree routing (Section 42.5).
42.2.5 WIRE WIDTH AND BUFFER CONSIDERATIONS IN CLOCK TREE
Wire resistance is a major concern for clock tree design in advanced process. If a clock wire is long
and narrow, it will have a very significant resistance. Together with the significant capacitive load of
the clock wires and terminals, this implies that the clock signal will have very long phase delay and
transition time. Note that this problem cannot be resolved merely by increasing the driving strength
(i.e., size) of the clock generator. Even though a strong clock generator can produce a sharp clock
signal at the source, the signal degrades rapidly as it is transmitted through the lossy clock wire.
One solution is to size up the width o f the clock wires as wire resistance is inversely proportional
to the wire width. Such a method must requir e a router to handle wires of varying widths, and also
requires appropriatesizing of the clock driversto meetthe delay andtransition time constraints under
an increased load for the stage.
Another solution is to insert buffers distributively in the clock tree: the basic concept is similar
to buffer insertion for signal lines, d iscussed elsewhere in this book. Buffers are effective in main-
taining the integrity of the clock signal by restoring degraded signals. Buffered clock trees generally

use smaller clock generator and narrower wires, and hence consume less power and area [14,15].
However,buffer delay is more sensitive to process variationsand power supply noise than wire delay.
Hence, buffered clock trees may have more skew and jitter. Moreover, clock tree design is typically
performed after placement so that clock terminals are fixed. Inserting the clock buffers into a placed
circuit may be difficult.
To reduce skew and skew sensitivity to process variations in buffered clock tree design, the
following guidelines are often followed:
• Buffered clock trees should have equal numbers of buffers in all source-to-sink paths
• At each buffered level, the buffers should have the same size
• At each buffered level, the buffers should have the same capacitive load and the same input
transition time (potentially by adjusting the width and length of the wires)
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 889 23-9-2008 #10
Clock Network Design: Basics 889
In practice,a mixed approachof wire width adjustmentand buffer insertionis typically used[16].
For example, Restle et al. [3] presented the clock n etwork design of six microprocessor chips. In all
these chips, the clock network consists of a series of buffered treelike networks driving a final set
of 16–64 sector buffers. Each sector buffer drives a tree network tunable by adjusting wire widths.
(The tunable trees finally drive a single clock grid, which is discussed in Section 42.3.1.)
42.3 CLOCK NETWORKS WITH NONTREE STRUCTURES
Although tree structures are relatively easy to design, a significant drawback associated with them
is that, in the presence of process variations, two physically nearby points that b elong to different
regions of the clock tree, may have a significant skew. For example, points A and B in Figure 42.1
may experience a large skew because the two paths from source to them are distinct and may not
match well with each other. This kind of local skew is particularly troublesome, because physically
nearby registers are likely to be connected by a combinational path. Therefore, the significant skew
can easily cause a hold time violation, which is especially costly as it cannot be fixed by slowing
down the clock frequency.In the following, several nontree structures are introduced. They are more
effective in reducing skew in a local region, but they consume more area and power.
42.3.1 GRID
A clock grid is amesh o f horizontaland vertical wires driven from the middle or edges. Typcially,the

mesh is fine enough to deliver the clock signal to within a short distance of every clocked element.
The skew minimization approach of grids is fundamentally different from that of trees. Grids try to
equalize delay of different points by connecting them together, whereas trees try to balance delay of
different points by carefully matching the characteristics of different paths.
As the grid connects nearby points directly, it is very effective in reducing local skew. Moreover,
its design is not as sensitive to the placement details as a tree structure, which makes late design
changes easier. On the other hand, for a tree-structured network, if a late design change significantly
alters the locations of the clocked elements or the values of the load capacitances, an entirely new
tree topology may be required. The main disadvantage of grids is that they consume a large amount
of wire resources and power. In addition, grids may have significant systematic skew between the
points closest to the drivers and the points furthest away. This problem can be illustrated by the clock
network design of the 300 MHz Alpha 21164 processor [17], where the clock signal generated at the
center of the chip is distributed to the left and right banks of final clock drivers (Figure 42.8a), which
then drive a grid. It is clear from the simulation results in Figure 42.8bthat the skews between points
near the left and right drivers and points further away are very significant (up to 90 ps). Therefore,
grids are rarely used by themselves. A balanced structure is usually employed to distribute the clock
globally to various places in the grid, as discussed in Section 42.3.3.
42.3.2 SPINE
The spine structure for clock distribution is shown in Figure 42.9. A clock spine is a long and
wide piece of wire running across the chip, which drives the clock signal through delay-matched
serpentine wires into each small group of clocked elements. This idea was first introduced by Lin
and Wong [18]. Typically, the clock signal is distributed from the clock generator to the spine by a
balanced buffered tree such that it arrives at many different points of the spine simultaneously. If the
load distribution induced by the serpentine wires on the spine is uniform, the spine has zero skew
everywhere. If the delays of the serpentine wires are perfectly matched, then the skew at the clocked
elements will also be zero.
Like grids, spines provide a stable structure that facilitates late design changes. Although this
structure does not make the clock as readily available as grids so that serpentine routing is required,
a serpentine is easy to design. To accommodatefor late d esign changes, each serpentine can be tuned
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 890 23-9-2008 #11

890 Handbook of Algorithms for Physical Design Automation
80
72
64
56
48
40
32
24
16
8
0
17
16
15
14
13
12
11
10
9
Y (mm)
8
7
6
5
4
3
2
1

0
0
1
2
3
4
5
6
7
X (mm)
8
9
10
11
12
13
14
15
16
17
(b)(a)
Clock drivers
Delay (ps)
FIGURE 42.8 Clock driver locations (a) and clock delay in Alpha 21164 (b). (Courtesy of Hewlett-Packard
Company.)
Spine 1
Spine 2
Delay-matched
serpentine routing
FIGURE 42.9 Clock distribution by spines with serpentine routing.

individually without affecting others. Moreover, clock gating is easy to be incorporated as each
serpentine can be gated separately. However, a system with many clocked elements may require a
lot of serpentine routes, which cause high area and power consumption. Like trees, spines also may
have large local skews between nearby elements driven by different serpentines.
Intel has used the spine structure in its Pentium processors. Details can be found in Chapter 43.
42.3.3 HYBRID
The tree structure is good atminimizing skew globally,while thegrid structure is effective in reducing
skew locally. To achieve low skew at both global and local levels, tree and grid can be combined to
form a hybrid structure. A practical approach is to use a balanced tree to distribute the clock signal
to a large number of points across the chip, and then a grid to connect these points together. As the
grid is driven in many points, the systematic skew problem of grid is resolved. Moreover, as the tree
sinks are shorted by grid segments, the local skew problem of tree is eliminated. In high-performance
Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 891 23-9-2008 #12
Clock Network Design: Basics 891
Zero-skew
subtree 1
Zero-skew
subtree 2
Zero-skew
subtree 3
Zero-skew
subtree 4
Zero-skew mesh
Clock driver
or buffer
FIGURE 42.10 Clock network with a global mesh driving local trees.
design, the skew budget is too tight to be satisfied either by a pure tree or a pure grid approach. The
hybrid approach is a common alternative. In addition, like a grid, the hybrid network also provides
a stable structure that facilitates late design changes. The only drawback of this approach is power
and area cost even higher than a pure grid approach.

Many microprocessors have used a hybrid structure for clock distribution, and several of them
are discussed in Chapter 43. In particular, IBM has used the hybrid approach on a variety of micro-
processors including the Power4, PowerPC, and S/390 [3]. In the IBM designs, a primary buffered
H-tree drives 16–64 sector buffers arranged on the chip. Each sector buffer drives a smaller tree
network. Each tree can be tuned to accommodate nonuniformload capacitance by adjusting the wire
widths. Together, the tunable trees drive a global clock grid at up to 1024 points.
Su and Sapatnekar [19] proposed a different hybrid approach. In this mesh/tree approach, a
global zero-skew mesh is used to drive local zero-skew trees as shown in Figure 42.10. This idea
can be generalized to a multilevel structu re in which each subtree sink at a certain level is driving
another mesh with four subtrees at the next lower level.
To construct an one-levelmesh/tree clocknetwork, the sinks are first divided into four groupsand
a buffered tree is built for each group by any zero-skew tree construction algorithm (e.g., Tsay [9]).
Basedon thedelay anddownstreamcapacitanceof the fourtrees,a zero-skewmeshis thenconstructed
by adjusting the width of the eight mesh segments. Interestingly, they show that the problem of
minimizing the total segment area to achieve zero skew with respect to Elmore delay (by requiring
all four trees to meet a given target delay) can be formulated as a linear program of only four of the
segment width variables. A heuristic procedure is presented to iteratively set the target delay and
possibly elongate some segments until a feasible solution (with all segment widths within bounds)
is found. As a postprocessing step, wire width optimization under an accurate higher-order delay
metric is performed.
It is shown experimentally that clock networks by this hybrid mesh/tree approach are better in
skew, skew sensitivity, phase delay, and transition time than trees by Tsay’s algorithm. They are
also better in skew, phase delay, and transition time, and similar in area when comparing to the IBM
structures discussed above.
42.4 CLOCK SKEW SCHEDULING
The clock skew scheduling technique makes use of intentional nonzero clock skews to optimize
the performance of synchronous systems. The basic idea is to use clock skews to balance the slack
differencebetween combinational paths instead of achieving zero-skew clock arrival times. This idea
was first proposed by Fishburn [20].
Before presentingthe clock skew scheduling problem formulations, we first introduce the timing

constraints on clock signals. To avoid clock hazards, setup time constraints and hold time constraints
have to be satisfied by all source/destination register pairs in the system. Consider a pair of registers

×