Tải bản đầy đủ (.pdf) (10 trang)

High Level Synthesis: from Algorithm to Digital Circuit- P18 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (446.19 KB, 10 trang )

158 P. Coussy et al.
Fig. 9.9 Operator area vs.
sizing approaches
40 slices 34 slices 24 slices
*
9
9
*
9
8
*
4
9
Max(8,4,3, 9) Max(in
1
,in
2
)
Best(in
1
,in
2
)
40 slices 34 slices 24 slices
*
9
9
*
9
8
*


4
9
Max(8,4,3, 9) Max(in
1
,in
2
)
Best(in
1
,in
2
)
(a)
(b)
(c)
9.3.2.5 Storage Element Optimization
Because currently there is no feed-back loop in the design flow, the registers opti-
mization has to be done during the conception of the processing unit. The choice of
the location of an unconstrained variable (user can define the location of variables)
in a register or in a memory, has to be done according to the minimization of two
contradictory cost criteria:
• The cost of a register is higher than the cost of a memory point.
• The cost to access data in a register is lower than the cost to access data in
memory (because of the necessity to compute the address).
Two criteria are used to choose the memorization location of the data:
• A variable whose life time is inferior to a locality threshold is stored in a register.
• The location of memorization depends on the class of the variable.
Data are classified into three categories:
• Temporary processing data (declared or undeclared).
• Constant data (read-only).

• Ageing data (which serves to express the recursivity of the algorithm to be
synthesized, via their assignment after having been utilized).
The optimal storage of a given data element depends upon its declaration and its
life time. It can be either stored in a memory bank of the MEMU or in a storage
element of the processing unit PU. The remaining difficulty lies in selecting an
optimal locality threshold which results in minimizing the cost of the storage unit.
The synthesis tool leaves the choice of the value of the locality threshold up to the
user. In order to help the designer, GAUT proposes a histogram of the life time of
the variables, normalized by the utilization frequency, which is calculated from the
scheduled DFG.
The architecture of the processing unit is composed of a processing part and
a memory part (i.e. memory plan) and the associated control state machine FSM
(Fig. 9.1). The memory part of the datapath is based on a set of strong seman-
tic memories (FIFO, LIFO) and/or registers. Spatial adaptation is performed by
an interconnection logic dealing with data dispatching from operators to storage
elements, and from storage elements to operators. Timing adaptation (data-rates,
different input/output data scheduling) is realized by the storage elements. Once the
location of data has been decided, the synthesis of the storage elements located in
9 GAUT: A High-Level Synthesis Tool for DSP Applications 159
Fig. 9.10 Four-step flow
RCG Construction
Binding
Optimization
Generation
Fig. 9.11 Resource compatibility
graph
a
d
e
L

F
R
F
F
F
R
L
f
c
R
R
R
RL
F
F
b
the PU is done. This design step inputs data lifetimes resulting from the scheduling
step and spatial information resulting from the binding step of the DFG. The spa-
tial information is the source and destination for each data. First, we formalize both
timing relationships between data (thanks to data lifetimes) and spatial information
through a Resource Compatibility Graph RCG. This formal model is then used to
explore the design space. We named timing relationships and spatial information as
Communication Constraints.
This synthesis task is based on a four-step flow: (1) Resource Compatibility
Graph (RCG) construction, (2) Storage resource binding, (3) Architecture optimiza-
tion and (4) VHDL RTL generation (see Fig. 9.10). During the first step of the
component generation, a Resource Constraints Graph is generated from the com-
munication constraints. The analysis of this formal model allows both the binding
of data to storage elements (queue, stack or register), and the sizing of each storage
element. This first architecture is then optimized by merging storage elements that

have non-overlapping usage time frames.
Formal model: In order to explore the design space of such a component, the
first step consists in generating a Resource Compatibility Graph, from the com-
munication constraints. This RCG specifies through formal modeling the timing
relationship between data that have to be handled by the datapath architecture.
The vertex set V = {v
0
, ,v
n
} represents data, the edge set E = {(v
i
,v
j
)} repre-
sents the compatibility between the data. A tag t
ij
∈ T is associated with each edge
(v
i
,v
j
). This tag represents the compatibility type between the two data (i and j),
T = {Register R, FIFO F, LIFO L},e.g. Fig.9.11.
160 P. Coussy et al.
In order to assign compatibility tags to edges, we need to identify the timing
relationship that exists between two data. For this purpose we defined a set of rules
based on functional properties of each storage element (FIFO, LIFO, Register). The
lifetime of data a is defined by
Γ
(a)=[

τ
min
(a),
τ
max
(a)] where
τ
min
(a) and
τ
max
(a)
are respectively the date of the write access of a into the storage element, and the
last date of the read access to a.
τ
first
(a) is the first read access to a,
τ
Ria
is the ith
read access to a, with first ≤ i ≤ max.
Rule 1: Register compatibility
If(
τ
min
b

τ
max
a

) then we create a “Register” tagged edge.
Rule 2: FIFO compatibility
If[(
τ
min
b
>
τ
min
a
) and (
τ
fisrt
b
>
τ
max
a
) and (
τ
min
b
<
τ
max
a
)] then we create a
“FIFO” tagged edge.
Rule 3: LIFO compatibility
If[[(

τ
min
b
>
τ
min
a
) and (
τ
first
a
>
τ
max
b
)] or [(
τ
Ri
a
<
τ
min
b
<
τ
max
b
<
τ
Ri+1

a
)]] then
we create a “LIFO” tagged edge.
Rule 4: Otherwise, No edge – No compatibility.
An analysis of the communication constraints enables the RCG generation. The
graph construction supposes edge creation between data, respecting a chronologi-
cal order (
τ
min
). If n is the number of data to be handled, the graph may contain:
n(n −1)/2 edges, O(n
2
).
Storage element binding: The second step consists in binding storage elements
to data thanks to the timing relations modeled by the RCG.
Resource identification: The second step consists in binding storage elements to
data by using the timing relations modeled by the RCG. The aim is to identify and
to bind as many FIFO or LIFO structures as possible on the RCG.
Theorem 1. If a is FIFO compatible with b and b is FIFO compatible with c, then
a is transitively FIFO (or Register) compatible with c.
As a consequence of Theorem 1, a FIFO compatible datapath, PF, is by construc-
tion equivalent to a FIFO compatibility clique (i.e. the data of the PF path can be
stored in the same FIFO).
Theorem 2. If a is LIFO compatible with b and b is LIFO compatible with c, then
a is transitively LIFO compatible with c.
As a consequence of Theorem 2, a LIFO compatible datapath, P
L
, is by construc-
tion equivalent to a LIFO compatibility clique (i.e. the data of the P
L

path can be
stored in the same LIFO).
Resource sizing: The size of a LIFO structure equals the maximum number of
data stored by a LIFO compatible data path. So, we have to identify the longest
LIFO compatibility path P
L
in a LIFO compatibility tree, and then the number of
vertices in P
L
from the longest LIFO path in the tree equals the maximum number
of data that can be stored in it.
9 GAUT: A High-Level Synthesis Tool for DSP Applications 161
d
L
L
f
b
FIFO
3
ec
FIFO
2
R
a
dd
L
L
ff
bb
FIFO

3
ec
FIFO
2
eecc
FIFO
2
R
aa
(a) Resulting hierarchical graph
a
b
c
d
time
e
f
FIFO
3
FIFO
2
a
b
c
d
time
e
f
FIFO
3

FIFO
2
(b) Resulting constraints
Fig. 9.12 A possible binding for graph
The size of a FIFO is the maximum number of data (of the considered path)
stored at the same time in the structure. In fact, the aim is to count the maximum
number of overlapped data (respecting I/O constraints) in the selected path P.These
sizes can be easily extracted from our formal model.
Resource binding: Our greedy algorithm is based on user plotted metrics (mini-
mal amount of data to use a FIFO or a LIFO, average use factor, FIFO/LIFO usage
priorityfactor )tobindasmanyFIFOorLIFOstructuresaspossibleontheRCG.
A two-steps flow is used: (1) identification of the best structure, (2) merging all the
concerned data in a hierarchical node.
Each node represents a storage element, as shown on Fig. 9.12a (e.g. data a, b and
f are merged in a three-stages FIFO). We say hierarchical node because merging
a set of data in a given node, supposes adding information that will be useful dur-
ing the optimization step: the lifetime of this structure (i.e. the time interval during
which this structure will be used. e.g. Fig. 9.12b).
Let P = {v
0
, ,v
n
} be a compatible data path,
• If P is a FIFO compatible path, the structure lifetime will be [
τ
min
v
0
,
τ

max
v
n
].
• If P is a LIFO compatible path, the structure lifetime will be [
τ
min
v
0
,
τ
max
v
0
].
Storage element optimization: The goal of this final task is to maximize stor-
age resource usage, in order to optimize the resulting architecture by minimizing
the number of storage elements and the number of structures to be controlled. To
tackle this problem, we built a new hierarchical RCG by using the merged nodes,
and their lifetimes. In order to avoid any conflict, the exploration algorithm of the
optimization step will only search for Register compatibility path, between same
type vertices. When two structures of the same type are Register compatible, they
can be merged.
Let P = {v
0
v
n
} be a Register compatible data path,
• The lifetime of the resulting hierarchical merged structure will be [
τ

min
v
0
,
τ
max
v
n
]
U U [
τ
min
v
n
,
τ
max
v
n
].
The algorithm is very similar to the one used during binding step. When there is
no more merging solution, the resulting graph is used to generate the RTL VHDL
162 P. Coussy et al.
Fig. 9.13 Optimization
of Fig. 9.11 graph
a
f
b
FIFO
3

d
ec
FIFO
2
a
f
b
FIFO
3
aa
ff
bb
FIFO
3
d
ec
FIFO
2
dd
eecc
FIFO
2
architecture. Figure 9.13 is a possible architectural solution for the Resource Com-
patibility Graph presented in Fig. 9.11. Here, the resulting architecture consist in a
three-stages FIFO that handles three data, and a two-stages FIFO that handles three
data: one memory place has been saved.
9.3.3 Memory Unit Synthesis
In this section, we present two major features of GAUT, regarding the memory sys-
tem. First the data distribution and placement are formalized as a set of constraint
for the synthesis. We introduce a formal model for the memory accesses, and an

accessibility criterion to enhance the scheduling step. Next, we propose a new strat-
egy to implement signals described as ageing vectors in the algorithm. We formalize
the maturing process and explain how it may generate memory conflicts over sev-
eral iterations of the algorithm. The final Compatibility Graph indicates the set of
valid mappings for every signal. Our scheduling algorithm exhibits a relatively low
complexity that allows to tackle complex problems in a reasonable time.
9.3.3.1 Memory Constrained Scheduling
In our approach the data flow graph DFG first generated from the algorithmic speci-
fication is parsed and a memory table is created. This memory table is completed by
the designer who can select the variable implementation (memory or register) and
place the variable in the memory hierarchy (which bank). The resulting table is the
memory mapping that will be used in the synthesis. It presents all the data vertices
of the DFG. The data distribution can be static or dynamic.
In the case of a static placement, the data remains at the same place during
the whole execution. If the placement is dynamic, data can be transferred between
different levels in the memory hierarchy. Thus, several data can share the same loca-
tion in the circuit memory. The memory mapping file explicitly describes the data
transfers to occur during the algorithm execution.
Direct Memory Address (DMA) directives will be added to the code to achieve
these transfers. The definition of the memory architecture will be performed in the
first step of the overall design flow. To achieve this task, advanced compilers such
as Rice HPF compiler, Illinois Polaris or Stanford SUIF could be used [14]. Indeed,
these compilers automatically perform data distribution across banks, determine
9 GAUT: A High-Level Synthesis Tool for DSP Applications 163
Fig. 9.14 Memory constraint
graph
x0
x1
x2
x3

h3
h2
h1
h0
x0
x1
x2
x3
h3
h2
h1
h0
which access goes to which bank, and then schedule to avoid bank conflicts. The
Data Transfer and Storage Exploration (DTSE) method from IMEC and the associ-
ated tools (ATOMIUM, ADOPT) are also a good mean to determine a convenient
data mapping [15].
We modified the original priority list (see Sect. 9.3.2.2) to take into account
the memory constraint: an accessibility criterion is used to determine if the data
involved by an operation is available, that is to say, if the memory where it is stored
is free. Operations are still listed according to the mobility and bit-width criterion,
but all operations that do not match the accessibility criterion are removed. Every
operation that needs to access a busy memory will not be scheduled, no matter its
priority level. Fictive memory access operators are added (one access operator per
access port to a memory). The memory is accessible only if one of its access oper-
ators is idle. Memory access operators are represented by tokens on the Memory
Constraint Graph (MCG): there are as many tokens as access ports to the memory
or bank. Figure 9.14 shows two MCG, for signal samples x[0] to x[3] stored in
bank 1, and coefficients h[0] to h[3] stored in bank 2 (in the case of a four points
convolution filter for instance).
If one bank is being accessed, one token is placed on the corresponding data.

Only one token is allowed for a one port bank. Dotted edges indicate which follow-
ing access will be the faster. In the case of a DRAM indeed, slower random accesses
are indicated with plain edges and faster sequential accesses with dotted edges. Our
scheduling algorithm will always favor fastest sequences of accesses whenever it
has the choice.
9.3.3.2 Implementing Ageing Vector
Signals are the input and output flows of the applications. A mono-dimensional
signal x is a vector of size n,ifn values of x are needed to compute the result. Every
cycle, a new value for x (x[n+ 1]) is sampled on the input, and the oldest value of x
(x[0]) is discarded. We call x an ageing, or maturing, vector or data. Ageing vectors
are stored in RAM. A straightforward way to implement, in hardware, the maturing
of a vector, is to write its new value always at the same address in memory, at the
end of the vector in the case of a 1D signal for instance. Obviously, that involves
to shift every other values of the signal in the memory to free the place for the new
value. This shifting necessitates n reads and n writes, which is very time and power
consuming. In GAUT, the new value is stored at the address of the oldest one in the
164 P. Coussy et al.
x(0)x(1)x(2)x(3)
3210
x[3]x[2]x[1]x[0]
Iteration 0
x(1)x(2)x(3)x(4)
2103
x[3]x[2]x[1]x[0]
Iteration 1
x(2)x(3)x(4)x(5)
1032
x[3]x[2]x[1]x[0]
Iteration 2
x(3)x(4)x(5)x(6)

0321
x[3]x[2]x[1]x[0]
Itération 3
Logical address @x[]
Algorithm sample x[]
Samples of vector x()
x(0)x(1)x(2)x(3)
3210
x[3]x[2]x[1]x[0]
x(0)x(1)x(2)x(3)
3210
x[3]x[2]x[1]x[0]
x(0)x(1)x(2)x(3)
3210
x[3]x[2]x[1]x[0]
x(1)x(2)x(3)x(4)
2103
x[3]x[2]x[1]x[0]
x(1)x(2)x(3)x(4)
2103
x[3]x[2]x[1]x[0]
x(1)x(2)x(3)x(4)
2103
x[3]x[2]x[1]x[0]
x(2)x(3)x(4)x(5)
1032
x[3]x[2]x[1]x[0]
x(2)x(3)x(4)x(5)
1032
x[3]x[2]x[1]x[0]

x(2)x(3)x(4)x(5)
1032
x[3]x[2]x[1]x[0]
x(3)x(4)x(5)x(6)
0321
x[3]x[2]x[1]x[0]
Itération 3
x(3)x(4)x(5)x(6)
0321
x[3]x[2]x[1]x[0]
x(3)x(4)x(5)x(6)
0321
x[3]x[2]x[1]x[0]
Itération 3
Fig. 9.15 Logical addresses evolution for signal x
@x[0] @x[1] @x[2] @x[3]
1, 1
1, 1
1, 1
-3, 1
@x[0] @x[1] @x[2] @x[3]
1, 1
1, 1
1, 1
-3, 1
@x[j]
i
@x[j]
i+1
-1

@x[j]
i
@x[j]
i+1
@x[j]
i
@x[j]
i+1
-1
@x[0] @x[1] @x[2] @x[3]
1, 1, 0
1, 1, 0
1, 1, 0
-3, 1, -1
@x[0] @x[1] @x[2] @x[3]
1, 1, 0
1, 1, 0
1, 1, 0
-3, 1, -1
@x[0] @x[1] @x[2] @x[3]
1, 1, 0
1, 1, 0
1, 1, 0
-3, 1, -1
Fig. 9.16 LAG, AG and USG
vector. Only one write is needed. Obviously, the address generation is more difficult
in this case, because the addresses of the samples called in the algorithm change
from one cycle to the other. Figure 9.15 represents the evolution of the addresses for
a L = 4 points signal x from one iteration to the other.
The methodology that we propose to support the synthesis of these complex log-

ical address generators is based on three graphs (see Fig. 9.16). The logical address
graph (LAG) traces the evolution of the logical addresses for a vector during the
execution of one iteration of the algorithm. Each vertices correspond to the logical
address where samples of signal x are to be accessed. Edges are weighted with two
numbers. The first number, f
ij
, indicates how the logical address evolves between
two successive accesses to vector x. f
ij
=(j −i)%L (% indicates the modulo). The
second number g
i, j
indicates the number of iteration between those two successive
accesses.
To actually calculate the evolution of logical addresses of x from one iteration to
the other, we must take into account the ageing of vector x. We introduce the ageing
factor k as the difference between the logical address of element x[i] at the iteration
o and the logical address of element x[i] at the iteration o+ 1, so that:
@x[j]
i+1
=(@x[j]
i
−k)%L.
In our example, k = 1. The Ageing Graph (Fig. 9.16) is another representation of
this equation. We finally combine the LAG and the ageing factor to get the Unified
Sequences Graph (USG) (Fig. 9.16). A detailed definition of those three graphs may
be find in [16].
By moving a token in the USG, and by adding to the first logical address for x
the value of weight f
i, j

minus the ageing factor k, we get the address sequence for
x during the complete execution of the algorithm. Then, the corresponding address
generator is generated.
If a pipelined architecture is synthesized, the ageing factor k is multiplied by
the number of pipeline slices, and as many tokens as pipeline slices are placed and
9 GAUT: A High-Level Synthesis Tool for DSP Applications 165
moved in the USG. Of course, as much memory locations as supplemental tokens
in the USG must be added to guarantee data consistency. Concurrent accesses to
elements of vector x may appear in a pipelined architecture. While moving tokens in
the USG, a Concurrent Accesses Graph is constructed. This graph is finally colored
to obtain the number of memory banks needed to support access concurrency.
9.3.4 Communication and Interface Unit Synthesis
9.3.4.1 Latency Insensitive Systems
Systems on a chip (SoCs) are the composition of several sub-systems exchanging
data. SoC size increase is such that an efficient and reliable interconnection strat-
egy is now necessary to combine sub-systems and preserve, at an acceptable design
cost, the speed performances that the current very deep sub-micron technologies
allow [20]. This communication requirement can be satisfied by a LIS communi-
cation network between hardware components. The LIS methodology enables to
build functionally correct SoCs by (1) promoting pre-developed components inten-
sive reuse (IPs), (2) segmenting inter-components interconnects with relay stations
to break critical paths and (3) bringing robustness to data stream latencies to com-
ponents by encapsulating them into synchronization wrappers. These encapsulated
blocks are called “patient processes”. Patient processes [21] are a key element in the
LIS theory. They are suspendable synchronous components (named pearls) encap-
sulated into a wrapper (named shell) which function is to make them insensible to
the I/O latency and to drive the clock. The decision to drive or not the component’s
clock is implemented with combinatorial logic. The LIS approach relies on a sim-
plifying, but restricting, assumption: a component is activated only if all its inputs
are valid and all its outputs are able to store a result produced at the next clock

cycle. Now, it is frequent that only a subset of the inputs and outputs are necessary
to execute one step of computation in a synchronous block.
To limit the patient process sensitivity to a subset of the inputs and outputs,
in [22] authors suggest to replace the combinatorial logic that drives the clock by
a Mealy type FSM. This FSM tests the state of only the relevant inputs and out-
puts at each cycle and drives the component clock only when they are all ready.
The major drawbacks of FSMs are their difficult synthesis and large silicon size
when communication scenarios are long and complex like for computing intensive
digital signal processing applications. To reduce the hardware cost, in [23] the com-
ponent activation static schedule is implemented with shift registers which contents
drive the component’s clock. This approach relies on the hypothesis that there are
no irregularities in the data streams: it is never necessary to randomly freeze the
components.
166 P. Coussy et al.
9.3.4.2 Proposed Approach
As (1) LIS methodology lacks the ability to dynamically sense I/O subsets, (2)
FSMs can become too large as communication bandwidth does, and (3) shift regis-
ters based synchronization targets only extremely rapid environments, we propose
to encapsulate hardware components into a new synchronization wrapper model
which area is much less than the FSM-based wrappersarea, which speed is enhanced
(mostly thanks to area reduction) and synthesizability is guaranteed whatever the
communication schedule is.
The solution we propose is functionally equivalent to the FSMs. This is a specific
processor that reads and executes cyclically operations stored in a memory. We name
it a “synchronization processor” (SP). Figure 9.1 specifies the new synchronization
wrapper structure with our SP.
The SP communicates with the LIS ports with FIFO-like signals. These signals
are formally equivalent to the voidin/out and stopin/out of [19] and valid, ready
and stall of [22]. Number of input and output ports can be any. It drives the com-
ponent’s clock with the enable signal. The SP model is specified by a three states

FSM: a reset state at power up, an operation-read state, and a free-run state. This
FSM is concurrent with the component and contains a data path: this a “concur-
rent FSM with data path” (CFSMD). Operation’s format is the concatenation of
an input-mask, an output-mask and a free-run cycles number. The masks specify
respectively the input and output ports the FSM is sensible to. The run cycles num-
ber represents the number of clock cycles the component can execute until the next
synchronization point. To avoid unnecessary signals and save area, the memory is
an asynchronous ROM (or SRAM with FPGAs) and its interface with the SP is
reduced to two buses: the operation address and operation word. The execution
of the program is driven by an operation “read-counter” incremented modulo the
memory size.
9.4 Experiments
Design synthesis results for Viterbi decoders are presented in this section. Results
are based on a Virtex-E FPGA technology from the hardware prototyping platform
that we used and that we present first.
9.4.1 The Hardware Platform
The Sundance platform [24] we used as an experimental support is composed of the
last generation of C6x DSPs and Virtex FPGAs. Communications between different
functional blocs are implemented with high throughput SDB links [24]. We have
automated the generation of communication interface for software and hardware
9 GAUT: A High-Level Synthesis Tool for DSP Applications 167
components which frees the user from designing the communication interfaces.
At the hardware level the communication between computing nodes is handled by
four-phases handshaking protocols and decoupling FIFOs. The handshaking pro-
tocols synchronize computing with communication and the FIFOs enable to store
data in order to overcome potential data flow irregularities. Handshaking protocols
are used either to communicate seamlessly between hardware nodes or between
hardware and software nodes. Handshaking protocols are automatically refined by
the GAUT tool to fit with the selected (SDB) inter-node platform communication
interfaces (bus width, signal names, etc). To end the software code generation,

platform specific code has to be written to ensure the inter processing elements
communication. The communication drivers of the targeted platform are called
inside the interface functions introduced in the macro-architecture model through
an API mechanism. We provide a specific class for each type of link available on
the platform.
9.4.2 Synthesis Results
The Viterbi algorithm is applicable to a variety of decoding and detection problems
which can be modeled by a finite-state discrete-time Markov process, such as convo-
lutional and trellis decoding in digital communications [25]. Based on the received
symbols, the Viterbi algorithm estimates the most likely state sequence according
to an optimization criterion, such as the a posteriori maximum likelihood criterion,
through a trellis which generally represents the behavior of the encoder. The generic
C description of the Viterbi algorithm allowed us to synthesize architectures using
different values for the following functional parameters: state number and through-
put. A part of synthesis results that have been obtained is given in Fig. 9.17. For each
generated architecture, the table presents the throughput constraint and the com-
plexity of both the algorithm (number of operations) and the generated architecture
(amount of logic elements).
In the particular case of the DVB-DSNG Viterbi decoder (64 states) different
throughput constraints (from 1 to 50 Mbps) have been tested. Figure 9.18 present
the synthesis results.
State number 8 16 32 64 128
Throughput (Mbps) 44 39 35 26 22
Number of operations 50 94 182 358 582
Number of logic
elements
223 434 1130 2712 7051
Fig. 9.17 Synthesis results for different Viterbi decoders

×