Tải bản đầy đủ (.pdf) (10 trang)

High Level Synthesis: from Algorithm to Digital Circuit- P12 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (365.87 KB, 10 trang )

96 M. Meredith
• Synthesize RTL that implements the SystemC semantics that were simulated
• Use the same testbench for high-level simulation and RTL simulation
The design can comprise a single module or multiple cooperating modules. In
the case of multiple modules, the high-level SystemC simulation ensures that the
modules are operating correctly individually and working together properly. This
simulation validates the algorithms, the protocol implementations at the interfaces,
and the interactions of the modules operating concurrently.
The modules can then be synthesized, and the resulting RTL can be verified
using the same testbench that was used at the high level. This is made possi-
ble by the mixed-mode scheduling described earlier in which the algorithm is
written as untimed SystemC while the interfaces are specified as cycle-accurate
SystemC. Multiple testbench configurations may be constructed to verify various
combinations of high-level modules and RTL modules.
Single SystemC Testbench
RTL
Cynthesizer
Socket
C/C++ Algorithm
Socket
Cynthesizerincorporatesacompletedependencymanagementandprocessautoma-
tion system that automatically generates needed cosimulation wrappers and testbench
infrastructure to automate verification of multiple configurations of high-level and
RTL modules without any need to customize the testbench source code itself.
5.11 Conclusion
This chapter has outlined the synthesizable constructs of C++ and SystemC sup-
ported by the Forte Design Systems in its Cynthesizer product. It has described
specific techniques that can be used to encapsulate synthesizable communication
protocols in C++ classes for maximum reuse and techniques used to automati-
cally produce well-structured RTL for predictable timing closure. Finally, some of
5 High-Level SystemC Synthesis with Forte’s Cynthesizer 97


the user-visible mechanisms for controlling scheduling and the architecture of loop
implementation have been discussed along with a brief discussion of verification
issues automation incorporated in the Cynthesizer product.
Hopefully, this has enabled the reader to understand how SystemC synthesis with
Cynthesizer can be used to implement a broad range of functionality at multiple
abstraction levels and how the use of high-level C++ and SystemC constructs raises
the level of abstraction in hardware design.
Chapter 6
AutoPilot: A Platform-Based ESL
Synthesis System
Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and Jason Cong
Abstract The rapid increase of complexity in System-on-a-Chip design urges
the design community to raise the level of abstraction beyond RTL. Automated
behavior-level and system-level synthesis are naturally identified as next steps to
replace RTL synthesis and will greatly boost the adoption of electronic system-level
(ESL) design. High-level executable specifications, such as C, C++,orSystemC,
are also preferred for system-level verification and hardware/software co-design.
In this chapter we present a commercial platform-based ESL synthesis system,
named AutoPilot
TM
offered by AutoESL Design Technologies, Inc. AutoPilot is
based on the xPilot system originally developed at UCLA. It automatically gener-
ates efficient RTL code from C, C++ or SystemC descriptions for a given system
platform and simultaneously optimize logic, interconnects, performance, and power.
Preliminary experiments demonstrate very promising results for a wide range of
applications, including hardware synthesis, system-level design exploration, and
reconfigurable accelerated computing.
Keywords: ESL, Behavioral synthesis, Scheduling, Resource binding, Interface
synthesis
6.1 Introduction

The rapid increase of complexity in System-on-a-Chip (SoC) design urges the
design community to raise the level of abstraction beyond RTL. Electronic system-
level (ESL) design automation has been widely identified as the next productivity
boost for the semiconductor industry. However, the transition to ESL design will
not be as well accepted as the transition to RTL in the early 1990s without robust
synthesis technologies that automatically compile high-level functional descriptions
into optimized hardware architectures and implementations.
P. Coussy and A. Morawiec (eds.) High-Level Synthesis.
c
 Springer Science + Business Media B.V. 2008
99
100 Z. Zhang et al.
Despite the past failure of the first-generation behavioral synthesis technology
during the mid-1990s, we believe that behavior-level and system-level synthesis
and optimizations are now becoming imperative steps in EDA design flows for the
following reasons:
• Embedded processors are in almost every SoC: With the coexistence of micro-
processors, DSPs, memories and custom logic on a single chip, more software
elements are involved in the process of designing a modern embedded sys-
tem. It is natural to use C-based languages to program software for embedded
processors. Moreover, the automated C-based synthesis allows the designer to
quickly experiment different hardware/software boundaries and explore various
area/power/performance tradeoffs using a single functional specification.
• Huge silicon capacity requires higher level of abstraction: Design abstraction is
one of the most effective methods for controlling rising complexity and improv-
ing design productivity. For example, the study from NEC [10] shows that a
1M-gate design typically requires about 300K lines of RTL code, clearly beyond
what can be handled by a human designer. However, the code density can be
improved by more than 7X when moved to the behavior level. This results in a
human-manageable 40K lines of behavioral description.

• Verification drives the acceptance of SystemC: Transaction-levelmodeling (TLM)
with SystemC [2] has become a very popular approach to system-level verifica-
tion [8]. Designers commonly use SystemC TLMs to describe virtual software/
hardware platforms, which serve three important purposes: early embedded
software development, architectural modeling and functional verification.
The wide availability of SystemC functional models directly drives the needs
for SystemC-based synthesis solutions, which automatically generate RTL code
through a series of formal constructive transformations. This avoids the slow and
error-prone manual process and simplifies the design verification and debugging
effort.
• Accelerated computing or reconfigurable computing needs C/C++ based
compilation/synthesis to FPGAs: Recent advances in FPGAs have made recon-
figurable computing platforms feasible to accelerate many high-performance
computing (HPC) applications, such as image and video processing, financial
analytics, bioinformatics, and scientific computing applications.
Since HDLs are exotic to most application software developers, it is essential
to provide a highly automated compilation/synthesis flow from C/C++ language
to FPGAs.
In this chapter we present a platform-based ESL synthesis system named
AutoPilot
TM
, offered by AutoESL Design Technologies, Inc. AutoPilot is capable
of automatically generating efficient RTL code from an untimed or partially timed
C, C++ and SystemC description for the target hardware platform. It performs
platform-based behavioral and system synthesis, tightly integrates with a modern
leading-edge C/C++ compiler, and embodies a class of novel, near-optimal, and
highly-scalable synthesis algorithms.
6 AutoPilot: A Platform-Based ESL Synthesis System 101
The synthesis technology was originally developed in the UCLA xPilot sys-
tem [5], and has been licensed by AutoESL for the commercialization. In its current

stage, AutoPilot exhibits the following key features and advantages:
• Unified C/C++/SystemC design flow: AutoPilot accepts three kinds of stan-
dard C-based design entries: C, C++ and SystemC. It also supports a variety
of abstraction models including pure untimed functional model, partially timed
transactional model, and fully timed behavioral or structural model. The broad
coverage of languages and abstraction models allows AutoPilot to target a
wide range of applications, including hardware synthesis, system-level design
exploration and high-performance reconfigurable computing.
• Utilization of state-of-the-art compiler technologies: AutoPilot incorporates a
leading-edge commercial-strength C/C++ compiler in the synthesis loop. Many
state-of-the-art compiler techniques (intra-procedural and inter-procedural) are
utilized to analyze, transform and aggressively optimize the input behaviors.
• Platform-based and implementation-aware synthesis: AutoPilot takes advantage
of the target platform information to carry out more informed synthesis and opti-
mization. The timing, area and power for the available computation resources
and communication interfaces are all characterized.
In addition, AutoPilot has tight integration with several downstream RTL
synthesis and physical synthesis tools to assure better quality-of-result and higher
degree of automation.
• Interconnect-centric and power-aware optimization: AutoPilot is able to generate
an optimized microarchitecture with consideration of the on-chip interconnects
at the high level and maximize both data locality and communication locality to
achieve faster timing and power closure. Furthermore, it can carry out aggressive
power optimization using fine-grain clock gating and power gating.
The reminder of this paper is organized as follows: Sect. 6.2 presents an overview
of the AutoPilot design flow. Sections 6.3 and 6.4 briefly discuss the system front-
end and highlight the synthesis engine, respectively. The preliminary experimental
results are reported in Sect. 6.5.
6.2 Overall Design Flow
The overall design flow of the AutoPilot synthesis system is shown in Fig. 6.1.

AutoPilot accepts synthesizable C, C++, and/or SystemC as input and performs
four major steps to generate the cycle-accurate RTLs, which includes compilation
and elaboration, advanced code transformation, core behavioral and communication
synthesis, and microarchitecture generation.
In the first step the behavioral description is parsed by a GCC-compatible front-
end compiler, with the extensions to handle the bit-accurate integer data types. For
SystemC designs, elaboration will be invoked to extract processes, ports, channels,
and interconnection topologies and construct a detail-rich system-level synthesis
data model.
102 Z. Zhang et al.
C/C++/SystemC
C/C++/SystemC
Timing/Power/
Timing/Power/
Layout Constraints
Layout Constraints
RTL SystemC &
RTL SystemC &
RTL HDLs
RTL HDLs
Platform
Models
ASICs/FPGAs
ASICs/FPGAs
Implementation
Implementation
=
Simulation
Compilation &
Compilation &

Elaboration
Elaboration
Advance Code
Advance Code
Transformation
Transformation
Behavioral & Communication
Behavioral & Communication
Synthesis and Optimizations
Synthesis and Optimizations
AutoPilot
TM
Common Testbench
User Constraints
User Constraints
ESL Synthesis
Design Specification
Microarchitecture
Microarchitecture
Generation
Generation
Verification
Fig. 6.1 AutoPilot
TM
design flow
On top of the synthesis data model, AutoPilot applies a set of advanced code
transformations and analyses to optimize the input behavior, including traditional
compilation techniques such as constant propagation and dead code elimination, and
hardware-specific passes such as bitwidth analysis and optimization. The AutoPilot
front-end will be discussed in Sect. 6.3.

The code transformation phase is followed by the core hardware synthesis phase.
AutoPilot performs platform-based synthesis and interconnect-centric optimizations
during scheduling and resource binding; these take into account the user-specified
frequency/latency/throughput/resource constraints and generate optimized microar-
chitectures. We shall discuss more details of the synthesis engine in Sect. 6.4.
At the back-end, AutoPilot outputs RTL VHDL/Verilog code together with con-
straint files (e.g., multicycle path constraints, physical location constraints, etc.) to
leverage the existing logic synthesis and physical design toolset for final imple-
mentation on either ASICs or FPGAs. It is worth noting that RTL SystemC code
is also generated, which can be directly compiled and simulated with the original
C/SystemC test bench to verify the correctness of the synthesized RTLs.
6.3 AutoPilot Front-End
In this section we discuss three major aspects of the AutoPilot front end, i.e., the
language support, compiler optimizations, and the platform modeling.
6 AutoPilot: A Platform-Based ESL Synthesis System 103
6.3.1 Language Coverage
6.3.1.1 C/C++ Support
AutoPilot has a broad coverage of the C and C++ language features. It provides
comprehensive support for most of the commonly-used data types, operators, struct/
class constructs, and control flow constructs. Due to the fundamental difference
between the memory models of software and hardware, AutoPilot currently dis-
allows the usage of dynamic pointers, dynamic memory allocations, and function
recursions.
Designers can fully control the data precisions of a C/C++ specification. AutoPi-
lot directly supports single and double precision floating-point types. In addition, it
adds the capabilities (compared to xPilot) in compiling and synthesizing bit-accurate
fixed-point data types, for which standard C and C++ language lack native support.
• Arbitrary-precision integer (APInt) data types: The user can specify that an inte-
ger type’s precision (bit width) is any number of bits up to eight million. For
example, int24 declares an 24-bit signed integer value. Constant values will be

zero or sign extended to the indicated bit width if necessary.
• Arbitrary-precision fixed point (APFixed) data types: AutoPilot provides a syn-
thesizable templatized C++ library, named APFixed, for the designer to describe
fixed-point math. APFixed library implements the common arithmetic routines
via operator overloading and supports the standard quantization and saturation
modes.
• IEEE-754 standard single and double precision floating point data types are fully
supported in AutoPilot for FPGA platforms. Common floating-point math rou-
tines (e.g., square root, exponentiation, logarithm, etc.) can be also synthesized.
6.3.1.2 SystemC Support
AutoPilot fully supports the OCSI synthesizable subset [1] for the SystemC
synthesis.
Designers can make use of SystemC bit-accurate data types (i.e., sc
int/sc uint,
sc
bigint/sc biguint,andsc fixed/sc ufixed) to define the data precisions. Multi-
module hierarchical designs can be specified and synthesized with the SC
MODULE
constructs. Within each module, multiple concurrent processes can be declared with
the SC
METHOD and SC CTHREAD constructs.
6.3.2 Advanced Code Transformations
A variety of compiler optimization techniques are applied to the behavioral descrip-
tion code with the objective to reduce the code complexity, maximize the data
locality, and expose more parallelism. The following transformations and analyses
104 Z. Zhang et al.
are particularly instrumental for AutoPilot hardware synthesis.
• Traditional optimizations such as constant propagation, dead code elimination,
and common subexpression elimination that avoid functional redundancy.
• Strength reductions that replace expensive operations (e.g., multiplications and

divisions) with simpler low-cost operations (e.g., shifts, additions and subtrac-
tions).
• Transformations such as if-conversion and tree height reduction that explicitly
expose fine-grain operator-level parallelism.
• Coarse-grain code restructuring by loop transformations such as loop unrolling,
loop flattening, loop fusion, etc.
• Analyses such as bitwidth analysis, alias analysis, and dependence analysis that
help to reduce the data widths and analyze the data and control dependences.
These transformation are either performed locally within the function bodies, or
applied intraprocedurally across the function call hierarchy.
6.3.3 Platform Modeling
AutoPilot takes full advantage of the target platform information to carry out more
informed synthesis and optimization. The platform specification describes the avail-
abilities and characteristics of the important system building blocks, including the
on-chip computation resources and the selected communication interfaces.
Component pre-characterization is involved in the modeling process. Specifi-
cally, it characterizes the delay, area, and power for each type of hardware resource,
such as arithmetic units (e.g., adders and multipliers), memories (e.g., RAMs,
ROMs and register files), steering logic (multiplexors), and interface logics (e.g.,
FIFOs, and bus interface adapters). The delay/area/power characteristic functions
are derived by varying the bit widths, number of input and output ports, pipeline
intervals and latencies, etc. To facilitate our interconnect-centric synthesis. The het-
erogeneous resources distribution map and the distance-based wire delay lookup
tables are also constructed.
AutoPilot greatly extends the platform modeling capabilities in xPilot. It can sup-
port advanced ASIC process (e.g., TSMC 90 and 65 nm technologies), a wide range
of FPGA device families (e.g., Xilinx Virtex-4/Virtex-5, Altera Stratix II/Stratix
III) and various accelerated computing platforms (e.g., Nallatech [4] and XDI [3]
acceleration boards).
6.4 AutoPilot Hardware Synthesis Engine

This section highlights several important features of the AutoPilot synthesis engine,
including scheduling, resource binding, pipelining, and interface synthesis.
6 AutoPilot: A Platform-Based ESL Synthesis System 105
6.4.1 Scheduling
An efficient and versatile scheduler is implemented in the AutoPilot system to
exploit parallelism in the behavior-level design and determine the time at which
different computations and communications are performed. The core scheduling
algorithm is based on a mathematical programming formulation. It has significant
advantages over the prior approaches in two major aspects:
• Versatility: Our scheduler is able to model a rich set of scheduling constraints
(including cycle time constraint, latency constraints, throughput constraint, I/O
timing constraints, and resource constraints) in the constraint system, and express
different performance metrics (such as worst-case and average-case latency)
in the objective function. Moreover, several important synthesis optimizations
such as operation chaining, structural pipelining, behavioral template, slack
distribution, etc., are all naturally encoded in a single unified mathematical
framework.
• Efficiency and scalability: Our scheduler is highly efficient and scalable when
compared to the other constraint-driven approaches. For instance, typical ILP
formulations uses discrete 0–1 variables to model the assignment relationships
between operations and time steps, this requires lots of variables and complex
equations to express one scheduling constraint since all feasible time steps should
be considered. In our formulation, variables directly represent operation execu-
tion time and are independent of the final schedule latency. This leads to much
more compact constraint system, and the mathematical programming model can
be efficiently solved in a few seconds for very complex designs, as evidenced by
the Xilinx MPEG-4 design (to be discussed in Sect. 6.5).
The first generation of our scheduler was based on the SDC-based scheduling
algorithm and the technical details are available in [7].
6.4.2 Resource Binding

Resource binding determines the numbers of functional units and registers, and the
sharing among compatible operations and data transfers. It has a dramatic impact
on the final design quality as they determine the interconnection network with wires
and steering logic.
AutoPilot is capable of providing optimized binding for various functional units
and memory blocks, such as integer and floating-point arithmetic units, transcen-
dental functions, black-box IP blocks, registers, register files, RAMs/ROMs, etc.
AutoPilot’s binding algorithm can also generate different microarchitectures. For
example, it has an option to generate a distributed register-file microarchitecture
(DRFM) to optimize both data and communication localities.
DRFM has a semi-regular structure which consists of one or multiple islands.
As illustrated in Fig. 6.2, each DRFM island contains a local register file (LRF),
106 Z. Zhang et al.
Island A
Data-
Routing

Logic
Local
Register
File
FUP MUX
Island B
Functional Unit Pool
MUL
AL
AL
Island C
Island D


Input
Island E
Island F
Fig. 6.2 Distributed register-file microarchitecture
a functional unit pool (FUP), and data-routing logic. The LRF serves as the local
storage in an island. Each register file allows a variable number of read ports but
only a fixed number (typically one) of write ports. The LRF stores the results pro-
duced from the local computation units in FUP and provides data to both local FUP
and the external islands. By clustering LRF and FUP into a single island, we are
able to maximize both data/computation locality and communication locality. This
also helps us avoid, to a large extent, the centralized memory structures and global
communications which often become the bottlenecks limiting system efficiency in
performance, area, and power.To handle the necessary inter-island communications,
we use the data-routing logic to route data from the external islands.
DRFM is a semi-regular microarchitecture. The configurations of the LRF, FUP
and the data-routing logic are application-specific. One important objective that
DRFM-based resource binding tries to minimize is the inter-island connections.
This will simplify the data-routing logic in each island and reduce the overall
complexity of the resulting datapath.
The technical details of the DRFM-based resource binding algorithm are avail-
able in [6].
6.4.3 Pipelining
AutoPilot’s synthesis engine (during scheduling, resource binding, and microar-
chitecture generation) supports several forms of pipelining to improve the system
performance.

×