Tải bản đầy đủ (.pdf) (30 trang)

Model-Based Design for Embedded Systems- P8 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (812.13 KB, 30 trang )

Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 186 2009-10-2
186 Model-Based Design for Embedded Systems
object request broker (HORBA) when the support of small-grain parallelism
is needed.
Our most recent developments in MultiFlex are mostly focused on the
support of the streaming programming model, as well as its interaction
with the client–server model. SMP subsystems are still of interest, and they
are becoming increasingly well supported commercially [14,21]. Moreover,
our focus is on data-intensive applications in multimedia and communica-
tions. For these applications, our focus has been primarily on streaming and
client–server programming models for which explicit communication centric
approaches seem most appropriate.
This chapter will introduce the MultiFlex framework specialized at sup-
porting the streaming and client–server programming models. However,
we will focus primarily on our recent streaming programming model and
mapping tools.
7.2.1 Iterative Mapping Flow
MultiFlex supports an iterative process, using initial mapping results to
guide the stepwise refinement and optimization of the application-to-
platform mapping. Different assignment and scheduling strategies can be
employed in this process.
An overview of the MultiFlex toolset, which supports the client–server
and streaming programming models, is given in Figure 7.2. The design
methodology requires three inputs:
Application specification
Client/server Streaming
Abstract
platform
specification
Performance
analysis


Component assembly
Visualization
Application
constraints,
profiling
Application
core
C functions
Streaming
Static
tools
Streaming
Dynamic
tools
Intermediate representation (IR)
Map, Transform & schedule of IR
Client/server Client/server
Target platform
Video
platform
Mobile
platform
Multimedia
platform
FIGURE 7.2
MultiFlex toolset overview.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 187 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 187
• The application specification—the application can be specified as a set
of communicating blocks; it can be programmed using the streaming

model or client–server programming model semantics.
• Application-specific information (e.g., quality-of-service requirements,
measured or estimated execution characteristics of the application,
data I/O characteristics, etc.).
• The abstract platform specification—this information includes the
main characteristics of the target platform which will execute the
application.
An intermediate representation (IR) is used to express the high-level appli-
cation in a language-neutral form. It is translated automatically from one or
more user-level capture environments. The internal structure of the appli-
cation capture is highly inspired by the Fractal component model [23].
Although we have focused mostly on the IR-to-platform mapping stages, we
have experimented with graphical capture from a commercial toolset [7], and
a textual capture language similar to StreamIt [3] has also been experimented
with.
In the MultiFlex approach, the IR is mapped, transformed, and sched-
uled; finally the application is transformed into targeted code that can run
on the platform. There is a flexibility or performance trade-off between what
can be calculated and compiled statically, and what can be evaluated at run-
time. As shown on Figure 7.2, our approach is currently implemented using
a combination of both, allowing a certain degree of adaptive behaviors, while
making use of more powerful offline static tools when possible. Finally, the
MultiFlex visualization and performance analysis tools help to validate the
final results or to provide information for the improvement of the results
through further iterations.
7.2.2 Streaming Programming Model
As introduced above, the streaming programming model [1] has been
designed for use with data-dominated applications. In this computing
model, an application is organized into streams and computational kernels
to expose its inherent locality and concurrency. Streams represent the flow of

data, while kernels are computational tasks that manipulate and transform
the data. Many data-oriented applications can easily be seen as sequences of
transformations applied on a data stream. Examples of languages based on
the streaming computing models are: ESTEREL [4], Lucid [5], StreamIt [3],
Brooks [2]. Frameworks for stream computing visualization are also avail-
able (e.g., Ptolemy [6] and Simulink
R

[7]).
In essence, our streaming programming model is well suited to a
distributed-memory, parallel architecture (although mapping is possible
on shared-memory platforms), and favors an implementation using soft-
ware libraries invoked from the traditional sequential C language, rather
than proposing language extensions, or a completely new execution model.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 188 2009-10-2
188 Model-Based Design for Embedded Systems
The entry to the mapping tools uses an XML-based IR that describes the
application as a topology with semantic tags on tasks. During the mapping
process, the semantic information is used to generate the schedulers and all
the glue necessary to execute the tasks according to their firing conditions.
In summary, the objectives of the streaming design flow are:
• To refine the application mapping in an iterative process, rather than
having a one-way top-down code generation
• To support multiple streaming execution models and firing conditions
• To support both restricted synchronous data-flow and more dynamic
data-flow blocks
• To be controlled by the user to achieve the mechanical transformations
rather than making decisions for him
We first present the mapping flow in the Section 7.3, and at the end of the
section, we will give more details on the streaming programming model.

7.3 MultiFlex Streaming Mapping Flow
The MultiFlex technology includes support for a range of streaming pro-
gramming model variants. Streaming applications can be used alone or in
interoperation with client–server applications. The MultiFlex streaming tool
flow is illustrated in Figure7.3. The different stages of this flow will be
described in the next sections.
The application mapping begins with the assignment of the application
blocks to the platform resources. The IR transformations consist mainly in
splitting and/or clustering the application blocks; they are performed for
optimization purposes (e.g., memory optimization); the transformations also
imply the insertion of communication mechanisms (e.g., FIFOs, and local
buffers).
The scheduling defines the sharing of a processor between several blocks
of the application. Most of the IR mapping, transforming, and scheduling
is realized statically (at compilation time), rather than dynamically (at run-
time).
The methodology targets large-scale multicore platforms including a
uniform layered communication network based on STMicroelectronics’
network-on-chip (NoC) backbone infrastructure [18] and a small number
of H/W-based communication IPs for efficient data transfer (e.g., stream-
oriented DMAs or message-passing accelerators [9]). Although we consider
our methodology to be compatible with the integration of application-
specific hardware accelerators using high-level hardware synthesis, we are
not targeting such platforms currently.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 189 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 189
Application functional capture
Filter core
C functions
Filter

dataflow
B1 B2
B3
Comm. and
H/W abstraction
library
Component assembly
User
assignment
directives
Application
constraints
Profiling
Communication
Core functions
wrappers
Storage
resources
Commn.
resources
and topology
Number and
types of PE
Platform
specification
Abstract
Comm. services
Intermediate representation (IR)
MpAssign
MpCompose

Target platform
Video
platform
Mobile
platform
Multimedia
platform
FIGURE 7.3
MultiFlex tool flow for streaming applications.
7.3.1 Abstraction Levels
In the MultiFlex methodology, a data-dominated application is gradually
mapped on a multicore platform by passing through several abstractions:
• The application level—at this level, the application is organized as a
set of communicating blocks. The targeted architecture is completely
abstracted.
• The partitioning level—at this level the application blocks are grouped
in partitions; each partition will be executed on a PE of the target
architecture. PEs can be instruction-set programmable processors,
reconfigurable hardware or standard hardware.
• The communication level—at this level, the scheduling and the com-
munication mechanisms used on each processor between the different
blocks forming a partition are detailed.
• The target architecture level—at this level, the final code executed on
the targeted platforms is generated.
Table 7.2 summarizes the different abstractions, models, and tools provided
by MultiFlex in order to map complex data-oriented applications onto mul-
tiprocessor platforms.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 190 2009-10-2
190 Model-Based Design for Embedded Systems
TABLE 7.2

Abstraction, Models, and Tools in MultiFlex
Abstraction Level Model Refinement Tool
Application level Set of communicating blocks Textual or graphical
front-end
Partition level Set of communicating blocks
and directives to assign
blocks to processors
MpAssign
Communication level Set of communicating blocks
and required
communication components
MpCompose
Target architecture level Final code loaded and
executed on the target
platform
Component-based
compilation back-end
7.3.2 Application Functional Capture
The application is functionally captured as a set of communicating blocks.
A basic (or primitive) block consists of a behavior that implements a known
interface. The implementation part of the block uses streaming application
programming interface (API) calls to get input and output data buffers to
communicate with other tasks. Blocks are connected through communica-
tion channels (in short, channels) via their interfaces. The basic blocks can be
grouped in hierarchical blocks or composites.
The main types of basic blocks supported in MultiFlex approach are
• Simple data-flow block: This type of block consumes and produces
tokens on all inputs and outputs, respectively, when executed. It is
launched when there is data available at all inputs, and there is suf-
ficient free space in downstream components for all outputs to write

the results.
• Synchronous client–server block: This block needs to perform one or
many remote procedural calls before being able to push data in the
output interface. It must therefore be scheduled differently than the
simple data-flow block.
• Server block: This block can be executed once all the arguments of the
call are available. Often this type of block can be used to model a H/W
coprocessor.
• Delay memory: This type of block can be used to store a given number
of data tokens (an explicit state).
Figure 7.4 gives the graphical representation of a streaming application cap-
ture which interacts with a client–server application. Here, we focus mostly
on streaming applications.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 191 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 191
Sync. dataflow
semantics
Composite
Memory delay
App interface
Dataflow
interface
Init
Process
End
State
Max elem
Data type
Token rate
Synchronous

client semantics
Server(s)
semantics
int method1 ( )
int method2 ( )
FIGURE 7.4
Application functional capture.
From the point of view of the application programmer, the first step is to
split the application into processing blocks with buffer-based I/O ports. User
code corresponding to the block behavior is written using the C language.
Using component structures, each block has its private state, and imple-
ments a constructor (init), a work section (process), and a destructor (end).
To obtain access to I/O port data buffers, the blocks have to use a prede-
fined API. A run-to-completion execution model is proposed as a compro-
mise between programming and mapping flexibility. The user can extend
the local schedulers to allow the local control of the components, based
on application-specific control interfaces. The dataflow graph may contain
blocks that use client–server semantics, with application-specific interfaces,
to perform remote object calls that can be dispatched to a pool of servers.
7.3.3 Application Constraints
The following application constraints are used by the MultiFlex streaming
tools:
1. Block profiling information. For a given block, this represents the aver-
age number of clock cycles required for the block execution, on a target
processor.
2. Communication volume: the size of data exchanged on this channel.
3. User assignment directives. Three types of directives are supported by
the tool:
a. Assign a block to a specific processor
b. Assign two blocks to the same processor (can be any processor)

c. Assign two blocks to any two different processors
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 192 2009-10-2
192 Model-Based Design for Embedded Systems
7.3.4 The High-Level Platform Specification
The high-level platform specification is an abstraction of the processing,
communication, and storage resources of the target platform. In the current
implementation, the information stored is as follows:
• Number and type of PEs.
• Program and data memory size constraints (for each programmable
PE).
• Information on the NoC topology. Our target platform uses the
STNoC, which is based on the “Spidergon” topology [18]. We include
the latency measures for single and multihop communication.
• Constraints on communication engines: Number of physical links
available for communication with the NoC.
7.3.5 Intermediate Format
MultiFlex relies on intermediate representations (IRs) to capture the applica-
tion, the constraints, and high-level platform descriptions. The topology of
the application—the block declaration and their connectivity—is expressed
using an XML-based intermediate format. It is also used to store task anno-
tations, such as the block execution semantics. Other block annotations are
used for the application profiling and block assignments. Edges are anno-
tated with the communication volume information.
The IR is designed to support the refinement of the application as it
is iteratively mapped to the platform. This implies supporting the mul-
tiple abstraction levels involved in the assignment and mapping process
described in the next sections.
7.3.6 Model Assumptions and Distinctive Features
In this section, we provide more details about the streaming model. This
background information will help in explaining the mapping tools in the next

section.
The task specification includes the data type for each I/O port as well as
the maximum amount of data consumed or produced on these ports. This
information is an important characteristic of the application capture because
it is at the foundation of our streaming model: each task has a known com-
putation grain size. This means we know the amount of data required to
fire the process function of the task for a single iteration without starving
on input data, and we know the maximum amount of output data that can
be produced each time. This is a requirement for the nonblocking, or run-to-
completion execution of the task, which simplifies the scheduling and com-
munication infrastructure and reduces the system overhead. Finally, we can
quantify the computation requirements of each task for a single iteration.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 193 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 193
The run-to-completion execution model allows dissociating the
scheduling of the tasks from the actual processing function, providing
clear scheduling points. Application developers focus on implementing and
optimizing the task functions (using the C language), and expressing the
functionality in a way that is natural for the application, without trying to
balance the task loads in the first place. This means each task can work on a
different data packet size and have different computation loads. The assign-
ment and scheduling of the tasks can be done in a separate phase (usually
performed later), allowing the exploration of the mapping parameters, such
as the task assignment, the FIFO, and buffer sizes, to be conducted without
changing the functionality of the tasks: a basic principle to allow correct-by-
construction automated refinement.
The run-to-completion execution model is a compromise, requiring more
constrained programming but leads to higher flexibility in terms of mapping.
However, in certain cases, we have no choice but to support multiple concur-
rent execution contexts. We use cooperative threading to schedule special

tasks that use a mix of streaming and client–server constructs. Such tasks
are able to invoke remote services via client–server (DSOC) calls, including
synchronous methods (with return values) that cause the caller task to block,
waiting for an answer.
In addition, we are evaluating the pros and cons of supporting tasks with
unrestricted I/O and very fine-grain communication. To be able to eventu-
ally run several tasks of this nature on the same processor, we may need a
software kernel or make use of hardware threading if the underlying plat-
form provides it.
To be able to choose the correct scheduler to deploy on each PE, we have
introduced semantic tags, which describe the high-level behavior type of each
task. This information is stored in the IR. We have defined a small set of
task types, previously listed in Section7.3.2. This allows a mix of execution
models and firing conditions, thus providing a rich programming environ-
ment. Having clear semantic tags is a way to ensure the mapping tools can
optimize the scheduling and communications on each processor, rather than
systematically supporting all features and be designed for the worst case.
The nonblocking execution is only one characteristic of streaming com-
pared to our DSOC client–server message-passing programming model. As
opposed to DSOC, our streaming programming model does not provide data
marshaling (although, in principle, this could be integrated in the case of het-
erogeneous streaming subsystems).
When compared to asynchronous concurrent components, another dis-
tinction of the streaming model is the data-driven scheduling. In event-
based programming, asynchronous calls (of unknown size) can be generated
during the execution of a single reaction, and those must be queued. The
quantity of events may result in complex triggering protocols to be defined
and implemented by the application programmer. This remains to be a well-
known drawback of event-based systems. With the data-flow approach, the
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 194 2009-10-2

194 Model-Based Design for Embedded Systems
clear data-triggered execution semantic, and the specification of I/O data
ports resolve the scheduling, memory management, and memory ownership
problems inherent to asynchronous remote method invocations.
Finally, another characteristic of our implementation of the streaming
programming model, which is also shared with our SMP and DSOC models,
is the fact that application code is reused “as is,” i.e., no source code trans-
formations are performed. We see two beneficial consequences of this com-
mon approach. In terms of debugging, it is an asset, since the programmer
can use a standard C source-level debugger, to verify the unmodified code
of the task core functions. The other main advantage is related to profiling.
Once again, it is relatively easy for an application engineer to understand
and optimize the task functions with a profiling report, because his source
code is untouched.
7.4 MultiFlex Streaming Mapping Tools
7.4.1 Task Assignment Tool
The main objective of the MpAssign tool (see Figure 7.5) is to assign applica-
tion blocks to processors while optimizing two objectives:
1. Balance the task load on all processors
2. Minimize the inter-processor communication load
Filter core
C functions
Application
constraints
Profiling
MpAssign
B1
B1
PE2
PE1

B2
B2
Application graph
B4
B4
B5
B5
B3
B3
Assignment
directives
Communication
volume
Platform
specification
Number and
types of PE
Commn.
resources
and topology
Storage
resources
User
assignment
directives
FIGURE 7.5
MpAssign tool.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 195 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 195
The inter-processor communication cost is given by the data volume

exchanged between two processors, related to each task.
The tool receives as inputs the application capture, the application con-
straints, and the high-level platform specification. The output of the tool
is a set of assignment directives specifying which blocks are mapped on
each processor, the average load of each processor, and the cost for each
inter-processor communication. The lower portion of Figure 7.5 gives a
visual representation of the MpAssign output. The tool provides the visual
display of the resulting block assignments to processors.
The implemented algorithm for the MpAssign tool is inspired from
Marculescu’s research [10] and is based on graph traversal approaches,
where ready tasks with maximal 2-minimal cost-variance are assigned
iteratively. The two main graph traversal approaches implemented in
MpAssign are
• The list-based approach, using mainly the breadth-first principle—a
task is ready if all its predecessors are assigned
• The path-based approach, using mainly the depth-first principle—a
task is ready if one predecessor is assigned and it is on the critical path
A cost estimator C

t, p

of assigning a task t on processor p is used. This cost
estimator is computed using the following equation:
C

t, p

= w
1
∗C

proc
+w
2
∗C
comm
+w
3
∗C
succ
(7.1)
where
C
proc
is the additional average processing cost required when the task t is
assigned to processor p
C
comm
is the communication cost required for the communication of task t
with the preceding tasks
C
succ
represents a look-ahead cost concerning the successor tasks, the min-
imal cost estimate of mapping a number of successor tasks
This assumes state space exploration for a predefined look-ahead depth. w
i
represents the weight associated with each cost factor (C
proc
, C
comm
,and

C
succ
) and indicates the significance of the factor in the total cost C

p, t

as
compared with the other factors. The factors are weighted by the designer to
set their relative importance.
7.4.2 Task Refinement and Communication Generation Tools
The main objective of the MpCompose tool (see Figure 7.6) is to generate one
application graph per PE, each graph containing the desired computation
blocks from the application, one local scheduler, and the required communi-
cation components. To perform this functionality, MpCompose requires the
following three inputs:
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 196 2009-10-2
196 Model-Based Design for Embedded Systems
Filter core
C functions
Application IR
(with assignments)
MpCompose
Scheduler
Scheduler
B2
B1
Data I/F
GB
1/3
GB1/3

B3
B3
B1
B2
LB
1/2
Control
I/F
PE1 PE2
PE1
PE2
Abstract comm.
services
Scheduler
shell
Local
bindings
(buffer)
Global
bindings
(fifo)
FIGURE 7.6
MpCompose tool.
• The application capture
• The platform description
• The set of directives, optionally generated by the MpAssign tool
The MpCompose tool relies on a library of abstract communication services
that provide different communication mechanisms that can be inserted in
the application graph. Three types of services are currently supported by
MpCompose:

1. Local bindings consisting mainly of a FIFO implemented with memory
buffers and enabling the intra-processor communication (e.g., block B1
is connected to block B2 via local buffer LB1/2).
2. Global binding FIFOs, which enable the inter-processor communication
(e.g., block B1 on PE1 communicates to block B3 on PE2 via external
buffers GB1/3).
3. A scheduler on each PE, which is configurable in terms of number and
types of blocks and which enables the sharing of a processor between
several application blocks.
A set of libraries are used to abstract part of the platform and provide com-
munication and synchronization mechanism (point-to-point communication,
semaphores, access to shared memory, access to I/O, etc.). The various FIFO
components have a default depth, but these are configuration values that can
be changed during the mapping. Since we support custom data types for I/O
port tokens, each element of a FIFO has a certain size that matches the data
type and maximum size specified in the intermediate format.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 197 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 197
There is no global central controller: a local scheduler is created on each
processor. This component is the main controller and has access to the control
interface of all the components it is responsible for scheduling. The proper
control interface for each filter task is automatically added, based on the type
of filter specified in the application IR, and connected to the scheduler. The
implementations of the schedulers are partly generated, for example, the list
of filter tasks (a static list) and some setup code for the hardware communi-
cation accelerators are automatically created. The core scheduling function
can be pulled from a library or customized by the application programmer.
The output of MpCompose is a set of component descriptions; one for
each processor. From the point of view of the top-level component defi-
nitions, these components are not connected together; however, communi-

cating processors use the platform-specific features to actually implement
the buffer-based communication at runtime. The set of independent compo-
nent definitions allow a monoprocessor component–based infrastructure to
be used for compilation.
7.4.3 Component Back-End Compilation
Starting from the set of processor graphs, the component back-end gener-
ates the targeted code that can run on the platform. MultiFlex tools currently
target the Fractal component model, and more specifically its C implemen-
tation [19]. Even though this toolset supports features such as a binding
controller and a life cycle manager to allow dynamic insertion–removal of
components in the graph at runtime, we are not currently using any of the
dynamic features of components, such as runtime elaboration, introspection,
etc., mainly for code size reasons. Nevertheless, we expect multimedia appli-
cation requirements to push toward this direction. Until then, we mainly use
the component model as a back-end to represent the software architecture to
be built on each processor. MpCompose generates one architecture (.fractal)
file describing the components and their topology for each CPU. The Fractal
tools will generate the required C glue code to bind components, to create
block instance structures and will compile all the code into an executable
for the specified processor by invoking the target cross-compiler. This build
process is invoked for each PE, thus producing a binary for each processor.
7.4.4 Runtime Support Components
The main services provided by the MultiFlex components at runtime are
scheduling and communication. The scheduler in fact controls both the com-
munication components and the application tasks.
The scheduler interleaves communication and processing at the block
level. For each input port, the scheduler scans if there is available data in the
local memory. If not, it checks if the input FIFO is empty. If not, the sched-
uler orders the input FIFO to perform the transfer into local memory. This is
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 198 2009-10-2

198 Model-Based Design for Embedded Systems
typically done by some coprocessors such as DMA or specialized hardware
communication engines. While the transfer occurs, the scheduler can manage
other tasks. In the same manner, it can look for previously produced output
data ready to be transmitted from local memory to another processor, using
an output FIFO. Tasks with more dynamic (data-dependant) behaviors may
produce less data than their allowed maximum, including no data at all. If a
task is ready to execute, the scheduler simply calls its process function in the
same context. The user tasks make use of an API that is based on pointers,
thus we avoid data copies between the tasks and the local queues managed
by the scheduler.
So in a nutshell, the run-to-completion model allows the scheduler to run
ready tasks, manage input and output data consumed or produced by the
tasks, while allowing data transfers to take place in parallel, thus overlapping
communication and processing without the need for threading. The tasks can
have different computation and communication costs: the mapping tools will
help to balance the overall tasks load between processors, with the objective
to keep the streaming fabric busy and the latency minimized.
7.5 Experimental Results
7.5.1 3G Application Mapping Experiments
In this section, we present mapping results using the MpAssign tool on an
application graph having the characteristic of a 3G WCDMA/FDD base-
station application from [13].
The block diagram of this application is presented in Figure 7.7 and
contains two main chains: transmitter (tx) and receiver (rx). The blocks
tx_crc_0
(70)
rx_crc_0
(70)
rx_crc_1

(70)
rx_vtd_1
(205)
rx_vtd_0
(205)
tx_crc_1
(70)
aa
a
aa
b
bc
bb
b
b
ccd
dcc
aa
b
a
tx_vtc_0
(75)
tx_vtc_1
(75)
tx_rm
(40)
tx_fi
(80)
rx_fi
(80)

tx_rfs
(30)
rx_rfa
(30)
rx_rm
(40)
tx_si
(80)
rx_rake
(175)
tx_si
(80)
tx_sm
(170)
M
A
C
R
a
d
i
o
i
n
t
e
r
f
a
c

e
FIGURE 7.7
3G application block diagram. The communication volumes are a = 260,
b = 3136, c = 1280, and d = 768.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 199 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 199
are annotated with numbers that represent the estimated processing load,
while each edge has an estimated communication volume given in the figure
caption. These numbers are extracted from [13], where the computation cost
corresponds to a latency (in microseconds) for a PE to execute one iteration of
the corresponding functional block, while the edge cost corresponds to the
volume of data (in 16-bit words) transferred at each iteration between the
connected functional blocks.
A manual and static mapping of this application is presented in [13],
using a 2D-mesh of 46 PEs, where each PE is executing only one of the
functional blocks, for which some of them are duplicated to expose more
potential parallel processing. We use this example in this chapter mainly for
illustrative purposes, to show that MpAssign can be used to explore auto-
matically different mappings where, optionally, multiple functional blocks
can be mapped on the same PE to balance the processing load. To expose
more potential parallel processing, we create a set of functionally equivalent
application graphs of the above reference application in which we duplicate
the transmitter and receiver processing chains several times. In our experi-
ments, four versions have been explored:
• v1: 1 transmitter and 1 receiver (original reference application)
• v2: 2 transmitters and 2 receivers
• v3: 3 transmitters and 3 receivers
• v4: 4 transmitters and 4 receivers
The version v1 will be mapped on a 16 processor architecture (v1/16). The
version v2 will be mapped on a 16 processor architecture (v2/16) and a

32 processor architecture (v2/32). The version v3 will be mapped on a 32
processor architecture (v3/32) and a 48 processor architecture (v3/48). The
version v4 will be mapped on a 48 processor architecture (v4/48). This results
in six different mapping configurations (v1/16, v2/16, v2/32, v3/32, v3/48,
v4/48) to explore.
For the experiments, we suppose that each PE can execute any of the func-
tional blocks, and that the NoC connecting all the PEs is the STMicroelectron-
ics Spidergon [18].
As described in section “Task Assignment Tool,” our mapping heuris-
tic allows exploring different solutions in order to find a good compromise
between communication and PE load balancing. These different solutions
can be obtained by varying the parameters w1, w2, and w3 (see Equation 7.1):
A high value of w1 promotes solutions with good load balancing, while a
high value of w2 promotes solutions with minimal communications. The
parameter w3, which favors the selection based on an optimistic look-ahead
search, will be fixed at 100. For our experiments three combinations of w1
and w2 will be studied:
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 200 2009-10-2
200 Model-Based Design for Embedded Systems
• c1
(
w1 = 1000, w2 = 10
)
: This weight combination tends to maximize
the load balancing.
• c2
(
w1 = 100, w2 = 100
)
: This weight combination tends to balance

load and communications.
• c3
(
w1 = 10, w2 = 1000
)
: This weight combination tends to minimize
the communications.
Each of the six configurations described above will be tested with these three
weight parameter combinations, which results in a total of 18 experiments.
For each experiment, we will extract the following statistics:
• Load variance (LV), given by Equation 7.2, where for each mapping
solution x, load
(
PE
i
)
is the sum of the task costs assigned to the PE
i
,
avgload is the average load defined by the sum of all task costs divided
by the number of PE, and p is the PE number.
LV(x) =





p−1
i=0


load(PE
i
) − avgload

2
p
(7.2)
• Maximal load (ML) defined as max(load(PE
i
)), where 0 < i < p −1.
• Total communication (TC) given by the sum of each edge cost times
the NoC distance of the route related to that edge.
• Maximal communication (MC) the maximum communication cost
found between any of two PE.
The LV statistic gives an approximation of the quality of the load balancing.
The ML statistic is related to the lower-bound of the application performance
after mapping, since the application throughput depends on the slowest PE
processing. The MC statistic gives as well a lower-bound on the application
performance, but this time with respect to the worst case of communica-
tion contention (instead of w.r.t. processing in the case of ML). Finally, the
TC indicator gives an approximation of the quality of the communication
mapping.
Figure 7.8 shows the resulting LV statistic of the different application con-
figurations and mapping weight combinations. The best results are given by
the mapping weight combination c1. This is predictable because c1 promotes
solutions with a good load balancing, which means a low LV value.
Figure 7.9 presents the resulting ML statistics of the different application
configurations and mapping weight combinations. Following the same logic
as with Figure 7.8, the best results here are given by the mapping weight
combination c1.

Figure 7.10 presents the resulting TC statistic of the different application
configurations and mapping weight combinations. This time, the best results
are given by the mapping weight combination c3. This is predictable because
c3 promotes solutions with low communication costs.
Figure 7.11 presents the resulting MC statistic for the different application
configurations and mapping weight combinations. Contrary to Figure 7.10,
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 201 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 201
400
300
200
100
0
v1/16 v2/16 v2/32 v3/32 v3/48 v4/48
c1
c2
c3
FIGURE 7.8
Load variance.
1500
1000
500
0
v1/16 v2/16 v2/32 v3/32 v3/48 v4/48
c3
c2
c1
FIGURE 7.9
Maximal load (at any PE).
0

5000
10000
15000
20000
25000
30000
35000
v1/16 v2/16 v2/32 v3/32 v3/48 v4/48
c3
c2
c1
FIGURE 7.10
Total communications.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 202 2009-10-2
202 Model-Based Design for Embedded Systems
8000
6000
4000
2000
0
v1/16 v2/16 v2/32 v3/32 v3/48 v4/48
c3
c2
c1
FIGURE 7.11
Maximal communication (through any PE).
the best results are given by the mapping weight combination c2. Since the
tool is not trying to optimize this statistic, it appears that when optimizing
either for load balancing or TC, the given obtained solution may have a worst
case communication contention at a PE. This is one aspect of the mapping

heuristics that needs improvement.
These results show that the selection of the final task assignment solution
really depends on the target performance, architecture hardware budget,
and acceptable bandwidth for communications. Nevertheless, the MpAs-
sign tool, by generating an interesting subset of mapping solutions, allows
architects to concentrate on the more detailed and time consuming analysis,
rather than trying to find task assignment solutions. At this level, the various
costs remain estimates based on platform and application abstractions and
assumptions. For a candidate solution, the refinement can continue down to
the target architecture level.
7.5.2 Refinement and Simulation
For a given solution, MpAssign provides an output text file that contains
the task assignment directives, the resulting average load of each processor
and the cost for each inter-processor communication. The mapping results
are also available in a graphical representation. For the purpose of a sim-
pler display, we have created a mapping example of only eight processors
that is shown in Figure 7.12. We see that the dataflow source and sink blocks
have been assigned to dedicated I/O processors (the first and the last respec-
tively), thus keeping the data intensive tasks on the remaining other 6 PEs.
The intent was to isolate on different processors the interesting task set that
we will want to profile later on, during the simulation.
Starting from the mapping solution presented in Figure 7.12, MpCom-
pose uses those task assignment directives to perform the software synthesis.
Final compilation is carried out by the component back-end.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 203 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 203
Geometric location corresponds
to PE assignment (here PE2)
Block height proportional to
task load (here 80us)

PE load variance:
50 (out of 275 average)
Connectivity between PE
(here from PE2 to PE6)
FIGURE 7.12
Sample output of MpAssign for a single solution.
The next step is the execution of the application on an instrumented vir-
tual platform for performance analysis. For each processor, we obtain the
function profiling report. By adding up the time spent in the task process
functions versus the time spent in the scheduler or waiting for input data to
arrive, we obtain the effective processor utilization. Task code optimization
can be done orthogonally with the profiling report, in the same way as in a
monoprocessor flow. However, in the MultiFlex flow, the user can update
the IR with new task profiling information and rerun MpAssign to see how
this can influence the suggested task assignments.
The simulations on the virtual platform should additionally provide NoC
bandwidth and contention information, given by the instrumented links and
routers.
Our actual multicore virtual platform is currently under development,
and the accurate STNoC model is in the process of being integrated. Mean-
while, a fast and functional interconnect implementation is used instead.
7.6 Conclusions
The increasing need for flexibility in multimedia SoCs for consumer
applications is leading to a new class of programmable, multiprocessor
solutions. The high computation and data bandwidth requirements of these
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 204 2009-10-2
204 Model-Based Design for Embedded Systems
applications pose new challenges in the expression of the applications,
the platform architectures to support them and the application-to-platform
mapping tools.

In this chapter, we elaborated on these challenges and introduced the
MP-SoC platform mapping technologies under development at STMicroelec-
tronics called MultiFlex, with emphasis on assignment and scheduling of
streaming applications. The use of these tools, integrated in a design flow
that proposes a stepwise mapping refinement, was illustrated with a 3G
base-station application example.
While keeping the application code unmodified, we have seen how the
MultiFlex tools refine an IR of the application, based on information related
to the application properties, the high-level platform characteristics, as well
as user mapping constraints.
The MpAssign tool provides mapping solutions that minimize (1) the
communications on the NoC and (2) the processing Load Variance. By chang-
ing the weight factors, the user can direct the heuristics to favor different
classes of solutions.
For a chosen solution, the MpCompose tool provides the required intra-
and inter-processor communications and local task schedulers, by instantiat-
ing generic components with the proper specific configuration.
Finally, for each processor, a self-contained component description gen-
erated by MpCompose can be given to the component compilation back-end,
which takes care of implementing the low-level component glue code and
invoking the compiler for final compilation and linking, ready for simula-
tion and analysis.
This methodology supports a user-driven, iterative approach that auto-
mates the mechanical mapping transformations. Performance and profil-
ing information obtained from a given platform mapping iteration can be
exploited by the user and the mapping tools to guide the next optimization
cycle.
7.6.1 Outlook
We are currently looking at alternative approaches for the implementation
of the MpAssign tool, such as evolutionary algorithms, to cope with the scal-

ability problem of list-based heuristics presented in this chapter. In fact, if
we want to add several optimization goals to the algorithm, such as mini-
mizing the memory usage, or power optimizations, it will become difficult
to implement an efficient list-based algorithm.
Other areas of research include looking at how we can support finer
grain streaming with scalar-based I/O (similar to StreamIt [3]), mixed with
our buffer-based dataflow approach. We are also evaluating the dynamic
elaboration features of the component framework, which could extend our
methodology for runtime application deployment.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 205 2009-10-2
MPSoC Platform Mapping Tools for Data-Dominated Applications 205
References
1. G. De Micheli and L. Benini, Networks on Chip: Technology and Tools,
Morgan Kauffman, San Francisco, CA, 2006.
2. I. Buck, Brook: A Streaming Programming Language, available online at
/>3. W. Thies, M. Karczmarek and S.P Amarasinghe, StreamIt: A language
for streaming applications, in Proceedings of the International Conference on
Compiler Construction, Grenoble, France, April 2002.
4. G. Berry, P. Couronne, and G. Gonthier, Synchronous Programming of
Reactive Systems: An Introduction to ESTEREL, Elsevier, Amsterdam, the
Netherlands, 1988, pp. 35–56.
5. W.W. Wadge and E.A. Ashcroft, Lucid, the Data-Flow Programming Lan-
guage, Academic Press, New York, 1985.
6. J.T. Buck, S. Ha, E.A. Lee, and D.G. Messerschmitt, Ptolemy: A frame-
work for simulating and prototyping heterogeneous systems, Journal
of Computer Simulation, special issue on “Simulation Software Develop-
ment,” 4, 155–182, April 1994.
7. The MathWorks: Matlab and Simulink for technical computing, available
on line at
8. Data marshalling, />halling.html

9. P. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, and D. Lyonnard,
A multi-processor SoC platform and tools for communications applica-
tions, in Embedded Systems Handbook, CRC Press, Boca Raton, FL, 2004.
10. J. Hu and R. Marculescu, Energy-aware communication and task
scheduling for network-on-chip architectures under real-time con-
straints, in Proceedings of DATE 2004, Pairs, France.
11. M. Paganini, Nomadik
R

: A Mobile multimedia application proces-
sor platform, in Proceedings of ASP-DAC (Asia and South Pacific Design
Automation Conference), Yokohama, Japan, January 2007, pp. 749–750.
12. P.G. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, D. Lyonnard,
O. Benny, B. Lavigueur, D. Lo, G. Beltrame, V. Gagné, and G. Nicolescu,
Parallel programming models or a multi-processor SoC platform applied
to networking and multimedia, IEEE Transactions on VLSI Journal, 14(7),
July 2006, 667–680.
Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 206 2009-10-2
206 Model-Based Design for Embedded Systems
13. D. Wiklund and D. Liu, Design, mapping, and simulations of a 3G WCD-
MA/FDD base station using network on chip, in Proceedings of the Fifth
International Workshop on System-on-Chip for Real-Time Applications, Banff,
Canada, July 2005, pp. 252–256.
14. ARM Cortex-A9 MPCore, available online at />products/CPUs/ARMCortex-A9_MPCore.html
15. D.R. Butenhof, Programming with POSIX Threads, Addison-Wesley,
Reading, MA, 1997.
16. R. Ben-Natan, CORBA: A Guide to Common Object Request Broker Architec-
ture, McGraw-Hill, New York, 1995.
17. Thuan L. Thai, Learning DCOM, O’Reilly, Sebastopol, CA, 1999.
18. M. Coppola, Spidergon STNoC: The Communication Infrastructure

for Multiprocessor Architecture, MPSoC 2008, available on line at
/>19. M. Leclercq, O. Lobry, E. Özcan, J. Polakovic, and J.B. Stefani, THINK
C implementation of fractal and its ADL tool-chain, ECOOP 2006, 5
th
Fractal Workshop, Nantes, France, July 2006.
20. P. Paulin, Emerging Challenges for MPSoC Design, MPSoC 2006, avail-
able online at />21. The MIPS32
R

34K
TM
processor overview, available online at http://
www.mips.com/products/processors/32-64-bit-cores/mips32-34k/
22. P. Magarshack and P. Paulin, System-on-chip beyond the nanometer
wall, in Proceedings of Design Automation Conference, Anaheim, CA, 2003,
pp. 419–424.
23. E. Bruneton, T. Coupaye, and J.B. Stefani, The Fractal Component
Model, 2004, specification available online at ectweb.
org/specification/fractal-specification.pdf
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 207 2009-10-14
8
Retargetable, Embedded Software Design
Methodology for Multiprocessor-Embedded
Systems
Soonhoi Ha
CONTENTS
8.1 Introduction 207
8.2 RelatedWork 209
8.3 ProposedWorkflowofEmbeddedSoftwareDevelopment 211
8.4 CommonIntermediateCode 212

8.4.1 Task Code 212
8.4.2 Architecture Information File 214
8.5 CICTranslator 215
8.5.1 Generic API Translation 216
8.5.2 HW-Interfacing Code Generation 217
8.5.3 OpenMP Translator 217
8.5.4 Scheduling Code Generation . 218
8.6 PreliminaryExperiments 220
8.6.1 Design Space Exploration 220
8.6.2 HW-Interfacing Code Generation 221
8.6.3 Scheduling Code Generation . 223
8.6.4 Productivity Analysis 224
8.7 Conclusion 227
References 228
8.1 Introduction
As semiconductor and communication technologies improve continuously,
we can make very powerful embedded hardware by integrating many pro-
cessing elements so that a system with multiple processing elements inte-
grated in a single chip, called MPSoC (multiprocessor system on chip), is
becoming popular. While extensive research has been performed on the
This chapter is an updated version of the following paper: S. Kwon, Y. Kim, W. Jeun, S. Ha,
and Y Paek, A retargetable parallel-programming framework for MPSoC, ACM Transactions on
Design Automation of Electronic Systems (TODAES), Vol. 13, No. 3, Article 39, July 2008.
207
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 208 2009-10-14
208 Model-Based Design for Embedded Systems
design methodology of MPSoC, most efforts have focused on the design of
hardware architecture. But the real bottleneck will be software design, as pre-
verified hardware platforms tend to be reused in platform-based designs.
Unlike application software running on a general purpose computing

system, embedded software is not easy to debug at run time. Furthermore,
software failure may not be tolerated in safety-critical applications. So the
correctness of the embedded software should be guaranteed at compile time.
Embedded software design is very challenging since it amounts to a
parallel programming for nontrivial heterogeneous multiprocessors with
diverse communication architectures and design constraints such as hard-
ware cost, power, and timeliness. Two major models for parallel pro-
gramming are the message-passing and shared address-space models. In
the message-passing model, each processor has private memory and com-
municates with other processors via message-passing. To obtain high
performance, the programmer should optimize data distribution and data
movement carefully, which are very difficult tasks. The message-passing
interface (MPI) [1] is a de facto standard interface of this model. In the shared
address-space model, all processors share a memory and communicate data
through this shared memory. The OpenMP [2] is a de facto standard inter-
face of this model. It is mainly used for a symmetric multiprocessor (SMP)
machine. Because OpenMP makes it easy to write a parallel program, there is
work such as Sato et al. [3], Liu and Chaudhary [4], Hotta et al. [5], and Jeun
and Ha [6] that considers the OpenMP as a parallel programming model
on other parallel-processing platforms without shared address space such as
system-on-chips and clusters.
While an MPI or OpenMP program is regarded as retargetable with
respect to the number of processors and processor type, we consider it as
not retargetable with respect to task partition and architecture change since
the programmer should manually optimize the parallel code considering the
specific target architecture and design constraints. If the task partition or
communication architecture is changed, significant coding effort is needed
to rewrite the optimized code. Another difficulty of programming with MPI
and OpenMP is that it is the programmer’s responsibility to confirm the satis-
faction of the design constraints, such as memory requirements and real-time

constraints, in the manually designed code.
The current practice of parallel-embedded software is multithreaded pro-
gramming with lock-based synchronization, considering all target specific
features. The same application should be rewritten if the target is changed.
Moreover, it is well known that debugging and testing a multithreaded pro-
gram is extremely difficult. Another effort of parallel programming is to use
a parallelizing compiler that creates a parallel program from a sequential C
code. But automatic parallelization of a C code has been successful only for
a limited class of applications after a long period of extensive research [7].
In order to increase the design productivity of embedded software, we
propose a novel methodology for embedded software design based on a
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 209 2009-10-14
Retargetable, Embedded Software Design Methodology 209
parallel programming model, called a common intermediate code (CIC). In
a CIC, the functional and data parallelism of application tasks are specified
independent of the target architecture and design constraints. Information on
the target architecture and the design constraints is separately described in
an xml-style file, called the architecture information file. Based on this informa-
tion, the programmer maps tasks to processing components either manually
or automatically. Then, the CIC translator automatically translates the task
codes in the CIC model into the final parallel code, following the partition-
ing decision. If a new partitioning decision is made, the programmer need
not modify the task codes, only the partitioning information. The CIC trans-
lator automatically generates the newly optimized code from the modified
architecture information file.
Thus the proposed CIC programming model is truly retargetable with
respect to architecture change and partitioning decisions. Moreover, the CIC
translator alleviates the programmer’s burden to optimize the code for the
target architecture. If we develop the code manually, we have to redesign the
hardware-dependent part whenever hardware is changed because of hard-

ware upgrade or platform change. When the lifetime of an embedded sys-
tem is long, the maintenance of embedded software is very challenging since
there will be no old hardware when maintenance is required. In case the life-
time is too short, the hardware platform will change frequently. Automatic
code generation will remove such overhead of the software redesign. Thus
we increase the design productivity of parallel-embedded software through
the proposed methodology.
8.2 Related Work
Martin [8] emphasized the importance of a parallel programming model for
MPSoC to overcome the difficulty of concurrent programming. Conventional
MPI or OpenMP programming is not adequate since the program should
be made target specific for a message-passing or shared address-space
architecture. To be suitable for design space exploration, a programming
model needs to accommodate both styles of architecture. Recently Paulin
et al. [9] proposed the MultiFlex multiprocessor SoC programming environ-
ment, where two parallel programming models are supported, namely, dis-
tributed system object component (DSOC) and SMP models. The DSOC is a
message-passing model that supports heterogeneous distributed computing
while the SMP supports concurrent threads accessing the shared memory.
Nonetheless it is still the burden of the programmer to consider the target
architecture when programming the application; thus it is not fully retar-
getable. On the other hand, we propose here a fully retargetable program-
ming model.
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 210 2009-10-14
210 Model-Based Design for Embedded Systems
To be retargetable, the interface code between tasks should be automat-
ically generated after a partitioning decision on the target architecture is
made. Since the interfacing between the processing units is one of the most
important factors that affect system performance, some research has focused
on this interfacing (including HW–SW components). Wolf et al. [10] defined

a task-transaction-level (TTL) interface for integrating HW–SW components.
In the logical model for TTL intertask communication, a task is connected
to a channel via a port, and it communicates with other tasks through chan-
nels by transferring tokens. In this model, tasks call target-independent TTL
interface functions on their ports to communicate with other tasks. If the TTL
interface functions are defined optimally for each target architecture, the pro-
gram becomes retargetable. This approach can be integrated in the proposed
framework.
For retargetable interface code generation, Jerraya et al. [11] proposed a
parallel programming model to abstract both HW and SW interfaces. They
defined three layers of SW architecture: hardware abstraction layer (HAL),
hardware-dependent software (HdS), and multithreaded application. To
interface between software and hardware, translation to application
programming interfaces (APIs) of different abstraction models should be
performed. This work is complementary to ours.
Compared with related work, the proposed approach has the following
characteristics that make it more suitable for an MPSoC architecture:
1. We specifically concentrate on the retargetability of the software devel-
opment framework and suggest CIC as a parallel programming model.
The main idea of CIC is the separation of the algorithm specification and
its implementation. CIC consists of two sections: the tasks codes and the
architecture information file. An application programmer writes for all
tasks considering the potential parallelism of the application itself, inde-
pendent of the target architecture. Based on the target architecture, we
determine how to exploit the parallelism in implementation.
2. We use different ways of specifying functional and data parallelism
(or loop parallelism). Data parallelism is usually implemented by an
array of homogeneous processors or a hardware accelerator, different
from functional parallelism. By considering different implementation
practices, we use different specification and optimization methods for

functional and data parallelism.
3. Also, we explicitly specify the potential use of a hardware accelerator
inside a task code using a pragma definition. If the use of a hardware
accelerator is decided after design space exploration, the task code will
be modified by a preprocessor, replacing the code segment contained
within the pragma section by the appropriate HW interfacing code. Oth-
erwise, the pragma definition will be ignored. Thus the use of hardware
accelerators can be determined without code rewriting, which makes
design space exploration easier.

×