Model-Based Design for Embedded Systems- P26 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (581.5 KB, 10 trang )

Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 216 2009-10-14
216 Model-Based Design for Embedded Systems
the necessary information needed for each translation step. Based on the
task-dependency information that tells how to connect the tasks, the
translator determines the number of intertask communication channels.
Based on the period and deadline information of tasks, the run-time sys-
tem is synthesized. With the memory map information of each processor,
the translator defines the shared variables in the shared region.
To support a new target architecture in the proposed workflow, we have
to add translation rules of the generic API to the translator, make a target-
specific-OpenMP-translator for data parallel tasks, and apply the generation
rule of task scheduling codes tailored for the target OS. Each step of CIC
translator will be explained in this section.
8.5.1 Generic API Translation
Since the CIC task code uses generic APIs for target-independent specifi-
cation, the translation of generic APIs to target-dependent APIs is needed.
If the target processor has an OS installed, generic APIs are translated into
OS APIs; otherwise, they are translated into communication APIs that are
defined by directly accessing the hardware devices. We implement the OS
API library and communication API library, both optimized for each target
architecture.
For most generic APIs, API translation is achieved by simple redefini-
tion of the API function. Figure 8.6a shows an example where the trans-
lator replaces MQ_RECEIVE API with a “read_port” function for a target
processor with pthread support. The read_port function is defined using

MQ_RECEIVE (port_id, buf, size);
Generic API
1. int read_port(int channel_id, unsigned char *buf, int len) {
2.

3. pthread_mutex_lock (channel_mutex);
7. pthread_mutex_unlock(channel_mutex);
4.
6.
8. }
5. memcpy(buf, channel->start, len);
(a)
Generic API
#include <stdio.h>

fclose(file);
fread(data, 1, 100, file);
file = fopen("input.dat", "r");
file = OPEN("input.dat", O_RDONLY);

READ(file, data, 100);
CLOSE(file);
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
read(file, data, 100);
close(file);
file = open ("input.dat", O_RDONLY);

(b)
FIGURE 8.6
Examples of generic API translation: (a) MQ_RECEIVE operation, (b) READ
operation.
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 217 2009-10-14
Retargetable, Embedded Software Design Methodology 217
pthread APIs and the memcpy C library function. However some APIs
need additional treatment: For example, the READ API needs different
function prototypes depending on the target architecture as illustrated in
Figure 8.6b. Maeng et al. [14] presented a rule-based translation technique
that is general enough to translate any API if the translation rule is defined
in a pattern-list file.
8.5.2 HW-Interfacing Code Generation
If there is a code segment contained within a HW pragma section and its
translation rule exists in an architecture information file, the CIC translator
replaces the code segment with the HW-interfacing code, considering the
parameters of the HW accelerator and buffer variables that are defined in
the architecture section of the CIC. The translation rule of HW-interfacing
code for a specific HW is separately specified as a HW-interface library code.
Note that some HW accelerators work together with other HW IPs.
For example, a HW accelerator may notify the processor of its completion
through an interrupt; in this case an interrupt controller is needed. The CIC
translator generates a combination of the HW accelerator and interrupt con-
troller, as shown in the next section.
8.5.3 OpenMP Translator
If an OpenMP compiler is available for the target, then task codes with
OpenMP directives can be used easily. Otherwise, we somehow need to
translate the task code with OpenMP directives to a parallel code. Note that
we do not need a general OpenMP translator since we use OpenMP direc-
tives only to specify the data parallel CIC task. But we have to make a sepa-

rate OpenMP translator for each target architecture in order to achieve opti-
mal performance.
For a distributed memory architecture, we developed an OpenMP trans-
lator that translates an OpenMP task code to the MPI codes using a minimal
subset of the MPI library for the following reasons: (1) MPI is a standard that
is easily ported to various software platforms. (2) Porting the MPI library is
much easier than modifying the OpenMP translator itself for the new target
architecture. Figure 8.7 shows the structure of the translated MPI program.
As shown in the figure, the translated code has the master–worker
structure: The master processor executes the entire core while worker pro-
cessors execute the parallel region only. When the master processor meets
the parallel region, it broadcasts the shared data to worker processors. Then,
all processors concurrently execute the parallel region. The master proces-
sor synchronizes all the processors at the end of the parallel loop and col-
lects the results from the worker processors. For performance optimization,
we have to minimize the amount of interprocessor communication between
processors.
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 218 2009-10-14
218 Model-Based Design for Embedded Systems
Work
alone
Work
alone
Initialize Initialize Initialize
BCast
share
data
BCast
share
data

BCast
share
data
BCast
share
data
Work
in
parallel
region
Work
in
parallel
region
Work
in
parallel
region
Work
in
parallel
region
Receive
&
update
Send
shared
data
Send
shared

data
Send
shared
data
Master
processor
Worker
processor
Worker
processor
Worker
processor
Parallel
region start
Parallel
region end
FIGURE 8.7
The workflow of translated MPI codes. (From Kwon, S. et al., ACM Trans.
Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.)
8.5.4 Scheduling Code Generation
The last step of the proposed CIC translator is to generate thetask-scheduling
code for each processor core. There will be many tasks mapped to each
processor, with different real-time constraints and dependency information.
We remind the reader that a task code is defined by three functions: “{task
name}_init(), {task name}_go(), and {task name}_wrapup().” The generated
scheduling code initializes the mapped tasks by calling “{task name}_init()”
and wraps them up after the scheduling loop finishes its execution, by calling
“{task name}_wrapup().”
The main body of the scheduling code differs depending on whether
there is an OS available for the target processor. If there is an OS that is

POSIX-compliant, we generate a thread-based scheduling code, as shown in
Figure 8.8a. A POSIX thread is created for each task (lines 17 and 18) with
an assigned priority level if available. The thread, as shown in lines 3 to 5,
executes the main body of the task, “{task name}_go(),” and schedules the
thread itself based on its timing constraints by calling the “sleep()” method.
If the OS is not POSIX-compliant, the CIC translator should be extended to
generate the OS-specific scheduling code.
If there is no available OS for the target processor, the translator should
synthesize the run-time scheduler that schedules the mapped tasks. The CIC
translator generates a data structure of each task, containing three main
functions of tasks (“init(), go(), and wrapup()”). With this data structure, a
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 219 2009-10-14
Retargetable, Embedded Software Design Methodology 219
1. void
∗
thread_task_0_func(void
∗
argv) {
2.
3. task_0_go();
4. get_time(&time);
5. sleep(task_0->next_period – time); // sleep for remained time
6.
7. }
8. int main() {
9.
10. pthread_t thread_task_0;
11. sched_param thread_task_0_param;
12.
13. thread_task_0_param.sched_priority = 0;

14. pthread_attr_setschedparam( , &thread_task_0_param);
15.
16. task_init(); /
∗
{task_name}_init() functions are called
∗
/
17. pthread_create(&thread_task_0,
18. &thread_task_0_attr, thread_task_0_func, NULL);
19.
20. task_wrapup(); /
∗
{task_name}_wrapup() functions are called
∗
/
21. }
(a)
1. typedef struct {
2. void (
∗
init)();
3.int(
∗
go());
4.void(
∗
wrapup)();
5. int period, priority, ;
6. } task;
7. task taskInfo[] = { {task 1_init, task 1_go, task 1_wrapup, 100, 0}

8. , {task2_init, task2_go, task2_wrapup, 200, 0}};
9.
10. void scheduler() {
11. while(all_task_done()==FALSE) {
12. int taskld = get_next_task();
13. taskInfo[taskld]->go()
14. }
15. }
16.
17. int main() {
18. init(); /
∗
{task_name}_init() functions are called
∗
/
19. scheduler(); /
∗
scheduler code
∗
/
20. wrapup(); /
∗
{task_name}_wrapup() functions are called
∗
/
21.return0;
22.}
(b)
FIGURE 8.8
Pseudocode of generated scheduling code: (a) if OS is available, and (b) if OS

is not available. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst.,
13, Article 39, July 2008. With permission.)
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 220 2009-10-14
220 Model-Based Design for Embedded Systems
real-time scheduler is synthesized by the CIC translator. Figure 8.8b shows
the pseudocode of a generated scheduling code. Generated scheduling
code may be changed by replacing the function “void scheduler()” or “int
get_next_task()” to support another scheduling algorithm.
8.6 Preliminary Experiments
An embedded software development framework based on the proposed
methodology, named HOPES, is under development. While it allows the use
of any model for initial specification, the current implementation is being
done with the PeaCE model. PeaCE model is one that is used in PeaCE
hardware–software codesign environment for multimedia embedded sys-
tems design [15]. To verify the viability of the proposed programming, we
built a virtual prototyping system, based on the Carbon SoC Designer [16],
that consists of multiple subsystems of arm926ej-s connected to each other
through a shared bus as shown in Figure 8.9. H.263 Decoder as depicted in
Figure 8.3 is used for preliminary experiments.
8.6.1 Design Space Exploration
We specified the functional parallelism of the H.263 decoder with six tasks
as shown in Figure 8.3, where each task is assigned an index. For data-
parallelism, the data parallel region of motion compensation task is specified
with an OpenMP directive. In this experiment, we explored the design space
of parallelizing the algorithm, considering both functional and data paral-
lelisms simultaneously. Asis evidentin Figure 8.3,tasks 1 to3 can beexecuted
in parallel; thus, they are mapped to multiple-processors with three configu-
rations as shown in Table 8.1. For example, task 1 is mapped to processor 1,
and the other tasks are mapped to processor 0 for the second configuration.
Interrupt ctrl.

Local mem. HW1 HW2
Arm926ej-s
Interrupt ctrl.
Local mem. HW1 HW2
Arm926ej-s
HW3
Shared memory
FIGURE 8.9
The target architecture for preliminary experiments. (From Kwon, S. et
al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With
permission.)
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 221 2009-10-14
Retargetable, Embedded Software Design Methodology 221
TABLE 8.1
Task Mapping to Processors
The Configuration of Task Mapping
Processor Id 1 2 3
0 Task 0, Task 1, Task 2,
Task 3, Task 4, Task 5
Task 0, Task 2, Task 3,
Task 4, Task 5
Task 0, Task 3,
Task 4, Task 5
1 N/A Task 1 Task 1
2N/A N/A Task2
Source: Kwon,S.etal.,ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With
permission.
TABLE 8.2
Execution Cycles for Nine Configurations
The Configuration of Task Mapping

The Number of Processors
for Data-Parallelism 1 2 3
No OpenMP 158,099,172 146,464,503 146,557,779
2 167,119,458 152,753,214 153,127,710
4 168,640,527 154,159,995 155,415,942
Source: Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July
2008. With permission.
For each configuration of task mapping, we parallelized task 4, using one,
two, and four processors. As a result, we have prepared nine configurations
in total as illustrated in Table 8.2. In the proposed framework, each configu-
ration is simply specified by changing the task-mapping information in the
architecture information file. The CIC translator generates the executable C
codes automatically.
Table 8.2 shows the performance result for these nine configurations. For
functional parallelism, the best performance can be obtained by using two
processors as reported in the first row (“No OpenMP” case). H.263 decoder
algorithm uses a 4:1:1 format frame, so computation of Y macroblock decod-
ing is about four times larger than those of U and V macroblocks. Therefore
macroblock decoding of U and V can be merged in one processor during
macroblock decoding of Y in another processor. There is no performance
gain obtained by exploiting data parallelism. This is because the computa-
tion workload of motion compensation is not large enough to outweigh the
communication overhead incurred by parallel execution.
8.6.2 HW-Interfacing Code Generation
Next, we accelerated the code segment of IDCT in the macroblock decod-
ing tasks (task 1 to task 3) with a HW accelerator, as shown in Figure 8.10a.
We use the RealView SoC designer to model the entire system including the
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 222 2009-10-14
222 Model-Based Design for Embedded Systems
#pragma hardware IDCT (output.data, input.data){

/
∗
code segments for IDCT
∗
/
}
(a)
1. <hardware>
2. <name>IDCT</name>
3. <protocol>IDCT_slave</protocol>
4. <param>0x2F000000</param>
5. </hardware>
(b)
1. <hardware>
2. <name>IDCT</name>
3. <protocol>IDCT_interrupt</protocol>
4. <param>0x2F000000</param>
5. </hardware>
6. <hardware>
7. <name>IRQ_CONTROLLER</name>
8. <protocol>irq_controller</name>
9. <param>0xA801000</param>
10. </hardware>
(c)
FIGURE 8.10
(a) Code segment wrapped with HW pragma and architecture section infor-
mation of IDCT, (b) when interrupt is not used, and (c) when interrupt is
used. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article
39, July 2008. With permission.)
HW accelerator. Two kinds of inverse discrete cosine transformation (IDCT)

accelerator are used. One uses an interrupt signal for completion notifica-
tion, and other uses polling to detect the completion. The latter is specified
in the architecture section as illustrated in Figure 8.10b, where the library
name of the HW-interfacing code is set to IDCT_slave and its base address to
0x2F000000.
Figure 8.11a shows the assigned address map of the IDCT accelerator and
Figure 8.11b shows the generated HW-interfacing code. This code is sub-
stituted for the code segment contained within a HW pragma section. In
Figure 8.11b, bold letters are changeable according to the parameters spec-
ified in a task code and in the architecture information file; they specify the
base address for the HW interface data structure and the input and output
port names of the associated CIC task.
Note that interfacing code uses polling at line 6 of Figure 8.11b. If we use
the accelerator with interrupt, an interrupt controller is additionally attached
to the target platform, as shown in Figure 8.10c, with information on the code
library name, IRQ_CONTROLLER, and its base address 0xA801000. The new
IDCT accelerator has the same address map as the previous one, except for
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 223 2009-10-14
Retargetable, Embedded Software Design Methodology 223
Address (Offset) I/O Type Comment
0 Read Semaphore
4 Write IDCT start
8 Read Complete ﬂag
12 Write IDCT clear
64 ∼ 191 Write Input data
192 ∼ 319 Read Output data
(a)
1. int i;
2. volatile unsigned int
∗

idct_base = (volatile unsigned int
∗
) 0x2F000000;
3. while(idct_base[0]==1); // try to obtain hardware resource
4. for (i=0;i<32;i++) idct_base[i+16]= ((unsigned int
∗
)(input.data))[i];
5. idct_base[1]= 1; // send start signal to IDCT accelerator
6. while(idct_base[2]==0); // wait for completion of IDCT operation
7. for (i=0;i<32;i++) ((unsigned int
∗
)(output.data)[i] = idct_base[i+48];
8. idct_base[3]= 1; // clear and unlock hardware
(b)
FIGURE 8.11
(a) The address map of IDCT, and (b) its generated interfacing code. (From
Kwon,S.etal.,ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July
2008. With permission.)
the complete flag. The address of the complete flag (address 8 in Figure 8.11a)
is assigned to “interrupt clear.”
Figure 8.12a shows the generated interfacing code for the IDCT with
interrupt. Note that the interfacing code does not access the HW to check
the completion of IDCT, but checks the variable “complete.” In the gener-
ated code of the interrupt handler, this variable is set to 1 (Figure 8.12b). The
initialize code for the interrupt controller (“initDevices()”) is also generated
and called in the “{task_name}_init()” function.
8.6.3 Scheduling Code Generation
We generated the task-scheduling code of the H.263 decoder while chang-
ing the working conditions, OS support, and scheduling policy. At first, we
used the eCos real-time OS for arm926ej-s in the RealView SoC designer,

and generated the scheduling code, the pseudocode of which is shown in
Figure 8.13. In function cyg_user_start() of eCos, each task is created as a
thread. The CIC translator generates the parameters needed for thread cre-
ation such as stack variable information and stack size (fifth and sixth param-
eter of cyg_thread_create()). Moreover, we placed “{task_name}_go” in a
while loop inside the created thread (lines 10 to 14 of Figure 8.13). Function
{task_name}_init() is called in init_task().
Note that TE_main() is also created as a thread. TE_main() checks
whether execution of all tasks is finished, and calls “{task_name}_wrapup()”
in wrapup_task() before finishing the entire program.
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 224 2009-10-14
224 Model-Based Design for Embedded Systems
1. int complete;
2.
3. volatile unsigned int
∗
idct_base = (volatile unsigned int
∗
) 0x 2F000000;
4. while(idct_base[0]== 1); // try to obtain hardware resource
5. complete=0;
6. for (i=0;i<32;i++) idct_base[i+16]= ((unsigned int
∗
)(input.data))[i];
7. idct_base[1] = 1; // send start signal to IDCT accelerator
8. while(complete==0); // wait for completion of IDCT operation
9. for (i = 0; i < 32; i ++) ((unsigned int
∗
)(output.data)[i] =idct_base[i +48];
10. idct_base[3]= 1; // clear and unlock hardware

(a)
1. extern int complete;
2. __irq void IRQ_Handler() {
3. IRQ_CLEAR(); // interrupt clear of interrupt controller
4. idct_base[2] =1; // interrupt clear of IDCT
5. complete=1;
6. }
7. void initDevices(){
8. IRQ_INIT(); // initialize of interrupt controller
9. }
(b)
FIGURE 8.12
(a) Interfacing code for the IDCT with interrupt, and (b) the interrupt handler
code. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article
39, July 2008. With permission.)
For a processor without OS support, the current CIC translator supports
two kinds of scheduling code: default and rate-monotonic scheduling (RMS).
The default scheduler just keeps the execution frequency of tasks considering
the period ratio of tasks. Figure 8.14a and b show the pseudocode of function
get_next_task(), which is called in the function scheduler() of Figure 8.8b, for
the default and RMS, respectively.
8.6.4 Productivity Analysis
For the productivity analysis, we recorded the elapsed time to manually
modify the software (including debugging time) when we change the target
architecture and task mapping. Such manual modification was performed by
an expert programmer who is a PhD student.
For a fair comparison of automatic code generation and manual-coding
overhead, we made the following assumptions. First, the application task
codes are prepared and functionally verified. We chose an H.263 decoder as
the application code that consists of six tasks, as illustrated in Figure 8.3.

Second, the simulation environment is completely prepared for the ini-
tial configuration, as shown in Figure 8.15a. We chose the RealView SoC
designer as the target simulator, prepared two different kinds of HW IPs
Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 225 2009-10-14
Retargetable, Embedded Software Design Methodology 225
1. void cyg_user_start(void) {
2. cyg_threaad_create(taskInfo[0]->priority, TE_task_0,
3. (cyg_addrword_t)0, “TE_task_0”, (void
∗
)&TaskStk[0],
4. TASK_STK_SIZE-1, &handler[0], &thread[0]);
5.
6. init_task();
7. cyg_thread_resume(handle[0]);
8.
9. }
10. Void TE_task_0(cyg_addrword_t data) {
11. while(!ﬁnished)
12. if (this task is executable) tasklnfo[0]->go();
13. else cyg_thread_yield();
14. }
15. void TE_main(cyg_addrword_t data) {
16. while(1)
17. if (all_task_is_done()) {
18. wrapup_task();
19. exit(1);
20. }
21. }
FIGURE 8.13
Pseudocode of an automatically generated scheduler for eCos. (From Kwon,

S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With
permission.)
1. int get_next_task() {
2. a. ﬁnd executable tasks
3. b. ﬁnd the tasks that has the smallest value of time count
4. c. select the task that is not executed for the longest time
5. d. add period to the time count of selected task
6. e. return selected task id
7. }
(a)
1. int get_next_task() {
2. a. ﬁnd executable tasks
3. b. select the task that has the smallest period
4. c. update task information
5. d. return selected task id
6. }
(b)
FIGURE 8.14
Pseudocode of “get_next_task()” without OS support: (a) default, and (b)
RMS scheduler. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst.,
13, Article 39, July 2008. With permission.)
for the IDCT function block. Third, the software environment for the tar-
get system is prepared, which includes the run-time scheduler and target-
dependent API library.

Model-Based Design for Embedded Systems- P26 potx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về