8
Code Optimization
239
•
Optimization techniques for code efficiency
•
Intrinsic C functions
•
Parallel instructions
•
Word-wide data access
•
Software pipelining
In this chapter we illustrate several schemes that can be used to optimize and
drastically reduce the execution time of your code. These techniques include the
use of instructions in parallel, word-wide data, intrinsic functions, and software
pipelining.
8.1 INTRODUCTION
Begin at a workstation level; for example, use C code on a PC. While code written
in assembly (ASM) is processor-specific, C code can readily be ported from one plat-
form to another. However, optimized ASM code runs faster than C and requires
less memory space.
Before optimizing, make sure that the code is functional and yields correct
results. After optimizing, the code can be so reorganized and resequenced that the
optimization process makes it difficult to follow. One needs to realize that if a C-
coded algorithm is functional and its execution speed is satisfactory, there is no need
to optimize further.
After testing the functionality of your C code, transport it to the C6x platform.
A floating-point implementation can be modeled first, then converted to a fixed-
point implementation if desired. If the performance of the code is not adequate, use
DSP Applications Using C and the TMS320C6x DSK. Rulph Chassaing
Copyright © 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-20754-3 (Hardback); 0-471-22112-0 (Electronic)
different compiler options to enable software pipelining (discussed later), reduce
redundant loops, and so on. If the performance desired is still not achieved, you can
use loop unrolling to avoid overhead in branching. This generally improves the exe-
cution speed but increases code size. You also can use word-wide optimization by
loading/accessing 32-bit word (int) data rather than 16-bit half-word (short) data.
You can then process lower and upper 16-bit data independently.
If performance is still not satisfactory, you can rewrite the time-critical section of
the code in linear assembly, which can be optimized by the assembler optimizer. The
profiler can be used to determine the specific function(s) that need to be optimized
further.
The final optimization procedure that we discuss is a software pipelining
scheme to produce hand-coded ASM instructions [1,2]. It is important to follow the
procedure associated with software pipelining to obtain an efficient and optimized
code.
8.2 OPTIMIZATION STEPS
If the performance and results of your code are satisfactory after any particular step,
you are done.
1. Program in C. Build your project without optimization.
2. Use intrinsic functions when appropriate as well as the various optimization
levels.
3. Use the profiler to determine/identify the function(s) that may need to be
further optimized. Then convert these function(s) in linear ASM.
4. Optimize code in ASM.
8.2.1 Compiler Options
When the optimizer is invoked, the following steps are performed. A C-coded
program is first passed through a parser that performs preprocessing functions and
generates an intermediate file (.if) which becomes the input to an optimizer. The
optimizer generates an .opt file which becomes the input to a code generator for
further optimizations and generates an ASM file.
The options:
1. –o0 optimizes the use of registers.
2. –o1 performs a local optimization in addition to optimizations performed by
the previous option: –o0.
3. –o2 performs a global optimization in addition to the optimizations per-
formed by the previous options: –o0 and –o1.
240
Code Optimization
4. –o3 performs a file optimization in addition to the optimizations performed
by the three previous options: –o0, –o1, and –o2.
The options –o2 and –o3 attempt to do software optimization.
8.2.2 Intrinsic C Functions
There are a number of available C intrinsic functions that can be used to increase
the efficiency of code (see also Example 3.1):
1. int_mpy() has the equivalent ASM instruction MPY, which multiplies the
16 LSBs of a number by the 16 LSBs of another number.
2. int_mpyh() has the equivalent ASM instruction MPYH, which multiplies the
16 MSBs of a number by the 16 MSBs of another number.
3. int_mpylh() has the equivalent ASM instruction MPYLH, which multiplies
the 16 LSBs of a number by the 16 MSBs of another number.
4. int_mpyhl() has the equivalent instruction MPYHL, which multiplies the
16 MSBs of a number by the 16 LSBs of another number.
5. void_nassert(int) generates no code. It tells the compiler that the
expression declared with the assert function is true. This conveys information
to the compiler about alignment of pointers and arrays and of valid opti-
mization schemes, such as word-wide optimization.
6. uint_lo(double) and uint_hi(double) obtain the low and high 32 bits
of a double word, respectively (available on C67x or C64x).
8.3 PROCEDURE FOR CODE OPTIMIZATION
1. Use instructions in parallel so that multiple functional units can be operated
within the same cycle.
2. Eliminate NOPs or delay slots, placing code where the NOPs are.
3. Unroll the loop to avoid overhead with branching.
4. Use word-wide data to access a 32-bit word (int) in lieu of a 16-bit half-word
(short).
5. Use software pipelining, illustrated in Section 8.5.
8.4 PROGRAMMING EXAMPLES USING CODE OPTIMIZATION
TECHNIQUES
Several examples are developed to illustrate various techniques to increase the effi-
ciency of code. Optimization using software pipelining is discussed in Section 8.5.
Programming Examples Using Code Optimization Techniques
241
The dot product is used to illustrate the various optimization schemes. The dot
product of two arrays can be useful for many DSP algorithms, such as filtering
and correlation. The examples that follow assume that each array consists of 200
numbers. Several programming examples using mixed C and ASM code, which
provide necessary background, were given in Chapter 3.
Example 8.1: Sum of Products with Word-Wide Data Access for
Fixed-Point Implementation Using C Code (twosum)
Figure 8.1 shows the C code twosum.c, which obtains the sum of products of two
arrays accessing 32-bit word data. Each array consists of 200 numbers. Separate
sums of products of even and odd terms are calculated within the loop. Outside the
loop, the final summation of the even and odd terms is obtained.
For a floating-point implementation, the function and the variables sum, suml,
and sumh in Figure 8.1 are cast as float, in lieu of int:
float dotp (float a[ ], float b [ ])
{
float suml, sumh, sum;
int i;
.
.
.
}
242
Code Optimization
//twosum.c Sum of Products with separate accumulation of even/odd terms
//with word-wide data for fixed-point implementation
int dotp (short a[ ], short b [ ])
{
int suml, sumh, sum, i;
suml = 0;
sumh = 0;
sum = 0;
for (i = 0; i < 200; i +=2)
{
suml += a[i] * b[i]; //sum of products of even terms
sumh += a[i + 1] * b[i + 1]; //sum of products of odd terms
}
sum = suml + sumh; //final sum of odd and even terms
return (sum);
}
FIGURE 8.1. C code for sum of products using word-wide data access for separate accu-
mulation of even and odd sum of products terms (twosum.c).
Example 8.2: Separate Sum of Products with C Intrinsic Functions
Using C Code (dotpintrinsic)
Figure 8.2 shows the C code dotpintrinsic.c to illustrate the separate sum of
products using two C intrinsic functions, _mpy and _mpyh, which have the
equivalent ASM instructions MPY and MPYH, respectively. Whereas the even and odd
sum of products are calculated within the loop, the final summation is taken outside
the loop and returned to the calling function.
Example 8.3: Sum of Products with Word-Wide Access for Fixed-Point
Implementation Using Linear ASM Code (twosumlasmfix.sa)
Figure 8.3 shows the linear ASM code twosumlasmfix.sa, which obtains two
separate sums of products for a fixed-point implementation using linear ASM code.
It is not necessary to specify either the functional units or NOPs. Furthermore, sym-
bolic names can be used for registers. The LDW instruction is used to load a 32-bit
word-wide data value (which must be word-aligned in memory when using LDW).
Lower and upper 16-bit products are calculated separately. The two ADD instruc-
tions accumulate separately the even and odd sum of products.
Programming Examples Using Code Optimization Techniques
243
//dotpintrinsic.c Sum of products with C intrinsic functions using C
for (i = 0; i < 100; i++)
{
suml = suml + _mpy(a[i], b[i]);
sumh = sumh + _mpyh(a[i], b[i]);
}
return (suml + sumh);
FIGURE 8.2. Separate sum of products using C intrinsic functions (dotpintrinsic.c).
;twosumlasmfix.sa Sum of Products. Separate accum of even/odd terms
;With word-wide data for fixed-point implementation using linear ASM
loop: LDW *aptr++, ai ;32-bit word ai
LDW *bptr++, bi ;32-bit word bi
MPY ai, bi, prodl ;lower 16-bit product
MPYH ai, bi, prodh ;higher 16-bit product
ADD prodl, suml, suml ;accum even terms
ADD prodh, sumh, sumh ;accum odd terms
SUB count, 1, count ;decrement count
[count] B loop ;branch to loop
FIGURE 8.3. Separate sum of products using linear ASM code for fixed-point implemen-
tation (twosumlasmfix.sa).
Example 8.4: Sum of Products with Double-Word Load for Floating-Point
Implementation Using Linear ASM Code (twosumlasmfloat)
Figure 8.4 shows the linear ASM code twosumlasmfloat.sa to obtain two sepa-
rate sums of products for a floating-point implementation using linear ASM code.
The double-word load instruction LDDW loads a 64-bit data value and stores it in
a pair of registers. Each single-precision multiply instruction MPYSP performs a
32 ¥ 32 multiplication. The sums of products of the lower and upper 32 bits are
performed to yield a sum of both even and odd terms as 32 bits.
Example 8.5: Dot Product with No Parallel Instructions for Fixed-Point
Implementation Using ASM Code (dotpnp)
Figure 8.5 shows the ASM code dotpnp.asm for the dot product with no instruc-
tions in parallel for a fixed-point implementation. A fixed-point implementation can
244
Code Optimization
;twosumlasmfloat.sa Sum of products. Separate accum of even/odd terms
;Using double-word load LDDW for floating-point implementation
loop: LDDW *aptr++, ai1:ai0 ;64-bit word ai0 and ai1
LDDW *bptr++, bi1:bi0 ;64-bit word bi0 and bi1
MPYSP ai0, bi0, prodl ;lower 32-bit product
MPYSP ai1, bi1, prodh ;hiagher 32-bit product
ADDSP prodl, suml, suml ;accum 32-bit even terms
ADDSP prodh, sumh, sumh ;accum 32-bit odd terms
SUB count, 1, count ;decrement count
[count] B loop ;branch to loop
FIGURE 8.4. Separate sum of products with LDDW using linear ASM code for floating-point
implementation (twosumlasmfloat.sa).
;dotpnp.asm ASM Code with no-parallel instructions for fixed-point
MVK .S1 200, A1 ;count into A1
ZERO .L1 A7 ;init A7 for accum
LOOP LDH .D1 *A4++,A2 ;A2=16-bit data pointed by A4
LDH .D1 *A8++,A3 ;A3=16-bit data pointed by A8
NOP 4 ;4 delay slots for LDH
MPY .M1 A2,A3,A6 ;product in A6
NOP ;1 delay slot for MPY
ADD .L1 A6,A7,A7 ;accum in A7
SUB .S1 A1,1,A1 ;decrement count
[A1] B .S2 LOOP ;branch to LOOP
NOP 5 ;5 delay slots for B
FIGURE 8.5. ASM code with no parallel instructions for fixed-point implementation
(dotpnp.asm).
be performed with all C6x devices, whereas a floating-point implementation
requires a C67x platform such as the C6711 DSK.
The loop iterates 200 times. With a fixed-point implementation, each pointer
register A4 and A8 increments to point at the next half-word (16 bits) in each buffer,
whereas with a floating-point implementation, a pointer register increments the
pointer to the next 32-bit word. The load, multiply, and branch instructions must use
the .D, .M, and .S units, respectively; the add and subtract instructions can use any
unit (except .M). The instructions within the loop consume 16 cycles per iteration.
This yields 16 ¥ 200 = 3200 cycles. Table 8.4 shows a summary of several optimiza-
tion schemes for both fixed- and floating-point implementations.
Example 8.6: Dot Product with Parallel Instructions for Fixed-Point
Implementation Using ASM Code (dotpp)
Figure 8.6 shows the ASM code dotpp.asm for the dot product with a fixed-point
implementation with instructions in parallel. With code in lieu of NOPs, the number
of NOPs is reduced.
The MPY instruction uses a cross-path (with .M1x) since the two operands are
from different register files or different paths. The instructions SUB and B are moved
up to fill some of the delay slots required by LDH. The branch instruction occurs
after the ADD instruction. Using parallel instructions, the instructions within the loop
now consume eight cycles per iteration, to yield 8 ¥ 200 = 1600 cycles.
Example 8.7: Two Sums of Products with Word-Wide (32-bit) Data for
Fixed-Point Implementation Using ASM Code (twosumfix)
Figure 8.7 shows the ASM code twosumfix.asm, which calculates two separate
sums of products using word-wide access of data for a fixed-point implementation.
The loop count is initialized to 100 (not 200) since two sums of products are obtained
Programming Examples Using Code Optimization Techniques
245
;dotpp.asm ASM Code with parallel instructions for fixed-point
MVK .S1 200, A1 ;count into A1
|| ZERO .L1 A7 ;init A7 for accum
LOOP LDH .D1 *A4++,A2 ;A2=16-bit data pointed by A4
|| LDH .D2 *B4++,B2 ;B2=16-bit data pointed by B4
SUB .S1 A1,1,A1 ;decrement count
[A1] B .S1 LOOP ;branch to LOOP (after ADD)
NOP 2 ;delay slots for LDH and B
MPY .M1x A2,B2,A6 ;product in A6
NOP ;1 delay slot for MPY
ADD .L1 A6,A7,A7 ;accum in A7,then branch
;branch occurs here
FIGURE 8.6. ASM code with parallel instructions for fixed-point implementation
(dotpp.asm).
per iteration. The instruction LDW loads a word or 32-bit data. The multiply instruc-
tion MPY finds the product of the lower 16 ¥ 16 data, and MPYH finds the product of
the upper 16 ¥ 16 data. The two ADD instructions accumulate separately the even
and odd sums of products. Note that an additional ADD instruction is needed outside
the loop to accumulate A7 and B7. The instructions within the loop consume eight
cycles, now using 100 iterations (not 200), to yield 8 ¥ 100 = 800 cycles.
Example 8.8: Dot Product with No Parallel Instructions for Floating-Point
Implementation Using ASM Code (dotpnpfloat)
Figure 8.8 shows the ASM code dotpnpfloat.asm for the dot product with a
floating-point implementation using no instructions in parallel. The loop iterates
200 times. The single-precision floating-point instruction MPYSP performs a 32 ¥ 32
multiply. Each MPYSP and ADDSP requires three delay slots. The instructions within
the loop consume a total of 18 cycles per iteration (without including three NOPs
associated with ADDSP). This yields a total of 18 ¥ 200 = 3600 cycles. (See Table 8.4
for a summary of several optimization schemes for both fixed- and floating-point
implementations.)
Example 8.9: Dot Product with Parallel Instructions for Floating-Point
Implementation Using ASM Code (dotppfloat)
Figure 8.9 shows the ASM code dotppfloat.asm for the dot product with a
floating-point implementation using instructions in parallel. The loop iterates 200
246
Code Optimization
;twosumfix.asm ASM code for two sums of products with word-wide data
;for fixed-point implementation
MVK .S1 100, A1 ;count/2 into A1
|| ZERO .L1 A7 ;init A7 for accum of even terms
|| ZERO .L2 B7 ;init B7 for accum of odd terms
LOOP LDW .D1 *A4++,A2 ;A2=32-bit data pointed by A4
|| LDW .D2 *B4++,B2 ;A3=32-bit data pointed by B4
SUB .S1 A1,1,A1 ;decrement count
[A1] B .S1 LOOP ;branch to LOOP (after ADD)
NOP 2 ;delay slots for both LDW and B
MPY .M1x A2,B2,A6 ;lower 16-bit product in A6
|| MPYH .M2x A2,B2,B6 ;upper 16-bit product in B6
NOP ;1 delay slot for MPY/MPYH
ADD .L1 A6,A7,A7 ;accum even terms in A7
|| ADD .L2 B6,B7,B7 ;accum odd terms in B7
;branch occurs here
FIGURE 8.7. ASM code for two sums of products with 32-bit data for fixed-point imple-
mentation (twosumfix.asm).