CS 704
Advanced Computer Architecture
Lecture 21
Instruction Level Parallelism
(Static Scheduling – Multiple Issue Processor)
Prof. Dr. M. Ashraf Chughtai
Today’s Topics
Recap: Static Scheduling and Branch
Prediction
Static Multiple Issue: VLIW Approach
Detecting and enhancing loop level
parallelism
Software pipelining
Summary
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
2
Recap: Static Scheduling
Last time we started discussion on
to the static scheduling techniques
to exploit the ILP in pipeline
datapath
We noticed that inserting stalls is
the basic compiler approach used
to avoid the data and control
hazards
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
3
Recap: Static Scheduling
However, as the number of stalls
degrade the performance so compiler
schedule the instructions to avoid
hazards and to reduce or eliminate
stalls
Furthermore, we observed that in case
of loops, the loops are unrolled to
enhance the performance and reduce
stalls
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
4
Recap: Static Scheduling
The number of stalls are further
reduced when unrolled loop is
scheduled by repeating each
instruction for the number of iteration,
but using additional registers
Finally, we discussed the impact of
static branch prediction on the
performance on the scheduled and
unrolled loops
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
5
Recap: Static Scheduling
We also observed that in superscalar
processor, with multiple issues, the
static branch prediction results in
decrease in the misprediction rate
better than the dynamic branch
prediction
Here, the misprediction rate ranges
between 4% to 15%
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
6
Today’s Discussion - Scheduling in VLIW processor
We know that the Very Long
Instruction Word or VLIW-based
processors schedule multiple
instruction issues using only the
static scheduling
Today we will extend our discussion
on the Static Scheduling as used in
VLIW processors
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
7
Review of VLIW format
A VLIW contains a fixed set of instructions,
say 4-16 instructions
A VLIW is formatted:
Either as one large instruction
Or a fixed instruction packet with explicit
parallelism among instructions in a set
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
8
VLIW / EPIC Processor
Since there exist explicit parallelism among
instructions; VLIW is also referred to as:
Explicitly Parallel Instruction Computing –
EPIC
It can initiate multiple instructions in a
cycle by putting operations into wide
template or packet by the compiler
A packet may contain 64 – 128 bytes
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
9
Multiple-Issue overheads - VLIW Vs. Superscalar
In superscalar processor Overhead
grows with issue-width
– For two-issue processor the overhead for is
minimal
– For four-issue processor the overhead for is
manageable
For VLIW the over-head does not grow
with the issue-width
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
10
VLIW / EPIC Processor
The early VLIW machines were rigid in
their instruction formats and required
recompilation of programs for different
versions of the hardware
Certain innovations are made in recent
architectures to eliminate the need for
recompilation; hence results in
performance enhancement
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
11
VLIW / EPIC Processor …. Cont’d
Here, the wider processors are
used which employ multiple
number of independent functional
units; and
The compiler does most of the
work in finding and scheduling
instructions for parallel execution
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
12
VLIW / EPIC …. Cont’d
Compiler schedules and packs
multiple operations into one very
long instruction word; and
Hardware simply issues the
complete packet given to it by the
compiler
Thus, maximum issue-rate is
increased
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
13
Example VLIW Processor
Let us consider an example of VLIW
processor which can perform maximum five
operations in one cycle
These operations include:
– one integer operation
– two floating point operations; and
– two memory reference operations
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
14
Example VLIW Processor
Here, we assume that:
….. Cont’d
the instructions have set of 16-bit to
24-bit fields for each unit with an
instruction length ranging from 112
and 168 bits; and
to keep functional unit busy, there
must be enough parallelism in the code
sequence to fill the available operation
slots.
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
15
VLIW Loop unrolling Example
Now let us see how a loop is to be unrolled
to execute using multiple-issue with VLIW
processor
Here, If unrolling the loop generates straight
line code then local scheduling techniques,
which operates on single basic block, can
be used
Where as …..
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
16
VLIW Loop unrolling Example
If parallelism is required across the
branches then complex global scheduling
is used to uncover the parallelism
In order to explain these concepts let us
reconsider our earlier example MIPS code
to add scalar to a vector; i.e.,
x[i] = x[i] + s
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
17
VLIW Loop unrolling Example
Loop
L.D
ADD.D
S.D
DADDU
F0, 0(R1)
;F0 array element
F4, F0, F2
;add scalar in F2
F4, 0(R1)
;store result
R1, R1, #-8
;decrement pointer 8 bytes
BNE
R1, R2, LOOP ;branch R1!
=R2
In order to execute this code using multi-issue
VLIW, we may unroll the loop as many times as
necessary to eliminate any stalls, ignoring the
branch delay, if any
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
18
Assumptions
Let us assume that:
– the compiler can generate long straight line
code using local scheduling to build up VLIW
instructions
– VLIW processor has sufficient registers and
function units to issue up to 5 instructions in
one cycle; i.e., 15 registers verses 6 in
Superscalar
– the loop is unrolled in order to make seven
copies of the body, which eliminates all stalls
and avoid delays
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
19
Loop Unrolling for VLIW Processor
Clock
LD F0,0(R1) LD F6,-8(R1)
1
LD F10,-16(R1)
2
LD F14,-24(R1)
LD F18,-32(R1)
F8,F6,F2
3
LD F22,-40(R1)
ADDD F4,F0,F2ADDD
LD F26,-48(R1)
ADDD F12,F10,F2
ADDD F16,F14,F2 4
ADDD F20,F18,F2
5
ADDD F24,F22,F2
SD 0(R1),F4 SD -8(R1),F8
SD -16(R1),F12
7
MAC/VU-Advanced
Computer Architecture
SD -32(R1),F20
ADDD F28,F26,F2
6
SD -24(R1),F16
Lecture 21 – Instruction Level
Parallelism-Static (2)
SD -40(R1),F24
SUBI R1,R1,#48 20
8
Assumptions
The table shows the VLIW instructions that
occupy the copies of the loop instructions
in unrolled sequence
Here, we assume that R1 has been
initialized to #48 for 7 iterations
[each memory location is 8 byte apart,
starting with first value at #48, the 7th
value is at #0]
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
21
Explanation
Here, the loop has been unrolled for seven (7)
iterations to completely empty issue cycles,
thus eliminate the stalls
Each instruction comprises two (2) memory
reference operations [L.D or S.D], two (2) FP
operations [ADD.D] and one (1) integer
operation [SUBI]
The multiple-issues per cycle are depicted here
showing type of operation in each instruction
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
22
Operation types in VLIW
Memory
reference 1
Memory
FP
reference 2 operation 1
LD F0,0(R1)
LD F6,-8(R1)
FP
op. 2
Int. op/
branch
1
LD F10,-16(R1) LD F14,-24(R1)
2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2
LD F26,-48(R1)
Clock
ADDD F8,F6,F2
3
ADDD F12,F10,F2
ADDD F16,F14,F2
ADDD F20,F18,F2
ADDD F24,F22,F2
4
5
SD 0(R1),F4
SD -8(R1),F8
6
ADDD F28,F26,F2
SD -16(R1),F12 SD -24(R1),F16
7
SD
MAC/VU-Advanced
Computer Architecture
-32(R1),F20
SD
-40(R1),F24
Lecture 21 – Instruction Level
Parallelism-Static (2)
23
SUBI R1,R1,#48
Explanation
As our example VLIW processor can
handle two memory operation,
therefore two (2) L.D operations, which
don’t have dependence, corresponding
to 1st and 2nd iteration; are issued in the
1st clock cycle; and the
two (2) L.D operations for 3rd and 4th
iterations, having no dependence, are
issued in 2nd clock cycles
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
24
Explanation
.. Cont’d
Furthermore, our VLIW processor can
handle two (2) FP operations; and
two (2) ADD.D operation of the 1st and 2nd
iterations have dependence on the L.D
instructions of the respective two iterations;
and
the L.D instruction has latency of 2 cycles,
therefore, the two ADD.D instructions of 1 st
and 2nd iteration are scheduled in 3rd cycle to
eliminate stalls (identified by yellow arrows)
MAC/VU-Advanced
Computer Architecture
Lecture 21 – Instruction Level
Parallelism-Static (2)
25