CS 704
Advanced Computer Architecture
Lecture 20
Instruction Level Parallelism
(Static Scheduling)
Prof. Dr. M. Ashraf Chughtai
Today’s Topics
Recap: Dynamic Scheduling in ILP
Software Approaches to exploit ILP
– Basic Compiler Techniques
– Loop unrolling and scheduling
– Static Branch Prediction
Summary
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
2
Recap: Dynamic Scheduling
Our discussions in the last eight (8)
lectures have been focused on to the
hardware-based approaches to
exploit parallelism among
instructions
The instructions in a basic block , i.e.,
straight-line code sequence without
branches, are executed in parallel by
using a pipelined datapath
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
3
Recap: Dynamic Scheduling
Here, we noticed that:
– The performance of pipelined datapath
is limited by its structure and data and
control dependences, as they lead to
structural, data and control hazards
These hazards are removed by
introducing stalls
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
4
Recap: Dynamic Scheduling
The stalls degrade the performance of
a pipelined datapath by increasing the
CPI to more than 1
The number of stalls to overcome
hazards in pipelined datapath are
reduced or eliminated by introducing
additional hardware and using
dynamic scheduling techniques
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
5
Recap: Dynamic Scheduling
The major hardware-based techniques studied so
far are summarized here:
Technique
Hazards type stalls Reduced
- Forwarding and
Potential Data Hazard Stalls
bypass
-Delayed Branching
and Branch
Scheduling
Basic Dynamic
Scheduling
MAC/VU-Advanced
Computer Architecture
Control Hazard Stalls
Data Hazard Stalls from
(score boarding)true dependences
Lecture 20 – Instruction Level
Parallelism-Static (1)
6
Recap: Dynamic Scheduling
Technique
Hazards type stalls Reduced
- Dynamic Scheduling Stalls from: data hazards
with renaming
from anti-dependences and
(Tomasulo’s Approach)
fromoutput dependences
- Dynamic Branch
Prediction
- Speculation
Control Hazard stalls
Data and Control Hazard
stalls
- Multiple Instructions Ideal CPI > 1
issues per cycle
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
7
Introduction to Static Scheduling in ILP
The multiple-instruction-issues per cycle
processors are rated as the highperformance processors
These processors exist in a variety of
flavors, such as:
– Superscalar Processors
– VLIW processors
– Vector Processors
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
8
Introduction to Static Scheduling in ILP
The superscalar processors exploit ILP
using static as well as dynamic scheduling
approaches
The VLIW processors, on the other hand,
exploits ILP using static scheduling only
The dynamic scheduling in superscalar
processors has already been discussed in
detail;
And, the basics of static scheduling for
superscalar have been introduced
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
9
Introduction to Static Scheduling in ILP
In the today’s lectures and in a few
following lectures our focus will be the
detailed study of ILP exploitation through
static scheduling
The major software scheduling techniques,
under discussion, to reduce the data and
control stalls, will be as follows:
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
10
Introduction to Static Scheduling in ILP
Technique
Hazards type stalls Reduced
- Basic Compiler
scheduling
-Loop Unrolling
- Compiler
dependence
- Trace Scheduling
Data hazard stalls
Compiler
Speculation
MAC/VU-Advanced
Computer Architecture
Control hazard stalls
Ideal CPI, Data hazard
stalls
Ideal CPI, Data hazard
stalls
Ideal CPI, Data and
control hazard stalls
Lecture 20 – Instruction Level
Parallelism-Static (1)
11
Basic Pipeline scheduling
In order to exploit the ILP, we have to keep
a pipeline full by a sequence of unrelated
instructions which can be overlapped in
the pipeline
Thus, a dependent instruction in a
sequence, must be separated from the
source instruction by a distance equal to
the latency of that instruction,
For example, …
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
12
Basic Pipeline scheduling
A FP ALU operation that is using the result
of earlier FP ALU operation
– must be kept 3 cycles away from the
earlier instruction; and
A FP ALU operation that is using the
result of earlier Load double word
operation
-
must be kept 1 cycles away from it
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
13
Basic Pipeline scheduling
For our further discussions we will assume the
following average latencies of the functional
units
–
–
–
–
Integer ALU operation latency = 0
FP Load latency to FP store
= 0;
(here, the result of load can be
bypassed without stalling to store)
Integer Load latency =1; whereas
FP ALU operation latency =2
to FP store
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
14
Basic Scheduling
A compiler performing scheduling to
exploit ILP in a program, must take into
consideration the latencies of functional
units in the pipeline
Let us see, with the help of our earlier
example to add a scalar to a vector, how a
compiler can increase parallelism by
scheduling
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
15
Execution a Simple Loop with
basic scheduling
Let us consider a simple loop:
for (i=1000; i>0; i=i-1)
x[i] = x[i] + scalar
Where, a scalar is added to a vector in
1000 iterations; and
the body of each iteration is
independent at the compile time
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
16
MIPS code without scheduling
The MIPS code without any scheduling
look like,
L.D
F0, 0(R1)
;F0 array element
ADD.D
F4, F0, F2
;add scalar in F2
S.D
F4, 0(R1)
;store result
DADDU R1, R1, #-8
;decrement pointer 8 bytes
BNE
;branch R1! =R2
R1, R2, LOOP
Notice the data dependencies in ADD and STORE
operation which lead to data-hazards; and control
hazard due to BNE instruction
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
17
Loop execution without Basic
Scheduling
Let us assume that the loop is implemented
using standard five stage pipeline with
branch delay of one clock cycle
Functional units are fully pipelined
The functional units have latencies as
shown in the table
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
18
Stalls of FP ALU and Load Instruction
Here, the First column shows originating instruction type
Second column is the type of consuming instruction
Last column is the number of intervening clock cycles
needed to avoid a stall
MAC/VU-Advanced
Computer Architecture
Instruction
producing result
Instruction using
result
Latency in
clock cycle
FP ALU op
Another FP
ALU op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Load double
Store double
0
Lecture 20 – Instruction Level
Parallelism-Static (1)
19
Single Loop execution without scheduling for the
latencies and stalls for originating viz-a-viz consuming
instructions
Instructions
L.D
Stall
ADD.D
Stall
Stall
S.D
F4, 0(R1)
DADDUI
Stall
BNE
Stall
clock cycles
F0, 0(R1)
1
2 ; L.D followed by FP ALU op has latency=1
F4, F0, F2
3
4 ; FP ALU op followed by STORE double
5
; has latency =2
6
R1, R1, #-8
7
8 ; Double ALU has latency = 1
R1, R2, LOOP
9
10 ; Branch has latency = 1
This code requires 10 clock cycles per iteration.
We can schedule Lecture
the20loop
to reduce the stall to 1
MAC/VU-Advanced
– Instruction Level
Computer Architecture
Parallelism-Static (1)
20
Single loop execution With Compiler
scheduling
Loop
L.D
DADDUI
ADD.D
Stall
BNE
S.D
MAC/VU-Advanced
Computer Architecture
clock cycles
F0, 0(R1)
1
R1, R1, #-8
2
F4, F0, F2
3
4
R1, R2, LOOP 5 (delayed branch)
F4, 8(R1)
6 (altered &
interchanged
with DADDUI) )
Lecture 20 – Instruction Level
Parallelism-Static (1)
21
Explanation
To schedule the delay branch, complier
had to determine that it could swap the
DADDUI and S.D by changing the
destination address of S.D instruction
You can see that the address 0(R1) and
is replaced by 8(R1); as R1 has been
decremented by DADDUI
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
22
Explanation … Cont’d
Note that the chain of dependent
instructions from L.D to the ADD.D and
then ADD.D to S.D determines the clock
cycles count; which is
for this loop = 6
and for unscheduled execution = 10
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
23
Explanation .. Cont’d
In this example, one loop iteration and store
back is completed in one array element
every 6 clock cycles
but the actual work of operating on the
array element takes 3 clock cycles ( load,
add, and store)
The remaining 3 clock cycles per iteration
are the loop-overhead (to evaluate the
condition, stall and branch); i.e., the loop
over-head is 100% in this example
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
24
Loop Unrolling
To eliminate or reduce the impact the loopoverhead, here 3 clock cycles per loop, we
have to get more operations within the
loop, relative to the number of overhead
instructions
A simple way to increase the number of
instructions per loop can be to replicate
the loop body for number of iterations and
adjusting the loop termination code
This approach is known as loop unrolling
MAC/VU-Advanced
Computer Architecture
Lecture 20 – Instruction Level
Parallelism-Static (1)
25