Tải bản đầy đủ (.pdf) (65 trang)

Advanced Computer Architecture - Lecture 20: Instruction level parallelism

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.85 MB, 65 trang )

CS 704
Advanced Computer Architecture

Lecture 20
Instruction Level Parallelism
(Static Scheduling)

Prof. Dr. M. Ashraf Chughtai


Today’s Topics
Recap: Dynamic Scheduling in ILP
Software Approaches to exploit ILP
– Basic Compiler Techniques
– Loop unrolling and scheduling
– Static Branch Prediction
Summary
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

2


Recap: Dynamic Scheduling
Our discussions in the last eight (8)
lectures have been focused on to the
hardware-based approaches to
exploit parallelism among


instructions
The instructions in a basic block , i.e.,
straight-line code sequence without
branches, are executed in parallel by
using a pipelined datapath
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

3


Recap: Dynamic Scheduling
Here, we noticed that:
– The performance of pipelined datapath
is limited by its structure and data and
control dependences, as they lead to
structural, data and control hazards

These hazards are removed by
introducing stalls
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

4



Recap: Dynamic Scheduling
The stalls degrade the performance of
a pipelined datapath by increasing the
CPI to more than 1
The number of stalls to overcome
hazards in pipelined datapath are
reduced or eliminated by introducing
additional hardware and using
dynamic scheduling techniques
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

5


Recap: Dynamic Scheduling
The major hardware-based techniques studied so
far are summarized here:
Technique

Hazards type stalls Reduced

- Forwarding and

Potential Data Hazard Stalls


bypass
-Delayed Branching
and Branch
Scheduling
Basic Dynamic
Scheduling
MAC/VU-Advanced
Computer Architecture

Control Hazard Stalls

Data Hazard Stalls from
(score boarding)true dependences

Lecture 20 – Instruction Level
Parallelism-Static (1)

6


Recap: Dynamic Scheduling
Technique

Hazards type stalls Reduced

- Dynamic Scheduling Stalls from: data hazards

with renaming


from anti-dependences and
(Tomasulo’s Approach)

fromoutput dependences
- Dynamic Branch
Prediction
- Speculation

Control Hazard stalls

Data and Control Hazard
stalls
- Multiple Instructions Ideal CPI > 1
issues per cycle
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

7


Introduction to Static Scheduling in ILP
The multiple-instruction-issues per cycle
processors are rated as the highperformance processors
These processors exist in a variety of
flavors, such as:
– Superscalar Processors
– VLIW processors

– Vector Processors
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

8


Introduction to Static Scheduling in ILP
The superscalar processors exploit ILP
using static as well as dynamic scheduling
approaches
The VLIW processors, on the other hand,
exploits ILP using static scheduling only
The dynamic scheduling in superscalar
processors has already been discussed in
detail;
And, the basics of static scheduling for
superscalar have been introduced
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

9



Introduction to Static Scheduling in ILP
In the today’s lectures and in a few
following lectures our focus will be the
detailed study of ILP exploitation through
static scheduling
The major software scheduling techniques,
under discussion, to reduce the data and
control stalls, will be as follows:
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

10


Introduction to Static Scheduling in ILP
Technique

Hazards type stalls Reduced

- Basic Compiler
scheduling
-Loop Unrolling
- Compiler
dependence
- Trace Scheduling

Data hazard stalls


Compiler
Speculation
MAC/VU-Advanced
Computer Architecture

Control hazard stalls
Ideal CPI, Data hazard
stalls
Ideal CPI, Data hazard
stalls
Ideal CPI, Data and
control hazard stalls

Lecture 20 – Instruction Level
Parallelism-Static (1)

11


Basic Pipeline scheduling
In order to exploit the ILP, we have to keep
a pipeline full by a sequence of unrelated
instructions which can be overlapped in
the pipeline
Thus, a dependent instruction in a
sequence, must be separated from the
source instruction by a distance equal to
the latency of that instruction,
For example, …

MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

12


Basic Pipeline scheduling
A FP ALU operation that is using the result
of earlier FP ALU operation
– must be kept 3 cycles away from the
earlier instruction; and
A FP ALU operation that is using the
result of earlier Load double word
operation
-

must be kept 1 cycles away from it

MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

13



Basic Pipeline scheduling
For our further discussions we will assume the
following average latencies of the functional
units






Integer ALU operation latency = 0
FP Load latency to FP store
= 0;
(here, the result of load can be
bypassed without stalling to store)
Integer Load latency =1; whereas
FP ALU operation latency =2
to FP store

MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

14


Basic Scheduling
A compiler performing scheduling to

exploit ILP in a program, must take into
consideration the latencies of functional
units in the pipeline
Let us see, with the help of our earlier
example to add a scalar to a vector, how a
compiler can increase parallelism by
scheduling

MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

15


Execution a Simple Loop with
basic scheduling
Let us consider a simple loop:

for (i=1000; i>0; i=i-1)
x[i] = x[i] + scalar
Where, a scalar is added to a vector in
1000 iterations; and
the body of each iteration is
independent at the compile time
MAC/VU-Advanced
Computer Architecture


Lecture 20 – Instruction Level
Parallelism-Static (1)

16


MIPS code without scheduling
The MIPS code without any scheduling
look like,
L.D

F0, 0(R1)

;F0 array element

ADD.D

F4, F0, F2

;add scalar in F2

S.D

F4, 0(R1)

;store result

DADDU R1, R1, #-8

;decrement pointer 8 bytes


BNE

;branch R1! =R2

R1, R2, LOOP

Notice the data dependencies in ADD and STORE
operation which lead to data-hazards; and control
hazard due to BNE instruction
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

17


Loop execution without Basic
Scheduling
Let us assume that the loop is implemented
using standard five stage pipeline with
branch delay of one clock cycle
Functional units are fully pipelined
The functional units have latencies as
shown in the table
MAC/VU-Advanced
Computer Architecture


Lecture 20 – Instruction Level
Parallelism-Static (1)

18


Stalls of FP ALU and Load Instruction
Here, the First column shows originating instruction type
Second column is the type of consuming instruction
Last column is the number of intervening clock cycles
needed to avoid a stall

MAC/VU-Advanced
Computer Architecture

Instruction 
producing result

Instruction using 
result 

Latency in 
clock cycle

FP ALU op 

Another FP 
ALU op 

3


FP ALU op 

Store double 

2

Load double 

FP ALU op 

1

Load double 

Store double 

0

Lecture 20 – Instruction Level
Parallelism-Static (1)

19


Single Loop execution without scheduling for the
latencies and stalls for originating viz-a-viz consuming
instructions
Instructions
L.D

Stall
ADD.D
Stall
Stall
S.D
F4, 0(R1)
DADDUI
Stall
BNE
Stall

clock cycles
F0, 0(R1)
1
2 ; L.D followed by FP ALU op has latency=1
F4, F0, F2
3
4 ; FP ALU op followed by STORE double
5
; has latency =2
6
R1, R1, #-8
7
8 ; Double ALU has latency = 1
R1, R2, LOOP
9
10 ; Branch has latency = 1

This code requires 10 clock cycles per iteration.
We can schedule Lecture

the20loop
to reduce the stall to 1
MAC/VU-Advanced
– Instruction Level
Computer Architecture

Parallelism-Static (1)

20


Single loop execution With Compiler
scheduling
Loop
L.D
DADDUI
ADD.D
Stall
BNE
S.D

MAC/VU-Advanced
Computer Architecture

clock cycles
F0, 0(R1)
1
R1, R1, #-8
2
F4, F0, F2

3
4
R1, R2, LOOP 5 (delayed branch)
F4, 8(R1)
6 (altered &
interchanged
with DADDUI) )

Lecture 20 – Instruction Level
Parallelism-Static (1)

21


Explanation
To schedule the delay branch, complier
had to determine that it could swap the
DADDUI and S.D by changing the
destination address of S.D instruction
You can see that the address 0(R1) and
is replaced by 8(R1); as R1 has been
decremented by DADDUI

MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

22



Explanation … Cont’d

Note that the chain of dependent
instructions from L.D to the ADD.D and
then ADD.D to S.D determines the clock
cycles count; which is
for this loop = 6
and for unscheduled execution = 10
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

23


Explanation .. Cont’d
In this example, one loop iteration and store
back is completed in one array element
every 6 clock cycles
but the actual work of operating on the
array element takes 3 clock cycles ( load,
add, and store)
The remaining 3 clock cycles per iteration
are the loop-overhead (to evaluate the
condition, stall and branch); i.e., the loop
over-head is 100% in this example

MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

24


Loop Unrolling
To eliminate or reduce the impact the loopoverhead, here 3 clock cycles per loop, we
have to get more operations within the
loop, relative to the number of overhead
instructions
A simple way to increase the number of
instructions per loop can be to replicate
the loop body for number of iterations and
adjusting the loop termination code
This approach is known as loop unrolling
MAC/VU-Advanced
Computer Architecture

Lecture 20 – Instruction Level
Parallelism-Static (1)

25


×