CS 704
Advanced Computer Architecture
Lecture 19
Instruction Level Parallelism
(Limitations of ILP and Conclusion)
Prof. Dr. M. Ashraf Chughtai
Today's Topics
-
-
Recap
Limitations of ILP
Hardware model
Effects of branch/jumps
finite registers
Performance of Intel P6 Micro-Architecturebased processors
Pentium Pro, Pentium II, III and IV
Thread-level Parallelism
Summary
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
2
Recap: ILP- Dynamic Scheduling
In the last few lectures we have been
discussing the concepts and
methodologies, which have been
introduced during the last decade, to
design high-performance processors
Our focus has been the hardware
methods for instruction level
parallelism to execute multiple
instructions in pipelined datapath
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
3
Recap: ILP- Dynamic Scheduling
These hardware techniques are referred to
as Dynamic Scheduling techniques
These techniques are used to ovoid
structural, data and control hazards and
minimize the number of stalls to achieve
better performance
We have discussed dynamic scheduling in
integer pipeline datapath and in floatingpoint pipelined datapath
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
4
Recap: ILP- Dynamic Scheduling
We discussed the score-boarding and
Tomasulo’s algorithm as the basic
concepts for dynamic scheduling in
integer and floating-point datapath
The structures implementing these
concepts facilitate out-of-order execution
to minimize data dependencies thus avoid
data hazards without stalls
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
5
Recap: ILP- Dynamic Scheduling
We also discussed branch-prediction
techniques and different types of branchpredictors, used to reduce the number of
stalls due to control hazards
The concept of multiple instructions issue
was discussed in details
This concept is used to reduce the CPI to
less that one, thus, the performance of the
processor is enhanced
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
6
Recap: ILP- Dynamic Scheduling
Last time we talked about the extensions
to the Tomasulo’s structure by including
hardware-based speculation
It allows to speculate that branch is
correctly predicted, thus may execute outof-order but commit in-order having
confirmed that the speculation is correct
and no exceptions exist
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
7
Today’s topics ILP- Dynamic Scheduling
Today we will conclude our
discussion on the dynamic
scheduling techniques for Instruction
level parallelism by introducing an
ideal processor model to study the:
limitations of ILP; and
implementation of these concepts in
Intel P6 Micro-architecture
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
8
Limitations of the ILP – Ideal Processor
To understand the limitations of ILP, let us
first define an ideal processor.
- An ideal processor is one which doesn’t
have artificial constraints on ILP; and
- the only limits in such a processor are
those imposed by the actual data flows
through either registers or memory
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
9
Assumptions for an Ideal processor
An ideal processor is, therefore, one
wherein:
a) all control dependencies and
b)all but true data dependencies
are eliminated
The control dependencies are eliminated by
assuming that the: branch and Jump
predictions are perfect, i.e.,
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
10
Assumptions – Control Dependencies
i.e.,
all conditional branches and jumps
(including indirect jumps used for return
etc.,) are predicted exactly; and
the processor has perfect
speculation and an unbounded
buffer of instructions for execution
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
11
Assumptions – Data Dependencies
All but true data dependencies are
eliminated by assuming that:
a) There are infinite number of virtual
registers available facilitating:
-register renaming thus avoiding WAW and
WAR hazards); and
-Simultaneous execution of an unlimited
number of instructions
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
12
Assumptions – Data Dependencies
b) All memory addresses are known
exactly facilitating:
to move a load before a store,
provided that the addresses are
not identical
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
13
Ideal hardware model
Hence, by combining these assumptions,
we can say that in an ideal processor:
– Can issue unlimited number of
instructions, including the load and store
instructions, in one cycle
– All functional units have latencies of one
cycle, so the sequence of dependent
instruction can issue on successive
cycles
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
14
Ideal hardware model
– any instruction, in the execution of a
program, can be scheduled on the cycle
immediately following the execution of the
predecessors on which it depend; and
– the last dynamically executed instruction
in the program, can be scheduled on the
very first cycle
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
15
Performance of a Nearly Ideal Processor
Now let us examine the ILP in one of the most
advanced superscalar processor Alpha 21264
Alpha 21264 has the following features:
– Issues up to 4 instructions/cycle
– Initiates execution on up to 6 instructions
– Supports large set of renaming registers –
(41-integer and 41 floating-point)
– Uses large tournament type predictor
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
16
Performance of a Nearly Ideal Processor
In order to examine the ILP in this
processor a set of six (6) SPEC92
benchmarks (programs), compiled on
MIPS optimizing compiler are run.
The features of theses benchmarks
are:
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
17
Performance of a Nearly Ideal Processor
The three of these benchmarks are floating
point benchmarks:
- Fpppp
- Doduc
- Tomcatv
The integer programs are given as follows,
– gcc
– espresso
– li
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
18
Performance of a Nearly Ideal Processor
Now let us have a look on the performance
Alpha 21264 for average amount of
parallelism defined as number of
instruction-issues per cycle for these
benchmarks
Fig 3.35
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
19
Performance of a Nearly Ideal Processor
Here, you can see that fpppp and tomcatv
have extensive parallelism so have high
instruction-issues
Where as the doduc parallelism doesn't
occur in simple loop as it does in fpppp and
tomcatv
The integer program li is a LISP interpreter
that has many short dependences so offers
lowest parallelism
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
20
Performance of a Nearly Ideal Processor
Now let us discuss how the parameters
which define the performance of a realizable
processor are limited in ILP
The important parameters to be studied are:
– Window Size and Issue Count
– Branch and Jump predictors
– Finite number of registers
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
21
Window size and Issue count
In dynamic scheduling, every pending
instruction must look at every
completing instruction for either of its
operand.
A window in ILP processor is defined
as:
‘‘a set of instructions which is
examined for simultaneous execution’’
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
22
Window size and Issue count
Start of the window is the earliest
uncompleted instruction and the last
instruction in the window determines its
size
As each instruction in the window must be
kept in the processor till the completion of
execution, therefore
The total window size is limited by the
storage, number of comparisons and issue
rate
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
23
Window size and Issue count
The number of comparisons required every clock
cycle is equal to the product:
maximum completion rate x
window size x
number of operands per instruction
For example, if
maximum completion rate
= 6 IPC
window size
= 80 instructions
number of operands per instruction = 2 operands
Then
maximum comparisons required = 6 x 80 x 2 = 960
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
24
window size and issue count
In real processors,
maximum number of instructions that
may issue, execute and commit in the
same clock cycle,
is smaller than the window size
Now let us see the effect of restricting
window size on the instruction-issues
per cycle
MAC/VU-Advanced
Computer Architecture
Lecture 19 – Instruction Level
Parallelism-Dynamic (8)
25