CS 704
Advanced Computer Architecture
Lecture 17
Instruction Level Parallelism
(High-performance Instructions delivery - Multiple Issue)
Prof. Dr. M. Ashraf Chughtai
Recap:
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
2
High-Performance Processors
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
3
Reducing branch penalties for HighPerformance Processors
Branch Target Buffer
Integrated Instruction Fetch Units
Return Address Predictors
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
4
1: Branch Target Buffer
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
5
2: Integrated Instruction Fetch Units
Integrated Branch Prediction
Instruction Prefetch
Instruction memory access and buffering
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
6
2: Integrated Instruction Fetch Units
….. Cont’d
Integrated Branch Prediction
The Branch-predictor is included
in the Instruction Fetch Unit
So, it predicts and drive the
fetch-pipe
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
7
2: Integrated Instruction Fetch Units
….. Cont’d
Instruction Prefetch
An instruction pre-fetch queue is
part of IIFU
The queue holds multiple
instructions and deliver more than
one instructions in one cycle
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
8
2: Integrated Instruction Fetch Units
….. Cont’d
Instruction-Memory access and buffering
Fetching multiple instructions per clock
cycle may require accessing multiple
cache lines, which is a complex
operation
IIFU facilitates to overcome these
complexities and hides the cost of
crossing cache-blocks
IIFU also provides instruction buffering
and on-demand issue
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
9
3: Return Address Predictors
The Return-Address predictor predicts the
indirect jumps, i.e., the jumps whose
address varies at rum time
High-level language programs generate
such jumps for indirect procedure calls and
select or case statements
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
10
Summary: Minimizing Control Hazard Penalties
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
11
Multiple Instruction-Issue Processors
All of the schemes described so far can at
best achieve 1 instruction/cycle
There exist two variations to these
schemes:
- Superscalar processors
- Very Long Instruction Word (VLIW)
processors
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
12
1: Superscalar
–
Static Scheduling–
Dynamic Scheduling
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
13
Superscalar
The statically scheduled processors use inorder execution
The dynamically scheduled use out-of-order
execution
Superscalar concept has been used in:
– IBM Power2
– Sun Ultra SPARC
– Pentium III/4
– DEC Alpha
– HP 8000
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
14
2:Very Long Instruction Words – VLIW processor
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
15
2: VLIW / EPIC Machines
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
16
2: VLIW Processors
VLIW includes new features for:
- predication,
- rotating registers and
- speculations, etc.
Typical implementations are:
– i860, Trimedia, Itanium
We will talk about statically scheduled
superscalar today and about compiling
for VLIW/EPIC later
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
17
Statically Scheduled Superscalar Processor
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
18
Statically Scheduled Superscalar Processor
Instruction Issue Process:
The multiple instruction issue is a complex
process
During instruction fetch, the pipeline
receives all the instruction that could
potentially issue, called Issue-packet (it may
have say from 1 to 4 instructions)
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
19
Example: Statically Scheduled Superscalar MIPS Processor
As an example let us consider a MIPS superscalar
that has:
Number of Instructions issue/clock:
2 instructions - 1 FP operation, 1 Integer operations
(The integer operations include Load/store to integer
or FP register, branch and Integer ALU operation)
Issuing two instructions per cycle would
require Fetch and Decode 64-bits/clock
cycle
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
20
Example:
… Cont’d
Fetching two instructions need careful handling
of the cache,
as either the first instruction may be at end of the
cache block
or the second instruction my be at the beginning
of the cache block
Hazard detection
The restriction of one FP and one Integer makes
the hazard checking simple.
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
21
Example:
… Cont’d
We simply have to determine the likelihood of
hazards between two instructions in an issuepacket
If this situation exist then the Simple solution is to
treat this as a structural hazard (issue only 1 of
them)
However, the only difficulties arise when Integer
Instruction is a FP load/store/move instruction
it may create contention of the FP port and
create RAW hazard when second instruction of
the pair depends
on the first
MAC/VU-Advanced
Lecture 17 – Instruction Level
Computer Architecture
Parallelism -Dynamic (6)
22
Example
Issuing: If placement is not a problem, then
fetch and issue is completed in three
steps”
Fetch Two instructions from the cache
Determine whether 0, 1 or 2 instructions
can issue
Issue them to the correct functional unit
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
23
Example superscalar pipeline in operation
Let us see how the instructions look like when the go in
pair in a pipe
EX
EX
WB
EX
EX
EX
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
WB
EX
WB
EX
EX
24
Dynamic Scheduling in Superscalar Processors
1. Extending Tomasulo’s concept to support
two instruction-issue superscalar pipeline
Here, we do not want to issue instruction to
reservation station out of order, as this may
lead to the violation of program semantics.
Further, to gain full advantage of Dynamic
scheduling remove the constraints of
issuing one FP and integer instruction in a
clock.
MAC/VU-Advanced
Computer Architecture
Lecture 17 – Instruction Level
Parallelism -Dynamic (6)
25