CS 704
Advanced Computer Architecture
Lecture 12
Instruction Level Parallelism
(Introduction to multi cycle pipelined datapath)
Prof. Dr. M. Ashraf Chughtai
Today’s Topics
Recap: Pipelining Basics
Longer Pipelines – FP Instructions
Loop Level Parallelism
FP Loop Hazards
Summary
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
2
Recap: Pipelined datapath and control
In the previous lecture we reviewed the
pipelined datapath to understand the basics
of ILP – overlap among the instruction
execution to enhance performance
Key components of pipeline data path
Performance enhancement due to pipeline:
– Pipelining helps instruction bandwidth
but not latency
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
3
Recap: Pipeline Hazards
Structural hazards
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
4
Recap: Pipeline Hazards
….. Cont’d
Data Hazards
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
5
Recap: Three Generic Data Hazards
Read After Write (RAW): (dependence)
– instrJ tries to read operand before instri writes it;
i: add r1,r2,r3
j: sub r4,r1,r3
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
6
Recap: Three Generic Data Hazards
Write After Read (WAR): anti-dependence
–
i: sub r4,r1,r3
j: add r1,r2,r3
- Also called Name dependence(renaming)
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
7
Recap: Three Generic Data Hazards
• Write After Write (WAW)
i: sub r1,r4,r3
j: add r1,r2,r3
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
8
Recap: Pipeline Hazards
….. Cont’d
Control hazards
How to overcome Hazards?
Stall
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
9
Recap: How to remove Hazards?
Structural Hazard:
Multiple functional units
Data Hazard
: Forwarding or bypassing
Control Hazards:
Predict, delay branch
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
10
Instruction Level Parallelism
– clock speed
– number of instructions that can
execute in parallel, i.e., increasing
ILP
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
11
How to achieve Instruction Level Parallelism?
A superscalar processor:
- - pre-fetch and decode
- Start several branch instruction streams
- Finally, discard all but the correct stream
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
12
Superscalar Design
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
13
MIPS Longer Pipelines – FP Instructions
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
14
MIPS Longer Pipelines – FP Instructions
For example to ADD two FP minimum
four steps are performed in the
following sequence:
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
15
Flow diagram of MIPS FP Adder
Draw flow diagram of pp284
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
16
Steps for FP Addition
Step 1: Exponents of two numbers are compared,
the smaller number is shifted to the right to till its
exponent matches to the larger exponent
Step 2: Add the significands
Step 3: Normalize the sum – shift right and
increment or shift left and decrement
Step 4: If no overflow or underflow then round the
significand to number of bits
Stop if further normalization is not required,
otherwise go to step 3
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
17
MIPS Longer Pipelines
…… Cont’d
- The latency of functional unit is defined as:
the number of cycles between the instructions that
produces a result and the one that uses the result
of the operation
- The initiation or repeat interval is defined as:
the number of cycles that must elapse between
issuing two operations (repeat of an operation) of the
same type
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
18
MIPS Longer Pipelines
…… Cont’d
Latency Initiation (repeat)
Interval
Integer ALU
Data Memory (Int / FP Load)
FP ADD
FP/ Integer Multiply
FP/Integer Divide
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
=0
=1
=3
=6
= 24
1
1
1
1
25
19
Typical MIPS FP Pipeline
Let us consider a typical MIPS FP pipeline
with three un-pipelined FP functional units
Insert Fig. A.29 (page A-48)
Explanation next please
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
20
Typical MIPS FP Pipeline
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
21
MIPS FP Pipeline with Pipelined FUs
The previous FP pipeline can be extended
by adding additional pipeline stages in the
functional units
Insert Fig. A.31(page A-50)
Explanation next please
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
22
Working of extended FP Pipeline
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
23
Working of extended FP Pipeline
Note that additional pipeline register have
been inserted between intervening stage,
e.g., A1/A2, A2/A3, …..
Furthermore, ID/EX register must be
expanded to connect ID to A1, M1, EX and
DIV Function Units
Here, the FP divide FP is not pipelined but it
requires 24 clock cycles to complete
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
24
FP Pipeline Timing: Example
MAC/VU-Advanced
Computer Architecture
Lecture 12 –Instruction Level
Parallelism (1)
25