CS 704
Advanced Computer Architecture
Lecture 11
Computer Hardware Design
(Pipeline and Instruction Level Parallelism)
Prof. Dr. M. Ashraf Chughtai
Today’s Topics
Recap Lecture 10
Structural Hazards
Data Hazards
Control Hazards
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
2
Recap: Lecture 10
Multi cycle datapath verses pipeline
datapath
Key components of pipeline data path
Performance enhancement due to pipeline
Introduction to hazards in pipelined
datapath
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
3
Structural Hazards
Attempt to use the same resource two
different ways at the same time, e.g.,
Single memory port is accessed for
instruction fetch and data read in the same
clock cycle would be a structural hazard
…. Example : next slide
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
4
Single Memory is a Structural Hazard
Time (clock cycles)
Instr 5
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Instr 4
Reg
ALU
Instr 3
Mem
Reg
ALU
Instr 2
Mem
ALU
O
r
d
e
r
Instr 1 Load Mem Reg
ALU
I
n
s
t
r.
Mem
Reg
Two memory read operations in the 4th cycle:
The LOAD instruction accesses memory to read data and the
4th instruction fetched from the same memory
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
5
Single Memory is a Structural Hazard
Time (clock cycles)
Stall
Instr 4
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Instr 3
Mem
ALU
ADD
Reg
Bubble
Instr 2
Mem
ALU
O
r
d
e
r
Instr 1 Load Mem Reg
ALU
I
n
s
t
r.
Mem
Reg
Insert stall (bubble) to avoid memory
structural hazard
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
6
Structural Hazards
Structural hazard exists when
Single write port of register accessed for two
WB operations in same clock cycle –
this situation does not exist in 5-stage pipeline
But it may exist in 4 and 5 stage multi-cycle
pipeline
Explanation next…………………
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
7
Pipelining the Load Instruction
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock
1st lw Ifetch
Reg/Dec
2nd lw Ifetch
3rd lw
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
The five independent functional units in the pipeline
datapath are: Inst. Fetch, Dec/Reg. Rd, ALU for Exec, Data
Mem and Register File’s Write port for the Wr stage
Here, we have separate register’s read and write ports so
registers read and write is allowed at the same time
Each functional unit is used once
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
8
The Four Stages of R-type
Rtype
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Ifetch
Reg/Dec
Exec
Wr
R-type instruction does not access data memory,
so it only takes 4 clocks, or say 4 stages to
complete
Here, the ALU is used to operate on the register
operands
The result is written in to the register during WB
stage
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
9
Pipelining the R-type and Load
Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Cycle 5 Cycle 6 Cycle 7 Cycle 8
Cycle 9
Clock
Rtype Ifetch
Rtype
Reg/Dec
Exec
Ifetch
Reg/Dec
Exec
Ifetch
Reg/Dec
Load
Ops! We have a problem!
Wr
Rtype Ifetch
Wr
Exec
Mem
Wr
Reg/Dec
Exec
Wr
Rtype Ifetch
Reg/Dec
Exec
Wr
We have pipeline conflict or structural hazard:
– Two instructions try to write to the register file at the
same time!
– Only one write port
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
10
Important Observation
Each functional unit can only be used once per
instruction
Each functional unit must be used at the same
stage for all instructions:
– Load uses Register File’s Write Port during its
5th stage
– R-type uses Register File’s Write Port during its
4th stage
Two possible solutions ………. Next
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
11
Solution 1: Insert “Bubble” into the Pipeline
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
Ifetch
Load
Reg/Dec
Exec
Ifetch
Reg/Dec
Rtype Ifetch
Wr
Exec
Mem
Reg/Dec
Exec
Wr
Wr
Rtype Ifetch Reg/Dec Pipeline Exec
Rtype Ifetch
Bubble Reg/Dec
Ifetch
Wr
Exec
Reg/Dec
Wr
Exec
Insert a “bubble” into the pipeline to prevent 2 writes at the
same cycle
– The control logic can be complex.
– Lose instruction fetch and issue opportunity.
No instruction is started in Cycle 6!
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
12
Solution 2: Delay R-type’s Write by One
Cycle
Delay R-type’s register write
by one cycle:
– Now R-type instructions also use Reg File’s write port at Stage 5
– Mem stage is a NO-OP stage: nothing is being done.
1
2
Rtype Ifetch
Cycle 1 Cycle 2
Reg/Dec
3
Exec
4
Mem
5
Wr
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
Rtype Ifetch
Rtype
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Load
Rtype Ifetch
Rtype Ifetch
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
Wr
13
Eliminating Structural Hazards?
Structural hazards can be eliminated or
minimized by either using the stall operation
or adding multiple functional units
Time
Program Flow
Load
IFetch Dcd
2nd Inst.
Exec
IFetch Dcd
Mem
WB
Exec
Mem
WB
Exec
Mem
3rd Inst
IFetch Dcd
4th Inst
stall
5th Inst.
MAC/VU-Advanced
Computer Architecture
IFetch Dcd
WB
Exec
IFetch Dcd
Lecture 11 –Computer Hardware
Design (5)
Mem
Exec
WB
Mem
WB
14
Example: Dual-port vs.
Single-port
Machine A: Dual ported memory
Machine B: Single ported memory, but its
pipelined implementation has a 1.05 times
faster clock rate
Ideal CPI = 1 for both
Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x
1.05
Stall degrades the performance
Here, is an example:
Suppose data reference instructions constitute 40% of mix,
and processor with structural hazard has clock rate 1.05
times higher than the processor without hazard
The Average Instruction time = CPI x Clock Cycle Time
= (1 + 0.4 x 1) x clock cycle time Ideal / 1.05
= 1.4 / 1.05 x clock cycle time Ideal
= 1.3 x clock cycle time Ideal
The processor without structural hazard is
1.3 times faster
than with Structural hazard
MAC/VU-Advanced
Lecture 11 –Computer Hardware
Computer Architecture
Design (5)
16
Additional Functional Units increase cost
Memory structural hazard is removed by
- using two Cache memory units:
Instruction memory
Data Memory
Two write ports in register file allow 4-stage
and 5-stage pipe mix
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
17
Data Hazards
Attempt to use item before it is ready; e.g.,
One sock of pair in dryer and one in
washer; can’t fold until get sock from
washer through dryer
Instruction depends on result of prior
instruction still in the pipeline
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
18
Data Hazards
Pipelining changes the relative timing of
instruction by overlapping their execution
This overlap introduces the Data and Control
Hazard
Data Hazard occurs when order of operand
read/write is changed viz-z-viz sequential access
to the operands, which gives rise to data
dependency
Let us consider an example ……
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
19
Example Data Hazard on R1
Add
R1 ,R2,R3
Sub
R4, R1 ,R3
And
R6, R1 ,R7
Or
R8, R1 ,R9
Xor
MAC/VU-Advanced
Computer Architecture
R10, R1 ,R11
Lecture 11 –Computer Hardware
Design (5)
20
Data Hazard due to Dependencies backwards
in time are hazards
Time (clock cycles)
ME W
DmM RegB
Im
Reg
ALU
Dm
Im
Reg
ALU
Dm
Im
Reg
ALU
Dm
Im
Reg
ALU
O Or R8,R1,R9
r
d Xor R10,R1,R11
e
r
Im
ID/R
Reg
F
ALU
Add R1,R2,R3
I
n
s Sub R4,R1,R3
t
r. And R6,R1,R7
IF
E
X
Reg
Reg
Reg
Dm
Reg
Add instruction provide its results to sub after 3 cycles, to
and after 2 and to Or after 1 clock cycles
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
21
Data Hazard Solution #1 - Stall
stall cycles after next IF and
decode, before the register
read
Time (clock cycles)
Stall
Stall
Stall
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
Reg
Reg
ALU
Dm
Im
ALU
O
r
d
e
r
WB
ALU
I
n
s
t
r.
add r1,r2,r3
ID/RF EX MEM
ALU
IF
xor r10,r1,r11
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
Reg
Reg
Reg
Dm
Reg
22
XOR: No Data Hazard here, as register is read after
being written
Time (clock cycles)
Reg
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
or r8,r1,r9
Dm
ALU
and r6,r1,r7
Reg
ALU
O
r
d
e
r
sub r4,r1,r3
WB
ALU
I
n
s
t
r.
MEM
ALU
IF
add r1,r2,r3
EX
ID/RF
Im
xor r10,r1,r11
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
Reg
Reg
Reg
Dm
Reg
23
Data Hazard Solution - Forwarding
“Forward” result from one stage to another
From the EX/MEM pipeline register to Sub ALU stage,
MEM/WB pipeline register to AND ALU stage
Time (clock cycles)
IF
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
or r8,r1,r9
Reg
ALU
and r6,r1,r7
Im
Dm
ALU
sub r4,r1,r3
Reg
Reg
ALU
O
r
d
e
r
Im
WB
ALU
I
n
s
t
r.
add r1,r2,r3
ID/RF EX MEM
xor r10,r1,r11
MAC/VU-Advanced
Computer Architecture
No forwarding
As register is written in
the first half and read in
the second half cycle
Lecture 11 –Computer Hardware
Design (5)
Reg
Reg
Reg
Dm
Reg
24
Forwarding (or Bypassing):
What about Loads?
sub r4,r1,r3
Dm
Im
Reg
ALU
lw r1,0(r2)
ID/R
F
Reg
ALU
Time (clock cycles)
I
F
Im
EX
MEM
WB
Reg
Dm
Reg
Dependencies backwards in time are hazards
In this case, we Can’t solve with forwarding:
Must delay/stall instruction dependent on
loads
MAC/VU-Advanced
Computer Architecture
Lecture 11 –Computer Hardware
Design (5)
25