CS 704
Advanced Computer Architecture
Lecture 10
Computer Hardware Design
(Pipeline Datapath and Control Design)
Prof. Dr. M. Ashraf Chughtai
Recap: Lecture 9
Single cycle verses multi cycle datapath
Key components of multi cycle data path
Design and information flow in multi cycle
data path
Multi cycle control unit design
Finite State Machine–based control Unit
Microprogram-based controller
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
2
What is pipelining?
Pipelining is a fundamental concept
It utilizes capabilities of the Datapath by
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
3
Pipelining is Natural!
Laundry Example!
Four loads: A, B, C, D
Four laundry operations:
A
B
C
D
Wash, Dry, fold and place into
drawers
Washer takes 30 minutes
Dryer takes 30 minutes
“Folder” takes 30 minutes
“Stasher” takes 30 minutes
to put clothes into drawers
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
4
Sequential Laundry
6 PM
T
a
s
k
O
r
d
e
r
A
7
8
9
10
11
12
1
2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
B
C
D
Explanation next please ……………..
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
5
Pipelined Laundry: Start work ASAP
6 PM
T
a
s
k
O
r
d
e
r
7
8
9
10
30 30 30 30 30 30 30
11
12
1
2 AM
Time
A
B
C
D
Pipelined laundry takes 3.5 hours for 4
loads!
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
6
Features of Pipelined Processor
All the functional units operate independently
Multiple tasks operating simultaneously
using different resources
Pipelining doesn’t help latency of single
task, it helps throughput of entire
workload
Potential speedup = Number pipe stages
……… Cont’d
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
Next please!
7
Pipelining Lessons
Pipeline rate limited by:
- Slowest pipeline stage
- Time to “fill” pipeline and time to “drain” it
reduces speedup
- Unbalanced lengths of pipe stages reduces
speedup
If washer takes longer time than the dryer then
dryer has to wait!
Stall for Dependences
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
8
Five Steps of Datapath
Ins. fetch
Dec/Reg
Exec
Mem
Wr
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
9
Pipelined Processor Design
MAC/VU-Advanced
Computer Architecture
B
Lecture 10 –Computer Hardware
Design (4)
Equal
WB
Ctrl
Write Back
(Reg. Wrt)
Reg.
File
IRwb
IRmem
Exec
S
Mem Ctrl
Ex Ctrl
A
Memory
Rd/Wrt
Mem
Access
IRex
Execute/
Address
Dcd Ctrl
IR
ID/Register
Read
Reg
File
PC
Next PC
Inst. Mem
Instruction
Fetch
M
Data
Mem
10
Pipeline Control
IR <- Mem[PC];
PC <– PC+4;
Instruction Fetch
A <- R[rs];
B<– R[rt]
ID/Reg. Rd
Exe/Address
S <–
A + B;
S <– A or
ZX;
Memory Rd/Wrt
Reg. Wrt (WB)
MAC/VU-Advanced
Computer Architecture
S <–
A + SX;
S <–
A + SX;
If Cond
PC <
PC+SX;
M <– Mem[S] Mem[S] <- B
R[rd] <– S;
R[rt] <– S;
R[rd] <– M;
Lecture 10 –Computer Hardware
Design (4)
11
Pipelined Registers Included
MAC/VU-Advanced
Computer Architecture
B
Lecture 10 –Computer Hardware
Design (4)
Equal
WB
Ctrl
Write Back
(Reg. Wrt)
Reg.
File
IRmem
Exec
S
Mem Ctrl
IRwb
Ex Ctrl
A
Memory
Rd/Wrt
Mem
Access
IRex
Execute/
Address
Dcd Ctrl
ID/Register
Read
Reg
File
PC
Next PC
Inst. Mem
IR
Instruction
Fetch
M
Data
Mem
12
Five Steps as Stages of Pipeline
Load
Cycle 1
Cycle 2
Cycle 3
Ifetch
Reg/Dec
Exec
Cycle 4
Mem
Cycle 5
Wr
.
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
13
Multiple Cycle verses Pipeline – Pipeline enhances performance
Cycle 1
2
3
4
5
6
7
8
9
10
11 12
13
14
Clk
Multiple Cycle Implementation:
Load
Store
Rtype
Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Reg Exec Mem
Pipeline Implementation:
Load Ifetch Reg Exec Mem Wr
Store
Ifetch Reg Exec Mem Wr
Rtype
Ifetch Reg Exec Mem Wr
Explanation next slide…….
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
14
3 Instructions program reconsidered
Load
Store
R-type (ADD)
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
15
Example
The cycle time of a single cycle machine is 45 ns, and of multi
cycle and pipelined machines is 10 ns; and average CPI due to
instruction mix on multi cycle machine is 4.6.
What is the execution time on each type of machine?
Ans:
Single Cycle Machine
– 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multi Cycle Machine
– 10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns
Pipelined machine
– 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
16
Another Example
Consider a multicycle, unpiplined processor requires 4 cycles
for the ALU and Branch operations and 5 cycles for the memory
operation.
Assume the relative frequency of these operations is 40%, 25%
and 35% respectively; and the clock cycle is of 1 n sec.
In pipelined implementation, due to clock skew and setup
processor adds 0.2 n sec. to the clock
Ignoring any latency impact, how much is the
speedup from the pipelined processor?
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
17
Solution
Unpiplined Processor:
Average Execution Time/Instruction = Clock Cycle x Average CPI
=
1 n sec. x [{(0.4 +.25)} x 4 + 0.35 x 5]
=
1 n sec x (0.65 x 4 + 0.35 x 5)
=
1 n sec. x (2.60 + 1.75)
=
4.35 n sec
Pipelined Processor:
Average Execution Time/ Instruction = Clock cycle + overhead
=
1 n sec. + 0.2 n. sec
=
1.2 n sec
Speed up = 4.35 / 1.2 = 3.62 times
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
18
Pipelined Execution Representation
Conventional Representation
- Helps showing the program flow viz-a-viz time
Time
Program Flow
1st Inst.
IFetch Dcd
2nd Inst.
3rd Inst
4th Inst
5th Inst.
MAC/VU-Advanced
Computer Architecture
Exec
IFetch Dcd
Mem
Exec
IFetch Dcd
WB
Mem
WB
Exec
Mem
WB
Exec
Mem
IFetch Dcd
IFetch Dcd
Lecture 10 –Computer Hardware
Design (4)
Exec
WB
Mem
WB
19
Graphical Representation
Instr 4
Instr 5
Reg
D. Mem
Reg
I.Mem
Reg
D. Mem
I.Mem
Reg
D.Mem
I.Mem
Reg
D.Mem
I.Mem
Reg
ALU
Instr 3
CC5
ALU
Instr 2
I.Mem
CC4
ALU
O
r
d
e
r
Instr 1
CC3
ALU
I
n
s
t
r.
CC1
ALU
Time
(clock cycles)
CC2
CC6
CC7
CC8
CC9
Reg
Reg
Reg
Mem
Reg
Explanation…… Next Please
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
20
Why Pipeline?
Because the resources are there!
Time (clock cycles)
Inst 3
MAC/VU-Advanced
Computer Architecture
Im
Dm
Reg
Dm
Im
Reg
Im
Reg
Reg
Lecture 10 –Computer Hardware
Design (4)
Reg
Dm
ALU
Inst 4
Reg
Reg
ALU
Inst 2
Im
Dm
ALU
Inst 1
Reg
ALU
O
r
d
e
r
Inst 0
Im
ALU
I
n
s
t
r.
Reg
Dm
Reg
21
Can pipelining get us into trouble?
Structural hazards
– Data hazards
– Control hazards
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
22
How Stall degrades the performance?
The pipelined CPI with stalls =
Ideal CPI + Stall clock cycles per
instruction
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
23
How Stall degrades the performance?
1. Speedup w.r.t unpiplined =
CPI Unpiplined
1 + stall cycles per instruction
2. Speedup w.r.t. pipeline depth:
:
Speedup w.r.t pipeline depth =
pipeline depth
1 + stall cycles per instruction
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
24
Summary
multi cycle datapath verses pipeline
datapath
Key components of pipeline data path
Performance enhancement due to pipeline
Hazards in pipelined datapath
MAC/VU-Advanced
Computer Architecture
Lecture 10 –Computer Hardware
Design (4)
25