dce
2013
COMPUTER ARCHITECTURE
CSE Fall 2013
BK
TP.HCM
Faculty of Computer Science and
Engineering
Department of Computer Engineering
Vo Tan Phuong
/>CuuDuongThanCong.com
/>
dce
2013
Chapter 4.2
Pipelined Processor Design
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
2
dce
Presentation Outline
2013
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
3
dce
Pipelining Example
2013
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
3. Fold and put clothes into drawers
Each stage takes 30 minutes to complete
Four loads of clothes to wash, dry, and fold
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
A
B
C
D
©Fall 2013, CS
4
dce
Sequential Laundry
2013
6 PM
Time 30
7
30
8
30
30
9
30
30
10
30
30
11
30
30
12 AM
30
30
A
B
C
D
Sequential laundry takes 6 hours for 4 loads
Intuitively, we can use pipelining to speed up laundry
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
5
dce
2013
Pipelined Laundry: Start Load ASAP
6 PM
30
7
30
30
8
30
30
30
30
30
30
9 PM
Time
30
30
30
A
Pipelined laundry takes
3 hours for 4 loads
B
Speedup factor is 2 for
4 loads
C
Time to wash, dry, and
fold one load is still the
same (90 minutes)
D
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
6
dce
2013
Serial Execution versus Pipelining
Consider a task that can be divided into k subtasks
The k subtasks are executed on k different stages
Each subtask requires one time unit
The total execution time of the task is k time units
Pipelining is to overlap the execution
The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1 2
…
k
1 2
…
1 2
k
1 2
…
1 2
k
Without Pipelining
One completion every k time units
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
…
k
…
1 2
k
…
k
With Pipelining
One completion every 1 time unit
/>
©Fall 2013, CS
7
dce
Synchronous Pipeline
2013
Uses clocked registers between stages
Upon arrival of a clock edge …
All registers hold the results of previous stages simultaneously
The pipeline stages are combinational logic circuits
It is desirable to have balanced stages
Approximately equal delay in all stages
S2
Register
S1
Register
Input
Register
Register
Clock period is determined by the maximum stage delay
Sk
Output
Clock
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
8
dce
Pipeline Performance
2013
Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay
Clock frequency f = 1/t = 1/max(ti)
A pipeline can process n tasks in k + n – 1 cycles
k cycles are needed to complete the first task
n – 1 cycles are needed to complete the remaining n – 1 tasks
Ideal speedup of a k-stage pipeline over serial execution
nk
Serial execution in cycles
Sk =
Pipelined execution in cycles
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
=
k+n–1
Sk → k for large n
/>
©Fall 2013, CS
9
dce
2013
MIPS Processor Pipeline
Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Execute operation or calculate load/store address
4. MEM: Memory access for load and store
5. WB: Write Back result to register
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
10
dce
2013
Single-Cycle vs Pipelined Performance
Consider a 5-stage instruction execution in which …
Instruction fetch = ALU operation = Data memory access = 200 ps
Register read = register write = 150 ps
What is the clock cycle of the single-cycle processor?
What is the clock cycle of the pipelined processor?
What is the speedup factor of pipelined execution?
Solution
Single-Cycle Clock =
200+150+200+200+150 = 900 ps
IF
Reg
ALU
MEM
Reg
900 ps
IF
Reg
ALU
MEM
Reg
900 ps
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
11
dce
2013
Single-Cycle versus Pipelined – cont’d
Pipelined clock cycle = max(200, 150) = 200 ps
IF
Reg
200
IF
200
ALU
Reg
IF
200
MEM
Reg
ALU
MEM
Reg
ALU
MEM
200
200
Reg
200
Reg
200
CPI for pipelined execution = 1
One instruction completes each cycle (ignoring pipeline fill)
Speedup of pipelined execution = 900 ps / 200 ps = 4.5
Instruction count and CPI are equal in both cases
Speedup factor is less than 5 (number of pipeline stage)
Because the pipeline stages are not balanced
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
12
dce
2013
Pipeline Performance Summary
Pipelining doesn’t improve latency of a single instruction
However, it improves throughput of entire workload
Instructions are initiated and completed at a higher rate
In a k-stage pipeline, k instructions operate in parallel
Overlapped execution using multiple hardware resources
Potential speedup = number of pipeline stages k
Unbalanced lengths of pipeline stages reduces speedup
Pipeline rate is limited by slowest pipeline stage
Unbalanced lengths of pipeline stages reduces speedup
Also, time to fill and drain pipeline reduces speedup
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
13
dce
Next . . .
2013
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
14
dce
Single-Cycle Datapath
2013
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
Answer: Introduce pipeline register at end of each stage
IF = Instruction Fetch
ID = Decode &
Register Read
Jump or Branch Target Address
EX = Execute
MEM = Memory
Access
J
Next
PC
Beq
Bne
30
00
30
Instruction
Memory
Instruction
PC
0
1
ALU result
Imm26
+1
PCSrc
Rs 5
32
Rt 5
Address
Rd
Imm16
zero
32
BusA
RA
Registers
RB
BusB
0
0
1
WB =
Write
Back
RW
BusW
E
32
A
L
U
32
Data
Memory
Address
Data_out
Data_in
1
0
32
32
1
32
RegDst
clk
Reg
Write
ExtOp ALUSrc ALUCtrl
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
Mem Mem
Read Write
/>
Mem
toReg
©Fall 2013, CS
15
dce
Pipelined Datapath
2013
Pipeline registers are shown in green, including the PC
Same clock edge updates all pipeline registers, register
file, and data memory (for store instruction)
1
Address
RB
0
1
Rd
RW
32
Imm
E
BusB
BusW
32
zero
A
L
U
1
Data
Memory
ALUout
RA
ALU result
Imm16
A
NPC
Rt 5
BusA
Next
PC
32
Data_out
0
32
32
0
Address
1
WB Data
PC
0
Rs 5
B
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
MEM = Memory
Access
WB = Write Back
EX = Execute
D
ID = Decode &
Register Read
NPC2
IF = Instruction Fetch
Data_in
32
clk
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
16
dce
Problem with Register Destination
2013
Is there a problem with the register destination address?
Instruction in the ID stage different from the one in the WB stage
Address
RB
0
1
Rd
RW
ALU result
Imm16
E
BusB
BusW
32
zero
32
A
Imm
Next
PC
A
L
U
1
Data
Memory
ALUout
Rt 5
RA
BusA
MEM =
Memory Access
32
32
32
0
Address
Data_out
0
D
1
PC
0
Rs 5
B
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
NPC
NPC2
EX = Execute
1
WB Data
ID = Decode &
Register Read
IF = Instruction Fetch
WB = Write Back
Instruction in the WB stage is not writing to its destination register
but to the destination of a different instruction in the ID stage
Data_in
32
clk
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
17
dce
Pipelining the Destination Register
2013
Destination Register number should be pipelined
Destination register number is passed from ID to WB stage
The WB stage writes back data knowing the destination register
ID
EX
RW
BusB
BusW
0
1
32
A
L
U
1
Data
Memory
ALUout
Imm
A
32
E
32
zero
32
Data_out
0
32
32
0
Address
1
WB Data
Rd
ALU result
Imm16
Data_in
Rd4
RB
WB
Next
PC
D
RA
B
Address
Rt 5
BusA
MEM
Rd3
1
PC
0
Rs 5
Rd2
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
NPC
NPC2
IF
clk
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
18
dce
Graphically Representing Pipelines
2013
Multiple instruction execution over multiple clock cycles
Instructions are listed in execution order from top to bottom
Clock cycles move from left to right
Program Execution Order
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
CC1
CC2
CC3
CC4
CC5
lw $t6, 8($s5)
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
add $s1, $s2, $s3
ori $s4, $t3, 7
sub $t5, $s2, $t3
sw $s2, 10($t3)
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
CC6
CC7
/>
CC8
©Fall 2013, CS
19
dce
Instruction-Time Diagram
2013
Instruction-Time Diagram shows:
Which instruction occupying what stage at each clock cycle
Instruction flow is pipelined over the 5 stages
Instruction Order
Up to five instructions can be in the
pipeline during the same cycle
Instruction Level Parallelism (ILP)
lw
$t7, 8($s3)
lw
$t6, 8($s5)
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX
–
WB
IF
ID
EX
–
IF
ID
ori $t4, $s3, 7
sub $s5, $s2, $t3
sw
$s2, 10($s3)
CC1
CuuDuongThanCong.com
ALU instructions skip
the MEM stage.
Store instructions
skip the WB stage
WB
EX MEM
–
CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Computer Architecture – Chapter 4.2
/>
Time
©Fall 2013, CS
20
dce
Control Signals
ID
EX
RW
Imm16
BusB
1
32
Data
Memory
32
32
32
0
Address
Data_out
0
BusW
0
1
A
L
U
ALUout
E
32
zero
32
A
Imm
ALU result
1
WB Data
Rd
Bne
Data_in
Rd4
Address
RB
Beq
Rd3
1
PC
0
Rt 5
RA
BusA
WB
J
Next
PC
B
Instruction
Rs 5
MEM
Rd2
Instruction
Memory
Imm26
Register File
PCSrc
Instruction
+1
NPC
NPC2
IF
D
2013
clk
Reg
Dst
Reg
Write
Ext
Op
ALU
Src
ALU
Ctrl
Mem Mem
Read Write
Mem
toReg
Same control signals used in the single-cycle datapath
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
21
dce
Pipelined Control
32
0
1
32
Data_out
0
BusW
32
32
0
Address
1
WB Data
1
Data
Memory
ALUout
A
Imm
BusB
A
L
U
Data_in
Rd4
RW
32
E
32
zero
D
Rd
Op
1
Address
RB
Bne
Rd3
PC
0
Rt 5
RA
BusA
Beq
ALU result
Imm16
B
Instruction
Rs 5
J
Next
PC
Rd2
Instruction
Memory
Imm26
Register File
PCSrc
Instruction
+1
NPC
NPC2
2013
CuuDuongThanCong.com
Main
& ALU
Control
Computer Architecture – Chapter 4.2
Ext
Op
ALU
Src
J
ALU Beq
Ctrl Bne
Mem Mem
Read Write
Mem
toReg
WB
Reg
Write
MEM
Reg
Dst
EX
Pass control
signals along
pipeline just
like the data
func
clk
/>
©Fall 2013, CS
22
dce
2013
Pipelined Control – Cont'd
ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals
Each stage uses some of the control signals
Instruction Decode and Register Read
Control signals are generated
RegDst is used in this stage
Execution Stage => ExtOp, ALUSrc, and ALUCtrl
Next PC uses J, Beq, Bne, and zero signals for branch control
Memory Stage
=> MemRead, MemWrite, and MemtoReg
Write Back Stage => RegWrite is used in this stage
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
23
dce
Control Signals Summary
2013
Decode
Stage
Execute Stage
Memory Stage
Write
Control Signals
Control Signals
Back
Op
RegDst ALUSrc ExtOp
R-Type
1=Rd
0=Reg
addi
0=Rt
slti
Beq Bne
ALUCtrl
MemRd MemWr MemReg RegWrite
0
0
0
func
0
0
0
1
1=Imm 1=sign
0
0
0
ADD
0
0
0
1
0=Rt
1=Imm 1=sign
0
0
0
SLT
0
0
0
1
andi
0=Rt
1=Imm 0=zero
0
0
0
AND
0
0
0
1
ori
0=Rt
1=Imm 0=zero
0
0
0
OR
0
0
0
1
lw
0=Rt
1=Imm 1=sign
0
0
0
ADD
1
0
1
1
sw
x
1=Imm 1=sign
0
0
0
ADD
0
1
x
0
beq
x
0=Reg
x
0
1
0
SUB
0
0
x
0
bne
x
0=Reg
x
0
0
1
SUB
0
0
x
0
j
x
x
x
1
0
0
x
0
0
x
0
CuuDuongThanCong.com
x
J
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
24
dce
Next . . .
2013
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
CuuDuongThanCong.com
Computer Architecture – Chapter 4.2
/>
©Fall 2013, CS
25