kiến trúc máy tính võ tần phương chương ter04 2 pipelined processor sinhvienzone com

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.09 MB, 67 trang )

dce
2013

COMPUTER ARCHITECTURE
CSE Fall 2013

BK
TP.HCM

Faculty of Computer Science and
Engineering
Department of Computer Engineering

Vo Tan Phuong
/>CuuDuongThanCong.com

/>

dce
2013

Chapter 4.2
Pipelined Processor Design

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

2

dce

Presentation Outline

2013

 Pipelining versus Serial Execution
 Pipelined Datapath and Control
 Pipeline Hazards
 Data Hazards and Forwarding
 Load Delay, Hazard Detection, and Stall
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

3

dce

Pipelining Example

2013

 Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes

3. Fold and put clothes into drawers
 Each stage takes 30 minutes to complete
 Four loads of clothes to wash, dry, and fold

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
A

B

C

D

©Fall 2013, CS

4

dce

Sequential Laundry

2013

6 PM
Time 30

7
30

8
30

30

9
30

30

10
30

30

11
30

30

12 AM
30

30

A

B
C

D

 Sequential laundry takes 6 hours for 4 loads
 Intuitively, we can use pipelining to speed up laundry
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

5

dce
2013

Pipelined Laundry: Start Load ASAP
6 PM

30

7
30
30

8
30
30
30

30
30
30

9 PM
Time
30
30

30

A

 Pipelined laundry takes
3 hours for 4 loads

B

 Speedup factor is 2 for

4 loads

C

 Time to wash, dry, and
fold one load is still the
same (90 minutes)

D

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

6

dce
2013

Serial Execution versus Pipelining
 Consider a task that can be divided into k subtasks
 The k subtasks are executed on k different stages

 Each subtask requires one time unit
 The total execution time of the task is k time units

 Pipelining is to overlap the execution
 The k stages work in parallel on k different tasks
 Tasks enter/leave pipeline at the rate of one task per time unit
1 2

…

k
1 2

…

1 2

k
1 2

…

1 2

k

Without Pipelining
One completion every k time units
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

…

k

…

1 2

k

…

k

With Pipelining
One completion every 1 time unit
/>
©Fall 2013, CS

7

dce

Synchronous Pipeline

2013

 Uses clocked registers between stages
 Upon arrival of a clock edge …
 All registers hold the results of previous stages simultaneously

 The pipeline stages are combinational logic circuits
 It is desirable to have balanced stages
 Approximately equal delay in all stages

S2

Register

S1

Register

Input

Register

Register

 Clock period is determined by the maximum stage delay

Sk

Output

Clock
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

8

dce

Pipeline Performance

2013

 Let ti = time delay in stage Si
 Clock cycle t = max(ti) is the maximum stage delay
 Clock frequency f = 1/t = 1/max(ti)
 A pipeline can process n tasks in k + n – 1 cycles
 k cycles are needed to complete the first task
 n – 1 cycles are needed to complete the remaining n – 1 tasks

 Ideal speedup of a k-stage pipeline over serial execution
nk

Serial execution in cycles

Sk =

Pipelined execution in cycles
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

=

k+n–1

Sk → k for large n

/>
©Fall 2013, CS

9

dce
2013

MIPS Processor Pipeline
 Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Execute operation or calculate load/store address

4. MEM: Memory access for load and store
5. WB: Write Back result to register

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>

©Fall 2013, CS

10

dce
2013

Single-Cycle vs Pipelined Performance
 Consider a 5-stage instruction execution in which …
 Instruction fetch = ALU operation = Data memory access = 200 ps
 Register read = register write = 150 ps

 What is the clock cycle of the single-cycle processor?
 What is the clock cycle of the pipelined processor?

 What is the speedup factor of pipelined execution?
 Solution
Single-Cycle Clock =
200+150+200+200+150 = 900 ps
IF

Reg

ALU

MEM

Reg

900 ps

IF

Reg

ALU

MEM

Reg

900 ps
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

11

dce
2013

Single-Cycle versus Pipelined – cont’d
 Pipelined clock cycle = max(200, 150) = 200 ps
IF

Reg

200

IF
200

ALU

Reg
IF
200

MEM

Reg

ALU

MEM

Reg

ALU

MEM

200

200

Reg
200

Reg
200

 CPI for pipelined execution = 1
 One instruction completes each cycle (ignoring pipeline fill)

 Speedup of pipelined execution = 900 ps / 200 ps = 4.5
 Instruction count and CPI are equal in both cases

 Speedup factor is less than 5 (number of pipeline stage)
 Because the pipeline stages are not balanced

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

12

dce
2013

Pipeline Performance Summary

 Pipelining doesn’t improve latency of a single instruction

 However, it improves throughput of entire workload
 Instructions are initiated and completed at a higher rate

 In a k-stage pipeline, k instructions operate in parallel
 Overlapped execution using multiple hardware resources
 Potential speedup = number of pipeline stages k

 Unbalanced lengths of pipeline stages reduces speedup

 Pipeline rate is limited by slowest pipeline stage
 Unbalanced lengths of pipeline stages reduces speedup
 Also, time to fill and drain pipeline reduces speedup

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

13

dce

Next . . .

2013

 Pipelining versus Serial Execution
 Pipelined Datapath and Control
 Pipeline Hazards
 Data Hazards and Forwarding
 Load Delay, Hazard Detection, and Stall
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

14

dce

Single-Cycle Datapath

2013

 Shown below is the single-cycle datapath
 How to pipeline this single-cycle datapath?

Answer: Introduce pipeline register at end of each stage
IF = Instruction Fetch

ID = Decode &
Register Read

Jump or Branch Target Address

EX = Execute

MEM = Memory
Access
J

Next
PC

Beq
Bne

30

00

30

Instruction
Memory
Instruction

PC

0
1

ALU result

Imm26

+1

PCSrc

Rs 5
32

Rt 5

Address
Rd

Imm16

zero

32

BusA

RA

Registers

RB

BusB

0

0
1

WB =
Write
Back

RW

BusW

E

32

A
L
U

32

Data
Memory
Address

Data_out
Data_in

1

0

32

32
1

32

RegDst

clk

Reg
Write
ExtOp ALUSrc ALUCtrl

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

Mem Mem
Read Write

/>

Mem
toReg

©Fall 2013, CS

15

dce

Pipelined Datapath

2013

 Pipeline registers are shown in green, including the PC
 Same clock edge updates all pipeline registers, register
file, and data memory (for store instruction)

1

Address

RB

0
1

Rd

RW

32

Imm

E
BusB
BusW

32

zero

A
L
U

1

Data
Memory

ALUout

RA

ALU result
Imm16

A

NPC
Rt 5

BusA

Next
PC

32

Data_out

0

32
32

0

Address

1

WB Data

PC

0

Rs 5

B

Instruction

Imm26

Register File

Instruction
Memory

Instruction

+1

MEM = Memory
Access

WB = Write Back

EX = Execute

D

ID = Decode &
Register Read
NPC2

IF = Instruction Fetch

Data_in

32

clk

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

16

dce

Problem with Register Destination

2013

 Is there a problem with the register destination address?
 Instruction in the ID stage different from the one in the WB stage

Address

RB

0
1

Rd

RW

ALU result
Imm16

E
BusB
BusW

32

zero

32

A

Imm

Next
PC

A
L

U

1

Data
Memory

ALUout

Rt 5

RA

BusA

MEM =
Memory Access

32
32

32

0

Address
Data_out

0

D

1

PC

0

Rs 5

B

Instruction

Imm26

Register File

Instruction
Memory

Instruction

+1

NPC

NPC2

EX = Execute

1

WB Data

ID = Decode &
Register Read

IF = Instruction Fetch

WB = Write Back

 Instruction in the WB stage is not writing to its destination register
but to the destination of a different instruction in the ID stage

Data_in

32

clk
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

17

dce

Pipelining the Destination Register

2013

 Destination Register number should be pipelined
 Destination register number is passed from ID to WB stage
 The WB stage writes back data knowing the destination register
ID

EX

RW

BusB
BusW

0
1

32

A
L
U

1

Data

Memory

ALUout

Imm

A

32

E

32

zero

32

Data_out

0

32

32

0

Address

1

WB Data

Rd

ALU result
Imm16

Data_in

Rd4

RB

WB

Next
PC

D

RA

B

Address

Rt 5

BusA

MEM

Rd3

1

PC

0

Rs 5

Rd2

Instruction

Imm26

Register File

Instruction
Memory

Instruction

+1

NPC

NPC2

IF

clk
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

18

dce

Graphically Representing Pipelines

2013

 Multiple instruction execution over multiple clock cycles
 Instructions are listed in execution order from top to bottom
 Clock cycles move from left to right

Program Execution Order

 Figure shows the use of resources at each stage and each cycle
Time (in cycles)

CC1

CC2

CC3

CC4

CC5

lw $t6, 8($s5)

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

add $s1, $s2, $s3
ori $s4, $t3, 7
sub $t5, $s2, $t3
sw $s2, 10($t3)
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

CC6

CC7

/>
CC8

©Fall 2013, CS

19

dce

Instruction-Time Diagram

2013

 Instruction-Time Diagram shows:
 Which instruction occupying what stage at each clock cycle

 Instruction flow is pipelined over the 5 stages

Instruction Order

Up to five instructions can be in the
pipeline during the same cycle
Instruction Level Parallelism (ILP)
lw

$t7, 8($s3)

lw

$t6, 8($s5)

IF

ID

EX MEM WB

IF

ID

EX MEM WB

IF

ID

EX

–

WB

IF

ID

EX

–

IF

ID

ori $t4, $s3, 7
sub $s5, $s2, $t3
sw

$s2, 10($s3)
CC1
CuuDuongThanCong.com

ALU instructions skip
the MEM stage.
Store instructions

skip the WB stage

WB

EX MEM

–

CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Computer Architecture – Chapter 4.2

/>
Time

©Fall 2013, CS

20

dce

Control Signals
ID

EX

RW

Imm16

BusB

1

32

Data
Memory
32
32

32

0

Address
Data_out

0

BusW

0
1

A
L
U

ALUout

E

32

zero

32

A

Imm

ALU result

1

WB Data

Rd

Bne

Data_in

Rd4

Address

RB

Beq

Rd3

1

PC

0

Rt 5

RA

BusA

WB

J

Next
PC

B

Instruction

Rs 5

MEM

Rd2

Instruction
Memory

Imm26

Register File

PCSrc

Instruction

+1

NPC

NPC2

IF

D

2013

clk
Reg

Dst

Reg
Write

Ext
Op

ALU
Src

ALU
Ctrl

Mem Mem
Read Write

Mem
toReg

Same control signals used in the single-cycle datapath
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

21

dce

Pipelined Control

32

0
1

32

Data_out

0

BusW

32
32

0

Address

1

WB Data

1

Data
Memory

ALUout

A

Imm

BusB

A
L
U

Data_in

Rd4

RW

32

E

32

zero

D

Rd

Op

1

Address

RB

Bne

Rd3

PC

0

Rt 5

RA

BusA

Beq

ALU result
Imm16

B

Instruction

Rs 5

J

Next
PC

Rd2

Instruction
Memory

Imm26

Register File

PCSrc

Instruction

+1

NPC

NPC2

2013

CuuDuongThanCong.com

Main
& ALU
Control

Computer Architecture – Chapter 4.2

Ext
Op

ALU
Src

J
ALU Beq
Ctrl Bne

Mem Mem
Read Write

Mem
toReg

WB

Reg

Write

MEM

Reg
Dst

EX

Pass control
signals along
pipeline just
like the data

func

clk

/>
©Fall 2013, CS

22

dce
2013

Pipelined Control – Cont'd
 ID stage generates all the control signals
 Pipeline the control signals as the instruction moves

 Extend the pipeline registers to include the control signals

 Each stage uses some of the control signals
 Instruction Decode and Register Read
 Control signals are generated
 RegDst is used in this stage

 Execution Stage => ExtOp, ALUSrc, and ALUCtrl
 Next PC uses J, Beq, Bne, and zero signals for branch control

 Memory Stage

=> MemRead, MemWrite, and MemtoReg

 Write Back Stage => RegWrite is used in this stage

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

23

dce

Control Signals Summary

2013

Decode
Stage

Execute Stage

Memory Stage

Write

Control Signals

Control Signals

Back

Op
RegDst ALUSrc ExtOp
R-Type

1=Rd

0=Reg

addi

0=Rt

slti