Tải bản đầy đủ (.pdf) (67 trang)

kiến trúc máy tính võ tần phương chương ter04 2 pipelined processor sinhvienzone com

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.09 MB, 67 trang )

dce
2013

COMPUTER ARCHITECTURE
CSE Fall 2013

BK
TP.HCM

Faculty of Computer Science and
Engineering
Department of Computer Engineering

Vo Tan Phuong
/>CuuDuongThanCong.com

/>

dce
2013

Chapter 4.2
Pipelined Processor Design

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS


2


dce

Presentation Outline

2013

 Pipelining versus Serial Execution
 Pipelined Datapath and Control
 Pipeline Hazards
 Data Hazards and Forwarding
 Load Delay, Hazard Detection, and Stall
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

3


dce

Pipelining Example


2013

 Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes

3. Fold and put clothes into drawers
 Each stage takes 30 minutes to complete
 Four loads of clothes to wash, dry, and fold

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
A

B

C

D

©Fall 2013, CS

4


dce


Sequential Laundry

2013

6 PM
Time 30

7
30

8
30

30

9
30

30

10
30

30

11
30

30


12 AM
30

30

A

B
C

D

 Sequential laundry takes 6 hours for 4 loads
 Intuitively, we can use pipelining to speed up laundry
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

5


dce
2013

Pipelined Laundry: Start Load ASAP
6 PM

30

7
30
30

8
30
30
30

30
30
30

9 PM
Time
30
30

30

A

 Pipelined laundry takes
3 hours for 4 loads

B

 Speedup factor is 2 for

4 loads

C

 Time to wash, dry, and
fold one load is still the
same (90 minutes)

D

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

6


dce
2013

Serial Execution versus Pipelining
 Consider a task that can be divided into k subtasks
 The k subtasks are executed on k different stages

 Each subtask requires one time unit
 The total execution time of the task is k time units


 Pipelining is to overlap the execution
 The k stages work in parallel on k different tasks
 Tasks enter/leave pipeline at the rate of one task per time unit
1 2



k
1 2



1 2

k
1 2



1 2

k

Without Pipelining
One completion every k time units
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2




k



1 2

k



k

With Pipelining
One completion every 1 time unit
/>
©Fall 2013, CS

7


dce

Synchronous Pipeline

2013

 Uses clocked registers between stages
 Upon arrival of a clock edge …
 All registers hold the results of previous stages simultaneously


 The pipeline stages are combinational logic circuits
 It is desirable to have balanced stages
 Approximately equal delay in all stages

S2

Register

S1

Register

Input

Register

Register

 Clock period is determined by the maximum stage delay

Sk

Output

Clock
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2


/>
©Fall 2013, CS

8


dce

Pipeline Performance

2013

 Let ti = time delay in stage Si
 Clock cycle t = max(ti) is the maximum stage delay
 Clock frequency f = 1/t = 1/max(ti)
 A pipeline can process n tasks in k + n – 1 cycles
 k cycles are needed to complete the first task
 n – 1 cycles are needed to complete the remaining n – 1 tasks

 Ideal speedup of a k-stage pipeline over serial execution
nk

Serial execution in cycles

Sk =

Pipelined execution in cycles
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2


=

k+n–1

Sk → k for large n

/>
©Fall 2013, CS

9


dce
2013

MIPS Processor Pipeline
 Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Execute operation or calculate load/store address

4. MEM: Memory access for load and store
5. WB: Write Back result to register

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>

©Fall 2013, CS

10


dce
2013

Single-Cycle vs Pipelined Performance
 Consider a 5-stage instruction execution in which …
 Instruction fetch = ALU operation = Data memory access = 200 ps
 Register read = register write = 150 ps

 What is the clock cycle of the single-cycle processor?
 What is the clock cycle of the pipelined processor?

 What is the speedup factor of pipelined execution?
 Solution
Single-Cycle Clock =
200+150+200+200+150 = 900 ps
IF

Reg

ALU

MEM

Reg


900 ps

IF

Reg

ALU

MEM

Reg

900 ps
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

11


dce
2013

Single-Cycle versus Pipelined – cont’d
 Pipelined clock cycle = max(200, 150) = 200 ps
IF


Reg

200

IF
200

ALU

Reg
IF
200

MEM

Reg

ALU

MEM

Reg

ALU

MEM

200

200


Reg
200

Reg
200

 CPI for pipelined execution = 1
 One instruction completes each cycle (ignoring pipeline fill)

 Speedup of pipelined execution = 900 ps / 200 ps = 4.5
 Instruction count and CPI are equal in both cases

 Speedup factor is less than 5 (number of pipeline stage)
 Because the pipeline stages are not balanced

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

12


dce
2013

Pipeline Performance Summary

 Pipelining doesn’t improve latency of a single instruction

 However, it improves throughput of entire workload
 Instructions are initiated and completed at a higher rate

 In a k-stage pipeline, k instructions operate in parallel
 Overlapped execution using multiple hardware resources
 Potential speedup = number of pipeline stages k

 Unbalanced lengths of pipeline stages reduces speedup

 Pipeline rate is limited by slowest pipeline stage
 Unbalanced lengths of pipeline stages reduces speedup
 Also, time to fill and drain pipeline reduces speedup

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

13


dce

Next . . .

2013


 Pipelining versus Serial Execution
 Pipelined Datapath and Control
 Pipeline Hazards
 Data Hazards and Forwarding
 Load Delay, Hazard Detection, and Stall
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

14


dce

Single-Cycle Datapath

2013

 Shown below is the single-cycle datapath
 How to pipeline this single-cycle datapath?

Answer: Introduce pipeline register at end of each stage
IF = Instruction Fetch


ID = Decode &
Register Read

Jump or Branch Target Address

EX = Execute

MEM = Memory
Access
J

Next
PC

Beq
Bne

30

00

30

Instruction
Memory
Instruction

PC


0
1

ALU result

Imm26

+1

PCSrc

Rs 5
32

Rt 5

Address
Rd

Imm16

zero

32

BusA

RA

Registers

RB

BusB

0

0
1

WB =
Write
Back

RW

BusW

E

32

A
L
U

32

Data
Memory
Address

Data_out
Data_in

1

0

32

32
1

32

RegDst

clk

Reg
Write
ExtOp ALUSrc ALUCtrl

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

Mem Mem
Read Write

/>

Mem
toReg

©Fall 2013, CS

15


dce

Pipelined Datapath

2013

 Pipeline registers are shown in green, including the PC
 Same clock edge updates all pipeline registers, register
file, and data memory (for store instruction)

1

Address

RB

0
1

Rd

RW


32

Imm

E
BusB
BusW

32

zero

A
L
U

1

Data
Memory

ALUout

RA

ALU result
Imm16

A


NPC
Rt 5

BusA

Next
PC

32

Data_out

0

32
32

0

Address

1

WB Data

PC

0


Rs 5

B

Instruction

Imm26

Register File

Instruction
Memory

Instruction

+1

MEM = Memory
Access

WB = Write Back

EX = Execute

D

ID = Decode &
Register Read
NPC2


IF = Instruction Fetch

Data_in

32

clk

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

16


dce

Problem with Register Destination

2013

 Is there a problem with the register destination address?
 Instruction in the ID stage different from the one in the WB stage

Address

RB


0
1

Rd

RW

ALU result
Imm16

E
BusB
BusW

32

zero

32

A

Imm

Next
PC

A
L

U

1

Data
Memory

ALUout

Rt 5

RA

BusA

MEM =
Memory Access

32
32

32

0

Address
Data_out

0


D

1

PC

0

Rs 5

B

Instruction

Imm26

Register File

Instruction
Memory

Instruction

+1

NPC

NPC2

EX = Execute


1

WB Data

ID = Decode &
Register Read

IF = Instruction Fetch

WB = Write Back

 Instruction in the WB stage is not writing to its destination register
but to the destination of a different instruction in the ID stage

Data_in

32

clk
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

17



dce

Pipelining the Destination Register

2013

 Destination Register number should be pipelined
 Destination register number is passed from ID to WB stage
 The WB stage writes back data knowing the destination register
ID

EX

RW

BusB
BusW

0
1

32

A
L
U

1

Data

Memory

ALUout

Imm

A

32

E

32

zero

32

Data_out

0

32

32

0

Address


1

WB Data

Rd

ALU result
Imm16

Data_in

Rd4

RB

WB

Next
PC

D

RA

B

Address

Rt 5


BusA

MEM

Rd3

1

PC

0

Rs 5

Rd2

Instruction

Imm26

Register File

Instruction
Memory

Instruction

+1

NPC


NPC2

IF

clk
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

18


dce

Graphically Representing Pipelines

2013

 Multiple instruction execution over multiple clock cycles
 Instructions are listed in execution order from top to bottom
 Clock cycles move from left to right

Program Execution Order

 Figure shows the use of resources at each stage and each cycle
Time (in cycles)


CC1

CC2

CC3

CC4

CC5

lw $t6, 8($s5)

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM


Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM


add $s1, $s2, $s3
ori $s4, $t3, 7
sub $t5, $s2, $t3
sw $s2, 10($t3)
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

CC6

CC7

/>
CC8

©Fall 2013, CS

19


dce

Instruction-Time Diagram

2013

 Instruction-Time Diagram shows:
 Which instruction occupying what stage at each clock cycle


 Instruction flow is pipelined over the 5 stages

Instruction Order

Up to five instructions can be in the
pipeline during the same cycle
Instruction Level Parallelism (ILP)
lw

$t7, 8($s3)

lw

$t6, 8($s5)

IF

ID

EX MEM WB

IF

ID

EX MEM WB

IF

ID


EX



WB

IF

ID

EX



IF

ID

ori $t4, $s3, 7
sub $s5, $s2, $t3
sw

$s2, 10($s3)
CC1
CuuDuongThanCong.com

ALU instructions skip
the MEM stage.
Store instructions

skip the WB stage

WB

EX MEM



CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Computer Architecture – Chapter 4.2

/>
Time

©Fall 2013, CS

20


dce

Control Signals
ID

EX

RW

Imm16


BusB

1

32

Data
Memory
32
32

32

0

Address
Data_out

0

BusW

0
1

A
L
U


ALUout

E

32

zero

32

A

Imm

ALU result

1

WB Data

Rd

Bne

Data_in

Rd4

Address


RB

Beq

Rd3

1

PC

0

Rt 5

RA

BusA

WB

J

Next
PC

B

Instruction

Rs 5


MEM

Rd2

Instruction
Memory

Imm26

Register File

PCSrc

Instruction

+1

NPC

NPC2

IF

D

2013

clk
Reg

Dst

Reg
Write

Ext
Op

ALU
Src

ALU
Ctrl

Mem Mem
Read Write

Mem
toReg

Same control signals used in the single-cycle datapath
CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

21



dce

Pipelined Control

32

0
1

32

Data_out

0

BusW

32
32

0

Address

1

WB Data

1


Data
Memory

ALUout

A

Imm

BusB

A
L
U

Data_in

Rd4

RW

32

E

32

zero


D

Rd

Op

1

Address

RB

Bne

Rd3

PC

0

Rt 5

RA

BusA

Beq

ALU result
Imm16


B

Instruction

Rs 5

J

Next
PC

Rd2

Instruction
Memory

Imm26

Register File

PCSrc

Instruction

+1

NPC

NPC2


2013

CuuDuongThanCong.com

Main
& ALU
Control

Computer Architecture – Chapter 4.2

Ext
Op

ALU
Src

J
ALU Beq
Ctrl Bne

Mem Mem
Read Write

Mem
toReg

WB

Reg

Write

MEM

Reg
Dst

EX

Pass control
signals along
pipeline just
like the data

func

clk

/>
©Fall 2013, CS

22


dce
2013

Pipelined Control – Cont'd
 ID stage generates all the control signals
 Pipeline the control signals as the instruction moves

 Extend the pipeline registers to include the control signals

 Each stage uses some of the control signals
 Instruction Decode and Register Read
 Control signals are generated
 RegDst is used in this stage

 Execution Stage => ExtOp, ALUSrc, and ALUCtrl
 Next PC uses J, Beq, Bne, and zero signals for branch control

 Memory Stage

=> MemRead, MemWrite, and MemtoReg

 Write Back Stage => RegWrite is used in this stage

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

23


dce

Control Signals Summary


2013

Decode
Stage

Execute Stage

Memory Stage

Write

Control Signals

Control Signals

Back

Op
RegDst ALUSrc ExtOp
R-Type

1=Rd

0=Reg

addi

0=Rt

slti


Beq Bne

ALUCtrl

MemRd MemWr MemReg RegWrite

0

0

0

func

0

0

0

1

1=Imm 1=sign

0

0

0


ADD

0

0

0

1

0=Rt

1=Imm 1=sign

0

0

0

SLT

0

0

0

1


andi

0=Rt

1=Imm 0=zero

0

0

0

AND

0

0

0

1

ori

0=Rt

1=Imm 0=zero

0


0

0

OR

0

0

0

1

lw

0=Rt

1=Imm 1=sign

0

0

0

ADD

1


0

1

1

sw

x

1=Imm 1=sign

0

0

0

ADD

0

1

x

0

beq


x

0=Reg

x

0

1

0

SUB

0

0

x

0

bne

x

0=Reg

x


0

0

1

SUB

0

0

x

0

j

x

x

x

1

0

0


x

0

0

x

0

CuuDuongThanCong.com

x

J

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

24


dce

Next . . .

2013


 Pipelining versus Serial Execution
 Pipelined Datapath and Control
 Pipeline Hazards
 Data Hazards and Forwarding
 Load Delay, Hazard Detection, and Stall
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction

CuuDuongThanCong.com

Computer Architecture – Chapter 4.2

/>
©Fall 2013, CS

25


×