Tải bản đầy đủ (.pdf) (26 trang)

Advanced Computer Architecture - Lecture 10: Computer hardware design

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (867.56 KB, 26 trang )

CS 704
Advanced Computer Architecture

Lecture 10
Computer Hardware Design
(Pipeline Datapath and Control Design)

Prof. Dr. M. Ashraf Chughtai


Recap: Lecture 9
Single cycle verses multi cycle datapath
Key components of multi cycle data path
Design and information flow in multi cycle
data path
Multi cycle control unit design
Finite State Machine–based control Unit
Microprogram-based controller

MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

2


What is pipelining?
Pipelining is a fundamental concept
It utilizes capabilities of the Datapath by



MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

3


Pipelining is Natural!
Laundry Example!
Four loads: A, B, C, D
Four laundry operations:

A

B

C

D

Wash, Dry, fold and place into
drawers

Washer takes 30 minutes
Dryer takes 30 minutes
“Folder” takes 30 minutes
“Stasher” takes 30 minutes

to put clothes into drawers
MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

4


Sequential Laundry
6 PM
T
a
s
k
O
r
d
e
r

A

7

8

9


10

11

12

1

2 AM

30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time

B
C
D
Explanation next please ……………..

MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

5


Pipelined Laundry: Start work ASAP
6 PM
T

a
s
k
O
r
d
e
r

7

8

9

10

30 30 30 30 30 30 30

11

12

1

2 AM

Time

A

B
C
D

Pipelined laundry takes 3.5 hours for 4
loads!
MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

6


Features of Pipelined Processor
All the functional units operate independently
Multiple tasks operating simultaneously
using different resources
Pipelining doesn’t help latency of single
task, it helps throughput of entire
workload
Potential speedup = Number pipe stages
……… Cont’d

MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)


Next please!

7


Pipelining Lessons
Pipeline rate limited by:
- Slowest pipeline stage
- Time to “fill” pipeline and time to “drain” it
reduces speedup
- Unbalanced lengths of pipe stages reduces
speedup
If washer takes longer time than the dryer then
dryer has to wait!
Stall for Dependences
MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

8


Five Steps of Datapath
Ins. fetch
Dec/Reg
Exec
Mem

Wr

MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

9


Pipelined Processor Design

MAC/VU-Advanced
Computer Architecture

B

Lecture 10 –Computer Hardware
Design (4)

Equal

WB
Ctrl

Write Back
(Reg. Wrt)

Reg.

File

IRwb

IRmem

Exec

S

Mem Ctrl

Ex Ctrl

A

Memory
Rd/Wrt

Mem
Access

IRex

Execute/
Address

Dcd Ctrl

IR


ID/Register
Read

Reg
File

PC

Next PC

Inst. Mem

Instruction
Fetch

M
Data
Mem

10


Pipeline Control
IR <- Mem[PC];
PC <– PC+4;

Instruction Fetch

A <- R[rs];

B<– R[rt]

ID/Reg. Rd

Exe/Address

S <–
A + B;

S <– A or
ZX;

Memory Rd/Wrt

Reg. Wrt (WB)
MAC/VU-Advanced
Computer Architecture

S <–
A + SX;

S <–
A + SX;

If Cond
PC <
PC+SX;

M <– Mem[S] Mem[S] <- B


R[rd] <– S;

R[rt] <– S;

R[rd] <– M;

Lecture 10 –Computer Hardware
Design (4)

11


Pipelined Registers Included

MAC/VU-Advanced
Computer Architecture

B

Lecture 10 –Computer Hardware
Design (4)

Equal

WB
Ctrl

Write Back
(Reg. Wrt)


Reg.
File

IRmem

Exec

S

Mem Ctrl
IRwb

Ex Ctrl

A

Memory
Rd/Wrt

Mem
Access

IRex

Execute/
Address

Dcd Ctrl

ID/Register

Read

Reg
File

PC

Next PC

Inst. Mem
IR

Instruction
Fetch

M
Data
Mem

12


Five Steps as Stages of Pipeline

Load

Cycle 1

Cycle 2


Cycle 3

Ifetch

Reg/Dec

Exec

Cycle 4

Mem

Cycle 5

Wr

.

MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

13


Multiple Cycle verses Pipeline – Pipeline enhances performance
Cycle 1


 2

  3

   4

   5

  6

  7

   8

  9

   10

11  12

  13

   14

Clk

Multiple Cycle Implementation:
Load
Store
R­type

Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Reg Exec Mem

Pipeline Implementation:
Load Ifetch Reg Exec Mem Wr
Store

Ifetch Reg Exec Mem Wr

R­type

Ifetch Reg Exec Mem Wr
Explanation next slide…….

MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

14


3 Instructions program reconsidered
Load
Store
R-type (ADD)

MAC/VU-Advanced
Computer Architecture


Lecture 10 –Computer Hardware
Design (4)

15


Example
The cycle time of a single cycle machine is 45 ns, and of multi
cycle and pipelined machines is 10 ns; and average CPI due to
instruction mix on multi cycle machine is 4.6.
What is the execution time on each type of machine?
Ans:
Single Cycle Machine
– 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multi Cycle Machine
– 10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns
Pipelined machine
– 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

16


Another Example
Consider a multicycle, unpiplined processor requires 4 cycles
for the ALU and Branch operations and 5 cycles for the memory

operation.
Assume the relative frequency of these operations is 40%, 25%
and 35% respectively; and the clock cycle is of 1 n sec.
In pipelined implementation, due to clock skew and setup
processor adds 0.2 n sec. to the clock

Ignoring any latency impact, how much is the
speedup from the pipelined processor?
MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

17


Solution
Unpiplined Processor:
Average Execution Time/Instruction = Clock Cycle x Average CPI
=
1 n sec. x [{(0.4 +.25)} x 4 + 0.35 x 5]
=
1 n sec x (0.65 x 4 + 0.35 x 5)
=
1 n sec. x (2.60 + 1.75)
=
4.35 n sec

Pipelined Processor:

Average Execution Time/ Instruction = Clock cycle + overhead
=
1 n sec. + 0.2 n. sec
=
1.2 n sec

Speed up = 4.35 / 1.2 = 3.62 times
MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

18


Pipelined Execution Representation
Conventional Representation
- Helps showing the program flow viz-a-viz time
Time
Program Flow

1st Inst.

IFetch Dcd

2nd Inst.
3rd Inst
4th Inst
5th Inst.


MAC/VU-Advanced
Computer Architecture

Exec

IFetch Dcd

Mem
Exec

IFetch Dcd

WB
Mem

WB

Exec

Mem

WB

Exec

Mem

IFetch Dcd


IFetch Dcd

Lecture 10 –Computer Hardware
Design (4)

Exec

WB
Mem

WB

19


Graphical Representation

Instr 4
Instr 5

Reg

D. Mem

Reg

I.Mem

Reg


D. Mem

I.Mem

Reg

D.Mem

I.Mem

Reg

D.Mem

I.Mem

Reg

ALU

Instr 3

CC5

ALU

Instr 2

I.Mem


CC4

ALU

O
r
d
e
r

Instr 1

CC3

ALU

I
n
s
t
r.

CC1

ALU

Time
(clock cycles)

CC2


CC6

CC7

CC8

CC9

Reg
Reg
Reg
Mem

Reg

Explanation…… Next Please
MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

20


Why Pipeline?
Because the resources are there!
Time (clock cycles)


Inst 3

MAC/VU-Advanced
Computer Architecture

Im

Dm

Reg

Dm

Im

Reg
Im

Reg

Reg

Lecture 10 –Computer Hardware
Design (4)

Reg
Dm

ALU


Inst 4

Reg

Reg

ALU

Inst 2

Im

Dm

ALU

Inst 1

Reg

ALU

O
r
d
e
r

Inst 0


Im

ALU

I
n
s
t
r.

Reg
Dm

Reg

21


Can pipelining get us into trouble?
Structural hazards
– Data hazards
– Control hazards

MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

22



How Stall degrades the performance?
The pipelined CPI with stalls =
Ideal CPI + Stall clock cycles per
instruction

MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

23


How Stall degrades the performance?
1. Speedup w.r.t unpiplined =
CPI Unpiplined
1 + stall cycles per instruction

2. Speedup w.r.t. pipeline depth:
:

Speedup w.r.t pipeline depth =
pipeline depth
1 + stall cycles per instruction

MAC/VU-Advanced
Computer Architecture


Lecture 10 –Computer Hardware
Design (4)

24


Summary
multi cycle datapath verses pipeline
datapath
Key components of pipeline data path
Performance enhancement due to pipeline
Hazards in pipelined datapath

MAC/VU-Advanced
Computer Architecture

Lecture 10 –Computer Hardware
Design (4)

25


×