Advanced Computer Architecture - Lecture 15: Instruction level parallelism

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.37 MB, 41 trang )

CS 704
Advanced Computer Architecture

Lecture 15
Instruction Level Parallelism
(Dynamic Branch Prediction)

Prof. Dr. M. Ashraf Chughtai

Today's Topics
Recap - Lecture 14
Dynamic Branch Prediction
Branch Prediction Buffer
Examples of Branch Predictor
Summary

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

2

Recap: Lecture 14
Tomasulo's Approach for IBM 360/91 to achieve
high Performance without special compilers
Here, the control and buffers are distributed
with Function Units (FU)

Registers in instructions are replaced by values
or pointers to reservation stations(RS) ; i.e., the
registers are renamed
Unlike Scoreboard, Tomasulo can have multiple
loads outstanding
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

3

Recap: Lecture 14
These two properties allow to issue an instruction
having name dependence ; e.g., MULT is issued which
has name dependence of register F2

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

4

Recap: Lecture 14
Tomasulo eliminates the WAR hazard as in this

example ADD.D writes the result in Cycle 11 even if the
DIV.D will start execution in Cycle 16

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

5

Recap: Lecture 14
Tomasulo issues in-order and may execute outof-order

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

6

Recap: Lecture 14

• Here, the integer instructions SUBI and BNEZ are
executed out-of-order to evaluate the condition
• The perdition Branch-Taken is implemented by
repeating the loop instruction as shown

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

7

Recap: Lecture 14
• The perdition
Branch-Taken is
implemented by
two iterations of
the code
• R1 has been
initialized to 80

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

8

Recap: Lecture 14
• L.D is issued in 6th
clock cycle, prior

to the condition
evaluation –
Predict Branch
Taken
• R1 is updated in
Clock 6, by
executing SUB in
Clock cycle 5

• SUBI and BNZE are issued in Clock
Cycle 4 and 5 respectively

• F0 never sees the
result
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

9

Recap: Lecture 14

• MUL1 issued in
clock cycle 2 does
not start execution
till Wr to F0 by LD
is complete to

avoid WAR Hazard

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

10

Recap: Lecture 14
•

L.D 1 issues in cycle 1,
completes execution in
cycle 9 ( 8 CPI first time)
It writes to F0 in cycle
10

•

LD 2 issued in cycle 6
completes execution (4
CPI second time

•

So MUL1 will start in
cycle 11 avoiding WAR

Hazard

•

SD1 will start execution
on the completion of
MUL1 to avoid WAW
hazard
MAC/VU-Advanced
Computer Architecture

•

SUBI and BNEZ issued in clock cycles 9
and 10 respectively

•

SUBI completes execution in 10 cycle,
updates R1 to the next iteration

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

11

Recap: Lecture 14
•

MUL1 execution started
in cycle 11 completes in
cycle 14 write result in
F4 in cycle 15

•

SD1 issued in cycle 3,
will start execution in
Cycle 16 avoiding WAR
hazard

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

12

Recap: Lecture 14
•

MUL1 execution started
in cycle 11 completes in
cycle 14 write result in
F4 in cycle 15

•

SD1 issued in cycle 3,
will start execution in
Cycle 16 completes in
cycle 18

•

SBI issued in cycle 16
update R1 for next
iteration in cycle 18

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

13

Recap: Lecture 14
• MUL2 execution
started in cycle 12
completed in cycle
15 write result in F4
in cycle 16
• SD2 issued in cycle
8, start s execution
in Cycle 17 after

MUL2 writes result
in cycle 16 to avoid
WAR hazard

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

14

Introduction to Dynamic Branch Prediction
In the last lecture, we considered a loopbased example, to discuss the Tomasulo’s
approach to overcome the WAW and WAR
hazards
Here, we observed that dynamically
scheduled pipeline can yield high
performance provided branches are
predicted accurately

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

15

Branch History Table
If the prediction is wrong, then invert
prediction-bit

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

16

1-bit Dynamic Branch Prediction
Problem:
- In a loop, 1-bit BHT will cause two

mispredictions in a row
- 1-bit predictor mispredict at twice the rate

that the branch is not-taken
- Let us consider an example of loop-

branch (For i=1 to 10); i.e., the branch is
taken 9 times and not-taken once
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level

Parallelism -Dynamic (4)

17

1-bit Dynamic Branch Prediction … Conclusion
As the Performance =
ƒ (accuracy, cost of mispredictions)
The accuracy of the predictor is
expected to match the taken-branch
frequency, which in the previous
example is 9 out of 10 (90%)
But the 1-bit prediction has 8 out of 10
(80%)
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

18

2-bit Dynamic Branch Prediction
2 bits are used to encode 4-states in the
system (counter) Say:
States 00 and 01 for Predict Not-Taken
States 10 and 11 for Predict Taken

MAC/VU-Advanced

Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

19

2-bit Dynamic Branch Prediction
T
NT
Predict Taken
State 11

Predict Taken
State 10

T
T

NT

NT
Predict Not
Taken State 00

Predict Not
Taken State 01

T

NT

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

20

2-bit Dynamic Branch Prediction
In a saturating counter implementation:
2-bit counter saturates at:
- 00 (Predict Taken) or
- 11 (Predict Not taken)
The counter is incremented when a branch is
taken and decremented when it is not taken; e.g.,
-

00 to 01 for Taken when predicted not taken

-

10 to 11 for Taken when predicted taken

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level

Parallelism -Dynamic (4)

21

2-bit Dynamic Branch Prediction
Here, when the counter is greater than
or equal to ½ of its maximum value
(>=10; i.e., state 01 and 11) branch is
predicted as taken;
otherwise (i.e., <10: state 10 and 00) the
branch is predicted as untaken
Let us try the example of loop For
i=1,10
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

22

2-bit Dynamic Branch Prediction
Let us try the example of loop For i=1,10
Iteration P.S. Branch NS Prediction
0
-not Taken
11 Taken
1 11 Taken

11 Taken
2
11 Taken
11 Taken
:
9
11 Taken
11 Taken
10
11 Not taken
10 Taken
Prediction fails once only
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

23

Branch Prediction Buffer (BPB) or BHT
Implementation

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

24

Branch Prediction Buffer (BPB) or BHT
Implementation
If
Prediction is wrong
Then
prediction bits are changed –
In case
Predicted Taken:
State changes 11 10)
Predicted not taken:
State changes 0001
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

25

Advanced Computer Architecture - Lecture 15: Instruction level parallelism

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về