CS 704
Advanced Computer Architecture
Lecture 15
Instruction Level Parallelism
(Dynamic Branch Prediction)
Prof. Dr. M. Ashraf Chughtai
Today's Topics
Recap - Lecture 14
Dynamic Branch Prediction
Branch Prediction Buffer
Examples of Branch Predictor
Summary
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
2
Recap: Lecture 14
Tomasulo's Approach for IBM 360/91 to achieve
high Performance without special compilers
Here, the control and buffers are distributed
with Function Units (FU)
Registers in instructions are replaced by values
or pointers to reservation stations(RS) ; i.e., the
registers are renamed
Unlike Scoreboard, Tomasulo can have multiple
loads outstanding
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
3
Recap: Lecture 14
These two properties allow to issue an instruction
having name dependence ; e.g., MULT is issued which
has name dependence of register F2
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
4
Recap: Lecture 14
Tomasulo eliminates the WAR hazard as in this
example ADD.D writes the result in Cycle 11 even if the
DIV.D will start execution in Cycle 16
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
5
Recap: Lecture 14
Tomasulo issues in-order and may execute outof-order
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
6
Recap: Lecture 14
• Here, the integer instructions SUBI and BNEZ are
executed out-of-order to evaluate the condition
• The perdition Branch-Taken is implemented by
repeating the loop instruction as shown
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
7
Recap: Lecture 14
• The perdition
Branch-Taken is
implemented by
two iterations of
the code
• R1 has been
initialized to 80
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
8
Recap: Lecture 14
• L.D is issued in 6th
clock cycle, prior
to the condition
evaluation –
Predict Branch
Taken
• R1 is updated in
Clock 6, by
executing SUB in
Clock cycle 5
• SUBI and BNZE are issued in Clock
Cycle 4 and 5 respectively
• F0 never sees the
result
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
9
Recap: Lecture 14
• MUL1 issued in
clock cycle 2 does
not start execution
till Wr to F0 by LD
is complete to
avoid WAR Hazard
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
10
Recap: Lecture 14
•
L.D 1 issues in cycle 1,
completes execution in
cycle 9 ( 8 CPI first time)
It writes to F0 in cycle
10
•
LD 2 issued in cycle 6
completes execution (4
CPI second time
•
So MUL1 will start in
cycle 11 avoiding WAR
Hazard
•
SD1 will start execution
on the completion of
MUL1 to avoid WAW
hazard
MAC/VU-Advanced
Computer Architecture
•
SUBI and BNEZ issued in clock cycles 9
and 10 respectively
•
SUBI completes execution in 10 cycle,
updates R1 to the next iteration
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
11
Recap: Lecture 14
•
MUL1 execution started
in cycle 11 completes in
cycle 14 write result in
F4 in cycle 15
•
SD1 issued in cycle 3,
will start execution in
Cycle 16 avoiding WAR
hazard
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
12
Recap: Lecture 14
•
MUL1 execution started
in cycle 11 completes in
cycle 14 write result in
F4 in cycle 15
•
SD1 issued in cycle 3,
will start execution in
Cycle 16 completes in
cycle 18
•
SBI issued in cycle 16
update R1 for next
iteration in cycle 18
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
13
Recap: Lecture 14
• MUL2 execution
started in cycle 12
completed in cycle
15 write result in F4
in cycle 16
• SD2 issued in cycle
8, start s execution
in Cycle 17 after
MUL2 writes result
in cycle 16 to avoid
WAR hazard
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
14
Introduction to Dynamic Branch Prediction
In the last lecture, we considered a loopbased example, to discuss the Tomasulo’s
approach to overcome the WAW and WAR
hazards
Here, we observed that dynamically
scheduled pipeline can yield high
performance provided branches are
predicted accurately
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
15
Branch History Table
If the prediction is wrong, then invert
prediction-bit
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
16
1-bit Dynamic Branch Prediction
Problem:
- In a loop, 1-bit BHT will cause two
mispredictions in a row
- 1-bit predictor mispredict at twice the rate
that the branch is not-taken
- Let us consider an example of loop-
branch (For i=1 to 10); i.e., the branch is
taken 9 times and not-taken once
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
17
1-bit Dynamic Branch Prediction … Conclusion
As the Performance =
ƒ (accuracy, cost of mispredictions)
The accuracy of the predictor is
expected to match the taken-branch
frequency, which in the previous
example is 9 out of 10 (90%)
But the 1-bit prediction has 8 out of 10
(80%)
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
18
2-bit Dynamic Branch Prediction
2 bits are used to encode 4-states in the
system (counter) Say:
States 00 and 01 for Predict Not-Taken
States 10 and 11 for Predict Taken
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
19
2-bit Dynamic Branch Prediction
T
NT
Predict Taken
State 11
Predict Taken
State 10
T
T
NT
NT
Predict Not
Taken State 00
Predict Not
Taken State 01
T
NT
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
20
2-bit Dynamic Branch Prediction
In a saturating counter implementation:
2-bit counter saturates at:
- 00 (Predict Taken) or
- 11 (Predict Not taken)
The counter is incremented when a branch is
taken and decremented when it is not taken; e.g.,
-
00 to 01 for Taken when predicted not taken
-
10 to 11 for Taken when predicted taken
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
21
2-bit Dynamic Branch Prediction
Here, when the counter is greater than
or equal to ½ of its maximum value
(>=10; i.e., state 01 and 11) branch is
predicted as taken;
otherwise (i.e., <10: state 10 and 00) the
branch is predicted as untaken
Let us try the example of loop For
i=1,10
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
22
2-bit Dynamic Branch Prediction
Let us try the example of loop For i=1,10
Iteration P.S. Branch NS Prediction
0
-not Taken
11 Taken
1 11 Taken
11 Taken
2
11 Taken
11 Taken
:
9
11 Taken
11 Taken
10
11 Not taken
10 Taken
Prediction fails once only
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
23
Branch Prediction Buffer (BPB) or BHT
Implementation
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
24
Branch Prediction Buffer (BPB) or BHT
Implementation
If
Prediction is wrong
Then
prediction bits are changed –
In case
Predicted Taken:
State changes 11 10)
Predicted not taken:
State changes 0001
MAC/VU-Advanced
Computer Architecture
Lecture 15 – Instruction Level
Parallelism -Dynamic (4)
25