Tải bản đầy đủ (.pdf) (52 trang)

Advanced Computer Architecture - Lecture 16: Instruction level parallelism

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.62 MB, 52 trang )

CS 704
Advanced Computer Architecture

Lecture 16
Instruction Level Parallelism
(Dynamic Branch Prediction …. Cont’d)

Prof. Dr. M. Ashraf Chughtai


Today's Topics
Recap
Correlating Branch Predictors
Tournament Predictor
High Performance Instruction
Delivery – Branch Target Buffer
Summary
MAC/VU-Advanced
Computer Architecture

Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

2


Recap: Dynamic Scheduling and Branch Prediction

MAC/VU-Advanced
Computer Architecture


Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

3


Recap: Dynamic Scheduling and Branch Prediction
- Static: rely on the software (compiler)
- Dynamic: hardware intensive approaches

MAC/VU-Advanced
Computer Architecture

Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

4


Important questions:

Branch-Prediction Buffer

Q1: What is the impact of increasing the size
of branch-prediction buffer on two branches
in a program?
A single predictor predicting a single
branch is generally more accurate than is that
same predictor serving more than one
instructions; and

MAC/VU-Advanced
Computer Architecture

Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

5


Q:1

Branch-Prediction Buffer

It is less likely that two branches in a
program share a single predictor
Therefore, increasing the size of
predictor buffer does not have
significant effect on two branches in a
program

MAC/VU-Advanced
Computer Architecture

Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

6


Question 2 Branch-Prediction Buffer

How sharing a predictor effects the
misprediction rate
This is explained with the help of following
example:
Consider two sequences of branch-taken and nottaken , sharing 1-bit predictor; and identify the
sequence that
a) reduces the misprediction rate
b) increases the misprediction rate
MAC/VU-Advanced
Computer Architecture

Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

7


Example: Sequence 1
P

B1 P

NT T
Prediction

-

T

B2 P


B1

NT NT T

No No

P

B2 P B1 P B2 P

T

NT NT T

No -

No -

B1

T NT NT T

No -

No -

No

P


B2

T

NT

-

No

Correct?

Here, the columns B1 and B2 show the
branches B1 and B2
B1 is always TAKEN
B2 is always Not-TAKEN
MAC/VU-Advanced
Computer Architecture

Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

8


Example: Sequence 1
P

B1 P


NT T
Prediction

-

T

B2 P

B1

NT NT T

No No

P

B2 P B1 P B2 P

T

NT NT T

No -

No -

B1


T NT NT T

No -

No -

No

P

B2

T

NT

-

No

Correct?

MAC/VU-Advanced
Computer Architecture

Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

9



Example: Sequence 1
P

B1 P

NT T
Prediction

-

T

B2 P

B1

NT NT T

No No

P

B2 P B1 P B2 P

T

NT NT T

No -


No -

B1

T NT NT T

No -

No -

No

P

B2

T

NT

-

No

Correct?

MAC/VU-Advanced
Computer Architecture


Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

10


Example: Sequence 2
P

B1 P

NT T
Prediction

yes -

T

B2 P

B1

P

B2 P B1 P B2 P

NT NT NT NT T

No no


No -

T

yes -

T

B1

T NT NT NT

No -

yes -

no

P

B2

NT T
-

Correct?

MAC/VU-Advanced
Computer Architecture


Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

11


Example: Sequence 2
P

B1 P

NT T
Prediction

yes -

T

B2 P

B1

P

B2 P B1 P B2 P

NT NT NT NT T

No no


No -

T

yes -

T

B1

T NT NT NT

No -

yes -

no

P

B2

NT T
-

Correct?

MAC/VU-Advanced
Computer Architecture


Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

12


Example: Sequence 2
P

B1 P

NT T
Prediction

yes -

T

B2 P

B1

P

B2 P B1 P B2 P

NT NT NT NT T

No no


No -

T

yes -

T

B1

T NT NT NT

No -

yes -

no

P

B2

NT T
-

Correct?

MAC/VU-Advanced
Computer Architecture


Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

13


Example: Conclusion
Why sharing of predictor increases misprediction
rate?

It is clear from the above example that:
if a predictor is shared by a set of branch
instructions
then the members of the set of branch instruction
may change, over the course of execution of long
program

Hence, the branch action history changes and
predictor is likely to mispredict more often
MAC/VU-Advanced
Computer Architecture

Lecture 16 – Instruction Level
Parallelism -Dynamic (5)

14


Correlating Branch Predictors Re-visited
We have observed that in program segment

IF (d==0)
d=1;
IF (d==1)
d=2;

MAC/VU-Advanced
Computer Architecture

Branch b1 for d!=0
Branch b2 for b!=1

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

15


Correlating Branch Predictors Re-visited
This problem may be resolved in
Correlating-Branch
Predictor
by
recording m most recently executed
branches as taken or not taken (in 2m
branch-history tables for 1-, 2-, … or n-bit
predictor), and using branch-pattern to
select the proper branch history table for
the current branch
MAC/VU-Advanced
Computer Architecture


Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

16


Correlating Branch Predictors
In general,
(m, n) predictor means record last m
branches to select between 2m history
tables each with n-bit counters (2m n-bit
predictor)
A 2-bit BHT is regarded as (0,2)
correlating predictor;

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

17


Correlating Branch Predictor: Example
1-bit predictor with 1-bit correlation is written
as (1,1) predictor
Here, we have two (21) separate prediction bits (i.e.,
two 1-bit BHTs )

one prediction bit is used if the last
branch
executed was not-taken
other prediction bit is used if the last
branch
executed was taken
And is denoted as: New prediction when last NT
New prediction when last T
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

18


Correlating Branch Predictor: Example
1-bit predictor with 1-bit correlation is
referred to as (1,1) predictor
Here, we have two (21) separate prediction bits (i.e.,
two 1-bit BHTs )

one prediction bit is used if the last
branch executed was not-taken
other prediction bit is used if the last
branch executed was taken
-

MAC/VU-Advanced

Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

19


Correlating Branch Predictor: Example
And these two bits are denoted as:
New prediction when last NT
New prediction when last T
E.g., T/NT stands for:
New prediction is TAKEN if previous was NOTTAKEN and
is NOT-TAKEN if previous was TAKEN
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

20


Correlating Branch Predictor: Example
In an (m,n) predictor, the global history of
most recent m branches is recorder in an
m-bit shift register
Here, each bit records whether the branch
was taken or not taken

The branch-prediction buffers is indexed
using concatenation of low-order bits from
branch-address with m-bit global history
A typical (2,2) correlating predictor is shown next ….
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

21


(2, 2) Correlating Branches Predictor
4-bit Branch address
22, 2 -bits per branch predictors with 16
entries each

4
0
1
2
::

16
17
18
::

32

33
34
::

48
49
50
::

:
31

:
47

:
63

Prediction
Prediction

Forms the lower
part of 6-bit
address
:
15

1
MAC/VU-Advanced
Computer Architecture


0

2-bit global branch history –
the upper 2-bits of 6-bit address

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

22


(2, 2) Correlating Branches Predictor
Here, the buffer is drawn as 2-dimensional
object, each buffer is 2 bit wide, in reality
they are arranged linearly
(2, 2) branch prediction buffer uses 2-bit
global history to choose from among 4
predictors, for each branch address of 4-bit
(among the 16 entries in each of the 4
predictors

MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

23



(2, 2) Correlating Branches Predictor
Behavior of recent branches selects
between, say, four predictions of next
branch, updating just that prediction
Indexing is done by concatenation of 4
lower-order address bits of the branch
(word address) and 2 global bits to
form 6-bit address to select 2-bit
prediction from 64 entries in 4 buffers
each of 16 entries
MAC/VU-Advanced
Computer Architecture

Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

24


Comparison of (0,2) and (2,2) predictors
18%
16%

4096 Entries 2-bit BHT
11%
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT

14%

12%
10%
8%

6%
6%

6%

6%

5%

5%
4%

4%
2%

1%

1%
0%

4,096 entries: 2­bits per entry

MAC/VU-Advanced
Computer Architecture

Unlimited entries: 2­bits/entry


Lecture 15 – Instruction Level
Parallelism -Dynamic (4)

li

eqntott

espresso

gcc

fpppp

spice

doducd

tomcatv

0%

matrix300

0%
nasa7

FrequencyFrequency of Mispredictions
of Mispredictions


18%

1,024 entries (2,2)

25


×