Tải bản đầy đủ (.pdf) (62 trang)

Advanced Computer Architecture - Lecture 23: Instruction level parallelism

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.62 MB, 62 trang )

CS 704
Advanced Computer Architecture

Lecture 23
Instruction Level Parallelism
(Hardware Support at Compile Time)

Prof. Dr. M. Ashraf Chughtai


Today’s Topics
Recap
H/W Support at Compile Time
– Conditional/Predicated Instructions
– H/W based Compiler Speculation
Summary

MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

2


Recap: H/W and S/W Exploitation
We have studied both the Dynamic
and Static scheduling techniques
to exploit ILP for single or multiple
instructions issue per clock cycle and


to enhance the processor
performance
The dynamic approaches use
hardware modification which results
in superscalar and VLIW processors

MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

3


Recap …… Cont’d
Furthermore, the pipeline
structure enhancement such as
– Tomasulo’s pipeline facilitates
to overcome the structural and
data hazards and
– Branch predictors minimize the
stalls due the control hazards
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

4



Recap …… Cont’d
The static scheduling approaches
include
– Loop unrolling
– Software Pipelining
– Trace Scheduling
– Superblock Scheduling
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

5


Recap …… Cont’d
These techniques are focused to
increase ILP by exploiting
processor issuing more than one
instruction every cycle
These techniques give better
performance when the behavior of
the branches is correctly
predictable at the compile time
MAC/VU-Advanced
Computer Architecture


Lecture 23 – Instruction Level
Parallelism-Static (4)

6


Recap …… Cont’d
Otherwise, the parallelism could
not be completely exposed at the
compile time
This is due to the following two
reasons

MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

7


Recap …… Cont’d
1. Control dependences limits the

amount of the parallelism that
can be exploited; and
2. Dependence between memory

reference instructions could

prevent code movement
necessary to increase parallelism
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

8


Hardware Support for VLIW
These limitations, particularly for
VILW processor, could be
overcome by providing hardware
support at the compile time
Today, we will introduce some
hardware support-based
techniques to help:
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

9


Hardware Support for VLIW
– overcoming these limitations;

and
– to expose more parallelism at
the compile time
The most commonly used such
techniques are:
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

10


Hardware Support for VLIW
1. Extension of the Instruction Set

by including Conditional or
Predicated (base something on
something) Instructions
2. Hardware speculation to enhance

the ability of compiler to :

MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)


11


Hardware Support for VLIW
 to move code over branches,

while preserving exceptional
behavior
 To allow the compiler to reorder

load/store instruction when no
conflict is suspected but not
certain
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

12


1: Instruction Set Extension
The extended instruction set including
Conditional or Predicated Instructions

 allow the compiler to group

instructions across branches
 eliminate branches

 convert control dependence into
data dependence
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

13


Extension of Instruction Set
These approaches are equally useful
for hardware-intensive as well as
software-intensive scheme, i.e., the
dynamic as well as static scheduling
As discussed earlier, Predicate
registers are included, in the structure
of IA64 processor, to implement
predicated instructions to improve
performance
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

14



Conditional Instructions
Now let us discuss the concept behind
introducing the conditional
instructions in the instruction set




The conditional instructions have
an extra operand – a one-bit
predicate register
A condition is evaluated as part of
instruction execution to set the
value of predicate-register

MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

15


Conditional or predicted Instructions


In HPL-PD from HP Lab, the value
of the predicate register is typically
set to “Compare-to-predicate

operation;

p1 = CMPP <= r1, r2
Here the predicate register p1 is
set if r2 is <= r1
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

16


Conditional or predicted Instructions


If condition is true (p1=1), the
instruction is executed normally.



If the condition is false (p1=0), the
instruction execution continues as
if the instructions were a nooperation

MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level

Parallelism-Static (4)

17


Conditional Instructions
Typical conditional instructions for
pipeline processors are:
Conditional Move – CMOVZ R1, R2, R3
it moves the value from one register to
another if the condition is true; i.e., third
operand – the predicate register R3 is Zero

Such instructions are used to
eliminate branch code sequence
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

18


Conditional Instructions
Conditional ADD –
(R8) ADD R1, R2, R3
assumes that the R1= R2+R3 occurs if the
predicate register – R8 is 1


MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

19


Conditional Instructions
Conditional Load – LWC R1, 0(R2), R3
assumes that the load occurs unless the
third operand – R3 is Zero
The LW instruction, or a short block of
code, following the branch can be
converted to LWC and moved up to
second issue slot to improve the
execution time for several cycles
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

20


Conditional or predicated Instructions:
Example 1


Let us consider the conditional
statement:
If (A==0) { S=T;}
i.e., the value S is to be replaced by T if
the value A is zero
Assuming the register R1, R2, R3 holds
the value of A, S and T respectively.
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

21


Example 1
The code to implement this conditional
statements can be written as:
BNEZ

R1, L ; No-op if A (R1)!= 0

ADDU

R2, R3, R0 ; Else replace S (R2) by T (R3)

L
The IF statement can be implemented by the
conditional move as:

CMOVZ R2, R3, R1
Move R3 to R2 if the third operand R1=0
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

22


Conditional or predicated Instructions
Here, notice that using the Conditional
instruction CMOVZ,
the next operation is determined by the
contents of the third register instead of
condition evaluation
i.e., the control dependence has been
converted to data dependence
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

23


Conditional or predicated Instructions
This transformation has moved the

place to resolve dependence in a
pipelined processor
We know that, in a pipelined processor the
dependence for branches is resolved near
the front of the pipe
Whereas, the conditional instruction resolve
the dependence where the register-write
occurs
MAC/VU-Advanced
Computer Architecture

Lecture 23 – Instruction Level
Parallelism-Static (4)

24


Conditional or predicated Instructions
This transformation is also used for
vector computers, where it is called
if-conversion
The if-conversion replaces
conditional branches with predicated
operations
For example: Let see the code
generated for the following two (2)
if-then-else statements

MAC/VU-Advanced
Computer Architecture


Lecture 23 – Instruction Level
Parallelism-Static (4)

25


×