CS 704
Advanced Computer Architecture
Lecture 14
Instruction Level Parallelism
(Dynamic Scheduling – Tomasulo’s Approach)
Prof. Dr. M. Ashraf Chughtai
Today's Topics
Recap - Lecture 13
Dynamic Scheduling
Tomasulo’s Approach
Summary
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
2
Recap: Summary
Instruction Level Parallelism in
Hardware or Software
SW parallelism dependencies
defined by program result in
hazards if HW cannot resolve
HW exploiting ILP works when
dependence cannot be
determined at run time
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
3
Recap …..Cont’d
Key idea of Scoreboard –
Allow instructions behind stall to
proceed
It is accomplished by dividing
the ID stage into two parts
Issue the instruction in-order
Read operand out-of-order
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
4
Recap …..Cont’d
Scoreboard …. Cont’d–
Structural and data
dependencies are checked at ID
stage
It facilitates out-of-order
execution which results in outof-order completion
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
5
Recap …..Cont’d
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
6
Recap …..Cont’d
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
7
Recap …..Cont’d
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
8
Recap …..Cont’d
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
9
Recap …..Cont’d
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
10
Recap …..Cont’d
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
11
Today's Topics
Recap - Lecture 13
Tomasulo’s Approach
Scoreboard Vs. Tomasulo’s
Approach
Summary
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
12
Another Dynamic Scheduling Approach:
Tomasulo's Algorithm
Introduced by Tomasulo for
IBM 360/91 about 3 years after
CDC 6600 (1966)
Goal: High Performance
without special compilers
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
13
Tomasulo's Algorithm Vs. Scoreboard
Differences between IBM 360 & CDC 6600
ISA is:
– IBM has only 2 register specifiers / instr
vs. 3 in CDC 6600
– IBM has 4 FP registers vs. 8 in CDC
6600
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
14
Tomasulo's Organization For Dynamic
Scheduling
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
15
Components of Tomasulo's Structure
FP Operation Queue
Instruction are sent from instruction unit
into the instruction Queue in FIFO order
FP Adder Reservation Station
FP Multiplier Reservation Station
The reservation stations include
Operations , actual operands and
information to resolve hazards
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
16
Components of Tomasulo's Structure
Load Buffers have three functions:
-
Hold components of effective address
until it is computed
Track outstanding Loads waiting on
memory
Hold the result of completed load waiting
for CDB
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
17
Components of Tomasulo's Structure
Store Buffers also have three
functions:
-
Hold components of effective address
until it is computed
Hold the destination memory address of
outstanding store instructions
Hold the address and value of store until
the memory unit is available
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
18
Components of Tomasulo's Structure
Common Data Bus: The difference between CDB
and Normal bus is
Normal data bus:
The data and destination (“go to” bus)
Common data bus:
The data + source (“come from” bus)
– Does the broadcast
–
64 bits of data + 4 bits of Functional Unit source
address
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
19
Sequence of operations
All the results from the FP units
or the Load unit are placed on the
Common Data Bus, which goes to
the FP register file as well as to
the RS and store buffers
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
20
Tomasulo's Algorithm Vs. Scoreboard
Control & buffers distributed with
Function Units (FU) in Tomasulo vs.
centralized in scoreboard
– FU buffers called “reservation
stations”; have pending operands
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
21
Tomasulo's Algorithm Vs. Scoreboard
Registers in instructions are replaced
by values or pointers to reservation
stations(RS)
This is called register renaming
– avoids WAR, WAW hazards
More reservation stations than
registers, so can do optimizations
which compilers can’t
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
22
Tomasulo's Algorithm Vs. Scoreboard
Tomasulo: Results to FU from RS,
over Common Data Bus that
broadcasts results to all FUs
Scoreboard: Result to FU through
registers
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
23
Components of Reservation Station
Op— Operation to perform in the unit
(e.g., add, sub, …)
Busy—
Indicates reservation station or
FU is busy
Vj, Vk— Value of Source operands
–
Store buffers has V field to
store the result
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
24
Components of Reservation Station
Qj, Qk— Reservation stations producing
source registers value to be
written
Note: No ready flags as in
Scoreboard; Qj,Qk=0 => ready
– Store buffers only have Qi for RS
producing result
MAC/VU-Advanced
Computer Architecture
Lecture 14 – Instruction Level
Parallelism -Dynamic (3)
25