Tải bản đầy đủ (.pdf) (56 trang)

Advanced Computer Architecture - Lecture 37: Multiprocessors

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.63 MB, 56 trang )

CS 704
Advanced Computer Architecture

Lecture 37
Multiprocessors
(Performance and Synchronization)

Prof. Dr. M. Ashraf Chughtai


Today’s Topics
Recap:
Performance of Multiprocessors with


Symmetric Shared-Memory



Distributed Shared Memory
Synchronization in Parallel Architecture

Conclusion

MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

2



Recap: Cache Coherence Problem
So far we have discussed the sharing of
caches for multi-processing in the:
 symmetric shared-memory architecture
 Distributed shared memory architecture
We have studied cache coherence problem
in symmetric and distributed sharedmemory multiprocessors; and have noticed
that this problem is indeed performancecritical
MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

3


Recap: Multiprocessor cache Coherence
Last time we also studied the cache
coherence protocols, which use different
techniques to track the sharing status and
maintain coherence without performance
degrading
These protocols are classified as:
Snooping Protocols
Directory-Based Protocols
These protocols are implemented using a
FSM controller
MAC/VU-Advanced
Computer Architecture


Lec. 37 Multiprocessor (4)

4


Recap: Snooping Protocols
Snooping protocols employ write invalidate
and write broadcast techniques
Here, the block of memory is in one of the
three states, and each cached-block tracks
these three states; and
the controller responds to the read/write
request for a block of memory or cached
block, both from the processor and from
the bus
MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

5


Recap: Implementation Complications of snoopy protocols
The three states of the basic FSM are: Shared,
Exclusive or Invalid
However, the complications such as: write
races, interventions and invalidation have been
observed in the implementation of snoopy

protocols; and
to overcome these complications number of
variations in the FSM controller have been
suggested
These variations are: MESI Protocol, Barkley
Protocol and Illinois Protocol
MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

6


Recap: Variations in snoopy protocols
These variations resulted in four (4) states
FSM controller
– The states of MESI Protocol are: Modify,
Exclusive, Shared and Invalid
– The sates of Barkley Protocol are: OwnedExclusive, Owned-Sheared, Shared and
Invalid; and of
– Illinois Protocol are: Private Dirty, Private
clean, shared and Invalid
MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

7



Recap: Directory based Protocols
The larger multiprocessor systems employ
distributed shared-memory , i.e., a separate
memory per processor is provided
Here, the Cache Coherency is achieved
using non-cached pages or directory
containing information for every block in
memory
The directory-based protocol tracks state of
every block in every cache and finds the …..
MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

8


Recap:

Directory Based Protocol

…… caches having copies of block being
dirty or clean
The directory-based protocol tracks state
of every block in every cache and finds the
caches having copies of block being dirty
or clean
Similar to the Snoopy Protocol, the

directory-based protocol are implemented
by FSM having three states: Shared,
Uncached and Exclusive
MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

9


Recap: Directory-based Protocol

MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

10


Recap: Directory Based Protocols
These protocols involve three processors
or nodes, namely: local, home and remote
nodes
– Local node originates the request
– Home node stores the memory location
of an address
– Remote node holds a copy of a cache
block, whether exclusive or shared

MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

11


Recap: Directory-based Protocol
The transactions are caused by the
messages such as: read misses, write
misses, invalidates or data fetch requests
These messages are sent to the directory
to cause actions such as: update directory
state and to satisfy requests
The controller tracks all copies of memory
block; and indicates an action that updates the
sharing set
MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

12


Example: Working of Finite State Machine Controller
Now are going to discuss the state
transition and messages generated by FSM
controller in each state to implement the

directory-based protocols.
We consider an example distributed sharedmemory multiprocessor having two
processors P1 and P2 where each
processor has its own cache, memory and
directory
MAC/VU-Advanced
Computer Architecture

Lec. 35 Multiprocessor (2)

13


Example: Working of Finite State Machine Controller
Here, if
the required data is not in the cache and is
available in memory associated with the
respective processor, then
the state machine is said to be in Uncached
state; and
transition to other states is caused by
messages such as: read miss, write miss,
invalidates and data fetch request
MAC/VU-Advanced
Computer Architecture

Lec. 35 Multiprocessor (2)

14



Example: Dealing with read/write misses
Processor 1 Processor 2 Interconnect
step
P1: Write 10 to A1

Directory Memory

P1
P2
Bus
Directory
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value

P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

A1 and A2 map to the same cache block

MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

15



Example: Working of Finite State Machine Controller
Let us assume that the initially the cache states
are Uncached (i.e., the block of data is in memory);
and at the first step P1 write 10 to address A1,
here the following three activities take place

1. The bus action is write miss and the

processor P1 places the address A1 on the
bus;
2. the data value reply message is sent to the

controller, P1 is inserted in the directory
sharer-set {P1}; and
MAC/VU-Advanced
Computer Architecture

Lec. 35 Multiprocessor (2)

16


Example: Working of Finite State Machine Controller
3. the state transition from Uncached to
exclusive takes place – these operations are
shown here in red color
Processor 1 Processor 2
step
P1: Write 10 to A1


Interconnect Directory Memory

P1
P2
Bus
Directory
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
WrMs P1 A1
A1
Ex {P1}
Excl. A1
10
DaRp P1 A1
0

P1: Read A1
P2: Read A1

P2: Write 40 to A2
P2: Write 20 to A1

MAC/VU-Advanced
Computer Architecture

Lec. 35 Multiprocessor (2)

17



Example: Working of Finite State Machine Controller
At Step 2 – P1 reads A1; CPU read HITs
occurs, hence the FSM Stays in exclusive
state
Processor 1
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

Processor 2

Interconnect

Directory

Mem

P1
P2
Bus
Directory
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
WrMs P1 A1
A1
Ex
{P1}
Excl. A1

10
DaRp P1 A1
0
Excl. A1
10

P2: Write 20 to A1
P2: Write 40 to A2

MAC/VU-Advanced
Computer Architecture

Lec. 35 Multiprocessor (2)

18


Example: Working of Finite State Machine Controller
At Step 3: P2 reads A1
i) read miss occurs on the bus as P2 is initially in

Uncached state; the controller states of P1 and P2
change from Uncached to Shared
ii) P1 being in Exclusive state, remote read write-

back is asserted and the state changes from
exclusive to Shared; and
iii)the value (10) is read 1 from the shared-memory at

address A1, into P1 and P2 caches at A1; and both

P1 and P2 controllers are inserted in sharer-set
{P1,P2}
MAC/VU-Advanced
Computer Architecture

Lec. 35 Multiprocessor (2)

19


Example: Working of FSM Controller
Processor 1 Processor 2
Interconnect

step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P1
P2
Bus
State Addr Value State Addr Value Action Proc.
WrMs P1
Excl. A1 10
DaRp P1
Excl. A1 10
Shar. A1
RdMs P2
Shar. A1 10

Ftch P1
Shar. A1 10 DaRp P2

Directory

Memory

Directory
Memory
Addr Value Addr State {Procs} Value
A1
A1 Ex {P1}
A1
0
A1
A1
A1

10
10

A1

10
10
10
10
10

A1 Shar. {P1,P2}


P2: Write 20 to A1

P2: Write 40 to A2

Write back
A1 and A2 map to the same cache block

MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

20


Example: Working of Finite State Machine Controller
At Step 4: P2 write 20 to A2
i) As A1 and A2 maps to the same cache block; P1

find a remote write, so the state of the controller
changes from shared to Invalid
ii) P2 find a CPU write, so places write miss on the

bus and changes the state from shared to
exclusive and writes value 20 to A1
iii)The director addresses to A1 with sharer-set

containing {P2}
MAC/VU-Advanced

Computer Architecture

Lec. 35 Multiprocessor (2)

21


Example: working of FSM controller
Step 4
Processor 1

Interconnect

Memory
Directory

Processor 2

P1
P2
Bus
step
State Addr Value State Addr Value Action
P1: Write 10 to A1
WrMs
Excl. A1
10
DaRp
P1: Read A1
Excl. A1

10
P2: Read A1
Shar. A1
RdMs
Shar. A1
10
Ftch
Shar. A1 10 DaRp
Excl. A1 20 WrMs
P2: Write 20 to A1
Inv.
Inval.
P2: Write 40 to A2

Directory
Memory
Proc. Addr Value Addr State {Procs} Value
P1 A1
A1
Ex
{P1}
P1 A1
0
P2
P1
P2
P2
P1

A1

A1
A1
A1
A1

10
10

A1

10
10
10
10
10

A1 Shar. {P1,P2}
A1 Excl.

{P2}

A1 and A2 map to the same cache block

MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

22



Example: Working of Finite State Machine Controller
At Step 5: P2 write 40 to A2
i) P2 being in Exclusive state, P2 write Miss at A2

occurs
ii) Director of A2 is in exclusive state and places P2

in the sharer-set {P2}
iii)P2 write-back 20 at A1 completes; the directory at

A1 is in Uncached state; the sharer-set is empty
and value 20 is placed in the memory
iv)P2 remains in Exclusive state, with address A2

and value 40
MAC/VU-Advanced
Computer Architecture

Lec. 35 Multiprocessor (2)

23


Example

.. Cont’d

Processor 1 Processor 2 Interconnect
step

P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

P1
P2
Bus
State Addr Value State Addr Value Action Proc.
WrMs P1
Excl. A1
10
DaRp P1
Excl. A1 10
Shar. A1
RdMs P2
Shar. A1 10
Ftch
P1
Shar. A1 10 DaRp P2
Excl. A1 20 WrMs P2
Inv.
Inval. P1
WrMs P2
WrBk P2
Excl. A2 40 DaRp P2

Directory Memory


Directory
Memory
Addr Value Addr State {Procs} Value
A1
A1
Ex {P1}
A1
0
A1
A1
A1
A1
A1
A2
A1
A2

10
10

20
0

A1

A1 Shar. {P1,P2}
A1
A2
A1

A2

Excl. {P2}
Excl. {P2}
Unca. {}
Excl. {P2}

10
10
10
10
0
20
0

A1 and A2 map to the same cache block

MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

24


Performance of Multiprocessors
Symmetric Shared-Memory Architecture
In bus-based multiprocessor using an invalidation
protocols, several phenomenon combine to
determine performance:

– Overall cache performance is combination of the

behavior of the Uniprocessor cache miss-traffic
and the traffic caused by the communication due
to invalidation and subsequent cache miss
– Changing processor count, cache size and block

size effect these two components of miss rate
MAC/VU-Advanced
Computer Architecture

Lec. 37 Multiprocessor (4)

25


×