CS 704
Advanced Computer Architecture
Lecture 37
Multiprocessors
(Performance and Synchronization)
Prof. Dr. M. Ashraf Chughtai
Today’s Topics
Recap:
Performance of Multiprocessors with
–
Symmetric Shared-Memory
–
Distributed Shared Memory
Synchronization in Parallel Architecture
Conclusion
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
2
Recap: Cache Coherence Problem
So far we have discussed the sharing of
caches for multi-processing in the:
symmetric shared-memory architecture
Distributed shared memory architecture
We have studied cache coherence problem
in symmetric and distributed sharedmemory multiprocessors; and have noticed
that this problem is indeed performancecritical
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
3
Recap: Multiprocessor cache Coherence
Last time we also studied the cache
coherence protocols, which use different
techniques to track the sharing status and
maintain coherence without performance
degrading
These protocols are classified as:
Snooping Protocols
Directory-Based Protocols
These protocols are implemented using a
FSM controller
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
4
Recap: Snooping Protocols
Snooping protocols employ write invalidate
and write broadcast techniques
Here, the block of memory is in one of the
three states, and each cached-block tracks
these three states; and
the controller responds to the read/write
request for a block of memory or cached
block, both from the processor and from
the bus
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
5
Recap: Implementation Complications of snoopy protocols
The three states of the basic FSM are: Shared,
Exclusive or Invalid
However, the complications such as: write
races, interventions and invalidation have been
observed in the implementation of snoopy
protocols; and
to overcome these complications number of
variations in the FSM controller have been
suggested
These variations are: MESI Protocol, Barkley
Protocol and Illinois Protocol
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
6
Recap: Variations in snoopy protocols
These variations resulted in four (4) states
FSM controller
– The states of MESI Protocol are: Modify,
Exclusive, Shared and Invalid
– The sates of Barkley Protocol are: OwnedExclusive, Owned-Sheared, Shared and
Invalid; and of
– Illinois Protocol are: Private Dirty, Private
clean, shared and Invalid
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
7
Recap: Directory based Protocols
The larger multiprocessor systems employ
distributed shared-memory , i.e., a separate
memory per processor is provided
Here, the Cache Coherency is achieved
using non-cached pages or directory
containing information for every block in
memory
The directory-based protocol tracks state of
every block in every cache and finds the …..
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
8
Recap:
Directory Based Protocol
…… caches having copies of block being
dirty or clean
The directory-based protocol tracks state
of every block in every cache and finds the
caches having copies of block being dirty
or clean
Similar to the Snoopy Protocol, the
directory-based protocol are implemented
by FSM having three states: Shared,
Uncached and Exclusive
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
9
Recap: Directory-based Protocol
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
10
Recap: Directory Based Protocols
These protocols involve three processors
or nodes, namely: local, home and remote
nodes
– Local node originates the request
– Home node stores the memory location
of an address
– Remote node holds a copy of a cache
block, whether exclusive or shared
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
11
Recap: Directory-based Protocol
The transactions are caused by the
messages such as: read misses, write
misses, invalidates or data fetch requests
These messages are sent to the directory
to cause actions such as: update directory
state and to satisfy requests
The controller tracks all copies of memory
block; and indicates an action that updates the
sharing set
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
12
Example: Working of Finite State Machine Controller
Now are going to discuss the state
transition and messages generated by FSM
controller in each state to implement the
directory-based protocols.
We consider an example distributed sharedmemory multiprocessor having two
processors P1 and P2 where each
processor has its own cache, memory and
directory
MAC/VU-Advanced
Computer Architecture
Lec. 35 Multiprocessor (2)
13
Example: Working of Finite State Machine Controller
Here, if
the required data is not in the cache and is
available in memory associated with the
respective processor, then
the state machine is said to be in Uncached
state; and
transition to other states is caused by
messages such as: read miss, write miss,
invalidates and data fetch request
MAC/VU-Advanced
Computer Architecture
Lec. 35 Multiprocessor (2)
14
Example: Dealing with read/write misses
Processor 1 Processor 2 Interconnect
step
P1: Write 10 to A1
Directory Memory
P1
P2
Bus
Directory
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
15
Example: Working of Finite State Machine Controller
Let us assume that the initially the cache states
are Uncached (i.e., the block of data is in memory);
and at the first step P1 write 10 to address A1,
here the following three activities take place
1. The bus action is write miss and the
processor P1 places the address A1 on the
bus;
2. the data value reply message is sent to the
controller, P1 is inserted in the directory
sharer-set {P1}; and
MAC/VU-Advanced
Computer Architecture
Lec. 35 Multiprocessor (2)
16
Example: Working of Finite State Machine Controller
3. the state transition from Uncached to
exclusive takes place – these operations are
shown here in red color
Processor 1 Processor 2
step
P1: Write 10 to A1
Interconnect Directory Memory
P1
P2
Bus
Directory
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
WrMs P1 A1
A1
Ex {P1}
Excl. A1
10
DaRp P1 A1
0
P1: Read A1
P2: Read A1
P2: Write 40 to A2
P2: Write 20 to A1
MAC/VU-Advanced
Computer Architecture
Lec. 35 Multiprocessor (2)
17
Example: Working of Finite State Machine Controller
At Step 2 – P1 reads A1; CPU read HITs
occurs, hence the FSM Stays in exclusive
state
Processor 1
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
Processor 2
Interconnect
Directory
Mem
P1
P2
Bus
Directory
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
WrMs P1 A1
A1
Ex
{P1}
Excl. A1
10
DaRp P1 A1
0
Excl. A1
10
P2: Write 20 to A1
P2: Write 40 to A2
MAC/VU-Advanced
Computer Architecture
Lec. 35 Multiprocessor (2)
18
Example: Working of Finite State Machine Controller
At Step 3: P2 reads A1
i) read miss occurs on the bus as P2 is initially in
Uncached state; the controller states of P1 and P2
change from Uncached to Shared
ii) P1 being in Exclusive state, remote read write-
back is asserted and the state changes from
exclusive to Shared; and
iii)the value (10) is read 1 from the shared-memory at
address A1, into P1 and P2 caches at A1; and both
P1 and P2 controllers are inserted in sharer-set
{P1,P2}
MAC/VU-Advanced
Computer Architecture
Lec. 35 Multiprocessor (2)
19
Example: Working of FSM Controller
Processor 1 Processor 2
Interconnect
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P1
P2
Bus
State Addr Value State Addr Value Action Proc.
WrMs P1
Excl. A1 10
DaRp P1
Excl. A1 10
Shar. A1
RdMs P2
Shar. A1 10
Ftch P1
Shar. A1 10 DaRp P2
Directory
Memory
Directory
Memory
Addr Value Addr State {Procs} Value
A1
A1 Ex {P1}
A1
0
A1
A1
A1
10
10
A1
10
10
10
10
10
A1 Shar. {P1,P2}
P2: Write 20 to A1
P2: Write 40 to A2
Write back
A1 and A2 map to the same cache block
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
20
Example: Working of Finite State Machine Controller
At Step 4: P2 write 20 to A2
i) As A1 and A2 maps to the same cache block; P1
find a remote write, so the state of the controller
changes from shared to Invalid
ii) P2 find a CPU write, so places write miss on the
bus and changes the state from shared to
exclusive and writes value 20 to A1
iii)The director addresses to A1 with sharer-set
containing {P2}
MAC/VU-Advanced
Computer Architecture
Lec. 35 Multiprocessor (2)
21
Example: working of FSM controller
Step 4
Processor 1
Interconnect
Memory
Directory
Processor 2
P1
P2
Bus
step
State Addr Value State Addr Value Action
P1: Write 10 to A1
WrMs
Excl. A1
10
DaRp
P1: Read A1
Excl. A1
10
P2: Read A1
Shar. A1
RdMs
Shar. A1
10
Ftch
Shar. A1 10 DaRp
Excl. A1 20 WrMs
P2: Write 20 to A1
Inv.
Inval.
P2: Write 40 to A2
Directory
Memory
Proc. Addr Value Addr State {Procs} Value
P1 A1
A1
Ex
{P1}
P1 A1
0
P2
P1
P2
P2
P1
A1
A1
A1
A1
A1
10
10
A1
10
10
10
10
10
A1 Shar. {P1,P2}
A1 Excl.
{P2}
A1 and A2 map to the same cache block
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
22
Example: Working of Finite State Machine Controller
At Step 5: P2 write 40 to A2
i) P2 being in Exclusive state, P2 write Miss at A2
occurs
ii) Director of A2 is in exclusive state and places P2
in the sharer-set {P2}
iii)P2 write-back 20 at A1 completes; the directory at
A1 is in Uncached state; the sharer-set is empty
and value 20 is placed in the memory
iv)P2 remains in Exclusive state, with address A2
and value 40
MAC/VU-Advanced
Computer Architecture
Lec. 35 Multiprocessor (2)
23
Example
.. Cont’d
Processor 1 Processor 2 Interconnect
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1
P2
Bus
State Addr Value State Addr Value Action Proc.
WrMs P1
Excl. A1
10
DaRp P1
Excl. A1 10
Shar. A1
RdMs P2
Shar. A1 10
Ftch
P1
Shar. A1 10 DaRp P2
Excl. A1 20 WrMs P2
Inv.
Inval. P1
WrMs P2
WrBk P2
Excl. A2 40 DaRp P2
Directory Memory
Directory
Memory
Addr Value Addr State {Procs} Value
A1
A1
Ex {P1}
A1
0
A1
A1
A1
A1
A1
A2
A1
A2
10
10
20
0
A1
A1 Shar. {P1,P2}
A1
A2
A1
A2
Excl. {P2}
Excl. {P2}
Unca. {}
Excl. {P2}
10
10
10
10
0
20
0
A1 and A2 map to the same cache block
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
24
Performance of Multiprocessors
Symmetric Shared-Memory Architecture
In bus-based multiprocessor using an invalidation
protocols, several phenomenon combine to
determine performance:
– Overall cache performance is combination of the
behavior of the Uniprocessor cache miss-traffic
and the traffic caused by the communication due
to invalidation and subsequent cache miss
– Changing processor count, cache size and block
size effect these two components of miss rate
MAC/VU-Advanced
Computer Architecture
Lec. 37 Multiprocessor (4)
25