Slides kiến trúc máy tính nhóm 3

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.12 MB, 106 trang )

Multi
Processor
Instructors: Mr. Tran Ngoc Thinh. Ph.D
Group 3:
• 13070223 – Võ Thanh Biết
• 13070229 – Lưu Nguyễn Hoàng Hạnh
• 13070232 – Nguyễn Duy Hoàng
• 13070243 _ Trần Duy Linh
• 13070244 – Nguyễn Thị Thúy Loan
• 13070251 – Phạm Ích Trí Nhân
• 13070258 – Nguyễn Anh Quốc
• 13070269 – Lê Thị Minh Thùy
• 12070558 – Cao Minh Vũ

Contents
•

•

•

Multi processor
•

What is a multiprocessor system?

•

What category can it be in the Flynn Classification?

•

Synchronization: state some techniques: spin lock, barrier, advantage/disadvantage.
Synchronization for large scale multiprocessor

•

Memory consistency: state the relaxed consistency models

•

Multithreading: how to multithreading improve the performance of a uniprocessor without
superscalar? With superscalar?

Cache coherent problem in multicore systems
•

Why keeping cache coherence on multiprocessor is needed

•

Brief explain directory-based protocol? Where is it most applicable

•

Explain snoopy-based protocol? Where is it most applicable

•

Listing some popular protocols in modern processors

•

What is MESI protocol

Sample

Multiprocessing
•

What is a multiprocessor system ?
•

•

Multiprocessing is a type of processing in which two or more
processors work together to process more than one program
simultaneously.

Advantages of Multiprocessor Systems:
•

Reduced Cost

•

Increased Reliability

•

Increased Throughput

Flynn classification
•

Based on notions of instruction and data streams
•

SISD (Single Instruction stream over a Single Data stream )

•

SIMD (Single Instruction stream over Multiple Data streams )

•

MISD (Multiple Instruction streams over a Single Data stream)

•

MIMD (Multiple Instruction streams over Multiple Data stream)

Flynn classification

Synchronization
•

Why Synchronize?
•

Need to know when it is safe for different processes running on
different processors to use shared data

Synchronization
P1
Lock(L)
Load sharedvar
Modify sharedvar
Store sharedvar
Release(L)

P2
Lock(L)
Load sharedvar
Modify sharedvar
Store sharedvar
Release(L)

Synchronization
•

Hardware support for synchronization
•

Atomic instruction to fetch and update memory (atomic operation)
•

Atomic exchange:
•

•

test-and-set:
•

•

tests a value and sets it if the value passes the test

Fetch-and-increment:
•

•

interchange a value stored in a register for a value stored in a memory location representing
a lock

returns the value of a memory location and atomically increments it after the fetch is done

Atomic Read and Write for Multiprocessors
•

load-linked(LL) and store-conditional(SC)

Synchronization
•

Initial Implementations
•

•

Semaphores

Current Trends
•

Spin Locks

•

Condition Variables

•

Read-Write Locks

•

Reference Locks

Synchronization

•

spin locks: locks that a processor continuously tries to acquire,
spinning around a loop until it succeeds
While(!acquire(lock))/*spin*/
/*some computation on shared data(critical
section)*/
release(lock)
Acquire based on primitive: Read-Modify-Write

Synchronization
•

spin locks:

Synchronization
•

Spin locks : (using test and set)
void spin_lock (spinlock_t *s)
{
while (test_and_set (s) != 0)
while (*s != 0) ;

}

void spin_unlock (spinlock_t *s)
{

s = 0;
}

Synchronization
•

Synchronization for large-scale multiprocessor:
•

For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of
synchronization.

•

Problem:
•

Ex: 20 processors spin on lock held by 1 proc, 50 cycles for bus
Read miss by all waiting processors to fetch lock (20x50) 1000
Write miss by releasing processor and invalidates

50

Read miss by all waiting processors (20x50)

1000

Write miss by all waiting processors
one successful lock (50) & invalidate all copies (19x50)

Total time for 1 proc. to acquire & release lock

1000

3050

Each time one gets a lock, it drops out of competition, so avg.=1525
20 x 1525 = 30,000 cycles for 20 processors to pass through the lock
Problem is contention for lock and serialization of lock access: once lock is free, all compete to see who gets it
•

Solution:
•
•

spin lock with exponential back-off
queuing lock

Synchronization
•

Barrier Synchronization
•

A very common synchronization primitive

•

Wait until all threads have reached a point in the program before any are

allowed to proceed further

•

Uses two shared variables
•

A counter that counts how many have arrived

•

flag that is set when the last processor arrives
computation;
barrier()
communication;
barrier()
repeat:

Synchronization
Barrier Synchronization

•

•

Simple Barrier Synchronization

lock(counterlock);
if(count==0) release=0;

/* First resets release */

count++; /* Count arrivals */
unlock(counterlock);
if(count==total){

/* All arrived */

count=0; /* Reset counter */
release = 1;
}else {

/* Release processes */

/* Wait for more to come */

spin(release==1);
}

/* Wait for release to be 1*/

Synchronization
•

Barrier with many processors
Have to update counter one by one – takes a long time

•

•

Solution: use a combining tree of barriers
•

Example: using a binary tree

•

Pair up processors, each pair has its own barrier
•

•

•

E.g. at level 1 processors 0 and 1 synchronize on one barrier, processors 2 and 3 on another, etc.

At next level, pair up pairs
•

Processors 0 and 2 increment a count a level 2, processors 1 and 3 just wait for it to be released

•

At level 3, 0 and 4 increment counter, while 1, 2, 3, 5, 6, and 7 just spin until this level 3 barrier is released

•

At the highest level all processes will spin and a few “representatives” will be counted.

Works well because each level fast and few levels
•

Only 2 increments per level, log2(numProc) levels

•

For large numProc, 2*log2(numProc) still reasonably small

Synchronization
•

•

Contention even with test-and-test-and-set
•

Every write goes to many, many spinning procs

•

Making everybody test less often reduces contention for highcontention locks but hurts for low-contention locks

•

Solution: exponential back-off
•

If we have waited for a long time, lock is probably high-contention

•

Every time we check and fail, double the time between checks
•

Fast low-contention locks (checks frequent at first)

•

Scalable high-contention locks (checks infrequent in long waits)

– Hardware support

Cache Coherence


In a shared memory multiprocessor with a separate cache memory
for each processor, it is possible to have many copies of any oneinstruction operand.



One copy in the main memory and one in each cache memory. When
one copy of an operand is changed, the other copies of the operand
must be changed.

==> Coherence ensure reading a location
should return the latest value written to that

location

Cache Coherence

Cache Coherence
Condition Coherency:


P writes X (becomes Xp), no other writes, P reads X => P gets
Xp



Q writes X (becomes Xq), no other writes, P reads X => P gets
Xq.



P writes X (Xp),then Q write X(Xq). R,S,T reads X => they all
get Xq

Snooping protocol
•

Used in systems with a shared bus between the processors
and memory modules

•

Rely on a common channel (or bus) connecting the processors
to main memory.

•

This enables all cache controllers to observe (or snoop) the
activities of all other processors and take appropriate actions
to prevent the processor from obtaining old data.

Cache Coherence

Cache Coherence
•

Cache State Bits: describe the state of every cache line
(invalid, valid, shared, dirty…)

•

Bus Monitor: monitoring hardware that can independently
update the state of cache lines,

•

Bus Cache Cycles: broadcast invalidates or updates. They
may or may not be part of bus read/write cycles

Cache Coherence
Two possible solutions:
1.Update copy in the cache of processors. (Write-Invalidate)
2.Invalidate copy in the cache of processors.
(Write-Invalidate)

Cache Coherence
•

Cache State Bits: describe the state of every cache line
(invalid, valid, shared, dirty…)

•

Bus Monitor: monitoring hardware that can independently
update the state of cache lines,

•

Bus Cache Cycles: broadcast invalidates or updates. They
may or may not be part of bus read/write cycles

Slides kiến trúc máy tính nhóm 3

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về