Tải bản đầy đủ (.pdf) (71 trang)

Advanced Computer Architecture - Lecture 34: Multiprocessors

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.89 MB, 71 trang )

CS 704
Advanced Computer Architecture

Lecture 34
Multiprocessors
(Shared Memory Architectures)

Prof. Dr. M. Ashraf Chughtai


Today’s Topics
Recap:
Parallel Processing
Parallel Processing Architectures


Symmetric Shared Memory



Distributed Shared Memory

Performance of Parallel Architectures
Summary
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

2



Recap
So far our focus have been to study the
performance of a single instruction stream
computers; and methodologies to enhance
the performance of such machines
We studied how
– the Instruction Level Parallelism is exploited
among the instructions of a stream; and
– the control, data and memory dependencies
are resolved
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

3


Recap:ILP
These characteristics are realized through:


Pipelining the datapath



Superscalar Architecture




Very Long Instruction Word (VLIW)
Architecture



Out-of-Order execution

MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

4


Parallel Processing and Parallel
Architecture
However, further improvements in the
performance may be achieved by exploiting
parallelism among multiple instruction
streams, which uses:
– Multithreading, i.e., number of instruction
streams running on one CPU
– Multiprocessing, i.e., streams running on
multiple CPUs where each CPU can itself
be multithreaded
MAC/VU-Advanced
Computer Architecture


Lec. 34 Multiprocessor (1)

5


Parallel Computers Performance
Amdahl’s Law
Furthermore, while evaluating the
performance enhancement due to parallel
processing two important challenges are to
be taken into consideration
– Limited parallelism available in program
– High cost of communication
These limitations make it difficult to
achieve good speedup in any parallel
processor

MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

6


Parallel Computers Performance
Amdahl’s Law
For example, if a portion of the program is
sequential, it limits the speedup; this can be
understood by the following example:

Example: What fraction of original computation
can be sequential to achieve speedup of 80 with
100 processors?
Answer: The Amdahl’s law states that:
Speedup =

1
Fraction Enhanced + (1- Fraction Enhanced)
Speedup Enhanced

MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

7


Parallel Computers Performance
Here, the fraction enhanced is the fraction in
parallel, therefore speedup can be expressed as
80 = 1 / [(Fraction parallel/100 + (1-Fractionparallel )]
Simplifying the expression, we get
0.8*Fraction parallel + 80*(1-Fraction parallel) = 1
80 - 79.2* Fraction parallel = 1
Fraction parallel = (80-1)/79.2

= 0.9975

i.e., to achieve speedup of 80 with 100 processors

only 0.25% sequential allowed!
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

8


Parallel Computers Performance
The second major challenge in parallel
processing is the communication cost that
involves the latency of remote access
Now let us consider another example to
explain the impact of communication cost
on the performance of parallel computers
Example: Consider an application running
on 32-processors multiprocessor, with 40
nsec. time to handle remote memory
reference
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

9


Parallel Computers Performance
Assume instruction per cycle for all memory

reference hit is 2 and processor clock rate is
1GHz, and find:
How fast is the multiprocessor when there
is no communication versus 0.2% of the
instructions involve remote access?
Solution: The effective CPI for
multiprocessor with remote reference is:
CPI = Base CPI +
Remote request rate x remote access cost
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

10


Introduction to Parallel Processing
Substituting the values we get:
CPI =
=

[1/Base IPC] +
0.2% x remote request cost
[1/2] + 0.2% x (400 cycle)

=

0.5 +0.8 = 1.3


And, CPI without remote reference
=

1/Base IPC = 0.5

Hence, the multiprocessor with all local
reference is 1.3/0.5 = 2.6 times faster as
compare to that with no remote reference
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

11


Introduction to Parallel Processing
Considering these limitations let us
explore how improvement in computer
performance can be accomplished
using Parallel Processing Architecture
Parallel Architecture is a collection of
processing elements that cooperate
and communicate to solve larger
problems fast
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)


12


Introduction to Parallel Processing
Parallel Computers extend the
traditional computer architecture with a
communication architecture to achieve
synchronization between threads and
consistency of data in cache

MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

13


Parallel Computer Categories
In 1966, Flynn proposed simple categorization
of computers that is still valid today
This categorization forms the basis to
implement the programming and
communication models for parallel computing
Flynn looked at the parallelism in the
instruction and in the data streams called for
by the instructions and proposed the following
four categories:
MAC/VU-Advanced
Computer Architecture


Lec. 34 Multiprocessor (1)

14


Parallel Computer Categories
SISD (Single Instruction Single Data)
– This category is Uniprocessor
SIMD (Single Instruction Multiple Data)
– Same instruction is executed by multiple
processors using different data streams
– Each processor has its own data memory
(i.e., multiple data memories) but there is a
single instruction memory and single control
processor
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

15


Parallel Computer Categories
– Illiac-IV and CM-2 are the typical examples of
SIMD architecture, which offer:
Simple programming model
Low overhead
Flexibility

All custom integrated circuits

MISD (Multiple Instruction Single Data)
– Multiple processors or consecutive functional
units are working on a single data stream
– (However, no commercial multiprocessor of this
type is available till date)

MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

16


Parallel Computer Categories
MIMD (Multiple Instruction Multiple Data)
– Each processor fetches its own instructions and
operates on its own data
– Examples: Sun Enterprise 5000, Cray T3D, SGI
Origin. The characteristics these machines are:
Flexibility: it can function as Single-user
multiprocessor or as multi-programmed
multiprocessor running many programs
simultaneously
Use of off-the-shelf microprocessors
MAC/VU-Advanced
Computer Architecture


Lec. 34 Multiprocessor (1)

17


MIMD and Thread Level Parallelism
MIMD machines have multiple processors and
can be used as:
– Either each processor executing different
processes in a multi-program environment
– Or multiple processors execute a single
program sharing the code and most of their
address space
In the later case, where multiple processes
share code and data, such processes are
referred to as the threads
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

18


MIMD and Thread Level Parallelism
Threads may be
– either large-scale independent
processes, such as independent
programs, running in multi-programmed
fashion

– Or parallel iterations of a loops having
thousands of instructions, automatically
generated by a compiler
This parallelism in the threads in called
Thread Level Parallelism
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

19


MIMD Classification
Based on the memory organization and
interconnect strategy, the MIMD machines
are classified as:

Centralized Shared Memory
Architecture
Distributed Memory Architecture

MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

20



Centralized Shared-Memory
The Centralized Shared Memory design,
shown here, illustrates the interconnection
of main memory and I/O systems to the
processor-cache subsystems
In small-level designs, with less than a
dozens processor-cache, subsystems
share the same physical centralized
memory connected by a bus; while
In larger designs, i.e., the designs with a few …
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

21


Centralized Share-Memory
Architecture

MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

22


Centralized Shared-Memory

… dozens processor- cache subsystems,
the single bus is replaced with multiple
buses or even a switch are used
However, the key architectural property of
the singleMemory design is
the Centralized Shared
the Uniform Memory Access – UMA;
i.e., the access time to all memory from all
the processors is same
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

23


Centralized Shared-Memory
Furthermore, the single main memory has a
symmetric relationship to all the processors
These multiprocessors, therefore are
referred to as the Symmetric (Shared
Memory) Multi-Processors (SMP)
This style of architecture is also sometimes
called the Uniform Memory Access (UMA)
as it offers uniform access time to all the
memory from all the processors
MAC/VU-Advanced
Computer Architecture


Lec. 34 Multiprocessor (1)

24


Decentralized or Distributed Memory
The decentralized or distributed memory
design style of multiprocessor architecture
is shown here
It consists of number of individual nodes
containing a processors, some memory and
I/O and an interface to an interconnection
network that connects all the nodes
The individual nodes contain a small
number of processors which may be ……
MAC/VU-Advanced
Computer Architecture

Lec. 34 Multiprocessor (1)

25


×