Tải bản đầy đủ (.pdf) (43 trang)

kiến trúc máy tính dạng thanh tin figs 8 parallel computer architectures sinhvienzone com

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (226.94 KB, 43 trang )

8
PARALLEL COMPUTER
ARCHITECTURES

1

CuuDuongThanCong.com

/>

P

P

P

P

P
P

Shared
memory

P
P
P

P

P


(a)

P

CPU

P

P

P

P

P

P

P

P

P

P

P

P


P

P

P

P
P

P

P

P

(b)

Figure 8-1. (a) A multiprocessor with 16 CPUs sharing a common memory. (b) An image partitioned into 16 sections, each
being analyzed by a different CPU.

CuuDuongThanCong.com

/>

M

P

M


P

M

P

M

P

M

M

M

M

Private memory

P

P

P

P

CPU


Messagepassing
interconnection
network

P

P

P

P

M

M

M

M

(a)

P

P

M

P


P

M

P

P

M

P

P

M

P

P

P

P

CPU
P

Messagepassing
interconnection
network


P
P
P

P

P

P

P

(b)

Figure 8-2. (a) A multicomputer with 16 CPUs, each with
each own private memory. (b) The bit-map image of Fig. 8-1
split up among the 16 memories.

CuuDuongThanCong.com

/>

Machine 1

Machine 2

Machine 1

Machine 2


Machine 1

Machine 2

Application

Application

Application

Application

Application

Application

Language
run-time
system

Language
run-time
system

Language
run-time
system

Language

run-time
system

Language
run-time
system

Language
run-time
system

Operating
system

Operating
system

Operating
system

Operating
system

Operating
system

Operating
system

Hardware


Hardware

Hardware

Hardware

Hardware

Hardware

Shared memory

Shared memory

Shared memory

(a)

(b)

(c)

Figure 8-3. Various layers where shared memory can be implemented. (a) The hardware. (b) The operating system. (c)
The language runtime system.

CuuDuongThanCong.com

/>


(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 8-4. Various topologies. The heavy dots represent
switches. The CPUs and memories are not shown. (a) A star.
(b) A complete interconnect. (c) A tree. (d) A ring. (e) A grid.
(f) A double torus. (g) A cube. (h) A 4D hypercube.

CuuDuongThanCong.com

/>

Input port

CPU 1

Output port

A

B

C

D

End
of
packet

Middle
of
packet

Four-port switch

CPU 2

Front of packet

Figure 8-5. An interconnection network in the form of a fourswitch square grid. Only two of the CPUs are shown.

CuuDuongThanCong.com

/>

CPU 1


Entire
packet

Input port

Four-port
switch

Output port

A

B

A

B

A

B

C

D

C

D


C

D

CPU 2
Entire
packet

Entire
packet
(a)

(b)

(c)

Figure 8-6. Store-and-forward packet switching.

CuuDuongThanCong.com

/>

CPU 1
B

C

D

CPU 2


Four-port switch


,
,

A

CPU 3

Input port
Output buffer

CPU 4

Figure 8-7. Deadlock in a circuit-switched interconnection network.

CuuDuongThanCong.com

/>

60
N-body problem
50

Linear speedup

Speedup


40

Awari

30

20

10
Skyline matrix inversion
0

0

10

20

30
40
Number of CPUs

50

60

Figure 8-8. Real programs achieve less than the perfect speedup indicated by the dotted line.

CuuDuongThanCong.com


/>

n CPUs active



Inherently
sequential
part

Potentially
parallelizable
part

1 CPU
active

f

1–f

f

1–f

fT

(1 – f)T/n

T

(a)

(b)

Figure 8-9. (a) A program has a sequential part and a parallelizable part. (b) Effect of running part of the program in parallel.

CuuDuongThanCong.com

/>

CPU

Bus
(a)

(b)

(c)

(d)

Figure 8-10. (a) A 4-CPU bus-based system. (b) A 16-CPU
bus-based system. (c) A 4-CPU grid-based system. (d) A 16CPU grid-based system.

CuuDuongThanCong.com

/>

P1
P1


P2

Work queue

P3

P1

P2

P3
P1

Synchronization point

P1

P3

P5

P4

P2
P2

P2

P6


P3
P7

P8

Process

Synchronization point
P9

(a)

(b)

(c)

(d)

Figure 8-11. Computational paradigms. (a) Pipeline. (b)
Phased computation. (c) Divide and conquer. (d) Replicated
worker.

CuuDuongThanCong.com

/>
P3


Physical

(hardware)
Multiprocessor
Multiprocessor
Multicomputer
Multicomputer

Logical
(software)
Shared variables
Message passing
Shared variables
Message passing

Examples
Image processing as in Fig. 8-1
Message passing simulated with buffers in memory
DSM, Linda, Orca, etc. on an SP/2 or a PC network
PVM or MPI on an SP/2 or a network of PCs

Figure 8-12. Combinations of physical and logical sharing.

CuuDuongThanCong.com

/>

Instruction
streams
1
1
Multiple

Multiple

Data
streams
1
Multiple
1
Multiple

Name
SISD
SIMD
MISD
MIMD

Examples
Classical Von Neumann machine
Vector supercomputer, array processor
Arguably none
Multiprocessor, multicomputer

Figure 8-13. Flynn’s taxonomy of parallel computers.

CuuDuongThanCong.com

/>

Parallel computer architectures

SISD


SIMD

MISD

(Von Neumann)

MIMD

?

Vector
processor

Array
processor

UMA

Bus

Multiprocessors

COMA

Switched

Multicomputers

NUMA


CC-NUMA

Shared memory

NC-NUMA

MPP

Grid

COW

Hypercube

Message passing

Figure 8-14. A taxonomy of parallel computers.

CuuDuongThanCong.com

/>

Input vectors

Vector ALU

Figure 8-15. A vector ALU.

CuuDuongThanCong.com


/>

Operation
Ai = f1 (Bi )
Scalar = f2 (A)
Ai = f3 (Bi, Ci )
Ai = f4 (scalar, Bi )

f1
f2
f3
f4

Examples
= cosine, square root
= sum, minimum
= add, subtract
= multiply Bi by a constant

Figure 8-16. Various combinations of vector and scalar operations.

CuuDuongThanCong.com

/>

Step
1
2
3

4

Name
Fetch operands
Adjust exponent
Execute subtraction
Normalize result

Values
1.082 × 10 − 9.212 × 1011
1.082 × 1012 − 0.9212 × 1012
0.1608 × 1012
1.608 × 1011
12

Figure 8-17. Steps in a floating-point subtraction.

CuuDuongThanCong.com

/>

Step
1
Fetch operands B1 , C1
Adjust exponent
Execute operation
Normalize result

2
B2 , C2

B1 , C 1

3
B3 , C3
B2 , C2
B1 + C1

Cycle
4
B4 , C4
B3 , C3
B2 + C2
B1 + C1

5
B5 , C5
B4 , C4
B3 + C3
B2 + C2

Figure 8-18. A pipelined floating-point adder.

CuuDuongThanCong.com

/>
6
B6 , C6
B5 , C5
B4 + C4
B3 + C3


7
B7 , C7
B6 , C6
B5 + C5
B4 + C4


A

B

S

64
24-Bit
holding
registers
for
addresses

8
24-Bit
address
registers

ADD

8
64-Bit

scalar
registers

64
64-Bit holding
registers for
scalars

8 64-Bit
vector registers

ADD

ADD

ADD

BOOLEAN

MUL

BOOLEAN

SHIFT

RECIP.

SHIFT

MUL

Address units

64 Elements
per register

T

POP. COUNT
Scalar
integer units

Scalar/vector
floatng-point
units

Vector
integer
units

Figure 8-19. Registers and functional units of the Cray-1

CuuDuongThanCong.com

/>

CPU
2
Write 200
1


Write
100

x

Read 2x

Read 2x

3

W100

W100

W200

W200

R3 = 100

R4 = 200

R3 = 200

W200

W100

R3 = 200


R4 = 200

R3 = 100

R4 = 200

R3 = 200

R4 = 100

R4 = 200

R4 = 200

R3 = 100

(b)

(c)

(d)

4
(a)

Figure 8-20. (a) Two CPUs writing and two CPUs reading a
common memory word. (b) - (d) Three possible ways the two
writes and four reads might be interleaved in time.


CuuDuongThanCong.com

/>

Write

CPU A

1A

CPU B

1B

2A

CPU C

1C

1D 1E

2B

2C

3A

3B


1F

3C

Synchronization point
Time

Figure 8-21. Weakly consistent memory uses synchronization
operations to divide time into sequential epochs.

CuuDuongThanCong.com

/>
2D


CPU

CPU

M

Shared
memory

Private memory

Shared memory
CPU


CPU

M

CPU

CPU

Cache
Bus
(a)

(b)

(c)

Figure 8-22. Three bus-based multiprocessors. (a) Without
caching. (b) With caching. (c) With caching and private
memories.

CuuDuongThanCong.com

/>
M


Action
Read miss
Read hit
Write miss

Write hit

Local request
Fetch data from memory
Use data from local cache
Update data in memory
Update cache and memory

Remote request

Invalidate cache entry

Figure 8-23. The write through cache coherence protocol.
The empty boxes indicate that no action is taken.

CuuDuongThanCong.com

/>

CPU 1

CPU 2

CPU 3
Memory

(a)

CPU 1 reads block A


A
Exclusive
Bus

Cache

CPU 1

CPU 2

CPU 3
Memory

(b)

CPU 2 reads block A

A
Shared

Shared
Bus

CPU 1

CPU 2

CPU 3
Memory


(c)

CPU 2 writes block A

A
Modified
Bus

CPU 1

CPU 2

CPU 3

A

A

Memory

(d)
Shared

CPU 3 reads block A

Shared
Bus

CPU 1


CPU 2

CPU 3
Memory

(e)

CPU 2 writes block A

A
Modified
Bus

CPU 1

CPU 2

CPU 3
Memory

(f)

CPU 1 writes block A

A
Modified
Bus

Figure 8-24. The MESI cache coherence protocol.


CuuDuongThanCong.com

/>

×