8
PARALLEL COMPUTER
ARCHITECTURES
1
CuuDuongThanCong.com
/>
P
P
P
P
P
P
Shared
memory
P
P
P
P
P
(a)
P
CPU
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
(b)
Figure 8-1. (a) A multiprocessor with 16 CPUs sharing a common memory. (b) An image partitioned into 16 sections, each
being analyzed by a different CPU.
CuuDuongThanCong.com
/>
M
P
M
P
M
P
M
P
M
M
M
M
Private memory
P
P
P
P
CPU
Messagepassing
interconnection
network
P
P
P
P
M
M
M
M
(a)
P
P
M
P
P
M
P
P
M
P
P
M
P
P
P
P
CPU
P
Messagepassing
interconnection
network
P
P
P
P
P
P
P
(b)
Figure 8-2. (a) A multicomputer with 16 CPUs, each with
each own private memory. (b) The bit-map image of Fig. 8-1
split up among the 16 memories.
CuuDuongThanCong.com
/>
Machine 1
Machine 2
Machine 1
Machine 2
Machine 1
Machine 2
Application
Application
Application
Application
Application
Application
Language
run-time
system
Language
run-time
system
Language
run-time
system
Language
run-time
system
Language
run-time
system
Language
run-time
system
Operating
system
Operating
system
Operating
system
Operating
system
Operating
system
Operating
system
Hardware
Hardware
Hardware
Hardware
Hardware
Hardware
Shared memory
Shared memory
Shared memory
(a)
(b)
(c)
Figure 8-3. Various layers where shared memory can be implemented. (a) The hardware. (b) The operating system. (c)
The language runtime system.
CuuDuongThanCong.com
/>
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 8-4. Various topologies. The heavy dots represent
switches. The CPUs and memories are not shown. (a) A star.
(b) A complete interconnect. (c) A tree. (d) A ring. (e) A grid.
(f) A double torus. (g) A cube. (h) A 4D hypercube.
CuuDuongThanCong.com
/>
Input port
CPU 1
Output port
A
B
C
D
End
of
packet
Middle
of
packet
Four-port switch
CPU 2
Front of packet
Figure 8-5. An interconnection network in the form of a fourswitch square grid. Only two of the CPUs are shown.
CuuDuongThanCong.com
/>
CPU 1
Entire
packet
Input port
Four-port
switch
Output port
A
B
A
B
A
B
C
D
C
D
C
D
CPU 2
Entire
packet
Entire
packet
(a)
(b)
(c)
Figure 8-6. Store-and-forward packet switching.
CuuDuongThanCong.com
/>
CPU 1
B
C
D
CPU 2
Four-port switch
,
,
A
CPU 3
Input port
Output buffer
CPU 4
Figure 8-7. Deadlock in a circuit-switched interconnection network.
CuuDuongThanCong.com
/>
60
N-body problem
50
Linear speedup
Speedup
40
Awari
30
20
10
Skyline matrix inversion
0
0
10
20
30
40
Number of CPUs
50
60
Figure 8-8. Real programs achieve less than the perfect speedup indicated by the dotted line.
CuuDuongThanCong.com
/>
n CPUs active
…
Inherently
sequential
part
Potentially
parallelizable
part
1 CPU
active
f
1–f
f
1–f
fT
(1 – f)T/n
T
(a)
(b)
Figure 8-9. (a) A program has a sequential part and a parallelizable part. (b) Effect of running part of the program in parallel.
CuuDuongThanCong.com
/>
CPU
Bus
(a)
(b)
(c)
(d)
Figure 8-10. (a) A 4-CPU bus-based system. (b) A 16-CPU
bus-based system. (c) A 4-CPU grid-based system. (d) A 16CPU grid-based system.
CuuDuongThanCong.com
/>
P1
P1
P2
Work queue
P3
P1
P2
P3
P1
Synchronization point
P1
P3
P5
P4
P2
P2
P2
P6
P3
P7
P8
Process
Synchronization point
P9
(a)
(b)
(c)
(d)
Figure 8-11. Computational paradigms. (a) Pipeline. (b)
Phased computation. (c) Divide and conquer. (d) Replicated
worker.
CuuDuongThanCong.com
/>
P3
Physical
(hardware)
Multiprocessor
Multiprocessor
Multicomputer
Multicomputer
Logical
(software)
Shared variables
Message passing
Shared variables
Message passing
Examples
Image processing as in Fig. 8-1
Message passing simulated with buffers in memory
DSM, Linda, Orca, etc. on an SP/2 or a PC network
PVM or MPI on an SP/2 or a network of PCs
Figure 8-12. Combinations of physical and logical sharing.
CuuDuongThanCong.com
/>
Instruction
streams
1
1
Multiple
Multiple
Data
streams
1
Multiple
1
Multiple
Name
SISD
SIMD
MISD
MIMD
Examples
Classical Von Neumann machine
Vector supercomputer, array processor
Arguably none
Multiprocessor, multicomputer
Figure 8-13. Flynn’s taxonomy of parallel computers.
CuuDuongThanCong.com
/>
Parallel computer architectures
SISD
SIMD
MISD
(Von Neumann)
MIMD
?
Vector
processor
Array
processor
UMA
Bus
Multiprocessors
COMA
Switched
Multicomputers
NUMA
CC-NUMA
Shared memory
NC-NUMA
MPP
Grid
COW
Hypercube
Message passing
Figure 8-14. A taxonomy of parallel computers.
CuuDuongThanCong.com
/>
Input vectors
Vector ALU
Figure 8-15. A vector ALU.
CuuDuongThanCong.com
/>
Operation
Ai = f1 (Bi )
Scalar = f2 (A)
Ai = f3 (Bi, Ci )
Ai = f4 (scalar, Bi )
f1
f2
f3
f4
Examples
= cosine, square root
= sum, minimum
= add, subtract
= multiply Bi by a constant
Figure 8-16. Various combinations of vector and scalar operations.
CuuDuongThanCong.com
/>
Step
1
2
3
4
Name
Fetch operands
Adjust exponent
Execute subtraction
Normalize result
Values
1.082 × 10 − 9.212 × 1011
1.082 × 1012 − 0.9212 × 1012
0.1608 × 1012
1.608 × 1011
12
Figure 8-17. Steps in a floating-point subtraction.
CuuDuongThanCong.com
/>
Step
1
Fetch operands B1 , C1
Adjust exponent
Execute operation
Normalize result
2
B2 , C2
B1 , C 1
3
B3 , C3
B2 , C2
B1 + C1
Cycle
4
B4 , C4
B3 , C3
B2 + C2
B1 + C1
5
B5 , C5
B4 , C4
B3 + C3
B2 + C2
Figure 8-18. A pipelined floating-point adder.
CuuDuongThanCong.com
/>
6
B6 , C6
B5 , C5
B4 + C4
B3 + C3
7
B7 , C7
B6 , C6
B5 + C5
B4 + C4
A
B
S
64
24-Bit
holding
registers
for
addresses
8
24-Bit
address
registers
ADD
8
64-Bit
scalar
registers
64
64-Bit holding
registers for
scalars
8 64-Bit
vector registers
ADD
ADD
ADD
BOOLEAN
MUL
BOOLEAN
SHIFT
RECIP.
SHIFT
MUL
Address units
64 Elements
per register
T
POP. COUNT
Scalar
integer units
Scalar/vector
floatng-point
units
Vector
integer
units
Figure 8-19. Registers and functional units of the Cray-1
CuuDuongThanCong.com
/>
CPU
2
Write 200
1
Write
100
x
Read 2x
Read 2x
3
W100
W100
W200
W200
R3 = 100
R4 = 200
R3 = 200
W200
W100
R3 = 200
R4 = 200
R3 = 100
R4 = 200
R3 = 200
R4 = 100
R4 = 200
R4 = 200
R3 = 100
(b)
(c)
(d)
4
(a)
Figure 8-20. (a) Two CPUs writing and two CPUs reading a
common memory word. (b) - (d) Three possible ways the two
writes and four reads might be interleaved in time.
CuuDuongThanCong.com
/>
Write
CPU A
1A
CPU B
1B
2A
CPU C
1C
1D 1E
2B
2C
3A
3B
1F
3C
Synchronization point
Time
Figure 8-21. Weakly consistent memory uses synchronization
operations to divide time into sequential epochs.
CuuDuongThanCong.com
/>
2D
CPU
CPU
M
Shared
memory
Private memory
Shared memory
CPU
CPU
M
CPU
CPU
Cache
Bus
(a)
(b)
(c)
Figure 8-22. Three bus-based multiprocessors. (a) Without
caching. (b) With caching. (c) With caching and private
memories.
CuuDuongThanCong.com
/>
M
Action
Read miss
Read hit
Write miss
Write hit
Local request
Fetch data from memory
Use data from local cache
Update data in memory
Update cache and memory
Remote request
Invalidate cache entry
Figure 8-23. The write through cache coherence protocol.
The empty boxes indicate that no action is taken.
CuuDuongThanCong.com
/>
CPU 1
CPU 2
CPU 3
Memory
(a)
CPU 1 reads block A
A
Exclusive
Bus
Cache
CPU 1
CPU 2
CPU 3
Memory
(b)
CPU 2 reads block A
A
Shared
Shared
Bus
CPU 1
CPU 2
CPU 3
Memory
(c)
CPU 2 writes block A
A
Modified
Bus
CPU 1
CPU 2
CPU 3
A
A
Memory
(d)
Shared
CPU 3 reads block A
Shared
Bus
CPU 1
CPU 2
CPU 3
Memory
(e)
CPU 2 writes block A
A
Modified
Bus
CPU 1
CPU 2
CPU 3
Memory
(f)
CPU 1 writes block A
A
Modified
Bus
Figure 8-24. The MESI cache coherence protocol.
CuuDuongThanCong.com
/>