8
PARALLEL COMPUTER
ARCHITECTURES
1
CPU
(a)
P P P P
PPPP
P
P
P
P
P
P
P
P
(b)
P P P P
PPPP
P
P
P
P
P
P
P
P
Shared
memory
Figure 8-1. (a) A multiprocessor with 16 CPUs sharing a com-
mon memory. (b) An image partitioned into 16 sections, each
being analyzed by a different CPU.
CPU
(a)
P P P P
M M M M
PPPP
MMMM
P
P
P
P
M
M
M
M
P
P
P
P
M
M
M
M
Message-
passing
interconnection
network
CPU
(b)
P P P P
PPPP
P
P
P
P
P
P
P
P
Message-
passing
interconnection
network
Private memory
Figure 8-2. (a) A multicomputer with 16 CPUs, each with
each own private memory. (b) The bit-map image of Fig. 8-1
split up among the 16 memories.
(a)
Machine 1 Machine 2
Language
run-time
system
Operating
system
Shared memory
Application
Hardware
Language
run-time
system
Operating
system
Application
Hardware
(b)
Machine 1 Machine 2
Language
run-time
system
Operating
system
Shared memory
Application
Hardware
Language
run-time
system
Operating
system
Application
Hardware
(c)
Machine 1 Machine 2
Language
run-time
system
Operating
system
Shared memory
Application
Hardware
Language
run-time
system
Operating
system
Application
Hardware
Figure 8-3. Various layers where shared memory can be im-
plemented. (a) The hardware. (b) The operating system. (c)
The language runtime system.
(a)
(c)
(e)
(g)
(b)
(d)
(f)
(h)
Figure 8-4. Various topologies. The heavy dots represent
switches. The CPUs and memories are not shown. (a) A star.
(b) A complete interconnect. (c) A tree. (d) A ring. (e) A grid.
(f) A double torus. (g) A cube. (h) A 4D hypercube.
CPU 1
End
of
packet
Middle
of
packet
A
Input port
Output port
Front of packet
Four-port switch
B
CD
CPU 2
Figure 8-5. An interconnection network in the form of a four-
switch square grid. Only two of the CPUs are shown.
CPU 1
Input port
(a)
Output port
Entire
packet
Entire
packet
Four-port
switch
C
A
CPU 2
Entire
packet
D
B
(b)
C
A
D
B
(c)
C
A
D
B
Figure 8-6. Store-and-forward packet switching.
CPU 1
CPU 2
CPU 3
A
C
B
D
Input port
Output buffer
Four-port switch
CPU 4
,
,
Figure 8-7. Deadlock in a circuit-switched interconnection network.
6050403020100
60
50
40
30
20
10
0
Speedup
Linear speedup
N-body problem
Awari
Skyline matrix inversion
Number of CPUs
Figure 8-8. Real programs achieve less than the perfect speed-
up indicated by the dotted line.
(a)
n CPUs active
1 CPU
active
1 – ff
T
Inherently
sequential
part
(b)
1 – ff
Potentially
parallelizable
part
…
fT (1 – f)T/n
Figure 8-9. (a) A program has a sequential part and a parallel-
izable part. (b) Effect of running part of the program in paral-
lel.
CPU
Bus
(a) (b) (c) (d)
Figure 8-10. (a) A 4-CPU bus-based system. (b) A 16-CPU
bus-based system. (c) A 4-CPU grid-based system. (d) A 16-
CPU grid-based system.
Process
P
1
(d)
P
2
P
5
P
6
P
3
P
2
P
1
P
3
P
8
P
7
P
1
P
9
P
1
P
2
P
3
P
2
P
3
Synchronization point
P
1
P
2
P
3
Synchronization point
P
4
Work queue
(c)(b)(a)
Figure 8-11. Computational paradigms. (a) Pipeline. (b)
Phased computation. (c) Divide and conquer. (d) Replicated
worker.
Physical
(hardware)
Logical
(software) Examples
Multiprocessor Shared variables Image processing as in Fig. 8-1
Multiprocessor Message passing Message passing simulated with buffers in memory
Multicomputer Shared variables DSM, Linda, Orca, etc. on an SP/2 or a PC network
Multicomputer Message passing PVM or MPI on an SP/2 or a network of PCs
Figure 8-12. Combinations of physical and logical sharing.
Instruction
streams
Data
streams Name Examples
1 1 SISD Classical Von Neumann machine
1 Multiple SIMD Vector supercomputer, array processor
Multiple 1 MISD Arguably none
Multiple Multiple MIMD Multiprocessor, multicomputer
Figure 8-13. Flynn’s taxonomy of parallel computers.
SISD
(Von Neumann)
SIMD
Parallel computer architectures
MISD
?
MIMD
Vector
processor
Array
processor
Multi-
processors
Multi-
computers
UMA COMA NUMA MPP COW
Bus Switched CC-NUMA NC-NUMA Grid
Hyper-
cube
Shared memory Message passing
Figure 8-14. A taxonomy of parallel computers.
Input vectors
Vector ALU
Figure 8-15. A vector ALU.
Operation Examples
A
i
= f
1
(B
i
)f
1
= cosine, square root
Scalar = f
2
(A) f
2
= sum, minimum
A
i
= f
3
(B
i,
C
i
)f
3
= add, subtract
A
i
= f
4
(scalar,B
i
)f
4
= multiply B
i
by a constant
Figure 8-16. Various combinations of vector and scalar operations.