Parallel Programming: for Multicore and Cluster Systems- P38 ppsx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (721.52 KB, 10 trang )

7.1 Gaussian Elimination 363
can be used to solve several linear systems with the same matrix A and different
right-hand side vectors b without repeating the elimination process.
7.1.1.2 Pivoting
Forward elimination and LU decomposition require the division by a
(k)
kk
and so these
methods can only be applied when a
(k)
kk
= 0. That is, even if det A = 0 and the
system Ax = y is solvable, there does not need to exist a decomposition A = LU
when a
(k)
kk
is a zero element. However, for a solvable linear system, there exists a
matrix resulting from permutations of rows of A, for which an LU decomposition is
possible, i.e., BA= LU with a permutation matrix B describing the permutation of
rows of A. The permutation of rows of A, if necessary, is included in the elimination
process. In each elimination step, a pivot element is determined to substitute a
(k)
kk
.
A pivot element is needed when a
(k)
kk
= 0 and when a
(k)
kk
is very small which would

induce an elimination factor, which is very large leading to imprecise computations.
Pivoting strategies are used to ﬁnd an appropriate pivot element. Typical strategies
are column pivoting, row pivoting, and total pivoting.
Column pivoting considers the elements a
(k)
kk
···a
(k)
nk
of column k and determines
the element a
(k)
rk
, k ≤ r ≤ n, with the maximum absolute value. If r = k,therowsr
and k of matrix A
(k)
and the values b
(k)
k
and b
(k)
r
of the vector b
(k)
are exchanged. Row
pivoting determines a pivot element a
(k)
kr
, k ≤ r ≤ n, within the elements a
(k)

kk
···a
(k)
kn
of row k of matrix A
(k)
with the maximum absolute value. If r = k, the columns k
and r of A
(k)
are exchanged. This corresponds to an exchange of the enumeration of
the unknowns x
k
and x
r
of vector x. Total pivoting determines the element with the
maximum absolute value in the matrix
˜
A
(k)
= (a
(k)
ij
), k ≤ i, j ≤ n, and exchanges
columns and rows of A
(k)
depending on i = k and j = k. In practice, row or column
pivoting is used instead of total pivoting, since they have smaller computation time,
and total pivoting may also destroy special matrix structures like banded structures.
The implementation of pivoting avoids the actual exchange of rows or columns
in memory and uses index vectors pointing to the current rows of the matrix. The

indexed access to matrix elements is more expensive but in total the indexed access
is usually less expensive than moving entire rows in each elimination step. When
supported by the programming language, a dynamic data storage in the form of
separate vectors for rows of the matrix, which can be accessed through a vector
pointing to the rows, may lead to more efﬁcient implementations. The advantage is
that matrix elements can still be accessed with a two-dimensional index expression
but the exchange of rows corresponds to a simple exchange of pointers.
7.1.2 Parallel Row-Cyclic Implementation
A parallel implementation of the Gaussian elimination is based on a data distribution
of matrix A and of the sequence of matrices A
(k)
, k = 2, ,n, which can be a
row-oriented, a column-oriented, or a checkerboard distribution, see Sect. 3.4. In
this section, we consider a row-oriented distribution.
364 7 Algorithms for Systems of Linear Equations
From the structure of the matrices A
(k)
it can be seen that a blockwise row-
oriented data distribution is not suitable because of load imbalances: For a blockwise
row-oriented distribution processor P
q
,1≤ q ≤ p, owns the rows (q − 1) ·n/p +
1, ,q ·n/ p so that after the computation of A
(k)
with k = q ·n/ p +1thereisno
computation left for this processor and it becomes idle. For a row-cyclic distribution,
there is a better load balance, since processor P
q
,1≤ q ≤ p, owns the rows q, q+p,
q +2p, , i.e., it owns all rows i with 1 ≤ i ≤ n, and q = ((i −1) mod p)+1. The

processors begin to get idle only after the ﬁrst n − p stages, which is reasonable for
p  n. Thus, we consider a parallel implementation of the Gaussian elimination
with a row-cyclic distribution of matrix A and a column-oriented pivoting. One
step of the forward elimination computing A
(k+1)
and b
(k+1)
for given A
(k)
and b
(k)
performs the following computation and communication phases:
1. Determination of the local pivot element: Each processor considers its local
elements of column k in the rows k, ,n and determines the element (and its
position) with the largest absolute value.
2. Determination of the global pivot element: The global pivot element is the
local pivot element which has the largest absolute value. A single-accumulation
operation with the maximum operation as reduction determines this global pivot
element. The root processor of this global communication operation sends the
result to all other processors.
3. Exchange of the pivot row: If k =r for a pivot element a
(k)
rk
,therowk owned by
processor P
q
and the pivot row r owned by processor P
q

have to be exchanged.

When q = q

, the exchange can be done locally by processor P
q
. When q = q

,
then communication with single transfer operations is required. The elements b
k
and b
r
are exchanged accordingly.
4. Distribution of the pivot row: Since the pivot row (now row k) is required by all
processors for the local elimination operations, processor P
q
sends the elements
a
(k)
kk
, ,a
(k)
kn
of row k and the element b
(k)
k
to all other processors.
5. Computation of the elimination factors: Each processor locally computes the
elimination factors l
ik
for which it owns the row i according to Formula (7.2).

6. Computation of the matrix elements: Each processor locally computes the
elements of A
(k+1)
and b
(k+1)
using its elements of A
(k)
and b
(k)
according to
Formulas (7.3) and (7.4).
The computation of the solution vector x in the backward substitution is inherently
sequential, since the values x
k
, k = n, ,1, depend on each other and are com-
puted one after another. In step k, processor P
q
owning row k computes the value x
k
according to Formula (7.5) and sends the value to all other processors by a single-
broadcast operation.
A program fragment implementing the computation phases 1–6 and the back-
ward substitution is given in Fig. 7.2. The matrix A and the vector b arestoredina
two- and a one-dimensional array a and b, respectively. Some of the local functions
are already introduced in the program in Fig. 7.1. The SPMD program uses the
variable me to store the individual processor number. This processor number, the
7.1 Gaussian Elimination 365
Fig. 7.2 Program fragment with C notation and MPI operations for the Gaussian elimination with
row-cyclic distribution
366 7 Algorithms for Systems of Linear Equations

current value of k, and the pivot row are used to distinguish between different com-
putations for the single processors. The global variables n and p are the system size
and the number of processors executing the parallel program. The parallel algorithm
is implemented in the program in Fig. 7.2 in the following way:
1. Determination of the local pivot element: The function max
col loc(a,k)
determines the row index r of the element a[r][k], which has the largest local
absolute value in column k for the rows ≥ k. When a processor has no element
of column k for rows ≥ k, the function returns −1.
2. Determination of the global pivot element: The global pivoting is performed
by an MPI
Allreduce() operation, implementing a single-accumulation with
a subsequent single-broadcast. The MPI reduction operation MPI
MAXLOC for
data type MPI
DOUBLE INT consisting of one double value and one integer
value is used. The MPI operations have been introduced in Sect. 5.2. The
MPI
Allreduce() operation returns y with the pivot element in y.val
and the processor owning the corresponding row in y.node. Thus, after this
step all processors know the global pivot element and the owner for possible
communication.
3. Exchange of the pivot row: Two cases are considered:
• If the owner of the pivot row is the processor also owning row k (i.e., k%p ==
y.node), the rows k and r are exchanged locally by this processor for r = k.
Row k is now the pivot row. The function copy
row(a,b,k,buf) copies
the pivot row into the buffer buf, which is used for further communication.
• If different processors own the row k and the pivot row r,rowk is sent to
the processor y.node owning the pivot row with MPI

Send and MPI Recv
operations. Before the send operation, the function copy
row(a,b,k,buf)
copies row k of array a and element k of array b into a common buffer buf
so that only one communication operation needs to be applied. After the com-
munication, the processor y.node ﬁnalizes its exchange with the pivot row.
The function copy
exchange row(a,b,r,buf,k) exchanges the row
r (still the pivot row) and the buffer buf. The appropriate row index r is
known from the former local determination of the pivot row. Now the former
row k is the row r and the buffer buf contains the pivot row.
Thus, in both cases the pivot row is stored in buffer buf.
4. Distribution of the pivot row: Processor y.node sends the buffer buf to all
other processors by an MPI
Bcast() operation. For the case of the pivot row
being owned by a different processor than the owner of row k, the content of
buf is copied into row k by this processor using copy
back row().
5. and 6. Computation of the elimination factors and the matrix elements: The
computation of the elimination factors and the new arrays a and b is done in
parallel. Processor P
q
starts this computation with the ﬁrst row i > k with i mod
p = q.
For a row-cyclic implementation of the Gaussian elimination, an alternative way
of storing array a and vector b can be used. The alternative data structure consists
7.1 Gaussian Elimination 367
Fig. 7.3 Data structure for
the Gaussian elimination with
n = 8andp = 4 showing the

rows stored by processor P
1
.
Each row stores n +1
elements consisting of one
row of the matrix a and the
corresponding element of b
b
1
b
a
a
a
a
11
a
18
558
51
of a one-dimensional array of pointers and n one-dimensional arrays of length n +1
each containing one row of a and the corresponding element of b. The entries in the
pointer-array point to the row-arrays. This storage scheme not only facilitates the
exchange of rows but is also convenient for a distributed storage. For a distributed
memory, each processor P
q
stores the entire array of pointers but only the rows
i with i mod p = q; all other pointers are NULL-pointers. Figure 7.3 illustrates
this storage scheme for n = 8. The advantage of storing an element of b together
with a is that the copy operation into a common buffer can be avoided. Also the
computation of the new values for a and b is now only one loop with n+1 iterations.

This implementation variant is not shown in Fig. 7.2.
7.1.3 Parallel Implementation with CheckerboardDistribution
A parallel implementation using a block–cyclic checkerboard distribution for matrix
A can be described with the parameterized data distribution introduced in Sect. 3.4.
The parameterized data distribution is given by a distribution vector
((p
1
, b
1
), (p
2
, b
2
)) (7.7)
with a p
1
× p
2
virtual processor mesh with p
1
rows, p
2
columns, and p
1
· p
2
= p
processors. The numbers b
1
and b

2
are the sizes of a block of data with b
1
rows and
b
2
columns. The function G : P → N
2
maps each processor to a unique position in
the processor mesh. This leads to the deﬁnition of p
1
row groups
R
q
={Q ∈ P | G(Q) = (q, ·)}
with |R
q
|=p
2
for 1 ≤ q ≤ p
1
and p
2
column groups
C
q
={Q ∈ P | G(Q) = (·, q)}
with |C
q
|=p

1
for 1 ≤ q ≤ p
2
. The row groups as well as the column groups are a
partition of the entire set of processors, i.e.,
368 7 Algorithms for Systems of Linear Equations
p
1

q=1
R
q
=
p
2

q=1
C
q
= P
and R
q
∩R
q

=∅=C
q
∩C
q


for q = q

.Rowi of the matrix A is distributed across
the local memories of the processors of only one row group, denoted Ro(i)inthe
following. This is the row group R
k
with k =

i−1
b
1

mod p
1
+ 1. Analogously,
column j is distributed within one column group, denoted as Co( j), which is the
column group C
k
with k =

j−1
b
2

mod p
2
+1.
Example For a matrix of size 12×12 (i.e., n = 12), p = 4 processors {P
1
, P

2
, P
3
, P
4
}
and distribution vector ((p
1
, b
1
), (p
2
, b
2
)) = ((2, 2), (2, 3)), the virtual processor
mesh has size 2 ×2 and the data blocks have size 2 ×3. There are two row groups
and two column groups:
R
1
={Q ∈ P | G(Q) = (1, j), j = 1, 2},
R
2
={Q ∈ P | G(Q) = (2, j), j = 1, 2},
C
1
={Q ∈ P | G(Q) = ( j, 1), j = 1, 2},
C
2
={Q ∈ P | G(Q) = ( j, 2), j = 1, 2} .
The distribution of matrix A is shown in Fig. 7.4. It can be seen that row 5 is dis-

tributed in row group R
1
and that column 7 is distributed in column group C
1
.
Using a checkerboard distribution with distribution vector (7.7), the computation
of A
(k)
has the following implementation, which has a different communication pat-
tern than the previous implementation. Figure 7.5 illustrates the communication and
computation phases of the Gaussian elimination with checkerboard distribution.
Fig. 7.4 Illustration of a
checkerboard distribution for
a12×12 matrix. The tuples
denote the position of the
processors in the processor
mesh owning the data block
1
2
3
4
5
6
7
8
9
10
11
12
12345 6789101112

(1,1) (1,2) (1,1) (1,2)
(2,1) (2,2) (2,1) (2,2)
(1,1)
(2,1)
(1,2)
(2,2)
(1,1)
(2,1)
(1,2)
(2,2)
(1,1)
(2,1)
(1,2)
(2,2)
(1,1)
(2,1)
(1,2)
(2,2)
7.1 Gaussian Elimination 369
Fig. 7.5 Computation phases
of the Gaussian elimination
with checkerboard
distribution
k
k
k
k
k
k
(5a) broadcast of the

k
k
r
(5) computation of the matrix elements
(4) broadcast of the
k
k
r
pivot row pivot row
k
k
(5) computation of the
elimination factors elimination factors
(3) exchange of the
k
k
(1) determination of the (2) determination of the
global pivot elements local pivot elements
This
ﬁgure
will be
printed
in b/w
370 7 Algorithms for Systems of Linear Equations
1. Determination of the local pivot element: Since column k is distributed across
the processors of column group Co(k), only these processors determine the
element with the largest absolute value for row ≥ k within their local elements
of column k.
2. Determination of the global pivot element: The processors in group Co(k)
perform a single-accumulation operation within this group, for which each pro-

cessor in the group provides its local pivot element from phase 1. The reduction
operation is the maximum operation also determining the index of the pivot
row (and not the number of the owning processor as before). The root processor
of the single-accumulation operation is the processor owning the element a
(k)
kk
.
After the single-accumulation, the root processor knows the pivot element a
(k)
rk
and its row index. This information is sent to all other processors.
3. Exchange of the pivot row: The pivot row r containing the pivot element a
(k)
rk
is distributed across row group Ro(r). Row k is distributed across the row group
Ro(k), which may be different from Ro(r). If Ro(r) = Ro(k), the processors
of Ro(k) exchange the elements of the rows k and r locally within the columns
they own. If Ro(r) = Ro(k), each processor in Ro(k) sends its part of row
k to the corresponding processor in Ro(r); this is the unique processor which
belongs to the same column group.
4. Distribution of the pivot row: The pivot row is needed for the recalculation of
matrix A, but each processor needs only those elements with column indices for
which it owns elements. Therefore, each processor in Ro(r) performs a group-
oriented single-broadcast operation within its column group sending its part of
the pivot row to the other processors.
5. Computation of the elimination factors: The processors of column group
Co(k) locally compute the elimination factors l
ik
for their elements i of column
k according to Formula (7.2).

5a. Distribution of the elimination factors: The elimination factors l
ik
are needed
by all processors in the row group Ro(i). Since the elements of row i are dis-
tributed across the row group Ro(i), each processor of column group Co(k)
performs a group-oriented single-broadcast operation in its row group Ro(i)to
broadcast its elimination factors l
ik
within this row group.
6. Computation of the matrix elements: Each processor locally computes the
elements of A
(k+1)
and b
(k+1)
using its elements of A
(k)
and b
(k)
according to
Formulas (7.3) and (7.4).
The backward substitution for computing the n elements of the result vector x is
done in n consecutive steps where each step consists of the following computations:
1. Each processor of the row group Ro(k) computes that part of the sum

n
j=k+1
a
(n)
kj
x

j
which contains its local elements of row k.
2. The entire sum

n
j=k+1
a
(n)
kj
x
j
is determined by the processors of row group
Ro(k) by a group-oriented single-accumulation operation with the processor P
q
as root which stores the element a
(n)
kk
. Addition is used as reduction operation.
7.1 Gaussian Elimination 371
3. Processor P
q
computes the value of x
k
according to Formula (7.5).
4. Processor P
q
sends the value of x
k
to all other processors by a single-broadcast
operation.

A pseudocode for an SPMD program in C notation with MPI operations imple-
menting the Gaussian elimination with checkerboard distribution of matrix A is
given in Fig. 7.6. The computations correspond to those given in the pseudocode for
the row-cyclic distribution in Fig. 7.2, but the pseudocode additionally uses several
functions organizing the computations on the groups of processors. The functions
Co(k) and Ro(k) denote the column or row groups owning column k or row
k, respectively. The function member(me,G) determines whether processor me
belongs to group G. The function grp
leader() determines the ﬁrst processor in
a group. The functions Cop(q) and Rop(q) determine the column or row group,
respectively, to which a processor q belongs. The function rank(q,G) returns the
local processor number (rank) of a processor in a group G.
1. Determination of the local pivot element: The determination of the local pivot
element is performed only by the processors in column group Co(k).
2. Determination of the global pivot element: The global pivot element is again
computed by an MPI
MAXLOC reduction operation, but in contrast to Fig. 7.2 the
index of the row of the pivot element is calculated and not the processor number
owning the pivot element. The reason is that all processors which own a part of
the pivot row need to know that some of their data belongs to the current pivot
row; this information is used in further communication.
3. Exchange of the pivot row: For the exchange and distribution of the pivot row
r, the cases Ro(k) == Ro(r) and Ro(k)! = Ro(r) are distinguished.
• When the pivot row and the row k are stored by the same row group,
each processor of this group exchanges its data elements of row k and row
r locally using the function exchange
row loc() and copies the ele-
ments of the pivot row (now row k) into the buffer buf using the function
copy
row loc(). Only the elements in column k or higher are considered.

• When the pivot row and the row k are stored by different row groups,
communication is required for the exchange of the pivot row. The function
compute
partner(Ro(r),me) computes the communication partner
for the calling processor me, which is the processor q ∈ Ro(r) belonging to
the same column group as me. The function compute
size(n,k,Ro(k))
computes the number of elements of the pivot row, which is stored for the call-
ing processor in columns greater than k; this number depends on the size of
the row group Ro(k), the block size, and the position k. The same function is
used later to determine the number of elimination factors to be communicated.
4. Distribution of the pivot row: For the distribution of the pivot row r, a processor
takes part in a single-broadcast operation in its column group. The roots of the
broadcast operation performed in parallel are the processors q ∈ Ro(r).The
participants of a broadcast are the processors q

∈ Cop(q), either as root when
q

∈ Ro(r) or as recipient otherwise.
372 7 Algorithms for Systems of Linear Equations
Fig. 7.6 Program of the Gaussian elimination with checkerboard distribution

Parallel Programming: for Multicore and Cluster Systems- P38 ppsx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về