Tải bản đầy đủ (.pdf) (10 trang)

Digital Signal Processing Handbook P10

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (177.96 KB, 10 trang )

Yagle, A.E. “Fast Matrix Computations”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c

1999byCRCPressLLC
10
Fast Matrix Computations
Andrew E. Yagle
University of Michigan
10.1 Introduction
10.2 Divide-and-Conquer Fast Matrix Multiplication
Strassen Algorithm

Divide-and-Conquer

Arbitrary Preci-
sion Approximation (APA) Algorithms

Number Theoretic
Transform (NTT) Based Algorithms
10.3 Wavelet-Based Matrix Sparsification
Overview

The Wavelet Transform

Wavelet Representations
of Integral Operators

Heuristic Interpretation of Wavelet


Sparsification
References
10.1 Introduction
This chapter presents two major approaches to fast matrix multiplication. We restrict our attention
to matrix multiplication, excluding matrix addition and matrix inversion, since matrix addition
admits no fast algorithm structure (save for the obvious parallelization), and matrix inversion (i.e.,
solution of large linear systems of equations) is generally performed by iterative algorithms that
require repeated matrix-matrix or matrix-vector multiplications. Hence, matrix multiplication is
the real problem of interest.
We presenttwomajorapproachestofastmatrix multiplication. Thefirst is the divide-and-conquer
strategy made possible by Strassen’s [1] remarkable reformulation of non-commutative 2 × 2 matrix
multiplication. We also present the APA (arbitrary precision approximation) algorithms, which
improve on Strassen’s resultatthepriceofapproximation, and a recentresult that reformulatesmatrix
multiplication as convolution and applies number theoretic transforms. The second approach is to
use a wavelet basis to sparsify the representation of Calderon-Zygmund operators as matrices. Since
electromagnetic Green’s functions are Calderon-Zygmund operators, this has proven to be useful
in solving integral equations in electromagnetics. The sparsified matrix representation is used in
an iterative algorithm to solve the linear system of equations associated with the integral equations,
greatly reducing the computation. We also present some new insights that make the wavelet-induced
sparsification seem less mysterious.
10.2 Divide-and-Conquer Fast Matrix Multiplication
10.2.1 Strassen Algorithm
It is not obvious that there should be any way to perform matrix multiplication other than using
the definition of matrix multiplication, for which multiplying two N × N matrices requires N
3
c

1999 by CRC Press LLC
multiplications and additions (N for each of the N
2

elements of the resulting matrix). However, in
1969 Strassen [1] made the remarkable observation that the product of two 2 × 2 matrices

a
1,1
a
1,2
a
2,1
a
2,2

b
1,1
b
1,2
b
2,1
b
2,2

=

c
1,1
c
1,2
c
2,1
c

2,2

(10.1)
may be computed using only seven multiplications (fewer than the obvious eight), as
m
1
= (a
1,2
− a
2,2
)(b
2,1
+ b
2,2
); m
3
= (a
1,1
− a
2,1
)(b
1,1
+ b
1,2
)
m
2
= (a
1,1
+ a

2,2
)(b
1,1
+ b
2,2
)
m
4
= (a
1,1
+ a
1,2
)b
2,2
; m
7
= (a
2,1
+ a
2,2
)b
1,1
m
5
= a
1,1
(b
1,2
− b
2,2

); m
6
= a
2,2
(b
2,1
− b
1,1
)
c
1,1
= m
1
+ m
2
− m
4
+ m
6
; c
1,2
= m
4
+ m
5
c
2,2
= m
2
− m

3
+ m
5
− m
7
; c
2,1
= m
6
+ m
7
(10.2)
A vital feature of (10.2) is that it is non-commutative, i.e., it does not depend on the commutative
property of multiplication. This can be seen easily by noting that each of the m
i
are the product
of a linear combination of the elements of A by a linear combination of the elements of B, in that
order, so that it is never necessary to use, say a
2,2
b
2,1
= b
2,1
a
2,2
. We note there exist commutative
algorithms for 2 × 2 matrix multiplication that require even fewer operations, but they are of little
practical use.
Thesignificanceofnoncommutativityisthat the noncommutative algorithm (10.2)maybeapplied
as is to block matrices. That is, if the a

i,j
,b
i,j
and c
i,j
in (10.1) and (10.2)arereplacedbyblock
matrices, (10.2) is still true. Since matrix multiplication can be subdivided into block submatrix
operations (i.e. (10.1)isstilltrueifa
i,j
,b
i,j
and c
i,j
are replaced by block matrices), this immediately
leads to a divide-and-conquer fast algorithm.
10.2.2 Divide-and-Conquer
To see this, consider the 2
n
× 2
n
matrix multiplication AB = C,whereA, B, C are all 2
n
× 2
n
matrices. Using the usual definition, this requires (2
n
)
3
= 8
n

multiplications and additions. But
if A, B, C are subdivided into 2
n−1
× 2
n−1
blocks a
i,j
,b
i,j
,c
i,j
, then AB = C becomes (10.1),
which can be implemented with (10.2) since (10.2) does not require the products of subblocks of A
and B to commute. Thus the 2
n
× 2
n
matrix multiplication AB = C can actually be implemented
using only seven matrix multiplications of 2
n−1
× 2
n−1
subblocks of A and B. And these subblock
multiplications can in turn be broken down by using (10.2) to implement them as well. The end result
is that the 2
n
×2
n
matrix multiplication AB = C can be implemented using only 7
n

multiplications,
instead of 8
n
.
The computational savings grow as the matrix size increases. For n = 5 (32 × 32 matrices) the
savings is about 50%. For n = 12 (4096 × 4096 matrices) the savings is about 80%. The savings as
a fraction can be made arbitrarily close to unity by taking sufficiently large matrices. Another way of
looking at this is to note that N × N matrix multiplication requires O(N
log
2
7
) = O(N
2.807
)<N
3
multiplications using Strassen.
Of course we are not limited to subdividing into 2 × 2 = 4 subblocks. Fast non-commutative
algorithms for 3 × 3 matrix multiplication requiring only 23 < 3
3
= 27 multiplications were found
by exhaustive search in [2] and [3]; 23 is now known to be optimal. Repeatedly subdividing AB = C
into 3 × 3 = 9 subblocks computes a 3
n
× 3
n
matrix multiplication in 23
n
< 27
n
multiplications;

N × N matrix multiplication requires O(N
log
3
23
) = O(N
2.854
) multiplications, so this is not quite
as good as using (10.2). A fast noncommutative algorithm for 5 × 5 matrix multiplication requiring
only 102 < 5
3
= 125 multiplications was found in [4]; this also seems to be optimal. Using this
c

1999 by CRC Press LLC
algorithm, N × N matrix multiplication requires O(N
log
5
102
) = O(N
2.874
) multiplications, so
this is even worse. Of course, the idea is to write N = 2
a
3
b
5
c
for some a, b,c and subdivide into
2 ×2 = 4 subblocks a times, then subdivide into 3 ×3 = 9 subblocks b times, etc. The total number
of multiplications is then 7

a
23
b
102
c
< 8
a
27
b
125
c
= N
3
.
Note that we have not mentioned additions. Readers familiar with nesting fast convolution algo-
rithms will know why; now we review why reducing multiplications is much more important than
reducing additions when nesting algorithms. The reason is that at each nesting stage (reversing the
divide-and-conquer to build up algorithms for multiplying large matrices from (10.2)), each scalar
addition is replaced by a matrix addition (which requires N
2
additions for N × N matrices), and
each scalar multiplication is replaced by a matrix multiplication (which requires N
3
multiplications
and additions for N × N matrices). Although we are reducing N
3
to about N
2.8
, it is clear that each
multiplication will produce more multiplications and additions as we nest than each addition. So

reducing the number of multiplications from eight to seven in (10.2) is well worth the extra additions
incurred. In fact, the number of additions is also O(N
2.807
).
The design of these base algorithms has been based on the theory of bilinear and trilinear forms.
The review paper [5] and book [6] of Pan are good introductions to this theory. We note that reducing
the exponent of N in N × N matrix multiplication is an area of active research. This exponent has
been reduced to below 2.5; a known lower bound is two. However, the resulting algorithms are too
complicated to be useful.
10.2.3 Arbitrary Precision Approximation (APA) Algorithms
APA algorithms are noncommutative algorithms for 2 × 2 and 3 × 3 matrix multiplication that
require even fewer multiplications than the Strassen-type algorithms, but at the price of requiring
longer word lengths. Proposed by Bini [7], the APA algorithm for multiplying two 2 × 2 matrices is
this:
p
1
= (a
2,1
+ a
1,2
)(b
2,1
+ b
1,2
) ;
p
2
= (−a
2,1
+ a

1,1
)(b
1,1
+ b
1,2
)
p
3
= (a
2,2
− a
1,2
)(b
2,1
+ b
2,2
) ;
p
4
= a
2,1
(b
1,1
− b
2,1
) ;
p
5
= (a
2,1

+ a
2,2
)b
2,1
c
1,1
= (p
1
+ p
2
+ p
4
)/ − (a
1,1
+ a
1,2
)b
1,2
;
c
2,1
= p
4
+ p
5
;
c
2,2
= (p
1

+ p
3
− p
5
)/ − a
1,2
(b
1,2
− b
2,2
).
(10.3)
If we now let  → 0, the second terms in (10.3) become negligible next to the first terms, and so
they need not be computed. Hence, three of the four elements of C = AB may be computed using
only five multiplications. c
1,2
may be computed using a sixth multiplication, so that, in fact, two
2 ×2 matrices may be multiplied to arbitrary accuracy using only six multiplications. The APA 3 ×3
matrix multiplication algorithm requires 21 multiplications. Note that APA algorithms improve on
the exact Strassen-type algorithms (6 < 7, 21 < 23).
The APA algorithms are often described as being numerically unstable, due to roundoff error as
 → 0. We believe that an electrical engineering perspective on these algorithms puts them in a light
different from that of the mathematical perspective. In fixed point implementation, the computation
AB = C can be scaled to operations on integers, and the p
i
can be bounded. Then it is easy to set 
a sufficiently small (negative) power of two to ensure that the second terms in (10.3) do not overlap
the first terms, provided that the wordlength is long enough. Thus, the reputation for instability
c


1999 by CRC Press LLC
is undeserved. However, the requirement of large wordlengths to be multiplied seems also to have
escaped notice; this may be a more serious problem in some architectures.
The divide-and-conquer and resulting nesting of APA algorithms work the same way as for the
Strassen-typealgorithms. N×N matrixmultiplicationusing(10.3)requires O(N
log
2
(6)
) = O(N
2.585
)
multiplications, which improves on the O(N
2.807
) multiplications using (10.2). But the wordlengths
are longer.
A design methodology for fast matrix multiplication algorithms by grouping terms has been
proposed in a series of papers by Pan (see References [5] and [6]). While this has proven quite
fruitful, the methodology of grouping terms becomes somewhat ad hoc.
10.2.4 Number Theoretic Transform (NTT) Based Algorithms
An approach similar in flavor to the APA algorithms, but more flexible, has been taken recentlyin [8].
First, matrix multiplication is reformulated as a linear convolution, which can be implemented as the
multiplication of two polynomials using the z-transform. Second, the variable z is scaled, producing
a scaled convolution, which is then made cyclic. This aliases some quantities, but they are separated
by a power of the scaling factor. Third, the scaled convolution is computed using pseudo-number-
theoretic transforms. Finally, the various components of the product matrix are read off of the
convolution, using the fact that the elements of the product matrix are bounded. This can be done
without error if the scaling factor is sufficiently large.
This approach yields algorithms that require the same number of multiplications or fewer as APA
for 2 × 2 and 3 × 3 matrices. The multiplicands are again sums of scaled matrix elements as in APA.
However, the design methodology is quite simple and straightforward, and the reason why the fast

algorithm exists is now clear, unlike the APA algorithms. Also, the integer computations inherent in
this formulation make possible the engineering insights into APA noted above.
We reformulate the product of two N ×N matrices as the linearconvolutionofa sequence of length
N
2
and a sparse sequence of length N
3
− N + 1. This results in a sequence of length N
3
+ N
2
− N,
from which elements of the product matrix may be obtained. For convenience, we write the linear
convolution as the product of two polynomials. This result (of [8]) seems to be new, although a
similar result is briefly noted in ([3], p. 197). Define
a
i,j
= a
i+jN
; b
i,j
= b
N−1−i+jN
; 0 ≤ i, j ≤ N − 1


N−1

i=0
N−1


j=0
a
i+jN
x
i+jN




N−1

i=0
N−1

j=0
b
N−1−i+jN
x
N(N−1−i+jN)


=
N
3
+N
2
−N−1

i=0

c
i
x
i
c
i,j
= c
N
2
−N+i+jN
2
; 0 ≤ i, j ≤ N − 1 .
(10.4)
Note that coefficientsofall three polynomials are read off of the matrices A, B, C column-by-column
(each column of B is reversed), and the result is noncommutative. For example, the 2 × 2 matrix
multiplication (10.1) becomes

a
1,1
+ a
2,1
x + a
1,2
x
2
+ a
2,2
x
3


b
2,1
+ b
1,1
x
2
+ b
2,2
x
4
+ b
1,2
x
6

=∗+∗x + c
1,1
x
2
+ c
2,1
x
3
+∗x
4
+∗x
5
+ c
1,2
x

6
+ c
2,2
x
7
+∗x
8
+∗x
9
,
(10.5)
c

1999 by CRC Press LLC

×