Tải bản đầy đủ (.pdf) (30 trang)

Tài liệu Cryptographic Algorithms on Reconfigurable Hardware- P5 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1016.23 KB, 30 trang )

5.2 Modular Addition Operation 99
A^B.C^
A,B,C, A3B3C3 A2B2C2 A,B,C, AoB^Co
iii iit iil ill iU iil
FA
^^e
1
S5
(
HA
FA
"5
s.
HA
FA
^4
S3
FA
C3
HA
s,
FA
C.
HA
FA
Si
C,
HA
J
So
f


HA
Co
n r~i n n n rr^.
Fig. 5.7. Carry Delayed Adder
combined, in other words, S' = A-\- B and S" = A-{- B -n can be computed
at the same time. Then, we perform a sign detection to decide whether to
take S' or S" as the correct sum. We will review algorithms of this type when
we study modular multiplication algorithms.
5.2.1 Omura's Method
An efficient method computing the modular addition, which especially useful
for multioperand modular addition was proposed by Omura in
[260].
Let n <
2^.
This method allows a temporary value to grow larger than n, however, it
is always kept less than 2^. Whenever it exceeds 2^, the carry-out is ignored
and a correction is performed. The correction factor is m = 2^^

n, which
is precomputed and saved in a register. Thus, Omura's method performs the
following steps given the integers A,B<2'^ (but they can be larger than n).
1.
First compute S' = A-\- B.
2.
If there is a carry-out (of the /cth bit), then 5 = 5' + m, else S

S'.
The correctness of Omura's algorithm follows from the observations that
• If there is no carry-out, then 5 = .4 4- -B is returned. The sum S is less
than 2^, but may be larger than n. In a future computation, it will be

brought below n if necessary.
• If there is a carry-out, then we ignore the carry-out, which means we
compute
S' = A-hB-2''.
The result, which needs to be reduced modulo n, is in effect reduced mod-
ulo 2^^. We correct the result by adding m back to it, and thus, compute
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
100 5. Prime Finite Field Arithmetic
= A-{-B-2^^2^-n
= A-hB -n.
After all additions are completed, a final result is reduced modulo n by using
the standard technique. As an example, let assume n = 39. Thus, we have
m = 2^ - 39 = 25 = (011001). The modular addition of A - 40 and 5-30
is performed using Omura's method as follows:
A = 40 - (101000)
B = 30= (011110)
S' = >l -f- B = 1(000110) Carry-out
m = (011001)
S = S' + m= (011111) Correction
Thus,
we obtain the result as 5 = (011111) = 31 which is equal to 70 (mod 39)
as required. On the other hand, the addition of A = 23 by B = 26 is performed
as
A = 23= (010111)
B = 26= (011010)
S' = A + B = 0(110001) No carry-out
S = S' = (110001)
This leaves the result as 5 = (110001) = 49 which is larger than the modulus
39.
It will be reduced in a further step of the multioperand modulo addition.

After all additions are completed, a final negative result can be corrected by
adding m to it. For example, we correct the above result S = (110001) as
follows:
S = (110001)
m = (011001)
S = S-\-m = 1(001010)
S = (001010)
The result obtained is 5 = (001010) = 10, which is equal to 49 modulo 39, as
required.
5.3 Modular Multiplication Operation
The modular multiplication problem is defined as the computation of P = AB
(mod n) given the integers A, B, and n. It is usually assumed that A and B are
positive integers with 0 < A^B < n, i.e., they are the least positive residues.
There are basically four approaches for computing the product P.
• Multiply and then divide.
• The steps of the multiplication and reduction are interleaved.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 101
• Brickell's method.
• Montgomery's method.
The multiply-and-divide method first multiplies A and B to obtain the
2/c-bit number
P'
:- AB.
Then, the result P' is divided (reduced) by n to obtain the /c-bit number
P:=P' mod n.
The result P is a /c-bit or 5-word number.
The reduction is accomplished by dividing P' by n, however, we are not in-
terested in the quotient; we only need the remainder. The steps of the division
algorithm can be somewhat simplified in order to speed up the process.

5.3.1 Standard Multiplication Algorithm
Let A and B be two 5-digit (s-word) numbers expressed in radix W as:
s-l
A = {As-iAs-2 Ao) =
Y^AiW\
s-l
B = {Bs-iBs-2"'Bo) =
Yl^'^'^
j=0
where the digits of A and B are in the range [0,
VF —
1]. In general W can be
any positive number. For reconfigurable hardware implementations, we often
select W = 2'^ where w is the word-size or granularity of the device, e.g.,
w = 4. The standard (pencil-and-paper) algorithm for multiplying A and B
produces the partial products by multiplying a digit of the multiplier (B)
by the entire number A, and then summing these partial products to obtain
the final number 2s-word number P'. Let P-j denote the (Carry,Sum) pair
produced from the product Ai

Bj. For example, when W = 10, and Ai = 7
and Bj = 8, then P^ = (5,6). The Plj pairs can be arranged in a table as
X
+ ^^3
P'
P'
•^32
P'
^13
P'

•^22
P'
^3
^3
M)3
P'
^12
P'
^21
P'
^30
^2
P2
P'
^02
Pii
P'
-^20
^1
Pi
P'
M)l
P'
^Q
Bo
p'
M)0
pt p/ pf pi pi pi pi pi
^7 ^6 -^5 -M ^3 ^2 ^l M)
The last row denotes the total sum of the partial products, and represents the

product as an 2s-word number. The standard algorithm for multiplication
essentially performs the above digit-by-digit multiplications and additions. In
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
102 5. Prime Finite Field Arithmetic
order to save space, a single partial product variable P' is being used. The
initial value of the partial product is equal to zero; we then take a digit of B
and multiply by the entire number A, and add it to the partial product P'.
The partial product variable P' contains the final product A- B
dX
the end of
the computation. Algorithm 5.1 shows the standard procedure for computing
the product A- B.
Algorithm 5.1 The Standard Multiplication Algorithm
Require: A^B.
Ensure: P' = A-
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
Initially P[ :=
for i = 0 to s
C:=0;
B.
0 for all i

-
- 1 do
for ji = 0 to s

(C,5):=:
Pi^j := S
end for
Pi+3
'•—
C]
end for
Return(P2s-
P'
1P23-
1 do
+ Aj
= 0,
Bi
-2 Po)
l, ,2s
+
C;
In the following, we show the steps of the computation of A- B = 348

857
using the standard algorithm.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 103
j Step
(C,

S) Partial P'
0 0 P(5 4- Aobo -f C (0, *) 000000
0 + 8-7 + 0 (5,6) 000006
1 P{ + Aibo + C
0 + 4-7 + 5 (3,3) 000036
2 P^ + A260 + C
0 + 3-7 + 3 (2,4) 000436
002436
1 0 Pi' + Aobi + C (0, *)
3 + 8-5 + 0 (4,3) 002436
1 Pi + Aibi + C
4 + 4.5 + 4 (2,8) 002836
2 P;^ + A2bi + C
2 + 3-5 + 2 (1,9) 009836
019836
2 0 P^ + A062 + C (0, *)
8 + 8-8 + 0 (7,2) 019236
1 P3' + Aib2 + C
9 + 4-8 + 7 (4,8) 018236
2 P^ + A2b2 + C
1+3-8 + 4 (2,9) 098236
298236
In order to implement this algorithm, we need to be able to execute Step 5 of
Algorithm 5.1 as,
{C,S)~Pi+j+Aj-Bi + C,
where the variables P/+j, Aj^ Bi, C, and S each hold a single-word, or a
W-bit number. This step is termed as an inner-product operation which is
common in many of the arithmetic and number-theoretic calculations. The
inner-product operation above requires that we multiply two VK-bit numbers
and add this product to previous 'carry' which is also a VK-bit number and

then add this result to the running partial product word
P/^-j-
From these
three operations we obtain a 2V^-bit number since the maximum value is
->vr
^w
w
-)2Vr
1 + (2'^ - 1)(2^ _ 1) -f 2^ - 1 - 2^^ - 1.
Also,
since the inner-product step is within the innermost loop, it needs to run
as fast as possible. Of course, the best thing is to have a single microprocessor
instruction for this computation; unfortunately, none of the currently available
microprocessors and signal processors offers such a luxury. A brief inspection
of the steps of this algorithm reveals that the total number of inner-product
steps is equal to 5^. Since s = k/w and it; is a constant on a given computer,
the standard multiphcation algorithm requires 0{k'^) bit operations in order
to multiply two k-hit numbers.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
104 5. Prime Finite Field Arithmetic
5.3.2 Squaring is Easier
Squaring is an easier operation than multipHcation since half of the single-
precision multiplications can be skipped. This is due to the fact that P/ =
Ai' Aj = P-^.
X
4-
-f
P^
P'
P'

^33
P'
P'
V23
P'
V23
2-^23
Pi
P'
^13
P'
-^22
P[z
2Pl'3
p'
V22
P'
^3
^3
P'
P'
^12
P'
^12
P'
^03
2-^03
2P{2
^3'
^2

A2
P'
P'
P'
2^02
^11
V2
^1
Al
P'
M)l
i^^l
2P^i
A'
^0
Ao
P'
M)0
P'
n
Thus,
we can modify the standard multiplication procedure as shown in Al-
gorithm 5.2 to take advantage of this property of the squaring operation.
Algorithm 5.2 The Standard Squaring Algorithm
Require: A.
Ensure: P'

A- A.
1:
Initially Pi := 0 for alH =

0,1, ,
2s - 1.
2:
for i = 0 to s -
1
do
3:
{C,S)-PU,^-Ai-Ai
4:
for j =
z
-I-
1
to s -
1
do
5:
{C,S):=PU,-Y2'ArAi-\-C-
6: PUj := 5;
7:
end for
8: Pi^s
'•—
C\
9: end for
10:
Return(P^,_iP^,_2
• • •
Po)
However, we warn the reader that the carry-sum pair produced by opera-

tion
{C,S)-Pl^^^2-Aj-Ai-^C
in Step 5 of Algorithm 5.2 may be 1 bit longer than a single-precision number
which requires w bits. Since
(2^
- 1) + 2(2^ - 1)(2^ - 1) -f (2^ - 1) =
22^^-^
- 2^+^
and
I ^ Q^if+i _ 2^"^^ <' o'^'^'^^ _ 1
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 105
the carry-sum pair requires 2w-\-l bits instead of 2w bits for its representation.
Thus,
we need to accommodate this 'extra' bit during the execution of the
operations in Steps 5, 6, and 7 of Algorithm 5.2. The resolution of this carry
may depend on the way the carry bits are handled by the particular processor's
architecture. This issue, being rather implementation-dependent, will not be
discussed here.
5.3.3 Modular Reduction
The multiply-and-reduce modular multiplication algorithm first computes the
product A ' B (or, A - A) using one of the multiplication algorithms given
above. The multiplication step is then followed by a division algorithm in
order to compute the remainder. However, as we have mentioned before, we
are not interested in the quotient; we only need the remainder. Therefore, the
steps of the division algorithm can somewhat be simphfied in order to speed
up the process. The reduction step can be achieved by making one of the
well-known sequential division algorithms. In the rest of this subsection, we
describe the restoring and the nonrestoring division algorithms for computing
the remainder of P' when divided by n, where n is a general modulus^

Division is the most complex of the four basic arithmetic operations. First
of all, it has two results: the quotient and the remainder. Given a dividend
P'
and a divisor n, a quotient Q and a remainder R have to be calculated in
order to satisfy
P'
=
Q'n-\-R
with R < n.
If P' and n are positive, then the quotient Q and the remainder R will be
positive. The sequential division algorithm successively shifts and subtracts n
from P' until a remainder R with the property 0 < -R < n is found. However,
after a subtraction we may obtain a negative remainder. The restoring and
nonrestoring algorithms take different actions when a negative remainder is
obtained.
Restoring Division Algorithm
Let Ri be the remainder obtained during the zth step of the division algorithm.
Since we are not interested in the quotient, we ignore the generation of the
bits of the quotient in the following algorithm. The procedure given below
first left-aligns the operands P' and n. Since P' is 2/i;-bit number and n is a
k-h\t number, the left ahgnment implies that n is shifted k bits to the left,
i.e., we start with 2^n. Furthermore, the initial value of R is taken to be P',
i.e., RQ = P', We then subtract the shifted n from P' to obtain R\\ if Ri is
^ It is noted that Solinas proposed in [338] primes of special form for which the
reduction step can be accomplished with high efficiency. However the material
for Solinas special primes is not covered in this book. The interested reader may
consult [37].
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
106 5. Prime Finite Field Arithmetic
positive or zero, we continue to the next step. If it is negative the remainder

is restored to its previous value as is shown in Algorithm 5.3 below.
Algorithm 5.3 The Restoring Division Algorithm
Require: P\n,
Ensure: R = P' mod n.
1:
RQ := t;
2:
n
:=
2^n\
3:
for
2
=
1
to /c do
4:
Ri := Ri-m;
5:
if Ri <0 then
6: Ri := Ri-i',
7:
end if
8: n := n/2
9: end for
10:
Return(i?/e)
In Step 5 of Algorithm 5.3, we check the sign of the remainder; if it is
negative, the previous remainder is taken to be the new remainder, i.e., a
restore operation is performed. If the remainder Ri is positive, it remains as

the new remainder, i.e., we do not restore. The restoring division algorithm
performs k subtractions in order to reduce the 2/c-bit number t modulo the
/c-bit number n. Thus, it takes much longer than the standard multiplication
algorithm which requires s = k/w inner-product steps, where w is the word-
size of granularity being employed.
In the following, we give an example of the restoring division algorithm for
computing 3019 mod 53, where 3019 = (101111001011)2 and 53 - (110101)2-
The result is 51 = (110011)2.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 107
RQ
n
-
Ri
n/2
+
R2
n/2
4-
Rs
n/2
+
R4
n/2
n/2
n/2
4-
R5
101111 OOIOIU
110101 subtract

000110 negative remainder
101111 001011 restore
11010 1 shift and subtract
10100 1 positive remainder
10100 101011 not restore
1101 01 shift and subtract
0111 01 positive remainder
0111 011011 not restore
110 101 shift and subtract
000 110 positive remainder
000 110011 not restore
11 0101 shift
1 10101 shift
110101 shift and subtract
000010 negative remainder
noon restore
R noon final remainder
Also,
before subtracting, we may check if the most significant bit of the re-
mainder is 1. In this case, we perform a subtraction. If it is zero, there is no
need to subtract since n > Ri. We shift n until it is aligned with a nonzero
most significant bit oiRi. This way we are able to skip several subtract/restore
cycles. In the average, k/2 subtractions are performed.
Nonrestoring Division Algorithm
The nonrestoring division algorithm allows a negative remainder. In order to
correct the remainder, a subtraction or an addition is performed during the
next cycle, depending on the whether the sign of the remainder is positive
or negative, respectively. This is based on the following observation: Suppose
Ri — Ri-\


n < 0, then the restoring algorithm assigns Ri \= Ri-i and
performs a subtraction with the shifted n, obtaining
Ri^i ==Ri- n/2 = Ri-i - n/2.
However, if Ri = Ri-i

n < 0, then one can instead let Ri remain negative
and add the shifted n in the following cycle. Thus, one obtains
Ri^i = Ri-^ n/2 ^ {Ri-i - n) 4- n/2 = Ri-i - n/2,
which would be the same value. The steps of the nonrestoring algorithm,
which implements this observation, are given in Algorithm 5.4.
Note that the nonrestoring division algorithm requires a final restoration
cycle in which a negative remainder is corrected by adding the last value of n
back to it.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
108 5. Prime Finite Field Arithmetic
Algorithm 5.4 The Nonrestoring Division Algorithm
Require: P',n.
Ensure: R = P' mod n.
Ro '•= t\
n := 2'^n;
for i =
1
to /c do
if Ri-i > 0 then
Ri := Ri-i

n;
else
Ri := Ri-i + n;
end if

n := n/2;
if i^fc < 0 then
i?:= /?-f n;
end if
end for
Return(J^fc)
In the following we compute 51 — 3019 mod 53 using the nonrestoring
division algorithm. Since the remainder is allowed to stay negative, we use 2's
complement coding to represent such numbers.
Ro 0101111 001011 i
n 0110101 subtract
Ri 1111010 negative remainder
n/2 011010 1 add
R2 010100 1 positive remainder
n/2 01101 01 subtract
J^3 00111 01 positive remainder
n/2 0110 101 subtract
R4 0000 110 positive remainder
n/2 Oil 0101
n/2 01 10101
n/2 0 110101 subtract
Rs 1 111110 negative remainder
_ji 0 110101 add (final restore)
R 0 noon Final remainder
5.3.4 Interleaving Multiplication and Reduction
The interleaving algorithm has been known. The details of the method are
sketched in papers [27, 334]. Let Ai and Bi be the bits of the k-hit positive
integers A and
JB,
respectively. The product P' can be written as

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 109
fc-i fc-i
P'
=:
A'B^A'Y^
Bi2'
= Y^{A

Bi)2'
i=0 i=0
= 2("' 2(2(0 -f A . Bk-i) + A . Bk-2) -\-' )-{-A
-
BQ
This formulation yields the shift-add multiphcation algorithm. Notice that we
also reduce the partial product modulo n at each step of Algorithm 5.5.
Algorithm 5.5 The Interleaving Multiplication Algorithm
Require:
A,B,n.
Ensure: P = A
-
B mod n.
1:
P:=0;
2:
for i = 0 to
A;
-
1
do

3:
P := 2P-{-A

Bk-i-i',
4:
P
:=
P mod n;
5:
end for
6: Return(P)
Assuming that A, B^P < n, we have
P :=2P + A' Bj
< 2(n-l)-f (n-1) = 3n-3.
Thus,
the new P will be in the range 0 < P < 3n

3, and at most 2 sub-
tractions are needed to reduce P to the range 0 < P < n. We can use the
following algorithm to bring P back to this range:
P'
:= P - n ; If P' > 0 then P = P'
P'
:= P - n ; If P' > 0 then P = P'
The computation of P requires k steps, at each step we perform the following
operations:
• A left shift: 2P
• A partial product generation: A
-
Bj

• An addition: P := 2P -h A

Bj
• At most 2 subtractions:
P'
:= P - n ; If P' > 0 then P
=^
P'
P'
:= P - n ; If P' > 0 then P
=^
P'
The left shift operation is easily performed by wiring. The partial products,
on the other hand, are generated using an array of AND gates. The most
crucial operations are the addition and subtraction operations: they need to
be performed fast. We have the following avenues to explore:
• We can use the carry propagate adder, introducing 0{k) delay per step.
However, Omura's method can be used to avoid unnecessary subtractions:
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
110 5. Prime Finite Field Arithmetic
3a. P := 2P
3b.
If carry-out then P := P
-{-
m
3c.
P \= P-\- A' Bj
3d. If carry-out then P := P
-h
m

• We can use the carry save adder, introducing only 0(1) delay per step.
However, recall that the sign information is not immediately available in
the CSA. We need to perform fast sign detection in order to determine
whether the partial product needs to be reduced modulo n.
5.3.5 Utilization of Carry Save Adders
In order to utilize the carry save adders in performing the modular multipli-
cation operations, we represent the numbers as the carry save pairs
(C^S),
where the value of the number is the sum C-f 5. The carry save adder method
of the interleaving algorithm is given in Algorithm 5.6.
Algorithm 5.6 The Carry-Save Interleaving Multiphcation Algorithm
Require:
A,B,n.
Ensure: P = A

B mod n.
1:
(C,5):=(0,0);
2:
for i = 0 to
fc
-
1
do
3:
(C, S):=2C-\-2S + A- Bk-i-i]
4:
(C\S'):=C-\-Sn',
5:
if SIGN > 0 then

6: (C,5) :=(C^50;
7: end if
8: end for
9: Return(C,5)
The function SIGN gives the sign of the carry save number C
-\-
S', Since
the exact sign is available only when a full addition is performed, we calculate
an estimated sign with the SIGN function. A sign estimation algorithm was
introduced in
[185].
Here, we briefly review this algorithm, which is based on
the addition of the most significant t bits of C and S to estimate the sign of
C 4- 5. For example, let C = (011110) and S = (001010), then the function
SIGN produces
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 111
C-011110
S = 001010
{t = 1) SIGN = 0
(t = 2) SIGN = 01
(t = 3) SIGN = 100
(^ = 4) SIGN = 1001
(t = 5) SIGN = 10100
(t = 6) SIGN = 101000.
In the worst case the exact sign is produced after adding all k bits. If the
exact sign of C + 5 is computed, we can obtain the result of the multiplication
operation in the correct range
[0,
n). If an estimation of the sign is used, then

we will prove that the range of the result becomes
[0,
n +
Zl),
where A depends
on the precision of the estimation. Furthermore, since the sign is used to decide
whether some multiple of n should be subtracted from the partial product,
an error in the decision causes only an error of a multiple of n in the partial
product, which is corrected later. We define function T{X) on an n-bit integer
X as
T{X) = X-{X mod 2*),
where 0<t<n

1. In other words, T replaces the first least significant t
bits of X with t zeros. This implies
T{X)<X <T{X)-^2K
We reduce the pair (C, S) by performing the following operation Q times:
I. {C,S):=C + S-n.
J. If T(C) + T(S) > 0 then set C := C and S := 5.
In Step J, the computation of the sign bit R of T{C) + T{S) involves n

t
most significant bits of C and S. The above procedure reduces a carry-sum
pair from the range
0<Co + 5o< (Q + l)n + 2*
to the range
0 <
CK
4-5i? < n + 2*,
where

(CQ,SO)
and
{CR,SR)
respectively denote the initial and the final carry-
sum pair. Since the function T always underestimates, the result is never
over-reduced, i.e.,
CR-hSR> 0.
If the estimated sign in Step J is positive for all Q iterations, then QN is
subtracted from the initial pair; therefore
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
112 5. Prime Finite Field Arithmetic
CR-^SR^CO + SO-QN <n-{- 2^
If the estimated sign becomes negative in an iteration, it stays negative there-
after to the last iteration. Thus, the condition
T{C)
4-
T{S) < 0
in the last iteration of Step J implies that
T(C)-f T(5) < -2\
since T{X) is always a multiple of 2^ Thus, we obtain the range of C and S
as
T(C) + T(S) <C + S< T(C) + T{S) + 2*+\
It follows from the above equations that
C 4-5 < 2*+^ - 2^ = 2^
Since in Step I we perform (C, S) := C
-\-
S ~ n and in the last iteration the
carry-sum pair is not reduced (because the estimated sign is negative), we
must have
CR-\-SR=^C^-S + n,

which implies
CR^SR<n-\-2K
The modular reduction procedure described above subtracts n from (C, S) in
each of the Q iterations. The procedure can be improved in speed by sub-
tracting 2^~^n during iteration j, where (Q -f 1) < 2^ and j = 1,
2,3, ,
/c.
For example, if Q = 3, then k = 2 can be used. Instead of subtracting n
three times, we first subtract 2N and then n. This observation is utilized in
Algorithm 5.7.
The parameter t controls the precision of estimation; the accuracy of the
estimation and the total amount of logic required to implement it decreases
as t increases. After Step 7 of Algorithm 5.7, we have
CW+^CO <n-h2S
which implies that after the next shift-add step the range of C^*"^^^ +
S^'^'^^^
will be [0,3N -f 2*+^). Assuming Q = 3, we have
3iV + 2*+^ < (Q + l)n +
2*
= 4iV + 2\
which implies 2* < n, or t < n - 1. The range of C^*"^^^ 4- S^*"*"^^ becomes
0 < C^^+i) -f. 5(^+1) < 3A^ 4- 2*-^^ < 3A^ 4- 2^ < 2^-^^
and after Step 4 of Algorithm 5.7, the range will be
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 113
Algorithm 5.7 The Carry-Save Interleaving Multiplication Algorithm Re-
visited
Require: A, B, n.
Ensure: P


A

B mod n.
1
2
3
4
5
6
7
8
9
10
11
12
13
Set 5^°^ := 0 and C^°> := 0.
for i = 1 to /c do
{C'<'\S^'^) := 2C(^-^) +
25^^-^^
+ An-iB
{C^'\S^'^) := C^^)-f-5^^) -2n;
if T{C^'^)+T(S^''>) >Othen
C(^) :=C(^> and5(^^ := S^'\
end if
(C(^\5(^>) := C^^^-f-5^^^ -n;
if T(C'(^))+T(5(^)) >Othen
C^^) :=C(^) and5(^) :=5(^);
end if
: end for

: Return(C(^\5^^^)
_2n+i < _2jv < C^^+i) +
5^^+^^
< n 4- 2^ < 2"^+^
In order to contain the temporary results, we use (n-f-3)-bit carry save adders
which can represent integers in the range
[—2"""^^,
2""^^). When t = n

1,
the sign estimation technique checks 5 most significant bits of
C^^^
and
S^^^
from the bit locations n

2 to n 4- 3. This algorithm produces a pair of
integers (C, 5) = (C(^),5(^)) such that P = C + 5 is in the range [0,2N).
The final result in the correct range
[0,
n) can be obtained by computing
P — C
-{•
S and P = C
-{•
S

n using carry propagate adders. If P < 0,
we have P = P
-\-

n < n^ and thus P is in the correct range. Otherwise, we
choose P because 0<P = P

n<2^<n implies P € [0, n). The steps of
the algorithm for computing 47

48 (mod 50), are illustrated in the following
figure. Here we have
/c=[log2(50)J + l = 6,
A = 47= (000101111),
B = 4S = (000110000),
n = 50= (000110010),
M = -n = (111001110).
The algorithm computes the final result
(C,
S) = (010111000,110000000) = (184, -128)
in 3k = 18 clock cycles. The range of C -f- 5" = 184 - 128 = 56 is [0, 2

50).
The final result is found by computing C
H-
5 = 56 and C
-\-
S

n = 6^ and
selecting the latter since it is positive.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
114
5.

Prime Finite Field Arithmetic
7-0
Z=
1
\i = 2
i-3
z-4
z-5
z-6
2a
2b
2c
2a
2b
2c
2a
2b
2c
2a
2b
2c
2a
2b
2c
2a
2b
2c
C
000000000
000000000

000000000
000000000
000000000
000000000
010000000
000100000
001011000
001011000
101100000
001000000
001000000
101100000
101100000
010010000
001000000
010111000
010111000
s
000000000
000110000
000110000
000110000
001100000
001100000
110101110
001101100
111010000
111010000
100100000
111011100

111011100
100001000
100001000
110100110
001011100
110000000
110000000
c
-
-
000100000
000000000
-
000000000
010000000
-
001011000
110110000
-
001000000
110011000
-
000010000
010010000
-
010111000
100010000
s
-
-

110101100
111111110
-
111111100
110101110
-
111010000
001000110
-
111011100
001010010
-
111110100
110100110
-
110000000
011110110
T{C)-i-T{S)
-
-
111000000
111100000
-
111100000
000100000
-
000000000
111100000
-
000000000

111000000
-
111100000
000100000
-
000100000
111100000
R\
-
-
1
1
-
1
0
-
0
1
-
0
1
-
1
0
-
0
1
5.3.6 Brickell's Method
This method is based on the use of a carry delayed integer introduced in
§5.1.6.

Let A be a carry delayed integer, then, it can be written as
fc-i
i=0
The product P = AB can be computed by summing the terms:
{To-B + Do'B)-2^
-{-
{Ti'B-{-Di'B)-2^ ^
(T2

B + D2

5) . 22 4-
{n-i-B^Dk-i-B)-2^-'
Since DQ = 0, we rearrange to obtain
2^-To-B + 2^ 'Di'B
-{-
2^'Ti'B-\'2'^'D2'B +
2'^'T2'B-^2^-D3'B +
rik-2
Tk-2

B + 1^-^

Dk-i
2'=-!

Tk-i

B
B

+
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 115
Also recall that either Ti or Di^i is zero due to the property of the carry
delayed adder. Thus, each step requires a shift of B and addition of at most
2 carry delayed integers:
# Either. {Pd,Pt):={Pd, Pt)-\-2'-Ti-B
. Or: {Pd^Pt):={Pd, Pt)-\-2'^''Di^,-B
After k steps P — {Pd,Pt) is obtained. In order to compute P (mod n), we
perform reduction:
If P > 2^-1-n then P:
If P > 2^-2 . n then P :
If P > 2^^-^ . n then P :
= p-
-p-
= p-
2fc-
2/e-
2/e-
-1
-2
-3
n
n
n
If P > n then P := P - n
We can also reverse these steps to obtain:
^k-l
^Tk-i'B'2^
= P + Tk-2'B'2''-^-i-Dk-i'B'2''

= P-\-Tk-3-B-2^
Dk-2 •B'2^
p — p-I-Ti

P . 2^ + J^2

5

2^
P — P-f To

P

2° + A

P

2^
Also,
the multiplication steps can be interleaved with reduction steps. To per-
form the reduction, the sign of P

2*

n needs to be determined (estimated).
Brickell's solution [33] is essentially a combination of the sign estimation tech-
nique and Omura's method of correction. We allow enough bits for P, and
whenever P exceeds 2^^, add m = 2^

n to correct the result. 11 steps after

the multiplication procedure started, the algorithm starts subtracting multi-
ples of n. In the following, P is a carry delayed integer of /c 4- 11 bits, m is
a binary integer of k bits, and t\ and ^2 control bits, whose initial values are
ti-=^t2 = 0.
1.
Add the most significant 4 bits of P and m

2^^
2.
If overflow is detected, then t2 = I else ^2

0.
3.
Add the most significant 4 bits of P and the most significant 3 bits of
m.2io.
4.
If overflow is detected and ^2 = 0, then ^i = 1 else ti = 0.
The multiplication and reduction steps of Brickell's algorithm are as follows:
B'
:=Ti-B + 2' A+i

B
m'
:=t2'm'2^^ -\-ti

m

2^°
P := 2(P + P'-f mO
A := 2A.

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
116 5. Prime Finite Field Arithmetic
5.3.7 Montgomery's Method
In 1985, P. L. Montgomery introduced an efficient algorithm [238] for comput-
ing R = A- B mod n where A, B, and n are k-hit binary numbers. The Mont-
gomery reduction algorithm computes the resulting /c-bit number R without
performing a division by the modulus n. Via an ingenious representation of
the residue class modulo n, this algorithm replaces division by n operation
with division by a power of 2. This operation is easily accomplished on a
computer since the numbers are represented in binary form. Assuming the
modulus n is a /c-bit number, i.e., 2^~^ < n < 2^, let r be 2^. The Mont-
gomery reduction algorithm requires that r and n be relatively prime, i.e.,
gcd(r, n) = gcd(2'^,n) = 1. This requirement is satisfied if n is odd. In the
following we summarize the basic idea behind the Montgomery reduction al-
gorithm.
Given an integer ^4 < n, we define its n-residue with respect to r as
A==
A ' r mod n.
It is straightforward to show that the set
{ i' r mod n\0<i<n

1}
is a complete residue system, i.e., it contains all numbers between 0 and n— 1.
Thus,
there is a one-to-one correspondence between the numbers in the range
0 and n

1 and the numbers in the above set. The Montgomery reduction
algorithm exploits this property by introducing a much faster multiplication
routine which computes the n-residue of the product of the two integers whose

n-residues are given. Given two n-residues A and B, the Montgomery product
is defined as the n-residue
R=:
A- B ' r~^ mod n
where r~"^ is the inverse of r modulo n, i.e., it is the number with the property
The resulting number R is indeed the n-residue of the product
R = A' B mod n
since
R== A- B ' r~^ mod n
= A

r ' B ' r ' r~^ mod n
= A ' B ' r mod n.
In order to describe the Montgomery reduction algorithm, we need an addi-
tional quantity, n', which is the integer with the property
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 117
r ' r~^

n- n' = I.
The integers r~^ and n' can both be computed by the extended Euclidean
algorithm
[178].
The Montgomery product algorithm, which computes
u==^
A' B

r~^ (mod n)
given A and 5, is given in Algorithm 5.8 below.
Algorithm 5.8 Montgomery Product

Require:
A,B,r,n.
Ensure: ti=MonPro(^, B)=A

B

r~^ (mod n).
t:=AB;
m
'.•=
t' n' mod r;
u \= {t
•]-
in
'
n)/r\
if u > n then
Return(u

n)
else
Return(u)
end if
The most important feature of the Montgomery product algorithm is that
the operations involved are multiplications modulo r and divisions by r, both
of which are intrinsically fast operations since r is a power 2. The MonPro
Algorithm 5.9 can be used to compute the product of A and B modulo n,
provided that n is odd.
Algorithm 5.9 Montgomery Modular Multiplication: Version I
Require: A, B, an odd number n.

Ensure: x = A

B (mod n).
1:
Compute n' using the extended Euclidean algorithm.
2:
A
:=
A
'
r mod n;
3:
B
'.—
B
'
r mod n;
4:
X := MonPro(i, 5);
5:
X := MonPro(x, 1);
6: Return(a;)
A better algorithm can be given by observing the property
MonPro(A, B) = (A

r)

B
-
r''^ = A

-
B (mod n),
which modifies Algorithm 5.9 as shown in Algorithm 5.10. However, the
preprocessing operations, especially the computation of n', are rather time-
consuming.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
118 5. Prime Finite Field Arithmetic
Algorithm 5.10 Montgomery Modular Multiplication: Version II
Require: A,B, an odd number n.
Ensure: x = A

B (mod n).
1:
Compute n' using the extended Euclidean algorithm.
2:
A := A

r mod n;
3:
X := MonPro(i, B);
4:
Return(a;)
Nevertheless, there is an efficient algorithm for computing the single pre-
cision integer
UQ.
The computation of no can be performed by a specialized
Euclidean algorithm instead of the general extended Euclidean algorithm.
Since r = 2^^ and
r ' r~^


n- n' = I,
we take modulo 2^ of the both sides, and obtain
-n-n'-l (mod 2^),
or, in other words,
U'Q
==
-n^^ (mod 2^),
where
UQ
and n^^ are the least significant words (the least significant w bits)
of n' and n~\ respectively. In order to compute
—UQ^
(mod 2^), we use
algorithm 5.11 given below which computes x~^ (mod 2^) for a given odd x.
Algorithm 5.11 Specialized Modular Inverse
Re
quire: an odd number x and
Ensure: yuj = x~
1:
2:
3:
4:
5:
6:
7:
8:
9:
y\
•= 1;
for i


2 to w
if 2'-^ < X
'
Vi := Vi-i
else
Vi = Vi-i
end if
end for
Return(?/ty)
' (mod 2^).
do
yi-i (mod 2')
+ 2^-^
J
w.
then
The correctness of the algorithm follows from the observation that, at every
step z, we have
X •
yi = I (mod 2^).
This algorithm is very eflftcient, and uses single precision addition and multipli-
cations in order to compute x~^. As an example, we compute 23"^ (mod 64)
using the above algorithm. Here we have a; = 23, ii; = 6. The steps of the
algorithm, the temporary values, and the final inverse are shown below:
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 119
i
2
3

4
5
6
2'
4
8
16
32
64
Vi-i
1
3
7
7
7
X
' yi-i (mod 2^)
23

1
= 3
23

3
= 5
23

7
= 1
23-7-1

23

7
= 33
22-1
2
4
8
16
32
2/i
1 +
2
= 3
34-4 = 7
7
7
7 +
32
= 39
Thus,
we compute 23 ^ = 39 (mod 64). This is indeed the correct value since
23

39 = 14

64 -h
1
= 1 (mod 64).
Also,

at every step z, we have x
-
yi = 1 (mod 2*), as shown below:
X •
yi mod 2*
23

1 = 1 mod 2
23

3 = 1 mod 4
23

7 = 1 mod 8
23

7 = 1 mod 16
23

7 = 1 mod 32
23

39 = 1 mod 64
Montgomery Exponentiation
The Montgomery product algorithm is more suitable when several modular
multiplications with respect to the same modulus are needed. Such is the case
when one needs to compute a modular exponentiation, i.e., the computation
of M^ mod n. Using one of the addition chain algorithms given in §5.4, we
replace the exponentiation operation by a series of square and multiplication
operations modulo n. This is where the Montgomery product operation finds

its best use. In the following we summarize the modular exponentiation op-
eration which makes use of the Montgomery product function MonPro. The
exponentiation Algorithm 5.12 below uses the binary method.
Thus,
we start with the ordinary residue M and obtain its n-residue M
using a division-like operation, which can be achieved, for example, by a series
of shift and subtract operations. Additionally, Steps 2 and 3 of Algorithm 5.12
require divisions. However, once the preprocessing has been completed, the
inner-loop of the binary exponentiation method uses the Montgomery product
operations which performs only multiplications modulo 2^ and divisions by 2^,
When the binary method finishes, we obtain the n-residue x of the quantity
X = M^ mod n. The ordinary residue number is obtained from the n-residue
by executing the MonPro function with arguments x and 1. This is easily
shown to be correct since
X
= X ' r mod n
immediately impHes that
x = X'r~^modn — x-l-r~^modn := MonPro(x,l).
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
120 5. Prime Finite Field Arithmetic
Algorithm 5.12 Montgomery Modular Exponentiation
Require: A, B, and odd number n.
Ensure: x = M^ (mod n).
1:
Compute n' using the extended Euclidean algorithm.
2:
M
:=
M


r mod n;
3:
X \= I

r mod n;
4:
for i
==
k

1 down to 0 do
5:
X := MonPro(x,x);
6: if Ci = 1 then
7:
X
:= MonPro(M,x);
8: end if
9: end for
10:
X
:—
MonPro(x, 1);
11:
Return(x)
The resulting algorithm is quite fast as was demonstrated by many researchers
and engineers who have implemented it, for example, see [72, 200]. However,
this algorithm can be refined and made more efficient, particularly when the
numbers involved are multi-precision integers. For example, Dusse and Kaliski
[72] gave improved algorithms, including a simple and efficient method for

computing n'. We will describe these methods below.
An Example of Exponentiation
Here we show how to compute x = 7^° mod 13 using the Montgomery expo-
nentiation algorithm.
• Since n = 13, we take r
==
2^ == 16 > n.
• Computation of n'\
Using the extended EucHdean algorithm, we determine that
16-9 —13-11
=
1,
thus, r~^ = 9 and n' = 11.
• Computation of M:
Since M = 7, we have M := M

r (mod n) = 7

16 (mod 13) = 8.
• Computation of x for a; = 1:
We have x := x

r (mod n) = 1

16 (mod 13) = 3.
• Steps 5 and 7 of the ModExp routine:
ei
1
0
1

0
Step 5
MonPro(3,3) = 3
MonPro(8,8) = 4
MonPro(4,4) = 1
MonPro(7,7) = 12
Step 7
MonPro(8,3) = 8
MonPro(8,l) = 7
o Computation of MonPro(3,3) = 3: o Computation of MonPro(8,3) = 8:
t := 3

3 = 9 t := 8

3 = 24
m := 9

11 (mod 16) = 3 m := 24

11 (mod 16) = 8
^ := (9 -f 3 .13)/16 = 48/16 = 3 u := (24 -|- 8

13)/16 = 128/16 = 8
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 121
o Computation of MonPro(8,8) = 4: ^ Computation of MonPro(4,4) = 1;
t-=8 8 = 64 i:=4-4 = 16
m := 64 . 11 (mod 16) = 0 '^
•=
.]l''

("1°^/,^)
= » , ,
M
:= (64 + 0

13)/16 = 64/16 = 4 « ^= (16 + 0

13)/16 = 16/16 = 1
o Computation of MonPro(8,1)
==
7: o Computation of MonPro(7, 7) = 12:
t:=S'l = S t:=7'7 = 49
m := 8

11 (mod 16) - 8 m :- 49

11 (mod 16) = 11
u:= (84-8-13)/16= 112/16-7 u := (49+1M3)/16 - 192/16 = 12
• Step 7 of the ModExp routine: x = MonPro(12,1) = 4
i —12-1 = 12
m:= 12-11 (mod 16) = 4
u:={12 + 4' 13)/16 = 64/16 - 4
Thus,
we obtain x = 4 as the result of the operation 7^° mod 13.
Hardware Implementation of the Montgomery Method
In the rest of this section, we introduce an efficient binary add-shift algorithm
for computing MonPro(yl, J9), and then generahze it to the m-ary method.
We take r = 2^, and assume that the number of bits in ^4 or B is less than
k. Let A = {Ak-iAk-2
- • •

AQ)
be the binary representation of A. The above
product can be written as
k-l
2-^

{Ak-iAk-2 "'Ao)'B = 2-''-Y^Ai'2''B (mod n).
i=0
The product t = (^o 4- Ai2
H
Ak-i2^~'^)

B can be computed by starting
from the most significant bit, and then proceeding to the least significant, as
follows:
1.
t:=0
2.
for
z
=
/c
-
1
to 0
2a. t:=t-\-Ai-B
2b.
t\=2't
The shift factor 2~^ \n 2~^
-

A
-
B reverses the direction of summation. Since
2-^'{A^-{-Ax2-{-"'Ak-i2^-^) = Aifc_l2-l+Afc_22-2 Ao2-^
we start processing the bits of A from the least significant, and obtain the
following binary add-shift algorithm to compute t = A
-
B
-
2'^^, as shown in
Algorithm 5.13.
Procedure 5.13 computes the product t ~ 2~^ - A ' By however, we are
interested in computing u = 2~^ - A- B (mod n). This can be achieved by
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
122 5. Prime Finite Field Arithmetic
Algorithm 5.13 Add-and-Shift Montgomery Product
Require: A,B.
Ensure: t =
A-B-2~''.
1:
t :=0;
2:
for i = 0 to
fc
-
1
do
3:
t:=t + Ai-B;
4:

t
:=
t/2]
5:
end for
6: Return(t)
subtracting n during every add-shift step, but there is a simpler way: We add
n to n if li is odd, making new u an even number since n is always odd. If u is
even after the addition step, it is left untouched. Thus, u will always be even
before the shift step, and we can compute
u := u- 2~^ (mod n)
by shifting the even number u to the right since u = 2v implies
u
:—
2v

2~^ =
V
(mod n).
The binary add-shift algorithm computes the product u~
A'B-2~^
(mod n)
as shown in Algorithm 5.14.
Algorithm 5.14 Binary Add-and-Shift Montgomery Product
Require: A, B, an odd number n.
(mod n).
Ensure:
1:
2:
3:

4:
5:
6:
7:
8:
9:
u :=
for i
u
if
u = A- B
0;
= 0 to
A;
-
•.= u-{- Ai •
.2-'
1 do
B;
u is odd then
u := li + n;
end if
u
- «/2;
end for
Reti
arn(u)
We reserve a {k + l)-bit register for u because if u has k bits at beginning
of an add-shift step, the addition oi Ai - B and n (both of which are /c-bit
numbers) increases its length to

A;
+ 1 bits. The right shift operation then
brings it back to k bits. After k add-shift steps, we subtract n from u if it is
larger than n.
Also note that Steps 2a and 2b of the above algorithm can be combined:
We can compute the least significant bit i^o of u before actually computing
the sum in Step 2a. It is given as
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.3 Modular Multiplication Operation 123
Thus,
we decide whether u is odd prior to performing the full addition oper-
ation u := u
-\-
AiB. This is the most important property of Montgomery's
method. In contrast, the claissical modular multiplication algorithms (e.g., the
interleaving method) computes the entire sum in order to decide whether a
reduction needs to be performed.
5.3.8 High-Radix Interleaving Method
Since the speed for radix 2 multipliers is approaching limits, the use of higher
radices is investigated. High-radix operations require fewer clock cycles, how-
ever, the cycle time and the required area increases. Let 2^ be the radix.
The key operation in computing P = AB (mod n) is the computation of an
inner-product steps coupled with modular reduction, i.e., the computation of
P:=2^
'P-\-A'Bi-Q'n,
where P is the partial product and Bi is the ith digit of B in radix 2^.
The value of Q determines the number of times the modulus n is subtracted
from the partial product P in order to reduce it modulo n. We compute Q
by dividing the current value of the partial product P by n, which is then
multiplied by n and subtracted from the partial product during the next

cycle. This implementation is illustrated in Fig. 5.8.
B (Multiplier)
Shift Left
bbits
Shift Left
bbits
bbits
Accumulator
B (Multiplier)
B (Multiplier)
<8)
t t
-€)
^^27
k b+1 bits
Divide by n H
Fig. 5.8. High-Radix Interleaving Method
For the radix 2, the partial product generation is performed using an array of
AND gates. The partial product generation is much more complex for higher
radices, e.g., Wallace trees and generalized counters need to be used. However,
the generation of the high-radix partial products does not greatly increase cy-
cle time since this computation can be easily pipeHned. The most complicated
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

×