The Implementation and Analysis of the
ECDSA on the Motorola StarCore SC140 DSP
Primarily Targeting Portable Devices
by
Eric W. Smith
A thesis
presented to the University of Waterloo
in fulfillment of the
thesis requirement for the degree of
Master of Applied Science
In
Electrical and Computer Engineering
Waterloo, Ontario, Canada, 2002
© Eric W. Smith 2002
I hereby declare that I am the sole author of this thesis.
I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the
purpose of scholarly research.
I further authorize the University of Waterloo to reproduce this thesis by photocopying or by other
means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly
research.
ii
The University of Waterloo requires the signatures of all persons using or photocopying this thesis.
Please sign below, and state an address and date.
iii
Abstract
The viability of the Elliptic Curve Digital Signature Algorithm (ECDSA) on portable devices is
important due to the growing wireless communications industry, which has inherent insecurities. The
StarCore SC140 DSP (SC140) targets portable devices, and therefore is a prime candidate to study the
viability of the ECDSA on such devices. The ECDSA was implemented on the SC140 using a Koblitz
curve over GF(2
163
). The τ-adic representation of polynomials involved in the elliptic curve point-
multiplication is exploited to achieve superior performance. The ECDSA was implemented and
optimized in C and assembly, and verified in hardware. The performance of the C and assembly
implementations is analyzed and compared to previously published results. The ability of the compiler
to generate efficient cryptographic related code and the SC140 to perform efficient operations is
discussed. Numerous compiler optimization improvements that considerably enhance the performance
of the generated assembly are suggested. Coding guidelines that state simple measures to improve the
performance of the implementation and help to achieve efficient C and assembly are listed. Finally,
security issues, with respect to the implementation and focusing on side-channel attacks (SCA) are
investigated, including estimated performance penalties due to adding resiliency. Two SCA
countermeasures specific to the implementation are also described. In summary, the implemented
ECDSA signature generation and verification processes require 4.43 and 8.63 ms when the SC140
operates at 300MHz. Methods of optimizing the implementation to further reduce execution times are
also presented.
iv
Acknowledgements
The author would like to thank his supervisor, Professor Catherine Gebotys, for her aid and direction
throughout the development of the thesis, as well as the use of computing resources and the StarCore
SC140 Software Development Platform (SDP). He would also like to thank friends and family for
their support, without which the completion of the thesis would not be possible.
The author is extremely grateful for the financial support provided by the National Sciences
and Research Council of Canada (NSERC), Motorola, his supervisor and the Department of Electrical
and Computer Engineering at the University of Waterloo. Financial support was provided by the listed
entities through various scholarships, which allowed the author to focus more thoroughly on his
research and studies.
v
Contents
1
Introduction 1
1.1
DSPs and Embedded Systems Security Requirements 2
1.2
Thesis Objective 3
1.3
Thesis Overview 3
2
Public-Key Cryptosystems and the StarCore SC140 DSP 4
2.1
Public-Key Cryptosystems 4
2.2
ECC Background 5
2.2.1
Comparison to Other Cryptographic Techniques 7
2.3
Digital Signature Schemes 9
2.4
StarCore SC140 DSP Processor Description 10
2.5
Previous Cryptographic and DSP Research 14
3
The ECDSA Algorithm and Implementation Philosophy 15
3.1
The ECDSA 15
3.2
Finite Field and Large Integer Arithmetic 17
3.2.1
Basic Operations 18
3.2.2
Finite Field Multiplication 19
3.2.3
Finite Field Squaring 20
3.2.4
Finite Field Inversion 20
3.2.5
Large Integer Operations 22
3.3
Elliptic Curve Arithmetic 22
3.3.1
Elliptic Curve Point Addition and Subtraction 22
3.3.2
Elliptic Curve Point Representation 25
3.3.3
Elliptic Curve Point-Multiplication 27
3.3.3.1
Non-Adjacent Format 27
3.3.3.2
Reduced TNAF Representation 28
3.3.3.3
TNAF Point-Multiplication 32
3.3.3.4
Width-w TNAF Representation 33
3.3.3.5
TNAFw Point-multiplication 34
3.3.4
Simultaneous Multiple Point-Multiplication 35
3.4
Implementation and Integration Philosophy 36
4
Implementation Analysis and Performance Results 38
4.1
C Data Structures 38
4.2
Finite Field Operations 39
vi
4.2.1
Finite Field Addition (c = a ⊕ b) 40
4.2.2
Finite Field Reduction (c = a mod f) 40
4.2.3
Finite Field Multiplication (c = a ⋅ b) 41
4.2.4
Finite Field Squaring (c = a
2
) 43
4.2.5
Finite Field Inversion (c = a
-1
mod f) 45
4.3
Large Integer Operations 48
4.3.1
Large Integer Addition and Subtraction (c = a + b; c = a - b) 49
4.3.2
Large Integer Multiplication (c = a ⋅ b) 49
4.3.3
Large Integer Division (c = a / b) 50
4.3.4
Large Integer Inversion (c = a
-1
mod f) 51
4.4
Elliptic Curve Operations 51
4.4.1
TNAF Conversion (k
2
Æ k
TNAF
) 52
4.4.2
Partial Reduction - Partmod δ (k′ = k partmod δ) 52
4.4.3
TNAF Point-Multiplication (Q = k
TNAF
⋅ P) 54
4.4.4
TNAFw Conversion (k
2
Æ k
TNAFw
) 55
4.4.5
TNAFw Point-Multiplication (Q = k
TNAFw
⋅ P) 57
4.4.6
Simultaneous Multiple Point-Multiplication (R = k ⋅ P + l ⋅ Q) 59
5
Implementation Comparison and Coding Guidelines 61
5.1
Performance Comparison with Previous Published Results 61
5.1.1
Low-Level Performance Comparison 61
5.1.2
High-Level Performance Comparison 63
5.2
Guidelines for Writing Efficient C Code for Cryptographic Applications 65
5.3
Guidelines for Writing Efficient Assembly Code for Cryptographic Applications 67
5.4
Hand-Written and Compiler-Generated Assembly Comparison 70
5.4.1
Low-Level Performance Comparison 70
5.4.2
High-Level Performance Comparison 76
5.5
Memory Requirements Comparison 78
6
SC140 and Compiler Analysis for Cryptographic Applications 81
6.1
Analysis of the SC140 for Elliptic Curve Cryptographic Applications 81
6.1.1
SC140 Cryptographic Pros 82
6.1.2
SC140 Cryptographic Cons 87
6.2
Compiler Optimization Improvements 89
6.3
Compiler Anomalies 98
6.3.1
Compiler Anomaly A 98
6.3.2
Compiler Anomaly B 100
7
Side-Channel Attack Security Issues 104
7.1
Timing Attacks 105
7.2
Simple Power Attacks 107
7.3
Differential Power Analysis 108
7.4
SCA Countermeasures specific to Koblitz Curves and the SC140 109
7.4.1
Parallel Processing Countermeasure 110
vii
7.4.2
Koblitz Curve Specific Countermeasure 112
8
Discussion and Conclusions 115
8.1
Thesis Summary 115
8.2
Limitations of the Research and Implementation 116
8.3
Conclusions 117
8.4
Future Work 119
Appendix A – Koblitz Curve Parameters 122
Bibliography 123
viii
List of Acronyms
AAU Address Arithmetic Unit IF Integer Factorization
AGU Address Generation Unit IFA IF Always
AIA Almost Inverse Algorithm IFF IF False
ALU Arithmetic Logic Unit IFT IF True
ASL Arithmetic Shift Left (by one bit) JF Jump if True
ASLL Arithmetic Shift Left (by multiple bits) JT Jump if False
ASM Assembly Language Code LSL Logical Shift Left (by one bit)
ASR Arithmetic Shift Right (by one bit) LSLL Logical Shift Left (by multiple bits)
ASRR Arithmetic Shift Right (by multiple bits) LSR Logical Shift Right (by one bit)
BF Branch if False LSRR Logical Shift Right (by multiple bits)
BFU Bit Field Unit LUT Look-Up Table
BT Branch if True MAC Multiply and Accumulate
CA Certificate Authority MIPS Million Instructions Per Second
CGA Compiler-Generated Assembly NAF Non-Adjacent Format
CLB Count Leading Bits NB Normal Basis
CP Critical Path NIST National Institute of Standards and Technology
DALU Data Arithmetic Logic Unit NOP No Operation
DL Discrete Logarithm PB Polynomial Basis
DLP Discrete Logarithm Problem PDA Personal Digital Assistant
DPA Differential Power Analysis RRK Random Rotation of Key
DSA Digital Signature Algorithm SCA Side Channel Attack
DSP Digital Signal Processor SC140 StarCore SC140 DSP
EC Elliptic Curve SPA Simple Power Attacks
ECC Elliptic Curve Cryptography SMPM Simultaneous Multiple Point-Multiplication
ECDLP Elliptic Curve Discrete Logarithm Problem SRAM Static Random Access Memory
ECDSA Elliptic Curve Digital Signature Algorithm TA Timing Attack
EEA Extended Euclidean Algorithm
TNAF
τ-adic NAF
FF Finite Field TNAFw Width-w TNAF
GUI Graphical User Interface VLES Variable Length Execution Set
HWA Hand-Written Assembly VLIW Very Long Instruction Word
IDE Integrated Development Environment XXX(A) XXX and XXXA instructions
ix
List of Algorithms
Algorithm 3-1. ECDSA Signature Generation [30] 16
Algorithm 3-2. ECDSA Signature Verification [30] 16
Algorithm 3-3. Finite Field Reduction (c = a mod f) [19] 18
Algorithm 3-4. Finite Field Multiplication (c = a⋅b) [39] 19
Algorithm 3-5. Finite Field Squaring (c = a
2
) [19] 20
Algorithm 3-6. Finite Field Inversion (b = a
-1
mod f) [20] 21
Algorithm 3-7. Elliptic Curve Point Addition (P
3
= P
1
+ P
2
) [38] 24
Algorithm 3-8. TNAF Conversion (k
TNAF
= r
0
+ r
1
⋅τ) [61] 29
Algorithm 3-9. Partmod δ Reduction (r
0
+ r
1
⋅τ := k
2
partmod δ) [61] 31
Algorithm 3-10. TNAF Point-Multiplication (Q = k
TNAF
⋅P) [19] 32
Algorithm 3-11. TNAFw Conversion (k
TNAFw
= r
0
+ r
1
⋅τ) [61] 33
Algorithm 3-12. TNAFw Point-Multiplication (Q = k
TNAFw
⋅P) [61] 35
Algorithm 3-13. Simultaneous Multiple Point-Multiplication (R = k⋅P + l⋅Q) [19] 35
Algorithm 4-1. Improved Finite Field Squaring (c = a
2
) 45
Algorithm 4-2. Improved Finite Field Inversion (c = a
-1
mod f) 46
Algorithm 4-3. Integer Coefficient to Binary Representation Conversion 56
Algorithm 7-1. TA Resistant TNAF Point-Multiplication (Q[0] = k
TNAF
⋅P) [22] 106
Algorithm 7-2. Proposed DPA Resistant τ–adic Point-Multiplication 112
x
List of Tables
Table 2-1. Current Estimated Memory Requirement Comparison [6] 9
Table 3-1. Elliptic Curve Coordinate System Comparison [19] 26
Table 4-1. Finite Field Reduction Performance 41
Table 4-2. Single and Multiple Bit-Shifting Function Comparison 42
Table 4-3. Finite Field Multiplication Performance 42
Table 4-4. Finite Field Squaring Performance Comparison 44
Table 4-5. Finite Field Inversion Bit-Shift Distribution 47
Table 4-6. Finite Field Inversion Performance 48
Table 4-7. TNAF Point-Multiplication Performance 54
Table 4-8. TNAFw Point-Multiplication Performance Comparison 58
Table 4-9. Simultaneous Multiple Point-Multiplication Performance Comparison 60
Table 5-1. Estimated Finite Field Operation Cycle Count Comparison 62
Table 5-2. Estimated Elliptic Curve Operation Cycle Count Comparison 64
Table 5-3. Estimated Signature Generation and Verification Cycle Count Comparison 64
Table 5-4. Low-Level CGA and HWA Performance Comparison (input independent routines) 72
Table 5-5. Low-Level CGA and HWA Performance Comparison (input dependent routines) 74
Table 5-6. Computational Reduction of the Signature Generation Process due to HWA Routines 76
Table 5-7. High-Level CGA and HWA Performance Comparison 77
Table 5-8. Estimated Permanent Storage Requirements 79
Table 6-1. Assembly Symbolic Description 90
Table 7-1. Estimated TA Resistant Performance Penalties 106
Table 7-2. Estimated SPA Resistant Signature Generation Performance Penalty 107
Table 7-3. Estimated Sample Entropy and Overhead of Algorithm 7-2 113
xi
1 Introduction
The ECDSA is a cryptographic tool that can provide security to systems when implemented correctly.
The algorithm defines a method of achieving data integrity, data origin authentication and non-
repudiation. The ability to efficiently implement the ECDSA forecasts its usefulness in the growing
wireless and wireline communications industries.
It is difficult to argue the usefulness of the ECDSA without proof that it can be efficiently
implemented on a wide range of target processors. Furthermore, it is difficult to convey the threat of
attackers to users without personally experiencing a security breach in the digital sense, because they
do not easily tolerate large computational delays due to tasks they deem unimportant.
The purpose of the thesis is to study the performance of the ECDSA on the SC140. Analysis
of the implementation, the benefits of an assembly implementation, and the strength of the compiler to
produce efficient cryptographic applications are all included as part of the study. There have been
several documented implementations of the ECDSA on general-purpose and extremely resource
limited processors that are present on smart cards, but a limited number of implementations of the
ECDSA, or more generally Elliptic Curve Cryptography (ECC), on processors targeting portable
devices. The few documented implementations of ECC on DSPs have involved prime fields. ECC
using binary fields is also a viable option, which may be more attractive due to the numerous bit-
manipulating instructions common to DSPs.
Due to the decreased power consumption of DSPs relative to general-purpose processors, and
the limited battery lifespan of portable devices, DSPs are an excellent candidate for the primary
computing core of portable devices. Furthermore, due to the inherent insecurities of wireless
communications that threaten portable devices, the implementation of security measures is of utmost
importance. The performance of security measures, including the ECDSA, must be studied on such
devices. By studying the performance of the ECDSA on the SC140, their viability with respect to
portable devices and security on the devices can be determined.
1
CHAPTER 1. INTRODUCTION
2
1.1 DSPs and Embedded Systems Security Requirements
The employment of adequate security systems was overlooked during the incredible growth the
communications industry underwent over the past two decades. Systems were introduced without
adequate security measures in place. The combination of recent world events and the sudden decline in
the communications industry’s growth, has led to the realization that a great deal of current network
security measures are inadequate.
Furthermore, increasing the demand on network security, the wireless communications
industry is rapidly expanding. The current trend in the communications industry is increasing wireless
services as 3
rd
generation cellular systems become a reality. The services that cell phones, personal
digital assistants (PDAs) and other portable handheld devices provide are ever increasing. The new
services require more bandwidth and greater processing capabilities. Examples of the introduced
services are email and streaming multimedia.
As the communications industry expands, and more information is transmitted via wireless and
wireline connections, the inherent requirement for security measures increases. The SC140 targets
several communication applications that all require certain levels of security. It is therefore important
to study the SC140 to determine if it is a viable processor to implement the required security measures.
Handheld devices are powered by several different processing units, including DSPs. The
deployment of DSPs is widespread. They have lower power dissipation than general-purpose
processors, and are less costly than specialty processors. DSPs are currently present in network and
data communications, and several other devices throughout the communications industry. High-end
DSPs control network traffic on high-speed backbones, and will likely be deployed in future handheld
devices. Handheld devices are often part of extensive wireless networks that are naturally insecure,
and are extremely susceptible to security risks such as impersonation attacks.
Digital signatures allow tasks such as data integrity, data origin authentication and non-
repudiation to be performed. The importance of integrity and origin of data is heightened in a wireless
network, which are much more susceptible to impersonation attacks and modification of transmitted
data because of the ease of which the transmission medium is accessed.
It is important to study security on DSPs because of their widespread deployment. If DSPs are
a viable target for implementation of security measures, security can be added to systems with simple
software add-ons or upgrades. The cost of adding the security related services is greatly decreased
because new hardware is not required. Furthermore, expensive processing units for cryptographic
applications are not required by new devices, maintaining their affordability.
CHAPTER 1. INTRODUCTION
3
1.2 Thesis Objective
The objective of the thesis is to study the performance of ECC, and more precisely the ECDSA, on a
DSP targeting portable devices. The ECDSA is implemented on the StarCore SC140 DSP. The
implementation is examined thoroughly, and optimized to improve its performance with respect to
execution time and code size. The compiler and associated optimizer are examined to determine if
efficient implementation is possible at the C programming language, or if assembly language coding is
required to achieve the necessary performance. The execution time of the implementation should be
comparable to current published results, and must account for acceptable delays, unnoticeable to the
average user when the digital signature technique is utilized by a practical application. Furthermore,
while maintaining acceptable execution times, the code size of the compiled application must be
suitable for portable devices, where memory is an expensive and limited resource.
1.3 Thesis Overview
In chapter 2, a brief description of public-key cryptosystems, focusing on ECC, and the StarCore
SC140 DSP is presented. The ECDSA and the algorithms utilized to implement the required finite
field and elliptic curve operations are outlined in chapter 3, along with the implementation philosophy.
The implementation and performance analysis of the finite field and elliptic curve operations are
presented in chapter 4. Chapter 5 analyzes the performance of the implementation. A comparison of
the performance of the implementation with previously published results is presented. In addition, the
performance of the hand-written and compiler-generated assembly is contrasted. Several coding
guidelines to follow, which aid in the development of efficient assembly and C code, are included. To
conclude chapter 5, the memory requirements are presented and compared. An analysis of the SC140
and the associated compiler is presented in chapter 6. The analysis is based on implementing
cryptographic applications on the SC140, and the ability of the compiler to optimize cryptographic
related code. Security issues that arise due to side-channel attacks and methods of avoidance are
presented in chapter 7. Finally, chapter 8 presents a thesis summary, limitations of the study, a
conclusion, as well as future work to be done in this area of research.
2 Public-Key Cryptosystems and the StarCore SC140 DSP
This chapter introduces the concept of public-key cryptosystems, providing several examples. The
implemented public-key cryptosystem is explained and compared to alternatives, and a description of
the SC140 is included.
2.1 Public-Key Cryptosystems
Public-key cryptosystems were invented by Whitfield Diffie and Martin Hellman in 1976 [6]. They are
asymmetric cryptosystems, which are based on the concept of using different keys for the encryption
and decryption processes. The two keys involved must seem unrelated for the cryptosystem to be
useful. They must seem unrelated such that the encryption key E, or public key, can be put in the
public domain without compromising the decrypting key D. The decrypting key D is also known as the
private or secret key.
Consider two entities that are people or computer nodes, which want to communicate. Each
entity individually develops a public and private key. The two keys are inverses of each other,
described by Formula 2.1 [9]. In the formula, M is the message, and the functions D() and E()
represent encryption using the private and public keys respectively.
M = D(E(M)) = E(D(M)) (2.1)
A trusted third party is required by public-key cryptosystems. The trusted third party, also
known as a Certificate Authority (CA), is in charge of key storage and distribution. Entities transmit
their public key to a CA in a secure manner. The CA is in-charge of storage and distribution of the
domain parameters and public keys of entities.
Public-key cryptosystems are capable of providing authentication, secrecy, or both between
communicating parties. Authentication is achieved by encrypting messages with one’s private key
before transmission. The target entity uses the public key of the sender, obtained from a CA, to decrypt
the transmitted message. Authentication is inherent when the message is successfully decrypted.
Secure communication is achieved by encrypting a message with the target’s public key. The
4
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
5
communication is secure because the target’s private key, which is only known by the target, is
required to decrypt the message. An authenticated and secure communication is achieved by first using
the target’s public key to encrypt a message, then by using one’s private key to encrypt the encrypted
message before transmission to the target. In this case, the target first authenticates the transmission by
decrypting the received message with the sender’s public key. Then the original message is revealed
by decrypting the authenticated message with their private key.
There are currently three secure and efficient public-key cryptosystems. They are based on
Integer Factorization (IF), the Discrete Logarithm (DL) and Elliptic Curves (EC) [6]. Each system is
based on a difficult mathematical problem, relative to their input size [6], which requires a great deal of
time to solve.
The most common public-key cryptosystem is RSA, named after its developers Rivest, Shamir
and Adelman [6]. It is an IF system based on large prime integers. The difficult problem associated
with the system is the factorization of large numbers. Both encryption and digital signature schemes
have been developed using RSA.
The Digital Signature Algorithm (DSA) is an example of a DL cryptosystem [6]. The DSA is a
digital signature scheme. DSA is based on the Discrete Logarithm Problem (DLP), as are all DL
cryptosystems. Encryption is also possible using the DLP, but is not commonly used due to the
associated large overhead.
Both encryption and digital signature schemes have been developed using ECC. It is based on
an extension of the DLP, rightfully named the Elliptic Curve DLP (ECDLP). The ECDSA is an
example of an ECC, which is a digital signature scheme very similar to the DSA. ECC is presently the
most promising public-key cryptosystem because of the high security-per-bit ratio it provides. Further
detail of the cryptosystem is given in the following section.
2.2 ECC Background
In 1985, Neil Koblitz and Victor Miller independently proposed the use of elliptic curves for a public-
key cryptosystem [2]. Both encryption and digital signature techniques have been developed using the
cryptosystem. The public-key cryptosystem is based on the manipulation of points on an elliptic curve,
defined modulo f, where P = (x, y) is a point on the curve. For cryptographic applications, each
coordinate belongs to a prime or binary finite field, defined by f. The generalized equation of an
elliptic curve for cryptography is presented as Formula 2.2.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
6
y
2
+ x⋅y = x
3
+ a⋅x
2
+ b (mod f) (2.2)
Several parameters define an elliptic curve, including but not limited to, a and b. There are
classes of curves that are defined by specific sets of parameters. These classes have special properties
associated with them, making them more or less attractive for cryptographic applications. For
example, anomalous binary curves, more commonly known as Koblitz curves, are used and assumed in
the scope of the thesis. They have properties that allow for efficient point-multiplication that are
explained in §3.3.3.
Associated with elliptic curves are point addition, doubling, negating, subtracting and
multiplication operations. Each of the elliptic curve operations is defined by a sequence of finite field
operations, and are described in §3.3. Similar to standard mathematics, point-multiplication is based on
a series of point additions. Algorithms have been developed to improve the computation of the
multiplication because of its expensiveness.
The ECDLP is the difficult mathematical problem of reversing the point-multiplication
operation. The problem is an analogue of the DLP, but in the elliptic curve domain. The ECDLP,
which is attempting to solve for k knowing P and Q, from the point-multiplication formula Q = k⋅P,
proves difficult. The problem is computationally expensive and cannot be solved in a reasonable
amount of time as long as the finite field associated with the curve is large enough. As expected, the
finite field size required to make the problem computationally infeasible is directly related to current
processing trends. As the average computing power of devices increase, larger finite fields are required
to maintain security levels [30].
Elliptic curves for cryptographic applications can be defined over prime or binary fields. In
general, prime fields tend to outperform binary fields because most processors are designed to favor the
execution of integer arithmetic, and are not number crunchers. However, binary fields were chosen for
implementation to determine how well they perform on the SC140, because it has an extended and less
computationally costly set of logic instructions compared to general-purpose processors. Binary finite
fields are assumed throughout the thesis unless otherwise stated.
The Polynomial Basis (PB) was selected to represent binary finite field elements. Unless
otherwise stated, the PB is used throughout the thesis. The use of an alternate basis, such as the
Normal Basis (NB) does have its benefits. For example, squaring fields is simplified when employing
the NB, but other operations become more complex. It is believed by the writer that the drawbacks of
alternative representations outweigh the benefits for this implementation.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
7
The binary finite field GF(2
m
), where m=163, is used and assumed throughout the project.
The field size of 163-bits was selected to provide the current acceptable security level [6]. When
implementing Koblitz curves, all of the elliptic curve parameters are fixed after the finite field size is
selected, except for C. The parameter C and its value are further explained in §3.3.3.2. The Koblitz
curve parameters for the PB, GF(2
163
) and used for implementation, are listed in Appendix A.
Some general finite field terminology must be defined. The terms polynomial and finite field
element are used interchangeably throughout the thesis. The degree of a polynomial is the position of
the most significant coefficient (where the first coefficient position is position zero), and the Hamming
weight of a polynomial is the number of nonzero coefficients in its representation.
In the following section, ECC is compared to two other public-key cryptosystems. The
positive and negative aspects of the cryptosystems are compared to give some background on why
ECC was selected for implementation over other possibilities.
2.2.1 Comparison to Other Cryptographic Techniques
The ECC was selected as the public-key cryptosystem for implementation on the SC140 for several
reasons. The comparison below briefly develops and states the grounds for selecting ECC over the
other public-key cryptosystems. Further comparison of the public-key cryptosystems can be found in
[6] and [29].
There are both encryption and digital signature schemes associated with ECC, DLP and RSA.
The underlying mathematical problems associated with the cryptosystems are identical for both
encryption and digital signature schemes. Therefore, after one scheme is implemented, much of the
implementation can be reused with the other scheme. There are two types of digital signature
techniques, which are with and without appendices. The digital signature is logically combined with
the message in a technique without appendix, whereas the digital signature is appended to the message
in the digital signature with appendix technique.
RSA refers to both encryption and digital signature schemes. It includes a digital signature
scheme without appendix. ElGammal proposed digital signature and encryption schemes based on the
DLP [6]. Later, the DSA was developed, which is an improvement on the digital signature scheme
proposed by ElGammal. The ECDSA is the digital signature algorithm associated with ECC, and
encryption is simply referred to as ECC. The ECDSA and DSA are both digital signature schemes with
appendix. Digital signature schemes with and without appendices are explained in §2.3.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
8
Assuming that an elliptic curve is selected that does not have negative security implications,
and that each public-key cryptosystem is implemented correctly without any security loopholes or
backdoors, the per-bit security of ECC is far superior to that of RSA and DSA. As stated in [6], and
depicted in Figure 2-1, the current acceptable security level is 10
12
MIPS years, leading to a 160-bit
modulus for ECC, and 1024-bit modulus for both RSA and DSA [6]. In addition, the figure shows the
expected growth of the modulus size for each cryptosystem. The modulus sizes for RSA and DSA
grow exponentially versus an exponential growth in MIPS years, whereas modulus sizes for ECC
experience approximately linear growth [6]. The per-bit security of ECC is far greater than RSA and
DSA, making it far more attractive for portable devices with limited resources, assuming the execution
times are similar. In the future, RSA and DSA modulus sizes are expected to grow exponentially,
resulting in an unacceptable amount of overhead.
Figure 2-1. Modulus Size Comparison of Public-Key Cryptosystems [2]
0
1000
2000
3000
4000
5000
6000
1.E+04 1.E+12 1.E+20 1.E+36
Time to break cryptosystem (MIPS Years)
Modulus size (bits)
ECC
RSA and
DSA
Current Acceptable
Security Level
In general, the key size can be assumed identical to the modulus size for each system. The
total size of the system parameters and key pairs for RSA and DSA are much larger than with ECC.
They currently differ by a factor of four, which will become larger as the current acceptable security
level increases. The encrypted message and signature sizes for RSA are much larger than with ECC,
and only the signature size of DSA is the same as ECC. A comparison of the current estimated sizes of
parameters, keys, signatures and encrypted messages is presented in Table 2-1. With respect to the
table values, the signature sizes stated are for large messages, i.e. 2000-bits, and the original size of the
encrypted message is 100-bits.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
9
Larger keys, parameters, signatures, and encrypted messages require more memory for
storage and more bandwidth to transmit, both of which are scarce resources when dealing with portable
devices. Moreover, even with non-portable devices, there is no reason to unnecessarily squander
resources. As depicted in Figure 2-1 and Table 2-1, ECC provides equivalent security levels, requiring
fewer resources than alternative public-key cryptosystems.
Table 2-1. Current Estimated Memory Requirement Comparison [6]
Public-Key
Cryptosystem
System
Parameters (bits)
Public
Key (bits)
Private
Key (bits)
Signatures
Size (bits)
Encrypted
Message (bits)
RSA N/A 1088 2048 1024 1024
DLP (DSA, ElGammal) 2208 1024 160 320 2048
ECC (ECDSA) 481 161 160 320 321
When comparing cryptosystems, the computational overhead must be investigated. Focusing
on the computational expenses within a single cryptosystem, DLP and ECC systems perform similarly,
and opposite to RSA. The signature generation process for ECDSA and DSA is faster than the
verification process, whereas the verification of RSA signatures is less computationally expensive.
Decryption is slower than encryption using RSA, whereas the opposite is true for ECC.
Overall, ECC is proven to require less computational overhead. After all the techniques used
to increase the performance of each cryptosystem are implemented, ECC is found ten times faster than
RSA and DSA [6]. Both the execution times and memory requirements of ECC are less than those
associated with RSA and DSA, making it superior to the alternatives.
2.3 Digital Signature Schemes
The concept of a digital signature is very powerful, but difficult to achieve. This section gives a brief
overview of the two types of digital signature schemes and their capabilities. Digital signatures are
designed to be similar to, and more compelling than handwritten signatures, and target digital data [30].
Digital signatures are based on the data being signed, M, and a private key only known by the signer.
Digital signatures are powerful because they provide data integrity, data origin authentication
and non-repudiation [30]. After data has been signed, all these services are achieved by the signature
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
10
verification process. There is no privacy associated with digital signatures. Transmitted data can be
easily intercepted and interpreted by eavesdroppers. To achieve confidentiality between
communicating parties, an encryption scheme must be employed.
Signatures that are verified with a sender’s public key guarantee the integrity of the
transmission because the signature is based on the original message. Messages cannot be intercepted
and modified because the signature on the message will not verify. Without the private key of the
sender, the correct signature cannot be computed. An entity cannot impersonate another because each
entity has a unique private key. The private key is known only by the owner, and is required to
compute the digital signature of each message. Lastly, an entity cannot deny knowledge of a message
containing their signature. The entity is the only one with knowledge of their private key, and therefore
is the only one able to compute the signature of a message.
There are two types of digital signature schemes. They are schemes with and without
appendix. In the case of a digital signature scheme without appendix, the digital signature is the only
data transmitted. The transmitted data contains the original message. The signature verification
process results in the computation of the original message. It is impossible to determine the original
message without signature verification. If the verification process fails, the receiver is left with a
garbled message, and the original message cannot be determined. In the case of a digital signature with
appendix, the digital signature is computed and concatenated onto the message. The message along
with the concatenated digital signature is transmitted. It is possible for an attacker to modify the
original message and concatenate an incorrect signature. In this case, the signature will not be verified
by the receiver. Therefore, the receiver will know the transmitted data has been modified. Since the
message and signature are separate in the digital signature with appendix scheme, the verification
process is technically optional, and is left up to the receiver.
The specific type of digital signature implemented, the ECDSA, is described in §3.1. Further
details of the implemented algorithm are provided, as well as a depiction of a digital signature scheme
with appendix.
2.4 StarCore SC140 DSP Processor Description
The StarCore SC140 DSP is a high performance processor that can be clocked at a maximum of 300
MHz [46]. With its many assets that include high performance and low power consumption, the
processor targets computationally intensive communication applications [46]. The SC140 has several
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
11
features that allow for efficient digital signal processing, which are also useful for cryptographic
applications. These features are examined in §6.1.1.
The SC140 targets a wide range of communication applications. Some examples of the target
markets include wireless Internet and multimedia, network and data communications, 3
rd
generation
wireless handset systems with wideband data services, wireless and wireline base stations and the
corresponding infrastructure [46].
The high performance SC140 is designed to have a large data throughput of 4.8GBytes/sec.
The processor uses a 32-bit unified program and data address space, which is byte addressable. It is
designed to allow significant parallelism. The SC140 has the capability of having a very large on-chip
zero-wait Static Random Access Memory (SRAM). The SRAM allows for efficient execution of
applications, by reducing the cost of fetching instructions from memory. The cost of read and writes,
to and from memory is reduced as well.
The Data Arithmetic Logic Unit (DALU) of the SC140 performs arithmetic and logical
instructions with four parallel Arithmetic Logic Units (ALUs). Each ALU has access to the sixteen 40-
bit data registers, which is the DALU register file. Each ALU contains a Multiply and Accumulate
(MAC) and Bit-Field Unit (BFU). The MAC is capable of a multiplication of two 16-bit values and an
accumulate every clock cycle. The BFU contains a 40-bit bi-directional barrel shift register. It is
capable of single and multiple, arithmetic and logical shifts, as well as logical, bit-masking and bit-
extraction operations.
The Address Generation Unit (AGU) of the SC140 performs address manipulation and limited
arithmetic instructions with two parallel Address Arithmetic Units (AAU). It contains its own register
file and operates in parallel with the DALU. The register file consists of sixteen 32-bit registers.
The SC140 employs a Variable Length Execution Set (VLES), which allows the execution of
several instructions in a single clock cycle. A VLES allows the execution of up to six instructions per
clock cycle, fully utilizing the processing capabilities of the SC140. The combination of instructions
allowed in a VLES is limited by a set of rules, but VLESs greatly improve the overall code size.
Within a VLES, NOPs are assumed, eliminating the need to define an instruction for each processing
unit per clock cycle. Similar DSPs with parallel processing capabilities that do not have VLESs have a
fixed size instruction set, referred to as a Very Long Instruction Word (VLIW). VLIWs lead to large
program sizes that have a low code density [44]. Large programs are inefficient, and attempts must be
made to avoid them when dealing with portable devices with limited memory, bandwidth and power
resources.
The SC140 has zero-overhead hardware loops that can be highly beneficial when used
correctly. The hardware loops allow for up to four levels of nesting, and provide a means of reducing
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
12
repetitive code with a minimal execution cost. By reducing repetitive code, the code size of
applications can be decreased. Hardware loops are further explained in §6.1.1.
The SC140 also has several unique addressing modes that allow for efficient execution of
repetitive algorithms. There are four addressing modes that include register direct, address register
indirect, PC relative, and special [46]. The register direct, PC relative and special address modes are
general addressing techniques that are common to most processors.
The address register indirect addressing mode is the most interesting and beneficial addressing
method. It allows several techniques of addressing memory, including methods of modifying address
registers. Post-increment and post-decrement addressing can be specified. In each case, the address
register is modified by the memory access width defined by the instruction. There is also a post-
increment by offset addressing method, where the address register is modified by the memory access
width multiplied by a control register. There is no cycle penalty for each of the post addressing
methods. When properly implemented, these addressing modes lead to tightly bound looping with
minimal wasted clock cycles when implementing repetitive algorithms.
Indirect addressing modes allow addressing by offsets. The offset can be another address
register, a short or long displacement, or a control register. For more addressing methods with
complete descriptions, refer to the SC140 DSP Core Reference Manual [46].
The power management control features of the SC140 further increase the processor’s
attractiveness for portable devices. The processor can be put into a wait or stop state, which are both
low power consumption modes. In these modes, the functionality of the processor greatly decreases
while waiting for an event to occur. The low power consumption modes conserve energy, on top of
that saved because of the processor’s low voltage operation.
There is a wide variety of software tools available to develop applications for the SC140,
including the Metrowerks CodeWarrior Integrated Development Environment (IDE). It provides an
IDE where applications for the SC140 can be written, edited, compiled, assembled, simulated,
analyzed, and tested in software and hardware. Projects can be developed with the IDE using either or
both C and assembly source code. The amount of parallelism, efficiency and size of the compiled
application is controlled by compiler optimization levels selected by the developer at compile time.
The optimization techniques used by the compiler include various levels of scheduling, pipelining,
bundling, global register allocation, as well as global and space optimization.
The IDE includes a fully functional source code browser and editor, which provide developers
with an easy-to-use graphical user interface. They allow the addition and removal of files from
projects, as well as providing a means of navigating a project’s source code, thus simplifying the
development of large and complex multi-file applications.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
13
Projects can be simulated in software or executed in hardware using the IDE debugging tool.
Software simulation is slow for complex applications, whereas hardware debugging of projects is much
faster but requires the attachment of a development board to the host or a networked computer. All of
the debugging options are available independent of simulation target. An important fact when
simulating, is code simulated and verified in software is not guaranteed to execute identically in
hardware. Some code and functionality restrictions present in hardware are not implemented by the
software simulator.
Breakpoints can be set within the C and/or assembly code for debugging purposes. The
position of breakpoints is limited when debugging optimized code. To aid in the debugging process,
commands such as step into, step over, step out, and run to cursor can be issued when the execution
sequence is paused. Stack variables, memory addresses, and internal register values can all be viewed
and modified during debugging.
A very useful component of the IDE is the profiler. The profiler can be run during the
debugging process to record statistics describing the execution sequence. It records several useful
statistical values that aid in weighing the performance of an application and individual functions.
Values including function call counts, function cycle counts, function and descendant function cycle
counts, as well as minimum, maximum and average cycle counts are recorded for each C function
executed during the simulation. A function call tree is also recorded and useful during analysis.
An easy to use GUI associated with the profiler allows a developer to analyze each function
within an application. The developer can navigate the function call tree, and investigate the
performance of each function, as well as viewing the performance and call counts of parent and
descendent functions.
The profiler is an excellent tool to use during analysis of the optimization process. The
performance of individual functions can be easily analyzed. All the C functions in a project can be
sorted by statistics including total call count, total function cycle count, total function and descendant
cycle count, average function cycle count, and average function and descendant cycle count to quickly
determine the functions that consume the most execution time, and therefore most likely require
optimizations.
The data recorded by the profiler is best viewed and analyzed with the IDE and profiler, but
can be exported to other formats including HTML, XML and a tab delimited file. The alternative
formats allow the sharing of data with colleagues, who can navigate and analyze the data on computers
that do not have the IDE tool installed.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP
14
2.5 Previous Cryptographic and DSP Research
There has been a significant amount of research done on ECC. Several resources, which are referenced
in §2.2.1, compare ECC with other public-key cryptosystems. Some papers present the theory behind
ECC and the ECDSA. Others state and develop various algorithms for implementing the required ECC
operations.
The papers that state and develop various algorithms, along with ones that implement and
compare the performance of the algorithms were thoroughly investigated before the implementation
process. References [19], [20], [38], [40] and [61] present important algorithms and/or performance
results that influenced the algorithms implemented in the thesis.
Most of the work in ECC related resources revolves around the elliptic curve point-
multiplication operation because it accounts for nearly the entire execution time of the encryption and
signature processes. A significant breakthrough is presented by Solinas in [61]. He presents a
technique of reducing the execution time of the point-multiplication operation on Koblitz curves. The
modified point-multiplication algorithms presented by Solinas that use the technique, which are
presented in §3.3 and implemented in §4.4, are shown to outperform other methods in [19].
The majority of recent ECC papers focus on SCAs. They are a new class of attacks that use
timing and power analysis to break cryptosystems. Kocher first describes the attacks against RSA,
DSS and others in [33], and later Coron generalized the technique to include ECC [10]. Actual power
traces of elliptic curve point-multiplication were published in [13], illustrating resistance to power
analysis attacks on the SC140. However, prime fields, and not binary fields, were used in the
implementation. SCAs are examined in chapter 7, focusing on the implementation. The three types of
SCAs and countermeasures for each are presented, as well as some alternative techniques that may foil
SCAs developed by the writer.
Most cryptographic research to date, including both symmetric and asymmetric-key
cryptosystems, involves general-purpose processors. A minimum amount of research has been done
involving DSP implementations of cryptography. For example, a very broad view of the secure
communication issues with respect to DSPs is presented in [12], implementation of AES on DSPs is
investigated in [64], and efficient implementations of 1024-bit RSA, 1024-bit DSA and ECDSA using
a 160-bit prime field are described in [26]. Finally, power-attacks on the elliptic curve point-
multiplication operation using prime fields are investigated in [13]. Obviously, further cryptographic
research involving DSPs is required.