Tải bản đầy đủ (.pdf) (70 trang)

ARM System Developer’s Guide phần 9 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (449.02 KB, 70 trang )

15.1 Advanced DSP and SIMD Support in ARMv6
15.1.1 SIMD Arithmetic Operations
15.1.2 Packing Instructions
15.1.3 Complex Arithmetic Support
15.1.4 Saturation Instructions
15.1.5 Sum of Absolute Differences Instructions
15.1.6 Dual 16-Bit Multiply Instructions
15.1.7 Most Significant Word Multiplies
15.1.8 Cryptographic Multiplication Extensions
15.2 System and Multiprocessor Support Additions to ARMv6
15.2.1 Mixed-Endianness Support
15.2.2 Exception Processing
15.2.3 Multiprocessing Synchronization Primitives
15.3 ARMv6 Implementations
15.4 Future Technologies beyond ARMv6
15.4.1 TrustZone
15.4.2 Thumb-2
15.5 Summary
Chapter
The Future
of the
Architecture
15
John Rayfield
In October 1999, ARM began to consider the future direction of the architecture that would
eventually become ARMv6, first implemented in a new product called ARM1136J-S. By this
time, ARM already had designs for many different applications, and the future requirements
of each of those designs needed to be evaluated, as well as the new application areas for
which ARM would be used in the future.
As system-on-chip designs have become more sophisticated, ARM processors have
become the central processors in systems with multiple processing elements and subsystems.


In particular, the portable and mobile computing markets were introducing new software
and performance challenges for ARM. Areas that needed addressing were digital signal
processing (DSP) and video performance for portable devices, interworking mixed-endian
systems such as TCP/IP, and efficient synchronization in multiprocessing environments.
The challenge for ARM was to address all of these market requirements and yet maintain
its competitive advantage in computational efficiency (computing power per mW) as the
best in the industry.
This chapter describes the components within ARMv6 introduced by ARM to address
these market requirements, including enhanced DSP support and support for a multi-
processing environment. The chapter also introduces the first high-performance ARMv6
implementations and, in addition to the ARMv6 technologies, one of ARM’s latest
technologies—TrustZone.
549
550 Chapter 15 The Future of the Architecture
15.1 Advanced DSP and SIMD Support in ARMv6
Early in the ARMv6 project, ARM considered howto improve the DSP and media processing
capabilities of the architecture beyond the ARMv5E extensions described in Section 3.7. This
work was carried out very closely with the ARM1136J-S engineering team, which was in the
early stages of developing the microarchitecture for the product. SIMD (Single Instruction
Multiple Data) is a popular technique used to garner considerable data parallelism and is
particularly effective in very math-intensive routines that are commonly used in DSP, video
and graphics processing algorithms. SIMD is attractive for high code density and low power
since the number of instructions executed (and hence memory system accesses) is kept low.
The price for this efficiency is the reduced flexibility of having to compute things arranged
in certain blocked data patterns; this, however, works very well in many image and signal
processing algorithms.
Using the standard ARM design philosophy of computational efficiency with very low
power, ARM came up with a simple and elegant way of slicing up the existing ARM 32-bit
datapath into four 8-bit and two 16-bit slices. Unlike many existing SIMD architectures
that add separate datapaths for the SIMD operations, this method allows the SIMD to be

added to the base ARM architecture with very little extra hardware cost.
The ARMv6 architecture includes this “lightweight” SIMD approach that costs virtually
nothing in terms of extra complexity (gate count) and therefore power. At the same time the
new instructions can improve the processing throughput of some algorithms by up to two
times for 16-bit data or four times for 8-bit data. In common with most operations in the
ARM instruction set architecture, all of these new instructions are executed conditionally,
as described in Section 2.2.6.
You can find a full description of all ARMv6 instructions in the instruction set tables of
Appendix A.
15.1.1 SIMD Arithmetic Operations
Table 15.1 shows a summary of the 8-bit SIMD operations. Each byte result is formed
from the arithmetic operation on each of the corresponding byte slices through the source
operands.
The results of these 8-bit operations may require that up to 9 bits be represented, which
either causes a wraparound or a saturation to take place, depending on the particular
instruction used.
In addition to the 8-bit SIMD operations, there are an extensive range of dual 16-bit
operations, shown in Table 15.2. Each halfword (16-bit) result is formed from the arithmetic
operation on each of the corresponding 16-bit slices through the source operands.
The results may need 17 bits to be stored, and in this case they can either wrap around
or are saturated to within the range of a 16-bit signed result with the saturating version of
the instruction.
15.1 Advanced DSP and SIMD Support in ARMv6 551
Table 15.1
8-bit SIMD arithmetic operations.
Instruction Description
SADD8{<cond>} Rd, Rn, Rm Signed 8-bit SIMD add
SSUB8{<cond>} Rd, Rn, Rm Signed 8-bit SIMD subtract
UADD8{<cond>} Rd, Rn, Rm Unsigned 8-bit SIMD add
USUB8{<cond>} Rd, Rn, Rm Unsigned 8-bit SIMD subtract

QADD8{<cond>} Rd, Rn, Rm Signed saturating 8-bit SIMD add
QSUB8{<cond>} Rd, Rn, Rm Signed saturating 8-bit SIMD subtract
UQADD8{<cond>} Rd, Rn, Rm Unsigned saturating 8-bit SIMD add
UQSUB8{<cond>} Rd, Rn, Rm Unsigned saturating 8-bit SIMD subtract
Table 15.2 16-bit SIMD arithmetic operations.
Instruction Description
SADD16{<cond>} Rd, Rn, Rm Signed add of the 16-bit pairs
SSUB16{<cond>} Rd, Rn, Rm Signed subtract of the 16-bit pairs
UADD16{<cond>} Rd, Rn, Rm Unsigned add of the 16-bit pairs
USUB16{<cond>} Rd, Rn, Rm Unsigned subtract of the 16-bit pairs
QADD16{<cond>} Rd, Rn, Rm Signed saturating add of the 16-bit pairs
QSUB16{<cond>} Rd, Rn, Rm Signed saturating subtract of the 16-bit pairs
UQADD16{<cond>} Rd, Rn, Rm Unsigned saturating add of the 16-bit pairs
UQSUB16{<cond>} Rd, Rn, Rm Unsigned saturating subtract of the 16-bit pairs
Operands for the SIMD instructions are not always found in the correct order within the
source registers; to improve the efficiency of dealing with these situations, there are 16-bit
SIMD operations that perform swapping of the 16-bit words of one operand register. These
operations allow a great deal of flexibility in dealing with halfwords that may be aligned in
different ways in memory and are particularly useful when working with 16-bit complex
number pairs that are packed into 32-bit registers. There are signed, unsigned, saturating
signed, and saturating unsigned versions of these operations, as shown in Table 15.3.
The X in the instruction mnemonic signifies that the two halfwords in Rm are swapped
before the operations are applied so that operations like the following take place:
Rd[15:0] = Rn[15:0] - Rm[31:16]
Rd[31:16] = Rn[31:16] + Rm[15:0]
The addition of the SIMD operations means there is now a need for some way of showing
an overflow or a carry from each SIMD slice through the datapath. The cpsr as originally
552 Chapter 15 The Future of the Architecture
Table 15.3
16-bit SIMD arithmetic operations with swap.

Instruction Description
SADDSUBX{<cond>} Rd, Rn, Rm Signed upper add, lower subtract, with a swap of
halfwords in Rm
UADDSUBX{<cond>} Rd, Rn, Rm Unsigned upper add, lower subtract, with swap of
halfwords in Rm
QADDSUBX{<cond>} Rd, Rn, Rm Signed saturating upper add, lower subtract, with
swap of halfwords in Rm
UQADDSUBX{<cond>} Rd, Rn, Rm Unsigned saturating upperadd, lower subtract, with
swap of halfwords in Rm
SSUBADDX{<cond>} Rd, Rn, Rm Signed upper subtract, lower add, with a swap of
halfwords in Rm
USUBADDX{<cond>} Rd, Rn, Rm Unsigned upper subtract, lower add, with swap of
halfwords in Rm
QSUBADDX{<cond>} Rd, Rn, Rm Signed saturating upper subtract, lower add, with
swap of halfwords in Rm
UQSUBADDX{<cond>} Rd, Rn, Rm Unsigned saturating uppersubtract, lower add, with
swap of halfwords in Rm
described in Section 2.2.5 is modified by adding four additional flag bits to represent each
8-bit slice of the data path. The newly modified cpsr register with the GE bits is shown in
Figure 15.1 and Table 15.4. The functionality of each GE bit is that of a “greater than or
equal” flag for each slice through the datapath.
Operating systems already save the cpsr register on a context switch. Adding these bits
to the cpsr has little effect on OS support for the architecture.
In addition to basic arithmetic operations on the SIMD data slices, there is considerable
use for operations that allow the picking of individual data elements within the datapath and
forming new ensembles of these elements. A select instruction SEL can independently select
each eight-bit field from one source register Rn or another source register Rm, depending
on the associated GE flag.
31 030 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
NZCV modeQJRes Res Res EA I F TGE [3:0]

Figure 15.1 cpsr layout for ARMv6.
15.1 Advanced DSP and SIMD Support in ARMv6 553
Table 15.4 cpsr fields for ARMv6.
Field Use
N Negative flag. Records bit 31 of the result of flag-setting operations.
Z Zero flag. Records if the result of a flag-setting operation is zero.
C Carry flag. Records unsigned overflow for addition, not-borrow for
subtraction, and is also used by the shifting circuit. See Table A.3.
V Overflow flag. Records signed overflows for flag-setting operations.
Q Saturation flag. Certain saturating operations set this flag on saturation. See for
example QADD in Appendix A (ARMv5E and above).
JJ= 1 indicates Java execution (must have T = 0). Use the BXJ instruction to
change this bit (ARMv5J and above).
Res These bits are reserved for future expansion. Software should preserve the
values in these bits.
GE[3:0] The SIMD greater-or-equal flags. See SADD in Appendix A (ARMv6).
E Controls the data endianness. See SETEND in Appendix A (ARMv6).
AA= 1 disables imprecise data aborts (ARMv6).
II= 1 disables IRQ interrupts.
FF= 1 disables FIQ interrupts.
TT= 1 indicates Thumb state. T = 0 indicates ARM state. Use the BX or BLX
instructions to change this bit (ARMv4T and above).
mode The current processor mode. See Table B.4.
SEL Rd, Rn, Rm
Rd[31:24] = GE[3] ? Rn[31:24] : Rm[31:24]
Rd[23:16] = GE[2] ? Rn[23:16] : Rm[23:16]
Rd[15:08] = GE[1] ? Rn[15:08] : Rm[15:08]
Rd[07:00] = GE[0] ? Rn[07:00] : Rm[07:00]
These instructions, together with the other SIMD operations, can be used very effec-
tively to implement the core of the Viterbi algorithm, which is used extensively for symbol

recovery in communication systems. Since the Viterbi algorithm is essentially a statistical
maximum likelihood selection algorithm, it is also used in such areas as speech and hand-
writing recognition engines. The core of Viterbi is an operation that is commonly known as
add-compare-select (ACS), and in fact many DSP processors have customized ACS instruc-
tions. With its parallel (SIMD) add, subtract (which can be used to compare), and selection
instructions, ARMv6 can implement an extremely efficient add-compare-select:
ADD8 Rp1, Rs1, Rb1 ; path 1 = state 1 + branch 1 (metric update)
ADD8 Rp2, Rs2, Rb2 ; path 2 = state 2 + branch 2 (mteric update)
554 Chapter 15 The Future of the Architecture
Table 15.5 Packing instructions.
Instruction Description
PKHTB{<cond>} Rd, Rn, Rm {, ASR #<shift_imm>} Pack the top 16 bits of Rn with the bottom
16 bits of the shifted Rm into the
destination Rd
PKHBT{<cond>} Rd, Rn, Rm {, LSL #<shift_imm>} Pack the top 16 bits of the shifted Rm with
the bottom 16 bits of Rn into the
destination Rd
USUB8 Rt, Rp1, Rp2 ; compare metrics - setting the SIMD flags
SEL Rd, Rp2, Rp1 ; choose best (smallest) metric
This kernel performs the ACS operation on four paths in parallel and takes a total
of 4 cycles on the ARM1136J-S. The same sequence coded for the ARMv5TE instruction
set must perform each of the operations serially, taking at least 16 cycles. Thus the add-
compare-select function is four times faster on ARM1136J-S for eight-bit metrics.
15.1.2 Packing Instructions
The ARMv6 architecture includes a new set of packing instructions, shown in Table 15.5,
that are used to construct new 32-bit packed data from pairs of 16-bit values in different
source registers. The second operand can be optionally shifted. Packing instructions are
particularly useful for pairing 16-bit values so that you can make use of the 16-bit SIMD
processing instructions described earlier.
15.1.3 Complex Arithmetic Support

Complex arithmetic is commonly used in communication signal processing, and in partic-
ular in the implementations of transform algorithms such as the Fast Fourier Transform as
described in Chapter 8. Much of the implementation detail examined in that chapter con-
cerns the efficient implementation of the complex multiplication using ARMv4 or ARMv5E
instruction sets.
ARMv6 adds new multiply instructions to accelerate complex multiplication, shown in
Table 15.6. Both of these operations optionally swap the order of the two 16-bit halves of
source operand Rs if you specify the X suffix.
Example
15.1
In this example Ra and Rb hold complex numbers with 16-bit coefficients packed with
their real parts in the lower half of a register and their imaginary part in the upper half.
15.1 Advanced DSP and SIMD Support in ARMv6 555
Table 15.6 Instructions to support 16-bit complex multiplication.
Instruction Description
SMUAD{X}{<cond>} Rd, Rm, Rs Dual 16-bit signed multiply and add
SMUSD{X}{<cond>} Rd, Rm, Rs Dual 16-bit signed multiply and subtract
We multiply Ra and Rb to produce a new complex number Rc. The code assumes that the
16-bit values represent Q15 fractions. Here is the code for ARMv6:
SMUSD Rt, Ra, Rb ; real*real–imag*imag at Q30
SMUADX Rc, Ra, Rb ; real*imag+imag*real at Q30
QADD Rt, Rt, Rt ; convert to Q31 & saturate
QADD Rc, Rc, Rc ; convert to Q31 & saturate
PKHTB Rc, Rc, Rt, ASR #16 ; pack results
Compare this with an ARMv5TE implementation:
SMULBB Rc, Ra, Rb ; real*real
SMULTT Rt, Ra, Rb ; imag*imag
QSUB Rt, Rc, Rt ; real*real-imag*imag at Q30
SMULTB Rc, Ra, Rb ; imag*real
SMLABT Rc, Ra, Rb ; + real*imag at Q30

QADD Rt, Rt, Rt ; convert to Q31 & saturate
QADD Rc, Rc, Rc ; convert to Q31 & saturate
MOV Rc, Rc, LSR #16
MOV Rt, Rt, LSR #16
ORR Rt, Rt, Rc, LSL#16 ; pack results
There are 10 cycles for ARMv5E versus 5 cycles for ARMv6. Clearly with any algorithm
doing very intense complex maths, a two times improvement in performance can be gained
for the complex multiply. ■
15.1.4 Saturation Instructions
Saturating arithmetic was first addressed with the E extensions that were added to the
ARMv5TE architecture, which was introduced with the ARM966E and ARM946E products.
ARMv6 takes this further with individual and more flexible saturation instructions that can
operate on 32-bit words and 16-bit halfwords. In addition to these instructions, shown in
Table 15.7, there are the new saturating arithmetic SIMD operations that have already been
described in Section 15.1.1.
556 Chapter 15 The Future of the Architecture
Table 15.7 Saturation instructions.
Instruction Description
SSAT Rd, #<BitPosition>, Rm,{<Shift>} Signed 32-bit saturation at an arbitrary bit
position. Shift can be an LSL or ASR.
SSAT16{<cond>} Rd, #<immed>, Rm Dual 16-bit saturation at the same position in
both halves.
USAT Rd, #<BitPosition>, Rm,{<Shift>} Unsigned 32-bit saturation at an arbitrary bit
position. Shift can be LSL or ASR.
USAT16{<cond>} Rd, #<immed>, Rm Unsigned dual 16-bit saturation at the same
position in both halves.
Note that in the 32-bit versions of these saturation operations there is an optional
arithmetic shift of the source register Rm before saturation, allowing scaling to take place
in the same instruction.
15.1.5 Sum of Absolute Differences Instructions

These two new instructions are probably the most application specific within the ARMv6
architecture—USAD8 and USADA8. They are used to compute the absolute difference
between eight-bit values and are particularly useful in motion video compression algorithms
such as MPEG or H.263, including motion estimation algorithms that measure motion by
comparing blocks using many sum-of-absolute-difference operations (see Figure 15.2).
Table 15.8 lists these instructions.
Table 15.8 Sum of absolute differences.
Instruction Description
USAD8{<cond>} Rd, Rm, Rs Sum of absolute differences
USADA8{<cond>} Rd, Rm, Rs, Rn Accumulated sum of absolute differences
To compare an N ×N square at (x, y) in image p
1
with an N ×N square p
2
, we calculate
the accumulated sum of absolute differences:
a(x, y) =
N −1

i=0
N −1

j=0


p
1
(x + i, y + j) −p
2
(i, j)



15.1 Advanced DSP and SIMD Support in ARMv6 557
Rn
Rm Rs
Rd
absdiff absdiff absdiff absdiff
+
Figure 15.2 Sum-of-absolute-differences operation.
To implement this using the new instructions, use the following sequence to compute the
sum-of-absolute differences of four pixels:
LDR p1,[p1Ptr],#4 ; load 4 pixels from p1
LDR p2,[p2Ptr],#4 ; load 4 pixels from p2
;load delay-slot
;load delay-slot
USADA8 acc, p1, p2 ; accumlate sum abs diff
There is a tremendous performance advantage for this algorithm over an ARMv5TE
implementation. There is a four times improvement in performance from the eight-bit
SIMD alone. Additionally the USADA8 operation includes the accumulation operation. The
USAD8 operation will typically be used to carry out the setup into the loop before there is an
existing accumulated value.
15.1.6 Dual 16-Bit Multiply Instructions
ARMv5TE introduced considerable DSP performance to ARM, but ARMv6 takes this much
further. Implementations of ARMv6 (such as ARM1136J) have a dual 16 × 16 multiply
capability, which is comparable with many high-end dedicated DSP devices. Table 15.9 lists
these instructions.
558 Chapter 15 The Future of the Architecture
Table 15.9
Dual 16-bit multiply operations.
Instruction Description

SMLAD{X}{<cond>} Rd, Rm, Rs, Rn Dual signed multiply accumulate with
32-bit accumulation
SMLALD{X}{<cond>} RdLo, RdHi, Rm, Rs Dual signed multiply accumulate with
64-bit accumulation
SMLSD{X}{<cond>} Rd, Rm, Rs, Rn Dual signed multiply subtract with
32-bit accumulation
SMLSLD{X}{<cond>} RdLo, RdHi, Rm, Rs Dual signed multiply subtract with
64-bit accumulation
We demonstrate the use of SMLAD as a signed dual multiply in a dot-product inner
loop:
MOV R0, #0 ; zero accumulator
Loop
LDMIA R2!,{R4,R5,R6,R7} ; load 8 16-bit data items
LDMIA R1!,{R8,R9,R10,R11} ; load 8 16-bit coefficients
SUBS R3,R3,#8 ; subtract 8 from the loop counter
SMLAD R0,R4,R8,R0 ; 2 multiply accumulates
SMLAD R0,R5,R9,R0
SMLAD R0,R6,R10,R0
SMLAD R0,R7,R11,R0
BGT Loop ; loop if more coefficients
This loop delivers eight 16 ×16 multiply accumulates in 10 cycles without using any data-
blocking techniques. If a set of the operands for the dot-product is stored in registers, then
performance approaches the true dual multiplies per cycle.
15.1.7 Most Significant Word Multiplies
ARMv5TE added arithmetic operations that are used extensively in a very broad range
of DSP algorithms including control and communications and that were designed to use
the Q15 data format. However, in audio processing applications it is common for 16-bit
processing to be insufficient to describe the quality of the signals. Typically 32-bit values
are used in these cases, and ARMv6 adds some new multiply instructions that operate on
Q31 formatted values. (Recall that Q-format arithmetic is described in detail in Chapter 8.)

These new instructions are listed in Table 15.10.
15.1 Advanced DSP and SIMD Support in ARMv6 559
Table 15.10
Most significant word multiplies.
Instruction Description
SMMLA{R}{<cond>} Rd, Rm, Rs, Rn Signed 32 × 32 multiply with accumulation of
the high 32 bits of the product to the
32-bit accumulator Rn
SMMLS{R}{<cond>} Rd, Rm, Rs, Rn Signed 32 × 32 multiply subtracting from
(Rn << 32) and then taking the high 32 bits
of the result
SMMUL{R}{<cond>} Rd, Rm, Rs Signed 32 × 32 multiply with upper 32 bits of
product only
The optional {R} in the mnemonic allows the addition of the fixed constant 0x80000000
to the 64-bit product before producing the upper 32 bits. This allows for biased rounding
of the result.
15.1.8 Cryptographic Multiplication Extensions
In some cryptographic algorithms, very long multiplications are quite common. In order
to maximize their throughput, a new 64 + 32 × 32 → 64 multiply accumulate operation
has been added to complement the already existing 32 × 32 multiply operation UMULL
(see Table 15.11).
Here is an example of a very efficient 64-bit ×64-bit multiply using the new instructions:
; inputs: First 64-bit multiply operand in (RaHi,RaLo)
; Second 64-bit multiply operand in (RbHi, RbLo)
umull64x64
UMULL R0, R2, RaLo, RbLo
UMULL R1, R3, RaHi, RbLo
UMAAL R1, R2, RaLo, RbHi
UMAAL R2, R3, RaHi, RbHi
; output: 128-bit result in (R3, R2, R1, R0)

Table 15.11 Cryptographic multiply.
UMAAL{<cond>} RdLo, RdHi, Rm, Rs Special crypto multiply (RdHi : RdLo) = Rm ∗
Rs + RdHi + RdLo
560 Chapter 15 The Future of the Architecture
15.2 System and Multiprocessor Support
Additions to ARMv6
As systems become more complicated, they incorporate multiple processors and processing
engines. These engines may share different views of memory and even use different endi-
annesses (byte order). To support communication in these systems, ARMv6 adds support
for mixed-endian systems, fast exception processing, and new synchronization primitives.
15.2.1 Mixed-Endianness Support
Traditionally the ARM architecture has had a little-endian view of memory with a big-
endian mode that could be switched at reset. This big-endian mode sets the memory system
up as big-endian ordered instructions and data.
As mentioned in the introduction to this chapter, ARM has found its cores integrated
into very sophisticated system-on-chip devices dealing with mixed endianess, and often has
to deal with both little- and big-endian data in software. ARMv6 adds a new instruction
to set the data endianness for large code sequences (see Table 15.12), and also some indi-
vidual manipulation instructions to increase the efficiency of dealing with mixed-endian
environments.
The endian_specifier is either BE for big-endian or LE for little endian. A program
would typically use SETEND when there is a considerable chunk of code that is carrying
out operations on data with a particular endianness. Figure 15.3 shows individual byte
manipulation instructions.
Table 15.12 Setting data-endianness operation.
SETEND <endian_specifier> Change the default data endianness based on the
<endian_specifier> argument.
15.2.2 Exception Processing
It is common for operating systems to save the return state of an interrupt or exception
on a stack. ARMv6 adds the instructions in Table 15.13 to improve the efficiency of this

operation, which can occur very frequently in interrupt/scheduler driven systems.
15.2.3 Multiprocessing Synchronization Primitives
As system-on-chip (SoC) architectures have become more sophisticated, ARM cores are
now often found in devices with many processing units that compete for shared resources.
15.2 System and Multiprocessor Support Additions to ARMv6 561
REV {<cond>} Rd, Rm Reverse order of all four bytes in a 32-bit word
REV16 {<cond>} Rd, Rm
Reverse order of byte pairs in upper and
lower half
REVSH {<cond>} Rd, Rm
Reverse byte order of the signed halfword
B3 B2 B1 B0
B0 B1 B2 B3
31 24 16 8 0
31 24 16 8 0
31 24 16 8 0
31 24 16 8 0
31 24 16 8 0
31 24 16 8 0
B3 B2 B1 B0
B3B2
B1
B0
B3 B2 B1 B0
SB1B0S
Rm
Rd
Rm
Rd
Rm

Rd
Figure 15.3 Reverse instructions in ARMv6.
562 Chapter 15 The Future of the Architecture
Table 15.13
Exception processing operations.
Instruction Description
SRS<addressing_mode>, #mode{!} Save return state (lr and spsr) on the stack
addressed by sp in the specified mode.
RFE <addressing_mode>, Rn{!} Return from exception. Loads the pc and cpsr
from the stack pointed to by Rn.
CPS<effect> <iflags> {,#<mode>} Change processor state with interrupt enable
or disable.
CPS #<mode> Change processor state only.
The ARM architecture has always had the SWP instruction for implementing semaphores to
ensure consistency in such environments. As the SoC has become more complex, however,
certain aspects of SWP cause a performance bottleneck in some instances. Recall that SWP is
basically a “blocking” primitive that locks the external bus of the processor and uses most of
its bandwidth just to wait for a resource to be released. In this sense the SWP instruction is
considered “pessimistic”—no computation can continue until SWP returns with the freed
resource.
New instructions LDREX and STREX (load and store exclusive) were added to the
ARMv6 architecture to solve this problem. These instructions, listed in Table 15.14, are
very straightforward in use and are implemented by having a system monitor out in the
memory system. LDREX optimistically loads a value from memory into a register assuming
that nothing else will change the value in memory while we are working on it. STREX stores
a value back out to memory and returns an indication of whether the value in memory
was changed or not between the original LDREX operation and this store. In this way the
primitives are “optimistic”—you continue processing the data you loaded with LDREX
even though some external device may also be modifying the value. Only if a modification
actually took place externally is the value thrown away and reloaded.

The big difference for the system is that the processor no longer waits around on the
system bus for a semaphore tobe free, and therefore leaves most of the system bus bandwidth
available for other processes or processors.
Table 15.14 Load and store exclusive operations.
Instructions Description
LDREX{<cond>} Rd, [Rn] Load from address in Rn and set memory monitor
STREX{<cond>} Rd, Rm, [Rn] Store to address in Rn and flag if successful in Rd
(Rd = 0 if successful)
15.4 Future Technologies beyond ARMv6 563
15.3 ARMv6 Implementations
ARM completed development of ARM1136J in December 2002, and at this writing con-
sumer products are being designed with this core. The ARM1136J pipeline is the most
sophisticated ARM implementation to date. As shown in Figure 15.4, it has an eight-stage
pipeline with separate parallel pipelines for load/store and multiply/accumulate.
The parallel load/store unit (LSU) with hit-under-miss capability allows load and store
operations to be issued and execution to continue while the load or store is completing with
the slower memory system. By decoupling the execution pipeline from the completion of
loads or stores, the core can gain considerable extra performance since the memory system
is typically many times slower than the core speed. Hit-under-miss extends this decoupling
out to the L1-L2 memory interface so that an L1 cache miss can occur and an L2 transaction
can be completing while other L1 hits are still going on.
Another big change in microarchitecture is the move from virtually tagged caches to
physically tagged caches. Traditionally, ARM has used virtually tagged caches where the
MMU is between the caches and the outside L2 memory system. With ARMv6, this changes
so that the MMU is now between the core and the L1 caches, so that all cache memory
accesses are using physical (already translated) addresses. One of the big benefits of this
approach is considerably reduced cache flushing on context switches when the ARM is
running large operating systems. This reduced flushing will also reduce power consumption
in the end system since cache flushing directly implies more external memory accesses. In
some cases it is expected that this architectural change will deliver up to a 20% performance

improvement.
15.4 Future Technologies beyond ARMv6
In 2003, ARM made further technology announcements including TrustZone and
Thumb-2. While these technologies are very new, at this writing, they are being included
in new microprocessor cores. The next sections briefly introduce these new technologies.
15.4.1 TrustZone
TrustZone is an architectural extension targeting the security of transactions that may be
carried out using consumer products such as cell phones and, in the future, perhaps online
transactions to download music or video for example. It was first introduced in October
2003 when ARM announced the ARM1176JZ-S.
The fundamental idea is that operating systems (even on embedded devices) are now
so complex that it is very hard to verify security and correctness in the software. The ARM
solution to this problem is to add new operating “states” to the architecture where only a
small verifiable software kernel will run, and this will provide services to the larger operating
system. The microprocessor core then takes a role in controlling system peripherals that
564 Chapter 15 The Future of the Architecture
1st fetch
stage
2nd fetch
stage
Instruction
decode
Register
read and
instruction
issue
Common decode pipeline
Fe1 Fe2 De Iss
ALU
pipeline

Sh ALU Sat
MAC1 MAC2 MAC3
Ex1 Ex2 Ex3
Shifter
operation
1st multiply
stage
2nd multiply
stage
3rd multiply
stage
Calculate
writeback
value
Saturation
Base
register
writeback
Writeback
from LSU
ADD DC1 DC2
Multiply
pipeline
WBex
WBIs
Load/store
pipeline
Data
address
calculation

First stage
of data
cache
access
Second
stage of data
cache
access
Hit under
miss
Load miss
waits
Figure 15.4 ARM1136J pipeline.
Source: ARM Limited, ARM 1136J, Technical Reference Manual, 2003.
15.4 Future Technologies beyond ARMv6 565
may be only available to the secure “state” through some new exported signals on the bus
interface. The system states are shown in Figure 15.5.
TrustZone is most useful in devices that will be carrying out content downloads such as
cell phones or other portable devices with network connections. Details of this architecture
are not public at the time of writing.
Platform
OS
Secure
kernel
Privileged
User
Normal Secure
Trusted code base
S
-

A
p
p
S
-
A
p
p
A
p
p
A
p
p
A
p
p
A
p
p
A
p
p
Monitor
Fixed entry
points
Fixed entry
points
Figure 15.5 Modified security structure using TrustZone technology.
Source: Richard York, A New Foundation for CPU Systems Security: Security Extensions to

the ARM Architecture, 2003.
15.4.2 Thumb-2
Thumb-2 is an architectural extension designed to increase performance at high code
density. It allows for a blend of 32-bit ARM-like instructions with 16-bit Thumb instruc-
tions. This combination enables you to have the code density benefits of Thumb with the
additional performance benefits of access to 32-bit instructions.
Thumb-2 was announced in October 2003 and will be implemented in the
ARM1156T2-S processor. Details of this architecture are not public at the time of writing.
566 Chapter 15 The Future of the Architecture
15.5 Summary
The ARM architecture is not a static constant but isbeing developed and improved to suit the
applications required by today’s consumer devices. Although the ARMv5TE architecture
was very successful at adding some DSP support to the ARM, the ARMv6 architecture
extends the DSP support as well as adding support for large multiprocessor systems.
Table 15.15 shows how these new technologies map to different processor cores.
ARM still concentrates on one of its key benefits—code density—and has recently
announced the Thumb-2 extension to its popular Thumb architecture. The new focus on
security with TrustZone gives ARM a leading edge in this area.
Expect many more innovations over the years to come!
Table 15.15 Recently announced cores.
Processor core Architecture version
ARM1136J-S ARMv6J
ARM1156T2-S ARMv6 + Thumb-2
ARM1176JZ-S ARMv6J + TrustZone
This Page Intentionally Left Blank
A.1 Using This Appendix
A.2 Syntax
A.2.1 Optional Expressions
A.2.2 Register Names
A.2.3 Values Stored as Immediates

A.2.4 Condition Codes and Flags
A.2.5 Shift Operations
A.3 Alphabetical List of ARM and Thumb Instructions
A.4 ARM Assembler Quick Reference
A.4.1 ARM Assembler Variables
A.4.2 ARM Assembler Labels
A.4.3 ARM Assembler Expressions
A.4.4 ARM Assembler Directives
A.5 GNU Assembler Quick Reference
A.5.1 GNU Assembler Directives
Appendix
ARM and Thumb
Assembler
Instructions
A
This appendix lists the ARM and Thumb instructions available up to, and including, ARM
architecture ARMv6, which was just released at the time of writing. We list the operations
in alphabetical order for easy reference. Sections A.4 and A.5 give quick reference guides to
the ARM and GNU assemblers armasm and gas.
We have designed this appendix for practical programming use, both for writing
assembly code and for interpreting disassembly output. It is not intended as a definitive
architectural ARM reference. In particular, we do not list the exhaustive details of each
instruction bitmap encoding and behavior. For this level of detail, see the ARM Architecture
Reference Manual, edited by David Seal, published by Addison Wesley. We do give a
summary of ARM and Thumb instruction set encodings in Appendix B.
A.1 Using This Appendix
Each appendix entry begins by enumerating the available instructions formats for the given
instruction class. For example, the first entry for the instruction class ADD reads
1. ADD<cond>{S} Rd, Rn, #<rotated_immed> ARMv1
The fields <cond> and <rotated_immed> are two of a number of standard fields described

in Section A.2. Rd and Rn denote ARM registers. The instruction is only executed if the
569
570 Appendix A ARM and Thumb Assembler Instructions
Table A.1 Instruction types.
Type Meaning
ARMvX 32-bit ARM instruction first appearing in ARM architecture version X
THUMBvX 16-bit Thumb instruction first appearing in Thumb architecture version X
MACRO Assembler pseudoinstruction
condition <cond> is passed. Each entry also describes the action of the instruction if it is
executed.
The {S} denotes that you may apply an optional S suffix to the instruction. Finally,
the right-hand column specifies that the instruction is available from the listed ARM
architecture version onwards. Table A.1 shows the entries possible for this column.
Note that there is no direct correlation between the Thumb architecture number and
the ARM architecture number. The THUMBv1 architecture is used in ARMv4T processors;
the THUMBv2 architecture, in ARMv5T processors; and the THUMBv3 architecture, in
ARMv6 processors.
Each instruction definition is followed by a notes section describing restrictions on the
use of the instruction. When we make a statement such as “Rd must not be pc,” we mean
that the description of the function only applies when this condition holds. If you break
the condition, then the instruction may be unpredictable or have predictable effects that we
haven’t had space to describe here. Well-written programs should not need to break these
conditions.
A.2 Syntax
We use the following syntax and abbreviations throughout this appendix.
A.2.1 Optional Expressions

{<expr>} is an optional expression. For example, LDR{B} is shorthand for LDR or LDRB.

{<exp1>|<exp2>|…|<expN>}, including at least one “|” divider, is a list of expressions.

One of the listed expressions must appear. For example LDR{B|H} is shorthand for
LDRB or LDRH. It does not include LDR. We would represent these three possibilities by
LDR{|B|H}.
A.2.2 Register Names

Rd, Rn, Rm, Rs, RdHi, RdLo represent ARM registers in the range r0 to r15.

Ld, Ln, Lm, Ls represent low-numbered ARM registers in the range r0 to r7.
A.2 Syntax 571

Hd, Hn, Hm, Hs represent high-numbered ARM registers in the range r8 to r15.

Cd, Cn, Cm represent coprocessor registers in the range c0 to c15.

sp, lr, pc are names for r13, r14, r15, respectively.

Rn[a] denotes bit a of register Rn. Therefore Rn[a]=(Rn  a)&1.

Rn[a:b] denotes the a +1 − b bit value stored in bits a to b of Rn inclusive.

RdHi:RdLo represents the 64-bit value with high 32 RDHi bits and low 32 bits RdLo.
A.2.3 Values Stored as Immediates

<immedN> is any unsigned N-bit immediate. For example, <immed8> represents any
integer in the range 0 to 255. <immed5>*4 represents any integer in the list0, 4, 8, …, 124.

<addressN> is an address or label stored as a relative offset. The address must be in the
range pc − 2
N
≤ address < pc + 2

N
. Here, pc is the address of the instruction plus
eight for ARM state, or the address of the instruction plus four for Thumb state. The
address must be four-byte aligned if the destination is an ARM instruction or two-byte
aligned if the destination is a Thumb instruction.

<A-B> represents any integer in the range A to B inclusive.

<rotated_immed> is any 32-bit immediate that can be represented as an eight-
bit unsigned value rotated right (or left) by an even number of bit positions. In
other words, <rotated_immed> = <immed8> ROR (2*<immed4>). For example 0xff,
0x104, 0xe0000005, and 0x0bc00000are possible values for <rotated_immed>. How-
ever, 0x101 and 0x102 are not. When you use a rotated immediate, <shifter_C> is
set according to Table A.3 (discussed in Section A.2.5). A nonzero rotate may cause
a change in the carry flag. For this reason, you can also specify the rotation explicitly,
using the assembly syntax <immed8>, 2*<immed4>.
A.2.4 Condition Codes and Flags

<cond> represents any of the standard ARM condition codes. Table A.2 shows the
possible values for <cond>.

<SignedOverflow> is a flag indicating that the result of an arithmetic operation suf-
fered from a signed overflow. For example, 0x7fffffff +1=0x80000000 produces
a signed overflow because the sum of two positive 32-bit signed integers is a negative
32- bit signed integer. The V flag in the cpsr typically records signed overflows.

<UnsignedOverflow> is a flag indicating that the result of an arithmetic operation
suffered from an unsigned overflow. For example, 0xffffffff + 1 = 0 produces an
overflow in unsigned 32-bit arithmetic. The C flag in the cpsr typically records unsigned
overflows.

572 Appendix A ARM and Thumb Assembler Instructions
Table A.2 ARM condition mnemonics.
<cond> Instruction is executed when cpsr condition
{|AL} ALways TRUE
EQ EQual (last result zero) Z==1
NE Not Equal (last result nonzero) Z==0
{CS|HS} Carry Set, unsigned Higher or Same (following a compare) C==1
{CC|LO} Carry Clear, unsigned LOwer (following a comparison) C==0
MI MInus (last result negative) N==1
PL PLus (last result greater than or equal to zero) N==0
VS V flag Set (signed overflow on last result) V==1
VC V flag Clear (no signed overflow on last result) V==0
HI unsigned HIgher (following a comparison) C==1 && Z==0
LS unsigned Lower or Same (following a comparison) C==0 || Z==1
GE signed Greater than or Equal N==V
LT signed Less Than N!=V
GT signed Greater Than N==V && Z==0
LE signed Less than or Equal N!=V || Z==1
NV NeVer—ARMv1 and ARMv2 only—DO NOT USE FALSE

<NoUnsignedOverflow> is the same as 1 − <UnsignedOverflow>.

<Zero> is a flag indicating that the result of an arithmetic or logical operation is zero.
The Z flag in the cpsr typically records the zero condition.

<Negative> is a flag indicating that the result of an arithmetic or logical operation is
negative. In other words, <Negative> is bit 31 of the result. The N flag in the cpsr
typically records this condition.
A.2.5 Shift Operations


<imm_shift> represents a shift by an immediate specified amount. The possible shifts
are LSL #<0-31>, LSR #<1-32>, ASR #<1-32>, ROR #<1-31>, and RRX. See Table A.3
for the actions of each shift.

<reg_shift> represents a shift by a register-specified amount. The possible shifts are
LSL Rs, LSR Rs, ASR Rs, and ROR Rs. Rs must not be pc . The bottom eight bits of Rs
are used as the shift value k in Table A.3. Bits Rs[31:8] are ignored.

<shift> is shorthand for <imm_shift> or <reg_shift>.

<shifted_Rm> is shorthand for the value of Rm after the specified shift has been applied.
See Table A.3.

×