Tải bản đầy đủ (.pdf) (10 trang)

3D Graphics with OpenGL ES and M3G- P43 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (148.52 KB, 10 trang )

404 FIXED-POINT MATHEMATICS APPENDIX A
In these examples we showed the programs as a list of assembly instructions. It is not
possible to compile them into a working program without some modifications. Here is
an example of an inlined assembly routine that you can actually call from your C program
(using an ARM GCC compiler):
INLINE int mul_fixed_fixed( int a, int b )
{
int result, tmp;
__asm__ ( "smull %0,%1,%2,%3 \n\t"
"mov %0,%0,lsr #16 \n\t"
"orr %0,%0,%1,lsl #16 \n\t"
: "=&r" (result), "=&r" (tmp),
: "r" (a), "r" (b)
:""
);
return result;
}
Here the compiler allocates the registers and places the register of result to argument %0,
tmp to %1, a to %2, and b to %3. For result and tmp = means that the register is going
to be written to, and
& indicates that this register cannot be used for anything else inside
this __asm__statement. The first line performs a signed multiply of a and b and stores the
low 32 bits to result and the high 32 bits to tmp.
The second line shifts the result right 16 times, the third line shifts tmp left 16 times, and
combines tmp and result into result using a bitwise OR. The interested reader may want
to consult a more in-depth exposition on GCC inline assembly [S03, Bat].
Another compiler that is used a lot for mobile development is the ARM RVCT compiler.
It also handles the register allocation of the inline assembly. RVCT goes a step further
though: there is no need to specify registers and their constraints as they are automatically
handled by the compiler. Here is the previous example code in the inline assembler format
usedbyRVCT:


INLINE int mul_fixed_fixed( int a, int b )
{
int result, tmp;
__asm
{
smull result, tmp, a, b
mov result, result, lsr #16
orr result, result, tmp, lsl #16
}
return result;
}
For a list of supported instru ctions, check ARM Instruction Set Quick Reference
Card [Arm].
SECTION A.3 FIXED-POINT METHODS IN JAVA 405
A.3 FIXED-POINT METHODS IN JAVA
Fixed-point routines in Java work almost exactly as in C, except that you do not have to
strugg le with the portability of 64-bit integers, because the long type in Java is always
64 bits. Also, since there is no #define nor an inline keywordinJava,youneed
to figure out alternative means to get your code inlined. This is crucially important
because the method call overhead will eliminate any benefit that you get from faster
arithmetic otherwise. One way to be sure is to inline your code manually, and that is
what you probably end up doing anyway, as soon as you need to go beyond the basic
16.16 format. Note that the standard javac compiler does not do any inlining; see
Appendix B for suggestions on other tools that may be able to do it.
The benefit of using fixed-point in Java depends greatly on the Java virtual machine. The
benefit can be very large on VMs that leverage Jazelle (see Appendix B), or just-in-time
(JIT) or ahead-of-time (AOT) compilation, but very modest on traditional interpreters.
To give a ballpark estimate, a DOT4 done in fixed-point using 64-bit intermediate reso-
lution might be ten times faster than a pure float routine on a compiling VM, five times
faster on Jazelle, but only twice as fast on an interpreter.

On a traditional interpreter, float is relatively efficient because it requires only one
bytecode for each addition, multiplication, or division. Fixed point, on the other hand,
takes extra bytecodes due to the bit-shifting. The constant per-bytecode overhead is very
large on a software interpreter.
On Jazelle, integer additions and multiplications get mapped to native machine instruc-
tions directly, whereas float operations require a function call. The extra bytecodes
are still there, however, taking their toll. Finally, a JIT/AOT compiler is looking at longer
sequences of bytecode and can probably combine the bit-shifts with other operations in
the compiled code, as we did in the previous section.
To conclude, using fixed-point arithmetic generally does pay off in Java, and even more
so with the increasing prevalence of Jazelle and JIT/AOT compilers. There is a caveat,
though: if you need to do a lot of divides, or need to convert between fixed and float
frequently, you may be better off just using floats and spending your optimization efforts
elsewhere. Divides are ver y slow regardless of the number format and the VM, and will
quickly dominate the execution time. Also, they are much slower in 64-bit integer than in
32-bit floating point!
This page intentionally left blank
B
APPENDIX
JAVA PERFORMANCE TUNING
Although M3G offers a lot of high-level functionality implemented in efficient native
code, it will not write your game for you. You need to create a lot of Java code yourself,
and that code will ultimately make or break your game, so it had better be good.
The principles of writing efficient code on the Java ME platform are much like on any
other platform. In order to choose the best data structures and algorithms, and to imple-
ment them in the most efficient way, you need to know the strengths and weaknesses of
your target architecture, programming language, and compiler. The problem compared
to native platforms is that there are more variables and unknowns: a multitude of differ-
ent VMs, using different acceleration techniques, running on different operating systems
and hardware. Hence, spending a lot of time optimizing your code on an emulator or just

one or two devices can easily do you more harm than good.
In this appendix we briefly describe the main causes of performance problems in Java ME,
and suggest some techniques to overcome them. This is not to be taken as final truth;
your mileage may vary, and the only way to be sure is to profile your application on the
devices that you are targeting. That said, we hope this will help you avoid the mostobvious
performance traps and also better understand some of the decisions that we made when
designing M3G.
407
408 JAVA PERFORMANCE TUNING APPENDIX B
B.1 VIRTUAL MACHINES
The task of the Java Virtual Machine is to execute Java bytecode, just like a real, nonvirtual
CPU executes its native assembly language. The instruction set of the Java VM is in stark
contrast to that of any widely used embedded CPU, however.
To start with, by tecode instructions take their operands off the top of an internal operand
stack, whereas native instructions pick theirs from a fixed set of typically sixteen registers.
The ar bitrary depth of the operand stack prevents it from being mapped to the regis-
ters in a straightforward manner. This increases the number of costly memory accesses
compared to native code. The stack-based architecture is very generic, allowing imple-
mentations on almost any imaginable processor, but it is also hard to map efficiently onto
a machine that is really based on registers.
Another complication is due to bytecode instructions having variable length, compared to
the fixed-length codewords of a RISC processor. This makes bytecode very compact: most
instructions require just one byte of memory, whereas native inst ructions are typically four
byteseach.Thedownsideisthatinstruction fetchinganddecodingbecomes morecomplex.
Furthermore, the by tecode instruction set is avery mixed bag, having instructions at widely
var ying levels of abstraction. The bytecodes range from basic arithmetic and bitwise oper-
ations to things that are usually considered to be in the operating system’s domain, such
as memory allocation (new). Most of the bytecodes are easily mapped to native machine
instructions, except for having to deal with the operand stack, but some of the high-level
ones require complex subroutines and interfacing with the operating system. Adding into

the equationthe factsthat allmemory accesses are type-checked and bounds-checked, that
memor y must be garbage-collected, and so on, it becomes clear that designing an efficient
Java VM, while maintaining security and robustness, is a formidable task.
There are three basic approaches that virtual machines are taking to execute bytecode:
interpretation, just-in-time compilation, and ahead-of-time compilation. The predom-
inant approach in mobile devices is interpre tation: bytecodes are fetched, decoded, and
translated into machine code one by one. Each bytecode instruction takes several machine
instructions to translate, so this method is obviously much slower than executing native
code. The slowdown used to be some two orders of magnitude in early implementations,
but has since then been reduced to a factor of 5–10, thanks to assembly-level optimiza-
tions in the interpreter loops.
The second approach is to compile (parts of) the program into machine code at runtime.
These just-in-time (JIT) compilers yield good results in long-running benchmarks, but
perform poorly when only limited time and memory are available for the compiler and
the compiled code. The memory problems are exacerbated by the fact that compiled code
can easily take five times as much space as bytecode. Moreover, runtime compilation will
necessarily delay, interrupt, or slow down the program execution. To minimize the dis-
turbance, JIT compilers are restricted to ver y basic and localized optimizations. In theor y,
the availability of runtime profiling information should allow JIT compilers to produce
SECTION B.2 BYTECODE OPTIMIZATION 409
smaller and faster code than any static C compiler, but that would require a drastic increase
in the available memory, and the compilation time would still remain a problem for inter-
active applications. Today, we estimate well-written C code to outperform embedded JIT
compilers by a factor of 3–5.
The third option is to compile the program into native code already before it i s run, typ-
ically at installation time. This ahead-of-t ime (AOT) tactic allows the compiler to apply
more aggressive optimizations than is feasible at runtime. On the other hand, the com-
piled code consumes significantly more memory than the original bytecode.
Any of these three approaches can be accelerated substantially with hardware support.
The seemingly obvious solution is to build a CPU that uses Java bytecode as its machine

language. This has been tried by numerous companies, including Nazomi, Zucotto, inSil-
icon, Octera, NanoAmp, and even Sun Microsystems themselves, but to our knowledge
all such attempts have failed either technically or commercially, or both. The less radical
approach of augmenting a conventional CPU design with Java acceleration seems to be
working better.
The Jazelle extension to ARM processors [Por05a] runs the most common bytecodes
directly on the CPU, and manages to pull that off at a negligible extra cost in ter ms of
silicon area. Although many bytecodes are still emulated in software, this yields perfor-
mance roughly equivalent to current embedded JIT compilers, but without the excessive
memor y usage and annoying interruptions. The main weakness of Jazelle is that it must
execute each and every bytecode separately, whereas a compiler might be able to turn a
sequence of bytecodes into just one machine instr uction.
Taking a slightly different approach to hardware acceleration, Jazelle RCT (Runtime
Compilation Target) [Por05b], augments the native ARM instruction set with additional
instructions that can be used by JIT and AOT compilers to speed up array bounds check-
ing and exception handling, for example. The extra instructions also help to reduce the
size of the compiled machine code almost to the level of the original bytecode.
As an application developer, you will encounter all these different types of vir tual machines.
In terms of installed base, traditional interpreters still have the largest market share, but
Jazelle, JIT, and AOT are quickly catching up. According to the JBenchmark ACE results
database,
1
most newer devices appear to be using one of these acceleration techniques.
Jazelle RCT has not yet been used in any mobile devices by the time of this writing, but
we expect it to be widely deployed over the next few years.
B.2 BYTECODE OPTIMIZATION
As we pointed out before, Java bytecode is less than a perfect match for modern embedded
RISC processors. Besides being stack-based and having instructions at wildly varying
1 www.jbenchmark.com/ace
410 JAVA PERFORMANCE TUNING APPENDIX B

levels of abstraction, it also lacks many features that native code can take advantage of,
at least when using assembly language. For instance, there are no bytecodes correspond-
ing to the kind of data-parallel (SIMD) instructions that are now commonplace also in
embedded CPUs and can greatly speed up many types of processing. To take another
example, there are no conditional (also known as predicated) instructions to provide a
faster alternative to short forward branches.
Most of the bytecode limitations can be attributed to the admirable goal of platform
independence, and are therefore acceptable. It is much harder to accept the notoriously
poor quality of the code that the javac compiler produces. In fact, you are better off
assuming that it does no optimization whatsoever. For instance, if you compute a con-
stant expression like 16*a/4 in your inner loop, rest assured that the entire expression
will be meticulously evaluated at every iteration—and of course using real multiplies and
divides rather than bit-shifts (as in a<<2).
The lack of optimization in javac is presumably because it trusts the virtual machine to
apply advanced optimization techniques at runtime. That may be a reasonable assump-
tion in the server environment, but not on mobile devices, where resources are scarce
and midlet start-up times must be minimized. Traditional interpreters and Jazelle take a
serious performance hit from badly optimized bytecode, but just-in-time and ahead-of-
time compilers are not immune, either. If the on-device compiler could trust javac to
inline trivial methods, eliminate constant expressions and common subexpressions, con-
vert power-of-two multiplications and divisions into bit-shifts, and so on, it could spend
more time on things that cannot be done at the bytecode level, such as register allocation
or eliminating array b ounds checking.
Given the limitations of javac, your best bet is to use other off-line compilers, bytecode
optimizers, and obfuscators such as GCJ,
2
mBooster,
3
DashO,
4

ProGuard,
5
Java Global
Optimizer,
6
Bloat,
7
or Soot.
8
None of these tools is a superset of the others, so it might
make sense to use more than one on the same application.
B.3 GARBAGE COLLECTION
All objects, including arrays, are allocated from the Java heap using the new operator.
They are never explicitly deallocated; instead, the garbage collector (GC) automatically
reclaims any objects that are no longer referenced by the executing program.
2 gcc.gnu.org/java
3 www.innaworks.com
4 www.preemptive.com
5 proguard.sourceforge.net
6 www.garret.ru/
Ä
knizhnik/javago/ReadMe.htm
7 www.cs.purdue.edu/s3/projects/bloat/
8 www.sable.mcgill.ca/soot/
SECTION B.4 MEMORY ACCESSES 411
Automatic garbage collection eliminates masses of insidious bugs, but also bears
significant overhead. Explicit memory management using malloc and free has been
shown to be faster and require less physical memory. For example, in a study by Hertz and
Berger [HB05], the best-performing garbage collector degraded application performance
by 70% compared to an explicit memory manager, even when the application only used

half of the available memory. Performance of the garbage collector declined rapidly as
memory was r unning out. Thus, for best performance, you should leave some reasonable
percentage of the Java heap unused. More importantly, you should not create any garbage
while in the main loop, so as not to trigger the garbage collector in the first place.
Pitfall: There is no reliable way to find out how much memory your midlet
is consuming, or how much more it has available. The numbers you get from
Runtime.getRuntime().freeMemory() are not to be trusted, because you
may run out of native heap before you run out of Java heap, or vice versa, and because
the Java heap may be dynamically resized behind your back.
A common technique to avoid generating garbage is to allocate a set of objects and arrays
at the setup stage and then reuse them throughout your code. In other words, start off your
application by allocating all the objects that you are ever going to need, and then hold on
to them until you quit the midlet. Although this is not object-oriented and not very pretty,
it goes a long way toward eliminating the GC overhead—not all the way, though. There are
built-in methods that do not facilitate object reuse, forcing you to create a new instance
when you really only wanted to change some attribute. Even worse, there are built-in APIs
that allocate and release temporary objects inter nally without you ever knowing about it.
Strings are particularly easy to trip on, because they are immutable in Java. Thus,
concatenating two strings creates a new String object simply because the existing ones
cannot be changed. If you need to deal with strings on a per-frame basis, for example
to update the player’s score, you need to be extra careful to avoid creating any garbage.
Perhaps the only way to be 100% sure is to revert to C-style coding and only use char
arrays.
B.4 MEMORY ACCESSES
One of the most frequent complaints that C programmers have about Java is the lack of
direct memory access. Indeed, there are no pointers in the Java programming language,
and no bytecode instructions to read or write arbitr ary memory locations. Instead, there
are only references to strongly typed objects that reside in the garbage-collected heap. You
do not know where in physical memory each particular object lies at any given time, nor
how many bytes it occupies. Furthermore, all memory accesses are type-checked, and in

case of arrays, also bounds-checked. These restrictions are an integral part of the Java
security model, and one of the reasons the platform is so widely deployed, but they also
rule out many optimizations that C programmers are used to.
412 JAVA PERFORMANCE TUNING APPENDIX B
As an example, consider a bitmap image stored in RGBA format at 32 bits per pixel. In C,
you would use a byte array, but still access the pixels as integers where necessary, to speed
up copying and some other operations. The lack of type-checking in C therefore allows
you to coalesce four consecutive memory accesses into one. Java does not give you that
flexibility: you need to choose either bytes or integers and stick to that. To take another
example, efficient floating-point processing on FPU-less devices requires custom routines
that operate directly on the integer bit patterns of float values, and that is something
you cannot do in Java. To illustra te, the following piece of C code computes the absolute
valueofafloat in just one machine instr u ction, but relies on pointer casting to do so:
float fabs(float a)
{
int bits = *(int*)(&a); // extract the bit pattern
bits &= 0x7fffffff; // clear the sign bit
return *(float*)(&bits); // cast back to float
}
Type-checking is not the only thing in Java that limits your choice of data structures and
algorithms. For example, if you want to build an aggregate object (such as an array of
structures) in C, you can either inline the component objects (the structures) or refer-
ence them with pointers; Java only gives you the latter option. Defining a cache-friendly
data structure where objects are aligned at, say, 16-byte boundaries is another thing that
you cannot do in Java. Moreover, you do not have the choice of quickly allocating local
variables from the CPU stack. Finally, the lack of pointer arithmetic forces you to follow
object references even when the target address could be computed without any memory
accesses.
Unlike type checking, array bounds checking does not limit your choice of data struc-
tures. It does impose a performance penalty, though, and the more dimensions you have

in the array, the higher the cost per access. Thus, you should always use a flat array, even
if the data is inherently multidimensional; for instance, a 4 × 4 matrix should be allo-
cated as a flat array of 16 elements. Advanced JIT/AOT compilers may be able to elim-
inate a range check if the array index can be proven to be within the correct range.
The compiler is more likely to come up with the proof if you use new int[100]
rather than new int[getCount()] to allocate an array, and index<100 instead
of index<getCount() to iter ate over its elements. Do not let this complicate your
code too much, however, as this sort of optimization may be beyond the capabilities of
the current compilers.
To minimize memory accesses in general, it is a good idea to use the built-in primitive
typessuchasint and float rather than objects. Also, the input parameters and local
variables of a method are likely to be faster than class variables or instance variables.
Finally, using System.arraycopy pays off almost universally: it amounts to a native
memcpy with some extra type-checking and range-checking up front. The savings can
be huge compared to doing the same checks for each element separately.
SECTION B.5 METHOD CALLS 413
B.5 METHOD CALLS
Method invocations in Java are more expensive and more restricted than function calls in
C or C++. The virtual machine must first look up the method from an internal symbol
table, and then check the type of each argument against the method signature. A C/C++
function call, on the other hand, requires very few machine inst ructions.
In general, private methods are faster to call than public or protected ones, and
stand a better chance of being inlined. Also, static methods are faster than instance
methods, and final methods are faster than those that can be re-implemented in
derived classes. synchronized methods are by far the slowest, and should be used
only when necessary. Depending on the VM, native methods can also bear high overhead,
particularly if large objects or arrays are passed to or from native code.
As a final note, code and data are strictly separated in Java. There is no way for a method
to read or write its own bytecode or that of any other method. There is also no way
to tr ansfer program control to the data area, or in fact anywhere else than one of the

predefined method entry points. These restrictions are absolutely mandatory from the
security standpoint, but they have the unfortunate side-effect that any kind of runtime
code generation is prevented. In other words, you could not implement a JIT compiler
in Java!

×