Tải bản đầy đủ (.pdf) (10 trang)

3D Graphics with OpenGL ES and M3G- P17 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (270.27 KB, 10 trang )

144 PERFORMANCE AND SCALABILITY CHAPTER 6
Table 6.1: Performance checklist.
Check item OK Answer Applicability
Do you use full-screen window surfaces? Yes ALL
Do you use glReadPixels? No ALL
Do you use eglCopyBuffers? No MOST
Do you use glCopyTex(Sub)Image2D? No MOST
Do you change texture data of existing texture? No ALL
Do you load textures during the rendering pass? No MOST
Do you use render-to-texture results during the same frame? No SOME
Do you clear the whole depth buffer at the start of a frame? Yes SOME
Do you use mipmapping? Yes ALL
Do you use vertex buffer objects? Yes ALL
Do you use texture compression? Yes SOME
Is any unnecessary state enabled? No ALL
Do you use auto mipmap generation or change filter modes? No SOME
Do you use perspective correction? No SOME (SW)
Do you use bilinear or trilinear filtering? No SOME (SW)
Do you use floating-point vertex data? No SOME
Table 6.2: Quality checklist.
Check item OK Answer Applicability
Do you use multisampling? Yes MOST (HW)
Do you use LINEAR
MIPMAP NEAREST? Yes MOST (HW)
Do you have enough depth buffer bits? Yes ALL
Do you have enough color buffer bits? Yes ALL
Have you enabled perspective correction? Yes ALL
Table 6.3: Power usage checklist.
Check item OK Answer Applicability
Do you terminate EGL when application is idling? Yes MOST (HW)
Do you track the focus and halt rendering if focus is lost? Yes ALL


Do you limit your frame rate? Yes ALL
SECTION 6.3 CHANGING AND QUERYING THE STATE 145
Table 6.4: Portability checklist.
Check item OK Answer Applicability
Do you use writable static data? No SOME (OS)
Do you handle display layout changes? Yes SOME (OS)
Do you depend on pixmap surface support? No SOME
Do you use EGL from another thread than main? No SOME
Do you specify surface type when asking for a config? Yes MOST
Do you require exact number of samples for multi-sampling? No SOME
6.3 CHANGING AND QUERYING THE STATE
Modern rendering pipelines are one-way streets: data keeps flowing in, it gets buffered,
number-crunching occurs, and eventually some pixels come out. State changes and
dynamic state queries are operations that disturb this flow. In the worst case a client-server
roundtrip is required. For example, if the application wants to read back the contents of
the color buffer, the application (the “client”) has to stall until the graphics hardware
(the “server”) has processed all of the buffered primitives—and the buffers in modern
hardware, especially tile-based devices, can be very long. An example of an extreme state
change is modifying the contents of a texture map mid-frame as this may lead to internal
duplication of the image data by the underlying driver.
While having some state changes is unavoidable in any realistic applications, you should
steer clear of dynamic state queries, if possible. Applications should shadow the relevant
state in their own code rather than query it from the graphics driver, e.g., the applica-
tion should know whether a particular light source is enabled or not. Dynamic queries
should only be utilized when keeping an up-to-date copy of the graphics driver’s state
is cumbersome, for example when combining application code with third-party middle-
ware libraries that communicate directly with the underlying OpenGL ES or M3G layers.
If for some reason dynamic state queries are absolutely needed, they should all be executed
together once p er frame, so that only a single pipeline stall is generated.
Smaller state changes, such as operations that alter the transformation and lighting

pipeline or the fragment processing, affect the performance in various ways. Changing
state that is typically set only during initialization, such as the size of the viewport or
scissor rectangle, may cause a pipeline flush and may therefore be costly. State changes
and under-the-hood synchronization may also happen when an application uses different
APIs to access the same graphics resources. For example, you may be tempted to mix 2D
and 3D functionality provided by different APIs. This is more than likely to be extremely
slow, as the entire 3D pipeline may have to be completely flushed before the 2D operations
146 PERFORMANCE AND SCALABILITY CHAPTER 6
can take place and vice versa. The implementations of the graphics libraries may well come
from different vendors, and their interaction can therefore be nonoptimal. This is a sig-
nificant problem in the Java world, as the whole philosophy of Java programming is to be
able to mix and match different libraries.
6.3.1 OPTIMIZING STATE CHANGES
The rule of thumb for all state changes is to minimize the number of stalls created by
them. This means that changes should be grouped and executed together. An easy way
to do this is to group related state changes into “shaders” (we use the term here to indi-
cate a collection of distinct pieces of the rendering state, corresponding roughly with the
Appearance class of M3G), and to organize the rendering so that all objects sharing
a shader are rendered together. It is a good idea to expose this shader-based approach in
the artists’ modeling tools as well. If one lets the artists tweak attributes that can create
state changes, the end result is likely to be a scene where each object has slightly different
materials and frag ment pipelines, and the application needs to do a large number of stage
changes to render the objects. It is therefore better to just let the artist pick shaders from
a predefined list.
Also, it is important to be aware that the more complex a shader is, the slower it is likely
to be. Even though graphics hardware may perform some operations “for free” due to its
highly parallel nature, in a software implementation everything has an associated cost:
enabling texture mapping is going to take dozens of CPU cycles for every pixel rendered,
bilinear filtering of textures is considerably more expensive than point sampling, and
using blending or fog will definitely slow down a software renderer. For this reason, it

is crucial that the application disables all operations that are not going to have an impact
on the final rendered image. As an example, it is typical that applications draw over-
lay images after the 3D scene has been rendered. People often forget to disable the fog
operation when drawing the overlays as the fog usually does not affect objects placed at
the near clipping plane. However, the underlying rendering engine does not know this,
and has to perform the expensive fog computations for every pixel rendered. Disabling
the fog for the overlays in this case may have a significant performance impact.
In general, simplifying shaders is more important for software implementations of the
rendering pipeline, whereas keeping the number of state changes low i s more important
for GPUs.
6.4 MODEL DATA
The way the vertex and triangle data of the 3D models is organized has a significant
impact on the rendering performance. Although the internal caching rules vary from
one rendering pipeline implementation to another, straightforward rules of thumb for
presentation of data exist: keep vertex and triangle data short and simple, and make as
few rendering calls as possible.
SECTION 6.4 MODEL DATA 147
In addition to the layout and format of the vertex and triangle data used, where the data
is stored plays an important role. If it is stored in the client’s memory, the application has
more flexibility to modify the data dynamically. However, since the data is now transferred
from the client to the server during every render call, the server loses its opportunity
for optimizing and analyzing the data. On the other hand, when the mesh data is stored
by the server, it is possible to perform even expensive analysis of the data, as the cost
is amortized over multiple rendering operations. In general, one should always use such
server-stored buffer objects whenever provided by the rendering API. OpenGL ES supports
buffer objects from version 1.1 onward, and M3G implementations may support them in
a completely transparent fashion.
6.4.1 VERTEX DATA
Optimization of model data is an offline process that is best performed in the export-
ing pipeline of a modeling tool. The most impor tant optimization that should be done is

vertex welding, that is, finding shared vertices and removing all but one of them. In a finely
tessellated grid each vertex is shared by six triangles. This means an effective vertices-
per-triangle ratio of 0.5. For many real-life meshes, ratios between 0.6 and 1.0 are obtained.
This is a major improvement over the naive approach of using three individual vert ices
for each triangle, i.e., a ratio of 3.0. The fastest and easiest way for implementing welding
is to utilize a hash table where vertices are hashed based on their attributes, i.e., position,
normal, texture coordinates, and color.
Any reasonably complex 3D scene will use large amounts of memory for storing its vertex
data. To reduce the consumption, one should always try to use the smallest data formats
possible, i.e., bytes and shorts instead of integers. Because quantization of floating-point
vertex coordinates into a smaller fixed-point representation may introduce artifacts and
gaps between objects, controllingthe quantization shouldbemade explicit in the modeling
and exporting pipeline. All interconnecting “scene” geomet ry could be represented with
a higher accuracy (16-bit coordinates), and all smaller and moving objects could be
expressed with lower accuracy (8-bit coordinates). For vertex positions this quantization
is typically done by scanning the axis-aligned bounding box of an object, re-scaling the
bounding [min,max] range for each axis into [−1, +1], and converting the resulting
values into signed fixed-point values. Vertex normals usually survive quantization into
8 bits per component rather well, whereas texture coordinates often require 16 bits
per component.
In general, one should always prefer integer formats over floating-point ones, as the y are
likely to be processed faster by the transformation and lighting pipeline. Favoring small
formats has another advantage: when vertex data needs to be copied over to the render-
ing hardware, less memory bandwidth is needed to transfer smaller data elements. This
improves the performance of applications running on top of both hardware and software
renderers. Also, in order to increase cache-coherency, one should interleave vertex data if
148 PERFORMANCE AND SCALABILITY CHAPTER 6
possible. This means that all data of a single vertex is stored together in memory, followed
by all of the data of the next vertex, and so forth.
6.4.2 TRIANGLE DATA

An important offline optimization is ordering the triangle data in a coherent way so that
subsequent triangles share as many vertices as possible. Since we cannot know the exact
rules of the vertex caching algorithm used by the graphics driver, we need to come up with
a generally good ordering. This can be achieved by sorting the triangles so that they refer
to vert ices that have been encountered recently. Once the triangles have been sorted in a
coherent fashion, the vertex indices are remapped and the vertex arrays are re-indexed to
match the order of referral. In other words, the first triangle should have the indices 0, 1,
and 2. Assuming the second triangle shares an edge with the first one, it will introduce
one new vertex, which in this scheme gets the index 3. The subsequent triangles then refer
to these vertices and introduce new vertices 4, 5, 6, and so forth.
The triangle index array can be expressed in several different formats: triangle lists, strips,
and fans. Strips and fans have the advantage that they use fewer indices per triangle than
triangle lists. However, you need to watch out that you do not create too many rendering
calls. You can “stitch” two disjoint strips together by replicating the last vertex of the first
strip and the first vertex of the second strip, which creates two degenerate triangles in
the middle. In general, using indexed rendering allows you to take full advantage of ver-
tex caching, and you should sort the tr iangles as described above. Whether triangle lists
or strips perform better depends on the implementation, and you should measure your
platform to find out the winner.
6.5 TRANSFORMATION PIPELINE
Because many embedded devices lack floating-point units, the transformation pipeline
can easily become the bottleneck as matrix manipulation operations need to be performed
using emulated floating-point operations. For this reason it is important to minimize the
number of times the matrix stack is modified. Also, expressing all object ver tex data in
fixed point rather than floating point can produce savings, as a much simpler transfor-
mation pipeline can then be utilized.
6.5.1 OBJECT HIERARCHIES
When an artist models a 3D scene she typically expresses the world as a complex hierarchy
of nodes. Objects are not just collections of triangles. Instead, they have internal str uc-
ture, and often consist of multiple subobjects, each with its own materials, transformation

matrices and other attributes. This flexible approach makes a lot of sense when modeling
a world, but it is not an optimal presentation for the rendering pipeline, as unnecessary
matrix processing is likely to happen.
SECTION 6.5 TRANSFORMATION PIPELINE 149
A better approach is to create a small piece of code that is executed when the data is
exported from the modeling tool. This code should find objects in the same hierarchy
sharing the same transfor mation matrices and shaders, and combine them together. The
code should also “flatten” static transformation hier archies, i.e., premultiply hierarchical
transformations together. Also, if the scene contains a large number of replicated static
objects such as low-polygon count trees forming a forest or the kinds of props shown in
Figure 6.9, it makes sense to combine the objects into a single larger one by transforming
all of the objects into the same coordinate space.
6.5.2 RENDERING ORDER
The rendering order of objects has implications to the rendering performance. In gen-
eral, objects should be rendered in an approximate front-to-back order. The reason for
this is that the z-buffering algorithm used for hidden surface removal can quickly dis-
card covered fragments. If the occluding objects are rasterized first, many of the hidden
fragments require less processing. Modern GPUs often perform the depth buffering in
a hierarchical fashion, discarding hidden blocks of 4 × 4 or 8 × 8 pixels at a time. The
best practical way to exploit this early culling is to sort the objects of a scene in a coarse
fashion. Tile-based rendering architectures such as MBX of Imagination Technologies
and Mali of ARM buffer the scene geometry before the rasterization stage and are thus
able to perform the hidden surface removal efficiently regardless of the object order-
ing. However, other GPU architectures can benefit greatly if the objects are in a rough
front-to-back order.
Depth ordering is not the only important sorting criterion—the state changes should be
kept to a minimum as well. This suggests that one should first group objects based on
their materials and shaders, then render the groups in depth order.
Figure 6.9: Low-polygon in-game objects. (Images copyright
c

 Digital Chocolate.)
150 PERFORMANCE AND SCALABILITY CHAPTER 6
Figure 6.10: Occlusion culling applied to a complex urban environment consisting of thousands of buildings. Left: view
frustum intersecting a city as seen from a third person view.
Right: wireframe images of the camera’s view without (top )
and with (
bottom ) occlusion culling. Here culling reduces the number of objects rendered by a factor of one hundred. (Image
copyright
c
 NVidia.)
6.5.3 CULLING
Conservative culling strategies are ones that reduce the number of rendered objects with-
out introducing any artifacts. Frustum culling is used to remove objects falling outside
the view frustum, and occlusion culling to discard objects hidden completely by others.
Frustum culling is best performed using conservatively computed bounding volumes for
objects. This can be further optimized by organizing the scene graph into a bounding
volume hierarchy and performing the culling using the hierarchy. Frustum culling is a
trivial optimization to implement, and should be used by any rendering application—
practically all scene graph engines support this, including all real-world M3G implemen-
tations. Occlusion culling algorithms, on the other hand, are complex, and often difficult
to implement (see Figure 6.10). Of the various different algorithms, two are particularly
suited for handheld 3D applications: pre-computed: Potentially Visible Sets (PVSs) and
portal rendering. Both have modest run-time CPU requirements [Air90, LG95].
When an application just has too much geometry to render, aggressive culling strate-
gies need to be employed. There are several different options for choosing which objects
are not rendered. Commonly used methods include distance-based culling where faraway
objects are discarded, and detail culling, where objects having small screen footprints after
projection are removed. Distance-based culling creates annoying popping artifacts which
are often reduced either by bringing the far clipping plane closer, by using fog effects to
SECTION 6.6 LIGHTING 151

mask the transition, or by using distance-based alpha blending to fade faraway objects
into full transparency. The popping can also be reduced by level-of-detail rendering, i.e.,
by switching to simplified versions of an object as its screen area shrinks.
6.6 LIGHTING
The fixed-functionality lighting pipeline of OpenGL ES and M3G is fairly limited in its
capabilities and it inherits the basic problems inherent in the or iginal OpenGL lighting
model. The fundamental problem is that it is vertex-based, and thus fine tessellation of
meshes is required for reducing the artifacts due to sparse lighting sampling. Also, the
lighting model used in the mobile APIs is somewhat simplified; some important aspects
such as properly modeled specular illumination have been omitted.
Driver implementations of the lighting pipeline are notoriously poor, and often very
slow except for a few hand-optimized fast paths. In practice a good bet is that a single
directional light will be properly accelerated, and more complex illumination has a good
chance of utilizing slower code paths. In any case the cost will increase at least linearly
with the number of lights, and the more complex lighting features you use, the slower
your application runs.
When the vertex lighting pipeline is utilized, you should always attempt to simplify its
workload. For example, prenormalizing vertex normals is likely to speed up the lighting
computations. In a similar fashion, you should avoid using truly homogeneous vertex
positions, i.e., those that have w components other than zero or one, as these require
a more complex lighting pipeline. Specular illumination computations of any kind are
rather expensive, so disabling them may increase the performance. The same advice
applies to distance attenuation: disabling it is likely to result in performance gains. How-
ever, if attenuating light sources are used, a potential optimization is completely disabling
faraway lights that contribute little or nothing to the illumination of an object. This can be
done using trivial bounding sphere overlap tests between the objects and the light sources.
6.6.1 PRECOMPUTED ILLUMINATION
The quality problems of the limited OpenGL lighting model will disappear once pro-
grammable shaders are supported, though even then you will pay the execution time
penalty of complex lighting models and of multiple light sources. However, with fixed-

functionality pipelines of OpenGL ES 1.x and M3G 1.x one should primar ily utilize
texture-based and precomputed illumination, and try to minimize the application’s
reliance on the vertex-based lighting pipeline.
For static lighting, precomputed vertex-based illumination is a cheap and good option.
The lighting is computed only once as a part of the modeling phase, and the vertex illumi-
nation is exported along with the mesh. This may also reduce the memory consumption of
152 PERFORMANCE AND SCALABILITY CHAPTER 6
the meshes, as vertex normals do not need to be exported if dynamic lighting is omitted.
OpenGL ES supports a concept called color material tracking which allows changing a
material’s diffuse or ambient component separ ately for each vertex of a mesh. This allows
combining precomputed illumination with dynamic vertex-based lighting.
6.7 TEXTURES
Texturing plays an especially important role in mobile graphics, as it makes it possible to
push lighting computations from the vertex pipeline to the fragment pipeline. Thisreduces
the pressure to tessellate geometry. Also, it is more likely that the fragment pipeline is
accelerated; several commonly deployed hardware accelerators such as MBX Lite perform
the entire transformation and lighting pipeline on the CPU but have fast pixel-processing
hardware.
Software and hardware implementations of texture mapping have rather different per-
formance characteristics. A software implementation will take a serious performance hit
whenever linear blending between mipmap levels or texels is used. Also, disabling perspec-
tive correct texture interpolation may result in considerable speed-ups when a software
rasterizer is used. Mipmapping , on the other hand, is almost always a good idea, as it
makes texture caching more efficient for both software and hardware implementations.
It should be kept in mind that modifying texture data has almost always a significant
negative performance impact. Because rendering pipelines are generally deeply buffered,
there are two things that a driver may do when a texture is modified by the application.
Either the entire pipeline is flushed—this means that the client and the server cannot
executeinparallel,orthetextureimageandassociatedmipmaplevelsneedtobeduplicated.
In either case, the performance is degraded. The latter case also temporarily increases the

driver’s memory usage.
Multi-texturing should be always preferred over multi-pass rendering. There are several
good reasons for this. Z-fighting artifacts can be avoided this way, as the textures are
combined before the color buffer write is performed. Also, the number of render state
changes is reduced, and an expensive alpha blending pass is avoided altogether. Finally,
the number of draw calls is reduced by half.
6.7.1 TEXTURE STORAGE
Both OpenGL ES and M3G abstract out completely how the driver caches textures inter-
nally. However, the application has still some control over the data layout, and this may
have a huge impact on performance. Deciding the correct sizes for texture maps, and
combining smaller maps used together into a single larger texture can be significant opti-
mizations. The “correct size” is the one where the texture map looks good under typical
viewing conditions—in other words, one where the ratio between the texture’s texels
SECTION 6.7 TEXTURES 153
and the screen’s pixels approaches 1.0. Using a larger texture map is a waste of memory.
A smaller one just deteriorates the quality.
The idea of combining multiple textures into a single texture map is an important one,
and is often used when rendering fonts, animations, or light maps. Such texture atlases
are also commonly used for storing the different texture maps used by a complex object
(see Figure 6.11). This technique allows sw itching between texture maps without actually
performing a state change—only the texture coordinates of the object need to vary. Long
strings of text or complex objects using multiple textures can thus be rendered using a
single rendering call.
Texture image data is probably the most significant consumer of memory in a graphics-
intensive application. As the memory capacity of a mobile device is still oftenrather limited,
it is important to pay attention to the texture formats and layouts used. Both OpenGL ES
and M3G provide support for compressed texture formats—although only via palettes
and vendor-specific extensions.
Nevertheless, compressed formats should be utilized whenever possible. Only in cases
where artifacts gener ated by the compression are visually disturbing, or when the texture

is often modified manually, should noncompressed formats be used. Even then, 16-bit
texture formats should be favored over 32-bit ones. Also, one should take advantage of
the intensity-only and alpha-only formats in cases where the texture data is monochrome.
In addition to saving valuable RAM, the use of compressed textures reduces the internal
memory bandwidth, which in turn is likely to improve the rendering performance.
Figure6.11: An example of automatically packing textures into a texture atlas (refer to Section 6.7.1).
Image courtesy of Bruno Levy. (See the color plate.)

×