Tải bản đầy đủ (.pdf) (493 trang)

Solaris application programming

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.17 MB, 493 trang )


Solaris™ Application
Programming


This page intentionally left blank


Solaris Application
Programming


Darryl Gove

Sun Microsystems Press

Upper Saddle River, NJ • Boston • Indianapolis • San Francisco
New York • Toronto • Montreal • London • Munich • Paris • Madrid
Capetown • Sydney • Tokyo • Singapore • Mexico City


Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals.
The author and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.
Sun Microsystems, Inc., has intellectual property rights relating to implementations of the technology described
in this publication. In particular, and without limitation, these intellectual property rights may include one or
more U.S. patents, foreign patents, or pending applications. Sun, Sun Microsystems, the Sun logo, J2ME, Solaris,
Java, Javadoc, NetBeans, and all Sun and Java based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and other countries. UNIX is a registered trademark in
the United States and other countries, exclusively licensed through X/Open Company, Ltd.
THIS PUBLICATION IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. THIS PUBLICATION COULD INCLUDE
TECHNICAL INACCURACIES OR TYPOGRAPHICAL ERRORS. CHANGES ARE PERIODICALLY ADDED TO


THE INFORMATION HEREIN; THESE CHANGES WILL BE INCORPORATED IN NEW EDITIONS OF THE
PUBLICATION. SUN MICROSYSTEMS, INC., MAY MAKE IMPROVEMENTS AND/OR CHANGES IN THE
PRODUCT(S) AND/OR THE PRO- GRAM(S) DESCRIBED IN THIS PUBLICATION AT ANY TIME.
The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special
sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact: U.S. Corporate and Government Sales, (800) 382-3419,
For sales outside the United States, please contact International Sales,
Visit us on the Web: www.prenhallprofessional.com.

Library of Congress Cataloging-in-Publication Data
Gove, Darryl.
Solaris application programming / Darryl Gove.
p. cm.
Includes index.
ISBN 978-0-13-813455-6 (hardcover : alk. paper)
1. Solaris (Computer file) 2. Operating systems (Computers) 3.
Application software—Development. 4. System design. I. Title.
QA76.76.O63G688 2007
005.4’32—dc22
2007043230
Copyright © 2008 Sun Microsystems, Inc.
4150 Network Circle, Santa Clara, California 95054 U.S.A.
All rights reserved.
All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system,
or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For
information regarding permissions, write to: Pearson Education, Inc., Rights and Contracts Department, 501
Boylston Street, Suite 900, Boston, MA 02116, Fax: (617) 671-3447.
ISBN-13: 978-0-13-813455-6
ISBN-10:
0-13-813455-3
Text printed in the United States on recycled paper at Courier in Westford, Massachusetts.

First printing, December 2007


Contents
Preface

xix

PART I

Overview of the Processor
Chapter 1

The Generic Processor
1.1
Chapter Objectives
1.2
The Components of a Processor
1.3
Clock Speed
1.4
Out-of-Order Processors
1.5
Chip Multithreading
1.6
Execution Pipes
1.6.1
Instruction Latency
1.6.2
Load/Store Pipe

1.6.3
Integer Operation Pipe
1.6.4
Branch Pipe
1.6.5
Floating-Point Pipe
1.7
Caches
1.8
Interacting with the System
1.8.1
Bandwidth and Latency
1.8.2
System Buses

1
3
3
3
4
5
6
7
8
9
9
9
11
11
14

14
15
v


vi

Chapter 2

Chapter 3

Contents

1.9
Virtual Memory
1.9.1
Overview
1.9.2
TLBs and Page Size
1.10 Indexing and Tagging of Memory
1.11 Instruction Set Architecture

16
16
17
18
18

The SPARC Family


21

2.1
Chapter Objectives
2.2
The UltraSPARC Family
2.2.1
History of the SPARC Architecture
2.2.2
UltraSPARC Processors
2.3
The SPARC Instruction Set
2.3.1
A Guide to the SPARC Instruction Set
2.3.2
Integer Registers
2.3.3
Register Windows
2.3.4
Floating-Point Registers
2.4
32-bit and 64-bit Code
2.5
The UltraSPARC III Family of Processors
2.5.1
The Core of the CPU
2.5.2
Communicating with Memory
2.5.3
Prefetch

2.5.4
Blocking Load on Data Cache Misses
2.5.5
UltraSPARC III-Based Systems
2.5.6
Total Store Ordering
2.6
UltraSPARC T1
2.7
UltraSPARC T2
2.8
SPARC64 VI

21
21
21
22
23
23
26
27
29
30
30
30
31
32
34
34
36

37
37
38

The x64 Family of Processors

39

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

39
39
40
41
42
43
46
46

Chapter Objectives
The x64 Family of Processors
The x86 Processor: CISC and RISC
Byte Ordering

Instruction Template
Registers
Instruction Set Extensions and Floating Point
Memory Ordering


Contents

vii

PART II

Developer Tools
Chapter 4

47

Informational Tools

49

4.1
4.2

49
49

Chapter Objectives
Tools That Report System Configuration


4.2.1
4.2.2

Introduction
Reporting General System Information
(prtdiag, prtconf, prtpicl, prtfru)
4.2.3
Enabling Virtual Processors (psrinfo and psradm)
4.2.4
Controlling the Use of Processors through Processor
Sets or Binding (psrset and pbind)
4.2.5
Reporting Instruction Sets Supported by Hardware
(isalist)
4.2.6
Reporting TLB Page Sizes Supported by Hardware
(pagesize)
4.2.7
Reporting a Summary of SPARC Hardware
Characteristics (fpversion)
4.3
Tools That Report Current System Status
4.3.1
Introduction
4.3.2
Reporting Virtual Memory Utilization (vmstat)
4.3.3
Reporting Swap File Usage (swap)
4.3.4
Reporting Process Resource Utilization (prstat)

4.3.5
Listing Processes (ps)
4.3.6
Locating the Process ID of an Application (pgrep)
4.3.7
Reporting Activity for All Processors (mpstat)
4.3.8
Reporting Kernel Statistics (kstat)
4.3.9
Generating a Report of System Activity (sar)
4.3.10 Reporting I/O Activity (iostat)
4.3.11 Reporting Network Activity (netstat)
4.3.12 The snoop command
4.3.13 Reporting Disk Space Utilization (df)
4.3.14 Reporting Disk Space Used by Files (du)
4.4
Process- and Processor-Specific Tools
4.4.1
Introduction
4.4.2
Timing Process Execution (time, timex, and ptime)
4.4.3
Reporting System-Wide Hardware Counter Activity
(cpustat)

49
49
51
52
53

53
55
55
55
56
57
58
60
61
62
64
64
68
70
71
71
72
72
72
72
73


viii

Contents

4.4.4

Reporting Hardware Performance Counter Activity

for a Single Process (cputrack)
4.4.5
Reporting Bus Activity (busstat)
4.4.6
Reporting on Trap Activity (trapstat)
4.4.7
Reporting Virtual Memory Mapping Information
for a Process (pmap)
4.4.8
Examining Command-Line Arguments Passed to
Process (pargs)
4.4.9
Reporting the Files Held Open by a Process (pfiles)
4.4.10 Examining the Current Stack of Process (pstack)
4.4.11 Tracing Application Execution (truss)
4.4.12 Exploring User Code and Kernel Activity with dtrace
4.5
Information about Applications
4.5.1
Reporting Library Linkage (ldd)
4.5.2
Reporting the Type of Contents Held in a File (file)
4.5.3
Reporting Symbols in a File (nm)
4.5.4
Reporting Library Version Information (pvs)
4.5.5
Examining the Disassembly of an Application,
Library, or Object (dis)
4.5.6

Reporting the Size of the Various Segments in
an Application, Library, or Object (size)
4.5.7
Reporting Metadata Held in a File (dumpstabs,
dwarfdump, elfdump, dump, and mcs)

Chapter 5

Using the Compiler
5.1
Chapter Objectives
5.2
Three Sets of Compiler Options
5.3
Using -xtarget=generic on x86
5.4
Optimization
5.4.1
Optimization Levels
5.4.2
Using the -O Optimization Flag
5.4.3
Using the -fast Compiler Flag
5.4.4
Specifying Architecture with -fast
5.4.5
Deconstructing -fast
5.4.6
Performance Optimizations in -fast
(for the Sun Studio 12 Compiler)


75
76
77
78
79
79
79
80
82
84
84
86
87
87
89
90
90
93
93
93
95
96
96
98
98
99
100
100



Contents

ix

5.5
Generating Debug Information
5.5.1
Debug Information Flags
5.5.2
Debug and Optimization
5.6
Selecting the Target Machine Type for an Application
5.6.1
Choosing between 32-bit and 64-bit Applications
5.6.2
The Generic Target
5.6.3
Specifying Cache Configuration Using
the -xcache Flag
5.6.4
Specifying Code Scheduling Using the -xchip Flag
5.6.5
The -xarch Flag and -m32/-m64
5.7
Code Layout Optimizations
5.7.1
Introduction
5.7.2
Crossfile Optimization

5.7.3
Mapfiles
5.7.4
Profile Feedback
5.7.5
Link-Time Optimization
5.8
General Compiler Optimizations
5.8.1
Prefetch Instructions
5.8.2
Enabling Prefetch Generation (-xprefetch)
5.8.3
Controlling the Aggressiveness of Prefetch Insertion
(-xprefetch_level)
5.8.4
Enabling Dependence Analysis (-xdepend)
Handling Misaligned Memory Accesses on SPARC
(-xmemalign/-dalign)
5.8.6
Setting Page Size Using -xpagesize=<size>
5.9
Pointer Aliasing in C and C++
5.9.1
The Problem with Pointers
5.9.2
Diagnosing Aliasing Problems
5.9.3
Using Restricted Pointers in C and C++ to Reduce
Aliasing Issues

5.9.4
Using the -xalias_level Flag to Specify the
Degree of Pointer Aliasing
5.9.5
-xalias_level for C
5.9.6
-xalias_level=any in C
5.9.7
-xalias_level=basic in C
5.9.8
-xalias_level=weak in C

102
102
103
103
103
104
105
106
106
107
107
108
110
111
115
116
116
118

120
120

5.8.5

121
123
123
123
126
126
127
128
128
129
130


x

Contents

5.9.9
-xalias_level=layout in C
5.9.10 -xalias_level=strict in C
5.9.11 -xalias_level=std in C
5.9.12 -xalias_level=strong in C
5.9.13 -xalias_level in C++
5.9.14 -xalias_level=simple in C++
5.9.15 -xalias_level=compatible in C++

5.10 Other C- and C++-Specific Compiler Optimizations
5.10.1 Enabling the Recognition of Standard Library
Routines (-xbuiltin)
5.11 Fortran-Specific Compiler Optimizations
5.11.1 Aligning Variables for Optimal Layout (-xpad)
5.11.2 Placing Local Variables on the Stack (-xstackvar)
5.12 Compiler Pragmas
5.12.1 Introduction
5.12.2 Specifying Alignment of Variables
5.12.3 Specifying a Function’s Access to Global Data
5.12.4 Specifying That a Function Has No Side Effects
5.12.5 Specifying That a Function Is Infrequently Called
5.12.6 Specifying a Safe Degree of Pipelining for a
Particular Loop
5.12.7 Specifying That a Loop Has No Memory
Dependencies within a Single Iteration
5.12.8 Specifying the Degree of Loop Unrolling
5.13 Using Pragmas in C for Finer Aliasing Control
5.13.1 Asserting the Degree of Aliasing between Variables
5.13.2 Asserting That Variables Do Alias
5.13.3 Asserting Aliasing with Nonpointer Variables
5.13.4 Asserting That Variables Do Not Alias
5.13.5 Asserting No Aliasing with Nonpointer Variables
5.14 Compatibility with GCC

Chapter 6

132
132
132

133
133
133
133
133
133
135
135
135
136
136
136
137
138
139
140
141
141
142
143
144
145
146
146
147

Floating-Point Optimization

149


6.1
Chapter Objectives
6.2
Floating-Point Optimization Flags
6.2.1
Mathematical Optimizations in -fast
6.2.2
IEEE-754 and Floating Point

149
149
149
150


Contents

xi

6.2.3
6.2.4

Vectorizing Floating-Point Computation (-xvector)
Vectorizing Computation Using SIMD Instructions
(-xvector=simd) (x64 Only)
6.2.5
Subnormal Numbers
6.2.6
Flushing Subnormal Numbers to Zero (-fns)
6.2.7

Handling Values That Are Not-a-Number
6.2.8
Enabling Floating-Point Expression Simplification
(-fsimple)
6.2.9
Elimination of Comparisons
6.2.10 Elimination of Unnecessary Calculation
6.2.11 Reordering of Calculations
6.2.12 Kahan Summation Formula
6.2.13 Hoisting of Divides
6.2.14 Honoring of Parentheses at Levels of Floating-Point
Simplification
6.2.15 Effect of -fast on errno
6.2.16 Specifying Which Floating-Point Events Cause Traps
(-ftrap)
6.2.17 The Floating-Point Exception Flags
6.2.18 Floating-Point Exceptions in C99
6.2.19 Using Inline Template Versions of Floating-Point
Functions (-xlibmil)
6.2.20 Using the Optimized Math Library (-xlibmopt)
6.2.21 Do Not Promote Single-Precision Values to Double
Precision (-fsingle for C)
6.2.22 Storing Floating-Point Constants in Single
Precision (-xsfpconst for C)
6.3
Floating-Point Multiply Accumulate Instructions
6.4
Integer Math
6.4.1
Other Integer Math Opportunities

6.5
Floating-Point Parameter Passing with SPARC V8 Code

Chapter 7

151
152
153
155
155
157
158
158
159
161
163
165
165
166
167
169
170
171
171
172
173
174
177
178


Libraries and Linking

181

7.1
Introduction
7.2
Linking
7.2.1
Overview of Linking
7.2.2
Dynamic and Static Linking
7.2.3
Linking Libraries

181
181
181
182
183


x ii

Chapter 8

Contents

7.2.4
Creating a Static Library

7.2.5
Creating a Dynamic Library
7.2.6
Specifying the Location of Libraries
7.2.7
Lazy Loading of Libraries
7.2.8
Initialization and Finalization Code in Libraries
7.2.9
Symbol Scoping
7.2.10 Library Interposition
7.2.11 Using the Debug Interface
7.2.12 Using the Audit Interface
7.3
Libraries of Interest
7.3.1
The C Runtime Library (libc and libc_psr)
7.3.2
Memory Management Libraries
7.3.3
libfast
7.3.4
The Performance Library
7.3.5
STLport4
7.4
Library Calls
7.4.1
Library Routines for Timing
7.4.2

Picking the Most Appropriate Library Routines
7.4.3
SIMD Instructions and the Media Library
7.4.4
Searching Arrays Using VIS Instructions

184
184
185
187
187
188
189
191
192
193
193
193
196
196
198
199
199
201
202
203

Performance Profiling Tools

207


8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
8.13
8.14

207
207
208
210
210
212
214
215
217
218
219
222
223
226


Introduction
The Sun Studio Performance Analyzer
Collecting Profiles
Compiling for the Performance Analyzer
Viewing Profiles Using the GUI
Caller–Callee Information
Using the Command-Line Tool for Performance Analysis
Interpreting Profiles
Intepreting Profiles from UltraSPARC III/IV Processors
Profiling Using Performance Counters
Interpreting Call Stacks
Generating Mapfiles
Generating Reports on Performance Using spot
Profiling Memory Access Patterns


Contents

xi i i

8.15
8.16
8.17
8.18
8.19
8.20

Chapter 9


er_kernel
Tail-Call Optimization and Debug
Gathering Profile Information Using gprof
Using tcov to Get Code Coverage Information
Using dtrace to Gather Profile and Coverage
Information
Compiler Commentary

233
235
237
239
241
244

Correctness and Debug

247

9.1
Introduction
9.2
Compile-Time Checking
9.2.1
Introduction
9.2.2
Compile-Time Checking for C Source Code
9.2.3
Checking of C Source Code Using lint
9.2.4

Source Processing Options Common to the C and
C++ Compilers
9.2.5
C++
9.2.6
Fortran
9.3
Runtime Checking
9.3.1
Bounds Checking
9.3.2
watchmalloc
9.3.3
Debugging Options under Other mallocs
9.3.4
Runtime Array Bounds Checking in Fortran
9.3.5
Runtime Stack Overflow Checking
9.3.6
Memory Access Error Detection Using discover
9.4
Debugging Using dbx
9.4.1
Debug Compiler Flags
9.4.2
Debug and Optimization
9.4.3
Debug Information Format
9.4.4
Debug and OpenMP

9.4.5
Frame Pointer Optimization on x86
9.4.6
Running the Debugger on a Core File
9.4.7
Example of Debugging an Application
9.4.8
Running an Application under dbx
9.5
Locating Optimization Bugs Using ATS
9.6
Debugging Using mdb

247
248
248
248
248
250
251
253
256
256
257
258
259
260
261
262
262

263
263
264
264
264
265
268
271
274


x iv

Contents

PART III

Optimization
Chapter 10 Performance Counter Metrics
10.1 Chapter Objectives
10.2 Reading the Performance Counters
10.3 UltraSPARC III and UltraSPARC IV Performance
Counters
10.3.1 Instructions and Cycles
10.3.2 Data Cache Events
10.3.3 Instruction Cache Events
10.3.4 Second-Level Cache Events
10.3.5 Cycles Lost to Cache Miss Events
10.3.6 Example of Cache Access Metrics
10.3.7 Synthetic Metrics for Latency

10.3.8 Synthetic Metrics for Memory Bandwidth
Consumption
10.3.9 Prefetch Cache Events
10.3.10 Comparison of Performance Counters with and
without Prefetch
10.3.11 Write Cache Events
10.3.12 Cycles Lost to Processor Stall Events
10.3.13 Branch Misprediction
10.3.14 Memory Controller Events
10.4 Performance Counters on the UltraSPARC IV and
UltraSPARC IV+
10.4.1 Introduction
10.4.2 UltraSPARC IV+ L3 Cache
10.4.3 Memory Controller Events
10.5 Performance Counters on the UltraSPARC T1
10.5.1 Hardware Performance Counters
10.5.2 UltraSPARC T1 Cycle Budget
10.5.3 Performance Counters at the Core Level
10.5.4 Calculating System Bandwidth Consumption
10.6 UltraSPARC T2 Performance Counters
10.7 SPARC64 VI Performance Counters

277
279
279
279
281
281
283
285

286
287
288
290
292
293
295
297
299
300
301
302
302
302
303
304
304
305
307
308
308
309


Contents

xv

10.8 Opteron Performance Counters
10.8.1 Introduction

10.8.2 Instructions
10.8.3 Instruction Cache Events
10.8.4 Data Cache Events
10.8.5 TLB Events
10.8.6 Branch Events
10.8.7 Stall Cycles

Chapter 11 Source Code Optimizations
11.1 Overview
11.2 Traditional Optimizations
11.2.1 Introduction
11.2.2 Loop Unrolling and Pipelining
11.2.3 Loop Peeling, Fusion, and Splitting
11.2.4 Loop Interchange and Tiling
11.2.5 Loop Invariant Hoisting
11.2.6 Common Subexpression Elimination
11.2.7 Strength Reduction
11.2.8 Function Cloning
11.3 Data Locality, Bandwidth, and Latency
11.3.1 Bandwidth
11.3.2 Integer Data
11.3.3 Storing Streams
11.3.4 Manual Prefetch
11.3.5 Latency
11.3.6 Copying and Moving Memory
11.4 Data Structures
11.4.1 Structure Reorganizing
11.4.2 Structure Prefetching
11.4.3 Considerations for Optimal Performance from
Structures

11.4.4 Matrices and Accesses
11.4.5 Multiple Streams
11.5 Thrashing
11.5.1 Summary

310
310
311
312
313
314
315
316
319
319
319
319
320
321
322
324
325
325
325
326
326
327
329
330
333

338
339
339
343
346
347
348
349
349


x vi

Contents

11.5.2 Data TLB Performance Counter
11.6 Reads after Writes
11.7 Store Queue
11.7.1 Stalls
11.7.2 Detecting Store Queue Stalls
11.8 If Statements
11.8.1 Introduction
11.8.2 Conditional Moves
11.8.3 Misaligned Memory Accesses on SPARC Processors
11.9 File-Handling in 32-bit Applications
11.9.1 File Descriptor Limits
11.9.2 Handling Large Files in 32-bit Applications

351
352

354
354
355
357
357
358
361
364
364
366

PART IV

Threading and Throughput
Chapter 12 Multicore, Multiprocess, Multithread
12.1 Introduction
12.2 Processes, Threads, Processors, Cores, and CMT
12.3 Virtualization
12.4 Horizontal and Vertical Scaling
12.5 Parallelization
12.6 Scaling Using Multiple Processes
12.6.1 Multiple Processes
12.6.2 Multiple Cooperating Processes
12.6.3 Parallelism Using MPI
12.7 Multithreaded Applications
12.7.1 Parallelization Using Pthreads
12.7.2 Thread Local Storage
12.7.3 Mutexes
12.7.4 Using Atomic Operations
12.7.5 False Sharing

12.7.6 Memory Layout for a Threaded Application
12.8 Parallelizing Applications Using OpenMP
12.9 Using OpenMP Directives to Parallelize Loops
12.10 Using the OpenMP API

369
371
371
371
374
375
376
378
378
378
382
385
385
387
389
395
397
399
402
403
406


Contents


xv i i

12.11 Parallel Sections
12.11.1 Setting Stack Sizes for OpenMP
12.12 Automatic Parallelization of Applications
12.13 Profiling Multithreaded Applications
12.14 Detecting Data Races in Multithreaded Applications
12.15 Debugging Multithreaded Code
12.16 Parallelizing a Serial Application
12.16.1 Example Application
12.16.2 Impact of Optimization on Serial Performance
12.16.3 Profiling the Serial Application
12.16.4 Unrolling the Critical Loop
12.16.5 Parallelizing Using Pthreads
12.16.6 Parallelization Using OpenMP
12.16.7 Auto-Parallelization
12.16.8 Load Balancing with OpenMP
12.16.9 Sharing Data Between Threads
12.16.10 Sharing Variables between Threads Using OpenMP

407
407
408
410
412
413
417
417
418
419

420
422
424
425
429
430
432

PART V

Concluding Remarks

435

Chapter 13 Performance Analysis
13.1 Introduction
13.2 Algorithms and Complexity
13.3 Tuning Serial Code
13.4 Exploring Parallelism
13.5 Optimizing for CMT Processors

437

Index

447

437
437
442

444
446


This page intentionally left blank


Preface
About This Book
This book is a guide to getting the best performance out of computers running the
Solaris operating system. The target audience is developers and software architects who are interested in using the tools that are available, as well as those who
are interested in squeezing the last drop of performance out of the system.
The book caters to those who are new to performance analysis and optimization, as well as those who are experienced in the area. To do this, the book starts
with an overview of processor fundamentals, before introducing the tools and getting into the details.
One of the things that distinguishes this book from others is that it is a practical guide. There are often two problems to overcome when doing development
work. The first problem is knowing the tools that are available. This book is written to cover the breadth of tools available today and to introduce the common uses
for them. The second problem is interpreting the output from the tools. This book
includes many examples of tool use and explains their output.
One trap this book aims to avoid is that of explaining how to manually do the
optimizations that the compiler performs automatically. The book’s focus is on identifying the problems using appropriate tools and solving the problems using the easiest approach. Sometimes, the solution is to use different compiler flags so that a
particular hot spot in the application is optimized away. Other times, the solution is
to change the code because the compiler is unable to perform the optimization; I
explain this with insight into why the compiler is unable to transform the code.
xi x


xx

Preface


Goals and Assumptions
The goals of this book are as follows.
Provide a comprehensive introduction to the components that influence processor performance.
Introduce the tools that you can use for performance analysis and improvement, both those that ship with the operating system and those that ship
with the compiler.
Introduce the compiler and explain the optimizations that it supports to
enable improved performance.
Discuss the features of the SPARC and x64 families of processors and demonstrate how you can use these features to improve application performance.
Talk about the possibilities of using multiple processors or threads to enable
better performance, or more efficient use of computer resources.
The book assumes that the reader is comfortable with the C programming language. This language is used for most of the examples in the book. The book also
assumes a willingness to learn some of the lower-level details about the processors
and the instruction sets that the processors use. The book does not attempt to go
into the details of processor internals, but it does introduce some of the features of
modern processors that will have an effect on application performance.
The book assumes that the reader has access to the Sun Studio compiler and
tools. These tools are available as free downloads. Most of the examples come from
using Sun Studio 12, but any recent compiler version should yield similar results.
The compiler is typically installed in /opt/SUNWspro/bin/ and it is assumed
that the compiler does appear on the reader’s path.
The book focuses on Solaris 10. Many of the tools discussed are also available in
prior versions. I note in the text when a tool has been introduced in a relatively
recent version of Solaris.

Chapter Overview
Part I—Overview of the Processor
Chapter 1—The Generic Processor
Chapter 2—The SPARC Family
Chapter 3—The x64 Family of Processors



Preface

xxi

Part II—Developer Tools
Chapter 4—Informational Tools
Chapter 5—Using the Compiler
Chapter 6—Floating-Point Optimization
Chapter 7—Libraries and Linking
Chapter 8—Performance Profiling Tools
Chapter 9—Correctness and Debug
Part III—Optimization
Chapter 10—Performance Counter Metrics
Chapter 11—Source Code Optimizations
Part IV—Threading and Throughput
Chapter 12—Multicore, Multiprocess, Multithread
Part V—Concluding Remarks
Chapter 13—Performance Analysis

Acknowledgments
A number of people contributed to the writing of this book. Ross Towle provided an
early outline for the chapter on multithreaded programming and provided comments on the final version of that text. Joel Williamson read the early drafts a
number of times and each time provided detailed comments and improvements.
My colleagues Boris Ivanovski, Karsten Gutheridge, John Henning, Miriam Blatt,
Linda Hsi, Peter Farkas, Greg Price, and Geetha Vallabhenini also read the
drafts at various stages and suggested refinements to the text. A particular debt
of thanks is due to John Henning, who provided many detailed improvements to
the text.
I’m particularly grateful to domain experts who took the time to read various

chapters and provide helpful feedback, including Rod Evans for his input on the
linker, Chris Quenelle for his assistance with the debugger, Brian Whitney for contributing comments and the names of some useful tools for the section on tools,
Brendan Gregg for his comments, Jian-Zhong Wang for reviewing the materials on
compilers and source code optimizations, Alex Liu for providing detailed comments
on the chapter on floating-point optimization, Marty Izkowitz for comments on the
performance profiling and multithreading chapters, Yuan Lin, Ruud van der Pas,


x x ii

Preface

Alfred Huang, and Nawal Copty for also providing comments on the chapter on
multithreading, Josh Simmons for commenting on MPI, David Weaver for insights
into the history of the SPARC processor, Richard Smith for reviewing the chapter
on x64 processors, and Richard Friedman for comments throughout the text.
A number of people made a huge difference to the process of getting this book
published, including Yvonne Prefontaine, Ahmed Zandi, and Ken Tracton. I’m particularly grateful for the help of Richard McDougall in guiding the project through
the final stages.
Special thanks are due to the Prentice Hall staff, including editor Greg Doench
and full-service production manager Julie Nahil. Thanks also to production project
manager Dmitri Korzh from Techne Group.
Most importantly, I would like to thank my family for their support and encouragement. Jenny, whose calm influence and practical suggestions have helped me
with the trickier aspects of the text; Aaron, whose great capacity for imaginatively
solving even the most mundane of problems has inspired me along the way; Timothy, whose enthusiastic sharing of the enjoyment of life is always uplifting; and
Emma, whose arrival as I completed this text has been a most wonderful gift.


PART I


Overview of the Processor

Chapter 1, The Generic Processor
Chapter 2, The SPARC Family
Chapter 3, The x64 Family of Processors


This page intentionally left blank


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×