Tải bản đầy đủ (.pdf) (128 trang)

programming windows phần 1 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (549.27 KB, 128 trang )

Copyright 1998 by Charles Petzold
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
Author's Note

Visit my web site www.cpetzold.com for updated information regarding this book, including possible bug reports
and new code listings. You can address mail regarding problems in this book to Although
I'll also try to answer any easy questions you may have, I can't make any promises. I'm usually pretty busy, and my
cat refuses to learn the Windows API.
I'd like to thank everyone at Microsoft Press for another great job in putting together this book. I think this "10th
Anniversary Edition" of Programming Windows is the best edition yet. Many other people at Microsoft (including
some of the early developers of Microsoft Windows) also helped out when I was writing the earlier editions, and
these fine people are listed in those editions.
Thanks also to my family and friends, and in particular those more recent friends (you know who you are!) whose
support has made this book possible. To you this book is dedicated.
Charles Petzold
October 5, 1998
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
Chapter 1

Getting Started

This book shows you how to write programs that run under Microsoft Windows 98, Microsoft Windows NT 4.0,
and Windows NT 5.0. These programs are written in the C programming language and use the native Windows
application programming interfaces (APIs). As I'll discuss later in this chapter, this is not the only way to write
programs that run under Windows. However, it is important to understand the Windows APIs regardless of what
you eventually use to write your code.
As you probably know, Windows 98 is the latest incarnation of the graphical operating system that has become the
de facto standard for IBM-compatible personal computers built around 32-bit Intel microprocessors such as the 486
and Pentium. Windows NT is the industrial-strength version of Windows that runs on PC compatibles as well as


some RISC (reduced instruction set computing) workstations.
There are three prerequisites for using this book. First, you should be familiar with Windows 98 from a user's
perspective. You cannot hope to write applications for Windows without understanding its user interface. For this
reason, I suggest that you do your program development (as well as other work) on a Windows-based machine
using Windows applications.
Second, you should know C. If you don't know C, Windows programming is probably not a good place to start. I
recommend that you learn C in a character-mode environment such as that offered under the Windows 98 MS-DOS
Command Prompt window. Windows programming sometimes involves aspects of C that don't show up much in
character-mode programming; in those cases, I'll devote some discussion to them. But for the most part, you should
have a good working familiarity with the language, particularly with C structures and pointers. Some knowledge of
the standard C run-time library is helpful but not required.
Third, you should have installed on your machine a 32-bit C compiler and development environment suitable for
doing Windows programming. In this book, I'll be assuming that you're using Microsoft Visual C++ 6.0, which can
be purchased separately or as a part of the Visual Studio 6.0 package.
That's it. I'm not going to assume that you have any experience at all programming for a graphical user interface such
as Windows.
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
The Windows Environment

Windows hardly needs an introduction. Yet it's easy to forget the sea change that Windows brought to office and
home desktop computing. Windows had a bumpy ride in its early years and was hardly destined to conquer the
desktop market.
A History of Windows

Soon after the introduction of the IBM PC in the fall of 1981, it became evident that the predominant operating
system for the PC (and compatibles) would be MS-DOS, which originally stood for Microsoft Disk Operating
System. MS-DOS was a minimal operating system. For the user, MS-DOS provided a command-line interface to
commands such as DIR and TYPE and loaded application programs into memory for execution. For the application
programmer, MS-DOS offered little more than a set of function calls for doing file input/output (I/O). For other tasks

in particular, writing text and sometimes graphics to the video display applications accessed the hardware of the PC
directly.
Due to memory and hardware constraints, sophisticated graphical environments were slow in coming to small
computers. Apple Computer offered an alternative to character-mode environments when it released its ill-fated Lisa
in January 1983, and then set a standard for graphical environments with the Macintosh in January 1984. Despite the
Mac's declining market share, it is still considered the standard against which other graphical environments are
measured. All graphical environments, including the Macintosh and Windows, are indebted to the pioneering work
done at the Xerox Palo Alto Research Center (PARC) beginning in the mid-1970s.
Windows was announced by Microsoft Corporation in November 1983 (post-Lisa but pre-Macintosh) and was
released two years later in November 1985. Over the next two years, Microsoft Windows 1.0 was followed by
several updates to support the international market and to provide drivers for additional video displays and printers.
Windows 2.0 was released in November 1987. This version incorporated several changes to the user interface. The
most significant of these changes involved the use of overlapping windows rather than the "tiled" windows found in
Windows 1.0. Windows 2.0 also included enhancements to the keyboard and mouse interface, particularly for menus
and dialog boxes.
Up until this time, Windows required only an Intel 8086 or 8088 microprocessor running in "real mode" to access 1
megabyte (MB) of memory. Windows/386 (released shortly after Windows 2.0) used the "virtual 86" mode of the
Intel 386 microprocessor to window and multitask many DOS programs that directly accessed hardware. For
symmetry, Windows 2.1 was renamed Windows/286.
Windows 3.0 was introduced on May 22, 1990. The earlier Windows/286 and Windows/386 versions were merged
into one product with this release. The big change in Windows 3.0 was the support of the 16-bit protected-mode
operation of Intel's 286, 386, and 486 microprocessors. This gave Windows and Windows applications access to up
to 16 megabytes of memory. The Windows "shell" programs for running programs and maintaining files were
completely revamped. Windows 3.0 was the first version of Windows to gain a foothold in the home and the office.
Any history of Windows must also include a mention of OS/2, an alternative to DOS and Windows that was
originally developed by Microsoft in collaboration with IBM. OS/2 1.0 (character-mode only) ran on the Intel 286
(or later) microprocessors and was released in late 1987. The graphical Presentation Manager (PM) came about
with OS/2 1.1 in October 1988. PM was originally supposed to be a protected-mode version of Windows, but the
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -

graphical API was changed to such a degree that it proved difficult for software manufacturers to support both
platforms.
By September 1990, conflicts between IBM and Microsoft reached a peak and required that the two companies go
their separate ways. IBM took over OS/2 and Microsoft made it clear that Windows was the center of their strategy
for operating systems. While OS/2 still has some fervent admirers, it has not nearly approached the popularity of
Windows.
Microsoft Windows version 3.1 was released in April 1992. Several significant features included the TrueType font
technology (which brought scaleable outline fonts to Windows), multimedia (sound and music), Object Linking and
Embedding (OLE), and standardized common dialog boxes. Windows 3.1 ran only in protected mode and required
a 286 or 386 processor with at least 1 MB of memory.
Windows NT, introduced in July 1993, was the first version of Windows to support the 32-bit mode of the Intel
386, 486, and Pentium microprocessors. Programs that run under Windows NT have access to a 32-bit flat address
space and use a 32-bit instruction set. (I'll have more to say about address spaces a little later in this chapter.)
Windows NT was also designed to be portable to non-Intel processors, and it runs on several RISC-based
workstations.
Windows 95 was introduced in August 1995. Like Windows NT, Windows 95 also supported the 32-bit
programming mode of the Intel 386 and later microprocessors. Although it lacked some of the features of Windows
NT, such as high security and portability to RISC machines, Windows 95 had the advantage of requiring fewer
hardware resources.
Windows 98 was released in June 1998 and has a number of enhancements, including performance improvements,
better hardware support, and a closer integration with the Internet and the World Wide Web.
Aspects of Windows

Both Windows 98 and Windows NT are 32-bit preemptive multitasking and multithreading graphical operating
systems. Windows possesses a graphical user interface (GUI), sometimes also called a "visual interface" or "graphical
windowing environment." The concepts behind the GUI date from the mid-1970s with the work done at the Xerox
PARC for machines such as the Alto and the Star and for environments such as SmallTalk. This work was later
brought into the mainstream and popularized by Apple Computer and Microsoft. Although somewhat controversial
for a while, it is now quite obvious that the GUI is (in the words of Microsoft's Charles Simonyi) the single most
important "grand consensus" of the personal-computer industry.

All GUIs make use of graphics on a bitmapped video display. Graphics provides better utilization of screen real
estate, a visually rich environment for conveying information, and the possibility of a WYSIWYG (what you see is
what you get) video display of graphics and formatted text prepared for a printed document.
In earlier days, the video display was used solely to echo text that the user typed using the keyboard. In a graphical
user interface, the video display itself becomes a source of user input. The video display shows various graphical
objects in the form of icons and input devices such as buttons and scroll bars. Using the keyboard (or, more directly,
a pointing device such as a mouse), the user can directly manipulate these objects on the screen. Graphics objects
can be dragged, buttons can be pushed, and scroll bars can be scrolled.
The interaction between the user and a program thus becomes more intimate. Rather than the one-way cycle of
information from the keyboard to the program to the video display, the user directly interacts with the objects on the
display.
Users no longer expect to spend long periods of time learning how to use the computer or mastering a new program.
Windows helps because all applications have the same fundamental look and feel. The program occupies a window
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
usually a rectangular area on the screen. Each window is identified by a caption bar. Most program functions are
initiated through the program's menus. A user can view the display of information too large to fit on a single screen by
using scroll bars. Some menu items invoke dialog boxes, into which the user enters additional information. One dialog
box in particular, that used to open a file, can be found in almost every large Windows program. This dialog box
looks the same (or nearly the same) in all of these Windows programs, and it is almost always invoked from the same
menu option.
Once you know how to use one Windows program, you're in a good position to easily learn another. The menus and
dialog boxes allow a user to experiment with a new program and explore its features. Most Windows programs have
both a keyboard interface and a mouse interface. Although most functions of Windows programs can be controlled
through the keyboard, using the mouse is often easier for many chores.
From the programmer's perspective, the consistent user interface results from using the routines built into Windows
for constructing menus and dialog boxes. All menus have the same keyboard and mouse interface because Windows
rather than the application program handles this job.
To facilitate the use of multiple programs, and the exchange of information among them, Windows supports
multitasking. Several Windows programs can be displayed and running at the same time. Each program occupies a

window on the screen. The user can move the windows around on the screen, change their sizes, switch between
different programs, and transfer data from one program to another. Because these windows look something like
papers on a desktop (in the days before the desk became dominated by the computer itself, of course), Windows is
sometimes said to use a "desktop metaphor" for the display of multiple programs.
Earlier versions of Windows used a system of multitasking called "nonpreemptive." This meant that Windows did not
use the system timer to slice processing time between the various programs running under the system. The programs
themselves had to voluntarily give up control so that other programs could run. Under Windows NT and Windows
98, multitasking is preemptive and programs themselves can split into multiple threads of execution that seem to run
concurrently.
An operating system cannot implement multitasking without doing something about memory management. As new
programs are started up and old ones terminate, memory can become fragmented. The system must be able to
consolidate free memory space. This requires the system to move blocks of code and data in memory.
Even Windows 1.0, running on an 8088 microprocessor, was able to perform this type of memory management.
Under real-mode restrictions, this ability can only be regarded as an astonishing feat of software engineering. In
Windows 1.0, the 640-kilobyte (KB) memory limit of the PC's architecture was effectively stretched without
requiring any additional memory. But Microsoft didn't stop there: Windows 2.0 gave the Windows applications
access to expanded memory (EMS), and Windows 3.0 ran in protected mode to give Windows applications access
to up to 16 MB of extended memory. Windows NT and Windows 98 blow away these old limits by being
full-fledged 32-bit operating systems with flat memory space.
Programs running in Windows can share routines that are located in other files called "dynamic-link libraries."
Windows includes a mechanism to link the program with the routines in the dynamic-link libraries at run time.
Windows itself is basically a set of dynamic-link libraries.
Windows is a graphical interface, and Windows programs can make full use of graphics and formatted text on both
the video display and the printer. A graphical interface not only is more attractive in appearance but also can impart a
high level of information to the user.
Programs written for Windows do not directly access the hardware of graphics display devices such as the screen
and printer. Instead, Windows includes a graphics programming language (called the Graphics Device Interface, or
GDI) that allows the easy display of graphics and formatted text. Windows virtualizes display hardware. A program
written for Windows will run with any video board or any printer for which a Windows device driver is available. The
program does not need to determine what type of device is attached to the system.

This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
Putting a device-independent graphics interface on the IBM PC was not an easy job for the developers of Windows.
The PC design was based on the principle of open architecture. Third-party hardware manufacturers were
encouraged to develop peripherals for the PC and have done so in great number. Although several standards have
emerged, conventional MS-DOS programs for the PC had to individually support many different hardware
configurations. It was fairly common for an MS-DOS word-processing program to be sold with one or two disks of
small files, each one supporting a particular printer. Windows programs do not require these drivers because the
support is part of Windows.
Dynamic Linking

Central to the workings of Windows is a concept known as "dynamic linking." Windows provides a wealth of
function calls that an application can take advantage of, mostly to implement its user interface and display text and
graphics on the video display. These functions are implemented in dynamic-link libraries, or DLLs. These are files
with the extension .DLL or sometimes .EXE, and they are mostly located in the \WINDOWS\SYSTEM
subdirectory under Windows 98 and the \WINNT\SYSTEM and \WINNT\SYSTEM32 subdirectories under
Windows NT.
In the early days, the great bulk of Windows was implemented in just three dynamic-link libraries. These represented
the three main subsystems of Windows, which were referred to as Kernel, User, and GDI. While the number of
subsystems has proliferated in recent versions of Windows, most function calls that a typical Windows program
makes will still fall in one of these three modules. Kernel (which is currently implemented by the 16-bit
KRNL386.EXE and the 32-bit KERNEL32.DLL) handles all the stuff that an operating system kernel traditionally
handles memory management, file I/O, and tasking. User (implemented in the 16-bit USER.EXE and the 32-bit
USER32.DLL) refers to the user interface, and implements all the windowing logic. GDI (implemented in the 16-bit
GDI.EXE and the 32-bit GDI32.DLL) is the Graphics Device Interface, which allows a program to display text and
graphics on the screen and printer.
Windows 98 supports several thousand function calls that applications can use. Each function has a descriptive name,
such as CreateWindow. This function (as you might guess) creates a window for your program. All the Windows
functions that an application may use are declared in header files.
In your Windows program, you use the Windows function calls in generally the same way you use C library functions

such as strlen. The primary difference is that the machine code for C library functions is linked into your program
code, whereas the code for Windows functions is located outside of your program in the DLLs.
When you run a Windows program, it interfaces to Windows through a process called "dynamic linking." A
Windows .EXE file contains references to the various dynamic-link libraries it uses and the functions therein. When a
Windows program is loaded into memory, the calls in the program are resolved to point to the entries of the DLL
functions, which are also loaded into memory if not already there.
When you link a Windows program to produce an executable file, you must link with special "import libraries"
provided with your programming environment. These import libraries contain the dynamic-link library names and
reference information for all the Windows function calls. The linker uses this information to construct the table in the
.EXE file that Windows uses to resolve calls to Windows functions when loading the program.
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
Windows Programming Options

To illustrate the various techniques of Windows programming, this book has lots of sample programs. These
programs are written in C and use the native Windows APIs. I think of this approach as "classical" Windows
programming. It is how we wrote programs for Windows 1.0 in 1985, and it remains a valid way of programming for
Windows today.
APIs and Memory Models

To a programmer, an operating system is defined by its API. An API encompasses all the function calls that an
application program can make of an operating system, as well as definitions of associated data types and structures.
In Windows, the API also implies a particular program architecture that we'll explore in the chapters ahead.
Generally, the Windows API has remained quite consistent since Windows 1.0. A Windows programmer with
experience in Windows 98 would find the source code for a Windows 1.0 program very familiar. One way the API
has changed has been in enhancements. Windows 1.0 supported fewer than 450 function calls; today there are
thousands.
The biggest change in the Windows API and its syntax came about during the switch from a 16-bit architecture to a
32-bit architecture. Versions 1.0 through 3.1 of Windows used the so-called segmented memory mode of the 16-bit
Intel 8086, 8088, and 286 microprocessors, a mode that was also supported for compatibility purposes in the 32-bit

Intel microprocessors beginning with the 386. The microprocessor register size in this mode was 16 bits, and hence
the C int data type was also 16 bits wide. In the segmented memory model, memory addresses were formed from
two components a 16-bit segment pointer and a 16-bit offset pointer. From the programmer's perspective, this was
quite messy and involved differentiating between long, or far, pointers (which involved both a segment address and
an offset address) and short, or near, pointers (which involved an offset address with an assumed segment address).
Beginning in Windows NT and Windows 95, Windows supported a 32-bit flat memory model using the 32-bit
modes of the Intel 386, 486, and Pentium processors. The C int data type was promoted to a 32-bit value.
Programs written for 32-bit versions of Windows use simple 32-bit pointer values that address a flat linear address
space.
The API for the 16-bit versions of Windows (Windows 1.0 through Windows 3.1) is now known as Win16. The
API for the 32-bit versions of Windows (Windows 95, Windows 98, and all versions of Windows NT) is now
known as Win32. Many function calls remained the same in the transition from Win16 to Win32, but some needed to
be enhanced. For example, graphics coordinate points changed from 16-bit values in Win16 to 32-bit values in
Win32. Also, some Win16 function calls returned a two-dimensional coordinate point packed in a 32-bit integer.
This was not possible in Win32, so new function calls were added that worked in a different way.
All 32-bit versions of Windows support both the Win16 API to ensure compatibility with old applications and the
Win32 API to run new applications. Interestingly enough, this works differently in Windows NT than in Windows 95
and Windows 98. In Windows NT, Win16 function calls go through a translation layer and are converted to Win32
function calls that are then processed by the operating system. In Windows 95 and Windows 98, the process is
opposite that: Win32 function calls go through a translation layer and are converted to Win16 function calls to be
processed by the operating system.
At one time, there were two other Windows API sets (at least in name). Win32s ("s" for "subset") was an API that
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
allowed programmers to write 32-bit applications that ran under Windows 3.1. This API supported only 32-bit
versions of functions already supported by Win16. Also, the Windows 95 API was once called Win32c ("c" for
"compatibility"), but this term has been abandoned.
At this time, Windows NT and Windows 98 are both considered to support the Win32 API. However, each
operating system supports some features not supported by the other. Still, because the overlap is considerable, it's
possible to write programs that run under both systems. Also, it's widely assumed that the two products will be

merged at some time in the future.
Language Options

Using C and the native APIs is not the only way to write programs for Windows 98. However, this approach offers
you the best performance, the most power, and the greatest versatility in exploiting the features of Windows.
Executables are relatively small and don't require external libraries to run (except for the Windows DLLs themselves,
of course). Most importantly, becoming familiar with the API provides you with a deeper understanding of Windows
internals, regardless of how you eventually write applications for Windows.
Although I think that learning classical Windows programming is important for any Windows programmer, I don't
necessarily recommend using C and the API for every Windows application. Many programmers particularly those
doing in-house corporate programming or those who do recreational programming at home enjoy the ease of
development environments such as Microsoft Visual Basic or Borland Delphi (which incorporates an object-oriented
dialect of Pascal). These environments allow a programmer to focus on the user interface of an application and
associate code with user interface objects. To learn Visual Basic, you might want to consult some other Microsoft
Press books, such as Learn Visual Basic Now (1996), by Michael Halvorson.
Among professional programmers particularly those who write commercial applications Microsoft Visual C++ with
the Microsoft Foundation Class Library (MFC) has been a popular alternative in recent years. MFC encapsulates
many of the messier aspects of Windows programming in a collection of C++ classes. Jeff Prosise's Programming
Windows with MFC, Second Edition (Microsoft Press, 1999) provides tutorials on MFC.
Most recently, the popularity of the Internet and the World Wide Web has given a big boost to Sun Microsystems'
Java, the processor-independent language inspired by C++ and incorporating a toolkit for writing graphical
applications that will run on several operating system platforms. A good Microsoft Press book on Microsoft J++,
Microsoft's Java development tool, is Programming Visual J++ 6.0 (1998), by Stephen R. Davis.
Obviously, there's hardly any one right way to write applications for Windows. More than anything else, the nature of
the application itself should probably dictate the tools. But learning the Windows API gives you vital insights into the
workings of Windows that are essential regardless of what you end up using to actually do the coding. Windows is a
complex system; putting a programming layer on top of the API doesn't eliminate the complexity it merely hides it.
Sooner or later that complexity is going to jump out and bite you in the leg. Knowing the API gives you a better
chance at recovery.
Any software layer on top of the native Windows API necessarily restricts you to a subset of full functionality. You

might find, for example, that Visual Basic is ideal for your application except that it doesn't allow you to do one or
two essential chores. In that case, you'll have to use native API calls. The API defines the universe in which we as
Windows programmers exist. No approach can be more powerful or versatile than using this API directly.
MFC is particularly problematic. While it simplifies some jobs immensely (such as OLE), I often find myself wrestling
with other features (such as the Document/View architecture) to get them to work as I want. MFC has not been the
Windows programming panacea that many hoped for, and few people would characterize it as a model of good
object-oriented design. MFC programmers benefit greatly from understanding what's going on in class definitions
they use, and find themselves frequently consulting MFC source code. Understanding that source code is one of the
benefits of learning the Windows API.
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
The Programming Environment

In this book, I'll be assuming that you're running Microsoft Visual C++ 6.0, which comes in Standard, Professional,
and Enterprise editions. The less-expensive Standard edition is fine for doing the programs in this book. Visual C++
is also part of Visual Studio 6.0.
The Microsoft Visual C++ package includes more than the C compiler and other files and tools necessary to compile
and link Windows programs. It also includes the Visual C++ Developer Studio, an environment in which you can edit
your source code; interactively create resources such as icons and dialog boxes; and edit, compile, run, and debug
your programs.
If you're running Visual C++ 5.0, you might need to get updated header files and import libraries for Windows 98
and Windows NT 5.0. These are available at Microsoft's web site. Go to and
choose Downloads and then Platform SDK ("software development kit"). You'll be able to download and install the
updated files in directories of your choice. To direct the Microsoft Developer Studio to look in these directories,
choose Options from the Tools menu and then pick the Directories tab.
The msdn portion of the Microsoft URL above stands for Microsoft Developer Network. This is a program that
provides developers with frequently updated CD-ROMs containing much of what they need to be on the cutting
edge of Windows development. You'll probably want to investigate subscribing to MSDN and avoid frequent
downloading from Microsoft's web site.
API Documentation


This book is not a substitute for the official formal documentation of the Windows API. That documentation is no
longer published in printed form; it is available only via CD-ROM or the Internet.
When you install Visual C++ 6.0, you'll get an online help system that includes API documentation. You can get
updates to that documentation by subscribing to MSDN or by using Microsoft's Web-based online help system.
Start by linking to and select MSDN Library Online.
In Visual C++ 6.0, select the Contents item from the Help menu to invoke the MSDN window. The API
documentation is organized in a tree-structured hierarchy. Find the section labeled Platform SDK. All the
documentation I'll be citing in this book is from this section. I'll show the location of documentation using the nested
levels starting with Platform SDK separated by slashes. (I know the Platform SDK looks like a small obscure part of
the total wealth of MSDN knowledge, but I assure you that it's the essential core of Windows programming.) For
example, for documentation on how to use the mouse in your Windows programs, you can consult /Platform
SDK/User Interface Services/User Input/Mouse Input.
I mentioned before that much of Windows is divided into the Kernel, User, and GDI subsystems. The kernel
interfaces are in /Platform SDK/Windows Base Services, the user interface functions are in /Platform SDK/User
Interface Services, and GDI is documented in /Platform SDK/Graphics and Multimedia Services/GDI.
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
Your First Windows Program

Now it's time to do some coding. Let's begin by looking at a very short Windows program and, for comparison, a
short character-mode program. These will help us get oriented in using the development environment and going
through the mechanics of creating and compiling a program.
A Character-Mode Model

A favorite book among programmers is The C Programming Language (Prentice Hall, 1978 and 1988) by Brian
W. Kernighan and Dennis M. Ritchie, affectionately referred to as K&R. Chapter 1 of this book begins with a C
program that displays the words "hello, world."
Here's the program as it appeared on page 6 of the first edition of The C Programming Language:
main ()

{
printf ("hello, world\n") ;
}

Yes, once upon a time C programmers used C run-time library functions such as printf without declaring them first.
But this is the '90s, and we like to give our compilers a fighting chance to flag errors in our code. Here's the revised
code from the second edition of K&R:
#include <stdio.h>
main ()
{
printf ("hello, world\n") ;
}

This program still isn't really as small as it seems. It will certainly compile and run just fine, but many programmers
these days would prefer to explicitly indicate the return value of the main function, in which case ANSI C dictates
that the function actually returns a value:
#include <stdio.h>
int main ()
{
printf ("hello, world\n") ;
return 0 ;
}

We could make this even longer by including the arguments to main, but let's leave it at that with an include
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
statement, the program entry point, a call to a run-time library function, and a return statement.
The Windows Equivalent

The Windows equivalent to the "hello, world" program has exactly the same components as the character-mode

version. It has an include statement, a program entry point, a function call, and a return statement. Here's the
program:
/*
HelloMsg.c Displays "Hello, Windows 98!" in a message box
(c) Charles Petzold, 1998
*/
#include <windows.h>
int WINAPI WinMain (HINSTANCE hInstance, HINSTANCE hPrevInstance,
PSTR szCmdLine, int iCmdShow)
{
MessageBox (NULL, TEXT ("Hello, Windows 98!"), TEXT ("HelloMsg"), 0) ;
return 0 ;
}

Before I begin dissecting this program, let's go through the mechanics of creating a program in the Visual C++
Developer Studio.
To begin, select New from the File menu. In the New dialog box, pick the Projects tab. Select Win32 Application.
In the Location field, select a subdirectory. In the Project Name field, type the name of the project, which in this case
is HelloMsg. This will be a subdirectory of the directory indicated in the Location field. The Create New Workspace
button should be checked. The Platforms section should indicate Win32. Choose OK.
A dialog box labeled Win32 Application - Step 1 Of 1 will appear. Indicate that you want to create an Empty
Project, and press the Finish button.
Select New from the File menu again. In the New dialog box, pick the Files tab. Select C++ Source File. The Add
To Project box should be checked, and HelloMsg should be indicated. Type HelloMsg.c in the File Name field.
Choose OK.
Now you can type in the HELLOMSG.C file shown above. Or you can select the Insert menu and the File As Text
option to copy the contents of HELLOMSG.C from the file on this book's companion CD-ROM.
Structurally, HELLOMSG.C is identical to the K&R "hello, world" program. The header file STDIO.H has been
replaced with WINDOWS.H, the entry point main has been replaced with WinMain, and the C run-time library
function printf has been replaced with the Windows API function MessageBox. However, there is much in the

program that is new, including several strange-looking uppercase identifiers.
Let's start at the top.
The Header Files

HELLOMSG.C begins with a preprocessor directive that you'll find at the top of virtually every Windows program
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
written in C:
#include <windows.h>

WINDOWS.H is a master include file that includes other Windows header files, some of which also include other
header files. The most important and most basic of these header files are:
• WINDEF.H Basic type definitions.
• WINNT.H Type definitions for Unicode support.
• WINBASE.H Kernel functions.
• WINUSER.H User interface functions.
• WINGDI.H Graphics device interface functions.
These header files define all the Windows data types, function calls, data structures, and constant identifiers. They are
an important part of Windows documentation. You might find it convenient to use the Find In Files option from the
Edit menu in the Visual C++ Developer Studio to search through these header files. You can also open the header
files in the Developer Studio and examine them directly.
Program Entry Point

Just as the entry point to a C program is the function main, the entry point to a Windows program is WinMain,
which always appears like this:
int WINAPI WinMain (HINSTANCE hInstance, HINSTANCE hPrevInstance,
PSTR szCmdLine, int iCmdShow)

This entry point is documented in /Platform SDK/User Interface Services/Windowing/Windows/Window
Reference/Window Functions. It is declared in WINBASE.H like so (line breaks and all):

int
WINAPI
WinMain(
HINSTANCE hInstance,
HINSTANCE hPrevInstance,
LPSTR lpCmdLine,
int nShowCmd
);

You'll notice I've made a couple of minor changes in HELLOMSG.C. The third parameter is defined as an LPSTR in
WINBASE.H, and I've made it a PSTR. These two data types are both defined in WINNT.H as pointers to
character strings. The LP prefix stands for "long pointer" and is an artifact of 16-bit Windows.
I've also changed two of the parameter names from the WinMain declaration; many Windows programs use a
system called "Hungarian notation" for naming variables. This system involves prefacing the variable name with a short
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
prefix that indicates the variable's data type. I'll discuss this concept more in Chapter 3. For now, just keep in mind
that the prefix i stands for int and sz stands for "string terminated with a zero."
The WinMain function is declared as returning an int. The WINAPI identifier is defined in WINDEF.H with the
statement:
#define WINAPI __stdcall

This statement specifies a calling convention that involves how machine code is generated to place function call
arguments on the stack. Most Windows function calls are declared as WINAPI.
The first parameter to WinMain is something called an "instance handle." In Windows programming, a handle is
simply a number that an application uses to identify something. In this case, the handle uniquely identifies the program.
It is required as an argument to some other Windows function calls. In early versions of Windows, when you ran the
same program concurrently more than once, you created multiple instances of that program. All instances of the
same application shared code and read-only memory (usually resources such as menu and dialog box templates). A
program could determine if other instances of itself were running by checking the hPrevInstance parameter. It could

then skip certain chores and move some data from the previous instance into its own data area.
In the 32-bit versions of Windows, this concept has been abandoned. The second parameter to WinMain is always
NULL (defined as 0).
The third parameter to WinMain is the command line used to run the program. Some Windows applications use this
to load a file into memory when the program is started. The fourth parameter to WinMain indicates how the program
should be initially displayed either normally or maximized to fill the window, or minimized to be displayed in the task
list bar. We'll see how this parameter is used in Chapter 3.
The MessageBox Function

The MessageBox function is designed to display short messages. The little window that MessageBox displays is
actually considered to be a dialog box, although not one with a lot of versatility.
The first argument to MessageBox is normally a window handle. We'll see what this means in Chapter 3. The second
argument is the text string that appears in the body of the message box, and the third argument is the text string that
appears in the caption bar of the message box. In HELLMSG.C, each of these text strings is enclosed in a TEXT
macro. You don't normally have to enclose all character strings in the TEXT macro, but it's a good idea if you want
to be ready to convert your programs to the Unicode character set. I'll discuss this in much more detail in Chapter 2.
The fourth argument to MessageBox can be a combination of constants beginning with the prefix MB_ that are
defined in WINUSER.H. You can pick one constant from the first set to indicate what buttons you wish to appear in
the dialog box:
#define MB_OK 0x00000000L
#define MB_OKCANCEL 0x00000001L
#define MB_ABORTRETRYIGNORE 0x00000002L
#define MB_YESNOCANCEL 0x00000003L
#define MB_YESNO 0x00000004L
#define MB_RETRYCANCEL 0x00000005L

This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
When you set the fourth argument to 0 in HELLOMSG, only the OK button appears. You can use the C OR (|)
operator to combine one of the constants shown above with a constant that indicates which of the buttons is the

default:
#define MB_DEFBUTTON1 0x00000000L
#define MB_DEFBUTTON2 0x00000100L
#define MB_DEFBUTTON3 0x00000200L
#define MB_DEFBUTTON4 0x00000300L

You can also use a constant that indicates the appearance of an icon in the message box:
#define MB_ICONHAND 0x00000010L
#define MB_ICONQUESTION 0x00000020L
#define MB_ICONEXCLAMATION 0x00000030L
#define MB_ICONASTERISK 0x00000040L

Some of these icons have alternate names:
#define MB_ICONWARNING MB_ICONEXCLAMATION
#define MB_ICONERROR MB_ICONHAND
#define MB_ICONINFORMATION MB_ICONASTERISK
#define MB_ICONSTOP MB_ICONHAND

There are a few other MB_ constants, but you can consult the header file yourself or the documentation in /Platform
SDK/User Interface Services/Windowing/Dialog Boxes/Dialog Box Reference/Dialog Box Functions.
In this program, the MessageBox function returns the value 1, but it's more proper to say that it returns IDOK, which
is defined in WINUSER.H as equaling 1. Depending on the other buttons present in the message box, the
MessageBox function can also return IDYES, IDNO, IDCANCEL, IDABORT, IDRETRY, or IDIGNORE.
Is this little Windows program really the equivalent of the K&R "hello, world" program? Well, you might think not
because the MessageBox function doesn't really have all the potential formatting power of the printf function in
"hello, world." But we'll see in the next chapter how to write a version of MessageBox that does printf-like
formatting.
Compile, Link, and Run

When you're ready to compile HELLOMSG, you can select Build Hellomsg.exe from the Build menu, or press F7,

or select the Build icon from the Build toolbar. (The appearance of this icon is shown in the Build menu. If the Build
toolbar is not currently displayed, you can choose Customize from the Tools menu and select the Toolbars tab. Pick
Build or Build MiniBar.)
Alternatively, you can select Execute Hellomsg.exe from the Build menu, or press Ctrl+F5, or click the Execute
Program icon (which looks like a red exclamation point) from the Build toolbar. You'll get a message box asking you
if you want to build the program.
As normal, during the compile stage, the compiler generates an .OBJ (object) file from the C source code file. During
the link stage, the linker combines the .OBJ file with .LIB (library) files to create the .EXE (executable) file. You can
see a list of these library files by selecting Settings from the Project tab and clicking the Link tab. In particular, you'll
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
notice KERNEL32.LIB, USER32.LIB, and GDI32.LIB. These are "import libraries" for the three major Windows
subsystems. They contain the dynamic-link library names and reference information that is bound into the .EXE file.
Windows uses this information to resolve calls from the program to functions in the KERNEL32.DLL,
USER32.DLL, and GDI32.DLL dynamic-link libraries.
In the Visual C++ Developer Studio, you can compile and link the program in different configurations. By default,
these are called Debug and Release. The executable files are stored in subdirectories of these names. In the Debug
configuration, information is added to the .EXE file that assists in debugging the program and in tracing through the
program source code.
If you prefer working on the command line, the companion CD-ROM contains .MAK (make) files for all the sample
programs. (You can tell the Developer Studio to generate make files by choosing Options from the Tools menu and
selecting the Build tab. There's a check box to check.) You'll need to run VCVARS32.BAT located in the BIN
subdirectory of the Developer Studio to set environment variables. To execute the make file from the command line,
change to the HELLOMSG directory and execute:
NMAKE /f HelloMsg.mak CFG="HelloMsg _ Win32 Debug"

or
NMAKE /f HelloMsg.mak CFG="HelloMsg _ Win32 Release"

You can then run the .EXE file from the command line by typing:

DEBUG\HELLOMSG

or
RELEASE\HELLOMSG

I have made one change to the default Debug configuration in the project files on the companion CD-ROM for this
book. In the Project Settings dialog box, after selecting the C/C++ tab, in the Preprocessor Definitions field I have
defined the identifier UNICODE. I'll have much more to say about this in the next chapter.
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
Chapter 2

An Introduction to Unicode

In the first chapter, I promised to elaborate on any aspects of C that you might not have encountered in conventional
character-mode programming but that play a part in Microsoft Windows. The subject of wide-character sets and
Unicode almost certainly qualifies in that respect.
Very simply, Unicode is an extension of ASCII character encoding. Rather than the 7 bits used to represent each
character in strict ASCII, or the 8 bits per character that have become common on computers, Unicode uses a full
16 bits for character encoding. This allows Unicode to represent all the letters, ideographs, and other symbols used in
all the written languages of the world that are likely to be used in computer communication. Unicode is intended
initially to supplement ASCII and, with any luck, eventually replace it. Considering that ASCII is one of the most
dominant standards in computing, this is certainly a tall order.
Unicode impacts every part of the computer industry, but perhaps most profoundly operating systems and
programming languages. In this respect, we are almost halfway there. Windows NT supports Unicode from the
ground up. (Unfortunately, Windows 98 includes only a small amount of Unicode support.) The C programming
language as formalized by ANSI inherently supports Unicode through its support of wide characters, which I'll
discuss in detail below.
Of course, as usual, we as programmers are confronted with much of the dirty work. I've tried to ease the load by
making all of the programs in this book "Unicode-ready." What this means exactly will become more apparent as I

discuss Unicode in this chapter.
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
A Brief History of Character Sets

It is uncertain when human beings began speaking, but writing seems to be about six thousand years old. Early writing
was pictographic in nature. Alphabets in which individual letters correspond to spoken sounds came about just three
thousand years ago. Although the various written languages of the world served fine for some time, several
nineteenth-century inventors saw a need for something more. When Samuel F. B. Morse developed the telegraph
between 1838 and 1854, he also devised a code to use with it. Each letter in the alphabet corresponded to a series
of short and long pulses (dots and dashes). There was no distinction between uppercase and lowercase letters, but
numbers and punctuation marks had their own codes.
Morse code was not the first instance of written language being represented by something other than drawn or
printed glyphs. Between 1821 and 1824, the young Louis Braille was inspired by a military system for writing and
reading messages at night to develop a code for embossing raised dots into paper for reading by the blind. Braille is
essentially a 6-bit code that encodes letters, common letter combinations, common words, and punctuation. A
special escape code indicates that the following letter code is to be interpreted as uppercase. A special shift code
allows subsequent letter codes to be interpreted as numbers.
Telex codes, including Baudot (named after a French engineer who died in 1903) and a code known as CCITT #2
(standardized in 1931), were 5-bit codes that included letter shifts and figure shifts.
American Standards

Early computer character codes evolved from the coding used on Hollerith ("do not fold, spindle, or mutilate") cards,
invented by Herman Hollerith and first used in the 1890 United States census. A 6-bit character code known as
BCDIC ("Binary-Coded Decimal Interchange Code") based on Hollerith coding was progressively extended to the
8-bit EBCDIC in the 1960s and remains the standard on IBM mainframes but nowhere else.
The American Standard Code for Information Interchange (ASCII) had its origins in the late 1950s and was finalized
in 1967. During the development of ASCII, there was considerable debate over whether the code should be 6, 7, or
8 bits wide. Reliability considerations seemed to mandate that no shift character be used, so ASCII couldn't be a
6-bit code. Cost ruled out the 8-bit version. (Bits were very expensive back then.) The final code had 26 lowercase

letters, 26 uppercase letters, 10 digits, 32 symbols, 33 control codes, and a space, for a total of 128 codes. ASCII
is currently documented in ANSI X3.4-1986, "Coded Character Sets 7-Bit American National Standard Code for
Information Interchange (7-Bit ASCII)," published by the American National Standards Institute. Figure 2-1 shows
ASCII (for the zillionth time), very similar to how it appears in the ANSI document.
0- 1- 2- 3- 4- 5- 6- 7-
-0 NUL DLE SP 0 @ P ` p
-1 SOH DC1 ! 1 A Q a q
-2 STX DC2 " 2 B R b r
-3 ETX DC3 # 3 C S c s
-4 EOT DC4 $ 4 D T d t
-5 ENQ NAK % 5 E U e u
-6 ACK SYN & 6 F V f v
-7 BEL ETB ' 7 G W g w
-8 BS CAN ( 8 H X h x
-9 HT EM ) 9 I Y I y
-A LF SUB * : J Z j z
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
-B VT ESC + ; K [ k {
-C FF FS , L \ l |
-D CR GS - M ] m }
-E SO RS . N ^ n ~
-F SI US / ? O _ o DEL

Figure 2-1. The ASCII character set.
There are a lot of good things you can say about ASCII. The 26 letter codes are contiguous, for example. (This is
not the case with EBCDIC.) Uppercase letters can be converted to lowercase and back by flipping one bit. The
codes for the 10 digits are easily derived from the value of the digits. (In BCDIC, the code for the character "0"
followed the code for the character "9"!)
Best of all, ASCII is a very dependable standard. No other standard is as prevalent or as ingrained in our keyboards,

video displays, system hardware, printers, font files, operating systems, and the Internet.
The World Beyond

The big problem with ASCII is indicated by the first word of the acronym. ASCII is truly an American standard, and
it isn't even good enough for other countries where English is spoken. Where is the British pound symbol ( ), for
instance?
English uses the Latin (or Roman) alphabet. Among written languages that use the Latin alphabet, English is unusual in
that very few words require letters with accent marks (or "diacritics"). Even for those English words where diacritics
are traditionally proper, such as coöperate or résumé, the spellings without diacritics are perfectly acceptable.
But north and south of the United States and across the Atlantic are many countries and languages where diacritics
are much more common. These accent marks originally aided in adopting the Latin alphabet to the differences in
spoken sounds among these languages. Journey farther east or south of Western Europe, and you'll encounter
languages that don't use the Latin alphabet at all, such as Greek, Hebrew, Arabic, and Russian (which uses the
Cyrillic alphabet). And if you travel even farther east, you'll discover the ideographic Han characters of Chinese,
which were also adopted in Japan and Korea.
The history of ASCII since 1967 is mostly a history of attempts to overcome its limitations and make it more
applicable to languages other than American English. In 1967, for example, the International Standards Organization
(ISO) recommended a variant of ASCII with codes 0x40, 0x5B, 0x5C, 0x5D, 0x7B, 0x7C, and 0x7D "reserved
for national use" and codes 0x5E, 0x60, and 0x7E labeled as "may be used for other graphical symbols when it is
necessary to have 8, 9, or 10 positions for national use." This is obviously not the best solution to internationalization
because there's no guarantee of consistency. But it indicates how desperate people were to successfully code
symbols necessary to various languages.
Extending ASCII

By the time the early small computers were being developed, the 8-bit byte had been firmly established. Thus, if a
byte were used to store characters, 128 additional characters could be invented to supplement ASCII. When the
original IBM PC was introduced in 1981, the video adapters included a ROM-based character set of 256
characters, which in itself was to become an important part of the IBM standard.
The original IBM extended character set included some accented characters and a lowercase Greek alphabet (useful
for mathematics notation), as well as some block-drawing and line-drawing characters. Additional characters were

also assigned to the code positions of the ASCII control characters, because the bulk of these control characters
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
were not required.
This IBM extended character set was burned into countless ROMs on video boards and in printers, and it was used
by numerous applications to decorate their character-mode displays. However, this character set did not include
enough accented letters for all Western European languages that used the Latin alphabet, and it was not quite
appropriate for Windows. Windows didn't need line-drawing characters because it had an entire graphics system.
In Windows 1.0 (released in November 1985), Microsoft didn't entirely abandon the IBM extended character set,
but it was relegated to secondary importance. The native Windows character set was called the "ANSI character
set" because it was based on a draft ANSI and ISO standard, which eventually became ANSI/ISO 885911987,
"American National Standard for Information Processing 8-Bit Single-Byte Coded Graphic Character Sets Part 1:
Latin Alphabet No 1." This is also known more simply as "Latin 1."
The original version of the ANSI character set as printed in the Windows 1.0 Programmer's Reference is shown in
Figure 2-2.
0- 1- 2- 3- 4- 5- 6- 7- 8- 9- A- B- C- D- E- F-
-0 * * 0 @ P ` p * *
-1 * * ! 1 A Q a q * * ẹ ỏ ủ
-2 * * " 2 B R b r * * ũ õ ũ
-3 * * # 3 C S c s * * ú ó ú
-4 * * $ 4 D T d t * * ụ ọ ụ
-5 * * % 5 E U e u * * ừ ồ ừ
-6 * * & 6 F V f v * * ặ ử ổ ử
-7 * * ' 7 G W g w * * ầ * ỗ *
-8 * * ( 8 H * h * * * ẩ ứ ố ứ
-9 * * ) 9 I Y I y * * ẫ ộ ự
-A * * * : J Z j z * * ấ ờ ỳ
-B * * + ; K [ k { * * ậ ở ỷ
-C * * , < L \ l | * * è ĩ ỡ ỹ
-D * * - = M ] m } * * í ớ ý

-E * * . > N ^ n ~ * * ẻ ị ợ ỵ
-F * * / ? * _ o DEL * * ẽ ò ù
* - not applicable
Figure 2-2. The Windows ANSI character set (based on ANSI/ISO 8859-1).
The hollow rectangles indicate codes for which characters are not defined. This is close to how ANSI/ISO 8859-1
was ultimately defined. ANSI/ISO 8859-1 shows only graphic characters, not control characters, so it does not
define the DEL. In addition, code 0xA0 is defined as a nonbreaking space (which means that it's a space that
shouldn't be used to break a line when formatting), and code 0xAD is a soft hyphen (which means that it shouldn't be
displayed unless it's used to break a word at the end of a line). Also, ANSI/ISO 8859-1 defines codes 0xD7 as a
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
multiplication sign ( ) and 0xF7 as a division sign ( ). Some fonts in Windows also define some of the characters from
0x80 through 0x9F, but these are not part of the ANSI/ISO 8859-1 standard.
MS-DOS 3.3 (released in April 1987) introduced the concept of code pages to IBM PC users, a concept that was
also carried over to Windows. A code page defines a mapping of character codes to characters. The original IBM
character set became known as code page 437, or "MS-DOS Latin US." Code page 850 is "MS-DOS Latin 1,"
which replaces some of the line-drawing characters with additional accented letters (but which is not the Latin 1
ISO/ANSI standard shown in Figure 2-2 above). Other code pages were defined for other languages. The lower
128 codes are always the same; the higher 128 codes depend on the language for which the code page is defined.
Under MS-DOS, if a user sets the PC's keyboard, video display, and printer to a specific code page and then
creates, edits, and prints documents on the PC, all will be well. Everything's consistent. However, if the user attempts
to exchange documents with another user using a different code page or to change the code page on the machine,
problems will result. Character codes are associated with the wrong characters. Applications can save code page
information with documents in an attempt to reduce problems, but this strategy involves some work in converting
between code pages.
Although code pages originally provided only additional characters of the Latin alphabet beyond the unaccented
characters, eventually code pages were devised where the higher 128 characters contained complete non-Latin
alphabets, such as Hebrew, Greek, and Cyrillic. Such variety makes code page mix-ups potentially worse, of course;
it's one thing if a few accented letters appear incorrect and quite another if an entire text is an incomprehensible
jumble.

Code pages proliferated beyond all reason. Just to keep everyone on their toes, the MS-DOS code page 855 for
Cyrillic is not the same as either the Windows code page 1251 for Cyrillic or the Macintosh code page 10007 for
Cyrillic. Code pages in each environment are modifications of the standard character set for the environment. IBM
OS/2 also supports a variety of EBCDIC code pages.
But wait. It gets worse.
Double-Byte Character Sets

So far we've been looking at character sets of 256 characters. But the ideographic symbols of Chinese, Japanese,
and Korean number about 21,000. How can these languages be accommodated while still maintaining some kind of
compatibility with ASCII?
The solution (if that's the right word for it) is the double-byte character set (DBCS). A DBCS starts off with 256
codes, just like ASCII. Like any well-behaved code page, the first 128 of these codes are ASCII. However, some
of the codes in the higher 128 are always followed by a second byte. The two bytes together (called a lead byte and
a trail byte) define a single character, usually a complex ideograph.
Although Chinese, Japanese, and Korean share many of the same ideographs, obviously the languages are different
and often the same ideograph in the three different languages will represent three different things. Windows supports
four different double-byte character sets: code page 932 (Japanese), 936 (Simplified Chinese), 949 (Korean), and
950 (Traditional Chinese). DBCS is supported in only the versions of Windows that are manufactured for these
countries.
The problem with a double-byte character set is not that characters are represented by 2 bytes. The problem is that
some characters (in particular, the ASCII characters) are represented by 1 byte. This creates odd programming
problems. For example, the number of characters in a character string cannot be determined by the byte size of the
string. The string has to be parsed to determine its length, and each byte has to be examined to see if it's the lead byte
of a 2-byte character. If you have a pointer to a character somewhere in the middle of a DBCS string, what is the
address of the previous character in the string? The customary solution is to parse the string starting at the beginning
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
up to the pointer!
Unicode to the Rescue


The basic problem we have here is that the world's written languages simply cannot be represented by 256 8-bit
codes. The previous solutions involving code pages and DBCS have proven insufficient and awkward. What's the
real solution?
As programmers, we have experience with problems of this sort. If there are too many things to be represented by
8-bit values, we try wider values, perhaps 16-bit values. (Duh.) And that's the ridiculously simple concept behind
Unicode. Rather than the confusion of multiple 256-character code mappings or double-byte character sets that have
some 1-byte codes and some 2-byte codes, Unicode is a uniform 16-bit system, thus allowing the representation of
65,536 characters. This is sufficient for all the characters and ideographs in all the written languages of the world,
including a bunch of math, symbol, and dingbat collections.
Understanding the difference between Unicode and DBCS is essential. Unicode is said to use (particularly in the
context of the C programming language) "wide characters." Each character in Unicode is 16 bits wide rather than
8 bits wide. Eight-bit values have no meaning in Unicode. In contrast, in a double-byte character set we're still
dealing with 8bit values. Some bytes define characters by themselves, and some bytes indicate that another byte is
necessary to completely define a character.
Whereas working with DBCS strings is quite messy, working with Unicode text is much like working with regular
text. You'll probably be pleased to learn that the first 128 Unicode characters (16-bit codes 0x0000 through
0x007F) are ASCII, while the second 128 Unicode characters (codex 0x0080 through 0x00FF) are the ISO
8859-1 extensions to ASCII. Various blocks of characters within Unicode are similarly based on existing standards.
This is to ease conversion. The Greek alphabet uses codes 0x0370 through 0x03FF, Cyrillic uses codes 0x0400
through 0x04FF, Armenian uses codes 0x0530 through 0x058F, and Hebrew uses codes 0x0590 through 0x05FF.
The ideographs of Chinese, Japanese, and Korean (referred to collectively as CJK) occupy codes 0x3000 through
0x9FFF.
The best thing about Unicode is that there's only one character set. There's simply no ambiguity. Unicode came about
through the cooperation of virtually every important company in the personal computer industry and is code-for-code
identical with the ISO 10646-1 standard. The essential reference for Unicode is The Unicode Standard, Version
2.0 (Addison-Wesley, 1996), an extraordinary book that reveals the richness and diversity of the world's written
languages in a way that few other documents have. In addition, the book provides the rationale and details behind the
development of Unicode.
Are there any drawbacks to Unicode? Sure. Unicode character strings occupy twice as much memory as ASCII
strings. (File compression helps a lot to reduce the disk space differential, however.) But perhaps the worst

drawback is that Unicode remains relatively unused just yet. As programmers, we have our work cut out for us.
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
Wide Characters and C

To a C programmer, the whole idea of 16-bit characters can certainly provoke uneasy chills. That a char is the same
width as a byte is one of the very few certainties of this life. Few programmers are aware that ANSI/ISO
9899-1990, the "American National Standard for Programming Languages C" (also known as "ANSI C") supports
character sets that require more than one byte per character through a concept called "wide characters." These wide
characters coexist nicely with normal and familiar characters.
ANSI C also supports multibyte character sets, such as those supported by the Chinese, Japanese, and Korean
versions of Windows. However, these multibyte character sets are treated as strings of single-byte values in which
some characters alter the meaning of successive characters. Multibyte character sets mostly impact the C run-time
library functions. In contrast, wide characters are uniformly wider than normal characters and involve some compiler
issues.
Wide characters aren't necessarily Unicode. Unicode is one possible wide-character encoding. However, because
the focus in this book is Windows rather than an abstract implementation of C, I will tend to speak of wide
characters and Unicode synonymously.
The char Data Type

Presumably, we are all quite familiar with defining and storing characters and character strings in our C programs by
using the char data type. But to facilitate an understanding of how C handles wide characters, let's first review normal
character definition as it might appear in a Win32 program.
The following statement defines and initializes a variable containing a single character:
char c = `A' ;

The variable c requires 1 byte of storage and will be initialized with the hexadecimal value 0x41, which is the ASCII
code for the letter A.
You can define a pointer to a character string like so:
char * p ;


Because Windows is a 32-bit operating system, the pointer variable p requires 4 bytes of storage. You can also
initialize a pointer to a character string:
char * p = "Hello!" ;

The variable p still requires 4 bytes of storage as before. The character string is stored in static memory and uses 7
bytes of storage the 6 bytes of the string in addition to a terminating 0.
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
You can also define an array of characters, like this:
char a[10] ;

In this case, the compiler reserves 10 bytes of storage for the array. The expression sizeof (a) will return 10. If the
array is global (that is, defined outside any function), you can initialize an array of characters by using a statement like
so:
char a[] = "Hello!" ;

If you define this array as a local variable to a function, it must be defined as a static variable, as follows:
static char a[] = "Hello!" ;

In either case, the string is stored in static program memory with a 0 appended at the end, thus requiring 7 bytes of
storage.
Wider Characters

Nothing about Unicode or wide characters alters the meaning of the char data type in C. The char continues to
indicate 1 byte of storage, and sizeof (char) continues to return 1. In theory, a byte in C can be greater than 8 bits,
but for most of us, a byte (and hence a char) is 8 bits wide.
Wide characters in C are based on the wchar_t data type, which is defined in several header files, including
WCHAR.H, like so:
typedef unsigned short wchar_t ;


Thus, the wchar_t data type is the same as an unsigned short integer: 16 bits wide.
To define a variable containing a single wide character, use the following statement:
wchar_t c = `A' ;

The variable c is the two-byte value 0x0041, which is the Unicode representation of the letter A. (However, because
Intel microprocessors store multibyte values with the least-significant bytes first, the bytes are actually stored in
memory in the sequence 0x41, 0x00. Keep this in mind if you examine memory storage of Unicode text.)
You can also define an initialized pointer to a wide-character string:
wchar_t * p = L"Hello!" ;

This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -
Notice the capital L (for long) immediately preceding the first quotation mark. This indicates to the compiler that the
string is to be stored with wide characters that is, with every character occupying 2 bytes. The pointer variable p
requires 4 bytes of storage, as usual, but the character string requires 14 bytes 2 bytes for each character with 2
bytes of zeros at the end.
Similarly, you can define an array of wide characters this way:
static wchar_t a[] = L"Hello!" ;

The string again requires 14 bytes of storage, and sizeof (a) will return 14. You can index the a array to get at the
individual characters. The value a[1] is the wide character `e', or 0x0065.
Although it looks more like a typo than anything else, that L preceding the first quotation mark is very important, and
there must not be space between the two symbols. Only with that L will the compiler know you want the string to be
stored with 2 bytes per character. Later on, when we look at wide-character strings in places other than variable
definitions, you'll encounter the L preceding the first quotation mark again. Fortunately, the C compiler will often give
you a warning or error message if you forget to include the L.
You can also use the L prefix in front of single character literals, as shown here, to indicate that they should be
interpreted as wide characters.
wchar_t c = L'A' ;


But it's usually not necessary. The C compiler will zero-extend the character anyway.
Wide-Character Library Functions

We all know how to find the length of a string. For example, if we have defined a pointer to a character string like so:
char * pc = "Hello!" ;

we can call
iLength = strlen (pc) ;

The variable iLength will be set equal to 6, the number of characters in the string.
Excellent! Now let's try defining a pointer to a string of wide characters:
wchar_t * pw = L"Hello!" ;

And now we call strlen again:
iLength = strlen (pw) ;
This document is created with the unregistered version of CHM2PDF Pilot
Simpo PDF Merge and Split Unregistered Version -

×