Tải bản đầy đủ (.pdf) (100 trang)

The New C Standard- P9

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (706.47 KB, 100 trang )

6.4.2.1 General
796
Coding Guidelines
The visual similarity of these letters is discussed elsewhere.
792 character
visual similarity
795
There is no specific limit on the maximum length of an identifier.
Commentary
The standard does specify a minimum limit on the number of characters a translator must consider as
significant. Implementations are free to ignore characters once this limit is reached. The ignored characters
282 internal
identifier
significant charac-
ters
283 external
identifier
significant charac-
ters
do not form part of another token. It is as if they did not appear in the source at all.
C90
The C90 Standard does not explicitly state this fact.
Other Languages
Few languages place limits on the maximum length of an identifier that can appear in a source file. Like C,
some specify a lower limit on the number of characters that must be considered significant.
Coding Guidelines
Using a large number of characters in an identifier spelling has many potential benefits; for instance, it
provides the opportunity to supply a lot of information to readers, or to reduce dependencies on existing
reader knowledge by spelling words in full rather than using abbreviations. There are also potential costs;
for instance, they can cause visual layout problems in the source (requiring new-lines within an expression
in an attempt to keep the maximum line length within the bounds that can be viewed within a fixed-width


window), or increase the cognitive effort needed to visually scan source containing them.
The length of an identifier is not itself directly a coding guideline issue. However, length is indirectly
involved in many identifier memorability, confusability, and usability issues, which are discussed elsewhere.
792 identifier
syntax
Usage
The distribution of identifier lengths is given in Figure 792.7.
796
Each universal character name in an identifier shall designate a character whose encoding in ISO/IEC 10646
identifier
UCN
falls into one of the ranges specified in annex D.
60)
Commentary
Using other UCNs results in undefined behavior (in some cases even using these UCNs can be a constraint
violation). These character encodings could be thought of as representing letters in the specified national
816 UCNs
not basic char-
acter set
character set.
C90
Support for universal character names is new in C99.
Other Languages
The ISO/IEC 10646 standard is relatively new and languages are only just starting to include support for the
28 ISO 10646
characters it specifies. Java specifies a similar list of UCNs.
Common Implementations
A collating sequence may not be defined for these universal character names. In practice a lack of a defined
collating sequence is not an implementation problem. Because a translator only ever needs to compare the
spelling of one identifier for equality with another identifier, which involves a simple character-by-character

comparison (the issue of the ordering of diacritics is handled by not allowing them to occur in an identifier).
Support for this functionality is new and the extent to which implementations are likely to check that
UCN values fall within the list given in annex D is not known.
June 24, 2009 v 1.2
6.4.2.1 General
797
Coding Guidelines
The intended purpose for supporting universal character names in identifiers is to reduce the developer effort
needed to comprehend source. Identifiers spelled in the developer’s native tongue are more immediately
recognizable (because of greater practice with those characters) and also have semantic associations that are
more readily brought to mind.
The ISO 10646 Standard does not specify which languages contain the characters it specifies (although it
ISO 10646 28
does give names to some sets of characters that correspond to a language that contains them). The written
form of some human languages share common characters; for instance, the characters a through z (and their
uppercase forms) appear in many European orthographies. The following discussion refers to using UCNs
orthography 792
from more than one human language. This is to be taken to mean using UCNs that are not part of the written
form of the native language of the developer (the case of developers having more than one native language
is not considered). For instance, the character a is used in both Swedish and German; the character û is
used in Swedish, but not German; the character ß is used in German but not Swedish. Both Swedish and
German developers would be familiar with the character a, but the character ß would be considered foreign
to a Swedish developer, and the character û foreign to the German.
Some coding guideline documents recommend against the use of UCNs. Their use within identifiers
can increase the portability cost of the source. The use of UCNs is an economic issue; the potential cost
of not permitting their use in identifiers needs to be compared against the potential portability benefits.
(Alternatively, the benefits of using UCNs could be compared against the possible portability costs.)
Given the purpose of using UCNs, is there any rationale for identifiers to contain characters from more
than one human language? As an English speaker, your author can imagine a developer wanting to use
an English word, or its common abbreviation, as a prefix or suffix to an identifier name. Perhaps an Urdu

speaker can imagine a similar usage with Urdu words. The issue is whether the use of characters in the same
identifier from different human languages has meaning to the developers who write and maintain the source.
Identifiers very rarely occur in isolation. Should all the identifiers in the same function, or even source
file, only contain UCNs that form the set of characters used by a single human language? Using characters
from different human languages when it is possible to use only characters from a single language, potentially
increases the cost of maintenance. Future maintainers are either going to have to be familiar with the
orthography and semantics of the two human languages used or spend additional time processing instances of
identifiers containing characters they are not familiar with. However, in some cases it might not be possible
to enforce a single human language rule. For instance, a third-party library may contain callable functions
whose spellings use characters from a human language different from that used in the source code that
contains calls to it.
Support for the use of UCNs in identifiers is new in C99 (and other computer languages) and at the time
of this writing there is almost no practical experience available on the sort of mistakes that developers make
with them.
797
The initial character shall not be a universal character name designating a digit.
Commentary
The terminal
identifier-nondigit
that appears in the syntax implies that the possible UCNs exclude the
identifier
syntax
792
digit characters. Also the list given in annex D does not include the digit characters. This means that an
identifier containing a UCN designating a digit in any position results in undefined behavior.
The syntax for constants does not support the use of UCNs. This sentence, in the standard, reminds
constant
syntax
822
implementors that such usage could be supported in the future and that, while they may support UCN digits

within an identifier, it would not be a good idea to support them as the initial character.
v 1.2 June 24, 2009
6.4.2.1 General
798
Table 797.1: The Unicode digit encodings.
Encoding Range Language Encoding Range Language
0030–0039 ISO Latin-1 0BE7–0BEF Tamil (has no zero)
0660–0669 Arabic–Indic 0C66–0C6F Telugu
06F0–06F9 Eastern Arabic–Indic 0CE6–0CEF Kannada
0966–096F Devanagari 0D66–0D6F Malayalam
09E6–09EF Bengali 0E50–0E59 Thai
0A66–0A6F Gurmukhi 0ED0–0ED9 Lao
0AE6–0AEF Gujarati FF10–FF19 Fullwidth
0B66–0B6F Oriya digits
C
++
This requirement is implied by the terminal non-name used in the C
++
syntax. Annex E of the C
++
Standard
does not list any UCN digits in the list of supported UCN encodings.
Other Languages
Java has a similar requirement.
Coding Guidelines
The extent to which different cultural conventions support the use of a digit as the first character in an
identifier is not known to your author. At some future date the Committee may chose to support the writing
of integer constants using UCNs. If this happens, any identifiers that start with a UCN designating a digit
are liable to result in syntax violations. There does not appear to be a worthwhile benefit in a guideline
recommendation dealing with the case of an identifier beginning with a UCN designating a digit.

Example
1 int \u1f00\u0ae6;
2 int \u0ae6;
798
An implementation may allow multibyte characters that are not part of the basic source character set to appear
identifier
multibyte
character in
in identifiers;
Commentary
Prior to C99 there was no standardized method of representing nonbasic source character set characters
in the source code. Support for multibyte characters in string literals and constants was specified in C90;
some implementations extended this usage to cover identifiers. They are now officially sanctioned to do this.
Support for the ISO 10646 Standard is new in C99. However, there are a number of existing implementations
28 ISO 10646
that use a multibyte encoding scheme and this usage is likely to continue for many years. The C committee
recognized the importance of this usage and do not force developers to go down a UCN-only path.
The standard says nothing about the behavior of the
_ _func_ _
reserved identifier in the case when a
810 __func__
function name is spelled using wide characters.
C90
This permission is new in C99.
C
++
The C
++
Standard does not explicitly contain this permission. However, translation phase 1 performs an
116 transla-

tion phase
1
implementation-defined mapping of the source file characters, and an implementation may choose to support
multibyte characters in identifiers via this route.
June 24, 2009 v 1.2
6.4.2.1 General
801
Other Languages
While other language standards may not mention multibyte characters, the problem they address is faced by
implementations of those languages. For this reason, it is to be expected that some implementations of other
languages will contain some form of support for multibyte characters.
Coding Guidelines
UCNs may be the preferred, C Standard way, of representing nonbasic character set characters in identifiers.
However, developers are at the mercy of editor support for how they enter and view characters that are not in
universal
charac-
ter name
syntax
815
the basic source character set.
799
which characters and their correspondence to universal character names is implementation-defined.
Commentary
Various national bodies have defined standards for representing their national character sets in computer files.
While ISO 10646 is intended to provide a unified standard for all characters, it may be some time before
ISO 10646 28
existing software is converted to use it.
Common Implementations
It is common to find translators aimed at the Japanese market supporting JIS, shift-JIS, and EUC encodings
(see Table 243.3). These encoding use different numeric values than those given in ISO 10646 to represent

the same national character.
800
When preprocessing tokens are converted to tokens during translation phase 7, if a preprocessing token could
be converted to either a keyword or an identifier, it is converted to a keyword.
Commentary
The Committee could have created a separate name space for keywords and allowed developers to define
identifiers having the same spelling as a keyword. The complexity added to a translator by such a specification
would be significant (based on implementation experience for languages that support this functionality),
while a developer’s inability to define identifiers having these spellings was considered a relatively small
inconvenience.
C90
This wording is a simplification of the convoluted logic needed in the C90 Standard to deduce from a
constraint what C99 now says in semantics. The removal of this C90 constraint is not a change of behavior,
since it was not possible to write a program that violated it.
C90 6.1.2
Constraints
In translation phase 7 and 8, an identifier shall not consist of the same sequence of characters as a keyword.
Other Languages
Some languages allow keywords to be used as variable names (e.g., PL/1), using the context to disambiguate
intended use.
801
60) On systems in which linkers cannot accept extended characters, an encoding of the universal character
footnote
60
name may be used in forming valid external identifiers.
Commentary
This is really an implementation tip for translators. The standard defines behavior in terms of an abstract
machine that produces external output. The tip given in this footnote does not affect the conformance status
of an implementation that chooses to implement this functionality in another way. The only time such a
mapping might be visible is through the use of a symbolic execution-time debugging tool, or by having to

link against object files created by other translators.
v 1.2 June 24, 2009
6.4.2.1 General
805
C90
Extended characters were not available in C90, so the suggestion in this footnote does not apply.
215 extended
characters
Other Languages
Issues involving third-party linkers are common to most language implementations that compile to machine
code. Some languages, for instance Java, define the characteristics of an implementation at translation
and execution time. The Java language specification goes to the extreme (compared to other languages) of
specifying the format of the generated file object code file.
Common Implementations
There is a long-standing convention of prefixing externally visible identifier names with an underscore
character when information on them is written out to an object file. There is little experience available on
implementation issues involving UCNs, but many existing linkers do assume that identifiers are encoded
using 8-bit characters.
Coding Guidelines
The encoding of external identifiers only needs to be considered when interfacing to, or from code written in
another language. Cross-language interfacing is outside the scope of these coding guidelines.
802
For example, some otherwise unused character or sequence of characters may be used to encode the
\u
in a
universal character name.
Commentary
Some linkers may not support an occurrence of the backslash (
\
) character in an identifier name. One solution

to this problem is to create names that cannot be declared in the source code by the developer; for instance,
by deleting the \ characters and prefixing the name with a digit character.
Common Implementations
There are no standards for encoding of universal character names in object files. The requirement to support
this form of encoding is too new for it to be possible to say anything about common encodings.
803
Extended characters may produce a long external identifier.
Commentary
Here the word long does not have any special meaning. It simply suggests an identifier containing many
characters.
282 internal
identifier
significant charac-
ters
Implementation limits
804
As discussed in 5.2.4.1, an implementation may limit the number of significant initial characters in an identifier;
Implemen-
tation limits
Commentary
This subclause lists a number of minimum translation limits
276 translation
limits
C90
The C90 Standard does not contain this observation.
C
++
2.10p1
All characters are significant.
20)

C identifiers that differ after the last significant character will cause a diagnostic to be generated by a C
++
translator.
Annex B contains an informative list of possible implementation limits. However, “ . . . these quantities
are only guidelines and do not determine compliance.”.
June 24, 2009 v 1.2
6.4.2.1 General
806
805
the limit for an external name (an identifier that has external linkage) may be more restrictive than that for an
internal name (a macro name or an identifier that does not have external linkage).
Commentary
External identifiers have to be processed by a linker, which may not be under the control of a vendor’s
external
identifier
significant
characters
283
C implementations. In theory, any tool that performs the linking process falls within the remit of the C
Committee. However, the Committee recognized that, in practice, it is not always possible for translator
vendors to supply their own linker. The limitations of existing linkers needed to be factored into the limits
specified in the standard.
Internal identifiers only need to be processed by the translator and the standard is in a strong position to
internal
identifier
significant
characters
282
specify the behavior.
Other Languages

Most other language implementations face similar problems with linkers as C does. However, not all language
specifications explicitly deal with the issue (by specifying the behavior). The Java standard defines a complete
environment that handles all external linkages.
Coding Guidelines
What are the costs associated with a change to the linkage of an identifier during program maintenance, from
internal linkage to external linkage? (Experience shows that identifier linkage is rarely changed from external
to internal?)
In most cases implementations support a sufficiently large number of significant characters in an external
name that a change of identifier linkage makes no difference to its significant characters (i.e., the number
external
identifier
significant
characters
283
of characters it contains falls inside the implementation limit). In those cases where a change of identifier
identifier
number of
characters
792
linkage results in some of its significant characters being ignored, the affect may be benign (there is no other
identifier defined with external linkage whose name is the same as the truncated name) or results in undefined
behavior (the program defines two identifiers with external linkage with the same name).
external
linkage
exactly one
external definition
1818
806
The number of significant characters in an identifier is implementation-defined.
Commentary

Subject to the minimum requirements specified in the standard.
internal
identifier
significant
characters
282
C
++
2.10p1
All characters are significant.
20)
References to the same C identifier, which differs after the last significant character, will cause a diagnostic
to be generated by a C
++
translator.
There is also an informative annex which states:
Annex Bp2
Number of initial characters in an internal identifier or a macro name [1024]
Number of initial characters in an external identifier [1024]
Other Languages
Some languages require all characters in an identifier to be significant (e.g., Java, Snobol 4), while others
don’t (e.g., Cobol, Fortran).
Common Implementations
It is rare to find an implementation that does not meet the minimum limits specified in the standard. A
few translators treat all identifiers as significant. Most have a limit of between 256 and 2,000 significant
characters. The POSIX standard requires that any language that binds to its API needs to support 14
significant characters in an external identifier.
v 1.2 June 24, 2009
6.4.2.1 General
806

Coding Guidelines
While the C90 minimum limits for the number of significant characters in an identifier might be considered
unacceptable by many developers, the C99 limits are sufficiently generous that few developers are likely to
complain.
Automatically generated C source sometimes relies on a large number of significant characters in an
identifier. This can occur because of the desire to simplify the implementation of the generator. Character
sequences in different offsets within an identifier might be reserved for different purposes. Predefined default
character sequence is used to pad the identifier spelling where necessary.
As the following example shows, it is possible for a program’s behavior to change, both when the number
of significant identifiers is increased and when it is decreased.
1 /
*
2
*
Yes, C99 does specify 64 significant characters in an internal
3
*
identifier. But to keep this example within the page width
4
*
we have taken some liberties.
5
*
/
6
7 extern float _________1_________2_________3___bb;
8
9 void f(void)
10 {
11 int _________1_________2_________3___ba;

12
13 /
*
14
*
If there are 34 significant characters, the following operand
15
*
will resolve to the locally declared object.
16
*
17
*
If there are 35 significant characters, the following operand
18
*
will resolve to the globally declared object.
19
*
/
20 _________1_________2_________3___bb++;
21 }
22
23 void g(void)
24 {
25 int _________1_________2_________3___aa;
26
27 /
*
28

*
If there are 34 significant characters, the following operand
29
*
will resolve to the globally declared object.
30
*
31
*
If there are 33 significant characters, the following operand
32
*
will resolve to the locally declared object.
33
*
/
34 _________1_________2_________3___bb++;
35 }
The following issues need to be addressed:

All references to the same identifier should use the same character sequence; that is, all characters are
intended to be significant. References to the same identifiers that differ in nonsignificant characters
need to be treated as faults.

Within how many significant characters should different identifiers differ? Should identifiers be
required to differ within the minimum number of significant characters specified by the standard, or
can a greater number of characters be considered significant?
Readers do not always carefully check all characters in the spelling of an identifier. The contribution made by
characters occurring in different parts of an identifier will depend on the pattern of eye movements employed
June 24, 2009 v 1.2

6.4.2.1 General
807
Significant characters
%identical matches
6 10 20 30 40 50
0.001
0.01
0.1
1
10
100 × × gcc
×
×
×
×
×
×
×
×
×
×
×
×
×
×
×
×
×
×
×

×
×
×
×
×
×
×
×
×
××
××
×
××
×
.
.
idsoftware
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

∆ linux



































∆∆



mozilla










































• • • •

Figure 806.1:
Occurrence of unique identifiers whose significant characters match those of a different identifier (as a percentage
of all unique identifiers in a program), for various numbers of significant characters. Based on the visible form of the .c files.
by readers, which in turn may be affected by their reasons for reading the source, plus cultural factors (e.g.,
reading
kinds of
770
direction in which they read text in their native language, or the significance of word endings in their native
language). Characters occurring at both ends of an identifier are used by readers (at least native English- and
identifiers
Greek readers
792
French-speaking ones) when quickly scanning text.
word
reading individual
770
Cg
806.1
When performing similarity checks on identifiers, all characters shall be considered significant.
807
Any identifiers that differ in a significant character are different identifiers.
Commentary
In many cases different identifiers also denote different entities. In a some cases they denote the same entity
(e.g., two different typedef names that are synonyms for the type int).
Other Languages

This statement is common to all languages (but it does not always mean that they necessarily denote different
entities).
Coding Guidelines
Identifiers that differ in a single significant character may be considered to be

different identifiers by a translator, but considered to be the same identifier by some readers of the
source (because they fail to notice the difference).

the same identifiers by a translator (because the difference occurs in a nonsignificant character), but
considered to be different identifiers by some readers of the source (because they treat all characters as
being significant).
• identifiers by both a translator and some readers of the source.
The possible reasons for readers making mistakes are discussed elsewhere, as are the guideline recommenda-
developer
errors
0
tions for reducing the probability that these developer mistakes become program faults.
identifier
filtering spellings
792
Example
v 1.2 June 24, 2009
6.4.2.2 Predefined identifiers
810
1 extern int e1;
2 extern long el;
3 extern int a_longer_more_meaningful_name;
4 extern int a_longer_more_meeningful_name;
5 extern int a_meaningful_more_longer_name;
808

If two identifiers differ only in nonsignificant characters, the behavior is undefined.
Commentary
While the obvious implementation strategy is to ignore the nonsignificant characters, the standard does not
require implementations to use this strategy. To speed up identifier lookup many implementations use a
hashed symbol table— the hash value for each identifier is computed from the sequence of characters it
contains. Computing this hash value as the characters are read in, to form an identifier, saves a second pass
over those same characters later. If nonsignificant characters were included in the original computed hash
value, a subsequent occurrence of that identifier in the source, differing in nonsignificant characters, would
result in a different hash value being calculated and a strong likelihood that the hash table lookup would fail.
Developers generally expect implementations to ignore nonsignificant characters. An implementation that
behaved differently because identifiers differed in nonsignificant characters might not be regarded as being
very user friendly. Highlighting misspellings that occur in nonsignificant characters is not always seen in a
positive light by some developers.
C
++
In C
++
all characters are significant, thus this statement does not apply in C
++
.
Other Languages
Some languages specify that nonsignificant characters are ignored and have no effect on the program, while
others are silent on the subject.
Common Implementations
Most implementations simply ignore nonsignificant characters. They play no part in identifier lookup in
symbol tables.
Coding Guidelines
The coding guideline issues relating to the number of characters in an identifier that should be considered
significant are discussed elsewhere.
792 identifier

guideline signifi-
cant characters
809
Forward references: universal character names (6.4.3), macro replacement (6.10.3).
6.4.2.2 Predefined identifiers
Semantics
810
The identifier
_ _func_ _
shall be implicitly declared by the translator as if, immediately following the opening
__func__
brace of each function definition, the declaration
static const char __func__[] = "function-name";
appeared, where function-name is the name of the lexically-enclosing function.
61)
Commentary
Implicitly declaring
_ _func_ _
immediately after the opening brace in a function definition means that
the first, developer-written declaration within that function can access it. Giving
_ _func_ _
static storage
duration enables its address to be referred to outside the lifetime of the function that contains it (e.g., enabling
a call history to be displayed at some later stage of program execution). This is not a storage overhead
because space needs to be allocated for the string literal denoted by
_ _func_ _
. The
const
qualifier ensures
June 24, 2009 v 1.2

6.4.2.2 Predefined identifiers
810
that any attempts to modify the value cause undefined behavior. The identifier
_ _func_ _
has an array type,
and is not a string literal, so the string concatenation that occurs in translation phase 6 is not applicable.
transla-
tion phase
6
135
This identifier is useful for providing execution trace information during program testing. Developers who
make use of UCNs may need to ensure that the library they use supports the character output required by
them:
1 #include <stdio.h>
2
3 void \u30CE(void)
4 {
5 printf ("Just entered %s\n", __func__);
6 }
The issue of wide characters in identifiers is discussed elsewhere.
identifier
multibyte
character in
798
Which function name is used when a function definition contains the inline function specifier? In:
1 #include <stdio.h>
2
3 inline void f(void)
4 {
5 printf("We are in %s\n", __func__);

6 }
7
8 int main(void)
9 {
10 f();
11 printf("We are in %s\n", __func__);
12 }
the name of the function f is output, even if that function is inlined into main.
C90
Support for the identifier _ _func_ _ is new in C99.
C
++
Support for the identifier _ _func_ _ is new in C99 and is not available in the C
++
Standard.
Common Implementations
A translator only needs to declare
_ _func_ _
if a reference to it occurs within a function. An obvious
storage saving optimization is to delay any declaration until such time as it is known to be required. Another
optimization is for the storage allocated for
_ _func_ _
to exactly overlay that allocated to the string literal.
Allocating storage for a string literal and copying the characters to the separately allocated object it initializes
is not necessary when that object is defined using the
const
qualifier.
gcc
also supports the built-in form
_ _FUNCTION_ _.

Example
Debugging code in functions can provide useful information. But when there are lots of functions, the
quantity of useless information can be overwhelming. Controlling which functions are to output debugging
information by using conditional compilation requires that code be edited and the program rebuilt.
The names of functions can be used to dynamically control which functions are to output debugging
information. This control not only reduces the amount of information output, but can also reduce execution
time by orders of magnitude (output can be a resource-intense operation).
flookup.h
1 typedef struct f__rec {
2 char
*
func_name;
3 _Bool enabled;
4 struct f__rec
*
next;
v 1.2 June 24, 2009
6.4.2.2 Predefined identifiers
811
5 } func__list;
6
7 extern _Bool func_lookup(func__list
*
, char
*
);
8
9 /
*
10

*
Use the name of the function to control whether debugging is
11
*
switched on/off. lookup is only called the first time this code
12
*
is executed, thereafter the value f___l->enabled can be used.
13
*
/
14 #define D_func_trace(func_name, code) { \
15 static func__list
*
f___l = NULL; \
16 if (f___l ? f___l->enabled : lookup(&f___l, func_name)) \
17 {code} \
18 }
flookup.c
1 #include <stdbool.h>
2
3 #include "flookup.h"
4
5 /
*
6
*
A fixed list of functions and their debug mode.
7
*

We could be more clever and make this a list which
8
*
could be added to as a program executes.
9
*
/
10 static struct {
11 char
*
func_name;
12 _Bool enabled;
13 func__list
*
traces_seen;
14 } lookup_table[] = {
15 "abc", true, NULL,
16 NULL, false, NULL
17 };
18
19 _Bool func_lookup(func__list
*
f_list, char
*
f_name)
20 {
21 /
*
22
*

Loop through lookup_table looking for a match against f_name.
23
*
If a match is found, add f_list to the traces_seen list and
24
*
return the value of enabled for that entry.
25
*
/
26 }
27
28 void change_enabled_setting(char
*
f_name, _Bool new_enabled)
29 {
30 /
*
31
*
Loop through lookup_table looking for a match against f_name.
32
*
If a match is found, loop over its traces_seen list setting
33
*
the enabled flag to new_enabled.
34
*
35

*
This function can switch on/off the debugging output from
36
*
any registered function.
37
*
/
38 }
811
This name is encoded as if the implicit declaration had been written in the source character set and then
translated into the execution character set as indicated in translation phase 5.
Commentary
Having the name appearing as if in translation phase 5 avoids any potential issues caused by macro names
133 transla-
tion phase
5
June 24, 2009 v 1.2
6.4.2.2 Predefined identifiers
814
defined with the spelling of keywords or the name
_ _func_ _
. It also enables a translator to have an identifier
name and type predefined internally, ready to be used when this reserved identifier is encountered. Translation
phase 5 is also where characters get converted to their corresponding members in the execution character set,
an essential requirement for spelling a function name. In many implementations the function name written to
the object file, or program image, is different from the one appearing in the source. This translation phase 5
program
image
141

requirement ensures that it is not any modified name that is used.
Example
1 #include <stdio.h>
2
3 #define __func__ __CNUF__
4 #define __CNUF__ "g"
5
6 void f(void)
7 {
8 /
*
9
*
The implicit declaration does not appear until after preprocessing.
10
*
So there is no declaration ’static const char __func__[] = "f";’
11
*
visible to the preprocessor (which would result in __func__ being
12
*
mapped to __CNUF__ and "f" rather than "g" being output).
13
*
/
14 printf("Name of function is %s\n", __CNUF__);
15 }
812
EXAMPLE Consider the code fragment:

#include <stdio.h>
void myfunc(void)
{
printf("s\n", __func__);
/
*
...
*
/
}
Each time the function is called, it will print to the standard output stream:
myfunc
Commentary
This assumes that the standard output stream is not closed (in which case the behavior would be undefined).
813
Forward references: function definitions (6.9.1).
814
61) Since the name
_ _func_ _
is reserved for any use by the implementation (7.1.3), if any other identifier is
footnote
61
explicitly declared using the name _ _func_ _, the behavior is undefined.
Commentary
The name is reserved because it begins with two underscores. The fact that the standard defines an interpreta-
tion for this name in the identifier name space in block scope does not give any license to the developer to use
it in other name spaces or at file scope. This name is still reserved for use in other name spaces and scopes.
C90
Names beginning with two underscores were specified as reserved for any use by the C90 Standard. The
following program is likely to behave differently when translated and executed by a C99 implementation.

v 1.2 June 24, 2009
6.4.3 Universal character names
815
1 #include <stdio.h>
2
3 int main(void)
4 {
5 int __func__ = 1;
6
7 printf("d\n", __func__);
8 }
C
++
Names beginning with
_ _
are reserved for use by a C
++
implementation. This leaves the way open for a C
++
implementation to use this name for some purpose.
6.4.3 Universal character names
815
universal char-
acter name
syntax
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
hex-quad:
hexadecimal-digit hexadecimal-digit

hexadecimal-digit hexadecimal-digit
Commentary
It is intended that this syntax notation not be visible to the developer, when reading or writing source code
that contains instances of this construct. That is, a
universal-character-name
aware editor displays the
ISO 10646 glyph representing the numeric value specified by the
hex-quad
sequence value. Without such
58 glyph
editor support, the whole rationale for adding these characters to C, allowing developers to read and write
identifiers in their own language, is voided.
C90
Support for this syntactic category is new in C99.
Other Languages
Java calls this lexical construct a
UnicodeInputCharacter
(and does not support the
\U
form, only the
\u
one).
Coding Guidelines
It is difficult to imagine developers regularly using UCNs with an editor that does not display UCNs in
some graphical form. A guideline recommending the use of such an editor would not be telling developers
anything they did not already know.
A number of theories about how people recognize words have been proposed. One of the major issues yet
792 Word
recognition
models of

to be resolved is the extent to which readers make use of whole word recognition versus mapping character
sequences to sound (phonological coding). Support for UCNs increases the possibility that developers will
encounter unfamiliar characters in source code. The issue of developer performance in handling unfamiliar
characters is discussed elsewhere.
792 reading
characters
unknown to reader
Example
1 #define foo(x)
2
3 void f(void)
4 {
5 foo("\\u0123") /
*
Does not contain a UCN.
*
/
June 24, 2009 v 1.2
6.4.3 Universal character names
817
6 foo(\\u0123); /
*
Does contain a UCN.
*
/
7 }
Constraints
816
A universal character name shall not specify a character whose short identifier is less than 00A0 other than
UCNs

not basic char-
acter set
0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through DFFF inclusive.
62)
Commentary
The ISO 10646 Standard defines the ranges 00 through 01F, and 07F through 09F, as the 8-bit control codes
ISO 10646 28
(what it calls C0 and C1). Most of the UCNs with values less than 00A0 represent characters in the basic
source character set. The exceptions listed enumerate characters that are in the Ascii character set, but not
in the basic source character set. The ranges 0D800 through DBFF and 0DC00 through 0DFFF are known
as the surrogate ranges. The purpose of these ranges is to allow representation of rare characters in future
versions of the Unicode standard.
This constraint means that source files cannot contain the UCN equivalent for any members of the basic
source character set.
Rationale
UCNs are not permitted to designate characters from the basic source character set in order to permit fast
compilation times for C programs. For some real world programs, compilers spend a significant amount of
time merely scanning for the characters that end a quoted string, or end a comment, or end some other token.
Although, it is trivial for such loops in a compiler to be able to recognize UCNs, this can result in a surprising
amount of overhead.
A UCN is constrained not to specify a character short identifier in the range 0000 through 0020 or 007F through
009F inclusive for the same reason: this avoids allowing a UCN to designate the newline character. Since
different implementations use different control characters or sequences of control characters to represent
newline, UCNs are prohibited from representing any control character.
C
++
2.2p2
If the hexadecimal value for a universal character name is less than 0x20 or in the range 0x7F–0x9F (inclusive),
or if the universal character name designates a character in the basic source character set, then the program is
ill-formed.

The range of hexadecimal values that are not permitted in C
++
is a subset of those that are not permitted in C.
This means that source which has been accepted by a conforming C translator will also be accepted by a
conforming C
++
translator, but not the other way around.
Other Languages
Java has no such restrictions on the hexadecimal values.
Common Implementations
Support for UCNs is new in C99. It remains to be seen whether translator vendors decide to support any
UCN hexadecimal value as an extension.
Example
1 \u0069\u006E\u0074 glob; /
*
Constraint violation.
*
/
Description
v 1.2 June 24, 2009
6.4.3 Universal character names
818
817
Universal character names may be used in identifiers, character constants, and string literals to designate
characters that are not in the basic character set.
Commentary
UCNs may also appear in comments. However, comments do not have a lexical structure to them. Inside a
comment character, sequences starting with
\u
are not treated as UCNs by a translator, although other tools

may choose to do so, in this context. The mapping of UCNs in character constants and string literals to the
execution character set occurs in translation phase 5.
The constraint on the range of values that a UCN may take prevents them from being used to represent
816 UCNs
not basic char-
acter set
keywords.
C
++
The C
++
Standard also supports the use of universal character names in these contexts, but does not say in
words what it specifies in the syntax (although 2.2p2 comes close for identifiers).
Other Languages
In Java,
UnicodeInputCharacters
can represent any character and is mapped in lexical translation step
1. It is possible for every character in the source to appear in this form. The mapping only occurs once, so
\u005cu005a
becomes
\u005a
, not
Z
(
005c
is the Unicode value for
\
and
005a
is the Unicode character

for Z).
Coding Guidelines
UCNs in character constants and string literals are used to represent characters that are output when a program
is executed, or in identifiers to provide more readable source code. In the former case it is possible that
UCNs from different natural languages will need to be represented. In the latter case it might be surprising if
source code contained UCNs from different languages. This usage is a complex one involving issues outside
of these coding guidelines (e.g., configuration management and customer requirements) and your author has
insufficient experience to know whether any guideline recommendations might be worthwhile.
Some of the coding guideline issues relating to the use of characters outside of the basic execution
character set are discussed elsewhere.
238 multibyte
character
source contain
Example
1 #include <wchar.h>
2
3 int \u0386\u0401;
4 wchar_t
*
hello = "\u05B0\u0901";
Semantics
818
The universal character name
\U
nnnnnnnn designates the character whose eight-digit short identifier (as
short identifier
specified by ISO/IEC 10646) is nnnnnnnn.
63)
Commentary
The standard specifies how UCNs are represented in source code. A development environment may chose to

provide, to developers, a visible representation of the UCN that matches the glyph with the corresponding
numeric value in ISO 10646. The ISO 10646 BNF syntax for short identifiers is: ISO 10646
short identifier
{ U | u } [ {+}(xxxx | xxxxx | xxxxxx) | {-}xxxxxxxx ]
where x represents a hexadecimal digit.
June 24, 2009 v 1.2
6.4.4 Constants
822
Other Languages
Java does not support eight-digit universal character names.
Coding Guidelines
This form of UCN counts toward a greater number of significant characters in identifiers with external
linkage and therefore is not the preferred representation. However, the developer may not have any control
external
identifier
significant
characters
283
over the method used by an editor to represent UCNs. Given that characters from the majority of human
languages can be represented using four-digit short identifiers, eight-digit short identifiers are not likely to be
needed. If the development environment offers a choice of representations, use of four-digit short identifiers
is likely to result in more significant characters being retained in identifiers having external linkage.
819
Similarly, the universal character name
\u
nnnn designates the character whose four-digit short identifier is
nnnn (and whose eight-digit short identifier is 0000nnnn).
Commentary
It was possible to represent all of the characters specified by versions 1 and 2 of the Unicode-sponsored
character set using four-digit short identifiers. Version 3 introduced characters whose representation value

requires more than four digits.
Other Languages
Java only supports this form of four-digit universal character names.
820
62) The disallowed characters are the characters in the basic character set and the code positions reserved by
footnote
62
ISO/IEC 10646 for control characters, the character DELETE, and the S-zone (reserved for use by UTF-16).
Commentary
Requiring that characters in the basic character set not be represented using UCN notation helps guarantee
basic char-
acter set
215
that existing tools (e.g., editors) continue to be able to process source files.
The control characters may have special meaning for some tools that process source files (e.g., a commu-
nications program used for sending source down a serial link).
C
++
The C
++
Standard does not make this observation.
821
63) Short identifiers for characters were first specified in ISO/IEC 10646–1/AMD9:1997.footnote
63
Commentary
This amendment appeared eight years after the first publication of the C Standard (which was made by ANSI
in 1989).
6.4.4 Constants
822
constant

syntax
constant:
integer-constant
floating-constant
enumeration-constant
character-constant
Commentary
A
constant
differs from a
constant-expression
in that it consists of a single token. The term literal is
constant ex-
pression
syntax
1322
often used by developers to refer to what the C Standard calls a constant (technically the only literals C
contains are string literals). There is a more general usage of the term constant to mean something whose
string literal
syntax
895
value does not change. What the C Standard calls a constant-expression developers often shorten to constant.
v 1.2 June 24, 2009
6.4.4 Constants
822
C
++
Footnote 21
21) The term “literal” generally designates, in this International Standard, those tokens that are called “constants”
in ISO C.

The C
++
Standard also includes
string-literal
and
boolean-literal
in the list of literals, but it does
not include enumeration constants in the list of literals. However:
7.2p1
The identifiers in an
enumerator-list
are declared as constants, and can appear wherever constants are
required.
The C
++
terminology more closely follows common developer terminology by using literal (a single token)
and constant (a sequence of operators and literals whose value can be evaluated at translation time). The value
of a literal is explicit in the sequence of characters making up its token. A constant may be made up of more
than one token or be an identifier. The operands in a constant have to be evaluated by the translator to obtain
its result value. C uses the more easily confused terminology of
integer-constant
(a single token) and
constant-expression
(a sequence of operators,
integer-constant
and
floating-constant
whose
value can be evaluated at translation time).
Other Languages

Languages that support types not supported by C (e.g., instance sets) sometimes allow constants having
these types to be specified (e.g., in Pascal
[’a’, ’d’]
represents a set containing two characters). Fortran
supports complex literal constants (e.g., (1.0, 2.0) represents the complex number 1.0 + 2.0i)
Many languages do not support (e.g., Java until version 1.5) some form of enumeration-constant.
Coding Guidelines
Constants are the mechanism by which numeric values are written into source code. The term constant is
used because the numeric values do not change during program execution (and are known at translation time;
although in some cases a person reading the source may only know that the value used will be one of a list of
possible values because the definition of a macro may be conditional on the setting of some translation time
option— for instance, -D).
1931 macro
object-like
The use of constants in source code creates a number of possible maintenance issues, including:

A constant value, representing some quantity, often needs to occur in multiple locations within source
code. Searching for and replacing all occurrences of a particular numeric value in the code is an error
prone process. It is not possible, for instance, to know that all
15
s occurring in the source code have
the same semantic association and some may need to remain unchanged. (Your author was once told
by a developer, whose source contained lots of
15
s, that the UK government would never change
value-added tax from 15%; a few years later it changed to 17.5%.)

On encountering a constant in the source, a reader usually needs to deduce its semantic association
(either in the application domain or its internal algorithmic function). While its semantics may be very
familiar to the author of the source, the association between value and semantics may not be so readily

made by later readers.

A cognitive switch may need to be made because of the representation used for the constant (e.g.,
0 cognitive
switch
floating point, hexadecimal integer, or character constant).
One solution to these problems is to use an identifier to give a symbolic name
822.1
to the constant, and to use
symbolic name
that symbolic name wherever the constant would have appeared in the source. Changes to the value of the
constant can then be made by a single modification to the definition of the identifier and a well-chosen name
can help readers make the appropriate semantic association. The creation of a symbolic name provides two
pieces of information:
822.1
In some cases the linguistically more correct terminology would be iconic name.
June 24, 2009 v 1.2
6.4.4 Constants
822
1.
The property represented by that symbolic name. For instance, the maximum value of a particular
type (
INT_MAX
), whether an implementation supports some feature (
_ _STDC_IEC_559_ _
), a means of
INT_MAX 318
__STDC_IEC_559__
macro
2015

specifying some operation (SEEK_SET), or a way to obtain information (FE_OVERFLOW).
2.
A method of operating on the symbolic name to access the property it represents. For instance, arith-
metic operations (
INT_MAX
), testing in a conditional preprocessing directive (
_ _STDC_IEC_559_ _
),
passing as an argument to a library function (
SEEK_SET
); passing as an argument to a library function,
possibly in combination with other symbolic names (FE_OVERFLOW).
Operating on symbolic names involves making use of representation information. (Assignment, or argument
passing, is the only time that representation might not be an issue.) The extent to which the use of
representation information will be considered acceptable will depend on the symbolic name. For instance,
FE_OVERFLOW
appearing as the operand of a bitwise operator is to be expected, but its appearance as the
operand of an arithmetic operator would be suspicious.
The use of symbolic names is rarely seen by developers, as applying to all constants that occur in source
code. In some cases the following are claimed:
• The constants are so sufficiently well-known that there is no need to give them a name.

The number of occurrences of particular constants is not sufficient to warrant creating a name for them.

Operations involving some constant values occur so frequently that their semantic associations are
obvious to developers; for instance, assigning 0 or adding 1.
It is true that not all numeric values are meaningless to everybody. A few values are likely to be universally
known (at least to Earth-based developers). For instance, there are 60 seconds in a minute, 60 minutes in an
hour, and 24 hours in a day. The value
24

occurring in an expression involving time is likely to represent
hours in a day. Many values will only be well known to developers working within a given application
domain, such as atomic physics (e.g., the value
6.6261E-34
). Between these extremes are other values; for
instance,
3.14159
will be instantly recognized by developers with a mathematics background. However,
developers without this background may need to think about what it represents. There is the possibility
that developers who have grown up surrounded by other mathematically oriented people will be completely
unaware that others do not recognize the obvious semantic association for this value.
A constant having a particular semantic association may only occur once in the source. However, the
issue is not how many times a constant having a particular semantic association occurs, but how many times
the particular constant value occurs. The same constant value can appear because of different semantic
associations. A search for a sequence of digits (a constant value) will locate all occurrences, irrespective of
semantic association.
While an argument can always be made for certain values being so sufficiently well-known that there is no
benefit in replacing them by identifiers, the effort/time taken in discussions on what values are sufficiently
well-known to warrant standing on their own, instead of an identifier, is likely to be significantly greater than
the sum total of all the extra one seconds, or so, taken to type the identifier.
The constant values
0
and
1
occur very frequently in source code (see Figure 825.1). Experience suggests
that the semantic associations tend to be that of assigning an initial value in the case of
0
and accessing a
preceding or following item in the case of
1

. The coding guideline issues are discussed in the subsections
that deal with the different kinds of constants (e.g., integer, or floating).
What form of definition should a symbolic name denoting constant value have? Possibilities include the
following:

Macro names. These are seen by developers as being technically the same as constants in that they are
replaced by the numeric value of the constant during translation (there can also be an unvoiced bias
toward perceived efficiency here).
v 1.2 June 24, 2009
6.4.4 Constants
823

Enumeration constants. The purpose of an enumerated type is to associate a list of constants with each
other. This is not to say the definition of an enumerated type containing a single enumeration constant
517 enumeration
set of named
constants
should not occur, but this usage would be unusual. Enumeration constants share the same unvoiced
developer bias as macro names— perceived efficiency.

Objects initialized with the constant. This approach is advocated by some coding guideline documents
for C
++
. The extent to which this is because an object declared with the
const
qualifier really is
constant and a translator need not allocate storage for it, or because use of the preprocessor (often
called the C preprocessor, as if it were not also in C
++
) is frowned on in the C

++
community and is left
to the reader to decide.
The enumeration constant versus macro name issue is discussed in detail elsewhere.
517 enumeration
set of named
constants
What name to choose? The constant
6.6261E-34
illustrates another pitfall. Planck’s constant is almost
universally represented, within the physics community, using the letter h (a closely related constant is
¯h
,
the reduced Planck constant)). A developer might be tempted to make use of this idiom to name the value,
perhaps even trying to find a way of using UCNs to obtain the appropriate
h
. The single letter
h
probably
gives no more information than the value. The name
PLANCK_CONSTANT
is self-evident. The developer
attitude— anybody who does not know what
6.6261E-34
represents has no business reading the source— is
not very productive or helpful.
Table 822.1:
Occurrence of different kinds of constants (as a percentage of all tokens). Based on the visible form of the
.c
and

.h files.
Kind of Constant .c files .h files
character-constant 0.16 0.06
integer-constant 6.70 20.79
floating-constant 0.02 0.20
string-literal 1.02 0.74
Constraints
823
The value of a constant shall be in the range of representable values for its type. constant
representable
in its type
Commentary
This is something of a circular definition in that a constant’s value is also used to determine its type. The
lexical form of a constant is also a factor in determining which of a number of possible types it may take. An
824 constant
type determined by
form and value
unsuffixed constant that is too large to be represented in the type
long long
, or a suffixed constant that is
larger than the type with the greatest rank applicable to that suffix, violates this requirement (unless there is
some extended integer type supported by the implementation into whose range the value falls).
It can be argued that all floating constants are in range if the implementation supports ±∞.
There is a similar constraint for enumeration constants
1440 enumeration
constant
representable in int
C
++
The C

++
Standard has equivalent wording covering
integer-literals
(2.13.1p3),
character-literals
(2.13.2p3) and
floating-literals
(2.13.3p1). For
enumeration-literals
their type depends on the
context in which the question is asked:
7.2p4
Following the closing brace of an enum-specifier, each enumerator has the type of its enumeration. Prior to
the closing brace, the type of each enumerator is the type of its initializing value.
7.2p5
June 24, 2009 v 1.2
6.4.4 Constants
824
The underlying type of an enumeration is an integral type that can represent all the enumerator values defined in
the enumeration.
Other Languages
Most languages have a similar requirement, even those supporting a single integer or floating type.
Common Implementations
Some implementations use the string-to-integer conversions provided by the library, while others prefer the
flexibility (and fuller control of error recovery) afforded by specially written code. Parker
[1074]
describes the
minimal functionality required.
Example
1 char ch = ’\0\0\0\0y’;

2
3 float f_1 = 1e99999999999999999999999999999999999999999999999;
4 float f_2 = 0e99999999999999999999999999999999999999999999999;
5 float f_3 = 1e-99999999999999999999999999999999999999999999999; /
*
Approximately zero.
*
/
6 float f_4 = 0e-99999999999999999999999999999999999999999999999; /
*
Exact zero.
*
/
7
8 short s_1 = 9999999999999999999999999999999999999999999999999;
9 short s_2 = 99999999999999999999999 / 99999999999999999999999;
The integer constant
10000000000000000000L
would violate this constraint on an implementation that
represented the type
long long
in 64 bits. The use of an
L
suffix precludes the constant being given the type
unsigned long long.
Semantics
824
Each constant has a type, determined by its form and value, as detailed later. shall have a type and the value
constant
type determined

by form and value
of a constant shall be in the range of representable values for its type.
Commentary
Just as there are different floating and integer object types, the possible types that constants may have is not
limited to a single type.
integer
constant
possible types
836
It is a constraint violation for a constant to occur during translation phrase 7 without a type.
transla-
tion phase
7
136
integer
constant
no type
841
The requirement that a constant be in the range of representable values for its type is a requirement on the
implementation.
The wording was changed by the response to DR #298.
C
++
2.13.1p2
The type of an integer literal depends on its form, value, and suffix.
2.13.3p1
The type of a floating literal is
double
unless explicitly specified by a suffix. The suffixes
f

and
F
specify
float
,
the suffixes l and L specify long double.
There are no similar statements for the other kinds of literals, although C
++
does support suffixes on the
floating types. However, the syntactic form of string literals, character literals, and boolean literals determines
their type.
v 1.2 June 24, 2009
6.4.4.1 Integer constants
825
Coding Guidelines
The type of a constant, unlike object types, can vary between implementations. For instance, the integer
constant
40000
can have either the type
int
or
long int
. The suffix on the integer constant
40000u
only
ensures that it has one of the listed unsigned integer types. The coding guideline issues associated with the
possibility that the type of a constant can vary between implementations is discussed elsewhere.
835 integer
constant
type first in list

6.4.4.1 Integer constants
825
integer constant
syntax
integer-constant:
decimal-constant integer-suffix
opt
octal-constant integer-suffix
opt
hexadecimal-constant integer-suffix
opt
decimal-constant:
nonzero-digit
decimal-constant digit
octal-constant:
0
octal-constant octal-digit
hexadecimal-constant:
hexadecimal-prefix hexadecimal-digit
hexadecimal-constant hexadecimal-digit
hexadecimal-prefix: one of
0x 0X
nonzero-digit: one of
1 2 3 4 5 6 7 8 9
octal-digit: one of
0 1 2 3 4 5 6 7
hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F

integer-suffix:
unsigned-suffix long-suffix
opt
unsigned-suffix long-long-suffix
long-suffix unsigned-suffix
opt
long-long-suffix unsigned-suffix
opt
unsigned-suffix: one of
u U
long-suffix: one of
l L
long-long-suffix: one of
ll LL
Commentary
Integer constants are created in translation phase 7 when the preprocessing tokens
pp-number
are converted
136 transla-
tion phase
7
into tokens denoting various forms of
constant
.
Integer-constant
s always denote positive values. The
character sequence -1 consists of the two tokens {-} {1}, a constant expression.
1322 constant
expression
syntax

An
integer-suffix
can be used to restrict the set of possible types the constant can have, it also specifies
the lowest rank an integer constant may have (which for
ll
or
LL
leaves few further possibilities). The
U
, or
u, suffix indicates that the integer constant is unsigned.
All translation time integer constants are nonnegative. The character sequence
-1
consists of the token
sequence unary minus followed by the
decimal-constant 1
. Support for translation time negative constants
June 24, 2009 v 1.2
6.4.4.1 Integer constants
825
in the lexical grammar would create unjustified complexity by requiring lexers to disambiguate binary from
unary operators uses in, for instance: X-1.
C90
Support for long-long-suffix and the nonterminal hexadecimal-prefix is new in C99.
C
++
The C
++
syntax is identical to the C90 syntax.
Support for long-long-suffix and the nonterminal hexadecimal-prefix is not available in C

++
.
Common Implementations
Some implementations specify that the prefix
0b
(or
0B
) denotes an integer constant expressed in binary
notation. Over the years the C Committee received a number of requests for such a suffix to be added to
the C Standard. The Committee did not see sufficient utility for this suffix to be included in C99. The C
embedded systems TR specifies
h
and
H
to denote the types
short frac
or
short accum
, and one of
k
,
K
,
Embed-
ded C TR
18
r, and R to denote a fixed-point type.
The IBM ILE C compiler
[627]
supports a packed decimal data type. The suffix

d
or
D
may be used to
specify that a literal has this type. Microsoft C supports the suffixes
i8
,
i16
,
i32
, and
i64
denoting integer
constants having the types byte (an extension), short, int, and _ _int64, respectively.
Other Languages
Although Ada supports integer constants having bases between 1 and 36 (e.g.,
2#1101
is the binary represen-
tation for
10#13
), few other languages support the use of suffixes. Ada also supports the use of underscores
within an integer-constant to make the value more readable.
Coding Guidelines
A study by Brysbaert
[174]
found that the time taken for a person to process an Arabic integer between 1 and
99 was a function of the logarithm of its magnitude, the frequency of the number (based on various estimates
of its frequency of occurrence in everyday life; see Dorogovtsev et al
[373]
for measurements of numbers

appearing in web pages), and sometimes the number of syllables in the spoken form of the value. Subject
response times varied from approximately 300 ms for values close to zero, to approximately 550 ms for
values in the nineties.
Experience shows that the long-suffix l is often visually confused with the nonzero-digit 1.
825.1
Cg
825.1
If a long-suffix is required, only the form L shall be used.
Cg
825.2
If a long-long-suffix is required, only the form LL shall be used.
As previously pointed out, constants appearing in the visible form of the source often signify some quantity
constant
syntax
822
with real world semantics attached to it. However, uses of the integer constants
0
and
1
in the visible source
often have no special semantics associated with their usage. They also represent a significant percentage of
the total number of integer constants in the source code (see Figure 825.1). The frequency of occurrence of
these values (most RISC processors dedicate a single register to permanently hold the value zero) comes
about through commonly seen program operations. These operations include: code to count the number of
occurrences of entities, or that contain loops, or index the previous or next element of an array (not that 0 or
1 could not also have similar semantic meaning to other constant values).
A blanket requirement that all integer constants be represented in the visible source by symbolic names
fails to take into account that a large percentage of the integer constants used in programs have no special
825.1
While the visual similarity between alphabetic letters has been experimentally measured your author is not aware of any experiment

that has measured the visually similarity of digits with letters.
v 1.2 June 24, 2009
6.4.4.1 Integer constants
825
decimal-constant value
Occurrences
0 2 8 32 128 512
1
10
100
1,000
10,000
100,000
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
..
..
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
..
...
.

.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
hexadecimal-constant value
0 2 8 32 128 512
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
..
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
..
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.

.
.
.
..
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.

.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.

.
.
..
.
..
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
..
.
..
.
.
..
.
.
.
.
..
.
.
.
.
.
.
.
.
..
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
..
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
.
.
.

.
.
.
..
.
..
.
.
..
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
..
.
.
.
.

.
.
.
.
.
.
..
.
.
.
.
.
...
....
.
.
..
.
.
.
.
.
.
.
..
.
.
.
.
.

.
.
.
.
.
.
..
.
.
..
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
...
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....
.
.
.
.
.
.
.
.
.
.
.
..

.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
..
.
..
.
.
.
.
...
.
.
..
.
.
.
.

..
.
.
.
.
.
.
..
.
.
.
.
.
.
..
.
.
.
.
..
.
...
.
.
.
.
...
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......
.
.
.
.
.
.
..
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
..
.
.
.
...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Figure 825.1:
Number of integer constants having the lexical form of a
decimal-constant
(the literal
0
is also included in this
set) and hexadecimal-constant that have a given value. Based on the visible form of the .c and .h files.
meaning associated with them. In particular the integer constants 0 and 1 occur so often (see Figure 825.1)
that having to justify why each of them need not be replaced by a symbolic name would have a high cost for
an occasional benefit.

Rev
825.3
No integer constant, other than 0 and 1, shall appear in the visible source code, other than as the sole
preprocessing token in the body of a macro definition or in an enumeration definition.
Some developers are sloppy in the use of integer constants, using them where a floating constant was the
appropriate type. The presence of a period makes it explicitly visible that a floating type is being used. The
general issue of integer constant conversions is discussed elsewhere.
835.2 integer
constant
with suffix, not
immediately
converted
Example
The character sequence
123xyz
is tokenized as {
123xyz
}, a
pp-number
. This is not a valid integer constant.
927 pp-number
syntax
Usage
Having some forms of constant tokens (also see Figure 842.1) follow Benford’s law
[584]
would not be
integer constant
usage
surprising because the significant digits of a set of values created by randomly sampling from a variety of
different distributions converges to a logarithmic distribution (i.e., Benford’s law).

[583]
While the results for
decimal-constant
(see Figure 825.2) may appear to be a reasonable fit, applying a chi-squared test shows
the fit to be remarkably poor (
χ
2
= 132,398). The first nonzero digit of
hexadecimal-constant
s appears to
be approximately evenly distributed.
Table 825.1:
Occurrence of various kinds of
integer-constant
s (as a percentage of all integer constants; note that zero is
included in the
decimal-constant
count rather than the
octal-constant
count). Based on the visible form of the
.c
and
.h
files.
Kind of integer-constant .c files .h files
decimal-constant 64.1 17.8
hexadecimal-constant 35.8 82.1
octal-constant 0.1 0.2
June 24, 2009 v 1.2
6.4.4.1 Integer constants

826
First non-zero digit
Probability of appearing
1 2 3 4 5 6 7 8 9 A B C D E F
1
10
100
decimal
hexadecimal
Figure 825.2:
Probability of a
decimal-constant
or
hexadecimal-constant
starting with a particular digit; based on
.c
files.
Dotted lines are the probabilities predicted by Benford’s law (for values expressed in base 10 and base 16), i.e.,
log(1 + d
−1
)
,
where d is the numeric value of the digit.
Table 825.2:
Occurrence of various
integer-suffix
sequences (as a percentage of all
integer-constants
). Based on the
visible form of the .c and .h files.

Suffix Character Sequence .c files h. files Suffix Character Sequence .c files .h files
none 99.6850 99.5997 Lu/lU 0.0005 0.0001
U/u 0.0298 0.0198 LL/lL/ll 0.0072 0.0022
L/l 0.1378 0.2096 ULL/uLl/ulL/Ull 0.0128 0.0061
U/uL/ul 0.1269 0.1625 LLU/lLu/LlU/llu 0.0000 0.0000
Table 825.3: Common token pairs involving integer-constants. Based on the visible form of the .c files.
Token Sequence
% Occurrence
of First Token
% Occurrence of
Second Token
Token Sequence
% Occurrence
of First Token
% Occurrence of
Second Token
, integer-constant 42.9 56.5 ( integer-constant 2.8 3.4
integer-constant ] 6.4 44.4 == integer-constant 25.5 2.0
integer-constant , 58.2 44.2 return integer-constant 18.6 1.9
integer-constant ; 14.1 12.1 + integer-constant 33.7 1.9
integer-constant ) 14.2 11.7 & integer-constant 30.6 1.5
integer-constant # 1.4 9.1 identifier integer-constant 0.3 1.5
= integer-constant 19.6 9.0 - integer-constant 44.0 1.3
[ integer-constant 39.3 5.6 < integer-constant 40.0 1.3
integer-constant } 1.2 4.4 { integer-constant 4.2 1.2
-v integer-constant 69.0 4.1
A study by Pollmann and Jansen
[1120]
analyzed occurrences of related pairs of numerals (e.g., “two or
three books”) in written (Dutch) text. They found that pairs of numerals often followed what they called

ordering rules, which were (for the number pair x and y):
• x has to be smaller than y
• x or y has to be round (i.e., round numbers include the numbers 1 to 20 and the multiples of five)

the difference between
x
and
y
has to be a favorite number. (These include:
10
n
×
(1, 2,
½
, or
¼
) for
any value of n.)
Description
v 1.2 June 24, 2009
6.4.4.1 Integer constants
829
826
An integer constant begins with a digit, but has no period or exponent part. integer constant
Commentary
A restatement of information given in the Syntax clause.
827
It may have a prefix that specifies its base and a suffix that specifies its type.
Commentary
A suffix need not uniquely determine an integer constants type, only the lowest rank it may have. There is no

suffix for specifying the type
int
, or any integer type with rank less than
int
(although implementations
may provide these as an extension).
The base document did not specify any suffixes; they were introduced in C90.
1 base docu-
ment
Other Languages
A few other languages also support some kind of suffix, including C
++
, Fortran, and Java.
Coding Guidelines
Developers do not normally think in terms of an integer constant having a prefix. The term integer constant
terminology
integer constant
is often used to denote what the standard calls a decimal constant, which corresponds to the common case.
When they occur in source, both octal and hexadecimal constants are usually referred to by these names,
respectively. The benefits of educating developers to use the terminology decimal constant instead of integer
constant are very unlikely to exceed the cost.
828
A decimal constant begins with a nonzero digit and consists of a sequence of decimal digits. decimal constant
Commentary
A restatement of information given in the Syntax clause.
Coding Guidelines
The constant
0
is, technically, an octal constant. Some guideline documents use the term decimal constant in
their wording, overlooking the fact that, technically, this excludes the value

0
. The guidelines given in this
book do not fall into this trap, but anybody who creates a modified version of them needs to watch out for it.
829
An octal constant consists of the prefix 0 optionally followed by a sequence of the digits 0 through 7 only. octal constant
Commentary
A restatement of information given in the Syntax clause. An octal constant is a natural representation to use
when the value held in a single byte needs to be displayed (or read in) and the number of output indicators (or
input keys) is limited (only eight possibilities are needed). For instance, a freestanding environment where
the output device can only represent digits. The users of such input/output devices tend to be technically
literate.
Other Languages
A few other languages (e.g., Java and Ada) support octal constants. Most do not.
Common Implementations
K&R C supported the use of the digits
8
and
9
in octal constants (support for this functionality was removed
during the early evolution of C
[1199]
although some implementations continue to support it
[610, 1094]
). They
represented the values 10 and 11, respectively.
Coding Guidelines
Octal constants are rarely used (approximately 0.1% of all
integer-constant
s, not counting the value
0

).
There seem to be a number of reasons why developers occasionally use octal constants:
June 24, 2009 v 1.2

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×