Tải bản đầy đủ (.pdf) (804 trang)

Addison wesley aho, sethi, ullman compilers principles, techinques, and tools

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (35.82 MB, 804 trang )


Preface
This bwk is a descendant of Prinrlpdes of Compiler Design by Alfred V , Aho
and Jeffrey D. UNman. Like its ancestor, it is intended as a text for a first
course in compiler design. The emphasis is on solving p b l c m s universally
cnwuntered in designing s language'translator, regardless of the source or target machine.

Although few p p l e are likely to build or even maintain a compiler for a
major programming language, the reader can profitably apply the ideas and
techniques discussed in this book to general software design. Fwr example,
the string matching techniques for building lexical analyzers have also been
used in text editors, information retrieval systems, and pattern recognition
programs. Curttext-free grammars and syntax-d irected definitions have been
u d to build many little languages such as the typesettin6 and figure drawing
systems that prproduced this h k , The techniques of d e optimization have
been used in program verifitrs and in programs that prduce 'Structured"
pdograms from unstructured ones.

The m a p topicn' in cornpib design are covered in depth. The first chapter
intrduccs the basic structure of a compiler and is essential to the rest of the
bQk
Chapter 2 presents a translator from infix to p t f i x expressions, built using
some of the basic techniques described in this book, Many of the remaining
chapters amplify the material in Chapter 2.
Chapter 3 covers lexical analysis, regular expressions, finitc-state machines,
and scanner-generator tools. The maprial in this chapter i s broadly applicabk
to text-prcxx~ing*
Chapter 4 cuvers the major parsing techniques in depth, ranging from t h t
recursiue&scent methods that are suitable for hand implementation to the
mmputatianaly more intensive LR techniques that haw ken used in parser
generators.


Chapter 5 introduces the principal Meas in syntaxdirected translation. This
chapter is used in the remainder of the h k for both specifying and implcmenting t rrrnslations.
Chapter 6 presents the main ideas for pwforming static semantic checking,
Type checking and unification are discuswd in detail,


PREFACE

Chapter 7 discusses storage organizations u d to support the run-time
environment of a program.
Chapter 8 begins with a discussion of intermediate languages and then
shows how common programming language constructs can be translated into
intermediate d e .
Chapter 9 covers target d e generation. Included are the basic "on-thefly" d e generation mcthds, as well as optimal rnethds for generating d t
for expressions, Peephole optimization and dt-generator generators arc also

covered.
Chapter 10 is a wmprehensivc treatment of d t optimization. Data-flow
analysis methods are covered in detail, as well as the principal rnethds for
global optirnhtiw.
Chapter I 1 discusses some pragmatic issues that arise in implementing a
compiler. Software engineering and teaing are particularly important in m-

pller mnstxuctim.
Chapter 12 presents case studies of wmpikrs that have been ms~nrctcd
udng some of the techniques presented in this book.
Appndix A dcscriks a simple language; a "subset" of Pascal, that can be
used as the basis of an implementation project,
The authors have taught both introductory and advanced courses, at the
undergraduate and graduate levels, from the material in this b k at: AT&T

&11 hbratories, Columbia, Princeton, and Stanford,
An introductory mmpibr course might cover matmid from the following
sections of this book:

introduction
lexical analysis
symbl tables
parsing

Chapter 1 and Sections 2.1-2.5
2.6. 3.1-3.4
2.7, 7-6
2.4, 4.1-4,4

synt a x 4 ireded

trawlation

type checking
run-time organization
intermediate
code generation
d e generation
d e optimization

Informmtbn needmi for a programming project like the one in Apptndix A is
introduced in Chapter 2.
A course stressing twls In compiler construction might include tbe dimssion of lexical analyzer generators in Sections 3.5, of pmw generators in SIXtions 4.8 and 4.9, of code-generator generators in Wim 9.12, and material
on techniques for compiler constriction from Chapter I I .
An advanced course might stress the algorithms used in lexica1 analyzer

generators and parser gcneratms discussed in Chapters 3 and 4, the material


3

PREFACE

on type equivalence, overloading, polymurphisrn, and unifica~ionIn Chapter
6 , the material on run-time storage organizalion in Chapter 7, the paiterndirected code generation methods discussed in Chapter 9, and material on
code optimization from Chapter 10.

Exercises
As before: we rate exercises with stars. Exereism without stars test understanding of definitions, singly starred exercises are intended for more
advanced courses, and doubly starred exercises are fond for thought.

Acknowledgments
At various stages in the writing of this book, a number of people have given
us invaluable comments on the manuscript. In this regard we owe a debt of
gratitude to Bill Appelbe. Nelson Beebe, Jon Btntley, Lois Bngess, Rodney
Farrow, Stu Feldman, Charles Fischer, Chris Fraser, Art Gittelman, Eric
Grosse, Dave Hanson, Fritz Henglein, Robert Henry, Gerard Holzmann,
Steve Johnson, Brian Kernighan, Ken Kubota, Daniel Lehmann, Dave MacQueen, Dtanne Maki, Alan Martin, Doug Mcllroy, Charles McLaughlin, John
Mitchell, Elliott Organick, Roberr Paige, Phil Pfeiffer, Rob Pike, Kari-Jouko
Riiiha, Dennis Rirchic. Srirarn Sankar, Paul Stwcker, Bjarne Strmlstrup, Tom
Szyrnanskl. Kim Tracy. Peter Weinberger, Jennifer Widom. and Reinhard
Wilhelra.
This book was phototypeset by the authors using the cxcellenr software
available on the UNlX system. The typesetting c o m n m d read
picJk.s


tbl

e q n I t m f f -ms

p i c is Brian Kernighan's language for typesetting figures; we owe Brian a
special debt of gratirude for accommodating our special and extensive figuredrawing needs so cheerfully, tbl is Mike Lesk's language for laying out
tables. eqn is Brian Kernighan a d Lorinda Cherry's language for typesetting
mathcrnatics. trofi is Joe Ossana's program for formarring text for a phototypesetter, which in our case was a Mergenthakr Lino~ron202M. The ms
package of troff macros was written by Mike Lesk. in addition, we
managed the lext using make due to Stu Feldman, Crass references wirhin
the text.-were mainrained using awk crealed by A l Aho, Brian Kernighan, and
Peter Weinberger, and sed created bv Lee McMahon.
The authors would par~icularlylike to aekoowledp Patricia Solomon for
heipin g prepare the manuscript for photocomposiiion. Her cheerfuhcss and
expert typing were greatly appreciated. I . D . Ullrnan was supported by an
Einstein Fellowship of the Israeli Academy of Arts and Sciences during part of
the lime in which this book was written. Finally, the authors would like
thank AT&T Bell Laboratories far ils suppurt during the preparation of the

manuscript.

A,V+A,.R.S..J.D.U.


Contents

1.1 Compilers .............................................................. I
4
1.2 Analysis of the source program .................................
..........................................

1
0
1.3 The phasa of a compiler
16
1.4 Cousins of the compiler
1.5 The grouping of phases .............................,.....I......... 20

............................................

1.6 Compiler-construction tools
Bibliographic noles

Cbapkr 2

.......................................

..................................................

22
23

A Simple Ompass Cempiler

2.1 Overview ...............................................................
2.2 Syntax definition .....................................................
2.3 Syntax-directed translation .........................................
2.4 Parsing ................................................................
2.5 A translator for simple expressions ..............................
2.6 Lexical analysis .......................................................
2.7 Incarprating a symbol table ......................................

1

2.8 Abstract stack machines ............................
...............
2.9 Putting the techniques together ...................................
Exercises

.
.
..................................
..................................................

........................

Bibliographic notes
Chapter 3 bid Analysis

33

..................................
.......................................................

3.1 The role of the bxical analyzer

3.2 Input buffering

3.3 Specification of tokens ..........................
.
.................
3.4 Recognition of tokens ...............................................

3.5 A language for specifying lexical analyzers ....................
3. 6 Finite automata ..................................
....................
3.7 From a regular expression to an NFA ..........................
3.8 Design of a lexical analyzer generator ..........................
3.9 Optimization of DFA-based pattern matchers ................
Exercises ...............................................................

Bibliographic notes

..................................................


CONTENTS

Chapter 4 Syntax A d y s b
4.1 The role of the parser ...............................................
4.2 Context-free grammars .............................................
4.3 Writing a grammar ..................................................
4.4 Topdown parsing ....................................................
4.5 Bottom-up parsing .................................
..; ..-.
..........
4.6 Operator-precedence parsing ......................................
4.7 LR parsers .............................................................
4.8 Using ambiguous grammars .......................................
4.9 Parser generators ...................................................
Exercises

..........*.....................

.............*.*........*.****

Bibliographic notes

..................................................

Chapter 5 S y n t s K - D i m Translation
5.1 Synta~directeddefinitions .........................................
5.2 Construction of syntax trees .......................................
5.3 Bottom-up evaluation of Sattributed definitions .............
5.4 L-attributed definitions .............................................
5.5 Topdown translation ...............................................
5.6 Bottom-up evaluation of inherited attributes ..................
5.7 Recursive evaluators ............................... .................
5.8 Space for attribute values at compile time .....................
5.9 Assigning spare at compiler-construction time ................
5 . LO Analysis of syntaxdirected definitions .........................

E ~ercises........*......................*.......**.........*
.'...........
Bibliographic notes

..................................................

Chapter 6 Type khaklng
6.1 Type systems ..........................................................
6.2 Specification of a simple type checker ..........................
6.3 Equivalence of type expressions ..................................
6.4 Type conversions .....................................................
6 3 Overloading of functions and operators ........................

6.6 Polymorphic funclions ..............................................

6.7 An algorithm for unification ......................................
Exercises ...............................................................
Bibliographic notes ...................................................

7+1 Source language issues ......................................-- ......
7.2 Storage organization .................................................

7.3 Storage-allocation strategies .............................. .
.
.....
7.4 A m s s to nonlocal names ..........................................


3

CONTENTS

7.5 Parameter passing .................................................. 424
7.6 Symbol tables .......................................................429
7.7 Language facilities for dynamic storage allmation ........... 440
7.8 Dynamic storage alkation techniques ...................,
... 442
7.9 $orage allocation in Fortran ....................................... 446
Exercises ............................................................... 455
Bibliographic notes ................................................
461

463


Chapter 8 Intermediate C& Generstba

8.I Intcrmediatt languages .............................................
........**............**..,........
8.2 Declarations ....................
.
.
8.3 Assignment slaternents .............................................
8.4 Boolean e~pressions...................
......................
..........**.................
.................*.....
8.5 Case statements

.
.
.
.
-

8.6 Backpatching ..........................................................
8.7 P r d u r e calls .......................................................

...............................................................
Bibliographic notes ...............................
... ...............

Exercises


9.1 Issues in the design of a code generator ........................
9.2 The target machine ..................................................
9.3 Run-time storage management ....................................
9.4 Basic blocks and flow graphs .....................................
9.5 Next-use information ................................................
9.6 A simple code generator ...........................................
9.7 Register allocation and assignment ...............................
9.8 The dag representation of basic blwks .........................
9.9 Peephole optimist ion ...............................................
...............
9.10 Generating code from dagg ...................... .
9.1 1 Dynamic programming code-generation algorithm ..........
9.12 Code-generator generators .........................................

Exercises

...............................................................
.................................................

Bibliographic noles

.......................... 586
10.1 Introduction ............................
I
.
.
10.2 The principal sources of optimization ........................... 592
10.3 Optimization of basic blocks ...................................... 598
10.4 Loops in flow graphs ....................................
.- .......... 602

608
.......................
10.5 introduction to global data-flow analysis
10.6 l€erative mlutiosi of data-flow equations ....................... 624
10.7 Cde-improving transformations ................................. 633
10.8 Dealing with aliases ................................................. 648


CONTENTS

10.9 Data-flow analysis of structured flow graphs .................
10.10 Efficient data-flow algorithms ....................................
10.1 1 A tool for data-flow analysis ......................................
10.12 Estimation of typ
......................................+,.,........
10.13 Sy m b l i c debugging of optimized axle .........................
Exercises ...............................................................
Bibliographic notes ..................................................

660
671

680
694

703
711
718

723


Chapter 11 Want to Write a Compiler?
Planning a compiler .................................................
Approaches to compiler development ...........................
The compilerdevelopment environment .......................
Testing and maintenance ...........................................

723
725

12.1 BQN. a preproawr for typesetting mathematics ...........
12.2 Compilers for Pascal ................................................
12.3 The C compilers ......................................................
12.4 The Fortran H compilers ............................... -...........
12.5 The Bliss( l 1 compiler ...............................................
12.6 Modula-2 optimizing compiler ....................................

733
734
735

11 .1
11.2
I 1.3
1 L .4

729
731

737

740
742

A . l Intrduction ........................................................... 745
A.2 A Pascalsubset .................................................... 745
A.3 Program structure .................................................... 745
A.4 Lexical conventions ..................................................
743
A .5 Suggested exercises ................... ..............................
749
A.6 Evolution of the interpreter ....................................... 750
A.7 Extensions .........................:................................... 751
?


CHAPTER

Introduction
to Compiling
The principles and techniques of compiler writing are so pervasive that the
ideas found in this book will be used many times in the career of a cumputer
scicnt is1, Compiler writing spans programming languages, machine architecture, language theory, algorithms, and software engineering. Fortunately, a
few basic mrnpikr-writing techniques can be used to construct translators for
P wide variety of languages and machines. In this chapter, we intrduce the
subject of cornpiiing by dewxibing the components of a compiler, the environment in which compilers do their job, and some software tools that make it
easier to build compilers.
1.1 COMPILERS

Simply stated, a mmpiltr i s a program that reads a program written in oae
language - the source Language - and translates it inm an equivalent prqgram

in another language - the target language (see Fig. 1.I). As an important part
of this translation process, the compiler reports to its user the presence of
errors in the murcc program.

messages

At first glance, the variety of mmpilers may appear overwhelming. There
are thousands of source languages, ranging from traditional programming
languages such as Fortran and Pascal to specialized languages (hat have arisen
in vktually every area of computer application. Target languages are equally
as varied; a target language may be another programming language, or the
machine language of any computer between a microprocasor and a


supercwmputcr, Compilers arc sometimes classified as ~ingle~pass,
multi-pass,
load-and-go, debugging, or optimizing, depending on how they have been constructed or on what function they arc suppsed to pcrform. Uespitc this
apparent complexity, the basic tasks that any compiler must perform arc
essentially the same. By understanding thcse tasks, we can construct compilers h r a wide variety of murcc languages and targct machines using the

same basic techniques.
Our knowlctlp about how to organim and write compilers has increased
vastly sincc thc first compilers startcd to appcar in the carty 1950'~~
it is difficult to give an exact date for the first compiler kcausc initially a great deal of
experimentat ion and implementat ion was donc independently by several
groups. Much of the early work on compiling deal1 with the translation of
arithmetic formulas into machine cads.
Throughout the lY501s, compilers were mnsidcred notoriously difficult programs to write. The first Fortran ~Cimpller,for exampie, t o o k f 8 staff-years
to implement (Backus ct a[. 119571). We have since discovered systematic
techniques for handling many of the imponant tasks that mcur during compilation. Good implementation languages, programming environments, and

software t w l s have also been developed. With the% advances, a substantial
compiler can be implemented even as a student projtxt in a onesemester
wmpilar-design cuursc+

There are two puts to compilation: analysis and synthesis. The analysis part
breaks up the source program into mnstitucnt pieces and creates an intermdiate representation of the sou'rce pmgram. Tbc synthesis part constructs the
desired larget program from the intcrmcdiate representation. Of the I w e
parts, synthesis requires the most specialized techniques, Wc shall msider
analysis informally in Sxtion 1.2 and n u t h e the way target cude is synthesized in a standard compiler in % d o n 1.3.
During anaiysis, the operations implicd by thc source program are determined and recorded in a hierarchical pltrlrcturc m l l d a trcc. Oftcn, a special
kind of tree called a syntax tree is used, in which cach nodc reprcscnts an
operation and the children of a node represent the arguments of the operation.
Fw example. a syntax tree for an assignment statemcnt i s shown in Fig. 1.2.

:e

/ ' \

gasition

/

initial.

+

'-.
/

rate


h.1.2,

+

\

60

Syntax trtx lor @sit ion :+ i n i t i a l + r a t e 60.


EC. 1.1

COMPILERS

3

Many software tools that manipulate source programs first perform some
kind of analysis. Some exampies of such tools include:
Structure edit~m, A Structure editor

takes as input a sequence of corn-

mands to build a sour= program* The structure editor not ofily performs
the text-creation and mdification functions of an ordinary text editor,
but it alw analyzes the program text, putting an appropriate hierarchical
strudure on the source program. Thus, the structure editor can perform
additional tasks that are useful in the preparation of programs. For
example, it can check that the input is correctly formed, can supply kcywords automatically (e-g.. when the user types while. the editor svpplics

the mathing do and r e m i d i the user tha# a conditional must come
ktween them), and can jump from a begin or left parenthesis to its
matching end or right parenihesis. Further, the output of such an editor
i s often similar to the output of the analysis phase of a compiler.
Pretty printers. A pretty printer anaiyxs a program and prints it in wch
a way that the structure of the program becomes clearly visible. For
example, comments may appear in a spcial font, and statements may
appear with an amount of indentation proportional to the depth of their

nesting in the hierarchical organization of the stakments.
Static checkers. A siatic checker reads a program, analyzes it, and
attempts to dimver potential bugs without running the program, The

analysis portion is often similar to that fmnd in optimizing compilers of
the type discussed in Chapter 10. Fw example, a static checker may
detect that parts of the source propam can never be errscutd, or that a
certain variable might be used before b c t g defined, In addition, it can
catch Iogicai errors such as trying to use a real variable as a pintcr,
employing the t ype-checking techniques discussed in Chapter 6.

inrerpr~iers. Instead of producing a target program as a translation, an
interpreter performs the operations implied by the murce program. For
an assignment statement, for example, an interpreter might build a tree
like Fig. 1.2, and then any out the operations at the nodes as it "walks"
the tree. At the root it wwk! discover it bad an assignment to perform,
so it would call a routine to evaluate the axprcssion on the right, and then
store the resulting value in the Location asmiated with the identifiet
position. At the right child of the rm, the routine would discover it
had to compute the sum of two expressions. Ct would call itaclf recursiwly to compute the value of the expression rate + 60. It would then
add that value to the vaiue of the variable initial.

Interpreters are hqueatly used to cxecute command languages, since
each operator executed in a command language is usually an invmtim of
a cornpk~routine such as an editor or compiler. Similarly, some 'Wry
high-level" Languages, like APL, are normally interpreted b a u s e there
are many things about the data, such as the site and shape of arrays, that


4 1NTRODUCTION TO COMPILING

SEC.

I.

cannot be deduced at compile time.
Traditionally, we think of a compiler as a program that translates a source
language like Fortran into the assembly or machine ianguage of some computer. However, there are seemingly unrelated places where compiler technology is regularly used. The analysis portion in each of the following examples
is similar to that of a conventional compiler.
A text farmatter takes input that is a stream uf sharacten, most of which is text to be typeset, but some of which includes commands to indicate paragraphs, figures. or mathematical structures like
Text formrrers.

wbscripts and superscripts. We mention some of the analysis done by
text formatters in the next section.

Si1it-m ct~stylihrs. A silicon compiler has a source language that is similar
or identical to a conventional programming language. However, the variables of the language represent, not locations in memory, but, logical signals (0 or 1) or groups of signals in a switching circuit. The output is a
circuit design in an appropriate language. See Johnson 1 19831. Ullman
1 19843, or Trickey 1 19BSJfor a discussion of silicon compilation.
Qucry inrerpreters. A query interpreter translates a predicate containing
relational and h l e a n operators into commands to search s database for
records satisfying [hat pmlicate. (See Ullman 119821 or Date 11986j+)


The Context of a Compiler
In addit ion to a compiler, several other programs may be required to create an
executable target program, A source program may be divided into modules
stored in separate files. The task of collecting the source program is sometimes entrusted to a distinct program, called a preprocessor, The preprocessor
may also expand shorthands, called macros, into source language staternenfs.
Figure 1.3 shows a typical "compilation." The target program created by
the compiler may require further processing before it can be run. The cornpiler in Fig, 1.3 creates assembly code that is translated by an assembler into
machine code and then linked together with some library routines into thc
code that actually runs on the machine,
We shall consider the components of a compiler in the next two sccticsns;
the remaining programs in Fig. 1.3 are discussed in Sec~ion1.4.

1,2 ANALYSIS OF

THE SOURCE PROGRAM

In this section, we introduce analysis and illustrate its use in some textformatting languages, The subject is treated in more detail in Chapters 2-4
and 6. In compiling, analysis consists of three phaxs:
1.

Lirtuar unu!ysh, in which the stream of characters making up the source
program i s read from left-to-right and grouped into wkms thar are
sequences of characters having a collective meaning.


ANALYSIS OF THE SOURCE PROGRAM

5


library.

rclrmtabk objcct filcs
absdutc machinc a d c
Fig. '1-3.

A language-praccsning systcm.

2.

Hi~rurc~htcu~
am/y,~i.s,
in which characters or tokens are grouped hierarchically into nested cdlcctiwnx with mlleclive meaning*

3.

Scmontic unuiysh, in which certain checks are performed to ensure that
Ihe components of a program fit together meaningfully.

In a compiler, linear analysis i s called Irxicd anulysi,~or s r m n i n # . For example, in lexical analysis the charaaers in the assignment statement

'position := initial

+

rate

*

60


would be grouped into the fdlowmg tokens;

1. The identifier go$ ition.
2. The assignment symbol :=.
3. Theidentifier i n i t i a l .
4. The plus sim.
5 . The identifier rate.
6 . The multiplication sign.
7. The number 6 0 ,
The blanks separating the characters of these tokens would normally be eliminated during lexical analysis.


Syntax Analysis

H ierarchical analysis is called pur.~ingor synm antiiyxix 14 involves grouping
the tokens of the source program into grammatical phrases that are used by
the compiler to synthesize output. Usualty, the grammatical phrases of the
source program are represented by a parse tree such as the one shown in Fig.
1 -4.

I'

position

Fig, 1.4. Pursc trcc for position : = initial + rate 60.

In the expression i n i t i a l + rate * 60,the phrase rate 6 0 is a hgical unit bemuse the usual conventions of arithmetic expressions tell us that
multiplication is performed before addit ion. Because the expression
5n i t i a l + rate is foilowed by a *. it is not grouped into a single phrase by

itself in Fig. 1.4,
The hierarchical structure of a program is usually expressed by recursive
rules. For example, we might have the idlowing rules as part sf the definition of expressions:

I.
2,
3.

Any idmtijeris an expression.
Any m m h r is an expression.
If txprc.rsioiz 1 and ~ x p r ~ ' s s i u nare expressions, then so are

Rules (I) and (2) are (noorecursive) basis rules, while (3) defines expressions
in terms of operators applied to other expressions. Thus, by rule I I). i n i t i a l and rate are expressions. By rule (21, 6 0 is an expression, while by
rule (31, we can first infer that rate * 60 is an expresxion and finally that
initial + rate 60 is an expression.
Similarly, many Ianguagei; define statements recursively by rules such as:


SEC+ 1.2

ANALYSIS OF THE SOURCE PROGRAM

7

I f identrfrer is an identifier, and c'xprc+s.~ion~
is an exyrcshn, then

1.


is a statement.

2.

If expremion I is an expression and siumncnr 2 is a statemen I, then

are statements.

The division between lexical and syntactic analysis is somewhat arbitrary.
We usually choose a division that simplifies the overall task of analysis. One
factor in determining the division is whether a source !anguage construct i s
inherently recursive or not. Lexical constructs do not require recursion, while
syntactic conslructs often do. Context-free grammars are a formalization of
recursive rules that can be used to guide syntactic analysis. They are introduced in Chapter 2 and studied extensivdy in Chapter 4,
For example, recursion is not required to recognize identifiers, which are
typically strings of letters and digits beginning with a letter. We would normally recognize identifiers by a simple scan of the input stream. waiting unlil
a character that was neither a letter nor a digit was found, and then grouping
all the letters and digits found up to that point into an ideatifier token. The
characters so grouped are recorded in a table, called a symbol table. and
removed from the input so that processing o f the next token can begin.
On the other hand, this kind of linear scan is no1 powerful enough to
analyze expressions or statements. For example, we cannot properly match
parentheses in expressions, or begin and end in statements, without putting
some kind of hierarchical or nesting structu~eon the input.

.-

/ - -\

position


:=

+

I \

initial

/

vos i t ion

+

\

/

initial

+

\

*
/'\

rate


h#~resl

I
Fig. 1.5. Scmantic analysis inscrt s a conversion frnm intcgcr to real.

The parse tree in Fig. 1.4 describes the syntactic siructure of the input. A
more common internal representation of this syntactic structure is given by the
syntax tree in Fig. L.5(a). A syntax tree is a compressed representation of the
parse tree in which the operators appear as the interior nodes, a.nd the
operands of an operator are the children of the node for that operator. The
construction of trecs such as the one In Fig. 1 S(a) i s discussed in Section 5.2.


8

INTRODUCTION TO COMPILING

SEC.

1.2

We shall take up in Chapter 2, and in more detail in Chapter 5 , the subject of
~yntax-birecedtrwtshriun, In which the compiler uses the hierarchical structure on the input to help generate the output.
Semantic Analysis

The semantic analysis phase checks the source program for semantic errors
and gathers type information for the subsequent de-generation phase. It
uses the hierarchical structure determined by the syntax-analysis phase to
identify the operators and operands of expressions and statements.
An important compnent of semantic analysis i s type checking. Here the

compiler checks that each operator has operands that are permitted by the
source language specification. For example, many programming language
definitions require a compiler to report an error every time a real number is
used to index an array. However, the language specification may permit some
operand coercions, for example, when a binary arithmetic operator is applied
to an integer and real, [n this case, the compiler may need to convert the
integer to a real. Type checking and semantic analysis are discused in
Chapter 6 .

Example 1.1, Inside a machine, the bit pattern representing an integer is generally different from the bit pattern for a real, even if the integer and the real
number happen to have the same value, Suppse, for example, that all identifiers in Fig. 1 3 have been declared to be reals and that 6 0 by itself is
assumed to be an integer. Type checking of Fig. 1.5{a) reveals that + is
applied to a real, rats, and an integer, 60. The general approach is to convert the integer into a real. This has been achieved in Fig. 1.5(b) by creating
an extra node for the operator irltod that explicitly converts an integer into
a real. Alternatively, since the operand of inttawd is a constant, the cornpiler may instead repla- the integer constant by an equivalent real constant.
Analysis in Text Formatters

It is useful to regard the input to a text formatter as specifying a hterarchy of
hxcs that are rtaangular regions to be filled by some bit pattern, representing light and dark pixels to be printed by the output device.
system (Knuth [1984aj) views its input this way.
For example, the
Each character that is not part of a command represents a box containing the
bit pattern for that character in the appropriate font and size. Consecutive
characters not separated by "white space" (blanks or newline characters) are
grouped into words, consisring of a sequence of horizontally arranged boxes,
shown schematically in Fig, 1.6. The grouping of characters into words (or
commands) is the linear or lexical aspect of analysis in a k x t formatter.
Boxes in
may t
x built from smaller boxes by arbitrary horizontal and

vertical combinations. For example,


ANALY SlS OF THE SDURCE PROGRAM

9

Fig.t .6. Grouping of characters and words into

groups the list of boxes by juxtaposing them horizontally, while the \vbox
operator similarly groups a list of bxes by vertical juxtaposition. Thus, if we
say in
we get the arrangement of boxes shown in Fig. 1.7. Determining the
hierarchical arrangement of boxes implied by the input is part of syntax
analysis in

w.

Fig. 1.7. Hierarchy of h x c s in

w.

As another example, the preprocessor E Q N for mathematics (Kernighan
and Cherry 1 l975]), or the mathematical processor in
builds mathematical expsiofis from operators like sub and sup for subscripts and superscripts. I f EQN encounters an input text of the form

m,

BOX sub box
it shrinks the size of h x and attaches it to BOX near the lower right corner,

as illustrated in Fig. 1.8. The sup uperator similarly attaches box at the
upper right.

Fig. 1.8. Building the subiscript structure in mathematical Icxt.

These operators can be applied recursively, so, for example. the EQN text


10 INTRODUCTION TO COMPlLlNG

a sub {i sup 2 )

results in d , : . Grouping the operators sub and sup into tokens is part of the
lexical amalysts of EQN text, However, the syfitactic structure of the text is
needed to determine the size and placement of a box.
1,3 THE PHASES OF A COMPILER

Conceptually, a compiler operates in p h s e s , each of which transforms the
source program from one representation to another. A typical decompmition
of a compiler is shown in Fig, 1.9, In practice, some of the phases may be
grouped together, as mentioned in Sxtion 1.5, and the intermediate representations between the grouped phases need not be explicitly constructed.

wurcc program
lcxical
analyzcr

.

4


syntax

analyzer

J.

erna antic
symbol-tublc
managcr

analyzer

G

intcrrncdiatc code

error
handlcr

gcncrator

C-,

cdc
optimizer

1

codc
gcncrator


4

targct program

Fig. 1.9. P h a m d a mrnpilcr +

The first three phases, forming the bulk of the analysis portion of a compiler, were introduced in the last section. Two other activities, symbl-table
management and error handling, are shown interacting with the six phases of
lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. Informally, we shall also call
the symbol-table manager and the error handler "phases."


THE PHASES OF A COMflLER

I

Sy mhl-Table Management
An essential function of a compiler is to record the identifiers used in the
source program and collect information about various attributes of each idcntifier. These attributes may provide information about the storage allocated
for an identifier, its type, its scope (where in the program it is valid). and, in
the case of procedure names, such things as the number and types of its arguments, the method of passing each argument (e.g+,by reference), and the type
returned, if any.
A ,~ymhltable is a data structure containing a record €or each identifier,
with fields for the attributes uf the identifier. The data structure allows us 10
find the record for each idenfifier quickly and to store or retrieve data from
ihat record quickly. Symbol tables are discussed in Chapters 2 and 7.
When an identifier in the source program is detected by the lexical
analyzer, the identifier is entered into the symbol table. However, the attributes of an identifier cannot normally k determined during lexical analysis.
For example, in n Pascal declaration like


var position, i n i t i a l , rate : real ;
the type real is not known when position, i n i t i a l , and rate are seen by
the lexical analyzer +
The remaining phases enter information a b u t identifiers into the symbol
table and then use this information in various ways. For example, when
doing semantic analysis and intermediate code generation, we need to know
what the types of identifiers are, so we can check that thc source program
uses them in valid ways, and so that we can generate the proper operations on
them, The code generator typically enters and uses detailed information about
the storage assigned to identifiers.

Each phase can encounter errors. However, after detecting an error, a phase
must mmchow deal with that error, so that compilation can proceed, allowing
further errors in the source program to be detected. A compiler that stops
when it finds the first error is not as helpful as it could be.
The syntax and semantic analysis phases usually handle a large fraction of
the errors detectable by the compiler The lexical phase ern detect errors
where the characters remaining in the input do not form any token of the
language. Errors where the token stream violates the structure rules Is)wW
of the language are determined by the synlax analysis phase. During semantic
analysis the compiler tries to detect constructs that have the right syntactic
structure but no meaning to the operatibn involved, e . g , if we try to add two
identifiers, me of which is the name of an array, and the other the name of a
procedure, We discuss the handling of errors by each phase in the part of the
book devoted to ihat phase.


The Analysis Phases
As translation progresses, the compiler's internal represintation of the source

program changes. We ilhstrate these representations by considering the
translation of the statement

position ;= initial + rate

*

&
I

(1.1)

Figure 1.10 shows the rcprescntarion of this statement after each phase.
The lexical analysis phase rcads the characters in the source program and
groups them into a stream of tokens in which each token repre,sents a logically
cohesive sequence of characters, such as an identifier, a keyword (if,while,
etc,), a punctuation character, or a multi-character operator like :=. The
character sequence forming a token is called the ! m m r for the token,
Certain tokens will lx augmented by a "lexical value." For example, when
an identifier like rate is found, the lexical analyzer not only generates a
token, say id, but also enters the lexemr rate into the symbol table, if it is
not already there. The lexical value ass~iatedwith this occurrence of id
points to rhe symbol-table entry for r a t e +
In this sedion, we shall u.se id,, id,, and id:, for position, i n i t i a l , and
rate, respectively, to emphasize that the internal representation of an identifier is different from the character sequence forming the identifier. The
representation of ( I.1 ) after lexical analysis is therefore suggested by:

We should also make up tokens for the multi-character operator := and the
number 60 to reflect their internal .representation, but we defer that until
Chapter 2, Lexical analysis is covered in detail in Chapter 3.

The second and third phases, syntax and semantic analysis, have also k e n
inlroduced in Section 1.2. Syntax analysis imposes a hierarchical structure on
the token stream, which we shall portray by syntax trees as in Fig. 1. I I (a). A
typical data structure for thc tree is shown in Fig. 1.1 1(b) in which an interior
node is a record with a field for the operator and two fields containing
pointers to the records for the left and right children. A leaf is a record with
two or more fields, one to identify the token at the leaf, and the others to
record information a b u t the token. Additional ihformarion about language
constructs can be kepr by adding more' fields to thet records for nodes. We
discuss syntax and semantic analysis in Chapters 4 and 6, respectively.

Intermediate Cde Generation
After syntax and semantic analysis, some compilers generate an explicit intermediate representation of the source program. We can think of this intermediate representation as a program for an abstract machine. This intermediate representation should have two important properties; ir should be easy to
produce, and easy to translate into the target program,
The intermediate represenlation can have a variety d forms. In Chapter 8,


THE PHASES OF A COMPILER

position := i n i t i a l + rate

id, : = idl + id]

*

I

*

60


Q
syntax nnalyzcr

SYMBOL TABLE

3

rate

4

templ := inttoreal(60)

tenpl := id3 r 6 0 . 0
i d 1 ;= id2 + templ

WVF i d 3 , R2
HWLF X 6 0 . 0 , R2

MOVF id2, R1
ADDF R2, R1
MOVF R1, i d l

Fig. 1.10.

Translation of u statcmcnt .

60


13


14 INTRODUCTION TO COMPILING

SEC.

1.3

Fig. 1.11. The data struclurc in (b) is for thc tree in (a).
we consider an intermediate form catkd "three-address code," which is like
the assembly language for a machine in &ich every manory location can a f t
like a registel.. Three-address code consists of a sequence of instructions, each
of which has at most three operands. The source program in (1.1) might
appear in three-address code as

This intermediate form has several properties. Fitst , c a d t hree-address
instruction has at most one operator in addition to the assignment. Thus,
when generating these iinstrunions, the compiler has to decide rm the order in
which operations are to be done; the multiplication precedes the addition in
the source program of (1.1). Second, the compiler must generate a temporary
name to hold the value computed by each instruction* Third, some "threeaddress" instructions have fewer than three w r a n d s , e.g., the first and last

instructions in ( 1.3).
In Chapter 8, we cover the principal intermediate representations used in
compilers. in general, these representations must do more than compute
expressions; they must also handle flow-of-control constructs and procedure
calls. Chapters 5 and 8 present algorithms for generating intermediate wde
for typical programming language constructs.


The code optimization phase attempts to improve the intermediate code, so
that faster-running machine code will result. h e optimizations are trivial.
For example, a natural algorithm generates the intermediate d e (1.31, using
an instruction for each oprator in the tree representation after semantic
analysis, even though there is a better way to perform the same calculation,
using 1he two,instructions


There is nothing wrong with this simple algorithm, since the problem can be
fixed during he mdespti'mizatiua phase. That is, the compiler can deduce
that the conversion of 60 from integer to real representation can be done once
and for all at compik time, so the inttoreal operation can be eliminated.
Besides, temp3 is used only once', to transmit its value to i d l . I t then
becomes safe to substitute id1 for temp3, w~creuponthe last statement of
(1.3) is not needed and the code of (1.4) results.
There is great variation in the amount of wde optimization different cornpilers perform. In lhose that do the most. called "bptimizing cornpiters," a
significant fraction of the time of the compiler is spent on this phase, However, there are simple optimizations that sjgnificantly improve the running
time of the target program without slowing down compilation too much.
Many of these are discussed in Chapter 9, while Chapter 10 gives the technology used by the most powerful optimizing compilers.

The final phase of the compiler is the generation of target code, consisting
normally o f relocatable machine code or assembly c d c , Memory locations
are selected for each of the variables used by the program. Then, intermediate inslructions are each translared into a sequence of machine instructions
that perform the same task. A crucial aspect is the assignment of variables to
registers.
For example, using registers I and 2, the translation of the cude of ( 1.4)
might become

HOVF i d 3 , R2
M U L F #6O. 0, R 2

MOVF i d 2 , R1
ADDF R 2 , R t
HOVE' R l , id1

The first and second operands of each ifistruaion specify a source and desttnation, respectively. The F in each insiruction tells us that instructions deal with
floating-point numbers. This code moves the contents of the address' id3
into register 2, then multiplies it with the realanstant 60.0. The # signifies
that 6 0 . 0 is to be treated as a constant. The third instruction moves id2 into
register I and adds to it the value previously computed in register 2. Finally,
the value in register I is moved into the address of idl. rn the code implements the assignment in Fig. 1.10. Chaptm 9 covers code generation.


16 INTRODUCTION TO COMPILING

1.4 COUSlNS OF THE COMPILER
As we saw in Fig. 1.3, the input to a compiler may be produced by one or
more preprocessors, and further processing of h e compiler's output may be
needed before running machine code is obtained. In this section, we discuss
the context in which a compiler typically operates.

Preprocessors produce input to compikrs. They may perform the following
functions:
Aiurro processing. A preprocessor may allow a user to define macros that
are shorthands for longer wnstrlrcts.

File inclusion. A preprocessor may include header files into the program
text. For example, the C preprocessor causes the contenls o f the file
<global.h> to replace the statement #include sglobal .h> when i t
processes a file containing this statement.
"Rarionai" preprocew.ws. These processors augment older languages

with more modern flow-of-contrd and data-structuring facilities. For
example, such a preprocessor might provide the user with built-in macros
for constructs like while-statements or if-statements, where none exist in
the programming language itself.

Lcmguage ext~nsiuns, These processors attempt to add capabilities to the
language by what amounts to buih-in macros, For example. the language

Equel (Stonebraker et a\. [19761) is a database query language embedded
in C. Statements beginning with ## arc taken by the preprocessor to be
databage-access statements, unrelated to C, and are translated into procedure calls on routines that perform the database access.
Macro processors deal with two kinds of statement: macro definition and
macro use. Definitions are normally indicated by some unique character or
keyword, like d ~ine
f or macro. They consist of a name for the macro
being defined and a body, forming its definition. Often, macro processors
permit form1 poramercrs in their definition, char is, symbols ro be replaced by
values (a "value" is a string of characters, in this conlext). The use of a
macro consists of naming the macro and supplying actual paramefers, that is*
values for its formal parameters. The macro processor substitutes the actual
parameters for the formal parameters in the body of the macro; the
transformed body then replaces the macro use itself.

typesetting system mentioned in Section 1 -2 contains a
Ikample 1.2. The
general macro facility, Macro definitions take the form
\Bef

inc <macro name> <template> {<body>]


A mcrv name i s any string sf letters preceded by a backslash. The template


S C . 1.4

COUSINS OF THE COMPILER

17

i s any string of characters, with strings of the form # 7 , # 2 , . . . , #9
regarded as formal parameters. These symbols may also appear in the body,
any number of times. For example, Ihe following macro defines a citation for
the Juurnd of the ACM.

The macro name is \JACM, and the template i s "#7 ;#2;#3."; sernicolms
separate the parameters and the Iast parameter is followed by a period, A use
of this macro must take the form of the template, except that arbitrary strings
may be substituted for the formal pararncter~.~
Thus. we may write

and expect to see
J . ACM 17:4, pp. 715-728.
The portion of the body I \sl J. ACM) calls for an italicized ("slanted") "J,
ACM". Expression {\bf X I ) says that the first actual parameter is to be
made boldface; this parameter is intended to be the volume n u m k r .
TEX allows any punctuarion or string of texi to separate the volume, issue,
and page numbers in the definition of the UACM macro. We could even have
used no punctuation at all. in which case 'TEX would take each actual parameo
ter to be a single character or a string surrounded by ( }


Assemblers

Some compilers produce assembly d t , as in (1.5). that is passed to an
assembler for further prassing, Other compilers perform the job of the
assembler, producing relocatable machine code that can be passed directly to
the loaderllink-editor. We assume the reader has same Familiarity with what
an assembly language looks like and what an assembler does; here we shall
review the relationship between assembly and machine code.
Ass~mblyrude is a rnnernoaic venim of machine code, in which names are
used instead of binary codes for operations, and names are also given to
memory addresses. A typical sequence of assembly instrucrion~might k

MOV a, R1
ADD # 2 , R1
MOV Rl, b
This code moves the contents of the address a into register I , then adds the
constant 2 to it, treating the contents of register 1 as a fixed-point n u m k r ,
2

Well. almost arbilrary string*, sincc a simple kft-to-righl scan t$ thc macro usr: is m d e . and as
as a symbol matching ~ h ctext fcNrrwinp a #i symbnl in thc lcrnplatc is fibund. thc prcccdinp
string is docmed to march #i. Thus. if wc tried 10 hubsfilutc ab;cd for 41, wc would find thar
only ab rnutchcd #I
and cd was matchcd to #2.
MW


×