Chapter 4
Lexical and Syntax
Analysis
ISBN 0-321-33025-0
Chapter 4 Topics
•
Introduction
Introduction
• Lexical Analysis
hbl
•T
h
e Parsing Pro
bl
em
• Recursive-Descent Parsing
• Bottom-Up Parsing
Copyright © 2006 Addison-Wesley. All rights reserved. 1-2
Introduction
•
Language implementation systems must analyze
•
Language
implementation
systems
must
analyze
source code, regardless of the specific
implementation approach
implementation
approach
• Nearly all syntax analysis is based on a formal
description of the syntax of the source language
(BNF)
Copyright © 2006 Addison-Wesley. All rights reserved. 1-3
Using BNF to Describe Syntax
•
Provides a clear and concise syntax description
Provides
a
clear
and
concise
syntax
description
• The parser can be based directly on the BNF
Parsers based on BNF are easy to maintain
•
Parsers
based
on
BNF
are
easy
to
maintain
Copyright © 2006 Addison-Wesley. All rights reserved. 1-4
Syntax Analysis
• The syntax analysis portion of a language
processor nearly always consists of two parts:
– A low-level
p
art called a lexical anal
y
zer
p
y
(mathematically, a finite automaton based on a
re
g
ular
g
rammar
)
gg )
– A high-level part called a syntax analyzer, or
parser
(mathematically a push
-
down automaton
parser
(mathematically
,
a
push
down
automaton
based on a context-free grammar, or BNF)
Copyright © 2006 Addison-Wesley. All rights reserved. 1-5
Reasons to Separate Lexical and Syntax
Analysis
Analysis
•
Simplicity
-
less complex approaches can be
•
Simplicity
less
complex
approaches
can
be
used for lexical analysis; separating them
simplifies the parser
simplifies
the
parser
•
Efficiency
- separation allows optimization of
the lexical analyzer
•
Portabilit
y
-
p
arts of the lexical anal
y
zer ma
y
y
pyy
not be portable, but the parser always is
portable
Copyright © 2006 Addison-Wesley. All rights reserved. 1-6
portable
Lexical Analysis
• A lexical analyzer is a pattern matcher for
character strings
•
A lexical analyzer is a
“
front
-
end
”
for the parser
A
lexical
analyzer
is
a
front
end
for
the
parser
• Identify substrings of the source program that
bl t th
l
b
e
l
ong
t
oge
th
er
l
exemes
– Lexemes match a character pattern, which is
associated with a lexical category called a token
– sum is a lexeme; its token may be IDENT
Copyright © 2006 Addison-Wesley. All rights reserved. 1-7
Example
sum = oldsum – value / 100;
Token Lexeme
IDENT
sum
ASSIGN_OP
=
IDENT
oldsum
SUBSTRACT_OP
–
IDENT
value
DIVISION OP
DIVISION
_
OP
/
INT_LIT
100
SEMICOLON
Copyright © 2006 Addison-Wesley. All rights reserved. 1-8
SEMICOLON
;
Lexical Analysis (cont.)
• The lexical analyzer is usually a function that is
ll d b th h it d th t t k
ca
ll
e
d
b
y
th
e parser w
h
en
it
nee
d
s
th
e nex
t
t
o
k
en
• The lexical analysis process also:
– Includes skipping comments, tabs, newlines, and
blanks
Il f
dfi d ( i
–
I
nserts
l
exemes
f
or user-
d
e
fi
ne
d
names
(
str
i
ngs,
identifiers, numbers) into the symbol table
Saves source locations (file line column) for error
–
Saves
source
locations
(file
,
line
,
column)
for
error
messages
–
Detects and reports syntactic errors in tokens such
Copyright © 2006 Addison-Wesley. All rights reserved. 1-9
–
Detects
and
reports
syntactic
errors
in
tokens
,
such
as ill-formed floating-point literals, to the user
Pragmas
• Provide directives or hints to the compiler
•
Directives:
•
Directives:
– Turn various kinds of run-time checks on or off
– Turn certain code improvements on or off (performance vs
il ti d)
comp
il
a
ti
on spee
d)
– Turn performance profiling on or off
•Hints:
– Variable x is very heavily used (to keep it in a register)
– Subroutine S is not recursive (its storage may be statically
allocated)
allocated)
– 32 bits of precision (instead of 64) suffice for floating-point
variable x
Le ical anal sis is responsible for (often) dealing ith
Copyright © 2006 Addison-Wesley. All rights reserved. 1-10
Le
x
ical
anal
y
sis
is
responsible
for
(often)
dealing
w
ith
pragmas
Lexical Analysis (cont.)
• Three main approaches to building a scanner:
1. Write a formal description of the tokens and use a
software tool that constructs lexical analyzers
given such a description
2. Design a state diagram that describes the token
patterns and write a program that implements the
diagram*
3. Design a state diagram that describes the token
patterns and hand-construct a table-driven
Copyright © 2006 Addison-Wesley. All rights reserved. 1-11
impementation of the state diagram
The “longest possible token” rule
• The scanner returns to the parser only when the
next character cannot be used to continue the
next
character
cannot
be
used
to
continue
the
current token
The next character will generally need to be saved
–
The
next
character
will
generally
need
to
be
saved
for the next token
•
In some cases you may need to peek at more
•
In
some
cases
,
you
may
need
to
peek
at
more
than one character of look-ahead in order to
know whether to proceed
know
whether
to
proceed
– In Pascal, when you have a 3 and you a see a ‘.’
•
do you proceed (in hopes of getting 3.14)? or
Copyright © 2006 Addison-Wesley. All rights reserved. 1-12
do
you
proceed
(in
hopes
of
getting
3.14)?
or
• do you stop (in fear of getting 3 5)?
The rule …
• In messier cases, you may not be able to get by
with any fixed amount of look-ahead. In Fortran,
for example, we have
DO 5 I = 1,25 loop
DO 5 I = 1.25 assignment
• Here, we need to remember we were in a
p
otentiall
y
final state, and save enou
g
h
py g
information that we can back up to it, if we get
stuck later
Copyright © 2006 Addison-Wesley. All rights reserved. 1-13
State Diagram Design
• Suppose we need a lexical analyzer that only
recognizes program names, reserved words, and
integer literals
integer
literals
• A naïve state diagram would have a transition
ftt htith
f
rom every s
t
a
t
e on every c
h
arac
t
er
i
n
th
e source
language - such a diagram would be very large!
Copyright © 2006 Addison-Wesley. All rights reserved. 1-14
State Diagram Design (cont.)
• In many cases, transitions can be combined to
simplify the state diagram
– When recognizing an identifier, all uppercase and
lowercase letters are equivalent - use a
characte
r
class
– When recognizing an integer literal, all digits are
equivalent - use a
digit
class
– Reserved words and identifiers can be recognized
together (rather than having a part of the diagram
Copyright © 2006 Addison-Wesley. All rights reserved. 1-15
for each reserved word)
State Diagram Design (cont.)
• Convenient utility subprograms:
– getChar - gets the next character of input, puts
it in global variable nextChar, determines its
ldhl lblbl
c
l
ass an
d
puts t
h
e c
l
ass in g
l
o
b
a
l
varia
bl
e
charClass
hh f
i
– addChar -puts t
h
e c
h
aracter
f
rom nextChar
i
nto
the place the lexeme (global variable) is being
accumulated
accumulated
– lookup - determines whether the string in
lexeme
is a reserved word (returns a code)
Copyright © 2006 Addison-Wesley. All rights reserved. 1-16
lexeme
is
a
reserved
word
(returns
a
code)
State Diagram
Copyright © 2006 Addison-Wesley. All rights reserved. 1-17
Lexical Analysis - Implementation
int lex() {
getChar();
getChar();
switch (charClass) {
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT) {
while
(charClass
==
LETTER
||
charClass
==
DIGIT)
{
addChar();
getChar();
}
return lookup(lexeme);
bk
Copyright © 2006 Addison-Wesley. All rights reserved. 1-18
b
rea
k
;
…
Lexical Analysis - Implementation
case DIGIT:
dd h ()
a
dd
C
h
ar
()
;
getChar();
while (charClass
==
DIGIT) {
while
(charClass
DIGIT)
{
addChar();
getChar();
}
return INT_LIT;
} /* End of switch */
} /* End of function lex() */
Copyright © 2006 Addison-Wesley. All rights reserved. 1-19
A part of a Pascal scanner
• We read the characters one at a time with look-
hd
a
h
ea
d
• If it is one of the one-character tokens
{ ( ) [ ] < > , ; = + - }
we announce that token
• If it is a ‘.’, we look at the next character
– If that is a dot, we announce ‘ ’
– Otherwise, we announce ‘.’ and reuse the look-
ahead
Copyright © 2006 Addison-Wesley. All rights reserved. 1-20
A part of …
• If it is a ‘<’, we look at the next character
–
if that is a
‘=‘
we announce
‘
<
=’
if
that
is
a
we
announce
<
– otherwise, we announce ‘<‘ and reuse the look-ahead,
etc
• If it is a letter, we keep reading letters and digits
and maybe underscores until we can't anymore
then e check to see if it is a reser ed ord
–
then
w
e
check
to
see
if
it
is
a
reser
v
ed
w
ord
• If it is a digit, we keep reading until we find a
non
-
digit
non
digit
– if that is not a ‘.’ we announce an integer
– otherwise, we keep looking for a real number
Copyright © 2006 Addison-Wesley. All rights reserved. 1-21
– if the character after the ‘.’ is not a digit, we announce
an integer and reuse the ‘.’ and the look-ahead
State
Diagram
Copyright © 2006 Addison-Wesley. All rights reserved. 1-22
we skip any initial white space (spaces, tabs, and newlines)
we read the next character
if it is a ( we look at the next character
if that is a * we have a comment;
we skip forward through the terminating *)
we
skip
forward
through
the
terminating
*)
otherwise
we return a left parenthesis and reuse the look-ahead
if it is one of the one-character tokens ([ ] , ; = + - etc.)
we return that token
if it is a we look at the next character
if
it
is
a
.
we
look
at
the
next
character
if that is a . we return
otherwise we return . and reuse the look-ahead
if it is a < we look at the next character
if that is a = we return <=
otherwise we return < and reuse the look
-
ahead
Copyright © 2006 Addison-Wesley. All rights reserved. 1-23
otherwise
we
return
<
and
reuse
the
look
-
ahead
etc.
if it is a letter we keep reading letters and digits
and maybe underscores until we can
’
tanymore;
and
maybe
underscores
until
we
can t
anymore;
then we check to see if it is a keyword
if so we return the keyword
otherwise we return an identifier
in either case we reuse the character beyond the end of
the token
the
token
if it is a digit we keep reading until we find a nondigit
if that is not a .
we return an integer and reuse the nondigit
otherwise we keep looking for a real number
if the character after the is not a digit
if
the
character
after
the
.
is
not
a
digit
we return an integer and
reuse the . and the look-ahead
Copyright © 2006 Addison-Wesley. All rights reserved. 1-24
etc.
The Parsing Problem
• Goals of the parser, given an input program:
– Find all syntax errors; for each, produce an
a
pp
ro
p
riate dia
g
nostic messa
g
e
,
and recover
pp p g g ,
quickly
–
Produce the parse tree, or at least a trace of the
Produce
the
parse
tree,
or
at
least
a
trace
of
the
parse tree, for the program
Copyright © 2006 Addison-Wesley. All rights reserved. 1-25
The Parsing Problem (cont.)
• Two categories of parsers
–Top down- produce the parse tree, beginning at
the root
Order is that of a leftmost derivation
Traces the parse tree in preorder
– Bottom up - produce the parse tree, beginning at
the leaves
Order is that of the reverse of a rightmost derivation
• Parsers look only one token ahead in the input
Copyright © 2006 Addison-Wesley. All rights reserved. 1-26
The Set of Notational Conventions
• Terminal symbols – Lowercase letters at the
bii fth lhbt(b )
b
eg
i
nn
i
ng o
f
th
e a
l
p
h
a
b
e
t
(
a,
b
,
)
• Nonterminal symbols - Uppercase letters at the
bfhlhb()
b
eginning o
f
t
h
e a
l
p
h
a
b
et
(
A, B,
)
• Terminals or nonterminals - Uppercase letters at
the end of the alphabet (W, X, Y, Z)
• Strings of terminals - Lowercase letters at the
end of the alphabet (w, x, y, z)
• Mixed strings (terminals and/or nonterminals) -
Copyright © 2006 Addison-Wesley. All rights reserved. 1-27
Lowercase Greek letters (, , , )
The Parsing Problem (cont.)
• Top-down Parsers
– Given a sentential form, xA , the parser must
choose the correct A-rule to get the next
sentential form in the leftmost derivation using
sentential
form
in
the
leftmost
derivation
,
using
only the first token produced by A
•
The most common top
-
down parsing
•
The
most
common
top
-
down
parsing
algorithms:
Recursive descent
a coded implementation
–
Recursive
descent
-
a
coded
implementation
– LL parsers – table-driven implementation (1
st
L
stands for left
-
to
-
right 2
nd
Lstandsforleftmost
Copyright © 2006 Addison-Wesley. All rights reserved. 1-28
stands
for
left
to
right
,
2
L
stands
for
leftmost
derivation)
The Parsing Problem (cont.)
• Bottom-up parsers
– Given a right sentential form, , determine what
substrin
g
of
is the RHS of the rule in the
g
grammar that must be reduced to produce the
previous sentential form in the right derivation
– The most common bottom-up parsing algorithms
are in the LR famil
y
y
L stands for left-to-right, R stands for rightmost
derivation
Copyright © 2006 Addison-Wesley. All rights reserved. 1-29
Example
•
Consider the following grammar for a comma
-
Consider
the
following
grammar
for
a
comma
separated list of identifiers, terminated by a
semicolon
semicolon
id list
id
id list tail
id
_
list
id
id
_
list
_
tail
id_list_tail
, id
id_list_tail
id list tail
;
id
_
list
_
tail
;
Copyright © 2006 Addison-Wesley. All rights reserved. 1-30
Top-down
(left) and
(left)
and
bottom-up
parsing
parsing
(right) of the
input string
input
string
A, B, C;
Copyright © 2006 Addison-Wesley. All rights reserved. 1-31
Recursive-Descent Parsing
• Recursive-Descent Process
– There is a subprogram for each nonterminal in
the grammar, which can parse sentences that can
be generated by that nonterminal
be
generated
by
that
nonterminal
– EBNF is ideally suited for being the basis for a
recursive
descent parser because EBNF
recursive
-
descent
parser
,
because
EBNF
minimizes the number of nonterminals
•
A grammar for simple expressions:
•
A
grammar
for
simple
expressions:
<expr> <term> {(+ | -) <term>}
<term> <factor>
{(
*
|
/)
<factor>
}
Copyright © 2006 Addison-Wesley. All rights reserved. 1-32
{( | /) }
<factor> id | ( <expr> )
Recursive-Descent Parsing (cont.)
• Assume we have a lexical analyzer named
lex, which puts the next token code in
nextToken
• The coding process when there is only one
RHS:
– For each terminal symbol in the RHS, compare it
with the next input token; if they match,
continue, else there is an error
– For each nonterminal symbol in the RHS, call its
id i b
Copyright © 2006 Addison-Wesley. All rights reserved. 1-33
assoc
i
ate
d
pars
i
ng su
b
program
Function expr()
/* Function expr()
Ptiithl
P
arses s
t
r
i
ngs
i
n
th
e
l
anguage
generated by the rule:
<ex
p
r> → <term>
{(
+
|
-
)
<term>
}
p {( |
)}
*/
void expr() {
/
*
Parse the first term
*
/
/
Parse
the
first
term
/
term();
Copyright © 2006 Addison-Wesley. All rights reserved. 1-34
…
Function expr() (cont.)
/* As long as the next token is + or -, call
l()t tth ttk d th
l
ex
()
t
o ge
t
th
e nex
t
t
o
k
en, an
d
parse
th
e
next term */
while (nextToken == PLUS_CODE ||
nextToken
==
MINUS CODE) {
nextToken
MINUS_CODE)
{
lex();
term
()
;
()
}
}
Copyright © 2006 Addison-Wesley. All rights reserved. 1-35
Recursive-Descent Parsing (cont.)
• A nonterminal that has more than one RHS
requires an initial process to determine which
RHS it is to parse
– The correct RHS is chosen on the basis of the next
token of input
– The next token is compared with the first token
that can be generated by each RHS until a match is
found
– If no match is found, it is a syntax error
Copyright © 2006 Addison-Wesley. All rights reserved. 1-36
Function factor()
/* Parses strings in the language generated by
the rule:
<factor> -> id | (<expr>) */
id f t () {
v
o
id
f
ac
t
or
()
{
/* Determine which RHS */
if (nextToken) == ID_CODE)
/
* For the RHS id
,
j
ust call lex *
/
/,j/
lex();
Copyright © 2006 Addison-Wesley. All rights reserved. 1-37
Function factor() (cont.)
/* If the RHS is (<expr>) – call lex() to pass
over the left parenthesis call expr() and
over
the
left
parenthesis
,
call
expr()
,
and
check for the right parenthesis */
else
else
if (nextToken == LEFT_PAREN_CODE) {
lex();
expr();
expr();
if (nextToken == RIGHT_PAREN_CODE)
lex();
else
else
error();
}
l () /* N ith RHS t h */
Copyright © 2006 Addison-Wesley. All rights reserved. 1-38
e
l
se error
()
;
/*
N
e
ith
er
RHS
ma
t
c
h
es
*/
}
The LL Grammar Class
•The Left Recursion Problem: If a grammar
has left recursion, either direct or indirect, it
cannot be the basis for a top-down parser
– A grammar can be modified to remove left
recursion
• Example: consider the following rule
A
A+B
A
A
+
B
– A recursive-descent parser subprogram for A
immediately calls itself to parse the first symbol
Copyright © 2006 Addison-Wesley. All rights reserved. 1-39
immediately
calls
itself
to
parse
the
first
symbol
in its RHS …
Pairwise Disjointness Test
• The other characteristic of grammars that
dll
dhlkf
d
isa
ll
ows top-
d
own parsing is t
h
e
l
ac
k
o
f
pairwise disjointness
Th i bili d i h RHS h b i
–
Th
e
i
na
bili
ty to
d
eterm
i
ne t
h
e correct
RHS
on t
h
e
b
as
i
s
of one token of lookahead
–
FIRST(
)={a|
*
a
}(If
*
∈ FIRST(
))
FIRST(
)
=
{a
|
a
}
(If
,
∈
FIRST(
))
• Pairwise Disjointness Test
–
For each nonterminal A in the grammar that has
For
each
nonterminal
,
A
,
in
the
grammar
that
has
more than one RHS, for each pair of rules, A
i
and
A
j
, it must be true that:
Copyright © 2006 Addison-Wesley. All rights reserved. 1-40
FIRST(
i
) ∩ FIRST(
j
) =
Example
•Example 1: A aB | aAb
– The FIRST sets for the RHSs in these rules are {a}
and {a}, which are clearly not dis
j
oint. So, these
j
rules fail the pairwise disjointness test
•
Example 2: A
aB|bAb|c
•
Example
2:
A
aB
|
bAb
|
c
– The FIRST sets for the RHSs of these rules are {a},
{b} and {c} which are clearly disjoint Therefore
{b}
,
and
{c}
,
which
are
clearly
disjoint
.
Therefore
,
these rules pass the pairwise disjointness test
Copyright © 2006 Addison-Wesley. All rights reserved. 1-41
Left factoring
• This process can resolve the problem of
ii diji
pa
i
rw
i
se
di
s
j
o
i
ntness test
• Example: consider the rules
<variable> identifier | identifier [<expression>]
– The two rules can be replace by
<variable> identifier <new>
<new> | [<expression>]
–or
<variable> identifier [[<expression>]]
Copyright © 2006 Addison-Wesley. All rights reserved. 1-42
(the outer brackets are metasymbols of EBNF)
Bottom-up Parsing
• The process of bottom-up parsing produces the
reverse of a rightmost derivation
• A bottom-up parser starts with the input
sentence and produces the sequence of
sentential forms from there until all that remains
is the start symbol
• In each step, the task of the bottom-up parser is
finding the correct RHS in a right sentential form
to reduce to get the previous right sentential
Copyright © 2006 Addison-Wesley. All rights reserved. 1-43
form in the derivation
Example
• Consider the following simple grammar of
arithmetic expressions
arithmetic
expressions
E E + T | T
T T * F
|
F
|
F (E) | id
• The right sentential form E + T * id includes
three RHSs, E + T,
T
, and id. Only one of these
is the correct one to be rewritten
–If the RHS E +
T
were chosen to be rewritten in this
sentential form, the resulting sentential form would be
E * id. But E * id is not a le
g
al ri
g
ht sentential form
Copyright © 2006 Addison-Wesley. All rights reserved. 1-44
gg
for the given grammar
Definitions
• is the handle of the right sentential form
= w if and only if S
rm
* Aw
rm
w
•
is a
phrase
of the right sentential form
if
is
a
phrase
of
the
right
sentential
form
if
and only if S * =
1
A
2
+
1
2
i
il
h
fh ih ilf
•
i
s a s
i
mp
l
ep
h
rase o
f
t
h
e r
i
g
h
t sentent
i
a
l
f
orm
if and only if S * =
1
A
2
1
2
Copyright © 2006 Addison-Wesley. All rights reserved. 1-45
Example: Parser Tree of Sentential Form
E + T
*
id
E + T id
E
T
F
The
phrase
s of the sentential form
E + T * id
are
E+T * id
•
The
phrase
s
of
the
sentential
form
E + T * id
are
E + T * id
, T * id, and id
Th l
ilh
i
id
•
Th
e on
l
y s
i
mp
l
e p
h
rase
i
s
id
•The handle of a rightmost sentential form is the
lf
ilh
Copyright © 2006 Addison-Wesley. All rights reserved. 1-46
l
e
f
tmos
t
s
i
mp
l
e p
h
rase
Example: Consider the string
id + id
*
id
id + id id
E
(8)
E
T
T
T
(3)
(7)
(8)
T
T
FF F
(1)
(2)
(4)
(5)
(6)
E
(8)
E+
T
(7)
E+T*
F
(6)
E+
T
*id
id id id
*
+
(1)
(4)
E
(8)
E
+
T
(7)
E
+
T
*
F
(6)
E
+
T
*
id
(5)
E + F * id
(4)
E + id * id
(3)
T + id * id
(2)
F
id * id
(1)
id id * id
Copyright © 2006 Addison-Wesley. All rights reserved. 1-47
(2)
F
+
id
*
id
(1)
id
+
id
*
id
Shift-Reduce Algorithms
•Reduceis the action of replacing the handle on
the top of the parse stack with its
corresponding LHS
corresponding
LHS
•Shiftis the action of moving the next token to
th t f th t k
th
e
t
op o
f
th
e parse s
t
ac
k
Copyright © 2006 Addison-Wesley. All rights reserved. 1-48
LR Parsers
• Many different bottom-up parsing
algorithms have been devised. Most of these
are variations of a
p
rocess called LR
p
arser
pp
– L means it scans the input string left to right and
theRmeansitproducesarightmostderivation
the
R
means
it
produces
a
rightmost
derivation
• The original LR algorithm was designed by
D ld K h (1965) Thi l i h hi h
D
ona
ld
K
nut
h
(1965)
.
Thi
s a
l
gor
i
t
h
m, w
hi
c
h
is sometimes called
canonical
LR
Copyright © 2006 Addison-Wesley. All rights reserved. 1-49
Advantages of LR parsers
• They will work for nearly all grammars that
describe programming languages
•The
y
work on a lar
g
er class of
g
rammars than
ygg
other bottom-up algorithms, but are as efficient
as any other bottom
-
up parser
as
any
other
bottom
up
parser
• They can detect syntax errors as soon as it is
possible
possible
• The LR class of grammars is a superset of the
Copyright © 2006 Addison-Wesley. All rights reserved. 1-50
class parsable by LL parsers