Lexical and syntax analysis

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (288.28 KB, 29 trang )

Chapter 4
Lexical and Syntax
Analysis
ISBN 0-321-33025-0
Chapter 4 Topics
•
Introduction
Introduction
• Lexical Analysis
hbl
•T
h
e Parsing Pro
bl
em
• Recursive-Descent Parsing
• Bottom-Up Parsing
Copyright © 2006 Addison-Wesley. All rights reserved. 1-2
Introduction
•
Language implementation systems must analyze
•
Language

implementation

systems

must

analyze

source code, regardless of the specific
implementation approach
implementation

approach
• Nearly all syntax analysis is based on a formal
description of the syntax of the source language
(BNF)
Copyright © 2006 Addison-Wesley. All rights reserved. 1-3
Using BNF to Describe Syntax
•
Provides a clear and concise syntax description
Provides

a

clear

and

concise

syntax

description
• The parser can be based directly on the BNF
Parsers based on BNF are easy to maintain
•
Parsers

based

on

BNF

are

easy

to

maintain
Copyright © 2006 Addison-Wesley. All rights reserved. 1-4
Syntax Analysis
• The syntax analysis portion of a language
processor nearly always consists of two parts:
– A low-level
p
art called a lexical anal
y
zer
p
y
(mathematically, a finite automaton based on a
re
g
ular
g

rammar
)
gg )
– A high-level part called a syntax analyzer, or
parser
(mathematically a push
-
down automaton
parser
(mathematically
,
a

push
down

automaton

based on a context-free grammar, or BNF)
Copyright © 2006 Addison-Wesley. All rights reserved. 1-5
Reasons to Separate Lexical and Syntax
Analysis
Analysis
•
Simplicity
-
less complex approaches can be
•
Simplicity
less

complex

approaches

can

be

used for lexical analysis; separating them
simplifies the parser
simplifies

the

parser
•
Efficiency
- separation allows optimization of
the lexical analyzer
•
Portabilit
y
-
p
arts of the lexical anal
y
zer ma
y

y
pyy
not be portable, but the parser always is
portable
Copyright © 2006 Addison-Wesley. All rights reserved. 1-6
portable
Lexical Analysis
• A lexical analyzer is a pattern matcher for
character strings
•
A lexical analyzer is a
“
front
-
end
”
for the parser
A

lexical

analyzer

is

a

front
end

for

the

parser
• Identify substrings of the source program that
bl t th
l
b
e
l
ong
t
oge
th
er
l
exemes
– Lexemes match a character pattern, which is
associated with a lexical category called a token
– sum is a lexeme; its token may be IDENT
Copyright © 2006 Addison-Wesley. All rights reserved. 1-7
Example
sum = oldsum – value / 100;
Token Lexeme
IDENT
sum
ASSIGN_OP
=
IDENT

oldsum
SUBSTRACT_OP
–
IDENT
value
DIVISION OP
DIVISION
_
OP

/
INT_LIT
100
SEMICOLON
Copyright © 2006 Addison-Wesley. All rights reserved. 1-8
SEMICOLON

;
Lexical Analysis (cont.)
• The lexical analyzer is usually a function that is
ll d b th h it d th t t k
ca
ll
e
d

b
y
th
e parser w

h
en
it
nee
d
s
th
e nex
t

t
o
k
en
• The lexical analysis process also:
– Includes skipping comments, tabs, newlines, and
blanks
Il f
dfi d ( i
–
I
nserts
l
exemes
f
or user-
d
e
fi
ne

d
names
(
str
i
ngs,
identifiers, numbers) into the symbol table
Saves source locations (file line column) for error
–
Saves

source

locations

(file
,
line
,
column)

for

error

messages
–
Detects and reports syntactic errors in tokens such
Copyright © 2006 Addison-Wesley. All rights reserved. 1-9
–

Detects

and

reports

syntactic

errors

in

tokens
,
such

as ill-formed floating-point literals, to the user
Pragmas
• Provide directives or hints to the compiler
•
Directives:
•
Directives:
– Turn various kinds of run-time checks on or off
– Turn certain code improvements on or off (performance vs
il ti d)
comp
il
a
ti

on spee
d)
– Turn performance profiling on or off
•Hints:
– Variable x is very heavily used (to keep it in a register)
– Subroutine S is not recursive (its storage may be statically
allocated)
allocated)
– 32 bits of precision (instead of 64) suffice for floating-point
variable x
Le ical anal sis is responsible for (often) dealing ith
Copyright © 2006 Addison-Wesley. All rights reserved. 1-10

Le
x
ical

anal
y
sis

is

responsible

for

(often)

dealing

w
ith

pragmas
Lexical Analysis (cont.)
• Three main approaches to building a scanner:
1. Write a formal description of the tokens and use a
software tool that constructs lexical analyzers
given such a description
2. Design a state diagram that describes the token
patterns and write a program that implements the
diagram*
3. Design a state diagram that describes the token
patterns and hand-construct a table-driven
Copyright © 2006 Addison-Wesley. All rights reserved. 1-11
impementation of the state diagram
The “longest possible token” rule
• The scanner returns to the parser only when the
next character cannot be used to continue the
next

character

cannot

be

used

to

continue

the

current token
The next character will generally need to be saved
–
The

next

character

will

generally

need

to

be

saved

for the next token
•
In some cases you may need to peek at more
•

In

some

cases
,
you

may

need

to

peek

at

more

than one character of look-ahead in order to
know whether to proceed
know

whether

to

proceed
– In Pascal, when you have a 3 and you a see a ‘.’

•
do you proceed (in hopes of getting 3.14)? or
Copyright © 2006 Addison-Wesley. All rights reserved. 1-12
do

you

proceed

(in

hopes

of

getting

3.14)?

or
• do you stop (in fear of getting 3 5)?
The rule …
• In messier cases, you may not be able to get by
with any fixed amount of look-ahead. In Fortran,
for example, we have
DO 5 I = 1,25  loop
DO 5 I = 1.25  assignment
• Here, we need to remember we were in a
p
otentiall

y
final state, and save enou
g
h
py g
information that we can back up to it, if we get
stuck later
Copyright © 2006 Addison-Wesley. All rights reserved. 1-13
State Diagram Design
• Suppose we need a lexical analyzer that only
recognizes program names, reserved words, and
integer literals
integer

literals
• A naïve state diagram would have a transition
ftt htith
f
rom every s
t
a
t
e on every c
h
arac
t
er
i
n
th

e source
language - such a diagram would be very large!
Copyright © 2006 Addison-Wesley. All rights reserved. 1-14
State Diagram Design (cont.)
• In many cases, transitions can be combined to
simplify the state diagram
– When recognizing an identifier, all uppercase and
lowercase letters are equivalent - use a
characte
r
class
– When recognizing an integer literal, all digits are
equivalent - use a
digit
class
– Reserved words and identifiers can be recognized
together (rather than having a part of the diagram
Copyright © 2006 Addison-Wesley. All rights reserved. 1-15
for each reserved word)
State Diagram Design (cont.)
• Convenient utility subprograms:
– getChar - gets the next character of input, puts
it in global variable nextChar, determines its
ldhl lblbl
c
l
ass an
d
puts t
h

e c
l
ass in g
l
o
b
a
l
varia
bl
e
charClass
hh f
i
– addChar -puts t
h
e c
h
aracter
f
rom nextChar
i
nto
the place the lexeme (global variable) is being
accumulated
accumulated
– lookup - determines whether the string in
lexeme
is a reserved word (returns a code)
Copyright © 2006 Addison-Wesley. All rights reserved. 1-16

lexeme
is

a

reserved

word

(returns

a

code)
State Diagram
Copyright © 2006 Addison-Wesley. All rights reserved. 1-17
Lexical Analysis - Implementation
int lex() {
getChar();
getChar();
switch (charClass) {
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT) {
while

(charClass

==

LETTER

||

charClass

==

DIGIT)

{
addChar();
getChar();
}
return lookup(lexeme);
bk
Copyright © 2006 Addison-Wesley. All rights reserved. 1-18
b
rea
k
;
…
Lexical Analysis - Implementation
case DIGIT:
dd h ()
a
dd
C
h

ar
()
;
getChar();
while (charClass
==
DIGIT) {
while

(charClass

DIGIT)

{
addChar();
getChar();
}
return INT_LIT;
} /* End of switch */
} /* End of function lex() */
Copyright © 2006 Addison-Wesley. All rights reserved. 1-19
A part of a Pascal scanner
• We read the characters one at a time with look-
hd
a
h
ea
d
• If it is one of the one-character tokens
{ ( ) [ ] < > , ; = + - }

we announce that token
• If it is a ‘.’, we look at the next character
– If that is a dot, we announce ‘ ’
– Otherwise, we announce ‘.’ and reuse the look-
ahead
Copyright © 2006 Addison-Wesley. All rights reserved. 1-20
A part of …
• If it is a ‘<’, we look at the next character
–
if that is a
‘=‘
we announce
‘
<
=’
if

that

is

a

we

announce

<
– otherwise, we announce ‘<‘ and reuse the look-ahead,
etc

• If it is a letter, we keep reading letters and digits
and maybe underscores until we can't anymore
then e check to see if it is a reser ed ord
–
then
w
e

check

to

see

if

it

is

a

reser
v
ed
w
ord
• If it is a digit, we keep reading until we find a
non
-

digit
non
digit
– if that is not a ‘.’ we announce an integer
– otherwise, we keep looking for a real number
Copyright © 2006 Addison-Wesley. All rights reserved. 1-21
– if the character after the ‘.’ is not a digit, we announce
an integer and reuse the ‘.’ and the look-ahead
State
Diagram
Copyright © 2006 Addison-Wesley. All rights reserved. 1-22
we skip any initial white space (spaces, tabs, and newlines)
we read the next character
if it is a ( we look at the next character
if that is a * we have a comment;
we skip forward through the terminating *)
we

skip

forward

through

the

terminating

*)
otherwise

we return a left parenthesis and reuse the look-ahead
if it is one of the one-character tokens ([ ] , ; = + - etc.)
we return that token
if it is a we look at the next character
if

it

is

a
.
we

look

at

the

next

character
if that is a . we return
otherwise we return . and reuse the look-ahead
if it is a < we look at the next character
if that is a = we return <=
otherwise we return < and reuse the look
-
ahead

Copyright © 2006 Addison-Wesley. All rights reserved. 1-23
otherwise

we

return

<

and

reuse

the

look
-
ahead
etc.
if it is a letter we keep reading letters and digits
and maybe underscores until we can
’
tanymore;
and

maybe

underscores

until

we

can t

anymore;
then we check to see if it is a keyword
if so we return the keyword
otherwise we return an identifier
in either case we reuse the character beyond the end of
the token
the

token
if it is a digit we keep reading until we find a nondigit
if that is not a .
we return an integer and reuse the nondigit
otherwise we keep looking for a real number
if the character after the is not a digit
if

the

character

after

the
.
is

not

a

digit

we return an integer and
reuse the . and the look-ahead
Copyright © 2006 Addison-Wesley. All rights reserved. 1-24
etc.
The Parsing Problem
• Goals of the parser, given an input program:
– Find all syntax errors; for each, produce an
a
pp
ro
p
riate dia
g
nostic messa
g
e
,
and recover
pp p g g ,
quickly
–
Produce the parse tree, or at least a trace of the
Produce

the

parse

tree,

or

at

least

a

trace

of

the

parse tree, for the program
Copyright © 2006 Addison-Wesley. All rights reserved. 1-25
The Parsing Problem (cont.)
• Two categories of parsers
–Top down- produce the parse tree, beginning at
the root
 Order is that of a leftmost derivation
 Traces the parse tree in preorder
– Bottom up - produce the parse tree, beginning at

the leaves
 Order is that of the reverse of a rightmost derivation
• Parsers look only one token ahead in the input
Copyright © 2006 Addison-Wesley. All rights reserved. 1-26
The Set of Notational Conventions
• Terminal symbols – Lowercase letters at the
bii fth lhbt(b )
b
eg
i
nn
i
ng o
f

th
e a
l
p
h
a
b
e
t

(
a,
b
,
)

• Nonterminal symbols - Uppercase letters at the
bfhlhb()
b
eginning o
f
t
h
e a
l
p
h
a
b
et
(
A, B,
)

• Terminals or nonterminals - Uppercase letters at
the end of the alphabet (W, X, Y, Z)
• Strings of terminals - Lowercase letters at the
end of the alphabet (w, x, y, z)
• Mixed strings (terminals and/or nonterminals) -
Copyright © 2006 Addison-Wesley. All rights reserved. 1-27
Lowercase Greek letters (, , , )
The Parsing Problem (cont.)
• Top-down Parsers
– Given a sentential form, xA , the parser must
choose the correct A-rule to get the next
sentential form in the leftmost derivation using

sentential

form

in

the

leftmost

derivation
,
using

only the first token produced by A
•
The most common top
-
down parsing
•
The

most

common

top
-
down

parsing

algorithms:
Recursive descent
a coded implementation
–
Recursive

descent
-
a

coded

implementation
– LL parsers – table-driven implementation (1
st
L
stands for left
-
to
-
right 2
nd
Lstandsforleftmost
Copyright © 2006 Addison-Wesley. All rights reserved. 1-28
stands

for

left
to
right
,
2
L

stands

for

leftmost

derivation)
The Parsing Problem (cont.)
• Bottom-up parsers
– Given a right sentential form, , determine what
substrin
g
of

is the RHS of the rule in the
g
grammar that must be reduced to produce the
previous sentential form in the right derivation
– The most common bottom-up parsing algorithms
are in the LR famil
y
y
 L stands for left-to-right, R stands for rightmost

derivation
Copyright © 2006 Addison-Wesley. All rights reserved. 1-29
Example
•
Consider the following grammar for a comma
-
Consider

the

following

grammar

for

a

comma
separated list of identifiers, terminated by a
semicolon
semicolon
id list

id
id list tail
id
_
list


id
id
_
list
_
tail
id_list_tail
 , id
id_list_tail
id list tail

;
id
_
list
_
tail

;
Copyright © 2006 Addison-Wesley. All rights reserved. 1-30
Top-down
(left) and
(left)

and

bottom-up
parsing
parsing

(right) of the
input string
input

string

A, B, C;
Copyright © 2006 Addison-Wesley. All rights reserved. 1-31
Recursive-Descent Parsing
• Recursive-Descent Process
– There is a subprogram for each nonterminal in
the grammar, which can parse sentences that can
be generated by that nonterminal
be

generated

by

that

nonterminal
– EBNF is ideally suited for being the basis for a
recursive
descent parser because EBNF
recursive
-
descent

parser

,
because

EBNF

minimizes the number of nonterminals
•
A grammar for simple expressions:
•
A

grammar

for

simple

expressions:
<expr>  <term> {(+ | -) <term>}
<term>  <factor>
{(
*
|

/)
<factor>
}
Copyright © 2006 Addison-Wesley. All rights reserved. 1-32
{( | /) }
<factor>  id | ( <expr> )

Recursive-Descent Parsing (cont.)
• Assume we have a lexical analyzer named
lex, which puts the next token code in
nextToken
• The coding process when there is only one
RHS:
– For each terminal symbol in the RHS, compare it
with the next input token; if they match,
continue, else there is an error
– For each nonterminal symbol in the RHS, call its
id i b
Copyright © 2006 Addison-Wesley. All rights reserved. 1-33
assoc
i
ate
d
pars
i
ng su
b
program
Function expr()
/* Function expr()
Ptiithl
P
arses s
t
r
i
ngs

i
n
th
e
l
anguage
generated by the rule:
<ex
p
r> → <term>
{(
+
|
-
)
<term>
}
p {( |
)}
*/
void expr() {
/
*
Parse the first term
*
/
/

Parse

the

first

term

/
term();
Copyright © 2006 Addison-Wesley. All rights reserved. 1-34
…
Function expr() (cont.)
/* As long as the next token is + or -, call
l()t tth ttk d th
l
ex
()

t
o ge
t

th
e nex
t

t
o
k
en, an
d

parse
th
e
next term */
while (nextToken == PLUS_CODE ||
nextToken
==
MINUS CODE) {
nextToken

MINUS_CODE)

{
lex();
term
()
;
()
}
}
Copyright © 2006 Addison-Wesley. All rights reserved. 1-35
Recursive-Descent Parsing (cont.)
• A nonterminal that has more than one RHS
requires an initial process to determine which
RHS it is to parse
– The correct RHS is chosen on the basis of the next
token of input
– The next token is compared with the first token
that can be generated by each RHS until a match is
found

– If no match is found, it is a syntax error
Copyright © 2006 Addison-Wesley. All rights reserved. 1-36
Function factor()
/* Parses strings in the language generated by
the rule:
<factor> -> id | (<expr>) */
id f t () {
v
o
id

f
ac
t
or
()

{
/* Determine which RHS */
if (nextToken) == ID_CODE)
/
* For the RHS id
,

j
ust call lex *
/
/,j/
lex();
Copyright © 2006 Addison-Wesley. All rights reserved. 1-37

Function factor() (cont.)
/* If the RHS is (<expr>) – call lex() to pass
over the left parenthesis call expr() and
over

the

left

parenthesis
,
call

expr()
,
and
check for the right parenthesis */
else
else

if (nextToken == LEFT_PAREN_CODE) {
lex();
expr();
expr();
if (nextToken == RIGHT_PAREN_CODE)
lex();
else
else
error();
}

l () /* N ith RHS t h */
Copyright © 2006 Addison-Wesley. All rights reserved. 1-38
e
l
se error
()
;
/*

N
e
ith
er
RHS
ma
t
c
h
es
*/
}
The LL Grammar Class
•The Left Recursion Problem: If a grammar
has left recursion, either direct or indirect, it
cannot be the basis for a top-down parser
– A grammar can be modified to remove left
recursion
• Example: consider the following rule
A


A+B
A


A

+

B

– A recursive-descent parser subprogram for A
immediately calls itself to parse the first symbol
Copyright © 2006 Addison-Wesley. All rights reserved. 1-39
immediately

calls

itself

to

parse

the

first

symbol

in its RHS …

Pairwise Disjointness Test
• The other characteristic of grammars that
dll
dhlkf
d
isa
ll
ows top-
d
own parsing is t
h
e
l
ac
k
o
f

pairwise disjointness
Th i bili d i h RHS h b i
–
Th
e
i
na
bili
ty to
d
eterm
i

ne t
h
e correct
RHS
on t
h
e
b
as
i
s
of one token of lookahead
–
FIRST(

)={a|


*
a

}(If


*


∈ FIRST(

))

FIRST(

)

=

{a

|




a

}

(If





,

∈

FIRST(

))

• Pairwise Disjointness Test
–
For each nonterminal A in the grammar that has
For

each

nonterminal
,
A
,
in

the

grammar

that

has

more than one RHS, for each pair of rules, A 
i
and
A 
j
, it must be true that:
Copyright © 2006 Addison-Wesley. All rights reserved. 1-40
FIRST(
i

) ∩ FIRST(
j
) = 
Example
•Example 1: A  aB | aAb
– The FIRST sets for the RHSs in these rules are {a}
and {a}, which are clearly not dis
j
oint. So, these
j
rules fail the pairwise disjointness test
•
Example 2: A

aB|bAb|c
•
Example

2:

A


aB

|

bAb

|

c
– The FIRST sets for the RHSs of these rules are {a},
{b} and {c} which are clearly disjoint Therefore
{b}
,
and

{c}
,
which

are

clearly

disjoint
.
Therefore
,
these rules pass the pairwise disjointness test
Copyright © 2006 Addison-Wesley. All rights reserved. 1-41
Left factoring
• This process can resolve the problem of
ii diji
pa
i
rw
i
se

di
s
j
o
i
ntness test
• Example: consider the rules
<variable>  identifier | identifier [<expression>]
– The two rules can be replace by
<variable>  identifier <new>
<new> | [<expression>]
–or
<variable>  identifier [[<expression>]]
Copyright © 2006 Addison-Wesley. All rights reserved. 1-42
(the outer brackets are metasymbols of EBNF)
Bottom-up Parsing
• The process of bottom-up parsing produces the
reverse of a rightmost derivation
• A bottom-up parser starts with the input
sentence and produces the sequence of
sentential forms from there until all that remains
is the start symbol
• In each step, the task of the bottom-up parser is
finding the correct RHS in a right sentential form
to reduce to get the previous right sentential
Copyright © 2006 Addison-Wesley. All rights reserved. 1-43
form in the derivation
Example
• Consider the following simple grammar of
arithmetic expressions

arithmetic

expressions
E  E + T | T
T  T * F
|
F
|
F  (E) | id
• The right sentential form E + T * id includes
three RHSs, E + T,
T
, and id. Only one of these
is the correct one to be rewritten
–If the RHS E +
T
were chosen to be rewritten in this
sentential form, the resulting sentential form would be
E * id. But E * id is not a le
g
al ri
g
ht sentential form
Copyright © 2006 Addison-Wesley. All rights reserved. 1-44
gg
for the given grammar
Definitions
•  is the handle of the right sentential form
 = w if and only if S 
rm

* Aw 
rm
w
•

is a
phrase
of the right sentential form

if

is

a

phrase
of

the

right

sentential

form


if

and only if S *  = 

1
A
2

+

1

2
i
il
h
fh ih ilf
• 
i
s a s
i
mp
l
ep
h
rase o
f
t
h
e r
i
g
h
t sentent

i
a
l

f
orm
 if and only if S *  = 
1
A
2

1

2
Copyright © 2006 Addison-Wesley. All rights reserved. 1-45
Example: Parser Tree of Sentential Form
E + T
*
id
E + T id
E
T
F
The
phrase
s of the sentential form
E + T * id
are
E+T * id
•

The

phrase
s

of

the

sentential

form

E + T * id
are

E + T * id
, T * id, and id
Th l
ilh
i
id
•
Th
e on
l
y s
i
mp
l

e p
h
rase
i
s
id
•The handle of a rightmost sentential form is the
lf
ilh
Copyright © 2006 Addison-Wesley. All rights reserved. 1-46
l
e
f
tmos
t
s
i
mp
l
e p
h
rase
Example: Consider the string
id + id
*
id
id + id id
E
(8)
E

T
T
T
(3)
(7)
(8)
T
T
FF F
(1)
(2)
(4)
(5)
(6)
E
(8)
E+
T
(7)
E+T*
F
(6)
E+
T
*id
id id id
*
+
(1)
(4)

E

(8)
E

+

T

(7)
E

+

T

*

F

(6)
E

+

T
*

id

id

Shift-Reduce Algorithms
•Reduceis the action of replacing the handle on
the top of the parse stack with its
corresponding LHS
corresponding

LHS
•Shiftis the action of moving the next token to
th t f th t k
th
e
t
op o
f

th
e parse s
t
ac
k
Copyright © 2006 Addison-Wesley. All rights reserved. 1-48
LR Parsers
• Many different bottom-up parsing
algorithms have been devised. Most of these
are variations of a
p
rocess called LR
p

arser
pp
– L means it scans the input string left to right and
theRmeansitproducesarightmostderivation
the

R

means

it

produces

a

rightmost

derivation
• The original LR algorithm was designed by
D ld K h (1965) Thi l i h hi h
D
ona
ld

K
nut
h

(1965)

.
Thi
s a
l
gor
i
t
h
m, w
hi
c
h

is sometimes called
canonical
LR
Copyright © 2006 Addison-Wesley. All rights reserved. 1-49
Advantages of LR parsers
• They will work for nearly all grammars that
describe programming languages
•The
y
work on a lar
g
er class of
g
rammars than
ygg
other bottom-up algorithms, but are as efficient
as any other bottom

-
up parser
as

any

other

bottom
up

parser
• They can detect syntax errors as soon as it is
possible
possible
• The LR class of grammars is a superset of the
Copyright © 2006 Addison-Wesley. All rights reserved. 1-50
class parsable by LL parsers

Lexical and syntax analysis

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về