Tải bản đầy đủ (.pdf) (39 trang)

slike bài giảng môn chương trình dịch chương 2 design pattern visitor

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (242.74 KB, 39 trang )

LEXICAL ANALYSIS
Phung Hua Nguyen
University of Technology
2006
Outline
• Introduction to Lexical Analysis
• Token specification
– Language
– Regular Expressions (REs)
• Token recoginition
–REs ⇒ NFA (Thompson’s construction, Algorithm 3.3)
–NFA ⇒ DFA (subset construction, Algorithm 3.2)
–DFA ⇒ minimal DFA (Algorithm 3.6)
• Programming
CSE - HCMUT Lexical Analysis 2
Introduction
• Read the input characters
• Produce as output a sequence of tokens
• Eliminate white space and comments
lexical
analyzer
parser
symbol
table
source
program
token
get next
token
CSE - HCMUT Lexical Analysis 3
Why ?


• Simplify design
• Improve compiler efficiency
• Enhance compiler portability
CSE - HCMUT Lexical Analysis 4
Tokens, Patterns, Lexemes
Token Sample Lexeme
Informal description of pattern
const const const
if if if
relation <,<=,==,!=,>,>= < or <= or == or != or > or >=
id pi, count, x2 letter followed by letters or digits
num 3.14, 25, 6.02E3 any numeric constant
literal “core dumped” any characters between “ and “ except “
CSE - HCMUT Lexical Analysis 5
Outline
• Introduction √
• Token specification
– Language
– Regular Expressions (REs)
• Token recoginition
–REs ⇒ NFA (Thompson’s construction, Algorithm 3.3)
–NFA ⇒ DFA (subset construction, Algorithm 3.2)
–DFA ⇒ minimal DFA (Algorithm 3.6)
• Programming
CSE - HCMUT Lexical Analysis 6
Alphabet, Strings and Languages
• Alphabet ∑: any finite set of symbols
– The Vietnamese alphabet {a, á, à, , ã, , b, c, d, đ,…}
– The binary alphabet {0,1}
– The ASCII alphabet

•String: a finite sequence of symbols drawn from ∑ :
– Length |s| of a string s: the number of symbols in s
– The empty string, denoted
∈, |∈| = 0
• Language: any set of strings over ∑;
– its two special cases:
• ∅: the empty set
•{
∈}
CSE - HCMUT Lexical Analysis 7
Examples of Languages
• ∑ ={a, á, à, , ã, , b, c, d, đ,…}
– Vietnamese language
• ∑ = {0,1}
– A string is an instruction
– The set of Pentium instructions
• ∑ = the ASCII set
– A string is a program
– The set of C programs
CSE - HCMUT Lexical Analysis 8
Terms (Fig.3.7)
Term Definition
prefix of s a string obtained by removing 0 or more trailing
symbols of s;
e.g. ban is a prefix of banana
suffix of s a string formed by deleting 0 or more the leading
symbols of s;
e.g. na is a suffix of banana
substring of s a string obtained by deleting a prefix and a suffix from
s;

e.g. nan is a substring of banana
proper prefix,
suffix or
substring of s
Any nonempty string x that is, respectively, a prefix,
suffix os substring of s such that s ≠ x
CSE - HCMUT Lexical Analysis 9
String operations
• String
concatenation
–If x and y are strings, xy is the string formed
by appending y to x.
E.g.: x = hom, y = nay ⇒ xy = homnay
– ∈ is the identity: ∈y = y; x∈ = x
• String
exponentiation
–s
0
= ∈
–s
i
= s
i-1
s
E.g. s = 01, s
0
= ∈, s
2
= 0101, s
3

= 010101
CSE - HCMUT Lexical Analysis 10
Language Operations (Fig 3.8)
Term Definition
union: L ∪ ML ∪ M = { s | s ∈ L or s ∈ M }
concatenation: LM
LM= { st | s ∈ L and t ∈ M }
Kleene closure: L
*
L
*
= L
0
∪ L ∪ LL ∪ LLL ∪ …
where L
0
= {∈}
0 or more concatenations of L
positive closure: L
+
L
+
= L ∪ LL ∪ LLL ∪ …
1 or more concatenations of L
CSE - HCMUT Lexical Analysis 11
Examples
• L = {A,B,…,Z,a,b,…,z}
• D = {0,1,…,9}
Example Language
L ∪ D

LD
L
4
L
*
L(L ∪D)
*
D
+
letters and digits
strings consists of a letter followed by a digit
all four-letter strings
all strings of letters, including ∈
all strings of letters and digits beginning with a letter
all strings of one or more digits
CSE - HCMUT Lexical Analysis 12
Regular Expressions (REs) over
Alphabet ∑
• Inductive base:
1. ∈ is a RE, denoting the RL {∈}
2. a ∈ ∑ is a RE, denoting the RL {a}
• Inductive step: Suppose r and s are REs,
denoting the language L(r) and L(s). Then
3. (r)|(s) is a RE, denoting the RL L(r)

L(s)
4. (r)(s) is a RE, denoting the RL L(r)L(s)
5. (r)* is a RE, denoting the RL (L(r))*
6. (r) is a RE, denoting the RL L(r)
CSE - HCMUT Lexical Analysis 13

Precedence and Associativity
• Precedence:
– “*” has the highest precedence
– “concatenation” has the second highest precedence
– “|” has the lowest precedence
• Associativity:
– all are left-associative
E.g.: (a)|((b)*(c)) ≡ a|b*c
H
Unnecessary parentheses can be removed
CSE - HCMUT Lexical Analysis 14
Example
• ∑ = {a, b}
1. a|b denotes {a,b}
2. (a|b)(a|b) denotes {aa,ab,ba,bb}
3. a* denotes {∈,a,aa,aaa,aaaa,…}
4. (a|b)* denotes ?
5. a|a*b denotes ?
CSE - HCMUT Lexical Analysis 15
Notational Shorthands
• One or more instances +: r+ = rr*
– denotes the language (L(r))+
– has the same precedence and associativity as *
• Zero or one instance ?: r? = r|∈
– denotes the language (L(r) ∪ {∈})
• Character classes
– [abc] denotes a|b|c
– [A-Z] denotes A|B|…|Z
– [a-zA-Z_][a-zA-Z0-9_]* denotes ?
CSE - HCMUT Lexical Analysis 16

Outline
• Introduction √
• Token specification √
– Language
– Regular Expressions (REs)
• Token recoginition
–REs ⇒ NFA (Thompson’s construction, Algorithm 3.3)
–NFA ⇒ DFA (subset construction, Algorithm 3.2)
–DFA ⇒ minimal DFA (Algorithm 3.6)
• Programming
CSE - HCMUT Lexical Analysis 17
Overview
RE
NFA DFA
mDFA
3.5
3.6
3.2
3.3
CSE - HCMUT Lexical Analysis 18
Nondeterministic finite automata
• A nondeterministic finite automaton (NFA)
is a mathematical model that consists of
– a finite set of
states S
–a set of input symbols

– a transition function move: S
× ∑ →
S

– a start state s
0
– a finite set of final or accepting states F
CSE - HCMUT Lexical Analysis 19
Transition graph

state
transition
start state
final state
A B
a
A
A
A
CSE - HCMUT Lexical Analysis 20
Transition table
CSE - HCMUT Lexical Analysis 21
ab
0 {0,1} {0}
1-{2}
2-{3}
Input symbol
State
Acceptance
• A NFA accepts an input string x iff there is
some path in the
transition graph from
start state to some accepting state such
that the edge labels along this path spell

out x.
A
B
0
1
01010
01011
A → B → A → B → A → B
010
10
A → B → A → B → A → ?
0
1
0
11
error
0
1
0
CSE - HCMUT Lexical Analysis 22
Deterministic finite automata
• A deterministic finite automaton (DFA) is
a special case of NFA in which
1. no state has an ∈-transition, and
2. for each state
s and input symbol a, there is
at most one edge labeled a leaving s.
CSE - HCMUT Lexical Analysis 23
Thompson’s construction of NFA
from REs

• guided by the syntactic structure of the RE r
•For ∈,
• For a in ∑
i f

i f
a
CSE - HCMUT Lexical Analysis 24
Thompson’s construction (cont’d)
• Suppose N(s) and N(t) are NFA’s for REs
s and t
–For s|t,
– For st,
–For s*,
– For (s), use N(s) itself
N(s)
N(t)
i
f




N(t)
N(s)
i
f
N(s)
i
f





CSE - HCMUT Lexical Analysis 25

×