Parsing using graphs

Parsing Needs
New Abstractions

11/23/2011 1

Problem
• Parsing of context-free languages
– active research topic from 60’s to 80’s
– rich variety of parsing techniques are known
• general CFL parsing:
– Earley’s algorithm, Cocke-Younger-Kasami (CYK)
• deterministic parsing:
– SLL(k), LL(k), SLR(k), LR(k), LALR(k), LA(l)LR(k)..
• Problem: most of these techniques were invented by
automata theory people
– terminology is fairly obscure: leftmost derivations, rightmost
derivations, handles, viable prefixes, ….
– string rewriting is very clean but not intuitive for most PL people
– descriptions in compiler textbooks are obscure/erroneous
– connections between different parsing techniques are lost
• Question: is there an easier way of thinking about parsing
than in terms of strings and string rewriting?

11/23/2011 2

New abstraction
• For any context-free grammar, construct a Grammar Flow
Graph (GFG)
– syntax: representation of grammar as a control-flow graph
– semantics: executable representation
• special kind of non-deterministic pushdown automaton
• Parsing problems
– become path problems in GFG
• Alphabet soup of grammar classes like LL(k), SLL(k), LR(k),
LALR(k), SLR(k) etc. can be viewed as choices along three
dimensions
– non-determinism: how many paths can we explore at a time?
• all (Earley), only one (LL), some (LR)
– look-ahead: how much do we know about future?
• solve fixpoint equations over sets
– context: how much do we remember about the past?
• procedure cloning

11/23/2011 3

GFG example
SAa | bAc | Bc | bBa
Ad START-S
Bd ²
² ²
²
S.bAc ² ² S.bBa
² ²
b b
S.Aa S.Bc
START-A Sb.Ac START-B
Sb.Ba
² ²
SA.a ² A.d B.d
SB.c
d SbA.c d SbB.a
a c
c a
SAa. Ad. Bd.
SBc.
² SbAc. ² SbBa.
END-A END-B
² ²
² ²
²
²
END-S

GFG construction
For each non-terminal A, create nodes labeled START-A and END-A.

For each production in grammar, create a “procedure” and connect to
START and END nodes of LHS non-terminal as shown below.

A ² START-A
² A.
² END-A

A bXY ² b ²
START-A A.bXY A b.XY AbX.Y AbXY. END-A

² ² ² ²
START-X ……. END-X START-Y …….
END-Y

Edges labeled ²: only at entry/exit of START-A and END-A nodes.
Fan-out: only at exit of START-A nodes and END-A nodes
11/23/2011 transition node: node whose outgoing edge is labeled with a terminal
Terminal 5

Terminology

A ² START-A
² A.
² END-A

Start node Entry node Call node Return node Exit node End node

A bXY
² b ²
START-A A.bXY A b.XY AbX.Y AbXY. END-A

² ² ² ²
START-X ……. END-X START-Y …….
END-Y

11/23/2011 6

Non-deterministic GFG automaton
• Interpretation of GFG: NGA
– similar to NFA START-S
• Rules:
– begin at START-S
– at START nodes, make non-
deterministic choice b b
– at END nodes, must follow
CFL path
• “return to the same procedure d
from which you made the call” d
• CFL path from START to a c
c a
END  leftmost derivation
• Label(path):
– sequence of terminal
symbols labeling edges in
path
END-S
– Label of CFL path from
START to END is a word in SAa | bAc | Bc | bBa
language generated by CFG
Ad
Bd
11/23/2011 7

Parsing problem
• Paths(l):
– set of paths with label l
START-S
– inverse relation of Label
• Parsing problem: given a
grammar G and a string S,
– find all paths in GFG(G) that
generate S, or b b
– demonstrate that there is no
such path
• Parallel paths: d
d
– P1 = START-S + A a c
c a
– P2 = START-S + B
– Label(P1) = Label(P2)
– Equivalence relation on
paths originating at START-
S END-S
• Ambiguous grammar
– two or more parallel paths SAa | bAc | Bc | bBa
START-S+ END-S Ad
Bd
11/23/2011 8

Compressed paths

11/23/2011 9

Addition to GFG
• We need to be able START
to talk about SbAc
sentential forms, not SbBa

just sentences SAa SBc

• Small modification to b b
GFG: A Ad
A
B Bd
– add transitions labeled B

with non-terminals at a d
procedure calls c d
c a
• Some paths will have
edges labeled with
non-terminals
– non-terminals that END
have not been
“expanded out” SAa | bAc | Bc | bBa
Ad
Bd
11/23/2011 10

Compressed GFG paths
START
• More compact representation of
GFG path
• Idea: START-P

– collapse portion of path between P
start and end of a given procedure END-P
and replace with non-terminal
• Point: completed calls cannot P1
affect further evolution of path
so we need not store full path
• Edges going out of END nodes
of procedures will never appear
in compressed representation

11/23/2011 11

NFA for compressed paths
• Start from extended START

GFG SbAc SbBa

• Remove edges out
SAa SBc
b b
of END nodes since A Ad B Bd
these will never be A B

in compressed path a d c d
c a
• Each path in NFA
corresponds to a
compressed GFG
END
path
Ad
Bd
11/23/2011 12

Following all paths:
Earley’s algorithm

11/23/2011 13

Recall: NFA simulation
• Input string is processed left to
right, one symbol at a time
• Deterministic simulator keeps
track of all states NFA could be
in as the input is processed
• Simulation
– simulated state = subset of NFA
states
– if current simulated state is C
and next input symbol is t ,
compute next simulated N as {s0,s1,s4} !a {s2} !a {s2,s3,s7} ….
follows:
• scanning: if state si 2 C and NFA
has transition si t sj, add sj to
N
• prediction: if state sj 2 N and NFA
has transition sj sk, add sk to
N
– initial simulated state = set of
initial states of NFA closed with
prediction rule

11/23/2011 14

Analog in GFG
• First cut: use exactly the SAa | bAc | Bc | bBa
same idea Ad
Bd
– current state C, next state N,
next input symbol is t S0
– scanning: if state si 2 C and
NFA has transition
si t sj, add sj to N S4 S13
– prediction: if state sj 2 N and S1 b S8 b
NFA has transition S5 S14
sj sk, add sk to N
S2 S17 S9 S11
• Problem: not clear how to a d
S6 S15
make ²-transitions at return c c d
S3 S18 S10 S12 a
S7
states like s18 and s12 S16

• Solution: keep “return
addresses” as in Earley
S19

{S0,S1,S4,S8,S13,S17,S11} !d {S12,S18, ?????}
11/23/2011 15

START-E,0
E.(E+E),0 E.(E-E) ,0 E.int,0 0
( (

E(.E+E),0 E(.E- E),0 START-E
START-E,1 1 (
(
E.(E+E) ,1 E.(E-E) ,1 E.int,1

9 E E
+ int ---
E.int,1
END-E ,1 2 E E
E(E. +E),0 E(E. -E),0
) )
+

E(E+.E),0
START-E,3 3
E.(E+E),3 E.(E-E),3 E.int,3 E int | (E+E) | (E-E)

6
Input string: (9+6)
Eint.,3
END-E,3 4
E(E+E.),0
)

E (E+E)., 0
END-E,0 5 16

Earley parser and GFG states
• A given § set can contain multiple instances of the
same GFG state.
• Example: SaS|a
• Earley set §i
– <Sa.S, i-1>
– <Sa., i-1>
– <S.aS, i>
– <S.a, i>
– <SaS. , i-2>
– <SaS., i-3>
– ……
– <SaS., 0>

11/23/2011 17

Earley’s parser and
ambiguous grammars
• If an Earley configuration §t
can be added to a given § <X ® . , p1>
set by two or more <Y ¯. , p2>
configurations, grammar is
ambiguous <Z ° A. ±, p>

• Example: substring between
positions p and t can be
derived from A in two
different ways

11/23/2011 18

Look-ahead computation

11/23/2011 19

Look-ahead computation
• Look-ahead at point p in GFG:
– first k symbols you might encounter on path starting at p
– k is a small integer that is given for entire grammar
• Subtle point:
– look-ahead may depend on path from START that you took to get to p
– (eg) 2-look-ahead at entry of N is different for red and blue calls
• Two approaches:
– context-independent look-ahead: first k symbols on paths starting at p
– context-dependent look-ahead: given a path C from START to p, what
are the first k symbols on any path starting at p that extends C
S N
{xa} {ya,yb}
{aa,ab} {ab,bc}
x y
S xNab | yNbc N a
N
N a |
a b
11/23/2011 b c 20

FIRSTk sets
• FIRSTk(A): set of strings of length k or less
– If A * s where s is a terminal string of length k or less, s ²
FIRSTk(A)
– If A * s where s is a string longer than k symbols, then k-prefix of
s ² FIRSTk(A)
• Intuition:
– non-terminal A represents a set, which is the set of strings we can
derive from it
– FIRSTk(A) is the set of k-prefixes of these strings
• Easy to extend FIRSTk to sequences of grammar symbols

S N
S xNab | yNbc
N a |
FIRST2(N)= {a, }
x y
FIRST2(Nab) a
= {aa,ab} N N
a b
b c 21
11/23/2011

Useful string functions
• Concatenation: s + t
– (eg) xy + abc = xyabc
• k-prefix of string s: sk
– (eg) (xyz)2 = xy, (x)2 = x, ( )2 =
• Composition of concatenation and k-prefix: s +k t
– defined as (s+t)k
– (eg) x +2 yz = xy
– operation is associative
• Easy result: (s+t)k = (sk+tk)k = sk +k tk
• Operations can be extended to sets in the obvious
way
– (eg) {a,bcd} +2 { ,x,yz} = {a,ax,ay,bc}

11/23/2011 22

FIRSTk
FIRSTk(²) = {²}
FIRSTk(t) = {t}
FIRSTk(A) = FIRSTk(X1X2…Xn) U
FIRSTk(Y1Y2…Ym) U …
//rhs of productions
FIRSTk(X1X2..Xn) = FIRSTk(X1) +k FIRSTk(X2)
+k…+k FIRSTk(Xn)

11/23/2011 23

FIRSTk example

S  aAab | bAb
A  cAB | | a
B

FIRST2(S) = FIRST2(aAab) U FIRST2(bAb)
= ({a}+2 FIRST2(A) +2 {ab}) U ({b}+2 FIRST2(A) +2 {b})
FIRST2(A) = FIRST2(cAB) U {²} U{a} = ({c} +2 FIRST2(A) +2 FIRST2(B)) U {²} U {a}
FIRST2(B) = { }

FIRST2(A) ={²,a,c,ca,cc}
FIRST2(B) = {²}
FIRST2(S)={aa,ac,bb,ba,bc}

11/23/2011 24

Context-independent look-aheads
S A B

a b c
A A a
A
a b ?

b B
{ab}
{b$}
Se={$$} Be
Ae

Compute FOLLOWk(A) sets: strings of length k that can be encountered
after you return from non-terminal A

Se = {$$}
Ae = (FIRST2({ab}) +2 Se) U (FIRST2({b})+2 Se) U (FIRST2(B) +2 Ae)
Be = Ae
Solution: Se = {$$} Ae = {ab,b$} Be = {ab,b$}

From these FOLLOW sets, we can now compute look-ahead at any GFG point.

Computing context-independent
look-ahead sets
• Algorithm:
– For each non-terminal A, compute FIRSTk(A)
• First k terminals you encounter on path A-START + A-END
– For each non-terminal A, compute FOLLOWk(A)
• First k terminals you encounter on path that extends a GFG
path START + A-END
– Use the FIRSTk and FOLLOWk sets to compute the
look-ahead at any point of interest in GFG
• You can even compute FIRSTk and FOLLOWk
sets in one big iteration if you want.
• This computation is independent of the particular
parsing method used

11/23/2011 26

Production cloning:
a way of implementing
context-dependence

11/23/2011 27

Context-dependent look-ahead
• In running example,
– look-aheads for N for red S
call to N are disjoint N
– look-aheads for N for blue {xa} {ya,yb} {aa,ab} {ab,bc}
call to N are disjoint
– context-independent look- x y
ahead computation a
combines the look-aheads N N
from all the call sites of N b
at the bottom of N and a
{bc}
propagates them to the top b c
• Idea:
– compute look-aheads {ab}
separately for each context
– keep track of context while
parsing S xNab | yNbc
 we can get a more capable N a |
parser
Input string: xab$$
11/23/2011 28

Tracking context by cloning
• Grammar:
S xNab | yNbc S  xN1ab | yN2bc
N a | N1 a | N2 a |

S [N,{ab}]
N1 [N,{bc}]
N2

{aa} {ab} {ab} {bc}

x y
a a
[N,{ab}]
N1 [N,{bc}]
N2
a b
b c

11/23/2011 29

General idea of cloning
• Cloning creates copies of productions
• Intuitively we would like to create a clone of a production for each of
its contexts and write-down look-ahead
– but set of contexts for a production is usually infinite
• Solution:
– create finite number of equivalence classes of contexts for a given
production
– create a clone for each equivalence class
– compute context-independent look-ahead
• Two cloning rules are important in practice
– k-look-ahead cloning: two contexts are in same equivalence class if
their k-look-aheads are identical (used in LL(k))
– reachability cloning: two contexts C1 and C2 are in same equivalence
class if the set of GFG nodes reachable by paths with label(C1) is equal
to set of GFG nodes reachable by paths with label(C2) (used in LR(0))
– LR(k) uses a combination of them

11/23/2011 30

k-look-ahead cloning (intuitive idea)
S A B S

a b c a b
A A A a d
[A,{ab} [A,{b$}]
a b
a b
b B
{b$} b
{ab}
[A,{ab}] [A,{b$}]

c c
[A,{da}] a [A,{db}] a

[B,{ab}] [B,{b$}]

{b$}
{ab}
Other clones not shown
k
If there are |T| terminal symbols, you may end up with 2|T| clones of a given production

k-look-ahead cloning
• G=(V,T,P,S):grammar, k:positive integer.
• Tk(G) is following grammar
– nonterminals: {[A,R]| A in V -T, and R µ Tk}
– terminals: T
– start symbol: [S,{$k}]
– rules: all rules of the form [A,R]  X1'X2'X3'...Xm' where for
some rule A  X1X2X3...Xm in P
• Xi' = Xi if Xi is a terminal
• Xi' = [Xi, FIRSTk(Xi+1,..Xm) +k R] when Xi is a non-terminal.
• Intuition:
– after this kind of cloning, k-look-aheads at the end of a procedure
are identical for all return edges
– so doing a context-independent look-ahead computation on the
transformed grammar does not tell you anything you did not
already know about k-look-aheads

11/23/2011 32

LL(k) and SLL(k)

11/23/2011 33

Intuition
• This class of grammars has the following
property:
– if s is a string in the language, then for any prefix
p of s, there is a unique path P from START such
that label(P) = p (modulo look-ahead)
• So we need to follow only one path through
GFG for a given input string, using look-
ahead to eliminate alternatives
• Roughly analogous to DFAs in the CFL world
11/23/2011 34

LL(k) parsing
• Only one path can be
followed by the parser
– so at procedure call for S
non-terminal N, we must N
know exactly which {xa} {ya,yb} {aa,ab} {ab,bc}
procedure (rule) to call
• Simple LL(k) parsing: x y
a
– make decision based on N N
context-independent look- a b
ahead of k symbols at entry
point for N b c
• LL(k) parsing:
– use context-dependent
look-ahead of k symbols
– procedure cloning S xNab | yNbc
technique converts LL(k) N a |
grammar into SLL(k)
grammar
Grammar is LL(2) but not SLL(2)
11/23/2011 35

Parser
• Modify Earley parser to
– track compressed paths instead of full paths
• transitions labeled by non-terminals and terminals
– eliminate return addresses
• at the end of a production
– A  X1X2..Xn: pop n states off and make an A transition
from the exposed state
– A  ² : make an A transition from current state
– use look-ahead to eliminate alternatives

11/23/2011 36

START-E
E.(+ E E) E.(- E E) E.int
START-E
(
( (
E(. - E E)
- E E
E(- . E E) + int ---
START-E
E  .(+ E E) E .(- E E) E .int
E E
(
) )
E  (.+ E E) E
+
E  (+ . E E) E int | (+ E E) | (- E E)
START-E
E.(+ E E) E .(- E E) E .int Input string: ( - ( + 8 9 ) 7 )
8 E

E  int. E  (+ E . E) E  (- E .E)
END-E START-E START-E
E.(+ E E) E  (- E E) E .int E. (+ E E) E .(- E E) E .int
9 E 7 E
E  int. E (+ E E.)
END-E E int. E  (- E E .)
) END-E
)

E(+ E E). E (- E E ).
END-E END-E 37

Many grammars are not LL(k)
• Grammar
– Eint | (E+E) | (E-E)
E
• Not clear which rule to ( (
apply until you see “+”
E E
or “-”
+ int ---
– this needs unbounded
E E
look-ahead, so grammar
is not LL(k) for any k ) )

• One solution:
– follow multiple paths till
only one survives

11/23/2011 38

LR(k),SLR(k),LALR(k)

11/23/2011 39

LR grammars (informal)
• LR parsers permit limited non- START
determinism
– can follow more than one path but not
all paths like Early
• LR(0) condition: for any prefix of b b
input, the corresponding fully A B
extended compressed paths must A
have the same label
• Condition not true in general a d c d
c a
grammars: see example
– Consider string “da”
– For prefix “d”, there are two paths:
• red path
• blue path
END
– Labels of compressed paths:
• red path: “A” SAa | bAc | Bc | bBa
• blue path: “B”
Ad
• We can use modified Earley parser Bd
for these grammars
11/23/2011 40

START-E,0 START-E
0 START-E
E.(E+E),0 E.(E-E) ,0 E.int,0 E.(E+E) E.(E-E) E.int
( (
( ( ( (

E(.E+E),0 E(.E- E),0 E(.E+E) E(.E- E) E E
START-E,1 1 START-E + int ---
E.(E+E) ,1 E.(E-E) ,1 E.int,1 E.(E+E) E.(E-E) E.int
E E
3 E E
) )
E.int,1 E(E. +E) E(E. -E)
END-E ,1 2
E(E. +E),0 E(E. -E),0
+
+ E int | (E+E) | (E-E)

E(E+.E),0 E(E+.E)
START-E,3 3 START-E Input string: (3+4)
E.(E+E),3 E.(E-E),3 E.int,3 E.(E+E) E.(E-E) E.int

4 4

Eint.,3 Eint.
END-E,3 4 END-E
E(E+E.),0
…………..
)

E (E+E)., 0
END-E,0 5 41

Parser for LR languages
• Use the modified Earley parser we used for LL grammars
– each -state will have multiple items as in the original Earley parser
since LR parsers follow multiple paths too
• -states must follow a stack discipline for modified Earley parser to
work
• Since we are following multiple paths, this might break down
– shift-reduce conflict: parallel compressed paths
• P1 to a scan node and P2 to an EXIT node (push/pop conflict)
– reduce-reduce conflict: parallel compressed paths
• P1 and P2 to different EXIT nodes (pop/pop conflict)
• If grammar does not have shift-reduce or reduce-reduce conflicts,
we can use modified Earley parser and follow compressed paths
while maintaining a stack discipline for -states
• How do we determine whether grammar has shift-reduce or reduce-
reduce conflicts?

11/23/2011 42

Finding LR(0) conflicts
• Compute the DFA corresponding to the
compressed path NFA
• If conflicting states are in same DFA state,
grammar has an LR(0) conflict
d Reduce-reduce conflict
S.Aa Ad.
S.bAc Bd.
S.Bc d c
A SbA.c SbAc.
S.bBa
b Sb.Ac
A.d
Sb.Ba
B.d
A.d B SbB.a SbBa.
A a
B.d

B a
SAa | bAc | Bc | bBa SA.a SAa.
Ad
11/23/2011
Bd c 43
SB.c SBc.

LR(0) automaton for expression grammar

) E(E+E).

E(E+E.)
E(E+.E)
int Eint. E
E.(E+E)
E.(E+E) ( E.(E-E)
E.(E – E) int + E.int
int
E .int
( E(.E+E)
E.(E+E) E E(E.+E)
Eint.
E.(E-E) E(E.- E)
E.int
E(.E-E) - E(E-.E) int
E.(E+E)
( E.(E-E)
E.int

( E E(E-E.)

)
E(E-E).

11/23/2011 44

Parser for LR(0) languages
• Use the modified Earley parser we used for
LL grammars
– each -state will have multiple items as in the
original Earley parser since LR parsers follow
multiple paths too
• No need to keep track of GFG nodes within
each -state
– states in compressed path DFA correspond to
possible -states
– So modified Earley parser just pushes and pops
DFA states

11/23/2011 45

GFG path interpretation

• Let P1 and P2 be two GFG
START
paths with identical labels
• Sufficient condition for labels START-P
START-P

of compressed paths to be
END-P
equal: END-P

– sequence of completed calls in
P1 P2
P1 and P2 are identical
• Most of the action in LR
parsers happens at EXIT
nodes of productions
11/23/2011 46

LR(0) conflicts: GFG
START START

t* t*
t* t*

B
Bexit u
Aexit Aexit

reduce-reduce conflict shift-reduce conflict

• LR(0) conflicts (GFG definition):
– Shift-reduce conflict: there are parallel paths P1: START + Aexit and
P2: START + scan-node
– Reduce-reduce conflict: there are parallel paths P1: START + Aexit
and P2: START + Bexit
• Claim: Let G be an LR(0) grammar according to GFG definition.
– P1 and P2 are two GFG paths that end at SCAN or END nodes, and
C(P1) and C(P2) are their compressed equivalents
– P1
11/23/2011 and P2 have the same label iff C(P1) and C(P2) have the same label
47

LR(0) conflicts: GFG
START START

t* t*
t* t*

B
Bexit u
Aexit Aexit


• Claim: Let G be an LR(0) grammar according to GFG definition.
– P1 and P2 are two GFG paths that end at SCAN or END nodes, and C(P1) and C(P2)
are their compressed equivalents
– P1 and P2 have the same label iff C(P1) and C(P2) have the same label
• This claim is not true if the paths do not end at SCAN or END nodes
– counterexample: in this LR(0) grammar, consider paths from START to nodes
S  A.a and S .Uc

S  Aa | Uc
U  Ab
11/23/2011 A. 48

Example
START

SbAc SbBa
SAa SBc
b b
Ad Bd
• States with LR(0) conflicts
– (Ad. , Bd.)
a d c d
• Conflicting context pairs c a
(i) path label: d
– C1: START, S.Aa, A.d, Ad.
– C2: START, S.Bc, B.d, Bd.
END
(ii) path label: bd
– C3: START, S.bAc, Sb.Ac, A.d, Ad. SAa | bAc | Bc | bBa
– C4: START, S.bBa, Sb.Ba, B.d, Bd. Ad
Bd
• So grammar is not LR(0)
11/23/2011 49

LR(0) H&U
• A grammar G is LR(0) if
– its start symbol does not appear on the right side of any
production, and
– for every viable prefix °, whenever A ! ®. is a complete valid
item for °, then no other complete item nor any item with a
terminal to the right of the dot is valid for °.
• Comment:
– by this definition, the only other valid items that can occur
together with A ! ®. are incomplete items with a non-terminal to
the right of the dot of the form B! ¯.C±
– if First(C) contains a terminal t, it can be shown that an item of
the form Y ! .t ¸ is valid for °, violating the LR(0) condition.
Therefore, First(C) = {²}. It can be shown that this means ® = ²
– Example: this grammar is LR(0) (A  . and B .Cd are valid
items for viable prefix ² )
• SB
• BCd
• CA
• A ²
11/23/2011 50

Look-ahead in LR grammars
START START

t* t*
t* t*

Bexit B
Aexit Aexit


• LR(k)
– for each pair of parallel paths to LR(0) conflicting states, k-look-ahead
sets are disjoint
• SLR(k):
– if there is LR(0) conflict at nodes A and B, context-insensitive look-
ahead sets of A and B are disjoint
• LALR(k): grammar is SLR(k) after reachability cloning
11/23/2011 51

Example
START

SbAc SbBa
SAa SBc
b b
• States with LR(0) conflicts Ad Bd
– (Ad. , Bd.)
• Conflicting context pairs a d c d
c a
(i) path label: d
– C1: START, S.Aa, A.d, Ad.
– C2: START, S.Bc, B.d, Bd.
(ii) path label: bd END
– C3: START, S.bAc, Sb.Ac, A.d, Ad.
– C4: START, S.bBa, Sb.Ba, B.d, Bd.
Ad
• Grammar is LR(1) Bd
– Look-ahead for C1: {a}, look-ahead for C2: {c}
– Look-ahead for C3: {c}, look-ahead for C4: {a}
11/23/2011 52

LR(1) automaton
Ad
Bd

S.Aa,$ d Ad., a
S.bAc,$ Bd., c
S.Bc,$ Ad.,c
S.bBa,$ Bd.,a
Sb.Ac,$ d
A.d, a b
Sb.Ba,$ c
B.d, c A SbA.c,$ SbAc.,$
A.d, c
A B.d, a a
B SbB.a, $ SbBa.,$

B a
SA.a,$ SAa.,$

c
11/23/2011 SB.c,$ SBc.,$ 53

LALR look-ahead computation
• Key observation:
– each path START s in deterministic LR(0) automaton
represents a set of contexts in the non-deterministic LR(0)
automaton
• each context in this set ends at one of the items in s
– in general, there will be multiple paths to state s in
deterministic LR(0) automaton
– so each state in LR(0) automaton represents a set of sets
of contexts
– in LALR, we merge the look-aheads for those contexts
• LALR = reachability cloning + SLR (Bermudez and
Logothetis) + unions at some nodes (see RL.) state
in diagram on next page

11/23/2011 54

LALR(1) but not SLR(1)
S’ S$
shift-reduce conflict S L=R|R
L *R | id
S
R L
$
S’  .S$ S’  S.$ S’  S$.
S .L=R
R S  L=R.
S  .R S  L=.R
L S  L.=R =
L  .*R R  .L
R  L.
L  .id L  .*R L
R R  L.
R  .L L  .id
S  R. id
* L  id.
id id

L  *.R L  *R.
* R
FOLLOW(S) = { $ } R  .L
FOLLOW(R) = { =, $ } L  .*R
* L  .id L
FOLLOW(L) = { =, $ }
11/23/2011 55

LALR  SLR grammar
S’ S$ S’ S$
S L=R|R S L1 = R2 | R1
L *R | id L1,L2,L3 *R3 | id
R L R1 L1
R2 L2
S’  .S$
S
S’  S.$
$
S’  S$. R3 L3
S .L=R
R2 S  L=R.
S  .R S  L=.R
L1 S  L.=R =
L  .*R R  .L
R  L.
L  .id L  .*R L2
R1 R  L.
R  .L L  .id
S  R. id
* L  id.
id id
L  *.R L  *R.
* R3
R  .L
* L  .*R
11/23/2011 56
L  .id L3

LR(0): Reachability cloning
• Motivation: NFADFA
conversion for LR grammars START
• Driven by compressed paths
C1
• Need to verify that this cloning
satisfies sanity condition even C2 1
if grammar is not LR(0) 1 C3
2 2
• Compressed contexts C1 and
3
C2 of node A are in same
B
equivalence class if A
set of GFG nodes reachable by
paths with label(C1)
C1 and C2 will be in the same
= equivalence class. C3 is in a different class.
set of GFG nodes reachable by
paths with label(C2)

11/23/2011 57

Algorithm (need to write)
• G=(V,T,P,S):grammar
• R(G) is following grammar
– nonterminals: {[Ai]| A in V -T, 1 <= i <= n and
there are n edges labeled A in compressed path
DFA}
– terminals: T
– start symbol: [S]
– rules: all rules of the form [Ai]  X1'X2'X3'...Xm'
where for some rule A  X1X2X3...Xm in P
• Xi' = Xi if Xi is a terminal
• [Xi] when Xi is a non-terminal.

11/23/2011 58

Cloning for LALR(1)

• Same condition as LR(0): reachability
cloning
• Extension to LA(k)LR(l):
– cloning is governed by LR(l)
– compute SLR(k) look-aheads
– LALR(k) is LA(k)LR(0)
– LR(k) is LA(k)LR(k clone as in LR(l)
11/23/2011 59

Summary

• New abstraction for CFL parsing
– Grammar Flow Graph (GFG)
• Parsing problems become path problems in GFG
• Earley parser emerges as simple extension of NFA simulation
• Mechanisms
– control number of paths followed during parsing
– look-ahead:
• algorithm: solving set constraints
– context-dependent look-ahead
• algorithm: cloning
• SLL(k), LL(k), SLR(k), LR(k), LALR(k) grammars arise from
different choices of these mechanisms
• LL and LR parsers emerge as specializations of Earley parser

11/23/2011 60

LR(0) ²DFA
M1 M6
E(E+.E) E(E+E.)
Eint.
M3 E
int
² )
M8
+
M2 E(E+E).
E.(E+E) ²
E(.E+E) E E(E.+E)
E.(E – E) E(.E- E) E(E.- E) M4
(
E .int
M9
M0 -
² M5 E(E-E).
E(E-.E) )
M7
E E(E-E.)

((2+3)-4)

( ( 2 + 3 ) - 4
<M0,0> <M2,0> <M2,1> <M1,2> <M3,1> <M1,4> <M8,1> <M5,0>
<M0,1> <M0,2> <M4,1> <M0,4> <M6,1> <M4,0> <M0,7>

) )
<M1,7>
11/23/2011
<M9,0> 61
<M7,0>

LALR(1) example from G&J
S’ -> S #
S’S.#
S -> A B c
S A -> a
S’ .S# c B -> b
S.ABc SA.Bc SAB.c SAbc. B -> e
A.a B. B
A
B.b
a
b
Aa.
Bb.

11/23/2011 62

S L=R|R
L *R | id
S L = R Send
C
R L

id

*
L T R Lend

R L Rend
Shift-reduce conflict occurs at states C and Rend
(conflicting paths are S->L->Lend->C and S->R->L->Lend->Rend)
1-look-ahead at C is =
Context-independent 1-look-ahead at Rend is {=,$} so grammar is not SLR(1).
LALR(1) figures out that for conflicting state, the calling context must SR.
Look-ahead at Rend is = for context S LTRLLendRend but there is
11/23/2011 context S* C parallel to this one.
no 63

LR(1)
S L R

* L
L R R id
=
R FIRST(L)=FIRST(R)={*,id}
Shift-reduce conflict: id $

S [L,{=}] [L,{$}]
[R,{=}] [R,{$}]
* *
[L,{=}] [R,{$}] [R,{=}] id [R,{$}] id [L,{=}] [L,{$}]
=
[R,{$}]

11/23/2011 After procedure cloning 64

LALR(1) look-aheads

T0 T1 T2 T4

S’ .S$ S(.S) [$,)] S(S.) [$,)] S(S). [$,)]
S(S) S .(S) [$] S.(S) [)] )
S
S S. S. [)]
(
[$] S
T5 ( • After reduction S(S), parsing can
S’S.$ resume either in state T0 or T1.
• LR parser stack tells you which one to
resume from
• LALR(1) look-aheads in state T1 are
interesting. Item S(.S) gets look-ahead
from item S  .(S) in state T0 as well as
item S(.S) from state T1.

11/23/2011 65

Parsing techniques
• Our focus: techniques that perform breadth-first
traversal of GFG
– similar to online simulation of NFA
– input is read left to right one symbol at a time
– extend current GFG paths if possible, using symbol
• Three dimensions:
– non-determinism: how many paths can I follow at a given
time?
– look-ahead: how many symbols of look-ahead are known
at each point?
– context: how much context do we keep?
• this is implemented by procedure cloning, independent of look-
ahead

11/23/2011 66

What we would like to show
• Obvious algorithm:
– follow all CFL-paths in GFG
– essentially a fancy transitive closure in GFG
– leads to Earley’s algorithm
– O(n3) complexity
• O(n) algorithms: LL/LR/LALR,…
– preprocessing to compute look-ahead sets
– maintain compressed paths
– ensure that Earley sets can be manipulated as a
stack
11/23/2011 67

What we would like to show
(contd.)
• SLL(k) = no cloning + decision at procedure start
• LL(k) = k-look-ahead-cloning+ decision at
procedure start
• LA(l)LL(k) = l-look-ahead-cloning + context-
independent k-look-ahead + decision at procedure
start
• SLR(k) = no cloning + decision at procedure end
• LR(k) = k-lookahead-cloning + decision at
procedure end
• LALR(k) = reachability-cloning + decision at
procedure end
11/23/2011 68

Computing context-independent look-ahead
• Intuition: S xNab | yNbc
– simple inter-procedural N a |
backward dataflow analysis in
GFG
– assume look-ahead at exit of S N
GFG is {$k}
– propagate look-ahead back {xa} {ya,yb}
through GFG to determine
look-aheads at other points x y
• How do we propagate look- a
aheads through non-terminal N N
calls? a b
– would like to avoid repeatedly
analyzing procedure for each b c
look-ahead set we want to
propagate through it
– need to handle recursive calls
– ideally, we would have a 2-symbol look-aheads
function that tells us how a
look-ahead set at the exit of a
procedure gets propagated to
its input

11/23/2011 69

Every LL(1) grammar is an SLL(1) grammar

START Let string generated by paths P and Q be SP and SQ

Cases:
C1 C2 -SP = a and SQ = a : grammar is neither LL(1) nor SLL(1)
-SP = a and SQ = b : grammar is LL1() and SLL(1)
x y
-SP = and SQ = : grammar is neither LL(1) nor SLL(1)
N -SP = a and SQ = :
- We show that there cannot be a context Ci for which the
generated string for the complementary context Ci’ is a
P Q - Otherwise, for context Ci, 1-lookahead for choice P is a
1-lookahead for choice Q is a
so the grammar is not LL(1).
- Therefore, there is no context Ci for which the 1-lookahead for
choice Q is a.
C1’ - But this means that the context-independent 1-lookahead
C2’ for choice Q cannot contain a.
- Therefore the grammar is SLL(1).

END
11/23/2011 70

LL(2) grammar that is not SLL(2)
START
-Consider the context-sensitive look-aheads at N.
-For context C1,
2-lookahead for choice P is {aa}
C1 C2
2-lookahead for choice Q is {ab}
x y -For context C2,
N 2-lookahead for choice P is {ab}
2-lookahead for choice Q is {bc}.
a -Therefore, grammar is LL(2).
P Q -Context-independent lookaheads:
2-lookahead for choice P is {aa,ab}
2-lookahead for choice Q is {ab,bc}.
-Since these two sets are not disjoint, the grammar is
not SLL(2).
a b -Grammar:
C2’ S  xNab
C1’
S  yNbc
b c Na
N
END
11/23/2011 71

Cloning for LR(k)
• From Sippu & Soissalon
– replace each non-terminal A in the original grammar
G with the set of all pairs of the form ([ ]k,A) where is
a viable prefix of the $-augmented grammar G
• [page 16] String 1 is LR(k) equivalent to string 2
if VALIDk( 1) = VALIDk( 2); i.e. exactly those items
valid for 2 are valid 1 and vice versa.
• An item [A . ,y] is LR(k)-valid for if
S rm* Az rm z = z and k:z = y

• Question:
– is this a finer equivalence class than LL(k)?
11/23/2011 72

Sanity condition on
equivalence classes
• If C1 and C2 are two START
contexts for some node N
and
– C1 = B1 + P
– C2 = B2 + P B1 B2
– B1 and B2 are in the same
equivalence class
C1 and C2 must be in the
same equivalence class
• Can we come up with a P
general construction
procedure for cloning, N
given a specification of
the equivalence classes?
11/23/2011 73

Parsing using graphs

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Parsing using graphs