2. Summary
• Part I: Lexicalized Context-Free Grammars
– motivations and definition
– relation with other formalisms
• Part II: standard parsing
– TD techniques
– BU techniques
• Part III: novel algorithms
– BU enhanced
– TD enhanced
3. Lexicalized grammars
• each rule specialized for one or more
lexical items
• advantages over non-lexicalized
formalisms:
– express syntactic preferences that are
sensitive to lexical words
– control word selection
4. Syntactic preferences
• adjuncts
Workers [ dumped sacks ] into a bin
*Workers dumped [ sacks into a bin ]
• N-N compound
[ hydrogen ion ] exchange
*hydrogen [ ion exchange ]
5. Word selection
• lexical
Nora convened the meeting
?Nora convened the party
• semantics
Peggy solved two puzzles
?Peggy solved two goats
• world knowledge
Mary shelved some books
?Mary shelved some cooks
6. Lexicalized CFG
Motivations :
• study computational properties common
to generative formalisms used in
state-of-the-art real-world parsers
• develop parsing algorithm that can be
directly applied to these formalisms
9. Lexicalized CFG
Delexicalized nonterminals encode :
• word sense
N, V, ...
• grammatical features
number, tense, ...
• structural information
bar level, subcategorization state, ...
• other constraints
distribution, contextual features, ...
10. Lexicalized CFG
• productions have two forms :
– V[dump] → dumped
– VP[dump][sack]
→ VP[dump][sack] PP[into][bin]
• lexical elements in lhs inherited from rhs
11. Lexicalized CFG
• production is k-lexical :
k occurrences of lexical elements in rhs
– NP[bin] → Det[a] N[bin]
is 2-lexical
– VP[dump][sack]
→ VP[dump][sack] PP[into][bin]
is 4-lexical
12. LCFG at work
• 2-lexical CFG
– Alshawi 1996
: Head Automata
– Eisner 1996
: Dependency Grammars
– Charniak 1997
: CFG
– Collins 1997
: generative model
13. LCFG at work
Probabilistic LCFG G is strongly equivalent
to probabilistic grammar G’ iff
• 1-2-1 mapping between derivations
• each direction is a homomorphism
• derivation probabilities are preserved
14. LCFG at work
From Charniak 1997 to 2-lex CFG :
NPS[profits]
NNP[profits]
ADJNP[corporate]
Pr1 (corporate | ADJ, NP, profits) ×
Pr1 (profits | N, NP, profits) ×
Pr2 ( NP → ADJ N | NP, S, profits)
15. LCFG at work
From Collins 1997 (Model #2) to 2-lex CFG :
〈VPS, {}, ∆left ⊕ ∆, ∆S ⊕ ∆〉 [bought]
〈N, ∆〉 [IBM]
〈VPS, {NP-C}, ∆left , ∆S 〉 [bought]
Prleft (NP, IBM | VP, S, bought, ∆left , {NP-C})
16. LCFG at work
Major Limitation :
Cannot capture relations involving
lexical items outside actual constituent
(cfr. history based models)
NP[d1][d0]
V[d0]
d0
NP[d1]
d1
PP[d2][d3]
d2
cannot look at d0
when computing
PP attachment
17. LCFG at work
• lexicalized context-free parsers that
are not LCFG :
– Magerman 1995
: Shift-Reduce+
– Ratnaparkhi 1997
: Shift-Reduce+
– Chelba & Jelinek 1998
: Shift-Reduce+
– Hermjakob & Mooney 1997
: LR
18. Related work
Other frameworks for the study of
lexicalized grammars :
• Carroll & Weir 1997 :
Stochastic Lexicalized Grammars;
emphasis on expressiveness
• Goodman 1997 :
Probabilistic Feature Grammars;
emphasis on parameter estimation
19. Summary
• Part I: Lexicalized Context-Free Grammars
– motivations and definition
– relation with other formalisms
• Part II: standard parsing
– TD techniques
– BU techniques
• Part III: novel algorithms
– BU enhanced
– TD enhanced
20. Standard Parsing
• standard parsing algorithms
(CKY, Earley, LC, ...) run on LCFG in time
O ( | G | × |w |3 )
• for 2-lex CFG (simplest case) |G | grows
with |VD|3 × |VT|2 !!
Goal :
Get rid of |VT| factors
21. Standard Parsing: TD
Result (to be refined) :
Algorithms satisfying the correct-prefix
property are “unlikely” to run on LCFG in
time independent of VT
26. Standard Parsing: TD
Fact :
We can simulate a nondeterministic
FA M on w in time O ( |M | × |w | )
Conjecture :
Fix a polynomial p.
We cannot simulate M on w in time
p( |w | ) unless we spend exponential
time in precompiling M
27. Standard Parsing: TD
Assume our conjecture holds true
Result :
Off-line parsers with correct-prefix property
cannot run in time O ( p(|VD|, |w |) ),
for any polynomial p, unless we spend
exponential time in precompiling G
28. Standard Parsing: BU
Common practice in lexicalized grammar
parsing :
• select productions that are lexically
grounded in w
• parse BU with selected subset of G
Problem :
Algorithm removes |VT| factors but
introduces new |w | factors !!
29. Standard Parsing: BU
Time charged :
• i, k, j
⇒ |w |3
A[d2]
B[d1]
i
d1
C[d2]
k
d2
• A, B, C
j
• d 1, d 2
Running time is O ( |VD|3 × |w |5 ) !!
⇒ |VD|3
⇒ |w |2
30. Standard BU : Exhaustive
100000
10000
y = c x 5,2019
1000
time
BU naive
100
10
1
10
100
length
31. Standard BU : Pruning
10000
1000
time
y = c x 3,8282
BU naive
100
10
1
10
100
length
32. Summary
• Part I: Lexicalized Context-Free Grammars
– motivations and definition
– relation with other formalisms
• Part II: standard parsing
– TD techniques
– BU techniques
• Part III: novel algorithms
– BU enhanced
– TD enhanced
33. BU enhanced
Result :
Parsing with 2-lex CFG in time
O ( |VD|3 × |w |4 )
Remark :
Result transfers to models in Alshawi 1996,
Eisner 1996, Charniak 1997, Collins 1997
Remark :
Technique extends to improve parsing of
Lexicalized-Tree Adjoining Grammars
34. Algorithm #1
Basic step in naive BU :
A[d2]
B[d1]
i
d1
C[d2]
k
d2
j
Idea :
Indices d1 and j can be processed
independently
35. Algorithm #1
• Step 1
A[d2]
• Step 2
d1
⇒
C[d2]
B[d1]
i
A[d2]
k
A[d2]
k
d2
k
i
d2
C[d2]
i
C[d2]
j
d2
A[d2]
⇒
i
d2
j
36. BU enhanced
Upper bound provided by Algorithm #1 :
O (|w |4 )
Goal :
Can we go down to O (|w |3 ) ?
37. Spine
The spine of a parse tree is the path from
the root to the root’s head
S[buy]
S[buy]
NP[IBM]
IBM
AdvP[week]
VP[buy]
V[buy]
bought
last week
NP[Lotus]
Lotus
38. Spine projection
The spine projection is the yield of the subtree composed by the spine and all its
sibling nodes
S[buy]
S[buy]
NP[IBM]
IBM
AdvP[week]
VP[buy]
V[buy]
bought
last week
NP[Lotus]
Lotus
NP[IBM] bought NP[Lotus] AdvP[week]
39. Split Grammars
Split spine projections at head :
??
Problem :
how much information do we need to store
in order to construct new grammatical
spine projections from splits ?
40. Split Grammars
Fact :
Set of spine projections is a linear contextfree language
Definition :
2-lex CFG is split if set of spine projections
is a regular language
Remark :
For split grammars, we can recombine
splits using finite information
41. Split Grammars
Non-split grammar :
S[d]
AdvP[a]
S1[d]
S[d]
AdvP[a]
AdvP[b]
S1[d]
S[d]
AdvP[b]
• unbounded # of
dependencies
between left and
right dependents
of head
…
• linguistically
unattested and
unlikely
43. Split Grammars
Precompile grammar such that splits are
derived separately
S[buy]
S[buy]
S[buy] AdvP[week]
NP[IBM] VP[buy]
V[buy]
NP[IBM] r3[buy]
⇒
NP[Lotus]
bought
r3[buy] is a split symbol
r2[buy]
AdvP[week]
r1[buy]
NP[Lotus]
bought
44. Split Grammars
• t : max # of states per spine automaton
• g : max # of split symbols per spine
automaton (g < t )
• m : # of delexicalized nonterminals
thare are maximal projections
45. BU enhanced
Result :
Parsing with split 2-lexical CFG in time
O ( t 2 g 2 m 2 | w |3 )
Remark:
Models in Alshawi 1996, Charniak 1997
and Collins 1997 are not split
46. Algorithm #2
Idea :
B[d]
• recognize left and
right splits separately
d
B[d]
B[d]
s
d
s
d
• collect head
dependents one
split at a time
49. Algorithm #2 : Exhaustive
100000
y = c x 5,2019
10000
y = c x 3,328
1000
time
BU split
BU naive
100
10
1
10
100
length
50. Algorithm #2 : Pruning
10000
y = c x 3,8282
1000
time
y = c x 2,8179
BU naive
BU split
100
10
1
10
100
length
51. Related work
Cubic time algorithms for lexicalized
grammars :
• Sleator & Temperley 1991 :
Link Grammars
• Eisner 1997 :
Bilexical Grammars (improved by
transfer of Algorithm #2)
52. TD enhanced
Goal :
Introduce TD prediction for 2-lexical
CFG parsing, without |VT| factors
Remark :
Must relax left-to-right parsing
(because of previous results)
53. TD enhanced
Result :
TD parsing with 2-lex CFG in time
O ( |VD|3 × |w |4 )
Open :
O ( |w |3 ) extension to split grammars
57. Algorithm #3
S
A[d]
i
j
Item representation :
• i, j indicate extension
of A[d] partial analysis
k
• k indicates rightmost
possible position for
completion of A[d]
analysis
58. Algorithm #3 : Prediction
A[d2]
C[d2]
B[d1]
d1
i
d2
k’
k
• Step 1 :
find rightmost
subsequence
before k for some
A[d2] production
• Step 2 :
make Earley
prediction
59. Conclusions
• standard parsing techniques are not
suitable for processing lexicalized
grammars
• novel algorithms have been introduced
using enhanced dynamic programming
• work to be done :
extension to history-based models
60. The End
Many thanks for helpful discussion to :
Jason Eisner, Mark-Jan Nederhof
Notes de l'éditeur
This is a longer version of my invited talk at IWPT00, including some slides not shown at the presentation (because of time restrictions). The following very short notes are a summary of additional comments (not appearing in the slides) that I have made during the presentation.
Work by Jason, Mark-Jan and myself presented at ACL99, NAACL00, TAG+5 plus work in progress. Many thanks to Jason and Mark-Jan for plenty of discussion during talk preparation.
Standard parsing techniques are those used by our community in the 80’s with unlexicalized formalisms, and still exploited at present with several language models.
Part II goes through a computational analysis of these techniques/strategies and shows that they do not perform efficiently with lexicalized formalisms.
Novel solutions are presented in Part III.
Development of lexicalized grammars has been strongly enhanced by availability of large tree banks, starting with the beginning of the 90’s.
Certain problems (see above) cannot be solved if system does not use lexical information. Therefore extensive lexical relations have been introduced into the syntactic model. I am not going to dispute here whether the choice was a good one; things might evolve in the future toward more modular and esplicative systems.
This slide not shown at IWPT00.
These are standard examples taken from the existing literature.
Change the linguistic examples: use minimal pairs showing how syntactic structures switch with change of lexical element(s).
This slide not shown at IWPT00.
“Convene” shows a case of genuinely lexical factors.
Solve needs object with an actual state and a desired state, etc.
Abstract away from differences in language models that are not relevant for computational complexity analysis.
Standard CFG + lexical element percolation.
We call co-head the lexical head of a complement of the actual constituent.
The formalisms captures relations among lexical elements. When weights are introduced, this translates into preferences.
I am trying to keep to a minimum # of definitions, these are the only symbols you need to cache to follow this talk.
The really new notion here is the set of delexicalized nonterminals.
This slide not shown at IWPT00.
No direct reference to lexical elements is allowed in a delexicalized nonterminal.
Among “other constraints”: Charniak uses constraint on father node label (to differentiate distributions); Collins uses gap feature (see GPSG) and other finite info about distribution of certain lexical elements.
Here is an important notion expressing the amount of lexicalization the grammar supports. We measure the complexity of the grammar on the basis of the arity of the represented lexical relations.
Remark: Bilexical grammar implies binary productions.
Remark: The degree of lexicalization k induces a proper hierarchy on the generated languages.
Here are some state-of-the-art language models that are strongly related to lexicalized CFG. Strongly related means there exists a linear time, 1-2-1 mapping (actually, a homomorphism) between derivations, that preserves derivation probabilities.
Main reason for the large development of bilexical models rests on the difficulty, given available annotated data, in estimating lexical relations of arity greater than 2.
Open: is Lafferty et al. 1992 probabilistic version of Link Grammars a 3-lexical CFG?
This slide not shown at IWPT00.
This slide not shown at IWPT00.
This slide not shown at IWPT00.
Multi-sets denote a particular state in the process of generating complements subecategorized for by the head. Delta’s are finite functions denoting distributions of certain lexical elements, these can be merged.
NP is direct object of the Verb. In this case, taking into consideration the verbal head would help in computation of likelihood of PP-attachment.
Contrast with more general history based models.
I am not considering here models that are not based on CFG, as Brill’s transformation based model.
Computational limitations derived in Part II transfer to these models, since they are more powerful than 2-lexical CFG. Our results in Part III should however be extended.
Open: relation between 2-lex CFG and lexicalized version of DOP model.
Let us consider the least complex case, namely that of bilexical CFG. Grammar size grows with the square of alphabet size (|V_T|^2 is tight in real world cases, |V_D|^3 is not).
For the rest of this talk, our goal will be to develop parsing strategies running in time independent of |V_T| factors (in our case this will always result in sublinear time wrt the grammar).
We present a computational analysis of a family of parsing algorithms that use the so-called top-down prediction. These algorithms share the correct-prefix property (to be introduced in next slide).
The result, to be refined in the next slides, supports the empirically based common belief that BU strategies perform better than TD strategies in processing lexicalized formalisms based on context-free grammars.
The green productions are called here bridging productions. Note that this property is defined in combination with a strict left-to-right processing requirement. We will show that computation of this combination of requirements is expensive to maintain.
G is some data structure representing the input grammar but no additional information that could be inferred from it.
Result says we cannot get rid of |V_T| factors when top-down prediction is involved. Mathematically: it does not matter how fast function f grows, it cannot be independent of V_T.
The proof shows that the correct-prefix property is not a local property, you must consider the whole grammar in order to test this condition.
We are allowed to extract additional information from the input grammar previous to parsing.
The on-line parsing result was easy to establish. Transferring this result to the off-line case is quite difficult, as it is difficult in general to prove non-trivial lower bounds for many problems in computer science.
The best we have been able to do is to provide evidence in support of the result, by relating it with a conjecture regarding FA.
When parsing with a nondeterministic FA, in order to remove any dependence on the size of M from the runnig time, we need to spend exponential time in determinizing M. No algorithm is known for linear time parsing on |w| and independent on M that only requires polytime precompilation.
What if we are prepared to spend more than linear time in processing w? Can we compensate for the lack of information on M, due the poly bound on precompilation, by allowing more time in processing w? We conjecture the answer is no, due to the independence between M and w.
Similar to the on-line result, but relativized to our FA conjecture.
Standard CYK parsing of bilexical CFG shows an increase in time complexity of a factor of |w|^2, ascribed to the choice of productions that are grounded in the input string span under recognition.
How much does this affect us in real life? Here are some experiments that were run by Jason Eisner, on a bilexical CFG that was estimated from the Penn tree bank. I will say more later about the grammar, when we compare these results to the performance of our novel algorithms.
Exhaustive parsing has applications in EM algorithms.
We consider here a quite simple pruning strategy, using standard beam-search and some thresholding. The strategy has a speed-up of a factor between 10 to 20.
Note that the polynomial represented by the line gives a running time that is worst than the worst case we have with unlexicalized CFG.
This is an English text experiment. If we use input with some higher degree of noise, as for instance a language with word boundary uncertainty or a word lattice input from some low level speech recognizer, then I expect that the degree of ambiguity in the choice of head position to be even worst, resulting in an increased gap between the unlexicalized and the lexicalized cases.
Result directly transfers to the above language models; published implementations of parsers for these models are all O(|w|^5).
Naive method for Lexicalized Tree-Adjoining Grammar Parsing is O(|w|^8), see Rasnik 1992 and Schabes 1992. Result improves to O(|w|^7). Grammar constant involved is small and not dependent on alphabet size.
Some predicate headed in d_2 is seeking a left dependent with head d_1.
In pairing with head d_2 the constituent headed in d_1 and extending from i to k, we do not need to consider index j. After pairing is done, d_1 can be discarded, and then we can come to the processing of index j, that we need to compute the extension of the final constituent (the combination of the two). Thus we can split the basic step into two sub-steps.
No more than 4 indices are involved at each step, thus complexity is O(|w|^4).
There are some optimizations that we can apply here (but that do not improve the time asymptotic complexity). We can apply Step 1 only in case d_2 is head of a constituent starting at k and extending to the right. This amounts to say that we pack together all constituents starting at k and with head at a given position, before applying Step 1. The unpacking is then done when applying Step 2.
The spine is the path along which the root head is percolated.
Let us consider only spines that correspond to maximal projections of some head.
The idea with spine projections is to look at possible dependents of a given head in their left-to-right order. (Again, assume maximal projections.)
Convention : dependents are all phrases that can combine with a given head; they divide up into complements and adjuncts.
Let us split spine projections at the head position (either before or after). If we combine left and right splits coming from different spine projections, how do we find out whether the combination is still a grammatical spine projection?
Linear context-free grammar: only one nonterminal in the right-hand side of each production. Derivation trees grow along a single spine.
When combining splits, we can use finite information related to the states of an FA that recognizes spine projections. We need one FA for each lexical head.
In a non-split grammar, we can establish an unbounded number of dependencies between left and right dependents of some head.
In the example, we alternate sentential modifiers to the left and the right, using a special extra symbol (S_1) to record the state in this process. In this way, each left modifier depends on some right modifier, and we obtain non-regular linear context-free languages of the form a^n # b^n.
I don’t know of any natural language that behaves in this way, and it seems extremely unlikely that such a language exists.
In contrast, split grammars can only establish a finite number of dependencies between left and right dependents of the head.
Here is a more realistic natural language example, where left or right attachment of the modifier depends only on the head (i.e., not on the number of modifiers attached at the opposite side). In this case we have zero dependencies between left and right dependents of the head.
For split grammars, we can precompile the grammar in such a way that the above transformation is established for each spine.
The new symbols on the right spine are related to states of a FA that recognizes spine projections for the head “buy”. Symbol r_3 is a split symbol: it marks the switch from derivation of right split to derivation of left split.
These are “small” grammar constants that do not depend on the alphabet size (only used in the next slide, do not cache them). Roughly speaking, the product t times m is of the same order as the size of the delexicalized nonterminals in the grammar. Constant g is quite small in practice, and equal to one for most of the lexical entries of the grammar.
Grammar constant here is of the same order as that in Algorithm #1 (we are not trading computational resources).
The above language models have enough power to express non-split grammars. If only split grammars are interesting for natural language modeling, then we should not have any loss of accuracy if we restrict the above models to this class, and this will result in an asymptotic improvement in time efficiency as stated by the result.
Each constituent is represented by the algorithm by means of its left and right splits. Splits are recognized separately, and each head collects its dependents one split at a time.
Here is a high-level view of how Algorithm #2 works.
We see how a left dependent is collected by some head d_2 that is recognizing its left split. Recognition of right splits proceeds symmetrically.
Note that each of the two steps only involves three input string indices.
We report comparison between the naive, O(n^3) CYK implementation and Algorithm #2. We are running a split CFG that has been estimated from the Penn tree bank according to the model presented in Jason 1998.
At 30-length sentences, we have a speed up of 19 for exhaustive parsing.
At 30-length sentences, we have a speed up of 5 when our pruning strategy is applied to both algorithms.
First algorithm is defined for a non-lexicalized grammar model. But it has the nice property that, when used with lexicalized version of the grammar, it still works in n^3 time.
These two algorithms work for dependency-like models, the employed techniques are quite different from the ones for phrase structures we have presented.
In the perspective of unlexicalized grammar formalisms, top-down prediction is effective in reducing the parsing search space, as compared to crude bottom-up parsing. Correct-prefix property is a desired filtering technique that we would like to have in a parsing algorithm for lexicalized CFG as well.
As we have seen, direct implementation of correct-prefix property in a parser for bilexical CFG that works strictly left-to-right is very likely to result in an inefficient computation. So we have to give up with something in the end.
We will be able to implement an even stronger version of the correct prefix property, by relaxing only the left-to-right condition.
We present a version of the algorithm that works for general bilexical grammars and extends the BU result.
Open: split grammar case.
We exploit a generalization of look-ahead. Recall that we require the existence of bridging productions (the green productions) associated with each partially or completely recognized constituent (yellow constituent). We restrict possible bridging productions to those that have unexpanded nonterminals with heads grounded in the portion of the string not yet analyzed, and in the same order.
In order to have running time independent of the alphabet size, we need to develop specific data structures for grammar and input string representation.
Productions with the same left-hand side have the sequences of lexical elements appearing in their right-hand sides reversed and stored into a trie.
We also employ standard deterministic FA constructions that represent the set of all subsequences appearing in a given string (when read right-to-left).
As in the Earley algorithm or in chart-based algorithms, each item representing a partial analysis is associated with two indices corresponding to its extension in the input string. In addition, we keep a third index, indicating the rightmost position in the input beyond which the partial analysis cannot be extended without violating our version of the correct prefix property.
The algorithm can be viewed as an extension of the Earley algorithm, in which standard left-to-right phrase analysis is synchronized with right-to-left analysis for lexical item subsequences.
The prediction step of the algorithm works as follows. In order to expand a predicted left-hand side symbol A[d_2] that should not extend beyond position k, we read the trie associated with A[d_2] and our input string representation. This step locates all rightmost occurrences of subsequences of the input string that correspond to sequences of lexical elements appearing in the right-hand side of some A[d_2] production. Then we apply the standard prediction step as in the Earley algorithm, and predict the left-corner of the selected productions. Again, each item is associated with a third index beyond which the analysis cannot be expanded. This index is directly computed from the recognized subsequences.
Further work need to be done in order to extend the presented results to more general history based models.