1. Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen Univ. of Toronto Univ. of Minnesota, Duluth http//:www.cs.toronto.edu/~smm http//:www.d.umn.edu/~tpederse
2.
3.
4.
5.
6. WSD Tree Feature 4? Feature 4 ? Feature 2 ? Feature 3 ? Feature 2 ? SENSE 4 SENSE 3 SENSE 2 SENSE 1 SENSE 3 SENSE 3 0 0 0 1 1 1 0 1 0 1 0 1 Feature 1 ? SENSE 1
7.
8.
9.
10.
11.
12.
13.
14.
15. Lexical Features 72.9% 74.5% 54.3% 54.3% line 66.9% 66.9% 62.9% 56.3% Sval-1 89.5% 83.4% 81.5% 81.5% hard 72.1% 73.3% 44.2% 42.2% serve 79.9% 55.1% Bigram 75.7% 55.3% Unigram 64.0% 49.3% Surface Form 54.9% 47.7% Majority interest Sval-2
16. POS Features 54.9% 42.2% 81.5% 54.3% 56.3% 47.7% majority 62.3% 75.7% 81.7% 54.3% 59.9% 48.9% P 2 65.3% 73.0% 81.6% 54.2% 63.9% 53.1% P 1 64.0% 58.0% 81.6% 54.3% 60.3% 49.9% P 0 62.7% 60.2% 82.1% 56.2% 59.2% 49.6% P -1 56.0% 60.3% 81.6% 54.9% 57.5% 47.1% P -2 interest serve hard line Sval-1 Sval-2
17. Combining POS Features 62.3% 60.4% 54.1% 54.3% line 86.2% 84.8% 81.9% 81.5% hard 75.7% 73.0% 60.2% 42.2% serve 67.8% 68.0% 66.7% 56.3% Sval-1 80.6% 78.8% 70.5% 54.9% interest 54.6% P -2 , P -1 , P 0 , P 1 , P 2 54.6% P -1 , P 0 , P 1 54.3% P 0 , P 1 47.7% Majority Sval-2
18. Parse Features 54.9% 41.4% 81.5% 54.3% 58.5% 52.9% Phrase POS 54.3% 59.8% 54.7% 54.3% line 81.7% 84.5% 87.8% 81.5% hard 41.6% 57.2% 47.4% 42.2% serve 57.9% 60.6% 64.3% 56.3% Sval-1 54.9% 67.8% 69.1% 54.9% interest 52.7% Parent Phrase POS 50.0% Parent Word 51.7% Head Word 47.7% Majority Sval-2
19.
20.
21.
22. Best Combinations 89.0% 90.1% 83.2% 67.6% P -1 ,P 0 , P 1 78.8% Bigrams 79.9% interest 54.9% 83.0% 89.9% 81.6% 58.4% P -1 ,P 0 , P 1 73.0% Unigrams 73.3% serve 42.2% 83.0% 91.3% 88.9% 86.1% Head, Parent 87.7% Bigrams 89.5% hard 81.5% 88.0% 82.0% 74.2% 55.1% P -1 ,P 0 , P 1 60.4% Unigrams 74.5% line 54.3% 81.1% 78.0% 71.1% 57.6% P -1 ,P 0 , P 1 68.0% Unigrams 66.9% Sval-1 56.3% 66.7% 67.9% 57.0% 43.6% P -1 ,P 0 , P 1 55.3% Unigrams 55.3% Sval-2 47.7% Best Optimal Ours Base Set 2 Set 1 Data
23.
24.
25.
26. Individual Word POS : Senseval-1 64.3% 58.2% 62.2% 59.2% P -1 64.3% 58.2% 62.5% 60.3% P 0 66.2% 64.4% 65.4% 63.9% P 1 64.0 58.6% 58.2% 57.5% P -2 65.2% 60.8% 60.0% 59.9% P -2 64.3% 56.9% 57.2% 56.3% Majority Adj. Verbs Nouns All
27. Individual Word POS: Senseval-2 59.0% 40.2% 55.2% 49.6% P -1 58.2% 40.6% 55.7% 49.9% P 0 61.0% 49.1% 53.8% 53.1% P 1 57.9% 38.0% 51.9% 47.1% P -2 59.4% 43.2% 50.2% 48.9% P -2 59.0% 39.7% 51.0% 47.7% Majority Adj. Verbs Nouns All
28. Parse Features: Senseval-1 65.8% 60.3% 62.6% 60.6% Parent Word 66.2% 57.2% 57.5% 58.5% Phrase 66.2% 58.3% 58.1% 57.9% Parent Phrase 66.9% 59.8% 70.9% 64.3% Head Word 64.3% 56.9% 57.2% 56.3% Majority Adj. Verbs Nouns All
1. In case of neural n/w for example, the learned model is quite meaningless.
1. Bigrams and unigrams, (interest rate) and (rate) suggest the financial sense of interest.
Notice the different tag sets on the right of turn . P0, P-2 etc have similar meanings By combination I mean one tree where the nodes may be any of the different pos features: P0 or P1 or P-2 and so on.
If we know the pos of certain words, pretagging such words can improve overall quality of pos tagging by the automatic tagger. Note we are no longer confident of the quality of tagging around target word in case of mistags. We found a lot of such mis-taggings of the head words in Sval-1 and 2 data (5% of head words had radical mistags and 20% mistags in all (radical and subtle)). So we decided to find out why this was happening and hopefully do something abt it.
We wanted to utilize the guaranteed pre-tagging for a higher quality parsing. Head and parent words are marked in red and all 4 of them suggest a particular sense of hard and line . The hard work --- not easy, difficult sense The hard surface --- not soft, physical sense Fasten the line --- cord sense Cross the line --- division sense
Sval-1 (2-24) and Sval-2 (2-32) data created such that target words with varying number of senses are represented. Sval-1 annotated with senses from HECTOR, Sval-2 from WordNet. 2. Interest data created by Bruce and Weibe from penn treebank and WSJ (ACL/DCI version) Annotated with 6 senses from LDOCE 3. Serve data created by Leacock Chodrow from WSJ (1987-89) and APHB corpus. Annotated with four senses from WordNet. 4. Hard data created by Leacock Chodrow from SJM corpus. Annotated with three senses from WordNet. 5. line data created by Leacock et al. from WSJ (1987-89) and APHB corpus. Annotated with 6 senses from WordNet.
Surface form does not do much better than baseline. Unigrams and Bigrams both do significantly well (esp. considering they are lexical features, easily captured).
Simple comb of pos ftrs does almost as well as unigrams and bigrams. Note, much lower number of features utilized as compared to unigrams and bigrams. P0,P1 found to be most potent combination for Sval-1 and 2. Larger context found to be much more helpful for line, hard, serve and interest data as compared to the Sval data. We think that this is because of the much larger amounts of training data.
Simple comb of pos ftrs does almost as well as unigrams and bigrams. Note, much lower number of features utilized as compared to unigrams and bigrams. P0,P1 found to be most potent combination for Sval-1 and 2. Larger context found to be much more helpful for line, hard, serve and interest data as compared to the Sval data. We think that this is because of the much larger amounts of training data.
Optimal ensemble is the upper bound for accuracy achivable by an ensemble technique. One tree with all feature may yield even better results but we cannot say much about that and is beyond the scope of this work.
Note: reasonable amount of redundancy (Base): that was expected Note: the simple ensemble does slightly better than individual features in case of line and hard data it does worse (not sure why) Suggests that a powerful ensemble technique is desirable Note: the large amounts of complementarity as suggested by the optimal ensemble values which are around the best achieved so far. Combination of simple lexical and syntactic features can results close to state of art.
We have improvements over baseline (much is not expected as we are using just individual pos) Interestingly P1 is found to be best (we found this in all data) Break down into individual pos shows that … Verbs and adjectives do best with P1 Verb-object relations is in effect getting captured. Nouns are helped by pos tags on either side Subj-verb and verb-object relation (hence both sides help).
1. Similar results as in Sval-1.
Head found to be best Verbs are usually head themselves and hence the head ftr is not very useful for them. Parent found to do reasonable well.