Contenu connexe Similaire à Building an Inflectional Stemmer for Bulgarian Similaire à Building an Inflectional Stemmer for Bulgarian (14) Plus de Svetlin Nakov (20) Building an Inflectional Stemmer for Bulgarian1. BulStem: Design and Evaluation of Inflectional Stemmer for Bulgarian Preslav Nakov [email_address] EECS, University of California at Berkeley Presented by: Svetlin Nakov, [email_address] , Sofia University 12. BulStem : Evaluation of Text Categorisation Accuracy (cont.) Text classification accuracy : raw, stemming and lemmatisation. 98.27% 97.86% 97.68% 96.46% 91.51% 91.73% 92.19% 86.33% AVERAGE 100.00% 100.00% 100.00% 99.21% 100.00% 100.00% 100.00% 98.43% orig. 5 1 99.21% 100.00% 99.21% 99.21% 100.00% 99.21% 99.21% 99.21% 30 5 1 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 98.43% 10 5 1 97.64% 96.06% 97.64% 97.64% 96.85% 95.28% 96.85% 96.85% orig. 4 1 97.64% 98.43% 98.43% 95.28% 96.85% 96.85% 96.85% 95.28% 30 4 1 96.85% 96.85% 97.64% 96.06% 96.85% 96.85% 96.85% 97.64% 10 4 1 100.00% 100.00% 100.00% 99.21% 99.21% 99.21% 100.00% 98.43% orig. 3 1 100.00% 100.00% 99.21% 99.21% 100.00% 100.00% 99.21% 99.21% 30 3 1 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 97.64% 10 3 1 98.43% 98.43% 99.21% 96.85% 92.13% 95.28% 93.70% 82.68% orig. 2 1 97.64% 99.21% 99.21% 92.13% 88.98% 89.76% 91.34% 84.25% 30 2 1 96.06% 96.06% 95.28% 93.70% 88.19% 89.76% 89.76% 84.25% 10 2 1 96.85% 92.91% 91.34% 95.28% 90.55% 81.10% 85.04% 62.99% orig. 1 1 98.43% 96.06% 96.06% 96.06% 92.13% 92.13% 91.34% 85.83% 30 1 1 97.64% 98.43% 98.43% 96.85% 96.85% 96.85% 96.85% 92.91% 10 1 1 99.21% 98.43% 96.85% 96.06% 95.28% 96.06% 94.49% 90.55% orig. 0 1 99.21% 99.21% 100.00% 99.21% 96.85% 98.43% 97.64% 90.55% 30 0 1 97.64% 96.85% 98.43% 94.49% 96.85% 96.06% 95.28% 96.85% 10 0 1 100.00% 100.00% 100.00% 99.21% 99.21% 100.00% 100.00% 96.85% orig. 5 0 100.00% 100.00% 100.00% 98.43% 100.00% 100.00% 100.00% 99.21% 30 5 0 99.21% 99.21% 98.43% 96.06% 99.21% 98.43% 98.43% 97.64% 10 5 0 97.64% 96.85% 97.64% 99.21% 83.46% 91.34% 89.76% 73.23% orig. 4 0 100.00% 97.64% 98.43% 96.85% 95.28% 91.34% 96.06% 89.76% 30 4 0 93.70% 93.70% 96.85% 92.13% 80.31% 85.83% 83.46% 89.76% 10 4 0 100.00% 100.00% 100.00% 99.21% 96.85% 98.43% 98.43% 92.13% orig. 3 0 100.00% 100.00% 100.00% 99.21% 99.21% 100.00% 100.00% 94.49% 30 3 0 99.21% 98.43% 98.43% 97.64% 99.21% 97.64% 98.43% 95.28% 10 3 0 98.43% 98.43% 93.70% 93.70% 72.44% 68.50% 68.50% 57.48% orig. 2 0 96.85% 97.64% 95.28% 90.55% 71.65% 64.57% 69.29% 55.91% 30 2 0 93.70% 94.49% 91.34% 92.13% 65.35% 54.33% 61.42% 55.91% 10 2 0 98.43% 96.06% 95.28% 96.06% 85.04% 87.40% 87.40% 61.42% orig. 1 0 98.43% 98.43% 97.64% 95.28% 85.83% 88.19% 89.76% 83.46% 30 1 0 96.85% 98.43% 97.64% 96.85% 81.10% 89.76% 89.76% 76.38% 10 1 0 98.43% 96.06% 96.06% 96.06% 85.83% 91.34% 89.76% 74.80% orig. 0 0 99.21% 100.00% 99.21% 96.85% 84.25% 88.98% 86.61% 83.46% 30 0 0 96.85% 92.13% 95.28% 92.13% 84.25% 85.04% 88.98% 78.74% 10 0 0 lemma stem 3:1 stem 2:1 raw lemma stem 3:1 stem 2:1 raw STOP-WORDS REMOVED STOP-WORDS KEPT LSA dim. GWF LWF 13. BulStem : Evaluation of Text Categorisation Accuracy (cont.) Text classification: stemming parameters evaluation (no stop-words) 8.15% 10.28% 12.92% 15.36% 16.71% 17.93% 18.57% 20.13% 20.74% 23.47% 24.11% 27.86% OVER (Table 1) 15.31% 10.89% 9.66% 10.40% 9.27% 9.00% 9.09% 13.41% 15.28% 16.17% 16.37% 11.95% UNDER (Table 1) 23.46% 21.17% 22.58% 25.76% 25.98% 26.93% 27.66% 33.54% 36.02% 39.64% 40.48% 39.81% ERROR (Table 1) 97.49% 97.68% 91.73% 97.66% 97.86% 94.34% 92.19% 97.92% 97.90% 97.73% 97.33% 97.29% AVERAGE (above) 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% orig. 5 1 99.21% 99.21% 99.21% 99.21% 99.21% 100.00% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 30 5 1 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 10 5 1 96.85% 96.06% 95.28% 96.85% 96.85% 96.85% 96.85% 97.64% 97.64% 97.64% 97.64% 99.21% orig. 4 1 97.64% 98.43% 96.85% 97.64% 98.43% 96.85% 96.85% 98.43% 98.43% 99.21% 97.64% 97.64% 30 4 1 96.85% 96.85% 96.85% 96.85% 96.85% 96.85% 96.85% 97.64% 98.43% 98.43% 98.43% 96.85% 10 4 1 100.00% 100.00% 99.21% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% orig. 3 1 99.21% 100.00% 100.00% 98.43% 99.21% 100.00% 99.21% 100.00% 99.21% 99.21% 99.21% 99.21% 30 3 1 98.43% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 99.21% 10 3 1 99.21% 99.21% 95.28% 99.21% 99.21% 89.76% 93.70% 98.43% 97.64% 99.21% 97.64% 99.21% orig. 2 1 100.00% 100.00% 89.76% 100.00% 100.00% 95.28% 91.34% 100.00% 99.21% 100.00% 99.21% 97.64% 30 2 1 94.49% 96.06% 89.76% 95.28% 96.06% 93.70% 89.76% 95.28% 96.06% 95.28% 95.28% 96.06% 10 2 1 89.76% 89.76% 81.10% 90.55% 90.55% 92.91% 85.04% 91.34% 91.34% 92.13% 90.55% 92.91% orig. 1 1 96.85% 96.06% 92.13% 94.49% 96.06% 94.49% 91.34% 96.06% 96.06% 96.06% 95.28% 96.06% 30 1 1 98.43% 98.43% 96.85% 97.64% 97.64% 97.64% 96.85% 98.43% 98.43% 98.43% 98.43% 97.64% 10 1 1 96.06% 96.85% 96.06% 97.64% 96.85% 98.43% 94.49% 97.64% 96.85% 97.64% 96.85% 97.64% orig. 0 1 98.43% 98.43% 98.43% 98.43% 98.43% 97.64% 97.64% 100.00% 100.00% 99.21% 99.21% 98.43% 30 0 1 96.06% 97.64% 96.06% 98.43% 97.64% 96.06% 95.28% 97.64% 97.64% 97.64% 97.64% 97.64% 10 0 1 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% orig. 5 0 100.00% 100.00% 100.00% 100.00% 99.21% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 30 5 0 98.43% 98.43% 98.43% 98.43% 97.64% 98.43% 98.43% 99.21% 99.21% 99.21% 98.43% 99.21% 10 5 0 97.64% 97.64% 91.34% 96.85% 97.64% 91.34% 89.76% 98.43% 98.43% 96.85% 96.85% 96.85% orig. 4 0 98.43% 99.21% 91.34% 96.85% 97.64% 97.64% 96.06% 98.43% 98.43% 98.43% 98.43% 98.43% 30 4 0 92.91% 92.13% 85.83% 94.49% 95.28% 94.49% 83.46% 92.91% 96.85% 96.06% 96.06% 92.91% 10 4 0 100.00% 100.00% 98.43% 100.00% 100.00% 99.21% 98.43% 100.00% 100.00% 100.00% 100.00% 100.00% orig. 3 0 100.00% 100.00% 100.00% 100.00% 99.21% 100.00% 100.00% 100.00% 100.00% 99.21% 100.00% 100.00% 30 3 0 98.43% 98.43% 97.64% 98.43% 98.43% 98.43% 98.43% 98.43% 98.43% 98.43% 98.43% 98.43% 10 3 0 96.85% 98.43% 68.50% 96.85% 99.21% 74.80% 68.50% 98.43% 96.06% 94.49% 91.34% 90.55% orig. 2 0 98.43% 98.43% 64.57% 97.64% 98.43% 81.89% 69.29% 97.64% 96.85% 96.85% 97.64% 88.98% 30 2 0 93.70% 93.70% 54.33% 92.13% 94.49% 71.65% 61.42% 94.49% 93.70% 95.28% 92.91% 92.13% 10 2 0 95.28% 96.06% 87.40% 98.43% 97.64% 81.10% 87.40% 96.06% 96.06% 94.49% 92.91% 93.70% orig. 1 0 97.64% 96.85% 88.19% 97.64% 97.64% 94.49% 89.76% 97.64% 97.64% 96.06% 95.28% 96.85% 30 1 0 97.64% 97.64% 89.76% 98.43% 99.21% 94.49% 89.76% 98.43% 98.43% 98.43% 98.43% 98.43% 10 1 0 95.28% 95.28% 91.34% 98.43% 97.64% 85.83% 89.76% 96.06% 96.06% 95.28% 94.49% 96.06% orig. 0 0 99.21% 100.00% 88.98% 99.21% 99.21% 93.70% 86.61% 99.21% 98.43% 98.43% 99.21% 99.21% 30 0 0 92.91% 92.91% 85.04% 92.91% 92.91% 93.70% 88.98% 93.70% 95.28% 92.91% 92.91% 96.85% 10 0 0 3:3 3:2 3:1 2:10 2:5 2:2 2:1 1:20 1:10 1:5 1:2 1:1 SVD GWF LWF