To compare the effects of using different lexicons for training and testing, we chose two test setups for the Xerox HMM-tagger: one with the regular lexcion only, and a second with the extended lexicon, which is the same as for the test of of the TreeTagger.
(1) Regular training lexicon
Corpus statistics | ||
Tokens | 62860 | 13416 |
Tags | 51 | 46 |
Lexicon gaps | 1756 | 283 |
Lexical Errors | 543 | 65 |
Ambiguity classes | 263 | 196 |
Ambiguity rate | 1.69 | 1.67 |
Error statistics | ||||||||
ambiguity | tokens | in % | correct | in % | LE | in % | DE | in % |
1 | 7978 | 59.5 | 7942 | 99.6 | 36 | 0.5 | - | - |
2 | 2663 | 19.9 | 2482 | 93.2 | 13 | 0.5 | 168 | 6.3 |
3 | 2078 | 15.5 | 2014 | 96.9 | 8 | 0.4 | 56 | 2.7 |
4 | 589 | 4.4 | 518 | 88.0 | 7 | 1.2 | 64 | 10.9 |
5 | 81 | 0.6 | 71 | 87.9 | 1 | 1.2 | 9 | 11.1 |
6 | 19 | 0.1 | 16 | 84.2 | 0 | - | 3 | 15.8 |
7 | 8 | 0.1 | 7 | 87.5 | 0 | - | 1 | 12.5 |
total | 13416 | 100.0 | 13050 | 97.3 | 65 | 0.5 | 301 | 2.2 |
Most frequent errors (by word form) | |||
number | word | correct tag | tagger tag |
13 | DM | NN | NE |
9 | Osthold | NE | ADJD |
6 | das | PDS | ART |
5 | werden | VAFIN | VAINF |
5 | Reich | NE | NN |
4 | haben | VAFIN | VAINF |
4 | dem | PRELS | ART |
4 | Um | KOUI | APPR |
4 | Deutschland | NE | NN |
Most frequent errors (by tags) | ||
number | correct tag | tagger tag |
55 | NN | NE |
39 | NE | NN |
28 | VVFIN | VVINF |
16 | VVFIN | VVPP |
12 | KON | ADV |
11 | ADJD | VVPP |
11 | ADJD | ADV |
10 | VVINF | VVFIN |
10 | NE | ADJD |
(2) Extended training lexicon
Even though the Xerox HMM tagger is based on the same basic lexicon as the TreeTagger (in 6.1.1) the ambiguity class of the word forms is not always the same (as one would have expected).
This difference is due to the fact that the lexicon which is used internally by the TreeTagger omits marginal (i.e. very rare) readings of word forms and thus reduces the ambiguity classes of lexicalized word forms. For non-lexicalized word forms, however, the TreeTagger uses ambiguity classes containing up to 10 tags, whereas the largest ambiguity class of the tested Xerox HMM tagger is 7.
Corpus statistics | ||
Tokens | 62860 | 13416 |
Tags | 51 | 46 |
Lexicon gaps | 0 | 241 |
Lexical errors | 0 | 49 |
Ambiguity classes | 275 | 205 |
Ambiguity rate | 1.64 | 1.69 |
Error statistics | ||||||||
ambiguity | tokens | in % | correct | in % | LE | in % | DE | in % |
1 | 7962 | 59.4 | 7932 | 99.6 | 30 | 0.4 | - | - |
2 | 2660 | 19.8 | 2481 | 93.3 | 11 | 0.4 | 168 | 6.3 |
3 | 2082 | 15.5 | 2019 | 97.0 | 5 | 0.2 | 58 | 2.8 |
4 | 480 | 3.6 | 399 | 83.1 | 3 | 0.6 | 78 | 16.3 |
5 | 178 | 1.3 | 161 | 90.5 | 0 | - | 17 | 9.5 |
6 | 47 | 0.4 | 38 | 80.9 | 0 | - | 9 | 19.1 |
7 | 7 | 0.1 | 5 | 71.4 | 0 | - | 2 | 28.6 |
total | 13416 | 100.0 | 13035 | 97.2 | 49 | 0.4 | 332 | 2.5 |
Most frequent errors (by word form) | |||
number | word | correct tag | tagger tag |
13 | DM | NN | NE |
9 | Osthold | NE | ADJD |
6 | das | PDS | ART |
5 | werden | VAFIN | VAINF |
5 | Reich | NE | NN |
4 | haben | VAFIN | VAINF |
4 | dem | PRELS | ART |
4 | Um | KOUI | APPR |
4 | Deutschland | NE | NN |
4 | Asher | NE | ADJA |
Most frequent errors (by tags) | ||
number | correct tag | tagger tag |
42 | NE | NN |
41 | NN | NE |
28 | VVFIN | VVINF |
20 | NE | ADV |
18 | NE | ADJD |
15 | VVFIN | VVPP |
12 | KON | ADV |
12 | ADJD | ADV |
11 | ADJD | VVPP |