The most frequent error in the standard test concerned common nouns (NN) and proper names (NE).
The table below shows some statistics about the frequency of NN and NE in the standard training and test corpora depending on the ambiguity of word forms with respect to NN and NE.
training corpus | test corpus | |||
NN in ambiguity class | 14,606 | 3,249 | ||
unambiguous NN | 9,861 | (67.5 %) | 2,240 | (69.0 %) |
tagged as NN | 2,835 | (87.3 %) | ||
incorrect | 48 | (1.7 %) | ||
instead of NE | 39 | 81.3 % | ||
NE in ambiguity class | 3,621 | 706 | ||
unambiguous NE | 1,654 | (45.7 %) | 320 | (45.3 %) |
tagged as NE | 603 | (85.4 %) | ||
incorrect | 64 | (10.6 %) | ||
instead of NN | 55 | 85.9 % | ||
NN NE in ambiguity class | 1,912 | 372 | ||
tagged as NN | 94 | (25.3 %) | ||
incorrect | 29 | (30.9 %) | ||
instead of NE | 27 | 93.1 % | ||
tagged as NE | 273 | (73.4 %) | ||
incorrect | 55 | (20.2 %) | ||
instead of NN | 51 | 92.7 % | ||
tagged NN NE | 367 | (98.7 %) | ||
incorrect | 84 | (22.9 %) | ||
inverted NN-NE | 78 | 92.9 % |
These figures show that the overall error rate for common nouns (NN) is less than 2 %, whereas the automatically associated proper name tag (NE) is wrong in more than 10 % of all cases.
Both for common nouns and proper names the most frequent error is a confusion of the tags NE-NN (due to disambiguation as well as to lexical errors). Thus, we should expect an improvement of tagger accuracy if we put NE and NN together and use a single tag NOUN instead.
Test: Put NE and NN together in single class (NOUN)
Corpus statistics | ||
Tokens | 62860 | 13416 |
Tags | 50 | 45 |
Lexical gaps | 1756 | 283 |
Lexicon errors | 355 | 49 |
Ambiguity classes | 242 | 181 |
Ambiguity rate | 1.66 | 1.65 |
Error statistics | ||||||||
ambiguity | tokens | in % | correct | in % | LE | in % | DE | in % |
1 | 8131 | 60.6 | 8110 | 99.7 | 21 | 0.3 | - | - |
2 | 2530 | 18.9 | 2385 | 94.3 | 12 | 0.5 | 133 | 5.3 |
3 | 2192 | 16.3 | 2119 | 96.7 | 13 | 0.6 | 60 | 2.7 |
4 | 493 | 3.7 | 449 | 91.1 | 2 | 0.4 | 42 | 8.5 |
5 | 62 | 0.5 | 59 | 95.2 | 1 | 1.6 | 2 | 3.2 |
6 | 8 | 0.1 | 7 | 87.5 | 0 | - | 1 | 12.5 |
total | 13416 | 100.0 | 13129 | 97.9 | 49 | 0.4 | 238 | 1.8 |
Most frequent errors (by word form) | |||
number | word | correct tag | tagger tag |
9 | Osthold | NOUN | ADJD |
6 | das | PDS | ART |
5 | werden | VAFIN | VAINF |
4 | dem | PRELS | ART |
4 | Um | KOUI | APPR |
4 | Reich | NOUN | ADJD |
Most frequent errors (by tags) | ||
number | correct tag | tagger tag |
30 | VVFIN | VVINF |
16 | VVFIN | VVPP |
15 | NOUN | ADJD |
14 | NOUN | ADV |
12 | NOUN | ADJA |
12 | KON | ADV |
11 | ADJD | VVPP |
11 | ADJD | ADV |
A comparison of the statistical results as displayed here should be based on the results reported in section 6.1.2, page . There we have given the figures for the standard situation, on the basis of the Xerox HMM Tagger. Now, we give the figures for the same tagger, in a situation where common nouns and proper names are in a common ``noun'' class and thus not tagged differently.
The following changes are evident, as far as the corpus and lexicon characteristics are concerned: the corpus ambiguity rate, both for training and test corpus, is reduced (training corpus: from 1.69 in the standard test to 1.66 in the noun test; test corpus: from 1.67 to 1.65). Accordingly, the overall correctness rate increases from 97.3% to 97.9%. This result is expected: in section 6.1.2, clearly disambiguation errors at the level of NN vs. NE are a considerable part of the tagging errors. Evidently, now other disambiguation errors are most highly ranked in frequency: in this case, the ambiguities between finite and infinite verbs and between finite verbs and participles. However, still the problems encountered in the tagging of noun candidates are not solved completely. The table of error frequency by tags shows that now there are tag confusion pairs of the type NN vs. ADJD, NN vs. ADJ, NN vs. ADJA. Remember that, for example, the form Osthold led to NE vs. ADJD errors in the test displayed in section 6.1.2.
According to our calculations, nevertheless the amount of errors produced by the NN vs. NE confusion can be reduced to 46% of the original figure, by merging the NN and NE classes.
training corpus | test corpus | |||
NOUN in ambiguity class | 16,320 | 3,583 | ||
unambiguous | 12,174 | (74.6 %) | 2,713 | (75.7 %) |
tagged as NOUN | 3,436 | (95.9 %) | ||
incorrect | 25 | (0.7 %) |