The standard tagset distinguishes modals, auxiliaries and function verbs which is, however, a mere lexical distinction and does not reflect the actual syntactic property in a given sentence. Therefore the verb forms themselves are not ambiguous, but the subclassification should contribute to a higher accuracy for the disambiguation of infinite and finite verb forms.
The following tables show errors concerning the confusion of VAINF/VAFIN, VMINF/VMFIN and VVINF/VVFIN in the standard test. In German, all infinitives (except for ``sein'') are ambiguous with finite verb (1st and 3rd person plural present tense).
training corpus | test corpus | |||
V.INF in ambiguity class | 3,084 | 584 | ||
tagged as V.INF | 263 | (45.0 %) | ||
incorrect | 42 | (16.0 %) | ||
instead of V.FIN | 40 | 95.2 % | ||
V.FIN in ambiguity class | 9,087 | 1,817 | ||
tagged as V.FIN | 947 | (52.1 %) | ||
incorrect | 25 | (2.6 %) | ||
instead of V.INF | 13 | 52.0 % | ||
V.FIN V.INF in ambiguity class | 2,884 | 547 | ||
tagged as V.INF | 247 | (45.2 %) | ||
incorrect | 41 | (16.6 %) | ||
instead of V.FIN | 40 | 97.6 % | ||
tagged as V.FIN | 119 | (21.8 %) | ||
incorrect | 13 | (10.1 %) | ||
instead of V.INF | 13 | 100 % | ||
tagged V.INF V.FIN | 366 | (66.9 %) | ||
incorrect | 54 | (14.8 %) | ||
inverted V.FIN-V.FIN | 53 | 98.2 % |
Test: Do not distinguish VA-, VM- and VV- verbforms
Corpus statistics | ||
Tokens | 62860 | 13416 |
Tags | 45 | 41 |
Lexicon gaps | 1756 | 283 |
Lexical errors | 542 | 65 |
Ambiguity classes | 251 | 189 |
Ambiguity rate | 1.69 | 1.67 |
Error statistics | ||||||||
ambiguity | tokens | in % | correct | in % | LE | in % | DE | in % |
1 | 7978 | 59.5 | 7942 | 99.6 | 36 | 0.5 | - | - |
2 | 2663 | 19.9 | 2474 | 92.9 | 13 | 0.5 | 176 | 6.6 |
3 | 2078 | 15.5 | 2013 | 96.9 | 8 | 0.4 | 57 | 2.7 |
4 | 589 | 4.4 | 518 | 88.0 | 7 | 1.2 | 64 | 10.9 |
5 | 81 | 0.6 | 71 | 87.9 | 1 | 1.2 | 9 | 11.1 |
6 | 19 | 0.1 | 16 | 84.2 | - | - | 3 | 15.8 |
7 | 8 | 0.1 | 7 | 87.5 | - | - | 1 | 12.5 |
total | 13416 | 100.0 | 13041 | 97.2 | 65 | 0.5 | 310 | 2.3 |
Most frequent errors (by word form) | |||
number | word | correct tag | tagger tag |
13 | DM | NN | NE |
9 | Osthold | NE | ADJD |
6 | werden | VFIN | VINF |
6 | das | PDS | ART |
5 | Reich | NE | NN |
4 | haben | VFIN | VINF |
4 | dem | PRELS | ART |
4 | Um | KOUI | APPR |
4 | Deutschland | NE | NN |
Most frequent errors (by tags) | ||
number | correct tag | tagger tag |
55 | NN | NE |
46 | VFIN | VINF |
39 | NE | NN |
16 | VFIN | VPP |
14 | ADJD | VPP |
12 | KON | ADV |
11 | VINF | VFIN |
11 | ADJD | ADV |
10 | NE | ADJD |
The results of this test should again be interpreted against the figures displayed in section 6.1.2; again we have concentrated on the Xerox HMM Tagger, to make comparison easy.
In this test, the number of tags annotated both in the training and in the test corpus is reduced with respect to 6.1.2. This is expected, because we have merged the tags ``VAFIN'' and ``VMFIN'' into the class ``VVFIN'', and we have merged the respective infinite and participle form tags analogously. This accounts for a reduction of the number of tags. The corpus ambiguity rates, however, remain the same, of course.
The figures contained in the error statistics are not massively changed. The number of errors is slightly increased, but the changes do not seem to be significant.
The impact of the merging of the verb-subclasses on the overall treatment of verbs in the tagging process seems to be rather small.
training corpus | test corpus | |||
VINF in ambiguity class | 3,084 | 584 | ||
tagged as VINF | 271 | (46.4 %) | ||
incorrect | 48 | (17.7 %) | ||
instead of VFIN | 46 | 95.8 % | ||
VFIN in ambiguity class | 9,087 | 1,817 | ||
tagged as VFIN | 940 | (51.7 %) | ||
incorrect | 23 | (2.5 %) | ||
instead of VINF | 11 | 47.8 % | ||
VFIN VINF in ambiguity class | 2,884 | 547 | ||
tagged as VINF | 255 | (46.6 %) | ||
incorrect | (18.4 %) | |||
instead of VFIN | 46 | 97.9 % | ||
tagged as VFIN | 111 | (20.3 %) | ||
incorrect | (9.9 %) | |||
instead of VINF | 11 | 100 % | ||
tagged VINF VFIN | 366 | (66.9 %) | ||
incorrect | (15.9 %) | |||
inverted VFIN-VFIN | 57 | 98.3 % |
Without the distinction VA-, VM- and VV- there are slightly more errors within the confusion class VINF/VFIN (58 versus 54 in the standard test).