The following table shows for each tagger model the absolute number of errors (including lexical errors) and the overall accuracy rate for each test corpus.
Tests are run on each text file, including training files. In the table below, highlighted numbers represent the results for texts which were not part of the training corpus for the given HMM model.
HMM-1 | HMM-2 | HMM-3 | |||||
trained on | mix | news | tale | ||||
Corpus | Type | errors | accuracy | errors | accuracy | errors | accuracy |
mix1 | mix | 136 | 98.12 % | 215 | 97.02 % | 264 | 96.34 % |
mix2 | mix | 318 | 98.30 % | 408 | 97.82 % | 555 | 97.04 % |
europa | mix | 108 | 98.52 % | 155 | 97.87 % | 202 | 97.23 % |
spiegel | news | 305 | 97.40 % | 217 | 98.15 % | 346 | 97.05 % |
taz | news | 181 | 97.61 % | 142 | 98.12 % | 246 | 96.75 % |
welt | news | 119 | 98.04 % | 90 | 98.52 % | 156 | 97.43 % |
spektrum | news | 152 | 97.33 % | 166 | 97.09 % | 188 | 96.70 % |
andersen | tale | 251 | 97.87 % | 304 | 97.42 % | 163 | 98.62 % |
bechstein | tale | 161 | 97.85 % | 163 | 97.82 % | 94 | 98.74 % |
grimm | tale | 150 | 97.94 % | 154 | 97.89 % | 134 | 98.16 % |
With the chosen training and test corpora, the following tendencies can be observed; they would have to be confirmed or modified by further testing on other text genres (technical manuals, instructions, colloquial language, ...).
On the whole, the differences between the three configurations (training on sets TRAIN1, TRAIN2, TRAIN3) do not lead to very important differences in the performance of the tagging on the test material.
This may be due to the fact that the differences, which we assumed for the chosen text types, may not be very important, such that the statistical models trained from the three sets may not differ all that much. Tests with other text types (eg. technical manual style as opposed to newspaper style) might be more significant.
The impact of phenomena, which are specific for a given text type, might be tested separately, by means of training texts which are particularly geared to this question (and monitored accordingly).
In general, however, the test setup seems to be successfully applicable to carry out further tests concerning the impact of text types. Future tests should be based on a qualitative and quantitative description of the perceived differences between the test and the training texts; this could, at least in part, be obtained by the use of corpus query tools: with respect to certain constructions, quantitative profiles of test and training material could be established before the experiment is run; the tagger results on these phenomena should then be checked.