Error statistics

Next: References Up: Results Previous: Corpus statistics

Error statistics

The following table shows for each tagger model the absolute number of errors (including lexical errors) and the overall accuracy rate for each test corpus.

Tests are run on each text file, including training files. In the table below, highlighted numbers represent the results for texts which were not part of the training corpus for the given HMM model.

		HMM-1		HMM-2		HMM-3
trained on		mix		news		tale
Corpus	Type	errors	accuracy	errors	accuracy	errors	accuracy
mix1	mix	136	98.12 %	215	97.02 %	264	96.34 %
mix2	mix	318	98.30 %	408	97.82 %	555	97.04 %
europa	mix	108	98.52 %	155	97.87 %	202	97.23 %
spiegel	news	305	97.40 %	217	98.15 %	346	97.05 %
taz	news	181	97.61 %	142	98.12 %	246	96.75 %
welt	news	119	98.04 %	90	98.52 %	156	97.43 %
spektrum	news	152	97.33 %	166	97.09 %	188	96.70 %
andersen	tale	251	97.87 %	304	97.42 %	163	98.62 %
bechstein	tale	161	97.85 %	163	97.82 %	94	98.74 %
grimm	tale	150	97.94 %	154	97.89 %	134	98.16 %

With the chosen training and test corpora, the following tendencies can be observed; they would have to be confirmed or modified by further testing on other text genres (technical manuals, instructions, colloquial language, ...).

Evidently, identity of text types used for training and test leads to the best results.
Training on a mix of different types results in similar tagger accuracy for both news and tales: between 97.4 % and 98.0 %.
The tagger trained on fairy tales shows the greatest variations in the test results: from 96.3 % to 98.4 % (training texts excluded).
Training on mixed texts leads to slightly better results than training on the old-fashioned style of fairy tales when tagging newspaper texts.
The taggers trained on mixed texts (HMM-1) and on newspaper texts (HMM-2) obtain almost identical results in the tests on fairy tales.
HMM-2 (trained on news) performs better than HMM-3 (trained on tales) for tests on mixed texts.

On the whole, the differences between the three configurations (training on sets TRAIN1, TRAIN2, TRAIN3) do not lead to very important differences in the performance of the tagging on the test material.

This may be due to the fact that the differences, which we assumed for the chosen text types, may not be very important, such that the statistical models trained from the three sets may not differ all that much. Tests with other text types (eg. technical manual style as opposed to newspaper style) might be more significant.

The impact of phenomena, which are specific for a given text type, might be tested separately, by means of training texts which are particularly geared to this question (and monitored accordingly).

In general, however, the test setup seems to be successfully applicable to carry out further tests concerning the impact of text types. Future tests should be based on a qualitative and quantitative description of the perceived differences between the test and the training texts; this could, at least in part, be obtained by the use of corpus query tools: with respect to certain constructions, quantitative profiles of test and training material could be established before the experiment is run; the tagger results on these phenomena should then be checked.

Next: References Up: Results Previous: Corpus statistics