In 5.3 we defined experiments dealing with other tagset variations (eg. PRF vs. PPER, PTKVZ vs. ADJD), and we executed these tests as well. However, it turned out that the corpus material did not contain enough occurrences of the relevant tags to produce results which allow a reliable interpretation.
As we consider that our test and training corpus represents an average corpus, we might conclude that these phenomena are of less interest for overall tagger accuracy. But still, the accuracy for the concerned tags themselves should be evaluated with appropriate test data.
This is a general problem. For example, a 60,000 word manually annotated training corpus for German does not contain a single instance of an ambiguity much discussed in the linguistic literature, namely that between a relative pronoun and a demonstrative pronoun in an extraposition construction (das Buch, das ich gelesen habe (relative) vs. das Buch, ja, das habe ich gelesen (demonstrative)).
One could of course introduce such a phenomenon into a training corpus, because indeed some of the analyzed test material did contain examples of this phenomenon. However, it is not clear which impact the introduction of such ``hard phenomena'' would have, given that their frequency distribution seems to be very difficult to determine, when small-scale training corpora are used. From this point of view, one could argue that there is always a certain portion of phenomena which are not dealt with in training material for taggers and thus not analyzable correctly in the application of these taggers to test material. This would speak in favour of the hypothesis that there is an upper limit to the precision and correctness rate of automatic tagging which in a non-trivial way would be dependent on the size of training corpora.