news | = newspaper, journal |
tale | = fairy tale (old fashioned German) |
mix | = collection of newspaper, technical reports, manuals, ... |
Corpus | Type | Tokens | AW | in % | AR | NL | in % | LE | in % |
mix1 (1) | mix | 7216 | 2426 | 33.6 | 1.55 | 139 | 1.9 | 25 | 0.4 |
mix2 (1) | mix | 18739 | 6394 | 34.1 | 1.59 | 512 | 2.7 | 48 | 0.3 |
europa (1) | mix | 7293 | 2602 | 35.7 | 1.63 | 56 | 0.8 | 9 | 0.1 |
spiegel (2) | news | 11719 | 4256 | 36.3 | 1.65 | 390 | 3.3 | 50 | 0.4 |
taz (2) | news | 7562 | 2774 | 36.7 | 1.64 | 209 | 2.8 | 20 | 0.3 |
welt (2) | news | 6080 | 2187 | 36.0 | 1.63 | 138 | 2.3 | 22 | 0.4 |
spektrum | news | 5701 | 2106 | 36.9 | 1.64 | 152 | 2.7 | 12 | 0.2 |
andersen (3) | tale | 11774 | 4057 | 34.5 | 1.59 | 28 | 0.2 | 21 | 0.2 |
bechstein (3) | tale | 7479 | 2565 | 34.3 | 1.58 | 17 | 0.2 | 9 | 0.1 |
grimm | tale | 7284 | 2413 | 33.1 | 1.55 | 38 | 0.5 | 17 | 0.2 |
(1) TRAIN1 | mix | 33248 | 11422 | 34.3 | 1.59 | 707 | 2.1 | 82 | 0.3 |
(2) TRAIN2 | news | 25361 | 9217 | 36.3 | 1.64 | 737 | 2.9 | 92 | 0.4 |
(3) TRAIN3 | tale | 19253 | 6622 | 34.3 | 1.59 | 45 | 0.2 | 30 | 0.2 |
Lexicon and guesser contain 53 different tags and 245 different ambiguity classes.
The training corpora do not cover all of them. The coverage is shown in the table below.
Corpus | corpus tags | ambiguity classes | resulting tagger model | ||
TRAIN1 | 53 | 100.00 % | 215 | 87.76 % | HMM-1 |
TRAIN2 | 52 | 98.11 % | 194 | 79.18 % | HMM-2 |
TRAIN3 | 52 | 98.11 % | 177 | 72.24 % | HMM-3 |