next up previous contents
Next: Error statistics Up: Text type evaluation Previous: Text type evaluation

Corpus statistics

Corpus Type Tokens AW in % AR NL in % LE in %
mix1 (1) mix 7216 242633.6 1.55 1391.9 250.4
mix2 (1) mix 18739 639434.1 1.59 5122.7 480.3
europa (1) mix 7293 260235.7 1.63 56 0.8 90.1
spiegel (2) news 11719 425636.3 1.65 3903.3 500.4
taz (2) news 7562 277436.7 1.64 2092.8 200.3
welt (2) news 6080 218736.0 1.63 1382.3 220.4
spektrum news 5701 210636.9 1.64 1522.7 120.2
andersen (3) tale 11774 405734.5 1.59 280.2 210.2
bechstein (3) tale 7479 256534.3 1.58 170.2 90.1
grimm tale 7284 241333.1 1.55 380.5 170.2
(1) TRAIN1 mix 33248 1142234.31.59 7072.1 820.3
(2) TRAIN2 news 25361 921736.31.64 7372.9 920.4
(3) TRAIN3 tale 19253 662234.31.59 450.2 300.2

Lexicon and guesser contain 53 different tags and 245 different ambiguity classes.

The training corpora do not cover all of them. The coverage is shown in the table below.

Corpus corpus tags ambiguity classes resulting tagger model
TRAIN1 53 100.00 % 215 87.76 % HMM-1
TRAIN2 52 98.11 % 194 79.18 % HMM-2
TRAIN3 52 98.11 % 177 72.24 % HMM-3


next up previous contents
Next: Error statistics Up: Text type evaluation Previous: Text type evaluation