Xerox HMM-tagger: Standard test

Next: Interpretation Up: Tagger evaluation Previous: TreeTagger: Standard test

Xerox HMM-tagger: Standard test

To compare the effects of using different lexicons for training and testing, we chose two test setups for the Xerox HMM-tagger: one with the regular lexcion only, and a second with the extended lexicon, which is the same as for the test of of the TreeTagger.

(1) Regular training lexicon

Corpus statistics
Tokens	62860	13416
Tags	51	46
Lexicon gaps	1756	283
Lexical Errors	543	65
Ambiguity classes	263	196
Ambiguity rate	1.69	1.67

Error statistics
ambiguity	tokens	in %	correct	in %	LE	in %	DE	in %
1	7978	59.5	7942	99.6	36	0.5	-	-
2	2663	19.9	2482	93.2	13	0.5	168	6.3
3	2078	15.5	2014	96.9	8	0.4	56	2.7
4	589	4.4	518	88.0	7	1.2	64	10.9
5	81	0.6	71	87.9	1	1.2	9	11.1
6	19	0.1	16	84.2	0	-	3	15.8
7	8	0.1	7	87.5	0	-	1	12.5
total	13416	100.0	13050	97.3	65	0.5	301	2.2

Most frequent errors (by word form)
number	word	correct tag	tagger tag
13	DM	NN	NE
9	Osthold	NE	ADJD
6	das	PDS	ART
5	werden	VAFIN	VAINF
5	Reich	NE	NN
4	haben	VAFIN	VAINF
4	dem	PRELS	ART
4	Um	KOUI	APPR
4	Deutschland	NE	NN

Most frequent errors (by tags)
number	correct tag	tagger tag
55	NN	NE
39	NE	NN
28	VVFIN	VVINF
16	VVFIN	VVPP
12	KON	ADV
11	ADJD	VVPP
11	ADJD	ADV
10	VVINF	VVFIN
10	NE	ADJD

(2) Extended training lexicon

Even though the Xerox HMM tagger is based on the same basic lexicon as the TreeTagger (in 6.1.1) the ambiguity class of the word forms is not always the same (as one would have expected).

This difference is due to the fact that the lexicon which is used internally by the TreeTagger omits marginal (i.e. very rare) readings of word forms and thus reduces the ambiguity classes of lexicalized word forms. For non-lexicalized word forms, however, the TreeTagger uses ambiguity classes containing up to 10 tags, whereas the largest ambiguity class of the tested Xerox HMM tagger is 7.

Corpus statistics
Tokens	62860	13416
Tags	51	46
Lexicon gaps	0	241
Lexical errors	0	49
Ambiguity classes	275	205
Ambiguity rate	1.64	1.69

Error statistics
ambiguity	tokens	in %	correct	in %	LE	in %	DE	in %
1	7962	59.4	7932	99.6	30	0.4	-	-
2	2660	19.8	2481	93.3	11	0.4	168	6.3
3	2082	15.5	2019	97.0	5	0.2	58	2.8
4	480	3.6	399	83.1	3	0.6	78	16.3
5	178	1.3	161	90.5	0	-	17	9.5
6	47	0.4	38	80.9	0	-	9	19.1
7	7	0.1	5	71.4	0	-	2	28.6
total	13416	100.0	13035	97.2	49	0.4	332	2.5

Most frequent errors (by word form)
number	word	correct tag	tagger tag
13	DM	NN	NE
9	Osthold	NE	ADJD
6	das	PDS	ART
5	werden	VAFIN	VAINF
5	Reich	NE	NN
4	haben	VAFIN	VAINF
4	dem	PRELS	ART
4	Um	KOUI	APPR
4	Deutschland	NE	NN
4	Asher	NE	ADJA

Most frequent errors (by tags)
number	correct tag	tagger tag
42	NE	NN
41	NN	NE
28	VVFIN	VVINF
20	NE	ADV
18	NE	ADJD
15	VVFIN	VVPP
12	KON	ADV
12	ADJD	ADV
11	ADJD	VVPP

Next: Interpretation Up: Tagger evaluation Previous: TreeTagger: Standard test