Taggers used for evaluation

Next: Other taggers Up: Taggers Previous: Taggers

Taggers used for evaluation

The Xerox tagger is based on Hidden Markov Models (HMM). It applies the Baum-Welch (or Forward-Backward) algorithm for training and the Viterbi algorithm for tagging. The tagger needs as input a finite-state lexicon which contains surface word forms and their corresponding tags and a finite-state guesser for word forms which are not in the lexicon.
The Xerox HMM Tagger ([&make_named_href('', "node40.html#Cutting+al:92","[Cutting et al. 1992]")]) was designed to be trained on untagged corpora, but a more recent version ([&make_named_href('', "node40.html#Wilkens+Kupiec:95","[Wilkens, Kupiec 1995]")]) allows initialization with untagged and/or tagged data. However, when enough tagged training material is available, it turned out (cf. [&make_named_href('', "node40.html#Schiller:96","[Schiller 1996]")]) that the best tagging results were obtained when the tagger was just initialized on a tagged corpus and no further training was added.
: Decision-tree-based [&make_named_href('', "node40.html#Schmid+Kempe:94","[Schmid, Kempe 1994]")] The TreeTagger ([&make_named_href('', "node40.html#Schmid:94b","[Schmid 1994b]")], [&make_named_href('', "node40.html#Schmid:95a","[Schmid 1995a]")]) is also based on a Hidden Markov Model. Only tagged data are used for training, however (supervised training). To avoid problems with sparse training data, the TreeTagger uses a decision tree to reduce the number of contextual parameters until reliable estimation of all contextual parameters is possible. Tag for unknown word forms are guessed with a suffix lexicon (see also [&make_named_href('', "node40.html#Cutting+al:92","[Cutting et al. 1992]")]) which is built automatically. Lexical probabilities are smoothed using ambiguity class based probabilities in order to improve tagging of rare word forms.

For our experiments we chose among the available taggers those which can easily be adapted to a specific language or tagset. This allows to use a uniform lexicon and tagset for all different methods and compare the results (cf. 5.1), and it is necessary for tests on tagset variations (cf. 5.3).