Background

Next: Test setups and objectives Up: Background and Test Setups Previous: Background and Test Setups

Background

The EAGLES work group on computational lexicons has produced recommendations for the morphosyntactic classification of word forms. These have been presented in the EAGLES document [&make_named_href('', "node40.html#Monachini+Calzolari:94","[Monachini, Calzolari 1994]")], in the form of an inventory of labels for word forms. From there, formalized specifications have been derived, for French, German and Italian, as well as, slightly less complete, for English.

The proposals made in [&make_named_href('', "node40.html#Monachini+Calzolari:94","[Monachini, Calzolari 1994]")] are intended to be applicable to different European languages and to be independent from a particular NLP-application. The language-specific documents contain typed linguistic specifications which themselves are as well not geared towards a specific single application.

However, given that corpus tagging is one of the main applications of the kind of morphosyntactic specifications which the EAGLES lexicon group has produced, evidently corpus tagging has influenced the development: the synopsis step which led to the proposals published in [&make_named_href('', "node40.html#Monachini+Calzolari:94","[Monachini, Calzolari 1994]")] was based on a review of potential predecessors and sources of input which included tagsets for corpus annotation; and the linguistic specifications for French, Italian and German (referred to, in the following, by ELM-FR, ELM-IT and ELM-DE), respectively, are influenced to some extent by experience from tagset development. Moreover, the constant interaction with the EAGLES Work Group on the linguistic annotation of text corpora has of course contributed to this orientation. This interaction has helped shape the language-specific ELM-incarnations very much.

The work described in this report shows that the EAGLES-based ELM-DE specifications indeed allow to derive a tagset which can be practically used for the tagging of German and which leads to acceptable results. Moreover, we could follow part of the history of the tagset and evaluate the impact of the modifications introduced.

ELM-DE has two sources: the results of [&make_named_href('', "node40.html#Monachini+Calzolari:94","[Monachini, Calzolari 1994]")], and a corpus tagset jointly developed by the universities of Tübingen and Stuttgart, since 1993/94. This tagset, now referred to as STTS (Stuttgart- Tübingen Tag Set), see [&make_named_href('', "node40.html#Schiller+al:95","[Schiller et al 1995]")] has mainly been used in the tests described in this paper. It is mappable with ELM-DE, and automatic tools exist which perform this mapping. Where we have analyzed the impact of tagset changes on the tagging results, the modifications tested were influenced by differences between STTS and its predecessor, the IMS-TUE tagset (also jointly developed by Stuttgart and Tübingen).

The work reported here was carried out at Rank Xerox Research Centre (RXRC), Grenoble, France, and at the Institut für maschinelle Sprachverarbeitung, Computerlinguistik, of Universität Stuttgart (STR). The text material and tagsets all concern the German language.

Next: Test setups and objectives Up: Background and Test Setups Previous: Background and Test Setups