The EAGLES work group on computational lexicons has produced
recommendations for the morphosyntactic classification of word forms.
These have been presented in the EAGLES document
[&make_named_href('',
"node40.html#Monachini+Calzolari:94","[Monachini, Calzolari 1994]")], in the form
of an inventory of labels for word forms. From there, formalized
specifications have been derived, for French, German and Italian, as well
as, slightly less complete, for English.
The proposals made in [&make_named_href('', "node40.html#Monachini+Calzolari:94","[Monachini, Calzolari 1994]")] are intended to be applicable to different European languages and to be independent from a particular NLP-application. The language-specific documents contain typed linguistic specifications which themselves are as well not geared towards a specific single application.
However, given that corpus tagging is one of the main applications of
the kind of morphosyntactic specifications which the EAGLES lexicon
group has produced, evidently corpus tagging has influenced the
development: the synopsis step which led to the proposals published in
[&make_named_href('',
"node40.html#Monachini+Calzolari:94","[Monachini, Calzolari 1994]")] was based on a review of potential
predecessors and sources of input which included tagsets for corpus
annotation; and the linguistic specifications for French, Italian and
German (referred to, in the following, by ELM-FR, ELM-IT and ELM-DE),
respectively,
are influenced to some extent by experience from tagset development.
Moreover, the constant interaction with the EAGLES Work Group on the
linguistic annotation of text corpora has of course contributed to
this orientation. This interaction has helped shape the
language-specific ELM-incarnations very much
.
The work described in this report shows that the EAGLES-based ELM-DE specifications indeed allow to derive a tagset which can be practically used for the tagging of German and which leads to acceptable results. Moreover, we could follow part of the history of the tagset and evaluate the impact of the modifications introduced.
ELM-DE has two sources: the results of [&make_named_href('',
"node40.html#Monachini+Calzolari:94","[Monachini, Calzolari 1994]")],
and a corpus tagset jointly developed by the universities of
Tübingen and Stuttgart, since
1993/94. This tagset, now referred to as STTS (Stuttgart-
Tübingen Tag Set), see [&make_named_href('',
"node40.html#Schiller+al:95","[Schiller et al 1995]")] has mainly
been used in the tests described in this paper. It is mappable with ELM-DE, and
automatic tools exist which perform this mapping. Where we have analyzed the
impact of tagset changes on the tagging results, the modifications
tested were influenced by differences between STTS and its
predecessor, the IMS-TUE tagset (also jointly developed by Stuttgart
and Tübingen).
The work reported here was carried out at Rank Xerox Research Centre (RXRC), Grenoble, France, and at the Institut für maschinelle Sprachverarbeitung, Computerlinguistik, of Universität Stuttgart (STR). The text material and tagsets all concern the German language.