The EAGLES proposal has been analysed both from a corpus point of view and from that of a French lexicon. The results have engendered a first proposal for encoding a lexicon to be used as a basis for the specific application of corpus tagging within the MULTEXT project.
The sections on corpora describe the application of the EAGLES proposed model to a French corpus, taking as a reference point the tagset developed and used at IBM France Scientific Center, in particular by the speech group. For ease of use, this tagset will be referred to as the IBMF tagset. Despite this focus on a particular tagset, general problems will also be mentioned, even when they do not lead to any discussion of the French language (for example, everybody will agree that nouns do not bear case information in French).
Tagsets such as the IBMF tagset have a certain bias since they are used for the very specific purpose of predicting the exact part-of-speech of words in a corpus; in other words, they are used for modelling the language at the morphological level, whereas a lexicon tagset would be developed for describing the language. For example, this tagset does not cover the full set of features for verbs (mood, tense, etc.), for two reasons:
For similar reasons, several corpus-oriented tagsets for the same language might differ considerably, depending on the goal pursued (e.g. speech recognition, terminology identification, etc.) or on the type of language modelling used (e.g. stochastic vs. rule-based models, etc.). Thus, the reader should be careful and consider this contribution for French only as one possible application, not as an attempt to describe a universal solution.
The features are listed in the following order:
When only a part of the tag name is relevant to the specific attribute-value pair considered, this part will be put in bold font in the Tag column (e.g. in tag SUBSFS, SUBSFS applies to substantive, SUBSFS to feminine and SUBSFS to singular).