Application to French

Next: Application to Portuguese Up: Language-specific applications Previous: Application to Spanish

Preliminary Recommendations

Application to French

The EAGLES proposal has been analysed both from a corpus point of view and from that of a French lexicon. The results have engendered a first proposal for encoding a lexicon to be used as a basis for the specific application of corpus tagging within the MULTEXT project.

The sections on corpora describe the application of the EAGLES proposed model to a French corpus, taking as a reference point the tagset developed and used at IBM France Scientific Center, in particular by the speech group. For ease of use, this tagset will be referred to as the IBMF tagset. Despite this focus on a particular tagset, general problems will also be mentioned, even when they do not lead to any discussion of the French language (for example, everybody will agree that nouns do not bear case information in French).

Tagsets such as the IBMF tagset have a certain bias since they are used for the very specific purpose of predicting the exact part-of-speech of words in a corpus; in other words, they are used for modelling the language at the morphological level, whereas a lexicon tagset would be developed for describing the language. For example, this tagset does not cover the full set of features for verbs (mood, tense, etc.), for two reasons:

The graphic form of the verb will help, if necessary, to determine the missing features;
The addition of tags with these specific features would not improve the language model's essential capability: that of correctly predicting other tags.

For similar reasons, several corpus-oriented tagsets for the same language might differ considerably, depending on the goal pursued (e.g. speech recognition, terminology identification, etc.) or on the type of language modelling used (e.g. stochastic vs. rule-based models, etc.). Thus, the reader should be careful and consider this contribution for French only as one possible application, not as an attempt to describe a universal solution.

The features are listed in the following order:

The different EAGLES attributes/values applicable to the IBMF tagset;
The EAGLES features that are not applicable to the tagset or to French; and
The specific items for which there is no attribute/value in EAGLES (those which are not language-specific but rather tagset-specific and therefore do not need a new EAGLES attribute).

When only a part of the tag name is relevant to the specific attribute-value pair considered, this part will be put in bold font in the Tag column (e.g. in tag SUBSFS, SUBSFS applies to substantive, SUBSFS to feminine and SUBSFS to singular).

Next: Application to Portuguese Up: Language-specific applications Previous: Application to Spanish