Next: The concept of common
Up: Some relevant aspects for
Previous: Some relevant aspects for
Preliminary Recommendations
The interdependence between lexicon and corpus is
an important aspect for any activity aiming at
creating lexicons and/or tagsets to be shared by and made
available to the community.
The background motivation to this
was essentially the view of corpus tagging as just one
of the possible applications of a computational lexicon,
which has to be seen in a more neutral context as an
application-independent set of lexicon specifications.
Corpus tagging is in fact the
first obvious application of a computational lexicon and
cannot be developed on an independent basis: both the
lexicon specialists and the corpus specialists feel that it is
important to reconcile their two views.
The difference in perspective betwen the lexicon specification area
and that of corpus annotation can be seen at the level of
terminology:
- The terms feature and feature set are preferred when talking about lexicon
descriptions;
- The terms tag and tagset are preferred to refer to the information associated
with words in context, i.e. in corpus annotation.
For the sake of reusability, lexical descriptions should be (as
far as possible) independent of specific applications, and should
aim at a general description of each language.
The actual corpus tags depend on at least the
following:
- The lexicon features; and
- The capabilities of state-of-the-art taggers
to disambiguate between different lexicon descriptions or different
types of homography present in different languages.
Therefore,
morphosyntax can be encoded in a lexicon with fine granularity, while
a set of corpus tags usually reflects broader categories.
Corpus tags are, in fact,
developed for each language with a particular
application in mind, that of producing a corpus tagged for part of
speech (and possibly other morphosyntactic information) by means of
automatic disambiguation.
It would be ideal to tag a corpus with the lexical descriptions
themselves. However, it is well known that this is considerably
beyond the capabilities of state-of-the art tagging techniques.
Corpus tags are, therefore, to be seen as kinds of underspecified
lexical tags. There are two reasons why we may want (or need) to
underspecify corpus tags:
- Experience shows that some distinctions are difficult to get right
with a high rate of accuracy.
For example, in some languages, the disambiguation of indicative
present and subjunctive present in a corpus is extremely difficult by
automatic means.
- In order to train a tagger, we typically need statistical tables (based on
co-occurrences of tags). If we have a large tagset, we need a very
large corpus to train the disambiguator, in order to observe rare
co-occurrences. For example, in the proposal for French
presented in the MULTEXT
document (Bel et al., 1995),
there are 249 different lexical descriptions, but only 74 collapsed
corpus tags. Experience (Church, UPenn Treebank, IBM France,
etc.) shows that the tagset should be under 100 tags. In fact, the Penn
Treebank project
collapsed many tags compared to the original Brown tagset, and got better
results.
Two other observations are of relevance as regards the relation
between lexicon specifications and corpus tags.
- Sometimes tag classes are in reality different from
lexical descriptions. For example,
classes for punctuation are needed and certain types of
semantic or pragmatic or lexical information can be present in
the tags (e.g. the days of the week).
- Furthermore, decisions on tag collapsing are often language
dependent and therefore it may not be appropriate to have completely
identical tagsets across languages. We must preserve certain language-specific
peculiarities (e.g. if certain distinctions can be easily
maintained by an automatic tagger, it may be useful to preserve them).
Next: The concept of common
Up: Some relevant aspects for
Previous: Some relevant aspects for