Next: Bibliography
Up: Syntactic Annotation
Previous: Bracketing of single-word constituents
Recommendations
As has become evident in preceding sections, an essential part
of any corpus annotation project is a detailed documentation of
the annotation scheme employed. (For syntactic annotation, an
annotation scheme is alternatively called a parsing scheme.)
Without documentation provided by the originators, an annotated
corpus is extremely difficult for other users to apply to their
own research tasks. Decisions taken in the development of the
annotation scheme, as well as in its application, should be well
documented in order to ensure that future users will apply the
annotation scheme in a manner consistent with the originators of
the scheme, and which then will be consistent in the new
application.
The documentation should therefore include at least the following
classes of information:
- What layers of annotation (in terms of layers (a)-(h) above) have been undertaken. Each
of these layers of annotation represents a wide area of possible
annotation, and therefore more detail should be included in the
documentation as to what phenomena in particular are marked by
the annotation scheme.
- What is the set of annotation devices used (e.g. brackets, labels).
- What are the meanings of these devices (e.g. Ns =
singular noun phrase; etc.). In the documentation of many existing
schemes, all that is presented is a list of the labels used, and a short
explanation of these symbols (e.g. VP -- Verb Phrase). As has
been shown, this type of explanation
is not sufficient to describe the use of a label in an annotated corpus.
Each symbol should be described, and illustrated with one or more
examples.
- What are the conventions for the application of the
annotation devices to texts. A parsing scheme (or `grammatical
representation'; see Voutilainen 1994) is more than points 2 and 3
above. It includes the set of guidelines or conventions whereby
the annotation symbols are to be applied to text sentences, such
that (ideally) two different annotators, implementing the scheme
manually to the same sentence, would agree on the analysis to be
applied. In this sense, a detailed annotation scheme is a
guarantee of consistency. A parsing scheme may include reference
to a lexicon, to a grammar or to a reference corpus of annotated
sentences. In practice it is very difficult for a parsing scheme
to achieve total coverage and total explicitness in a corpus. Few
corpus annotation projects have achieved anything like this.
However, the nearest thing is the highly detailed parsing scheme
provided by Sampson (1995) for the SUSANNE Corpus.
Sampson's book discusses the various decisions taken in the
application and development of the SUSANNE annotation
scheme, and provides examples of the cases in which such
decisions must be taken. In this respect, Sampson's book is, so
far, a unique achievement.
- What is the measurable quality of the annotation.
This includes:
- to what extent the corpus has been checked
- accuracy rate
- consistency rate
These different measures of quality of annotation will depend
mainly on how the corpus is annotated. An automatic annotation
will require figures of accuracy -- usually given in terms of
recall and/or precision. A recall of less than 100%
indicates that some appropriate readings have been discarded,
while a precision of less than 100% indicates that superfluous
readings remain in the output in the form of system ambiguities (see Voutilainen et al. 1992 for discussion).
A consistency rate should be given for a manual or manually post-edited
annotation. Different annotators can be given a certain
percentage of overlapping material, and these can then be compared to
produce consistency figures. The method of comparison should also be
documented.
The extent to which the corpus has been checked overlaps to some extent
with consistency checking, but may also be relevant for a large
automatically annotated corpus. In some cases, automatically annotated
corpora may be manually checked after annotation, and modifications to
the automatic annotation should be documented.
- Specificity -- How detailed/shallow is the analysis. To a
certain extent, the specificity of the analysis may be shown by the
levels of annotation that have been applied. However, more detailed
documentation may be necessary in order to make clear the level to which
an annotation is undertaken -- for example some aspects of a deep or
logical grammar may be included in an annotation, while others are not
marked (e.g. marking of `logical subject/object', but no marking of
`traces').
- Ambiguity -- To what extent and in what respects has
disambiguation (of machine-generated ambiguities)
been carried out. During the annotation of a corpus, all ambiguities
may be resolved, or ambiguous structures may be left in the markup.
Resolution of problematic ambiguities should be documented, as should
any ambiguities that are left in the corpus .
- Incompleteness -- To what extent and in what respects is
annotation at any particular layer incomplete. At any particular level
of annotation, certain markings may be ignored by the annotation scheme,
for ease of automated annotation, or because of the intended purpose of
the resource. This information should also be included in the
documentation.
In the future, a further component of documentation will
presumably be necessary. Assuming that there will eventually be
acceptance of EAGLES guidelines as a standard for
syntactic annotation, it will be highly desirable to state to what
extent and in what respects a given annotation scheme conforms
to the EAGLES standard. We reiterate, however, that at the
present stage, the guidelines put forward in this document are
highly provisional.
Next: Bibliography
Up: Syntactic Annotation
Previous: Bracketing of single-word constituents