An annotation scheme which confined itself to the recommended category labels above would give a sparse and incomplete representation of the syntactic form of sentences in a corpus. For many purposes and for many languages, other major constituent labels, such as Auxiliary, Determiner Phrase, or (in English) Genitive Phrase will be found to be necessary. We will say no more of these here.
It is common practice in most existing schemes to further subcategorise sentences. [S] can be used for all sentences, or may be restricted to simple declarative sentences or imperatives, and questions with declarative word order. Other labels may be introduced, e.g. [SI] may be used to mark sentences in which the verb is imperative; [SQ] to mark questions, including such constructions as `yes/no' questions, or `tag questions'. Other descriptors may be required in order to mark sentences of kinds which are included in other sentences, e.g. `direct quotations' or `interpolated sentences'. Whenever the need for such refinements arises, they can be introduced into the annotation scheme. These additions should of course be documented.
Clause was introduced as a recommended category above. Clauses may be further subclassified traditionally by syntactic properties (such as whether the verb is finite or non-finite, or the clause verbless), or by functional properties (e.g. whether an embedded clause is nominal, adverbial, relative or comparative; see borderline categories). Further possibilities for subcategorisation may be language-specific. For example, in English it is common practice to categorise clauses according to their introductory subordinator, e.g. that-clauses, or wh-clauses.
At this level of analysis, the most useful distinctions are the purely syntactic ones of finite, non-finite, and verbless. The functional properties mentioned above are more dependant on semantic roles of the clause in question, and will therefore be dealt with later. These syntactic distinctions may be further subdivided: e.g. under non-finite one can distinguish between infinitive clauses, gerundival or participial clauses, and past participial clauses.
Phrases may be further subcategorised to show syntactic features such as gender, person and number. Thus a Noun Phrase may be marked as singular, masculine, etc. However, as morpho-syntactic tagging currently tends to be a preliminary to syntactic annotation, this type of information can often be derived from the POS-tag of the head of the phrase. The grammatical features of the head of the constituent may therefore be percolated up to the highest node of that constituent.
In 54 and 55, the subcategorisation features of the Noun filles in 54 -- 3rd Person Plural Feminine:
(54) | [NP Les filles_N3PlFem NP] [VP ont écrit [NP les lettres NP] VP] |
(55) | [NP Les filles NP-3PlFem] [VP ont écrit [NP les lettres NP] VP] |
(56) | [NP[NP Le garçon_N3MascSg NP] et [NP la fille_N3FemSg NP] NP] ... |
(57) | [NP[NP Le garçon_N3MascSg NP] et [NP la fille_N3FemSg NP] NP-3MascPl] ... |
Optionally, syntactic functions can be assigned to constituents. At the rank of sentence or clause, the Lexicon/Syntax SubGroup mark only one grammatical function, namely +/-Subject, but other grammatical functions may be derived from the combination of several other syntactic features plus the property +/- Subject. For annotated corpora, we propose Subject, Object, Indirect Object (if it applies in a language) and Adjunct. If used, these labels may be hyphenated to the constituent label, as in 58:
(58) | [S [NP-Subj John NP-Subj] [VP gave [NP-Obj a book NP-Obj] [PP-IndObj to [NP Mary NP] PP-IndObj] [PP-Adjct on [NP Monday NP] PP-Adjct] VP] . S] |