We must, for practical as well as definitional reasons, restrict our attention to corpora considered as collections of texts or textual samples of language. Texts are linear; syntactic structures, on the other hand, are often represented in two-dimensional terms, especially as tree structures, or (in greater detail) as tree structures, the nodes of which are sets of attributes and values. As far as syntactic annotation is concerned, we are interested only in how these two- or multi-dimensioned structures are represented in relation to the linearity of texts.
There are two general commonly-used linear formats for storing, inputting and outputting text data: horizontal and vertical. It is possible to represent a syntactically annotated text in either of these formats, without changing the nature of the annotation. The conversion of a horizontal to a vertical format or vice versa is a relatively trivial operation if undertaken automatically. However, from the user's point of view, the difference between the two formats is certainly not trivial, as it may make the difference between an intelligible and an unintelligible presentation. We will use examples from some corpora to illustrate this.
The first example is from the Associated Press Corpus with Lancaster skeleton parsing annotation. The sentence in 2 can be represented in a horizontal format, as in table 1.
(2) | The door, which was equipped with neither bell nor knocker, was blistered and distained. |
The labelled bracketed analysis can be represented in a vertical format, as in table 2. The original sentence is in the first column, the part-of-speech tags in the second, and the brackets and labels constituting the syntactic annotation appear in the third column.
Table 3 is an example in horizontal format from the IBM Paris Treebank (Langé 1994).
The horizontal format is more compact, and is easier to read so long as the amount of syntactic information interspersed with the words is not too dense. The vertical format is more convenient and more readable if there is too much syntactic information to be conveniently shown in the horizontal format. Moreover, the vertical format lends itself to a number of parallel fields of information, so that (for example) the actual orthographic text (as a sequence of word forms and punctuation marks) can be separated out from the sequence of morphosyntactic tags, and both of these separated from the representation of a phrase structure tree. Other fields may contain corpus location references, and deep syntactic information (such as ellipsis) alongside in a separate field from the surface syntactic information. Table 4 is an example from the SUSANNE corpus (Sampson 1995), which gives an impression of the various aligned information types that can be given. The columns (i.e fields) contain the following information:
The field that indicates the structure of the sentence can be made more graphically explicit by the use of indentation. The example from TOSCA in table 5 illustrates this. On the first level is Utterance, the second level NP, VP and PP, and so on. (This indented format is in fact an intermediate structure, the final output being represented as a tree on the screen.)