Only one attribute is considered obligatory: that of the major word categories, or parts of speech:
1. | N [noun] | 2. | V [verb] | 3. | AJ [adjective] |
4. | PD [pronoun/determiner] | 5. | AT [article] | 6. | AV [adverb] |
7. | AP [adposition] | 8. | C [conjunction] | 9. | NU [numeral] |
10. | I [interjection] | 11. | U [unique/unassigned] | 12. | R [residual] |
13. | PU [punctuation] |
Of these, the last three values are in need of explanation.
The unique value (U) is applied to categories with a unique or very small membership, such as negative particle, which are `unassigned' to any of the standard part-of-speech categories. The value unique cannot always be strictly applied, since (for example) Greek has three negative particles, , , and .
The residual value (R) is assigned to classes of textword which lie outside the traditionally accepted range of grammatical classes, although they occur quite commonly in many texts and very commonly in some. For example: foreign words, or mathematical formulae. It can be argued that these are on the fringes of the grammar or lexicon of the language in which the text is written. Nevertheless, they need to be tagged.
Punctuation marks (PU) are (perhaps surprisingly) treated here as a part of morphosyntactic annotation, as it is very common for punctuation marks to be tagged and to be treated as equivalent to words for the purposes of automatic tag assignment.
The symbols used to represent the major categories (above) and the attributes and values of other categories (below) will be used later for a method of language-neutral representation called the Intermediate Tagset.