The Constraint Grammar (Karlsson et al., 1995) Framework has been implemented with Two-level Morphology, to produce a system for syntactic analysis of unrestricted text. The most comprehensive system is ENGCG (for the analysis of written English) but the system is currently being developed for Finnish, Swedish, Danish, German, Basque and French.
The Constraint Grammar Framework differs from the syntactically annotated corpora under study in this section for three reasons:
The analysis is carried out at word level, and all text words receive one or more morphosyntactic analyses, consisting of:
Both the morphological and syntactic analysers use rule-based linguistic descriptions. The system works in the following way:
The tokeniser identifies punctuation and multiword units, and splits enclitic forms into grammatical words.
This process begins with a lexical analysis based on a large lexicon including all inflected and central derived word forms. The lexical analyser assigns all possible morphological analyses to each word that is in the lexicon, and the remaining words are assigned an analysis by means of the guesser (a heuristic rule-based module). These rules are mainly governed by word shape, and if none of them apply, then a nominal analysis is given.
The rule-based Constraint Grammar parser is used to resolve some of the ambiguities at this stage. The constraints are partial paraphrases of form definitions of syntactic constructs such as the noun phrase. The English grammar for example, contains about 1,200 grammar-based constraints, plus 200 heuristic constraints.
All possible syntactic tags are introduced for each word. This could, in some cases, mean that more than ten alternatives are given for one morphological reading.
The parser finally consults a syntactic disambiguation grammar. The English version of the Constraint Grammar contains 800 syntactic constraints, of a similar form to the rules at the morphological resolution stage.
The English version of the Constraint Grammar marks the syntactic functions shown in table 3.4:
As mentioned above, the syntactic tags are distinguished by the use of the `@' sign. The analysis is dependency based, but only partially. As can be seen in table 3.5, dependency relations are shown by the use of the left and right angle brackets, showing that a word is dependent on another to either the right of the left. In the example below, Karlsson is marked as `@<P' meaning that it is the complement of a preposition to be found previous to Karlsson (in this case the preposition by).