Next: The tagset mapping exercise
Up: Validation phase
Previous: Validation phase
Italian and German newspaper texts have been morphosyntactically
annotated according to the EAGLES
Italian (EAGLES, 1996d) and Italian (EAGLES, 1996a) specifications.
Table 7 summarises the distribution of texts in the German
corpus; the Italian corpus
is structured
in a very similar way.
Economy | ca. | 17,000 wordforms |
Politics | ca. | 14,000 wordforms |
Culture | ca. | 18,000 wordforms |
Sports | ca. | 9,000 wordforms |
Local Events | ca. | 8,500 wordforms |
Total | ca. | 66,500 wordforms |
| | |
Table 7: EAGLES/ELSNET reference corpus for German
The texts were prepared as follows:
- The texts were automatically tagged and then manually
corrected (For German: automatically pretagged for level (b), fully manually
tagged on level(c) -- see below for levels);
- The overlap in the manual work (two or more linguists working on
the same text without knowing it) is some 10-20per cent of the textual
- The material was cross-checked by extraction of all wordforms tagged
the same way and by plausibility checks against morphology
systems, as well as by manual checking on closed class items.
The material will be made available in the following forms:
- Level (b)
- -- STTS tags (DE)
- Level (b)
- -- ELM-DE attribute/value pair annotation (DE)
- Level (c)
- -- ELM-DE attribute/value pair annotation (DE)
- Level (c)
- -- ILC-Pisa tagset (IT)
- Level (c)
- -- ELM-IT attribute/value pair annotation (IT)
The following is a sample of the German text, in fully-fledged ELM
feature structure annotation (cf. level (c) above):
Mexiko [ pos=noun & type=prop ]
: [ pos=resid & type=punct & punct-t=c-final ]
Die [ pos=art ]
" [ pos=resid & type=punct & punct-t=non-c-final ]
Praesidentenmache [ pos=noun & type=com ]
" [ pos=resid & type=punct & punct-t=non-c-final ]
. [ pos=resid & type=punct & punct-t=c-final ]
Mexikanische [ pos=adj & use=attr ]
Politik [ pos=noun & type=com ]
ist [ pos=verb & type=aux & fin=fin & ( vm-f=ind | vm-f=konj )]
Almachie [ pos=noun & type=com ]
und [ pos=conj & type=coord ]
Tradition [ pos=noun & type=com ]
. [ pos=resid & type=punct & punct-t=c-final ]
Next: The tagset mapping exercise
Up: Validation phase
Previous: Validation phase