Next: German Tagsets
Up: Resources and tools
Previous: Resources and tools
In the following, we briefly describe the textual corpora used at STR
and RXRC for the work described here. The STR corpora have been used
for tagger and tagset evaluation, the RXRC corpora for text type evaluation.
- STR: manually tagged corpus of 75,000 tokens
- 50,000 from Frankfurter Rundschau
This corpus is the ELSNET/EAGLES DE reference corpus, and is composed
of news stories about economy (12,500), politics (12,500), culture (12,500),
local events (6,250) and sports (6,250).
- 25,000 from Stuttgarter Zeitung
This corpus is STR internal.
- RXRC: automatically tagged and manually corrected
corpus of 90,000 tokens.
- 20,000 reference corpus from SfS Tübingen
- 30,000 news (Spiegel, Welt, taz, spektrum)
- 26,000 fairy tales (Andersen, Bechstein, Grimm )
- 14,000 other (reports, guidelines, ...)