Much of this section is drawn from the report of the EAGLES Corpus Subgroup on Spoken Language Corpora. Most general corpora include an element of transcribed spoken language -- broadly or even orthographically transcribed. A substantial quantity of spoken data, particularly impromptu recordings of ordinary people talking together, is regarded as one of the richest sources of insights into language.
There are corpora that consist only of spoken material, just as some restricted corpora containonly written material. There are also some quite different corpora (sometimes called speech corpora to point up the distinction), which are compiled by researchers into the intricacies of phonetics.
The crucial distinction is between a corpus of spoken language that is suitable to be run in parallel with a corpus of written language, to provide general evidence of the grammar, lexis, phraseology and style of the language; and a corpus of spoken language, often called a speech corpus, which is put together to further the research of the speech community into the nature of phonetic substance.
A sequence of conferences and seminars (see, for example, Leech et al. (eds), 1995) has clarified the issues and shown that the interests of the two groups overlap and with the direction of research at present and the development of technology they are overlapping more and more. However, much of the speech corpus is too specialised for inclusion in a general corpus, and in the EAGLES typology of corpora would be called special corpora.
In practical terms most corpus providers see the need for making an attempt at including some transcribed spoken language, or at least making provision for the inclusion of such material at a later stage. Hence in this paper there will be a brief presentation of the issues that should be kept in mind -- what aspects of spoken texts need to be accommodated in the preparation of a corpus of written material.