The main differences in the approach to corpora containing spoken materials between the corpus linguistics community and the speech community that we have reviewed so far can be summarized in the following table:
Corpus linguistics | Speech research | |
Materials | Unprepared, unelicited speech | Controlled, elicited speech |
Scope | Discourse, dialogue | Utterance |
Recordings | Natural environment | Controlled environment |
Transcription | Orthographic enriched (transcription) | Phonetic and orthographic |
aligned with the speech signal | ||
(labelling) | ||
Oriented | Symbolic, categorical | Speech signal, temporal |
towards | representation | representation |
A discussion of other differences between collections of written and spoken data can be found in the chapter devoted to corpus design in the EAGLES Handbook on Spoken Language Systems (EAGLES Spoken Language Working Group, 1995). Seven main differences are outlined there, having to do with the following aspects:
Biber (1988) and Halliday (1989) contain a more in-depth discussion of differences between speaking and writing from a linguistic perspective.
As discussed in the next section, there has been in very recent times a tendency towards integrating the needs of both communities, especially because the notion of speech database used in speech research has been gradually enlarged to encompass large collections of more natural data that are characteristic of work in corpus linguistics. However, one should not forget the differences due to the historical development of both fields that have led to emphasis on elicited spoken language in the speech research community and to emphasis on unelicited speech in corpus linguistics (Sinclair, 1993:68, 1994, 1996).