Differences

Next: Towards convergence Up: Transcription and representation needs Previous: The speech community

Differences

The main differences in the approach to corpora containing spoken materials between the corpus linguistics community and the speech community that we have reviewed so far can be summarized in the following table:

	Corpus linguistics	Speech research
Materials	Unprepared, unelicited speech	Controlled, elicited speech
Scope	Discourse, dialogue	Utterance
Recordings	Natural environment	Controlled environment
Transcription	Orthographic enriched (transcription)	Phonetic and orthographic
		aligned with the speech signal
		(labelling)
Oriented	Symbolic, categorical	Speech signal, temporal
towards	representation	representation

A discussion of other differences between collections of written and spoken data can be found in the chapter devoted to corpus design in the EAGLES Handbook on Spoken Language Systems (EAGLES Spoken Language Working Group, 1995). Seven main differences are outlined there, having to do with the following aspects:

The durable character of text as opposed to the transient nature of speech, which requires to be recorded in some form in order to be studied.
The different production times involved in writing and speaking.
The different nature of the error correction processes in writing and in speaking; while in written texts collections the editing process is not reflected, transcriptions of unprepared speech reflect interruptions, hesitations, repetitions and self-repairs made by the speaker.
The variations in the spoken versions of orthographically identical word forms as opposed to the invariant nature of their written representation.
The discrete nature of written text and the continuous character of speech which requires the development of segmentation tools for the later.
The size and storage requirements for written and spoken corpora
The categorical information present in the written text and the lack of categorial information in the speech signal.

Biber (1988) and Halliday (1989) contain a more in-depth discussion of differences between speaking and writing from a linguistic perspective.

As discussed in the next section, there has been in very recent times a tendency towards integrating the needs of both communities, especially because the notion of speech database used in speech research has been gradually enlarged to encompass large collections of more natural data that are characteristic of work in corpus linguistics. However, one should not forget the differences due to the historical development of both fields that have led to emphasis on elicited spoken language in the speech research community and to emphasis on unelicited speech in corpus linguistics (Sinclair, 1993:68, 1994, 1996).

Next: Towards convergence Up: Transcription and representation needs Previous: The speech community