next up previous contents
Next: The speech community Up: Transcription and representation needs Previous: Transcription and representation needs

The corpus linguistics community

 

The traditional work in corpus linguistics, when spoken language is addressed, starts with deriving an orthographic transcription from a recording of large stretches of speech. This transcription is afterwards enriched using different annotation systems aiming at reflecting all the important events that take place in the process of speech production -- especially when speech is spontaneously produced or an interaction takes place between two or more speakers -- and that are not adequately captured by conventional spelling. Furthermore, grammatical information such as parts of speech (tagging) and syntactic structure (parsing) can be added to carry out linguistic descriptive work.

The main aim is to acquire large amounts of data reflecting the natural use of language, therefore emphasis is usually put on the naturalness and spontaneity of the recording, avoiding experimentally controlled situations where the speaker is constrained to utter a number of previously prepared short sequences. Also for this reason, words are transcribed as lexical units and the phonetic details of their realization are not usually taken into account. In certain studies, prosodic information is added in symbolic form, but the systematic use of a phonetic transcription system such as the IPA (International Phonetic Alphabet) is not common in this kind of studies. This also implies that the recorded speech signal is only accessed during the transcription phase and that subsequent work takes place at the level of the symbolic representation.

Corpora collected for the purposes described above and containing orthographic or phonetic / phonemic transcriptions are sometimes called spoken corpora (Sinclair, 1994, 1996). A useful definition summarizing its main features is provided by Sinclair:

A spoken language corpus is a corpus consisting of recordings of speech which are accessible in computer readable form, and which are transcribed orthographically, or into a recognised phonetic or phonemic notation (Sinclair, 1996:28)


next up previous contents
Next: The speech community Up: Transcription and representation needs Previous: Transcription and representation needs