Next: Recommendations for the orthographic Up: Spoken Texts Previous: Recommendations for data acquisition

Recommendation for a minimal set of encoding for spoken texts

In section 2.3 transcription and representation practices for spoken texts are reviewed, paying special attention to the NERC and TEI proposals. A survey of events represented and encoded in spoken texts (2.3.1) shows that an important number of phenomena can be of interest to different types of research. However, it seems necessary to consider a minimal set of events to be encoded according to the TEI-compliant Corpus Encoding Standard (CES) proposed for EAGLES (Ide, 1996). The present document will only be concerned with the events themselves, and the encoding of the International Phonetic Alphabet, of the transcription, and of the linguistic annotation of speech will be presented as part of CES. Proposals for the encoding of spoken texts within the TEI initiative can also be found in Johansson (1995a, b).

As a starting point, it should be noted that there are important differences between the transcription of read text - when the original written source is available - and the transcription of spontaneous speech. These differences are reviewed in detail in the EAGLES Handbook on Spoken Language Systems (EAGLES Spoken Language Working Group, 1995) and can be summarized in the following points:

The planning process of spontaneous speech is reflected in several types of disfluencies which do not normally occur in read speech, increasing the difficulty of the transcription process and the complexity of the representation. In section 2.3 most of the usually transcribed events related to this fact are presented.
The criteria to define utterances are not clear in spontaneous speech, neither in monologues nor in conversations.
In the case of dialogues, interruptions and overlappings still add more complexity to the representation.

Similar problems in the transcription of speech are mentioned by Johansson (1995b), who still adds one more dimension, i.e., the fact that since speech is generally addressed to a limited audience in a private setting, an adequate knowledge of the context and the situation is needed for a correct understanding.

Despite the difficulties involved in the transcription of unprepared speech, it should be possible to define a minimal common set of events to be encoded in the transcription of different types of spoken texts.

In section 2.4 the structural elements considered in the TEI Guidelines have been defined; they are listed again here for the reader's convenience:

Utterance
Pause
Vocal
- Semi-lexical
- Non-lexical
Kinesic
Event (non-vocalised, non-communicative)
Writing
Shift

The EAGLES Handbook on Spoken Language Systems (EAGLES Spoken Language Working Group, 1995) considers a set of non-linguistic phenomena that should be annotated when transcribing a speech corpus:

Omissions in read text
Verbal deletions and corrections
Word fragments
Unintelligible words
Hesitations and filled pauses
Non-speech acoustic events
- Produced by the speaker
- Produced by other speakers or environmental noises
Simultaneous speech
Speaking turns

A comparison between these recommendations shows that there are elements which are common to both proposals, and therefore, they could possibly be part of the minimal set of elements to be encoded. These elements are the following:

Vocal semi-lexical events

Included in this category are filled or voiced pauses and hesitations. As will be proposed in the next section, it is convenient to keep a list of standardized spellings for these phenomena, using, when possible, the conventional orthographic forms which appear in reference dictionaries for a given language.

Vocal non-lexical events

This category includes burps, clicks, smacks, coughs, giggles, laughs, sneezes, sobs, yawns, heavy breathing and all the non-speech acoustic events produced by the speaker. The number of these events can be variable, and a description of the event is used in the annotation.

Non-vocalised non-communicative events

This includes all the extraneous noises produced by other speakers or those wich result from the recording environment such as doors slaming, telephone ringing, etc. The annotation is, as in the previous category, a written description of the event.

Note that the first two categories correspond to those subsumed under the tag <vocal> in the TEI, while the third corresponds to <event>.

The transcription of spoken interactions where more than one speaker is involved also requires the consideration of the following elements:

Speaker identity

In the TEI encoding this information is indicated in the header within the `profile description' <profileDesc> element, which has a `participant' <partics> sub-element containing a series of elements `person' <person>. Among the attributes of <person> there is one - named `id' - coding the identity of the speaker. Within the text, each utterance can have an attribute `who' with the value corresponding to the identity of the speaker coded in the `id' attribute (Sperberg-McQueen - Burnard (Eds.), 1994; Johansson, 1995a). Other simplified forms of encoding can be found, but, in any case, this is a necessary element in the transcription of spoken interactions.

Speaking turns, indicating a change of speaker

Changes of speaker can be coded in the TEI by means of changes in the value of the `who' attribute, and appear to be the basis for the definition of utterances. Independently of the mechanisms that can be used, this is an essential information in the transcription of conversations.

Simultaneous speech or overlapping

Proposals for marking this phenomenon are found in the TEI (see 2.4) as part of the strategies for encoding simultaneous events. Although other ways of representing speech overlapping can be found, again this is an important element in the transcriptions of the type of spoken material discussed here.

A third group of elements to be transcribed is related to the performance of the speaker. The convenience to include them in transcriptions is discussed in the EAGLES Handbook on Spoken Language Systems, where three different types of phenomena are identified:

Omissions in read text

Where a written script exists, it might be recommended that the words or segments omitted by the reader should be marked in the transcription as such.

Self-repairs

In spontaneous speech, the planning process is sometimes evidenced by the presence of self-repair phenomena used by the speaker to correct speech production errors `on-line' (see Cutler (Ed.), 1982 and Fromkin (Ed.), 1973, 1980 for a psycholinguistic approach to the topic). They might be explicitly indicated by the speaker (using, for example, forms such `I mean') or they might be implicit; in other cases they might involve restarts or repetitions. Also in read speech it is possible to find corrections of errors detected by the reader himself in the course of the reading. Such phenomena should not be omitted in a transcription.

Word fragments

Word fragments are one or more sounds belonging to a word which is not fully pronounced by the speaker at a first attempt and are then repeated when the speaker succeeds in producing the complete word. In some systems they are marked by a hyphen (e.g. `fli-flights), while in others a star is used (e.g. `fli* flights). It seems also adequate to indicate these hesitations in the transcription.

Moreover, the encoding of spoken texts should contain a documentation of the difficulties encountered during the transcription process. The NERC proposals mention `guessed' and `unintellible fragments', while the SpeechDat conventions include a notational device for partially or totally unintelligible words. It seems also adequate to provide means for the notation of the uncertainties of the transcriber:

Unintelligible fragments

Fragments, words or part of words which are not intelligible to the transcriber should be indicated. A distinction between `guessed' or 'uncertain' and 'unintelligible' can be made if necessary.

Finally, the encoding of utterances - defined as a strecht of speech usually preceded and followed by a pause or by a change of speaker - should be considered. We have already recommended the marking of changes of the speaker, and in section 5.2.2 devoted to prosody it is also proposed that pauses should be part of the elements to be encoded. This implies that utterances are necessarily encoded, since they are related to these elements.

An important point which has to be considered is the usability of the TEI recommendations from the point of view of the transcriber. Sinclair (1995) and Chafe (1995) discuss this issue, which is also mentioned by the EAGLES Spoken Language Working Group. As a general rule, a balance between the advantages offered by the TEI, the aims of the corpus and the demands imposed on the transcriber should be sought. The distinction put forward by Sinclair (1995:107) between conformity and compatibility with TEI is useful in clarifying the debate. In fact, the need to develop conversion software between a user-friendly system of transcription and the TEI encoding scheme was one of the recommendations arising from the EAGLES Workshop on `Issues in Corpus Work' organized by the Text Corpora Working Group in Madrid in January 1996.

Recommendations for the orthographic representation of spoken texts

Next: Recommendations for the orthographic Up: Spoken Texts Previous: Recommendations for data acquisition