Next: Symbolic transcription system
Up: Recommendation for a minimal
Previous: Recommendation for a minimal
As defined in 2.5.1the orthographic representation of the text corresponds to a
representation of the speakers utterances using the standard spelling of a given language (i.e., a transliteration).
This level of representation is thus common to spoken and written corpora, and consequently conventions for
orthographic representation have been developed both in corpus linguistics and
in speech research.
Three representative proposals will be reviewed here and will form the basis
of a set of recommendations: the NERC conventions, the SpeechDat guidelines and the EAGLES
Spoken Language Working Group recommendations.
Within the tradition of corpus linguistics, the NERC initiative has adopted the conventions for orthographic transcription
proposed by French (1992:3ff). They are mainly intended for the transcription of the spoken
materials present in the type of reference corpora considered within the project.
These recommendations can be summarized for English as follows:
- The words spoken are represented in accordance with standard
orthographic conventions;
- The only contractions used are those accepted as standard in the Oxford
English Dictionary;
- Sentence boundaries are marked by a full stop and capital letter;
- Commas are not used within sentences;
- Direct quoted speech or quotations from written texts are placed in
single quotation marks;
- Apostrophes are used in accordance with standard conventions in
possessives and in contractions.
These conventions can be compared with the ones developed by Boves & den Os (1995) and
adopted for the transcription of the SpeechDat spoken corpora in different languages (more
information on SpeechDat can be found at URL http://www.icp.grenet.fr/SpeechDat/home.html).
They are based on the
ones used by the LDC/ARPA (Linguistic Data Consortium/Advanced Research Projects Agency) for the
production of the ATIS (Air Travel Information System) corpus,and are specially conceived for
the transcription of a corpus aimed at training and assessing speech recognition systems over
the telephone. Other proposals also oriented towards the transcription of speech corpora for
phonetic research and speech technology have been developed, for example, within the
German
VERBMOBIL project (Kohleret al.1994; Hess et al., 1995; more information on the project is
found at URL http://www.dfki.uni-sb.de/verbmobil/overview-us.html
and at URL
http://www.ims.uni-stuttgart.de/projekte/verbmobil/index-en.html),
for the transcription of the HCRC Map Task
corpus (Anderson et al.,1991;
and more information at URL http://www.cogsci.ed.ac.uk/elsnet/
Resources/Map-Task/mt_corpus.html)
or for the transcription of spontaneous spoken dialogues (Fink et al., 1995).
The most relevant SpeechDat conventions for the purpose of the present recommendations
are summarised below (Boves & den Os, 1995):
- Normal lexical items will be represented by their spellings in the normal way.
- It is recommended to chose a standard dictionary for each language and to use the
spelling forms which appear there. It is also recommended to maintain a lexicon of the spelling
forms used in the transcription. This lexicon also contains the forms chosen as the standard
for words or expressions which can be spelt in more than one way.
- It is possible to include, a very restricted number of markings for regular
variations in pronunciation, provided that they are documented and no more than two or three
regular variations are indicated.
- Abbreviations should be represented by their full orthographic forms, unless
they are spoken in their abbreviated form.
- Exceptions are abbreviations which do not have non-abbreviated forms.
- Number sequences (flight numbers, times, dates, aircraft types, money
amounts, etc.) will be spelled out to reflect what was said
- If digits have alternate pronunciation forms the transcription should
accurately reflect the form actually pronounced.
- If a speaker pronounces letters, acronyms or abbreviations as a word, for
example ``British Rail" for BR, then these should be spelled out as words.
- No punctuation will be provided in the transcription other than those
symbols used for special transcription purposes
Recommendations for the orthographic representation are also provided
in the chapter devoted to corpus representation of the EAGLES Handbook on Spoken Language
Systems (EAGLES Spoken Language Working Group, 1995). The following conventions are discussed:
- Reduced word forms
- It is recommended to use reduced word forms as they appear in a standard dictionary.
- If necessary, other reduced forms not existing in the dictionary can be used
- The use of reduced forms is recommended if they occur frequently and if they involve
syllable deletion
- Dialect forms
- Dialect forms have to be marked in the transcription
- Numbers
- Numbers are transliterated as words
- Abbreviations and spelled words
- Full forms of abbreviations are used in orthographic transcriptions.
- Abbreviations spoken as words are also transliterated as words
- Spelling has to be indicated in transcriptions
- Interjectives
- They should be indicated according to the standard spelling found in the dictionary
The general philosophy behind the proposals put forward by the Spoken Language Working Group
is that standard spelling should be used as much as possible and that all non-standard forms used
in the transcription should be clearly documented. It is also proposed to generate a list of words
and word forms, so that
the graphemic forms of the words can be converted to phonemes by means of computerised
grapheme-to-phoneme conversion. The result of this is a list of citation forms, also called
canonical forms. This forms indicate the pronunciation of words when spoken in isolation.
(EAGLES Spoken Language Working Group, 1995)
The consistent use of standard spelling forms ensures then the possibility of
linking levels S1 and S2 previously described in 2.5.1.
Taking into account this three proposals a set of general recommendations for the orthographic
transcription of spoken materials - either read or spontaneous - can be proposed:
- Use conventional spelling forms as they appear in a standard dictionary. This also applies to contractions,
reduced word forms, apostrophes, dialect forms, interjections and vocalised semi-lexical events (see 2.3.1)
- This implies selecting a standard dictionary for each language; in some languages there
are dictionaries produced by the relevant normative body (for example the Diccionario
de la Lengua Española from the Real Academia Española), while in others there are
dictionaries which are traditionally considered as reference works, such as the Oxford Dictionary
for English or the Robert for French.
- If more than one orthographic form is possible or if non-standard spellings or spelling variations are necessary,
maintain a lexicon of the spelling forms used in the transcription
- The purpose behind this recommendation is to help transcribers to maintain consistency and
to provide an accurate documentation. Moreover, if a full list of the spelled forms in
created, it is possible to automatically generate the phonemic citation forms of level S2 (see 2.5.1).
The creation of a list of the spelling forms used in the transcription in the case of variations
in word form, spelling variants and semi-lexical phenomena is also part of the TEI recommendations
for transcription practices.
- Represent numbers, abbreviations, acronyms and spelled words in full orthographic form as pronounced by the speaker
- The aim of this recommendation is to accurately reflect in the transcription the actual utterances
of the speaker. Numbers are always transliterated as words, as well as abbreviations and acronyms; however,
if one of these later forms is spelled by the speaker, it should then be transcribed as such.
These recommendations are of a very general nature and constitute basic principles to be applied to
the transcription of spoken materials. One aspect which would need a more in-depth discussion is
punctuation. The NERC proposal suggests to mark sentence boundaries with a full stop
and a capital letter and avoids using commas within sentences, while the SpeechDat recommendations
suggest not to use punctuation at all. One should be aware that in spontaneous speech the delimitation
of units such as sentences is not a trivial matter, since a combination of syntactic, semantic,
pragmatic and prosodic criteria is required (see, for example Schuetze-Coburn, 1991), and for this
reason introducing punctuation in an orthographic transcription can be sometimes a difficult and
controversial activity.
Next: Symbolic transcription system
Up: Recommendation for a minimal
Previous: Recommendation for a minimal