It is clear that the categorical levels of transcription presented so far as used by the speech and the corpus linguistic communities are different, except that both systems include an orthographic representation of the spoken text. Labelling of speech is a process that starts with the low-level units and ends at the highest ones, while the transcription of spoken corpora goes in the opposite direction. The suggestion of the Spoken Texts subgroup is that both systems can be related at the lexical level. Since French's Level II recommended by NERC contains words transcribed in orthographic form and the proposed level S2 (see below) in speech transcription consists of a phonemic representation of words in their citation form, it should not be too difficult to relate both types of representations. The level at which words are phonemically transcribed in their citation form can become, as Barry & Fourcin (1992:8) point out, the `mediator' between the signal and the lexicon. The role of lexical databases in the automatic transcription of speech corpora has been explored, among others, by the research group at IRIT in Toulouse (see, for example, de Ginestel et al., 1993), and more information on spoken lexica can be found in the corresponding chapter of the EAGLES Handbook on Spoken Language Systems (EAGLES Spoken Language Working Group, 1995).
For the purpose of a symbolic transcription of spoken language, three levels of representation and labelling have been identified within the EAGLES Spoken Texts subgroup:
Moreover, it should not be forgotten that it is a fundamental recommendation (Sinclair, 1993:70) of NERC, also adopted in this document is that a digitized version of every sample of recorded speech be included as a component of a corpus.