Next: Definitions Up: Corpus Typology Previous: Author

Introduction

Electronically-held corpora are new things, and little has yet emerged of a consensus as to what counts as a corpus and how corpora should be classified. The most important areas are:

the minimum conditions for any collection of language to be considered as a corpus;
the separation of corpora which record a language in ordinary use from corpora which record more specialised kinds of language behavior.

Both of these are contentious areas. If the profession finds that the criteria recommended here are adequate for current needs, then considerable progress will have been made, for there are many collections of language called corpora which do not meet these conditions, and there are some corpora available which record special and artificial language behavior, but do not point this out to the undoctrinated.

Furthermore, the discipline of corpus linguistics is developing rapidly and norms and assumptions are revised at frequent intervals. Categories have to be particularly flexible to meet such unstable conditions.

Hence the classifications in this paper go as far as is prudent at the present time. They offer a sound and resonably replicable way of classifying corpora, with clearly delimited categories wherever possible, and informed suggestions elsewhere. The paper has been reviewed by many experts in the field, who are in broad agreement that to present a more rigorous classification would be intellectually unsound and would be ignored by the majority of workers in the field. The present paper has a chance of acceptance because it raises the relevant issues and offers usable classifications.

For nearly twenty of those thirty years, the original targets of the Brown corpus=1, ; (KuceraFrancis1967) were taken to be the standard:

(a): one million words
(b): divided roughly evenly into genres
(c): 500 samples
(d): 2000 words in each
(e): written published sources

This is still a much used reference point, although the circumstances that led the Brown designers to make those choices are quite unlike those of today.

It is more helpful to extrapolate from the original design the principles that lay behind the specific decisions:

(a): The corpus should be as large as could possibly be envisaged with the technology of the time. Brown's one million words was just that and its appearance was like a miracle -- so many words at one's command. However, by the mid-seventies, the targets had gone up by an order of magnitude, the Birmingham Collection of English Text, for example, ending up with 20 million words in 1985. Now, in the mid-nineties, they are up another order of magnitude, with the Bank of English, for example (see below), being close to 200 million words in length.
(b): It should include samples from a broad range of material in order to attain some sort of representativeness.
(c): There should be an intermediate classification into genres between the corpus in total and the individual samples.
(d): The samples should be of an even size.
(e): The corpus as a whole should have a declared provenance.

Point (d) above -- that samples should be of an even size -- is controversial and will not be adopted in these proposals -- see below for further consideration. The restriction of the Brown corpus to written material -- still frequently copied in later work -- is regarded as unfortunate for a model although understandable in its historical context. Indeed, the first alternative models to the Brown were European collections of transcribed speech, such as the Edinburgh-Birmingham corpus of the early sixties=1, ; (JonesSinclair1974) For the importance of spoken corpora and their special contribution to corpus work, see below. It is noticeable that there is still considerable reluctance among corpus designers to include spoken material; at the planning stage of the Network of European Reference Corpora (NERC) in 1990 it was almost abandoned and there are signs now in the EU that spoken and written corpora may be developed separately and that the confusion between corpora for speech and corpora for language has returned.

The early corpus designers worked with slow computers (from our viewpoint) that were oriented towards numeric processing, and whose software had great difficulty with characters. Texts were assembled as large trays of cards, and retrieval programs were done on an overnight batch basis. All material had to be laboriously keyboarded on crude input devices.

In the last decade, there has been an unprecedented revolution in the availability of text in machine readable form, the emergence of a new written form -- e-mail -- which only exists in that form, and the invention of scanners to aid the input of certain types of text material. The processing speed of machines and the amount of storage has risen dramatically and costs have fallen as dramatically, so that modest PC users can have access to substantial corpora, while major users manipulate hundreds of millions of words on line. The balance of problems has switched from bottlenecks in acquiring corpus material to handling floods of it from a variety of unco-ordinated sources. In anticipation of this change, the notion of monitor corpora is under development to reconceptualise a corpus as a flow of data rather than an unchanging archive (see below section 8).

Next: Definitions Up: Corpus Typology Previous: Author