next up previous contents
Next: Quality Up: Characteristics Previous: Characteristics

Quantity

  The default value of Quantity is large. A corpus is assumed to contain a large number of words. The whole point of assembling a corpus is to gather data in quantity. The size of corpora continues to increase rapidly, and it would not be sensible to recommend any set of figures. Furthermore, the advent of monitor corpora (see below) changes the basis of size calculation from a total amount to a rate of flow. The size of a corpus is simply the sum of the sizes of its components. Questions of size are best dealt with by reference to a component.

In practice, the size of a component tends to reflect the ease or difficulty of acquiring the material. In turn, this factor may be loosely related to the availability of the material to the public and therefore to its relative importance as influential language, as against material which is difficult to get, perhaps because it is of small circulation. Such a relationship, however, does not hold with reference to the spoken language, where the most influential and pervasive material is informal and impromptu conversation, which is not normally recorded.

A more practical correlation to pursue is that between the size of a component and the number of people who are exposed to it. Since millions of people read the newspapers and listen to the radio, it should be easy to acquire large quantities of this type of data, and this will be assumed. Local radio reaches fewer, but still hundreds of thousands, and so do magazines and best-selling paperback books. Speeches to large rallies reach thousands of people, as do leaflets and notices.

The figures are merely typical. There are some leaflets printed in millions by governments and advertisers, and some very modest magazine circulations and audiences to radio. But note that if a speech at a rally is repeated on television, or a magazine article is reprinted in the newspapers, its intended audience is still the original one -- what happens to it afterwards does not affect its linguistic constitution.

It could be argued, however, that its very linguistic constitution is the factor that caused it to be transferred to another medium; or that the speaker/writer was attempting to achieve the wider publicity and contrived his/her language accordingly, and the text is thus more appropriately classified at its final destination. In practice, it frequently happens that a text is transferred from one medium to another and such a text may well be in a corpus twice.

Lower circulation material, involving hundreds of people, is to be found in workplace and institutional documentation, and in the spoken medium in lectures, talks and some sermons. When the audience can be numbered in tens we are down to documents with a circulation list and seminars. Individuals can be identified and in some cases the role of speaker/writer can change. At these levels, written material is not easy to identify, but there is of course private correspondence, with a readership of one or two only. Here there is a lot of variety in the spoken medium, with all kinds of discussions, interviews, meetings and conversations involving very small numbers of participants. E-mail tends to be used in fairly small groups but there is more and more circular material coming out.

Two points emerge from this discussion. One is that forms of the spoken language are composed, given to mass audiences, and made available in electronic form. This contrasts with the view, often expressed in corpus circles, that spoken language is difficult and expensive to obtain. This point will be elaborated later.

The other point is that a classification of texts by approximate audience size is worth further consideration as a way of quantifying the size default. It is open to various criticisms, particularly that it is just making a virtue out of necessity. Other, more suitable but equally realistic measures are solicited so that there may be general guidance to the very first question asked by a novice in corpus linguistics -- How big a corpus do I need?



next up previous contents
Next: Quality Up: Characteristics Previous: Characteristics