Topic is one of the central controversial areas of text typology. No existing external classification seems to be satisfactory. It can be analysed in several different ways, and the practice so far in many corpus designs has been to erect a makeshift and broad-mesh framework, within which the texts are disposed into undefined or inadequately defined categories. That is to say, an extensive categorisation is imposed, representing topic as capable of being dealt with by essentially external criteria. However, we believe that topic should be conceptualised principally as an internal matter, to do with things like the vocabulary choices in a text, rather than an external matter, where the Universe is endlessly chopped up into subcategories. Because of the importance of this point, we anticipate our position and illustrate the line of argument that is developing in the handling of the area. We make the following claim:
In the classification of topic, the internal evidence is primary.
That is to say, it will lead to a better classification of topic if the internal evidence, such as the vocabulary clustering, is developed first of all, and the external evidence is added at a stage of greater detail.
This matter is taken up in detail later in this report. For the present, we put forward some notes towards substantiating the above claim. Let us first of all consider the topic variation in documents and conversations. A newspaper covers a wide range of topics, with some stability from issue to issue, but also a great deal of variation. Unless they are hand-annotated (and this report envisages corpora that are much too large for hand-annotation) the electronic versions of most newspaper texts are not explicit about topic.
This position excludes cases where incoming newspapers to a corpus, say on a compact disk, are pre-analysed by experts, and divided into their constituent stories, reports, news items etc., to each of which is assigned a topic or a small list of key words. Such texts are hybrids between printed and electronic material. Newspapers in their printed form are usually divided into sections, e.g. the sports section. This is a kind of self-classification for which the term used in this report is reflexive. If we make use of self-classification, for topic consistency it might be helpful to see the structure of a newspaper in two dimensions -- the issue and the contents, as in table 1. The vertical dimension of table 1 is likely to show more consistency of topic than the horizontal.
news | sports | women | science | |
issue 1 | ||||
issue 2 | ||||
issue 3 |
Similarly, a novel covers a variety of topics, but without even the segmentation of a newspaper. It comes in chapters, usually substantial, and there is no requirement on the author to have consistency of topic within a chapter. Popular novels (spy stories and the like) frequently divide each chapter into a number of short episodes, each on a different topic.
An impromptu conversation is seamless, and moves from topic to topic normally in a series of steps that specifically obscure the changing of topic. The procedure is very subtle (Hazadiah, 1993), and there is no immediate prospect of automating the analysis of discourse in such a way that it would be capable of topic segmentation.
We have, then, identified three major sources of language corpus data for which external topic analysis is likely to be problematic and unhelpful. In the case of some classes of document the relationship to topic may be simpler (specialist magazines, formal reports, etc), but even when what is written seems to be confined to a restricted topic area, it would be foolish to be optimistic.
Hence, although general classification systems have been worked out (Dewey Decimal Classification, etc. -- see Internal criteria), they are not reflected simply in text structure, No doubt very short texts, and very short stretches of text -- fragments -- will be so classifiable, but not the majority of documents or conversations. The reasons are obvious and need only be mentioned here. One is that many communication events have as their origin the need to mix topics -- to talk about the influence of X on Y, etc. One answer to this might be multiple classification, but the likely result of taking that step would be a network of criss-crossing topics that would be useless for practical purposes.
Another reason is that many communicative events have a structure that is quite insensitive to topic -- as for example a daily newspaper or news broadcast. The topics are whatever is felt to be newsworthy, their sequencing reflects their perceived importance, with occasional moves towards coherent grouping, and some deliberate contrasts.
Yet another reason for the unpredictable nature of topic is the social need for maintaining interest and attention by refocusing the topic frequently in conversation. This feature, combined with the social requirement to obscure changes in topic, makes any simple idea of topic unlikely to be usable in classification, whether external or internal. As we shall see below, the likely internal topic-related patterns are not just a list of one-word subject labels.
The Subject typology given in appendix is culled from as many published sources as we could find, following the NERC study, and it shows the unsuitability of trying to build on received practice (because it is so inconsistent) or of trying to arrange a hierarchy of simple topic labels. The memoirs of a retired medical missionary who had an important collection of military paintings, particularly canvases showing the details of early ordnance and regimental uniform; who delighted in the languages and buildings that he had met in his travels, and paid close attention to the level of scientific sophistication in the agriculture of the regions, and the problems of distance from major centres -- such a document, which is by no means fantastical, would be listed under more than a dozen topics of those given in appendix.
The discussion of topic is returned to later in this report (see Internal criteria); in particular how a means of analysis might be developed considering topic as an internal property of texts. For the primary purpose of this paper -- to offer a workable classification scheme for today -- topic has to be replaced by several lists of external criteria.
The sectionalisation of newspapers has already been identified as reflecting some aspects of topic; to this we can the stated subject-matter of magazines and periodicals, which in the case of the more specialised ones is helpful. Also we can use the fact that a society institutionalises a number of topic-related classifications; of particular value to text typology are lists of recognised professions and educational courses. These are matters of social policy and expediency at a particular stage of development, and have quite a different status from general attempts to classify the cosmos; they will vary from culture to culture and language to language, but they constitute a set of parameters of classification that has scientific validity and is workable in average corpus projects.
The self-classification of reports, textbooks, etc., has already been mentioned as a valuable feature of the reflexivity of language. A typology based on such criteria will be untidy, but they contribute to a defensible assignment of topic that will suffice until it can be replaced with something better.