Topic is the lexical aspect of internal analysis of a text. Externally the problem of classification is that there are too many possible methods, and no agreement or stability in societies or across them that can be built upon. There are semantic classifications, such as Roget's famous Thesaurus (Roget, 1962); there are bibliographical ones, such as the Dewey Decimal Classification. Educational systems divide knowledge up into a myriad of confusing hierarchies, and keep changing their minds. Hierarchies and other organisations of terminology can be used as topic identifiers, terminologists organise term banks on cross-cutting principles, and organisations like ISO play a part in the attempt to stabilise terminology, and thus provide a means of organising conceptual knowledge.
It is thus a gross oversimplification to attempt to produce a single model of topic that embraces everything that can be written or spoken about, neatly arranged in discrete boxes, such that each text can be placed in one box, with only a small percentage of overlap or doubt.
In Roget's Thesaurus (Roget, 1962), the English language is primarily divided into six global categories: `Abstract Relations', `Space', `the Material World', `Intellect', `Volition' and `Affections'. From here, there are sub-categories which further divide the world into more manageable areas, for example the category `Affections' divides into `Affections Generally', `Personal', `Sympathetic', `Moral' and `Religious'. At this level we may already find difficulty in accepting a fundamental difference between, for example moral and religious affections, especially if one's moral values are fully embedded in religion. The boundaries become even more vague the further we separate language into smaller branches. Under the sub-categories come further groupings of topics. Following the category `Affections' and, say, the sub-category `Personal' we find the groupings `Passive', `Discriminative', `Prospective', `Contemplative' and `Extrinsic'. These further subdivide, finally, into the 1,000 topics by which we can conveniently categorise language.
At this level of categorisation, the boundaries between the proposed topics become very hard to distinguish. The amount of cross-reference also becomes immense. Following the same branch of `Affections - Personal' further down, under `Passive' we find such topics as `Joy', `Suffering', `Pleasurableness', `Painfulness', `Content', `Discontent' and so on. On this level, however, how can we justifiably differentiate between affection which comes under the topic `Affections - Personal - Prospective - Dislike' and one which is `Affections - Sympathetic - Social - Hatred'; or between language which is classed as `Communication of Ideas - Means of Communicating Ideas - natural - conventional - letter' and another topic `Communication of Ideas - Means of Communicating Ideas - written language - correspondence'. Is it desirable, in the classification of language into topic, to do so?
The boundaries between the topics are ultimately blurred, and we would argue that in the classification of topic for corpora, it is best done on a higher level, with few categories of topic which would alter according to the language data included.
There are numerous ways of classifying texts according to topic. Each corpus project has its own policies and criteria for classification, which, indeed, is the underlying impetus for this paper, to offer common guidelines for classification of texts. The fact that there are so many different approaches to the classification of text through topic, and that different classificatory topics are identified by different groups indicates that existing classification are not reliable. They do not come from the language, and they do not come from a generally agreed analysis. However they are arrived at, they are subjctive, and the subjective categorisation of language is bound to lead to diversities in the categories established since the resulting typology is only one view of language, among many with equal claims to be the basis of a typology.
The NERC report offers a summary of the classification systems used by major corpus project in Europe. Updated tables are attached as appendices. There are 35 categories in the subject typology report. Extensive though this may seem, this is only a consolidated summary of the common categories of the corpus projects. There are many different `topics' by which texts are classified in different corpora. Some classificatory systems go into more detail than others. In the Danish Corpus we find an extensive list of topics from `transportation' to `music', `business' to `environment' (see Norling-Christensen, 1995, Annex 1). There is even a separate topic for the `EU'. In this kind of classification system it will be difficult to classify a text according to one of the 66 topic areas offered; most texts will fall into more than one topic in this system. For example `law' and `crime' and `society' are all listed as separate topics, where they are inextricably linked.
As is explained in the NERC Report:
Generally applied rather than corpus-specific texts have been distinguished.Even though this table shows the `generally-applied' categories of topic, there is great variation between classification techniques. Looking at the list of subject categories proposed, we find `Leisure' as a separate category from, say `Sport'; the category `Science' and then, separately `Physics', `Biology', `Chemistry'; `Finance' and `Economy'. This would cause problems in the practice of labelling texts under such topic headings. The boundaries between them are by no means fixed no clear. It would be down to one person, or a group of people to decide where to place most of the texts and this is not satisfactory procedure either from the point of view of resources or from the need for replicability of decisions.
(Calzolari et al., 1995)
It is recommended here that topic be determined through internal criteria to provide linguistic justification for the resulting categorisation of texts through topic. Guidelines will be based on results from objective computer-assisted analysis of the texts to be included in the corpus.
Here we will review work which has been done to date on the analysis of linguistic features of texts with computer software. This will give a preview of the procedures by which we can select texts and maintain corpora in the future. Some of the main research in this area has been done by Phillips and his work is reviewed here since it intimates the kind of analysis that is likely to be used in order to reinstate topic as a useful segment of corpus typology.
Phillips (1983) offers a rationale for the determination of the topic of a text through an objective, quantitative distributional methodology. What we here are referring to as `topic' in the classification of texts for the purposees of the present typology, Phillips refers to as `aboutness', i.e. `the psychological perception of subject-matter'. He claims that the `aboutness' of a text is due to the global patternings in the text, or the text's `macrostructure'. In his thesis he analyses the macrostructure of texts by computational means, so that the results are derived from the text itself and not from external structures. He emphasises what we would urge here, that in any statement about subject, or `aboutness', of a text, the basis must be an objective analysis of linguistic features. It would be a mistake at this stage of our awareness to try to map language direcly onto external structures. As we have seen in the NERC report, this proves impossible in practice.
In order to say something on the `aboutness' of a text, the human reader has to understand that text. This opens the doors to subjective interpretation of the kind that we wish to dispel from the classification of texts for inclusion in corpora. An objective investigation of the linguistic features which make up the aboutness of a text is a real option now with the advent of computer corpora and sophisticated software for analysis. With this kind of objective analysis, Phillips points out that not only do all the results come from the text itself without outside interference, but that the procedure can be repeated for any number of language units (which, for Phillips, are chapters in a book, but could as well be any length of text that is long enough to show clusterings) for similar analysis.
Phillips further explains that the aboutness of a text cannot depend directly on the details of linguistic form, such as will be found in a lexico-grammatical analysis of it, since a reader will invariably be able to summarise a text and say what it is about without direct reference to the specific lexical items used in that text. Ultimately, however, the understanding of the text and evocation of topic can only be somehow derived from the actual text itself. The aboutness of a text would appear to be separate from its specific representation in language but at the same time it is something evoked in our consciousness only through the symbolism of language. Phillips therefore concludes that we need to look towards the global patternings in a text and it is through the analysis of these global patternings that we gain insight into the aboutness of the text. This global patterning is the macrostructure of the text and it is on this higher level, at the level of the macrostructure, that the text is analysed by sophisticated software for information concerning aboutness.
It is important, therefore, for the purposes of this report, to determine firstly the linguistic features which are relevant to the concept of topic or subject, and secondly to discuss the possiblities for the analysis of such features. Analyses at the level of macrostructure have also been carried out successfully in the Aviator project which was a computational continuation of the results of Phillips' work.
Phillips claims that:
[...] in any discussion of aboutness, `situation' is an important notion and [that] it is likely that different relations between texts and their contexts could form the basis for a text typology.This Firthian objective firmly situates text within a higher level of constraint, i.e. the context of situation. Both texts and contexts of situation are far too often classified through external means and designated categorisations which reflect, more than anything, a subjective interpretation and classification of our envirmonment. If the `aboutness' of a text is dependent on its relation to its context, we must determine the relevant features and analyse these features objectively, through computer-assisted analysis, to avoid unecessary distortions and impositions.
Phillips argues that whilst the aboutness of a text is not directly dependent on linguistic representation, language concepts are ultimately contructed by lexical items or associations of lexical items. It is his contention that the aboutness of a text can be determined by patterns on a higher level of analysis. The patterns in the text are what he refers to as the text's macrostructure and it is from this that we can gain insight into features of aboutness:
I contend that with all semantic, syntactic and lexical markers of a science text neutralised, the meaning of the text is not exhausted but that a distributional analysis of what remains will reveal the presence of global patternings, which I call macrostructure.
To analyse the macrostructure of a text, we look at the collocational associations in the text. We then must determine which are characteristic of a certain style, topic or genre. Phillips' methodology is a distributional analysis of texts (since it is a distributional analysis we must therefore also advocate the inclusion of full texts into the corpus so that all linguistic features are represented). The first step in Phillips' methodology is to identify the frequency of occurrence of each node-collocate pair within a range of four words on either side of the node. This is calculated taking each word in the corpus as node and the words within the span of four words on either side as collocates. The frequencies of the collocates are then collated to the level of the lemma (although the justification for this decision might be debatable today, with the greater understanding of corpus patterns that we have and the much greater computer power that we can bring to arithmetic tasks). Particular associations between lemmas are identified across the chapters (here, of science books) and the difference in the macrostructure of each assessed. In effect, the macrostructure is defined through the analysis of the patterns of the intercollocation of collocates.
Phillips employs Ward's method of cluster analysis to observe attractions between lemmas. He extends the association to include associations which regularly have at least two lemmas in common. Once a cut-off has been set (so that all the data does not merge into one cluster), each node is assigned a `similarity coefficient' based on its collocational environment. Nodes with similar `similarity coefficients' are merged and the process repeated for the new cluster. We thereby analyse the similarity of networks of lemmas which form the macrostructure of the text. For two networks to be similar, they must both contain at least two of the same lemmas, and at least one lemma must be a member of the nuclear set of nodes in each network.
The analysis proved the existence of patterns of association within the chapters. Phillips claims the two major findings of his study to be firstly ``the discovery of the existence of syntagmatic lexical sets''; and secondly that ``the sets identified in the analysis are meaningful'' if we interpret the syntagmatic organisation of words into lexical sets established by the analysis as the conceptual concerns of the text.
This analysis gives rise to the notion of lexical macrostructure and stresses the importance of the intercollocation of collocates. A semantic interpretation of the lexical evidence leads us to a notion of what the text is about.
Phillips' work formed the basis for the part of the AVIATOR project coordinated by Renouf which aimed to develop software for monitoring changes in language patterns as corpora develop over time. There were two main goals of the project, firstly to identify new word occurrences, collocations and word-combinations; and secondly, and most relevant to the present purpose, to identify clusters of words in a text that to some extent reflect its conceptual content. Since the aim was to monitor change, among other things, there was a constant stream of language data (from The Times national newspaper) over the three-year duration of the project.
The software developed as part of the Aviator project gives an example of the way in which the topic of a text can be retrieved automatically from the text iself. The methodology was much the same as Phillips'. First there was established a stop-word list of the most frequent words in the corpus, whose patterns of occurrence appeared to be stable over several years. These words were then overlooked in favour of those that seemed more responsive to their general environment. A collocate bank of the remaining words was created, holding frequency information.
Clustering software then processed the most frequent nodes to create clusters. From these clusters we can then see the aboutness of a text. The usefulness of the machine analysis was evaluated by comparison with the results from humans given the task of summarising the same texts or saying what they were about. The software thus created lists of keywords and clusters of keywords which are indicative of the topic of the text.
On the basis of the results of the Aviator project, it seems feasible that the topic of texts can be determined through analysis of linguistic criteria with software such as that used by Phillips (1983) and Renouf et al. (1993). A typology of texts according to the topics emerging from such analyses could then be established.
It is not foreseen, however, that the types of topic identified through this methodology ever be of finite number, or that orderly hierarchies will emerge. Rather, the relation between one text and another is best visualised as a pattern of overlap, and the clusters of collocates as a sophisticated extension of the familiar `keywords', on which most of today's content queries are based. The model of language underlying this research is dynamic and the categories of analysis are not preordained; so no existing list of possible topics can be precisely relevant; no doubt useful classificatory patterns will emerge when a large number of texts have been processed.
Since the NERC report it has been regularly recommended that corpus building should have in mind the eventual goal of adding the time-change dimension and moving towards the condition of monitor corpora (see the companion report on corpus typology). Such a move will require the development of new software tools, which will have to be fully automatic, very fast and sensitive to quite small changes in the language structure. If Phillips' computerised aboutness allows us to represent topic without human intervention, it will be possible to control additions to corpora with a new accuracy; there could also be important byproducts in information retrieval and knowledge retrieval.
The topic of a text -- what it is about -- is probably unique to the text. However, this is not in practice very helpful, since for a variety of purposes people need a means of associating texts together according to topic, and of differentiating them on the same lines.
This has been recognised in practice for many years, and it is now commonplace to classify texts using a crude system of keywords. Authors are routinely asked to supply with a manuscript a small number of words (including short phrases) under which a text an be indexed. It is more than likely that these are words and phrases that appear prominently in the text itself, and thus constitutes a rough-and-ready method of analysis with some reference to internal patterns.
Related to the keywords method is a much more sophisticated method of classification -- the summary or abstract, also routinely supplied by authors. A summary is supposed to be a short text that brings out the main message of the full text; as such it is likely to use language which is characteristic of the full text, though it is by no means restricted to that language, and frequently uses words of more generalised reference in order to summarise efficiently.
In the days of corpora of unlimited size, including complete texts of widely varying lengths, it is necessary that internal criteria will be formal and automatic in nature. Human abstracting services are expensive and slow compared with machine processing speeds. Hence we should recommend that operations such as abstracting and providing keywords should be developed for the internal classification of topic.
Keywords provide a set of labels or titles for indication what is inside a text, and how it relates to other texts. Abstracting gives indications of the kind of sentences and arguments that are to be found inside. These are clearly the kind of thing that people want, and find reasonably easy to understand, though they are not readily related by machine to the original texts.
However, automatic abstracting is the aim of a number of research projects, and one type in particular is of interest -- where the machine picks out and fits together sentences of the original text in order to make the abstract (Hoey, 1991: 113-4, 142, 160). In such systems the relationship between the text and its summary is simple and explicit.
Automatic selection of keywords is another obvious step, and here Phillips' notion of aboutness is of considerable practical importance. In principle, using aboutness could lead to a hierarchy of classification of texts that would at its `highest' level produce acceptable keywords, and then a pyramid of more and more detailed lexical analysis. Texts could be compared automatically and the overlap between them could be presented in terms that would be easy for a human to understand.
The combination of the Aboutness Pyramid and an automatic abstract of a desired granularity should give the best available tools for navigating corpora and other archives. Although neither of these tools is yet fully available, research is vigorous in both areas. Aboutness has the added benefit that it is language-independent; it is not yet clear whether any language-independent abstracting systems will give good enough results to be adopted.
The main limitation of the application of Phillips' ideas is the minimum length of text that will contain sufficient instances of the important words to allow the clustering techniques to work effectively and to depress the effect of other dimensions of patterning, which are apparent locally but do not repeat often enough to be included in the specification of topic. Each text has its own distributional properties, and it is not possible at present to determine a suitable general figure. By way of guidance, it can be noted that Phillips gets satisfactory results from considering each chapter in a scientific textook as a separate text, and Aviator was able to work with a variety of more popular texts of a few thousand words in length. Hence the prognostication is that if a corpus is divided down to very short entries, like newspaper reports for example, the aboutness technique will not have enough evidence to go on; however such tiny texts will surely be classifiable much more directly if it is felt necessary.
Phillips' work relates primarily to the specialised language of science and technology, where we assume there is less variety in the vocabulary than in general text. Parallel to his work is that of Yang (1986), who devised methods for the detection of technical terms in open text. Since the strategy involves the distribution of terms across texts, it is eminently suitable for application to corpora, and although as published it focuses on the individual terms, it can by extension be used to classify texts according to the distribution of technical terms within them. More recent relevant work on automatic term recognition which can be likewise used as a basis for text classification is due to Frantzi & Ananiadou (1996) and Lauriston (1996), the latter offering a substantial bibliography.
To summarise this discussion, we can look forward to being able to analyse texts in such a way that their topics will be retrieved automatically. This involves greater sophistication in the development of internal criteria than is widely available at the present time, but research towards it should be vigorously pursued. Two texts processed for aboutness can be compared at at least two levels of generalisation above the sentences of the texts themselves; those keywords or clusters that occur in both form an overlap in topic that can be evaluated by the investigator, and those that differentiate the texts provide urther helpful guidance. The Aboutness Pyramid is potentially a fully practical device for indexing texts in a corpus and comparing any two texts.
In the meantime, it is possible to use external criteria in order to arrive at classifications of subject-matter which, precisely because of their correlation with public and social distinctions, may well be superior for some purposes to the inherently less tidy internal picture of topic.