The earliest electronic corpora were designed using external criteria -- reference to institutionalised types of text, or features of the nonlinguistic environment or society in which the texts occurred. More recently, some internal criteria -- differentiating features of the language of the texts -- have been offered by researchers. This work suggests that a thorough classification of texts, an adequate typology, will eventually consist of a balanced combination of the two types of criteria.
Many internal and external criteria reflect each other. A text that showed a high average sentence length would be likely to be of one kind of book or magazine rather than another. For some parameters of classification the external evidence is primary -- whether a text is printed in a book or a newspaper, whether it is in written or spoken form, etc. No matter what kind of language is found in the document or transcript, these real-world distinctions are clear-cut, and relate to the experience of end-users of the typology, who will not all be experts in linguistics.
So if it turns out that some texts which are classified separately from an external perspective share some of the same linguistic features, that is not in itself an adequate reason for classifying them together, though it raises an interesting question. Much research on language variety, genre etc. has worked on the assumption that external distinctions are reflected internally, and indeed it is ultimately that assumption that makes external criteria relevant at all -- for if they had no influence on the language used, they would be of no interest to linguists or others in the language industries.
Perhaps the major indication of whether a corpus criterion should be expressed through internal or external criteria is whether the criterion itself is expressed intensively or extensively. This is a basic distinctiion in lexicography; an intensive criterion is expressed by something like a generalisation which is shared by all the members of a group, whereas an extensive criterion is merely a list of the members. Where the group is a fairly small and clearly delimited one, like `quality newspapers' in the UK, there is little to choose between listing them and phrasing a generalisation that will identify just the same newspapers. But when there is a large number of members of a group -- and possibly an indefinitely large number -- then the techniques of extensive definition do not work.
Hence, where we find, as in the classification of topic by external criteria through listing possible topics, an example of extensive listing without a principled limit or known dimension, there is a strong case to be made for abandoning this approach and looking instead for a generalised means of classifying the texts using internal criteria and intensive definition.
There is a halfway house, where the classification is based on what the text says about itself. If on the title page of a text it is stated that the text is a `manual of lithography', it is fairly safe to classify it as a manual, and to record its topic as lithography. Such a classification is called reflexive.
In a perfect world these claims would have to be verified with respect to the criteria for manuals, and the place of lithography in relation to other arts and crafts, and the printing trades. For the time being we accept the classification at its face value; we do not define a manual carefully, and we do not attempt to divide the world around lithography into a coherent set of labels.