The following classification has been worked out to include most of the information about a text that is likely to be:
It is by no means exhaustive; indeed it tends in the other direction. In practical terms, even a very large corpus in today's terms will be unwieldy to use if a large number of parameters of classification are applied.
The reason for this is that for a typological analysis to be useful, each binary parameter doubles the number of the eventual components of the corpus -- sets of texts which all have the same characteristics (see the report on corpus typology). Each component has to contain several texts in order to be reliable; so if there are even only 5 binary parameters, there will be components, i.e. 32. If the average size of a component is 0.5 million words, then the corpus must be at least 16 million words in size.
The question of relevance is specific to each text, and these categories are not intended to be made mandatory; for example, the identification of the author of a text is relevant towards the literary end of the spectrum but probably not at the bureaucratic end. Nevertheless, a text which seems to be composed in American English, but by a British author and publisher, may require a specific probe into its origins before it can be placed in a corpus without misleading users (this can happen, for example, if a book is first published in the USA, then reprinted in a UK edition from the original plates).
Also, it may in some cases be quite impossible to determine the author, or to chart accurately all the people who have made some contribution to the composition of the text. Scholars are specially trained in this work and may spend a great length of time establishing the authorship of a single text -- for literary reasons, perhaps, or forensic ones. So, in the normal use of this typology, it is expected that assignment to the authorship category or to any other will be done immediately or not at all.
The typology can be elaborated for the requirements of a particular application. Entries which were not made in the original establishment of the corpus can be added, and additional parameters can be introduced alongside those advocated here.
The study of internal parameters is much less advanced than that of external parameters. The main internal parameters are thought to be:
Sometimes an aspect of the classification is reflexive, in that it contains a statement of its origin, its audience, etc. If this is used as the basis for external classification, it is recommended that (R) is placed after the classification. A declaration in the text of a document or transcription is not necessarily accepted as correct, but should always be recorded, and any discrepancy between the reflexive classification and the one chosen for the typology should be justified.