The context within which a corpus text was produced, or received, is arguably at least as significant as any of its intrinsic linguistic properties, if indeed the two can be entirely distinguished. In large supposedly balanced corpora such as the BNC, it is of considerably more importance to be able to identify with confidence such information as the mode of production or publication or reception, the type or genre of writing or speech it contains, the socioeconomic factors or qualities pertaining to its producers or recipients, and so on. Even in smaller or more narrowly focussed corpora, such variables and a clear identification of the domain which they are intended to typify are of major importance for comparative work.
At the very least, a corpus text must indicate its provenance, (i.e. the original material from which it has been taken) with sufficient accuracy that the source can be located and checked against its corpus version. Clear bibliographic identification is readily provided for printed or published works, but for spoken material a whole host of ancillary features such as place and time of recording, demographic characteristics of speakers and hearers, social context and setting etc. are of equal potential value to the analyst, although there may be little consensus as to how such features should be summarized. Because electronic versions of a non-electronic original are inevitably subject to some form of distortion or translation, it is also important to document clearly the editorial procedures and conventions which have been adopted, as already mentioned. Where interpretative categories or descriptive taxonomies have been applied, for example in the definition of text types or genres, these must also be documented and defined if the user is to make full use of the material.
In earlier times, it was customary to provide all such information in a reference manual, if at all. With the definition of the TEI Header it becomes possible to present all such material in an integrated form, together with the corpus itself. This greatly facilitates both automatic validation of the accuracy and consistency with which such documentation is provided, and also facilitates the development of more human-readable and informative software access to the contents of a corpus.
The TEI Header has four major parts, derived originally from the International Standard Bibliographic Description (ISBD): the term computer file is used here, as in ISBD, to refer to any computer-held object, and is, for our purposes, equivalent to a TEI text, as defined above.
The scope of this article does not permit exhaustive discussion of all features of the TEI Header likely to be of relevance to corpus builders or users. Dunlop 1995 and Romary 1997 are useful case-studies, describing its use in the construction of the BNC, and in the implementation of a distributed server for linguistic resources, respectively. As the latter suggests, the TEI Header has had a particularly significant impact in the developing world of the electronic library. Our discussion focusses solely on the ways in which header components may be associated with particular texts, or parts of them and how information is factored out between the individual headers and a corpus header
It will rarely be the case that a corpus uses more than one reference or segmentation scheme.. However, it will often be the case that a corpus is constructed using more than one editorial policy or sampling procedure and it is almost invariably the case that each corpus text has a different source or particular combination of text-descriptive features or topics.
To cater for this variety, the TEI scheme allows for contextual information to be defined at a number of different levels. Information relating, either to all texts, or potentially to any number of texts within a corpus should be held in the overall corpus header. Information relating either to the whole of a single text, or to potentially any of its subdivisions, should be held in a single text header. Information is typically held in the form of elements whose names end with the letters decl (for declaration), and have a specific type. Examples include <editorialDecl> for editorial policies, <classDecl> for text classification schemes, and so on.
The following rules define how such declarations apply:
As a simple example, here is the outline of a corpus in which editorial policy E1 has been applied to texts T1 and T3, while policy E2 applies only to text T2:
<teiCorpus> <teiHeader> <!-- ... --> <editorialDecl id=E1> ... </editorialDecl> <editorialDecl id=E3> ... </editorialDecl> <!-- ... --> </teiHeader> <tei.2 id=T1> <teiHeader> <!-- no editorial declaration supplied --> </teiHeader> <text id=T1 decls=E1> ... </text> </tei.2> <tei.2> <teiHeader> <editorialDecl id=E2> ... </editorialDecl> </teiHeader> <text id=T2> ... </text> </tei.2> <tei.2 id=T3> <teiHeader> <!-- no editorial declaration supplied --> </teiHeader> <text id=T1 decls=E1> ... </text> </tei.2>The same method may be applied at the next level down, with the decls attribute being specified on divn class elements, if all the possible declarations are specified within a single header.
A similar method may be used to associate complex text descriptive information with a given text, (though not with part of a text). Corpus texts are generally selected in order to represent particular classifications, or text types, but the taxonomies from which those classifications come are widely divergent across different corpora. For this reason, the TEI scheme allows a wide variety of methods for identifying the classification assigned to a particular text, and also encourages the declaration of the classification itself. The latter is recorded within a special element, the <classDecl> located within the encoding description, and it will generally define a complex taxonomy of categories, using which individual texts may be classified across multiple dimensions.
To record the classification of a particular text, a distinct <textClass> element is provided within the profile description part of the header. This may contain one or more of the following
Despite its apparent complexity, a classificatory mechanism of this kind has several advantages over the kind of fixed classification schemes implied by simply assigning each text a fixed code, chiefly as regards flexibility and extensibility. As new ways of grouping texts are identified, new codes can be added. Cross classification is built into the system, rather than being an inconvenience. More accurate and better targetted enquiries can be posed, in terms of the markup. Above all, because the classification scheme is expressed in the same way as all the other encoding in the corpus, the same enquiry system can be used for both.
Finally, we discuss briefly the methods available for the classification of units of a text more finely grained than the complete text. These are of particular importance for transcriptions of spoken language, in which it is often of particular importance to distinguish, for example, speech of women and men, or speech produced by speakers of different socio-economic groups. Here the key concept is the provision of means by which information about individual speakers can be recorded once for all in the header of the texts they speak. For each speaker, a set of elements defining a range of such variables as age, social class, sex etc. can be defined in a <participant> element. The identifier of the participant is then used as the value for a who attribute supplied on each <u> element enclosing an utterance by the participant concerned. To select utterances by speakers according to specified participant criteria, the equivalent of a relational join between utterance and participant must be performed, using the value of this identifier.
The same method may be applied to select speech within given social contexts or settings, given the existence in the header of a <settingDesc> element defining the various contexts in which speech is recorded, which can be referenced by the decls attribute attached to an element enclosing all speech recorded in a particular setting.