1 Introduction

Since their first publication in 1994, the Recommendations of the Text Encoding Initiative (Sperberg-McQueen and Burnard, 1994) have influenced a very wide variety of electronic resources, serving as an important reference point even for projects which have not adopted them. Some indication of the breadth and variety of the community of TEI users is given by the TEI applications web page at (TEI 1997). However, of the many research fields in which the Guidelines have been influential, that of language corpora stands out, particularly in a European context. Their basic notions and content underlie both the structure and design of the British National Corpus, (BNC 1994) and the EAGLES Corpus Encoding Standard (Ide 1997).

This paper reviews the requirements of those building and distributing language corpora today, and the relevant parts of the TEI Recommendations , in an attempt to show where the latter can most usefully be applied to meet the needs of the former, and also to assess where modification or development of the TEI Guidelines might be beneficial in the light of experience. For complete coverage of the encoding schemes referred to here, the appropriate reference guides cited in the bibliography should be consulted. This document also assumes some acquaintance with the principles of SGML encoding schemes in general and the TEI in general: for an introduction to the TEI, see Burnard 1996; for a simple introduction to SGML, see TEI 1994, for a complete one, see Goldfarb 1990.

A language corpus, for our purposes, may be defined as a body of naturally occurring language data assembled for some specific purpose. Typically, the purpose will be to facilitate automatic linguistic analysis, either of the corpus itself, or of some other material for which the corpus is intended to provide a comparative basis, but there are many cases of corpora which are constructed simply because of the inherent interest or importance of the language data which they contain. A special case of this type are historical language corpora, such as the Corpus of Old English or the Corpus of Historical Spanish.

Whatever their intended application however, corpora are easily distinguishable from simple assemblages of texts or electronic collections, in that the components of a corpus are intended to be used together as a single unit, most if not all of the time. For this to be feasible, at least the following are pre-requisites:

When describing the components of written texts (other than words), it is necessary to indicate the boundaries of chapters, sections, paragraphs, sentences, etc., and the specialized roles of headings, lists, notes, citations, captions, references, etc. Many of these components serve a dual function: they mark a particular type of discourse within the text, but they also serve to identify locations within it, forming the basis of a reference system which may be used to localize occurrences of tokens within a specific context. In the same way, for spoken texts, indications of the beginnings and ends of individual utterances are essential, as is an indication of the speaker of each. The general issue of text structure is discussed in more detail in section 2 below.

In either kind of text, it is helpful to include editorial information about the status of the electronic text itself (for example to mark corrections or conjectures by the transcriber or editor): transcription is not an exact science, even for printed materials, and still more so for spoken texts. Even where entirely automatic procedures have been adopted, subsequent users of corpora need to be informed of the nature of the algorithms etc. applied. These issues are discussed in more detail in section 3 below.

Finally, it is essential to record descriptive information about the social or cultural context in which the text was produced, or classified. Such meta-information may often be of crucial importance to the corpus analyst, whether or not a contrastive study (for example between the speech of men and women, or between texts aimed at the young and the old) is involved. We discuss this in section 5