2 Encoding the structure of a corpus

This section addresses some particular aspects of the general problem implified by the need to apply a single consistent structure on the potentially very large and divers components making up a language corpus. We begin by considering the definition of a corpus text in section 2.1 , before discussing the components making it up (section 2.2 , concluding with a brief discussion of the referencing schemes structural markup facilitates in section 2.3 .

2.1 What is a text?

A corpus is typically composed of samples taken from other larger entities. The samples may or may not be `complete', under some notion of completeness appropriate to the kind of material in hand; a `complete' sample, may be referred to as a text. It should however be stressed that such concepts are far from unproblematic. For example, when considering newspaper materials (which form the major components of many contemporary corpora), there is no obvious answer to the question ``what is a text?''. It might be a single issue of a newspaper, or a single story within an issue. It might be a collection of different stories from different issues on a common theme (for example, the sports pages), or a series of issues grouped arbitrarily by date or sequence number. It might be a subdivision of an issue (for example, the regionally variant pages common in Italian newspapers), or a sequence of them (for example, a serialized story running across a sequence of issues). It might be a single paragraph in a collection of paragraphs of snippets, or a single advertisement within a classified section.

The arbitrariness with which the term ` text ' must be defined becomes increasingly apparent as the coverage required of a corpus begins to extend.. Leaflets, handbills, packaging, and advertisements form textual units of one kind, while unpublished letters, notes, essays, memoranda and reports present units of another. Advertising in particular pushes the boundary of what may be considered as text into the field of multimedia, while the collection of transcribed speech immediately requires the definition of what units of spoken discourse approximate to the function `text' in written discourse.

A text is regarded by the TEI simply as a distinct object carrying its own header. In this respect, the TEI `text ' seems very similar to the notion of a conventional printed work, with its title page or other bibliographic identification. However, it is perfectly legitimate (and indeed, necessary) to use the TEI<text> element to mark the boundaries of other textual objects, such as those listed above. The distinguishing feature of a TEI text, is simply that it can be regarded as in some sense complete, and that it has an associated distinct set of descriptive information.

This does not, of course, preclude the possibility that collections of such objects may equally be treated as texts. Pursuing the analogy with conventional printed works, we may regard a collection of TEI <text> elements , which functions both as single object (the collection) and as many (the constituents), in the same way as we would an anthology, or the collected works of an author. The TEI <group> element is provided for exactly this purpose: it allows the body of a text to be composed of a hierarchically organized collection of texts (or other groups). Indeed, one TEI <text> element can be nested within another arbitrarily, without any need for an enclosing <group>.

The grouping of texts in this way, where each text shares a common bibliographic identity, should be distinguished from the grouping of texts to form a corpus. A group of texts forming an organic unity (such as an anthology) is not the same as a group of texts combined to form a corpus but previously existing as independent entities. This is reflected in the way that each of the component texts of a corpus carries its own bibliographic identification, represented by a <teiHeader> element.

The <group> element may however be very useful in the definition of `subcorpora' within a large corpus, provided of course that all the components of the subcorpora concerned can be made to share a common header. We return to this question below in discussing the TEI Header's use of declarable and declaring elements.

The overall structure of a TEI-conformant corpus may thus be summarized by the following set of production rules (simplifying somewhat):

teiCorpus -> header, tei+
tei       -> header, text
text      -> front?, (body|group), back?
group     -> front?, (text|group)+, back?
body      -> components*, (div|div1)*
div       -> components*, div*
div1      -> components*, div2*

2.2 Components

Within this overall structure, different kinds of text will be made of different components. For written corpora, typically, components will be such items as headings, lists, paragraphs etc. Spoken corpora, by contrast, are typically composed of individual utterances or speaker-turns, noises, paralinguistic and other events etc. In analysed corpora, either or both type of component will be typically sub-divided into linguistic segments such as sentences, clauses, phrases, words, or morphemes. In aligned corpora, any or all of these components existing in different texts may be linked or associated in some way.

The particular tags proposed by the TEI for these components are not discussed in detail here since they are well documented elsewhere. It is worth stressing however that the modular nature of the TEI architecture allows for the construction of specific document type definitions (DTDs) appropriate to particular types of text, so that, for example, elements such as <u> (for spoken utterance) can appear only in transcribed speech, and equally importantly <p> paragraph cannot. For corpora such as the BNC which mix texts of different kinds, a special DTD must be constructed to enforce these constraints, since the pure TEI approach is to err on the side of permissiveness. The means by which such special DTDs are constructed in ways conformant with the TEI architecture is a technical topic beyond our present scope: the definitive account is provided in chapter 3 of the TEI Guidelines , and the topic is also addressed by introductory articles such as Sperberg-McQueen and Burnard 1995 and TEI 1995

Whichever version of the TEI DTD is in force, however, some form of segmentation intermediate between high level structural units such as chapters, utterances, or paragraphs and low level tokens (words, morphemes etc.) will be essential to provide a uniform reference system that can identify units within the corpus.

To facilitate this, the TEI provides two distinct neutral segmentation elements: <s> and <seg>: a third (<ab> for anonymous block) has recently been proposed, but is not yet part of the Guidelines . The distinction between <s> and <seg> is that the former may not self-nest, while the latter may. In either case, the semantics associated with the segmentation are deliberately left unstated, since they will vary greatly between applications. It is however customary to use <s> for orthographically-defined sentence-like units, used to divide the whole of a corpus end-to-end, and to use <seg> for more detailed linguistically motivated segmentation within this, for example for such units as phrases, or clauses. Since, however, <seg> elements may self-nest to any desired depth, they may also be used to tag individual words, or morphemes, etc. We discuss in section 4.1 below some of the ways these segments may be used to categorise individual components of a corpus.

2.3 Reference schemes

Once the structural components of a corpus have been identified, and its hierarchic organization decided upon, then it becomes possible to label any part of it in terms of that organization. The TEI scheme makes this possible by defining global attributes id and n which may be used to supply a unique identifier and a non-unique name or number respectively for any element. Each text may be given an identifier, as may every component within it. The corpus designer must choose how such identifiers are to be allocated, and where they will be most useful.

For example, it might be considered helpful to assign to sentence 12 of section 23 of part 4 of text 45 an explicit label such as Such a label could indeed by supplied as value for the n attribute on every sentence in a corpus. Although attractive, such a scheme has two disadvantages: firstly, for simple-minded processing, this requires that all sentences appear at the same hierarchic level (a sentence appearing in a non-sectioned part would presumably be given an identifier like 45.4.12 ); and secondly it repeats redundant information already available to an SGML-aware processor, which must be able to navigate the SGML hierarchy, and can thus always provide answers to such queries as ``What is the identifier of my parent?'' or ``Find the third child of the element with identifier X''.

An additional practical constraint in the case of large corpora is that SGML aware systems may be limited in the number of unique identifiers which they can handle within a single document: it is therefore best to restrict their use if at all possible (see further Dunlop 1995)

It has frequently been noted that SGML lacks the kind of scoping rules for name space familiar in computer science, and there have also been several proposals aimed at rectifying this, beyond the scope of our present discussion. For the present, it seems as if corpus builders must either make do with small corpora in which the range of identifiers can be global without inconvenience, or adopt their own conventions. In the case of the British National Corpus, with its six and a quarter million segments and 4,124 texts, the method chosen was to number each segment uniquely within a text, using the TEI's nattribute, and to give each distinct text a unique (but meaningless) three character identifier, supplied as the value of the TEI id attribute. This means that any segment can be uniquely identified by the combination of text-id plus segment number, but that only 4,124 SGML ids are required.

In the present author's opinion, identifiers should not in general be overloaded with categorical information. That is to say, an identifier should not be expected to convey any information about the object it identifies other than which one it is. Categorization should be performed by more fluid and powerful mechanisms such as those discussed in section 5 below, and not by the accidental form of an identifying code.

The purpose of the reference scheme in an electronic corpus is not only to provide a way of locating individual occurrences within their original source. The ability inherent to SGML of associating document-wide identifiers with any single component within it also opens up a range of exciting multimedia possibilities. Adressable elements within a corpus (at any level) can be the object of any of the various kinds of linkage discussed in this paper and elsewhere, for example to align audio and transcript. To be adressable, an SGML object needs either to bear its own unique identifier, or to be placed at a known (navigable) location within the document tree. A well-designed reference scheme can greatly facilitate the navigability of a large corpus.