In creating and tagging corpora, particularly large ones assembled from many sources, many editorial and encoding compromises are necessary. In particular, the kind of detailed text critical attention possible for a smaller literary text may not be judged appropriate. Nevertheless, users of a tagged corpus will not thank the encoder if arbitrary editorial changes have been silently introduced, with no indication of where, or with what regularity. Such corpora can actively mislead the unwary or partially informed user.
A conscientious corpus builder should therefore take care to consider making explicit in the markup at least the following kinds of intervention:
As a simple example of the use of these facilities,consider the following hypothetical case. In transcribing a spoken English text, a word that sounds like `skuzzy' is encountered, which the transcriber does not recognize as one way of pronouncing the common abbreviation `SCSI' (small computer system interface), which is also pronounced as a series of spelled out letters `ess - see - ess - eye.'. The transcriber T1 may simply encode his or her own uncertainty by a tag such as
<unclear extent="two syllables" desc="sounds like skuzzy (if there is such a word)">or even, (if ultra-cautious):
<gap extent="two syllables" cause="unrecognizable word">
Alternatively, where transcription policy is to include as much as possible, the transcriber may allow for the possibility of `skuzzy' as a lexical item. Assuming that the` correct' spelling for this lexical item is in fact `SCSI', the transcriber may tag it in one of the following ways
<sic>skuzzy</sic> <sic corr="SCSI">skuzzy</sic> <corr sic="skuzzy">SCSI</corr> <corr>SCSI</corr>The first of these would be appropriate where no attempt at correction is being made, but the encoder wishes to signal some doubt about the authenticity of the term so tagged. The second would be appropriate where a correction policy has been defined, but the primary focus of the transcription is to present the exactly what the transcriber heard. The third reverses these priorities, and would be appropriate where corrected orthography is of higher priority than authenticity. The last would be appropriate where clearly defined orthographic principles are in existence, and where the original form is of little or no interest.
The same principles apply to the treatment of apparent typographic error in printed originals, and the same mechanism may be used as a simple way of handling the problem of normalizing regional or other variant forms. For example, in modern British English, contracted forms such as `isn't' exhibit considerable regional variation, with forms such as `isnae', `int' or `ain't' being quite orthographically acceptable in certain contexts. An encoder might thus choose any of the following to represent the Scots form `isnae' :
<reg>isn't</reg> <reg orig="isnae">isn't</reg> <orig reg="isn't">isnae</orig> <orig>isnae</orig>
Which choice amongst these variant encodings will be appropriate is a function of the intentions and policies of the encoder: these should be stated in the appropriate section of the encoding description section of the Header (see further section 5 ).
The <gap> element may also be used as a means of indicating where non-linguistic (or linguistically intractable) material such as symbols or diagrams or tables have been omitted:
<gap desc="diagram">, where the effort involved in a more detailed transcription (using the specific TEI elements <figure> or <table>) is not considered worthwhile. It is also useful where material has been omitted for sampling reasons, so as to alert the user to the dangers of using such partial transcriptions for analysis of text-grammar features:
<div type=chapter> <gap extent="100 sentences" cause="sampling strategy"> <s>This is not the first sentence in this chapter
The TEI elements mentioned above all function very well for simple cases where the components to be tagged do not conflict with the single hierarchy required by basic SGML. Where this is not the case, one of the `multiple hierarchy workarounds ' discussed in Barnard et al, 1995 and elsewhere must be used.
As noted above, the specific editorial policy adopted in a corpus text should be documented in the appropriate section of its header. The elements concerned are repeatable, and a mechanism is defined by which particular corpus texts, or components thereof, can specify which of possibly several options apply to them (see further section 5 )