4 Analytic encoding

A corpus may simply consist of sequences of orthographic words and punctuation, sometime known as plain text. However, texts are not just sequences of words; they have many other features worthy of attention and analysis. As we have seen, some of these will correspond with structural features or components, and will thus necessarily be tagged or made explicit in the markup. In this section we consider the addition of tags for textual features that are (in principle at least) recognizable only by someone who understands the text.

Corpus-builders do not in general have the leisure to read and manually tag the majority of their materials; detailed distinctions must therefore be made either automatically or not at all (and the markup should make explicit which was the case!). The TEI scheme however is intended to be useful to both the close- and distant- readers of a text and hence supports a very wide range of options for detailed tagging, only some of which we can comment on here.

In the simplest case, a corpus builder may be able reliably to encode only the visually salient features of a written text such as its use of italic font or emphasis. A single element, <hi> for highlighted, is provided for this purpose, and may be used to mark all visually distinct phrases or words not already tagged for their structural properties. (The nature of the highlighting or rendition, if this is considered important, may be conveyed by the rend global attribute which behaves in a similar way to the lang attribute discussed in section 4.1 below; this attribute can therefore be attached at the highest appropriate structural element and inherited as required )

At a later stage, or following the development of suitably intelligent tools, it may be possible to review the elements which have been marked as visually highlighted, and assign a more specific interpretive textual function to them: again, the TEI provides a choice of tags for this purpose. Examples of the range of textual functions of this kind pre-defined by the TEI scheme include quotation, foreign words, linguistic emphasis, mention rather than use, titles, technical terms, glosses, etc.

The performance of such tools as morpho-syntactic taggers may occasionally be improved by pre-identification of these, and of other kinds of textual features which are not normally highlighted, such as names, addresses, dates, measures, etc. It remains debatable whether effort is better spent on improving the ability of such tools to handle arbitrary text, or on improving the performance of pre-tagging tools. In any case, the TEI provides a number of tags which can be used to make explicit that particular words or phrases have been recognized as (for example) dates, times, personal names, abbreviations, titles, measures, numbers, technical terms etc. The majority of these also allow for the addition of normalized values in the form of attributes:, so that (for example) all references to the same individual by different names may be linked together, dates, times, and quantities may be normalized, abbreviations expanded, etc. For details, any introduction to the TEI scheme should be consulted.

The process of encoding or tagging a corpus is therefore best regarded as the process of making explicit a set of more or less interpretive judgments about the material of which it is composed. Where the corpus is made up of reasonably well understood material (such as contemporary linguistic usage), it is reasonably easy to distinguish such interpretive judgments from apparently objective assertions about its structural properties, and hence convenient to represent them in a formally distinct way. Where corpora are made up of less well understood materials (for example, in ancient scripts or languages), the distinction between structural and analytic properties becomes less easy to maintain. Just as, in some models of cognition at least, a text triggers meaning but does not embody it, so a text triggers multiple encodings, each of equal formal validity, if not utility.

Linguistic annotation of almost any kind may be attached to components at any level from the whole text to individual words or morphemes. At its simplest, such annotation allows the analyst to distinguish between orthographically similar sequences (for example, whether the word `Pat' at the beginning of a sentence is a proper name, a verb, or an adjective), and to group orthographically dissimilar ones (such as the negatives `not' and `-n't'). In the same way, it may be convenient to specify the base or lemmatized version of a word as an alternative for its inflected forms explicitly, (for example to show that `is', `was' `being' etc. are all forms of the same verb), or to regularize variant orthographic forms, (for example, to indicate in a historical text that `morrow', `morwe' and `morrowe' are all forms of the same token). More complex annotation will use similar methods to capture one or more syntactic or morphological analyses, or to represent such matters as the thematic or discourse structure of a text.

Corpus work in general requires a modular approach in which basic text structures are overlaid with a variety of such annotations. These may be conceptualized as operating as a series of layers, or levels (as in the Multext model), or as a complex network of descriptive pointers (as in the NSL tools approach). The distinction is of some importance, because of the hierarchic nature of SGML and all approaches based upon it, such as the TEI.

We propose here a simple three way classification of the kinds of analytic encoding commonly encountered. Firstly, and probably most common, we find analytically encoded corpora which apply some form of categorization to one or more their constituent components, by associating them with some pre-defined descriptive category. Part-of-speech tagging, as widely practiced in corpus linguistics is an obvious case; but essentially the same mechanism is used when say discourse functions are associated with stretches of a spoken text.

Secondly, we may group together all the forms of analytic encoding concerned with the identification of non-structural units, or higher-level components which have not been formally identified in the text. This class of analysis, which we call here clustering may be distinguished from the basic structural markup of a corpus by virtue of the fact that the units typically identified by it are generally not `well-behaved' with respect to the standard document hierarchy.

Thirdly we group under the title correspondence the whole range of analytic encodings concerned with the identification of associations between one component of a text and another, in the same or a different text. As special cases, we may cite translation equivalence, which is particularly important in the expanding field of multilingual comparable corpora, or anaphoric reference, which is of equal importance in text understanding systems.

The TEI proposals in support of each of these requirements are briefly described in the following three sections.

4.1 Categorization

As noted above, many linguistic features are also inherent to the structure and organization of the text, indeed inseparable from it. A common requirement therefore is to associate an interpretive category with one or more elements at some level of the hierarchy.

In the TEI scheme the categorization of an element may be supplied in a number of different ways. It may be implied by the presence of information in the header associated with the element in question (see further section 5 ). It may be inherited from a parent element occurrence, or explicitly assigned by an appropriate attribute. The latter case is the more widely used, but we begin by discussing some aspects of the former.

If we say that a text is a newspaper or a novel, it is self-evident that journalistic or novelistic properties respectively are inherited by all the components making up that text. In the same way, any major structural division of a TEI tagged text can specify a value which is understood to apply to all elements within it. Two attributes in particular are provided to characterise texts in this way: the type attribute, and the lang attribute.

In SGML terms, the type attribute has a defined value of CDATA, defaulting to CURRENT. This means firstly that type may have any value, and secondly that whatever value is given will become the default for all subsequent elements of the same kind. As an example, consider a newspaper section composed of small ads:

<div1 type="adSection">
<head>For sale</head>
<div2 type=ad>
<p>Large French chest available....
<p>Pair of skis, one careful owner...

In this example, the second advertisement does not need to specify its type: it defaults to that of the preceding element of the same kind, i.e. the preceding <div2>.

A natural requirement not currently supported by the TEI Guidelines is for the user to be able to control the legal values of the type attribute. This is particularly important given the tendency within the Guidelines to use the type attribute as a way of extending the semantics of the markup in areas where no more precise tag exists. This could be done by using a parameter entity to specify the declared value for the type attribute on each element for which user-control was needed. For example, if the declared value for the type attribute on DIV were specified as %divType, with the parameter entity divType defaulting to CDATA, then the user could over ride this (for example, to say that type must be either PINK or GREEN) by supplying a parameter entity redefinition like the following in the document's DTD subset:

 <!ENTITY % divType "(pink|green)">

Regrettably perhaps, the situation is not quite as simple as this in the current Guidelines, for technical reasons relating to the way the TEI class system is defined which are beyond the scope of this paper. A better solution therefore would be for the user to modify the TEI dtd to allow for more exact tagging of typed< div>s, by introducing new `sugared' divs, such as <ad> and <letter>. (The term sugared or flavoured is used here as an indication that <ad> should be regarded as syntactic sugar for the perhaps less palatable but semantically equivalent formulation <div type=ad>). In the same way, the TEI predefines a number of ` sugared' elements to represent basic linguistic segmentation of the sentence, clause, phrase, word, morpheme type: <cl> is equivalent to <seg type=clause>, <m> to <seg type=morpheme> etc.

Attribute values may also be specified, not as current, but as inherited. This is a TEI specific notion, not directly enforced by current SGML processors, though it is likely to be included in later revisions of the SGML standard. It operates in much the same way as the example discussed above, but with the modification that defaulted values are inherited not from the preceding element of the same kind, but from the parent element. Its most obvious application in the TEI scheme is in the use of the lang attribute. This global attribute can be used to specify the language and writing system applicable to a given part of a TEI document at any hierarchic level. In a corpus composed of material in different languages it will usually be more convenient to specify the language at a fairly high level. For example:

<div type=section lang=FRA>
<head>Section française</head>
<s id=S1>Cette phrase est en frannçais.</s>
<s id=S3>Celle-ci également.</s>/div> 
<div type=section lang=ENG><head>English Section /head>
<s id=S2>This sentence is in English.</s>
<s id=S4>As is this one.</s>
<s id=S5 lang=FRA>Celle-ci est en frannçais.</s>
<s id=S6>This one is not.</s>

A TEI conformant application is required to assume that the sentences S2, S4, and S6 are in English, while S1 and S3 are in French, even though this is not explicitly stated in the markup, because their default value is defined as inherited, and they therefore inherit the value for this value from their parent <div> elements. Sentence S5 explicitly over-rides this default assumption, by supplying a value for its lang attribute, but S6 reverts to type, even though it follows S5. Of course, a non-TEI compliant, or SGML-unaware, application may have difficulty in implementing these requirements, since access to the document tree is necessary in order to determine the parent of a given element. This cost must be weighed against the advantage, in terms of reduced data preparation cost and mark up complexity.

Validation of the codes used as values for the lang attribute is specified exactly in the TEI Guidelines, and therefore is only briefly summarized here. Essentially, the value used must identify a special purpose <language> element defined within the header, which contains a descriptive name for the language itself, and a reference to the writing system employed, which may in turn be documented by a formal writing system declaration. It is important to note that in the TEI scheme the lang attribute specifies a language and writing system pair (for example, ``Greek using beta-code notation'') rather than solely a natural language as it does in the CES. In the next section, we discuss in more detail a variety of methods by which the code used to categorize other SGML elements may be validated.

4.2 Validation of categories

The TEI scheme offers an unusually rich variety of ways in which an encoder can make possible both the automatic validation and the documentation of the particular analytic codes or categorizations embedded in a document, for example to identify a particular linguistic category such as part of speech code, or discourse function.

Some kinds of validation can be carried out automatically by an SGML aware system: for example, where an explicit set of declared values is provided. Where an attribute of declared value IDREF is used, similarly, the parser will check to see that some other element using the supplied value as its identifier exists in the document. This is a feature quite heavily used in the TEI scheme, since it enables both validation and documentation. For example, when the TEI additional tag set for analysis is enabled, several new global attributes become available, including one, ana, of type IDREFS. Its value is intended to be one or more codes identifying the analysis applied to the element on which it appears. These codes in turn are defined in some (presumably smaller) set of elements elsewhere in the document, typically the header.

As an example, consider the phrase `analysed corpora', which might be tagged as follows

<w ana=VVD>analysed</w>
<w ana=NN2>corpora</w>

Morpho-syntactic analyses of this kind are relatively commonplace and well understood, so that (in this particular case) the encoder may feel that no further documentation or validation of the codes VVD or NN2 is needed. Indeed, if the encoder had chosen to use the type attribute to categorize each <w> element, no further validation would be possible, since the TEI scheme does not currently validate values of the type attribute, as noted above. Since however, the ana attribute has been used, and since this mechanism is intended to support rather more ambitious and complex forms of categorization, we assume that the encoder in this case wishes to do rather more than simply associate an opaque or undefined code with each <w> element.

As a first step, the encoder is required to provide somewhere in the same SGML document an element bearing the identifiers specified. This could, of course, be any type of element, but TEI recommended practice is to use an element in which the meaning or function of the code can be documented. The simplest way of achieving this is with an <interp> element, as follows:

<interp id=VVD value="past tense adjectival form of lexical verb">
<interp id=NN2 value="plural form of common noun">

In many cases, the string used to identify the particular category to which some component of a corpus text is assigned will be an arbitrary code rather than a recognizable name. Corpus builders tend however to choose codes which have some significance, to a human reader if not to a computer. For example, in the BNC, the code NN1 is assigned to tokens classified as singular common noun, while the code NN2 is assigned to tokens classified as plural common noun. From a formal point of view, this code might be better regarded as combining a word class code (NN) with a number indicator (1 or 2); however, from the encoding point of view, the codes NN1 and NN2 are not decomposable.

To represent explicitly this implicit hierarchy of terms, the TEI scheme also allows for interpretative categories to be grouped hierarchically, using the <interpGrp> element. Thus, one could construct a typology of word class codes along the following lines:

<interpGrp id=NN value="common noun">
<interp id=NN1 value="singular common noun">
<interp id=NN2 value="plural common noun">

The hierarchy could obviously be extended by nesting groups of the same kind. We might for example mark the grouping of common (NN) and proper (NP) nouns in the following way:

<interpGrp value="nominal">
<interpGrp id=NN>
<interp id=NN1 value="singular common noun">
<interp id=NN2 value="plural common noun">
<interpGrp id=NP>
<interp id=NP1 value="singular proper noun">
<interp id=NP2 value="plural proper noun">

Alternatively, and with more delicacy, one could unbundle the linguistic interpretations entirely by regarding them as a set of typed feature structures. The feature structure notation developed by the TEI is of considerable richness and delicacy, and permits the representation of a very wide range of structures, not limited to the simple hierarchies provided by the SGML mechanisms outlined so far. A full description is beyond the scope of the present paper, but the following brief indication may be helpful. (The interested reader is referred to the relevant chapter of the TEI Guidelines, or to Langendoen and Simons 1995, which provides a useful introduction.)

A feature, in this scheme, is defined as a pair, comprising a name and a value. The latter may be one of a defined set of value types, including boolean (plus or minus), numeric, string (an unclosed set of values), symbol (one of a defined set), a feature structure, or a reference to one. A feature structure is a named combination of such features, ordered or unordered.

For example, in the preceding analysis, we have identified the features number and proper, with values singular or plural, and plus or minus respectively. (The decision as to the appropriate domain for a value is inevitably arbitrary: we have here chosen to regard number as being a symbolic value to allow for the possibility of additional values such as dual or uncountable). These features may be combined to form feature structures corresponding to the codes given above as follows:

<fs id=NP1 name=">
<f name=class><sym value=noun>
<f name=number><sym value=singular></f>
<f name=proper><plus></f></fs>
<fs id=NP2>
<f name=class><sym value=noun>
<f name=number><sym value=plural></f>
<f name=proper><plus></f></fs>

To reduce the redundancy of this representation, one may specify the features making up a given feature structure by reference. This requires that the features to be used are first specified independently of the structures in which they are to be combined, using a construct known as a feature library, represented by a <fLib> element, each one being given a unique identifier, as follows:

<f name=class id=FCN><sym value=noun>
<f name=number id=FN1><sym value=singular></f>
<f name=number id=FN2><sym value=plural></f>
<f name=proper id=FPP><plus></f>
<f name=proper id=FPM><minus></f>

Each of the feature structures attested can now be represented by reference to these underlying primitives, using the feats attribute, as follows:

<fs id=NN1 feats="FCN FPM FN1">
<fs id=NN2 feats="FCN FPM FN2">
<fs id=NP1 feats="FCN FPP FN1">
<fs id=NN1 feats="FCN FPP FN2">

It should be apparent how this approach permits an SGML aware processor to identify automatically linguistic analyses where features such as number or properness are marked, independently of the actual category code (the NN1 or NP2) used to mark the analysis. In addition, of course, the use of the SGML ID/IDREF mechanism allows for simple validation of the codes used. For more sophisticated validation, for example to ensure that the feature properness cannot be both plus and minus in the same analysis, the TEI specifies an additional declarative mechanism, known as a feature system declaration: the scope of this is however beyond the present discussion.

4.3 Clustering

We have already discussed the facilities offered by the TEI for hierarchic clustering of corpus components. However, there is a very important need also to identify units which do not fit well within the hierarchic framework. For example, in a discourse analyse, the particular phases of a conversation may include discontinuous sequences that cut across the simple structure defined by the markup of individual utterances; in a written text, similarly, there may be stylistic features forming discontinuous sequences across the chapter-paragraph-sentence-word hierarchy.

As a simple example, we consider speech as it is commonly represented in writing. Direct speech, marked in the TEI scheme by the <q> element, is almost invariably interrupted by a reporting clause, as in the following example:

<q>You put it in the safe,</q> Quill reminded him, 
<q>and it is certainly still there</q>.
The problem here is that the two parts of Quill's speech should be encoded as a single unit in some sense, particularly if some form of linguistic segmentation is to be introduced. In an earlier version of the Guidelines, a special purpose tag <in.quot> was suggested in order to sweep this problem under the carpet by tagging the reporting clause itself
<q>You put it in the safe, <in.quot>Quill reminded him,
</in.quot>and it is certainly still there</q>.
This solution, though at first sight attractive, was not adopted, largely because it appeared to be too specific to the case of reporting clauses, which might better be handled by simply using an appropriately typed <seg> element. Moreover, in any realistic sampling of text we would expect to find reporting or interrupting clauses of considerably more internal complexity than this simple example.

Suppose that a simple linguistic segmentation is introduced into this text. The natural way to do this would be to regard the whole of the sentence as a single <s> element

<s><q>You put it in the safe,</q> Quill reminded him, 
<q>and it is certainly still there</q></s>.
If our segmentation is based on finite clauses rather than orthographic convention, we would however prefer to encode this sentence as follows:
<q><s>You put it in the safe,</s></q>
<s>Quill reminded him,</s>
<q><s>and it is certainly still there</s></q>.
. However, a sentence in which the quoted material spanned two finite clauses would give us a headache. If, for example the sentence began:
<q>You put it</q>, Quill reminded him, <q>in the safe</q>
our segmentation would look like this:
<q><s id=s1a>You put it,</s></q>
<s id=s2>Quill reminded him,</s>
<q><s id=s1b>in the safe</s></q>
and we would be faced with the problem of making explicit that the sentences with identifier s1a and s1b have more in common with each other than either does with that with identifier s2.

(Note that in the general case we can assume neither that <s> will be subordinate to<q> or the reverse; in either case we will have problems maintaining both hierarchies in parallel)

Two solutions to the class of discontinuous segmentation problems of which this forms an example are proposed in the TEI Guidelines. The first is to use special purpose linking attributes next and prev, either alone or in combination with a part attribute. The part attribute indicates that the element bearing it is incomplete in some sense (its default value is no); the other two are used to point to the associated fragment by specifying its identifier, as follows:

<q><s id=s1a next=s1b part=y>You put it,</s></q>
<s id=s2>Quill reminded him,</s>
<q><s id=s1b prev=s1a part=y>in the safe</s></q>

The second solution is to use a free standing `virtual'element to link the two elements. The TEI provides for this purpose either a generic link element <link>, or the more specifically defined <join> element:

<q><s id=s1a>You put it,</s></q>
<s id=s2>Quill reminded him,</s>
<q><s id=s1b>in the safe</s></q>

<join targets="s1a s1b" result="s">

Though elegant in its generalizability, the virtual element solution has not been as widely adopted as the use of additional attributes, perhaps partly because of the lack of definition within the Guidelines as to the intended location of this and similar `out of line' pointing elements, but also because of the additional complexity of programming which their use implies. However, as more sophisticated SGML tools become more widely deployed in the field, it is likely that this kind of solution will be more widely preferred.

A related, though different, choice of solutions is offered to another problem arising from the discontinuities inherent to naturally occuring language: what to do with footnotes or other material which can interrupt the `natural' linguistic flow such as `pull-quotes' or captions. When a footnote is encountered in a conventionally printed text, there will usually be a reference of some kind embedded in the running text, with the body of the note itself printed elsewhere. Document formatters such as LaTex handle this situation by simply embedding the full body of the note at the point of insertion, and then handling the formatting as a separate exercise. Following this method, a TEI encoded text can also embed a <note> element at any point in a text, as in this example:

The original text <note id="n1">Made up for the purpose by Wombat and Grillparzer (1996)</note> has been shown to be spurious.
A simple segmentation of the text must now indicate that the words `The original text ' and ` has been shown to be spurious.' form a single segment. For convenience of linguistic annotation, it is often therefore preferable to move such notes out of line, and to encode the note reference explicitly using a <ref> or empty<ptr> element:
The original text <ptr target=n1> has been shown to be spurious.<note id="n1">Made up for the purpose by Wombat and Grillparzer (1996)</note>

As with the <join> example above, the exact location of such relocated elements is a matter of editorial convention and convenience.

4.4 Correspondence

In a sentence like `Not there they won't', the pronouns, deixis, and ellipsis refer to concepts which are (probably) more fully expressed elsewhere in the text. Identifying those concepts is often important for natural language understanding systems and for machine translation. Substantial work has been carried out on procedures to identify and automatically link such anaphoric features to their antecedents, though opinions are divided as to whether or not the output from such procedures should be recorded in corpora. On the assumption that at least some people will wish to record such decisions in their encoding, rather than simply generate them on the fly, we discuss here how the TEI Guidelines support the encoding of alignment and correspondence.

A particular case of this requirement is the automatic alignment of sentences or words in parallel multilingual corpora, for example in order to aid the comparison of the contexts in which a given term from language A is mapped into one of a number of possible equivalents in language B, or simply to investigate all of its possible equivalences.

Whichever kind of alignment is being carried out, if it is to be made explicit , the encoder faces the same choice. If element P1 is to be aligned with element P2, either there must be a link from P1 to P2 (or vice versa), or some third element T must specify the association between P2 and P1. The TEI scheme supports both methods: here, first, is an example of anaphoric reference, using the pointer method:

<title id=shirley>Shirley</title>, which made its 
Friday night debut only a month ago, was not listed on <name id=NBC>NBC</name>'s new schedule, 
although  <seg id=s1 corresp=NBC>the network</seg> 
says <seg id=s2 corresp=shirley>the show</seg> still 
is being considered.

Here is the same technique used to align two sentences from a parallel corpus:

<s lang=FRA id=ALRTP1 corresp=ROTP1>
  Longtemps je me couchais de bonne heure.</s>
<s lang=ENG corresp=ALRTP1 id=ROTP1>
  For a long time I used to go to bed early.</s>

For both of these, we might alternatively use the TEI <corresp>element (itself a specialized case of the general purpose <link> element mentioned above) as follows:

<corresp type=anaphor targets="nbc s1">
<corresp type=anaphor targets="shirley s2">
<corresp type=translation targets="ALRTP1 ROTP1">

The same method might also be used to encode other forms of alignment, for example between parts of a transcription and the corresponding digital audio sequences, as in this example:

<xptr id="XD1-4" doc="tape1" from="foreign counter 1" to="foreign counter 4">
<corresp type=transcription targets="ALRTP1 XD1-4">
The notation suggested here for passing parameters to whichever device it is that will locate the actual audio sequence is hypothetical (the TEI allows for this by means of the foreign keyword), but otherwise this example is pure TEI. The <xptr> element is a form of hypertext link defined in the TEI scheme to enable a document to refer to components of other `documents' or digital objects; its exact syntax is not of relevance here (the interested reader is referred to the brief tutorial at Burnard 1997) but the pragmatics of the above declaration might be summarized as ``attach the unique identifier `XD1-4' to the stretch of data starting at the location `counter 1' and ending at the location `counter 4' within the object named `tape1'''. This stretch of sound is then said to correspond with the sentence ALRDTP1, as a transcription of it.

As with all forms of encoding, the ease with which the result of an alignment can be represented should not encourage us to believe that actually performing the alignment is a trivial exercise. All that the TEI offers us here is the ability to make explicit the result of the exercise, in a comparatively clear and perspicuous manner. But that at least helps us state clearly what the goal of our alignment software should be, even if the algorithms for achieving them remain stubbornly non-trivial.