3 How to mark up a corpus

A full description of the BNC mark up scheme is beyond the scope of this paper, and is in any case available in the documentation supplied with the corpus and elsewhere. In this paper I would like to focus on the way in which the anticipated uses of the corpus conditioned the mark up scheme actually applied.

It has often been said of general purpose dtds such as the TEI (which was being developed symbiotically with the CDIF scheme used in the BNC) that they allow the user too much flexibility. In practice, we found that the richly descriptive aspects of the TEI scheme were of least interest to our potential users. For purpose of linguistic analysis, the immense variety of objects in a fully marked up text, with all their fascinating problems of rendering and interpretation, are of less importance than a reliable and regular structural breakdown, into segments and words. This was an unpalatable lesson for academics with a fondness for the rugosities of real language, but an important one. The scale of the BNC simply did not permit us to lovingly mark up every detail of the text [mdash ] distinguishing sharply every list, foreign word, editorial intervention, or proper name. Instead we had to be sure that headings, paragraphs, and major text divisions were reliably and consistently captured in an immense variety of materials. For purposes of linguistic analysis, segmentation at the sentence and word level was crucial but, fortunately, automatic. By comparison with other, more literary oriented, TEI texts, the tagging of the BNC is thus rather sparse, despite its 150 million SGML tags.

The basic structural mark up of both written and spoken texts may be summarized as follows. Each of the 4124 documents or text samples making up the corpus is represented by a single <bncDoc> element, containing a header, and either a <text> (for written texts) or an <stext> (for spoken texts) element. The header element contains detailed and richly structured metadata supplying a variety of contextual information about the document (its title, source, encoding, etc., as defined by the TEI): as noted above, headers were automatically generated from information managed within a relational database. A spoken text is divided into utterances, possibly interspersed with nonlinguistic elements such as events, possibly grouped into divisions to mark breaks in conversations. A written text is divided into paragraphs, possibly also grouped into hierarchically numbered divisions. Below the level of the paragraph or utterance, all texts are composed of <s> elements, marking the automatic linguistic segmentation carried out at Lancaster, and each of these is divided into <w> (word) or <c> (punctuation) elements, each bearing a POS (part of speech) annotation attribute.

Considerable discussion went on at the start of the project as to the best method of encoding this automatically-generated information. There are about sixty different possible POS codes, each representing a linguistic category, for example as a singular noun, adverb of a particular type, etc. The codes are automatically allocated to each word by CLAWS, a sophisticated language-processing system developed at the University of Lancaster, and widely recognized as a mature product in the field of Natural Language Processing.

For approximately 4.7 per cent of the words in the corpus, CLAWS was unable to decide between two possible taggings with sufficient likelihood of success. In such cases, a two-value word-class code, known as a portmanteau tag is applied. For example, the portmanteau tag VVD-VVN means that the word may be either a past tense verb (VVD), or a past participle (VVN). We did not make any attempt to represent this ambiguity in the SGML coding, though at a later stage of linguistic analysis, perhaps based on the TEI feature structure mechanism, this might be possible. Without manual intervention, the CLAWS system has an overall error-rate of approximately 1.7%, excluding punctuation marks. Given the size of the corpus, there was no opportunity to undertake post-editing to correct annotation errors before the first release of the corpus.

Since then two successor projects have been completed by the Lancaster team, resulting in the availability of a much improved new version. The first step was to manually check a 2 percent sample from the whole corpus, using a much richer and more delicate set of codes. This corrected sample was then used to improve and extend the CLAWS tagging procedures, essentially by expanding its knowledge of common English phrasal sequences, before re-running the automatic procedure over the whole corpus.

Further details of the CLAWS tagging procedure and the linguistic concepts underlying it are available in a number of research publications from the Lancaster team and in a useful summary book ( Garside et al 1997); the present paper focusses only on the encoding issues associated with its use.

As with many other morpho-syntactic taggers, the tokens annotated by CLAWS do not always correspond with single orthographic words. For example, the word `won't' is regarded as two tokens by CLAWS: `wo' (verbal auxiliary) and `n't' (negation marker); similarly posessive forms such as `Queen's' are regarded as two tokens. Further to confuse matters, some common prepositional phrases such as `in spite of' are regarded as a single token, as are foreign phrases such as `annus horribilis'. (This last phrase appears over 30 times in the BNC, as a consequence of the Queen's speech to Parliament in 1993).

A second range of problems centres on the semantics of such annotations. There is some controversy amongst linguists about whether or no POS codes of this kind should be decomposable: that is, whether the encoding should make explicit that (for example) NN1 and NN2 have something in common (their noun-ness) which (say) VVXlacks. The TEI, of course, has a great deal to suggest on the subject, and proposes a very powerful SGML tagset for encoding such feature systems. To keep our options open, and also for ease of conversion from the data format output by CLAWS (which was already in existence, and had been for many years), we began by representing the code simply as an entity reference following the token to which it applied. Thus:

The&AT0 Queen&NP0's&POS annus horribilis&NN1

This option, we felt, would enable us to defer to a later stage exactly what the replacement for each entity reference should be: it might be nothing at all, for those uninterested in POS information, or a string, or a pointer indicating a more complex expansion of the TEI kind. The problem with this representation however, is that it relies on an ad hoc interpretive rule (of the kind which SGML is specifically designed to preclude the need for) to indicate, for example, that the code AT0 belongs to the word `The', rather than to the word `Queen'. In fact this is not encoding the truth of the situation: we have here a string of word-annotation pairs. A more truthful annotation might be:


A further possibility is to use an attribute value, for either the Form or the Code: thus

  <form code=AT0>The</form>
or, equivalently,
  <code form=The>AT0</code>

From the SGML point of view these are equivalent. From the application point of view, the notion of a text composed of strings of POS codes, with embedded forms seems somehow less appealing than the reverse, which is what we eventually chose: our example being tagged as follows:

<w AT0>The <w NP0>Queen<w POS>'s <w NN1>annus horribilis

The decision to use an often deprecated form of tag minimization for the POS annotation was forced upon us largely by economic considerations. A fully normalized form, with attribute name and end-tags included on each of the 100 million words would have more than doubled the size of the corpus. Data storage costs continue to plummet, but the difference between 2 Gb and 4Gb remains significant!

A second major set of encoding problems arose from the inclusion in the corpus of ten million words of transcribed speech, half of it recorded in pre-defined situations (lectures, broadcasts, consultations etc), and the other half recorded by a demographically sampled set of volunteers, willing to tape their own every day work and leisure time conversation.

Speech is transcribed using normal orthographic conventions, rather than attempting a full phonemic transcript, which would have been beyond the project's limited resources. Even so, the markup has to be very rich in order to capture the process of speaker interaction [mdash ] who is speaking, and how, and where they are interrupted. Significant non-verbal events such as pauses or changes in voice quality are also marked up using appropriate empty elements, which bear descriptive attributes. Here is an example of the start of one such conversation, as encoded in CDIF:

<u who=D00011>
<s n=00011>
<event desc="radio on"><w PNP><pause dur=34>You
<w VVD>got<w TO0>ta <unclear><w NN1>Radio
<w CRD>Two <w PRP>with <w DT0>that <c PUN>.
<s n=00012>
<pause dur=6><w AJ0>Bloody <w NN1>pirate
<w NN1>station <w VM0>would<w XX0>n't
<w PNP>you <c PUN>?

The basic unit is the utterance, marked as an <u> element, with an attribute who specifying the speaker, where this is known. This attribute targets an element in the header for the text, which carries important background information about the speaker, for example their gender, age, social background, inter-relationship etc. Where speakers interrupt each other, as they usually do, a system of alignment pointers simplified from that defined by the TEI, is used. This requires that all points of overlap are identified in a<timeLine> element prefixed to each text, component points (<when> elements) of which are then pointed to from synchronous moments within the transcribed speech, represented as <ptr> elements. Pausing is marked, using a <pause> element, with an indication of its length if this seems abnormal. Gaps in the transcription, caused either by inaudibility or the need to anonymize the material, are marked using the <unclear> or <gap> elements as appropriate. Truncated forms of words, caused by interruption or false-starts, are also marked, using the <trunc> element.

A semi-rigorous form of normalization is applied to the spelling of non-conventional forms such as `innit' or `lorra'; the principle adopted was to spell such forms in the way that they typically appear in general dictionaries. Similar methods are used to normalize such features of spoken language as filled pauses, semi-lexicalized items such as `um', `err', etc. Some light punctuation was also added, motivated chiefly by the desire to make the transcriptions comprehensible to a reader, by marking (for example) questions, possessives, and sentence boundaries in the conventional way.

Paralinguistic features affecting particular stretches of speech, such as shouting or laughing, are marked using the <shift> element to delimit changes in voice quality. Non-verbal sounds such as coughing or yawning, and non-speech events such as traffic noise are also marked, using the <vocal> and <event> elements respectively; in both cases, a closed list of values for the desc attribute is used to specify the phenomenon concerned. It should however be emphasized that the aim was to transcribe as clearly and economically as possible rather than to represent all the subtleties of the audio recording.

The metadata provided by the header element, mentioned above, is of particular importance in any electronic text, but especially so in a large corpus. Earlier corpora have tended to provide all such documentation (if at all) as a separate collection of reference manuals, rather than as an integral part of the corpus, with obvious concomitant problems of maintainability and consistency. In SGML, particularly the TEI header, we felt that we had a powerful mechanism for integrating data and metadata, which we used to the full: each component text of the BNC carries a full header, structured according to TEI recommendations, and containing a full bibliographic description of it, and of its source, as well as specific details of its encoding, revision status, etc. A corpus header, containing information common to all texts, is also provided: this includes full descriptions of the corpus creation methodology, and the various codes used within individual text headers, such as those for text classification.

A particular problem arises with large general purpose corpora like the BNC, the components of which can be cross-classified in many different ways. Earlier corpora have tended to simplify this, for example, by organizing the corpora into groups of texts of a particular type [mdash ] all newspaper texts together, all novels together, etc. A typical BNC text however can be classified in many different ways (medium, level, region, etc.). The solution we adopted, was to include in the header of each text a single <catRef> element carrying an IDREFS-valued attribute, which targetted each of the descriptive categories applicable to the text.

For example, the header of a text of written author type 2 (multiple authorship), written medium type 4 (miscellaneous unpublished), and written domain type 3 (applied sciences) will contain a element like the following:

<catref target="wriaty2 wrimed4 wridom3">
The values wriaty2 wrimed4 etc. here each references a <category> element in the corpus header, containing a definition for the classification intended. The full set of descriptive categories used is thus controlled and can be guaranteed uniform across the whole corpus, while at the same time permitting us to mix and combine descriptive categories within each text as appropriate.

A similar method was used to link very detailed participant descriptions (stored in the header) with utterances attributed to them in the spoken part of the corpus.

In retrospect, had we all known as much about SGML at the start of the project as we did by the end of it, we would have made much more impressive progress, and perhaps delivered a better product. Needless effort went into converting from one format to another, which might have been better spent on gathering more reliable contextual information for example. We also spent a long time devising ways of representing complex information about (for example) relationships between the speakers which in the event was not reliably available for more than a handful of cases. The data representation we produced was thus rather more sophisticated and complex than the material included perhaps warranted.