2 How to build a corpus

The building of large-scale corpora of text for use in linguistic analysis pre-dates the technical feasibility of such resources in digital form by several centuries. The Oxford English Dictionary, for example, may be regarded as the product of an immense corpus of citation slips, collected and collated in handwritten form over a period of decades during the last century. However, the term corpus is most typically used nowadays to refer to a collection of linguistic data gathered for some specific analytic purpose, with a strong presupposition that it will be stored, managed, and analysed in digital form. The grandfather of linguistic corpora of this type is the one-million word Brown corpus, created at Brown University in the early sixties, using methods still relevant today. Linguists and linguistics thrive on controversy, of which the dignifying of corpus-based approaches to the subject into a recognized academic discipline has had its fair share. Nevertheless, certainly in Europe, and increasingly in North America, corpus-based linguistics is widely perceived as central to many aspects of research into the nature and functioning of human language, with applications in fields as diverse as lexicography, natural language processing, machine translation, and language learning. The maturity of the field may also be inferred from the increasing number of general introductory textbooks: see for example McEnery and Wilson 1996, Biber et al 1998, Kennedy 1998, or [mdash ] a general introduction produced with particular reference to the British National Corpus (BNC) [mdash ] Aston and Burnard 1998.

Many of the most well-known language corpora were created within an academic context, where slightly different constraints tend to affect quality control, budgets, and deadlines than those associated with commercial production environments. The BNC project was, by contrast, a joint academic-industrial project, in which both academic and industrial partners learned a little more of their colleagues' perspectives by means of an enforced collaboration. In crude terms, if the academic partners learned to cut their coat according to the cloth available; the industrial partners learned that there were more complex things in life than boilersuits.

The British National Corpus (BNC) is a collection of over 4000 different text samples, of all kinds, both written and spoken, containing in all six and a quarter million sentences, and over 100 million words of current British English. Work on building it began in 1991, and was completed in 1994. The project was funded by the Science and Engineering Council (now EPSRC) and the Department of Trade and Industry under the Joint Framework for Information Technology (JFIT) programme. The project was carried out by a consortium lead by Oxford University Press, of which the other members are major dictionary publishers Addison-Wesley Longman and Chambers-Harrap; academic research centres at Oxford University Computing Services, Lancaster University's Centre for Computer Research on the English Language, and the British Library's Research and Innovation Centre.

Organizationally, the tasks of designing and building the corpus were split across a number of technical work groups on which each member of the consortium was represented. Task Group A concerned itself with basic issues of corpus design [mdash ] what principles should inform the selection of texts for inclusion in the corpus [mdash ] what target proportions should be set for different text types and so forth. Task Group B focussed on one key issue in corpus construction, the establishment of acceptable procedures for rights clearance and poermissions to include material in the corpus. This might have been the subject of a major research project in its own right: in practice, the output from the task group was a standard agreement, in some sense a precedent-setting document for other European corpus-builders.

Task Group C concerned itself with technical details of encoding and text processing; these are discussed in more detail below. Task Group D concerned itself with corpus enrichment and analysis. In practice, the distinction between the two turned out to be largely the distinction between the creation of the corpus and of specific software to make use of it. Since the latter task was not possible until the end of the project, by when there were no funds left to do it, it is unsurprising that little was actually accomplished in this group within the time of the original BNC project.

SGML played a major part in the BNC project: as an interchange medium between the various data-providers; as a target application-independent format; and as the vehicle for expression of metadata and linguistic interpretations encoded within the corpus. From the start of the project, it was recognized that SGML offered the only sure foundation for long term storage and distribution of the data; only during its progress did the importance of using it also as an exchange medium between the various partners emerge. The importance of SGML as an application independent encoding format is also only now becoming apparent, as a wide range of applications for it begin to be realized.

The scale and variety of data to be included meant that a industrial style production line environment had to be defined: this was dubbed the BNC sausage machine by Jeremy Clear, the project manager at the time, and may be summarized as follows:

A wide literature now exists on corpus design methodologies, which this paper will not attempt to summarize although the experience of designing and creating the BNC has contributed greatly to it (see in particular Atkins et al 1992). A corpus which, like the BNC, aims to represent all the varieties of the English language cannot simply be assembled opportunistically by collecting as much electronic material as its budget will permit, although a project with a defined budget and timescale inevitably finds design principles sometimes have to be sacrificed to pragmatic considerations. Neither can a corpus aiming to represent the full variety of contemporary English proceed on a purely statistical basis: a statistically balanced random sampling of language producers will be unlikely to include (for example) many journalists or media personalities, while a statistically balanced random sample of language reception is unlikely to include much apart from popular journalism. As a compromise, the project adapted a stratified sampling procedure, in which the range of texts to be sampled is pre-defined, and target proportions were then agreed on for each.

In the spoken part of the corpus, ten per cent of the whole, a balance was struck between material gathered on a statistical basis (i.e. recruited from a demographically-balanced sample of language producers) and from material gathered from a pre-defined set of speech situations or contexts. A moment's reflection should show that this dual practice was necessary to ensure that the corpus included examples of both common and uncommon types of language. Equally, in the written parts of the corpus, published and unpublished material, of a wide range of topics, registers, levels etc., were all represented. From high-brow novels and text books to pulp fiction and journalism, by way of school essays, office memoranda, email discussion lists, and paper-bags, our aim was to ensure that every form of written language is to be found in the corpus, to a greater or larger extent.

As noted above, data capture for the whole project was carried out by the three publishers in the BNC consortium (OUP, Longman and Chambers). Three sources of electronic data were envisaged at the start of the project: existing electronic text, OCR from printed text, and keyed-in text. It soon become apparent that the first source would be less useful than anticipated since either the material was encoded in formats too difficult to unscramble consistently, or the texts available did not match the stipulated design criteria. Scanning and keying text brought lesser problems of their own, of which probably the worst was training keyboarders and scanners at different places to be consistent under tight time constraints. In the case of spoken data, keyboarding was the only option from the start, and proved to be very expensive and time-consuming, in part because of the very high standards set for data capture. Transcribing spoken language with attention to such features as overlap (where one speaker interrupts another), and enforcing consistency in the representation of non lexical or semi-lexical phenomena are major technical problems, rarely attempted on the scale of the BNC material, which finally included ten million words of naturally occurring speech, recorded in all sorts of environments

For a variety of reasons, the three data suppliers all used their own internal markup systems for data capture which then had to be centrally converted and corrected to the project encoding standard. Had this standard, the Corpus Document Interchange Format, or CDIF, been available at the start of the project, the need for conversion would have been lessened, but not that for validation. CDIF, like many other TEI-conformant dtds, allows for considerable variation in actual encoding practice, largely because of the very widely different text types that it has to accommodate. To help ease the burden on data suppliers, the tags available were classified according to their perceived usefulness and applicability. Some [mdash ] such as headings, chapter or other division breaks, and paragraphs [mdash ] were designated "required" parts of any CDIF document; when such features occur in a text, they must be marked up. Others [mdash ] such as sub-divisions within the text, lists, poems, and notes about editorial correction, were "recommended", and should be marked up if at all possible. Finally, some tags were considered "optional" [mdash ] dates, proper names and citations which are easily identifiable. The process of format conversion and SGML validation was automated as far as possible (fortunately for us, the sgmls parser became available early on during the project): these constituted the `syntactic' check. Where time permitted, we also carried out a `semantic' check to determine whether material which should have been tagged had in fact been marked up, though it was of course impossible to carry out a full proof reading exercise. Materials which fell below an agreed threshold of errors, either syntactic or semantic, were returned to the data capture agency, for correction or replacement.

Management of the many thousand of files and versions of files involved as texts passed through the production line was managed by a relational database system, which also managed routine archiving and backup. This database also held all of the bibliographic and other metadata associated with each text, from which the TEI headers eventually added to each text were generated. (A useful summary of the information recorded in each header is provided in Dunlop 1995).

The project was funded for a total of four years, of which the first was devoted to agreeing and defining in full operational detail the procedures summarized above. By the end of the 5th quarter (March 1992), 10 percent of the corpus had been received at OUCS and procedures for handling it were in place. A small sample (2 million words) had been processed and sent on to Lancaster for the next stage of processing. The rate at which texts were received and processed at OUCS fluctuated somewhat during the course of the project, but ramped up steadily towards its end.

The following table shows the approximate number of words (in millions) received at OUCS, converted to the project standard, and received back from Lancaster in annotated form, for each quarter (parenthesized figures indicate `bounced' texts -- material which had to be returned because it did not pass the QA procedures discussed above):

Quarter Received ValidatedAnnotated
6 2 4 -
7 6 4 -
85 (1) 8 6
9 6 (2)14 13
1014 (3) 11 5
11 12 (2) 138
1225 16 17
13 25 3222
14 3 8 30