Using SGML for Linguistic Analysis: the case of the BNC LouBurnard
http://users.ox.ac.uk/~lou/
Lou Burnard is Manager of the Humanities Computing Unit at Oxford University Computing Services, and European Editor of the Text Encoding Initiative. He was educated at Oxford University, where he has worked in humanistic applications of information technology since the seventies.
The British National Corpus (BNC) is a rather large SGML document, comprising some 4124 samples taken from a rich variety of contemporary British English texts of every kind, written and printed, famous and obscure, learned and ignorant, spoken and written. Each of its hundred million words and six and a quarter million sentences is tagged explicitly in SGML and carries an automatically-generated linguistic analysis. Each sample carries a TEI-conformant header, containing detailed contextual and descriptive information, as well as more conventional SGML mark-up. The corpus was created over a four year period by a consortium of leading dictionary publishers and academic research centres in the UK, with substantial funding from the British Department of Trade and Industry, the Science and Engineering Research Council, and the British Library. On publication, it was made freely available under licence within the European Union, where it is increasingly used in linguistic research and lexicography, in applications ranging from the construction of state of the art language-recognition systems, to the teaching of English as a second language. This paper describes how the corpus was constructed, and gives an overview of some of the SGML encoding issues raised during the process. A brief description of the special purpose SGML aware retrieval system developed to analyse the corpus and its current status is also provided.
How to build a corpus The building of large-scale corpora of text for use in linguistic analysis pre-dates the technical feasibility of such resources in digital form by several centuries. The McEnery and Wilson1996, Biber et al 1998, Kennedy 1998, or — a general introduction produced with particular reference to the British National Corpus (BNC) — Aston and Burnard 1998. Many of the most well-known language corpora were created within an academic context, where slightly different constraints tend to affect quality control, budgets, and deadlines than those associated with commercial production environments. The BNC project was, by contrast, a joint academic-industrial project, in which both academic and industrial partners learned a little more of their colleagues' perspectives by means of an enforced collaboration. In crude terms, if the academic partners learned to cut their coat according to the cloth available; the industrial partners learned that there were more complex things in life than boilersuits. The British National Corpus (BNC) is a collection of over 4000 different text samples, of all kinds, both written and spoken, containing in all six and a quarter million sentences, and over 100 million words of current British English. Work on building it began in 1991, and was completed in 1994. The project was funded by the Science and Engineering Council (now EPSRC) and the Department of Trade and Industry under the Joint Framework for Information Technology (JFIT) programme. The project was carried out by a consortium lead by Oxford University Press, of which the other members are major dictionary publishers Addison-Wesley Longman and Chambers-Harrap; academic research centres at Oxford University Computing Services, Lancaster University's Centre for Computer Research on the English Language, and the British Library's Research and Innovation Centre. Organizationally, the tasks of designing and building the corpus were split across a number of technical work groups on which each member of the consortium was represented. Task Group A concerned itself with basic issues of corpus design — what principles should inform the selection of texts for inclusion in the corpus — what target proportions should be set for different text types and so forth. Task Group B focussed on one key issue in corpus construction, the establishment of acceptable procedures for rights clearance and poermissions to include material in the corpus. This might have been the subject of a major research project in its own right: in practice, the output from the task group was a standard agreement, in some sense a precedent-setting document for other European corpus-builders. Task Group C concerned itself with technical details of encoding and text processing; these are discussed in more detail below. Task Group D concerned itself with corpus enrichment and analysis. In practice, the distinction between the two turned out to be largely the distinction between the creation of the corpus and of specific software to make use of it. Since the latter task was not possible until the end of the project, by when there were no funds left to do it, it is unsurprising that little was actually accomplished in this group within the time of the original BNC project. SGML played a major part in the BNC project: as an interchange medium between the various data-providers; as a target application-independent format; and as the vehicle for expression of metadata and linguistic interpretations encoded within the corpus. From the start of the project, it was recognized that SGML offered the only sure foundation for long term storage and distribution of the data; only during its progress did the importance of using it also as an exchange medium between the various partners emerge. The importance of SGML as an application independent encoding format is also only now becoming apparent, as a wide range of applications for it begin to be realized. The scale and variety of data to be included meant that a industrial style production line environment had to be defined: this was dubbed the BNC sausage machine by Jeremy Clear, the project manager at the time, and may be summarized as follows: data capture each of the three commercial partners selected and prepared material to a different defined format, reflecting to some extent the diverse nature of materials for which they were primarily responsible; primary check and conversion OUCS checked each text against its data capture format, automatically converted it to project standard format, and made an accession record for it in the project database; linguistic annotation valid SGML texts were passed to Lancaster for automatic addition of word class tagging and linguistic segmentation, using the CLAWS software discussed further below; text cataloguing and final checking lexically annotated texts were run through a final conversion at OUCS; a detailed TEI header was generated from the project database and the text itself added to the corpus. A wide literature now exists on corpus design methodologies, which this paper will not attempt to summarize although the experience of designing and creating the BNC has contributed greatly to it (see in particular Atkins et al 1992). A corpus which, like the BNC, aims to represent all the varieties of the English language cannot simply be assembled opportunistically by collecting as much electronic material as its budget will permit, although a project with a defined budget and timescale inevitably finds design principles sometimes have to be sacrificed to pragmatic considerations. Neither can a corpus aiming to represent the full variety of contemporary English proceed on a purely statistical basis: a statistically balanced random sampling of language producers will be unlikely to include (for example) many journalists or media personalities, while a statistically balanced random sample of language reception is unlikely to include much apart from popular journalism. As a compromise, the project adapted a stratified sampling procedure, in which the range of texts to be sampled is pre-defined, and target proportions were then agreed on for each. In the spoken part of the corpus, ten per cent of the whole, a balance was struck between material gathered on a statistical basis (i.e. recruited from a demographically-balanced sample of language producers) and from material gathered from a pre-defined set of speech situations or contexts. A moment's reflection should show that this dual practice was necessary to ensure that the corpus included examples of both common and uncommon types of language. Equally, in the written parts of the corpus, published and unpublished material, of a wide range of topics, registers, levels etc., were all represented. From high-brow novels and text books to pulp fiction and journalism, by way of school essays, office memoranda, email discussion lists, and paper-bags, our aim was to ensure that every form of written language is to be found in the corpus, to a greater or larger extent. As noted above, data capture for the whole project was carried out by the three publishers in the BNC consortium (OUP, Longman and Chambers). Three sources of electronic data were envisaged at the start of the project: existing electronic text, OCR from printed text, and keyed-in text. It soon become apparent that the first source would be less useful than anticipated since either the material was encoded in formats too difficult to unscramble consistently, or the texts available did not match the stipulated design criteria. Scanning and keying text brought lesser problems of their own, of which probably the worst was training keyboarders and scanners at different places to be consistent under tight time constraints. In the case of spoken data, keyboarding was the only option from the start, and proved to be very expensive and time-consuming, in part because of the very high standards set for data capture. Transcribing spoken language with attention to such features as overlap (where one speaker interrupts another), and enforcing consistency in the representation of non lexical or semi-lexical phenomena are major technical problems, rarely attempted on the scale of the BNC material, which finally included ten million words of naturally occurring speech, recorded in all sorts of environments For a variety of reasons, the three data suppliers all used their own internal markup systems for data capture which then had to be centrally converted and corrected to the project encoding standard. Had this standard, the Corpus Document Interchange Format, or CDIF, been available at the start of the project, the need for conversion would have been lessened, but not that for validation. CDIF, like many other TEI-conformant dtds, allows for considerable variation in actual encoding practice, largely because of the very widely different text types that it has to accommodate. To help ease the burden on data suppliers, the tags available were classified according to their perceived usefulness and applicability. Some — such as headings, chapter or other division breaks, and paragraphs — were designated "required" parts of any CDIF document; when such features occur in a text, they must be marked up. Others — such as sub-divisions within the text, lists, poems, and notes about editorial correction, were "recommended", and should be marked up if at all possible. Finally, some tags were considered "optional" — dates, proper names and citations which are easily identifiable. The process of format conversion and SGML validation was automated as far as possible (fortunately for us, the sgmls parser became available early on during the project): these constituted the syntactic check. Where time permitted, we also carried out a semantic check to determine whether material which should have been tagged had in fact been marked up, though it was of course impossible to carry out a full proof reading exercise. Materials which fell below an agreed threshold of errors, either syntactic or semantic, were returned to the data capture agency, for correction or replacement. Management of the many thousand of files and versions of files involved as texts passed through the production line was managed by a relational database system, which also managed routine archiving and backup. This database also held all of the bibliographic and other metadata associated with each text, from which the TEI headers eventually added to each text were generated. (A useful summary of the information recorded in each header is provided in Dunlop 1995). The project was funded for a total of four years, of which the first was devoted to agreeing and defining in full operational detail the procedures summarized above. By the end of the 5th quarter (March 1992), 10 percent of the corpus had been received at OUCS and procedures for handling it were in place. A small sample (2 million words) had been processed and sent on to Lancaster for the next stage of processing. The rate at which texts were received and processed at OUCS fluctuated somewhat during the course of the project, but ramped up steadily towards its end. The following table shows the approximate number of words (in millions) received at OUCS, converted to the project standard, and received back from Lancaster in annotated form, for each quarter (parenthesized figures indicate bounced texts — material which had to be returned because it did not pass the QA procedures discussed above): Quarter Received Validated Annotated 6 2 4 - 7 6 4 - 85 (1) 8 6 9 6 (2) 14 131014 (3) 11 511 12 (2) 1381225 16 1713 25 32 22 14 3 8 30
How to mark up a corpus A full description of the BNC mark up scheme is beyond the scope of this paper, and is in any case available in the documentation supplied with the corpus and elsewhere. In this paper I would like to focus on the way in which the anticipated uses of the corpus conditioned the mark up scheme actually applied. It has often been said of general purpose dtds such as the TEI (which was being developed symbiotically with the CDIF scheme used in the BNC) that they allow the user too much flexibility. In practice, we found that the richly descriptive aspects of the TEI scheme were of least interest to our potential users. For purpose of linguistic analysis, the immense variety of objects in a fully marked up text, with all their fascinating problems of rendering and interpretation, are of less importance than a reliable and regular structural breakdown, into segments and words. This was an unpalatable lesson for academics with a fondness for the rugosities of real language, but an important one. The scale of the BNC simply did not permit us to lovingly mark up every detail of the text — distinguishing sharply every list, foreign word, editorial intervention, or proper name. Instead we had to be sure that headings, paragraphs, and major text divisions were reliably and consistently captured in an immense variety of materials. For purposes of linguistic analysis, segmentation at the sentence and word level was crucial but, fortunately, automatic. By comparison with other, more literary oriented, TEI texts, the tagging of the BNC is thus rather sparse, despite its 150 million SGML tags. The basic structural mark up of both written and spoken texts may be summarized as follows. Each of the 4124 documents or text samples making up the corpus is represented by a single <bncDoc> element, containing a header, and either a <text> (for written texts) or an <stext> (for spoken texts) element. The header element contains detailed and richly structured metadata supplying a variety of contextual information about the document (its title, source, encoding, etc., as defined by the TEI): as noted above, headers were automatically generated from information managed within a relational database. A spoken text is divided into utterances, possibly interspersed with nonlinguistic elements such as events, possibly grouped into divisions to mark breaks in conversations. A written text is divided into paragraphs, possibly also grouped into hierarchically numbered divisions. Below the level of the paragraph or utterance, all texts are composed of <s> elements, marking the automatic linguistic segmentation carried out at Lancaster, and each of these is divided into <w> (word) or <c> (punctuation) elements, each bearing a POS (part of speech) annotation attribute. Considerable discussion went on at the start of the project as to the best method of encoding this automatically-generated information. There are about sixty different possible POS codes, each representing a linguistic category, for example as a singular noun, adverb of a particular type, etc. The codes are automatically allocated to each word by CLAWS, a sophisticated language-processing system developed at the University of Lancaster, and widely recognized as a mature product in the field of Natural Language Processing. For approximately 4.7 per cent of the words in the corpus, CLAWS was unable to decide between two possible taggings with sufficient likelihood of success. In such cases, a two-value word-class code, known as a portmanteau tag is applied. For example, the portmanteau tag VVD-VVN means that the word may be either a past tense verb (VVD), or a past participle (VVN). We did not make any attempt to represent this ambiguity in the SGML coding, though at a later stage of linguistic analysis, perhaps based on the TEI feature structure mechanism, this might be possible. Without manual intervention, the CLAWS system has an overall error-rate of approximately 1.7%, excluding punctuation marks. Given the size of the corpus, there was no opportunity to undertake post-editing to correct annotation errors before the first release of the corpus. Since then two successor projects have been completed by the Lancaster team, resulting in the availability of a much improved new version. The first step was to manually check a 2 percent sample from the whole corpus, using a much richer and more delicate set of c existence, and had been for many years), we began by representing the code simply as an entity reference following the token to which it applied, thus: This option, we felt, would enable us to defer to a later stage exactly what the replacement for each entity reference should be: it might be nothing at all, for those uninterested in POS information, or a string, or a pointer indicating a more complex expansion of the TEI kind. The problem with this representation however, is that it relies on an ad hoc interpretive rule (of the kind which SGML is specifically designed to preclude the need for) to indicate, for example, that the code AT0 belongs to the word The, rather than to the word Queen. In fact this is not encoding the truth of the situation: we have here a string of word-annotation pairs. A more truthful annotation might be:
The
At0 ]]>
A further possibility is to use an attribute value, for either the Form or the Code: thus The ]]> or, equivalently,AT0 ]]> From the SGML point of view these are equivalent. From the application point of view, the notion of a text composed of strings of POS codes, with embedded forms seems somehow less appealing than the reverse, which is what we eventually chose: our example being tagged as follows: The Queen's annus horribilis]]> The decision to use an often deprecated form of tag minimization for the POS annotation was forced upon us largely by economic considerations. A fully normalized form, with attribute name and end-tags included on each of the 100 million words would have more than doubled the size of the corpus. Data storage costs continue to plummet, but the difference between 2 Gb and 4Gb remains significant! A second major set of encoding problems arose from the inclusion in the corpus of ten million words of transcribed speech, half of it recorded in pre-defined situations (lectures, broadcasts, consultations etc), and the other half recorded by a demographically sampled set of volunteers, willing to tape their own every day work and leisure time conversation. Speech is transcribed using normal orthographic conventions, rather than attempting a full phonemic transcript, which would have been beyond the project's limited resources. Even so, the markup has to be very rich in order to capture the process of speaker interaction — who is speaking, and how, and where they are interrupted. Significant non-verbal events such as pauses or changes in voice quality are also marked up using appropriate empty elements, which bear descriptive attributes. Here is an example of the start of one such conversation, as encoded in CDIF: You gotta Radio Two with that . Bloody pirate station wouldn't you ? ]]> The basic unit is the utterance, marked as an <u> element, with an attribute who specifying the speaker, where this is known. This attribute targets an element in the header for the text, which carries important background information about the speaker, for example their gender, age, social background, inter-relationship etc. Where speakers interrupt each other, as they usually do, a system of alignment pointers simplified from that defined by the TEI, is used. This requires that all points of overlap are identified in a<timeLine> element prefixed to each text, component points (<when> elements) of which are then pointed to from synchronous moments within the transcribed speech, represented as <ptr> elements. Pausing is marked, using a <pause> element, with an indication of its length if this seems abnormal. Gaps in the transcription, caused either by inaudibility or the need to anonymize the material, are marked using the <unclear> or <gap> elements as appropriate. Truncated forms of words, caused by interruption or false-starts, are also marked, using the <trunc> element. A semi-rigorous form of normalization is applied to the spelling of non-conventional forms such as innit or lorra; the principle adopted was to spell such forms in the way that they typically appear in general dictionaries. Similar methods are used to normalize such features of spoken language as filled pauses, semi-lexicalized items such as um, err, etc. Some light punctuation was also added, motivated chiefly by the desire to make the transcriptions comprehensible to a reader, by marking (for example) questions, possessives, and sentence boundaries in the conventional way. Paralinguistic features affecting particular stretches of speech, such as shouting or laughing, are marked using the <shift> element to delimit changes in voice quality. Non-verbal sounds such as coughing or yawning, and non-speech events such as traffic noise are also marked, using the <vocal> and <event> elements respectively; in both cases, a closed list of values for the desc attribute is used to specify the phenomenon concerned. It should however be emphasized that the aim was to transcribe as clearly and economically as possible rather than to represent all the subtleties of the audio recording. The metadata provided by the header element, mentioned above, is of particular importance in any electronic text, but especially so in a large corpus. Earlier corpora have tended to provide all such documentation (if at all) as a separate collection of reference manuals, rather than as an integral part of the corpus, with obvious concomitant problems of maintainability and consistency. In SGML, particularly the TEI header, we felt that we had a powerful mechanism for integrating data and metadata, which we used to the full: each component text of the BNC carries a full header, structured according to TEI recommendations, and containing a full bibliographic description of it, and of its source, as well as specific details of its encoding, revision status, etc. A corpus header, containing information common to all texts, is also provided: this includes full descriptions of the corpus creation methodology, and the various codes used within individual text headers, such as those for text classification. A particular problem arises with large general purpose corpora like the BNC, the components of which can be cross-classified in many different ways. Earlier corpora have tended to simplify this, for example, by organizing the corpora into groups of texts of a particular type — all newspaper texts together, all novels together, etc. A typical BNC text however can be classified in many different ways (medium, level, region, etc.). The solution we adopted, was to include in the header of each text a single <catRef> element carrying an IDREFS-valued attribute, which targetted each of the descriptive categories applicable to the text. For example, the header of a text of written author type 2 (multiple authorship), written medium type 4 (miscellaneous unpublished), and written domain type 3 (applied sciences) will contain a element like the following:]]>The values wriaty2 wrimed4 etc. here each references a <category> element in the corpus header, containing a definition for the classification intended. The full set of descriptive categories used is thus controlled and can be guaranteed uniform across the whole corpus, while at the same time permitting us to mix and combine descriptive categories within each text as appropriate. A similar method was used to link very detailed participant descriptions (stored in the header) with utterances attributed to them in the spoken part of the corpus. In retrospect, had we all known as much about SGML at the start of the project as we did by the end of it, we would have made much more impressive progress, and perhaps delivered a better product. Needless effort went into converting from one format to another, which might have been better spent on gathering more reliable contextual information for example. We also spent a long time devising ways of representing complex information about (for example) relationships between the speakers which in the event was not reliably available for more than a handful of cases. The data representation we produced was thus rather more sophisticated and complex than the material included perhaps warranted.
How to analyse a corpus Linguistic analysis, particularly of large and diversely organized corpora, is not the same as text retrieval. While some of the application needs of the BNC user community might be met by standard SGML browsers or text database systems, many are not. The typical user of the BNC is interested in its contents as raw material for analysis, not as material to be searched for particular words or references. There is a correspondingly greater emphasis on statistical output, on ways of patterning and reordering result sets, as well as a need to support more complex kinds of enquiry than are usual in text-retrieval products. To meet some of these needs, the BNC is now delivered with a purpose-written SGML Aware Retrieval Application (SARA), developed at Oxford. From the start of the BNC project in 1990, it had always tacitly been assumed that some kind of retrieval software would need to be delivered along with the corpus. The original project proposal talks of “simple processing tools” and an informal specification for an“ information search and retrieval processor” was also drawn up by the UCREL team early on. In the event, the need to complete delivery of the corpus on time (or at least, not too late), meant that development of any such software beyond that needed for the immediate needs of the project was increasingly deferred. It was argued that the lack of such software might be only transient, since the corpus was to be delivered in SGML form, tools for which were already becoming widely available, as a result of the widespread adoption of this standard both within the language engineering research community and elsewhere. However, a major stated goal of the project was to make the corpus available and usable as widely as possible, that is, not just at a low cost, but also within as wide a variety of environments as possible. It seemed to us that the potential user community for large scale corpora like the BNC extended considerably as far beyond the Natural Language Processing research community as it did beyond the immediate needs of commercial lexicographers, although it was largely on behalf of these groups that the project had originally been funded and largely therefore these groups which had determined the manner in which it should be delivered. It seemed to us that the software needs of some of the potential users of the BNC would be only partially met by the generic SGML software available in late 1994 (and to a large extent still today). The choice lay amongst highly specialized, but high performance, application development tool kits which given sufficient expertise could be customized to suit the needs of niche markets in NLP or lexicography, but which were somewhat beyond the needs, comprehension, or indeed purse, of the person in the street; generic SGML browse and display engines, designed originally for electronic publication or delivery over the web, often with very attractive and user-friendly interfaces but generally unable to handle the full complexity and scale of the BNC; or simple concordancing tools which were equally unable to take advantage of the added value we had so painfully put into the encoding and organization of the corpus. Moreover existing software was either very expensive (being aimed at large scale electronic publishing environments), or free, but requiring considerable technical expertise for anything beyond the most trivial of applications. As discussed further below, the scale and complexity of the BNC (with its 100 million tagged words, six and a quarter million sentences, and 4124 interlinked texts) seemed likely to stretch the capacity of most simple text-based concordancers available at that time. We were fortunate enough to obtain funding, initially from the British Library R & D Department, and subsequently from the British Academy, to produce a software package which might go some way to fill the gaps identified. Development of the system was carried out by Tony Dodd, with valuable input from members of the original BNC Consortium, and from early users of the software. The system is called SARA, for SGML-Aware Retrieval Application, to make explicit that although aware of the SGML markup present in the corpus, it is not a native SGML database. In this respect, however, it is no better or worse than a number of other current software packages.
The SARA system The SARA system was designed for client-server mode operation, typically in a distributed computing environment, where one or more work-stations or personal computers are used to access a central server over a network. This is, of course, the kind of environment which is most widely current in academic (and other) computing milieux today. The success of the World Wide Web, which uses an identical design philosophy, is vivid testimony to the effectiveness of this approach. The system has four chief components: the indexing program, which generates an index of tokens from an SGML marked-up text; the server program, which accepts messages in the Corpus Query Language (see below) and returns results from the SGML text; the SARA protocol, a formally defined set of message types which determines legal interactions between the client and server programs; this protocol makes use of a high-level query language known as CQL (for Corpus Query Language); one or more client programs, with which a user interacts in any appropriate platform-specific way, and which communicate with the server program using the protocol.
The SARA index Computationally, the best-understood method of accessing a text the size and complexity of the BNC is to use an index file, in which search terms are associated with their location in the main text file, and into which rapid access can be obtained using hashing techniques. Such methods have been employed for decades in mainstream information retrieval systems, with the consequence that the advantages and disadvantages of the various ways of implementing the underlying technology are well known and very stable. The SARA index is a conventional index of this type. Entries in the index are created by the indexing program, using the SGML markup to determine how the input text is to be tokenized. The tokens indexed include the content of every <w> or <c> element, together with the part of speech code allocated to it by the CLAWS program. For example, there will be one entry in the index for lead as a noun, and another for lead tagged as a verb. The index is not case-sensitive, so occurrences of Lead may appear in either entry. The tokenization is entirely dependent on that carried out by CLAWS, which accounts for the presence of a few oddities in the index where CLAWS failed to segment sentences entirely. The SGML tags (other than those for individual tokens) themselves are also indexed, as are their attribute values. For example, there is an entry in the index for every <text> start- and end-tag, and for every <head> start- and end-tag, etc. This makes it possible to search for words appearing within the scope of a particular SGML element type. For some very frequent element-types (notably <s> and <p> ) whose locations are particularly important when delimiting the context of a hit, additional secondary indexes called accelerator files are maintained. The index supplied with the first version of the BNC occupies 33,000 files and 2.5 gigabytes of disk space, i.e. slightly more than the size of the text itself. Building the index is a complex and computationally expensive process, requiring much larger amounts of disk space or several sort/merge intermediate phases. This was one reason for delivering the completed index together with the corpus itself on the first release of the BNC, even though development of the client software was not at that stage complete. More compact indexing would have been possible with the use of data compression, at the expense of some increase in complexity: in practice, the indexing algorithm used provides equally good retrieval times for any kind of query, independent of the size of the corpus indexed. The index included on the published CDs necessarily assumes that the server accessing it has certain hardware characteristics (in particular, word length and byte addressing order). To cater for machines for which these assumptions are incorrect, a localization program is now included with the software. This can either make a once for all modification to the index or be used by the server to make the necessary modifications on the fly. The indexer program is intended to operate on generic SGML texts, that is, not just on the particular set of tags defined for use in the BNC. However, we have not yet attempted to use it for corpora using other DTDs, and there are some features of its behaviour which assume that the DTD in use is (like the BNC) more or less TEI-compatible. For example, it requires that texts have a TEI header, that they are decomposed into <S> like elements, that each token to be indexed be explicitly tagged as such.
The SARA server The SARA server program was written originally in the ANSI C language, using BSD sockets to implement network connexions, with a view to making it as portable as possible. The current version, release 930, has been implemented on several different flavours of the Unix operating system, including Solaris, Digital Unix, and Linux, which appear to be the most popular variations. The software is delivered with detailed installation and localization instructions, and can be downloaded freely from the BNC's web site (see http://info.ox.ac.uk/bnc/sara.html), though it is not yet of much interest to anyone other than BNC licensees, since the indexer program is not yet included with it. The server has several distinct functions, amongst which the following are probably the most important: it allows registered users to log on or off and to change their passwords; it implements the key functions required of the Corpus Query Language, in particular: looking for tokens in the index; solving a query; supplying bibliographic information about a text; displaying some or all of a text at a given location; thinning or filtering the result set from a query. it handles all housekeeping, allowing concurrent access by several different users. The server listens on a specified socket for login calls from a client. When such a call is received, the server tries to create a process to accept further data packages. If it succeeds, the client is logged on and set up messages are exchanged which define for example, the names and characteristics of SGML elements in the server's database. Following this, the client sends queries in the Corpus Query Language, and receives data packets containing solutions to them. Once a connexion has been established in this way, the server expects to receive regular messages from the client, and will time out if it does not. The client can also request the server to interrupt certain transactions prematurely.
The Corpus Query Language The Corpus Query Language (CQL) is a fairly typical Boolean style retrieval language, with a number of additional features particularly useful for corpus work. It is emphatically not intended for human use. Like many other such languages, its syntax is designed for convenience of machine processing, rather than elegance or perspicuousness. A brief summary of its functionality only is given here. A query is made up of one or more atomic queries. An atomic query may be one of the following: a word or punctuation character; a wildcard character, which will match any single term; an L-word, that is a combination of word and part-of-speech code, such as CAN=NN1 (i.e. can as a singular noun); a phrase, which is decomposed into a search for consecutive terms irrespective of punctuation; a regular expression; an SGML query, that is, a search for a start- or end-tag, possibly including attribute name-value pairs. an existing (named) solution set. Names are allocated to queries by the server. any CQL query enclosed in parentheses. The following unary operators are currently implemented in CQL: case The dollar operator makes the query which is its operand case-sensitive; header The commercial-at operator makes the query which is its operand search within headers as well as in the bodies of texts (it thus assumes that a TEI-conformant dtd is in use); optionality The ? operator matches zero or one solutions to the query which is its operand; it makes no sense unless the query is combined with another; A CQL expression containing more than one query may use the following binary operators: concatenation Two queries written in sequence match occasions where a solution to the first query is directly followed by a solution to the second. disjunction The term query1|query2 matches anything that is a solution to either query1 or query2 join The term query1*query2 matches anything that is a solution to query1 followed by a solution to query2 within the current scope; the term query1#query2 matches anything that is a solution to query1 either followed or preceded by a solution to query2 within the current scope. When queries are joined, the scope of the expression may be defined in one of the following ways: SGML element A join query followed by the / operator and an SGML query matches cases where the joined query is satisfied within the scope of the SGML query. number A join query followed by the / operator and a number matches cases where the joined query is satisfied within the number of words specified. If no scope is supplied for a join query, the default scope is a single <bncDoc> element, i.e. a single text in the corpus.
SARA client programs The standard SARA installation includes a very rudimentary client program called solve, for Unix. This provides a command line interface at which CQL expressions can be typed for evaluation, returning result sets on the standard Unix output channel, for piping to a formatter of the user's choice, or display at a terminal. This client is provided mainly for debugging purposes, and also as a model of how to construct such software. A web client, written in Perl, has recently been developed at the University of Zurich, a simplified version of which is currently used at the BNC online service, and which will also be included in the next release of the SARA software. The SARA client program which has been most extensively developed and used runs in the Microsoft Windows environment, and it is this which forms the subject of the remainder of this paper. In designing the Windows client, we attempted to make sure that as much of the basic functionality of the CQL protocol could be retained, while at the same time making the package easy to use for the novice. We also recognized that we could not implement all of the features which corpus specialists would require at the same time as providing a simple enough interface to attract corpus novices. In retrospect, there are several features and functions we would liked to have added (of which some are discussed below); but no doubt, had we done so, there would be several aspects of the user interface we would now be equally dissatisfied with. The SARA client follows standard Microsoft Windows application guidelines, and is written in Microsoft C++, using the standard object classes and libraries. It thus looks very similar to any other Windows application, with the same conventions for window management, buttons, menus, etc. It runs under any version of Windows more recent than 3.0, and there are both 16 and 32 bit versions. A TCP/IP stack (such as Winsock) to implement connexion to the server is essential, and a colour screen highly desirable. The software uses only small amounts of disk or memory, except when downloading or sorting result sets containing very many (more than a few hundred) or very long (more than 1Kb) hits. The Windows client allows the user to: search the word index and check what tokens it contains; define, save, re-use, or modify a query (effectively, a CQL expression to be evaluated); view, sort, save, or print all or some of the results returned by a query; configure and manipulate the display of results in a variety of ways; view contextual and bibliographic data for any one text; combine simple queries to form a complex one, using a visual interface. A brief description of each of these functions is given below; more information is available from the built-in help file and from the BNC Handbook
Types of Query The Windows client distinguishes five types of query, and allows for their combination as a complex query. The basic query types are: word query this searches the SARA word index, either by stem (right hand truncation only is performed) or by pattern (see below). All index-entries matching the string entered are returned, and the user can then select all or some of them for dispatch to the server as CQL queries against the corpus; phrase query A phrase query behaves superficially like a word query, in that it searches for occurrences of a particular word or phrase. It differs in that it can be case-sensitive, can search text headers as well as bodies, can include punctuation, and is aware of the tokenization rules used by the CLAWS tagger. A phrase query can also include a wild card character to match any word in a phrase. pattern query A pattern query allows for queries using a simple subset of UNIX-style regular expressions, for example to find variant spellings of a word. Some limitations on the kind of pattern which can usefully be searched for are imposed by the nature of the index: for example, left hand truncation of the search term always implies a scan through the entire index, and is therefore not allowed. POS query: A part of speech (POS) query carries out a word query, further restricted by a given POS code or code, for example to find occurrences of lead tagged as a noun. It should be stressed that this is only feasible for a specified word, since the POS code is only a secondary key in the SARA word index — it is not possible to search for (say) all nouns with the current system. SGML query An SGML query carries out a search for a given SGML tag in the corpus, optionally qualified by particular combinations of attribute values, for example to find all occurrences of <event> elements in which the desc attribute has the value laughing or laughter. It is particularly useful when restricting searches to texts of a particular type, since text type information is typically carried by SGML attributes in the BNC. One or more of the above types of query may be combined to form a complex query, using the special purpose Query Builder visual interface, in which the parts of a complex query are represented by nodes of various types. A Query Builder query always has at least two nodes: one, the scope node, defines the the context within which a complex query is to be evaluated. This may be expressed either as an SGML element, or as a span of some number of words. The other nodes are known as content nodes, and correspond with the simple queries from which the complex query is built. Content nodes may be linked together horizontally, to indicate alternation, or vertically to indicate concatenation. In the latter case, different arc types are drawn, to indicate whether the terms are to be satisfied in either order, in one order only, or directly, i.e. with no intervening terms. Query Builder thus enables one to solve queries such as “find the word fork followed by the word knife as a noun, within the scope of a single <u> element”. It can be used to find occurrences of the words anyhow or anyway directly following laughter at the start of a sentence; to constrain searches to texts of particular types, or contexts, and so forth. For completeness, the Windows client also allows the skilled (or adventurous) user to type a CQL expression directly: this is the only form of simple query which is not permitted within the Query Builder interface.
Display and manipulation of queries By whatever method it is posed, any SARA query returns its results in the same way. Results may be displayed in one of line or page modes, i.e. in a conventional KWIC display, or one result at a time. The amount of context returned for each result is specified as a maximum number of characters, within which a whole sentence or paragraph will usually be displayed. Results can be displayed in one of four different formats: plain text-only display which effectively ignores and suppresses all markup; POS individual words are colour-coded according to their part of speech and a user-defined colour scheme; SGML all SGML encoding in the original is displayed uninterpreted; custom the SGML encoding is interpreted according to a simple user-supplied specification. It will often be the case that the number of results found for a query is unmanageably large. To handle this, the SARA client offers the following facilities. A global limit is defined on the number of results to be returned. When this limit is exceeded, the user can choose to over-ride the limit temporarily for this result set, specifying how many solutions are required, discarding any surplus from the end of the result set; to discard all but the first solution in each text; to take a random sample of specified size from the available solutions. When the last of these is repeated for a given large result set, it will return a different random sample each time. Once downloaded to the client, a set of results may be manipulated in a number of ways. It may be sorted according to the keyword which defined the query, by varying extents of the left or right context for this keyword, or by combinations of these keys. Sorting can be carried out either by the orthographic form, in case-insensitive manner, or by the POS code of words. This enables the user to group together all occurrences of a word in which it is followed by a particular POS code, for example. It is also possible to scroll through a result set, manually identifying particular solutions for inclusion or exclusion, or to thin it automatically in the same way as when the limit on the number of solutions is exceeded. A result set may simply be printed out, or saved to a file in SGML format, for later processing by some SGML-aware formatter or further processor. Named bookmarks may be associated with particular solutions (as in other Windows applications) to facilitate their rapid recovery. The queries generating a result set, together with any associated thinning of it, any bookmarks, and any additional documentary comment, can all be saved together as named queries on the client, which can then be reactivated as required.
Additional features of the client The main bibliographic information about each text from which a given concordance line has been extracted can be displayed with a single mouse click. It is also possible to browse directly the whole of the text and its associated header, which is presented as a hierarchic menu, reflecting its SGML structure. The user can either start from the position where a hit was found, expanding or contracting the elements surrounding it, or start from the root of the document tree, and move down to it. A limited range of statistical features are provided. Word frequencies and z-scores are provided for word-form lookups, and there is a useful collocation option which enables one to calculate the absolute and relative frequencies with which a specified term co-occurs within a specified number of words of the current query focus.
Limitations of the current system and future plans As noted above, the current client lacks some facilities which are widely used in particular fields of corpus-based research. This is particularly true of statistical information. There is no facility for the automatic generation of collocate lists, or any of the other forms of more sophisticated forms of statistical analysis now widely used. Neither is there any form of linguistic knowledge built into the system (other than the POS tagging): there is no lemmatized index, or lemmatizing component, though clearly it would be desirable to add one. For those sufficiently technically minded, or motivated, the construction of such facilities (whether using SGML-aware tools or not) is relatively straightforward; the problem is that no simple interface or hook exists to build them into the current Windows client. Similarly, it is not possible to define, save and re-use subcorpora, except by saving and re-using the queries which define them. The SARA client can address only the whole of the SARA index, which indexes the whole of the BNC. This is a design issue, which has yet to be addressed. If queries become very complex, involving manipulation of many very large result streams, they may exceed the limits of what can be handled by the server. This has not yet arisen in practice however. A more common complaint about the current system is that it cannot be used to search for patterns of POS codes, independently of the particular word forms to which they are attached. This is fundamentally an indexing problem, which may be addressed in the next major release of the system. The performance problems associated with queries containing very high frequency words are derived from the same problem, and may be addressed in the same way. And again, it is a trivial exercise for a competent programmer to write special purpose code which will search for such patterns across the whole of the BNC. Despite these limitations, the system has attracted great enthusiasm when tested and demonstrated, despite performance problems and difficulties of access, perhaps owing largely to the intrinsic interest of the BNC data itself. Since mid-1997, we have been providing a free online service using the client as a part of the British Library's Initiatives for Access programme. This service allows anyone with access to the World Wide Web to search the BNC at no charge. Using any Web browser and a simple query form, restricted searches can be carried out via a CGI script accessing the SARA server directly. Alternatively, the user can download and install their own copy of the Windows client software, and use it to access the same server. At the time of writing, this full query service is available free of charge for a limited trial period, after which an annual registration fee is charged. A second updated and corrected version of the Corpus is due for release in 1998. Up to date information about the project is available from the project website at http://info.ox.ac.uk/bnc.
Aston, G. and Burnard, L. 1998 The BNC Handbook Edinburgh: Edinburgh University Press. Atkins, S., Clear, J. and Ostler, N. 1992. Corpus design criteria Literary and linguistic computing 7: 1-16. Biber, D., Conrad, S., and Reppen, R. 1998 Corpus linguistics: investigating language structure and use Cambridge: Cambridge University Press. Dunlop, D. 1995Practical considerations in the use of TEI headers in large corpora (in Ide, N. and Veronis, J. eds, 1995 Text Encoding Initiative: background and context Kluwer.) Garside, R., Leech, G., and McEnery, T. 1997. Corpus annotation: linguistic information from computer text corpora Harlow: Addison-Wesley Longman. Kennedy, Graeme, 1998 An introduction to corpus linguistics Harlow: Addison-Wesley-Longman. McEnery, A. and Wilson, A. 1996. Corpus linguistics. Edinburgh: Edinburgh University Press.