4 How to analyse a corpus

Linguistic analysis, particularly of large and diversely organized corpora, is not the same as text retrieval. While some of the application needs of the BNC user community might be met by standard SGML browsers or text database systems, many are not. The typical user of the BNC is interested in its contents as raw material for analysis, not as material to be searched for particular words or references. There is a correspondingly greater emphasis on statistical output, on ways of patterning and reordering result sets, as well as a need to support more complex kinds of enquiry than are usual in text-retrieval products. To meet some of these needs, the BNC is now delivered with a purpose-written SGML Aware Retrieval Application (SARA), developed at Oxford.

From the start of the BNC project in 1990, it had always tacitly been assumed that some kind of retrieval software would need to be delivered along with the corpus. The original project proposal talks of ``simple processing tools'' and an informal specification for an``information search and retrieval processor'' was also drawn up by the UCREL team early on. In the event, the need to complete delivery of the corpus on time (or at least, not too late), meant that development of any such software beyond that needed for the immediate needs of the project was increasingly deferred. It was argued that the lack of such software might be only transient, since the corpus was to be delivered in SGML form, tools for which were already becoming widely available, as a result of the widespread adoption of this standard both within the language engineering research community and elsewhere.

However, a major stated goal of the project was to make the corpus available and usable as widely as possible, that is, not just at a low cost, but also within as wide a variety of environments as possible. It seemed to us that the potential user community for large scale corpora like the BNC extended considerably as far beyond the Natural Language Processing research community as it did beyond the immediate needs of commercial lexicographers, although it was largely on behalf of these groups that the project had originally been funded and largely therefore these groups which had determined the manner in which it should be delivered.

It seemed to us that the software needs of some of the potential users of the BNC would be only partially met by the generic SGML software available in late 1994 (and to a large extent still today). The choice lay amongst highly specialized, but high performance, application development tool kits which given sufficient expertise could be customized to suit the needs of niche markets in NLP or lexicography, but which were somewhat beyond the needs, comprehension, or indeed purse, of the person in the street; generic SGML browse and display engines, designed originally for electronic publication or delivery over the web, often with very attractive and user-friendly interfaces but generally unable to handle the full complexity and scale of the BNC; or simple concordancing tools which were equally unable to take advantage of the added value we had so painfully put into the encoding and organization of the corpus. Moreover existing software was either very expensive (being aimed at large scale electronic publishing environments), or free, but requiring considerable technical expertise for anything beyond the most trivial of applications. As discussed further below, the scale and complexity of the BNC (with its 100 million tagged words, six and a quarter million sentences, and 4124 interlinked texts) seemed likely to stretch the capacity of most simple text-based concordancers available at that time.

We were fortunate enough to obtain funding, initially from the British Library R [amp ] D Department, and subsequently from the British Academy, to produce a software package which might go some way to fill the gaps identified. Development of the system was carried out by Tony Dodd, with valuable input from members of the original BNC Consortium, and from early users of the software. The system is called SARA, for SGML-Aware Retrieval Application, to make explicit that although aware of the SGML markup present in the corpus, it is not a native SGML database. In this respect, however, it is no better or worse than a number of other current software packages.