Using SGML for Linguistic
Analysis: the case of the BNC
LouBurnard
http://users.ox.ac.uk/~lou/
Lou Burnard is
Manager of the Humanities Computing Unit at Oxford University Computing
Services, and European Editor of the Text Encoding Initiative. He was
educated at Oxford University, where he has worked in humanistic
applications of information technology since the seventies.
The British National Corpus (BNC) is a rather large SGML document,
comprising some 4124 samples taken from a rich variety of contemporary
British English texts of every kind, written and printed, famous and
obscure, learned and ignorant, spoken and written. Each of its hundred
million words and six and a quarter million sentences is tagged
explicitly in SGML and carries an automatically-generated linguistic
analysis. Each sample carries a TEI-conformant header, containing
detailed contextual and descriptive information, as well as more
conventional SGML mark-up.
The corpus was created over a four year period by a consortium of
leading dictionary publishers and academic research centres in the UK,
with substantial funding from the British Department of Trade and
Industry, the Science and Engineering Research Council, and the British
Library. On publication, it was made freely available under licence
within the European Union, where it is increasingly used in linguistic
research and lexicography, in applications ranging from the construction
of state of the art language-recognition systems, to the teaching of
English as a second language.
This paper describes how the corpus was constructed, and gives an
overview of some of the SGML encoding issues raised during the process.
A brief description of the special purpose SGML aware retrieval system
developed to analyse the corpus and its current status is also provided.
How to build a corpus
The building of large-scale corpora of text for use in linguistic
analysis pre-dates the technical feasibility of such resources in
digital form by several centuries. The McEnery and Wilson1996,
Biber et al 1998, Kennedy 1998,
or
— a general introduction produced with particular reference to the
British National Corpus (BNC) — Aston and
Burnard 1998.
Many of the most well-known language corpora were created within an
academic context, where slightly different constraints tend to affect
quality control, budgets, and deadlines than those associated with
commercial production environments. The BNC project was, by contrast, a
joint academic-industrial project, in which both academic and industrial
partners learned a little more of their colleagues' perspectives by
means of an enforced collaboration. In crude terms, if the academic
partners learned to cut their coat according to the cloth available; the
industrial partners learned that there were more complex things in life
than boilersuits.
The British National Corpus (BNC) is a collection of over 4000
different text samples, of all kinds, both written and spoken,
containing in all six and a quarter million sentences, and over 100
million words of current British English. Work on building it began in
1991, and was completed in 1994. The project was funded by the Science
and Engineering Council (now EPSRC) and the Department of Trade and
Industry under the Joint Framework for Information Technology (JFIT)
programme. The project was carried out by a consortium lead by Oxford
University Press, of which the other members are major dictionary
publishers Addison-Wesley Longman and Chambers-Harrap; academic research
centres at Oxford University Computing Services, Lancaster University's
Centre for Computer Research on the English Language, and the British
Library's Research and Innovation Centre.
Organizationally, the tasks of designing and building the corpus
were split across a number of technical work groups on which each member
of the consortium was represented. Task Group A concerned itself with
basic issues of corpus design — what principles should inform the
selection of texts for inclusion in the corpus — what target
proportions should be set for different text types and so forth. Task
Group B focussed on one key issue in corpus construction, the
establishment of acceptable procedures for rights clearance and
poermissions to include material in the corpus. This might have been the
subject of a major research project in its own right: in practice, the
output from the task group was a standard agreement, in some sense a
precedent-setting document for other European corpus-builders.
Task Group C concerned itself with technical details of encoding and
text processing; these are discussed in more detail below. Task Group D
concerned itself with corpus enrichment and analysis. In practice, the
distinction between the two turned out to be largely the distinction
between the creation of the corpus and of specific software to make use
of it. Since the latter task was not possible until the end of the
project, by when there were no funds left to do it, it is unsurprising
that little was actually accomplished in this group within the time of
the original BNC project.
SGML played a major part in the BNC project: as an interchange
medium between the various data-providers; as a target
application-independent format; and as the vehicle for expression of
metadata and linguistic interpretations encoded within the corpus. From
the start of the project, it was recognized that SGML offered the only
sure foundation for long term storage and distribution of the data; only
during its progress did the importance of using it also as an exchange
medium between the various partners emerge. The importance of SGML as an
application independent encoding format is also only now becoming
apparent, as a wide range of applications for it begin to be realized.
The scale and variety of data to be included meant that a industrial
style production line environment had to be defined: this was dubbed the
BNC sausage machine by Jeremy Clear, the project manager at the time,
and may be summarized as follows:
data capture
each of the three commercial partners selected and prepared
material to a different defined format, reflecting to some extent the
diverse nature of materials for which they were primarily responsible;
primary check and conversion
OUCS checked each text against its data capture format,
automatically converted it to project standard format, and made an
accession record for it in the project database;
linguistic annotation
valid SGML texts were passed to Lancaster for automatic addition
of word class tagging and linguistic segmentation, using the CLAWS
software discussed further below;
text cataloguing and final checking
lexically annotated texts were run through a final conversion at OUCS;
a detailed TEI header was generated from the project database and the
text itself added to the corpus.
A wide literature now exists on corpus design methodologies, which
this paper will not attempt to summarize although the experience of
designing and creating the BNC has contributed greatly to it (see in
particular Atkins et al 1992). A corpus
which, like the BNC, aims to represent all the varieties of the English
language cannot simply be assembled opportunistically by collecting as
much electronic material as its budget will permit, although a project
with a defined budget and timescale inevitably finds design principles
sometimes have to be sacrificed to pragmatic considerations. Neither can
a corpus aiming to represent the full variety of contemporary English
proceed on a purely statistical basis: a statistically balanced random
sampling of language producers will be unlikely to include (for example)
many journalists or media personalities, while a statistically balanced
random sample of language reception is unlikely to include much apart
from popular journalism. As a compromise, the project adapted a
stratified sampling procedure, in which the range of texts to be sampled
is pre-defined, and target proportions were then agreed on for each.
In the spoken part of the corpus, ten per cent of the whole, a
balance was struck between material gathered on a statistical basis
(i.e. recruited from a demographically-balanced sample of language
producers) and from material gathered from a pre-defined set of speech
situations or contexts. A moment's reflection should show that this dual
practice was necessary to ensure that the corpus included examples of
both common and uncommon types of language. Equally, in the written
parts of the corpus, published and unpublished material, of a wide range
of topics, registers, levels etc., were all represented. From high-brow
novels and text books to pulp fiction and journalism, by way of school
essays, office memoranda, email discussion lists, and paper-bags, our
aim was to ensure that every form of written language is to be found in
the corpus, to a greater or larger extent.
As noted above, data capture for the whole project was carried out
by the three publishers in the BNC consortium (OUP, Longman and
Chambers). Three sources of electronic data were envisaged at the start
of the project: existing electronic text, OCR from printed text, and
keyed-in text. It soon become apparent that the first source would be
less useful than anticipated since either the material was encoded in
formats too difficult to unscramble consistently, or the texts available
did not match the stipulated design criteria. Scanning and keying text
brought lesser problems of their own, of which probably the worst was
training keyboarders and scanners at different places to be consistent
under tight time constraints. In the case of spoken data, keyboarding
was the only option from the start, and proved to be very expensive and
time-consuming, in part because of the very high standards set for data
capture. Transcribing spoken language with attention to such features as
overlap (where one speaker interrupts another), and enforcing
consistency in the representation of non lexical or semi-lexical
phenomena are major technical problems, rarely attempted on the scale of
the BNC material, which finally included ten million words of naturally
occurring speech, recorded in all sorts of environments
For a variety of reasons, the three data suppliers all used their
own internal markup systems for data capture which then had to be
centrally converted and corrected to the project encoding standard. Had
this standard, the Corpus Document Interchange Format, or CDIF, been
available at the start of the project, the need for conversion would
have been lessened, but not that for validation. CDIF, like many other
TEI-conformant dtds, allows for considerable variation in actual
encoding practice, largely because of the very widely different text
types that it has to accommodate. To help ease the burden on data
suppliers, the tags available were classified according to their
perceived usefulness and applicability. Some — such as headings,
chapter or other division breaks, and paragraphs — were designated
"required" parts of any CDIF document; when such features
occur in a text, they must be marked up. Others — such as
sub-divisions within the text, lists, poems, and notes about editorial
correction, were "recommended", and should be marked up if
at all possible. Finally, some tags were considered
"optional" — dates, proper names and citations which
are easily identifiable. The process of format conversion and SGML
validation was automated as far as possible (fortunately for us, the
sgmls parser became available early on during the project): these
constituted the syntactic check. Where time
permitted, we also carried out a semantic check to
determine whether material which should have been tagged had in fact
been marked up, though it was of course impossible to carry out a full
proof reading exercise. Materials which fell below an agreed threshold
of errors, either syntactic or semantic, were returned to the data
capture agency, for correction or replacement.
Management of the many thousand of files and versions of files
involved as texts passed through the production line was managed by a
relational database system, which also managed routine archiving and
backup. This database also held all of the bibliographic and other
metadata associated with each text, from which the TEI headers
eventually added to each text were generated. (A useful summary of the
information recorded in each header is provided in Dunlop
1995).
The project was funded for a total of four years, of which the first
was devoted to agreeing and defining in full operational detail the
procedures summarized above. By the end of the 5th quarter (March
1992), 10 percent of the corpus had been received at OUCS and procedures
for handling it were in place. A small sample (2 million words) had been
processed and sent on to Lancaster for the next stage of processing. The
rate at which texts were received and processed at OUCS fluctuated
somewhat during the course of the project, but ramped up steadily
towards its end.
The following table shows the approximate number of words (in
millions) received at OUCS, converted to the project standard, and
received back from Lancaster in annotated form, for each quarter
(parenthesized figures indicate
bounced texts — material which had to be returned
because it did not pass the QA procedures discussed above):
| Quarter | Received | Validated |
Annotated |
| 6 | 2 | 4 | -
|
| 7 | 6 | 4 | -
|
| 8 | 5 (1) | 8 | 6 |
| 9 | 6 (2)
| 14 | 13 |
| 10 | 14 (3) | 11 |
5 |
| 11 | 12 (2) | 13 | 8 |
| 12 | 25 | 16 | 17 |
| 13 | 25 | 32 |
22
|
| 14 | 3 | 8 |
30
|
How to mark up a corpus
A full description of the BNC mark up scheme is beyond the scope of
this paper, and is in any case available in the documentation supplied
with the corpus and elsewhere. In this paper I would like to focus on
the way in which the anticipated uses of the corpus conditioned the mark
up scheme actually applied.
It has often been said of general purpose dtds such as the TEI
(which was being developed symbiotically with the CDIF scheme used in
the BNC) that they allow the user too much flexibility. In practice, we
found that the richly descriptive aspects of the TEI scheme were of
least interest to our potential users. For purpose of linguistic
analysis, the immense variety of objects in a fully marked up text, with
all their fascinating problems of rendering and interpretation, are of
less importance than a reliable and regular structural breakdown, into
segments and words. This was an unpalatable lesson for academics with a
fondness for the rugosities of real language, but an important one. The
scale of the BNC simply did not permit us to lovingly mark up every
detail of the text — distinguishing sharply every list, foreign
word, editorial intervention, or proper name. Instead we had to be sure
that headings, paragraphs, and major text divisions were reliably and
consistently captured in an immense variety of materials. For purposes
of linguistic analysis, segmentation at the sentence and word level was
crucial but, fortunately, automatic. By comparison with other, more
literary oriented, TEI texts, the tagging of the BNC is thus rather
sparse, despite its 150 million SGML tags.
The basic structural mark up of both written and spoken texts may
be summarized as follows. Each of the 4124 documents or text samples
making up the corpus is represented by a single <bncDoc>
element, containing a header, and either a <text> (for written
texts) or an <stext> (for spoken texts) element. The header
element contains detailed and richly structured metadata supplying a
variety of contextual information about the document (its title, source,
encoding, etc., as defined by the TEI): as noted above, headers were
automatically generated from information managed within a relational
database. A spoken text is divided into utterances, possibly
interspersed with nonlinguistic elements such as events, possibly
grouped into divisions to mark breaks in conversations. A written text
is divided into paragraphs, possibly also grouped into hierarchically
numbered divisions. Below the level of the paragraph or utterance, all
texts are composed of <s> elements, marking the automatic
linguistic segmentation carried out at Lancaster, and each of these is
divided into <w> (word) or <c> (punctuation) elements,
each bearing a POS (part of speech) annotation attribute.
Considerable discussion went on at the start of the project as to
the best method of encoding this automatically-generated information.
There are about sixty different possible POS codes, each representing a
linguistic category, for example as a singular noun, adverb of a
particular type, etc. The codes are automatically allocated to each word
by CLAWS, a sophisticated language-processing system developed at the
University of Lancaster, and widely recognized as a mature product in
the field of Natural Language Processing.
For approximately 4.7 per cent of the words in the corpus, CLAWS was
unable to decide between two possible taggings with sufficient
likelihood of success. In such cases, a two-value word-class code, known
as a portmanteau tag is applied. For example,
the portmanteau tag VVD-VVN means that the word may be
either a past tense verb (VVD), or a past participle (VVN).
We did not make any attempt to represent this ambiguity in the SGML
coding, though at a later stage of linguistic analysis, perhaps based on
the TEI feature structure mechanism, this might be possible. Without
manual intervention, the CLAWS system has an overall error-rate of
approximately 1.7%, excluding punctuation marks. Given the size of the
corpus, there was no opportunity to undertake post-editing to correct
annotation errors before the first release of the corpus.
Since then two successor projects have been completed by the
Lancaster team, resulting in the availability of a much improved new
version. The first step was to manually check a 2 percent sample from
the whole corpus, using a much richer and more delicate set of c
existence, and had been for many years), we began by representing the
code simply as an entity reference following the token to which it
applied, thus:
This option, we felt, would enable us to defer to a later stage
exactly what the replacement for each entity reference should be: it
might be nothing at all, for those uninterested in POS information, or a
string, or a pointer indicating a more complex expansion of the TEI
kind. The problem with this representation however, is that it relies on
an ad hoc interpretive rule (of the kind which SGML is specifically
designed to preclude the need for) to indicate, for example, that the
code AT0 belongs to the word The,
rather than to the word Queen. In
fact this is not encoding the truth of the situation: we have here a
string of word-annotation pairs. A more truthful annotation might be:
At0
]]>
A further possibility is to use an attribute value, for either the
Form or the Code: thus
The
]]> or, equivalently,AT0
]]>
From the SGML point of view these are equivalent. From the
application point of view, the notion of a text composed of strings of
POS codes, with embedded forms seems somehow less appealing than the
reverse, which is what we eventually chose: our example being tagged as
follows:
The Queen's annus horribilis]]>
The decision to use an often deprecated form of tag minimization for
the POS annotation was forced upon us largely by economic
considerations. A fully normalized form, with attribute name and
end-tags included on each of the 100 million words would have more than
doubled the size of the corpus. Data storage costs continue to plummet,
but the difference between 2 Gb and 4Gb remains significant!
A second major set of encoding problems arose from the inclusion in
the corpus of ten million words of transcribed speech, half of it
recorded in pre-defined situations (lectures, broadcasts, consultations
etc), and the other half recorded by a demographically sampled set of
volunteers, willing to tape their own every day work and leisure time
conversation.
Speech is transcribed using normal orthographic conventions, rather
than attempting a full phonemic transcript, which would have been beyond
the project's limited resources. Even so, the markup has to be very
rich in order to capture the process of speaker interaction — who
is speaking, and how, and where they are interrupted. Significant
non-verbal events such as pauses or changes in voice quality are also
marked up using appropriate empty elements, which bear descriptive
attributes. Here is an example of the start of one such conversation,
as encoded in CDIF:
You
gotta Radio
Two with that .
Bloody pirate
station wouldn't
you ?
]]>
The basic unit is the utterance, marked as an <u> element,
with an attribute who specifying the speaker, where this
is known. This attribute targets an element in the header for the text,
which carries important background information about the speaker, for
example their gender, age, social background, inter-relationship etc.
Where speakers interrupt each other, as they usually do, a system of
alignment pointers simplified from that defined by the TEI, is used.
This requires that all points of overlap are identified in a<timeLine>
element prefixed to each text, component points (<when>
elements) of which are then pointed to from synchronous moments within
the transcribed speech, represented as <ptr> elements. Pausing
is marked, using a <pause> element, with an indication of its
length if this seems abnormal. Gaps in the transcription, caused either
by inaudibility or the need to anonymize the material, are marked using
the <unclear> or <gap> elements as appropriate.
Truncated forms of words, caused by interruption or false-starts, are
also marked, using the <trunc> element.
A semi-rigorous form of normalization is applied to the spelling of
non-conventional forms such as innit
or
lorra; the principle adopted was to
spell such forms in the way that they typically appear in general
dictionaries. Similar methods are used to normalize such features of
spoken language as filled pauses, semi-lexicalized items such as
um,
err, etc. Some light punctuation
was also added, motivated chiefly by the desire to make the
transcriptions comprehensible to a reader, by marking (for example)
questions, possessives, and sentence boundaries in the conventional way.
Paralinguistic features affecting particular stretches of speech,
such as shouting or laughing, are marked using the <shift>
element to delimit changes in voice quality. Non-verbal sounds such as
coughing or yawning, and non-speech events such as traffic noise are
also marked, using the <vocal> and <event> elements
respectively; in both cases, a closed list of values for the
desc attribute is used to specify the phenomenon
concerned. It should however be emphasized that the aim was to
transcribe as clearly and economically as possible rather than to
represent all the subtleties of the audio recording.
The metadata provided by the header element, mentioned above, is of
particular importance in any electronic text, but especially so in a
large corpus. Earlier corpora have tended to provide all such
documentation (if at all) as a separate collection of reference manuals,
rather than as an integral part of the corpus, with obvious concomitant
problems of maintainability and consistency. In SGML, particularly the
TEI header, we felt that we had a powerful mechanism for integrating
data and metadata, which we used to the full: each component text of the
BNC carries a full header, structured according to TEI recommendations,
and containing a full bibliographic description of it, and of its
source, as well as specific details of its encoding, revision status,
etc. A corpus header, containing information common to all texts, is
also provided: this includes full descriptions of the corpus creation
methodology, and the various codes used within individual text headers,
such as those for text classification.
A particular problem arises with large general purpose corpora like
the BNC, the components of which can be cross-classified in many
different ways. Earlier corpora have tended to simplify this, for
example, by organizing the corpora into groups of texts of a particular
type — all newspaper texts together, all novels together, etc. A
typical BNC text however can be classified in many different ways
(medium, level, region, etc.). The solution we adopted, was to include
in the header of each text a single <catRef> element carrying an
IDREFS-valued attribute, which targetted each of the descriptive
categories applicable to the text.
For example, the header of a text of written author type 2 (multiple
authorship), written medium type 4 (miscellaneous unpublished), and
written domain type 3 (applied sciences) will contain a element like the
following:]]>The
values wriaty2 wrimed4 etc. here each references a <category>
element in the corpus header, containing a definition for the
classification intended. The full set of descriptive categories used is
thus controlled and can be guaranteed uniform across the whole corpus,
while at the same time permitting us to mix and combine descriptive
categories within each text as appropriate.
A similar method was used to link very detailed participant
descriptions (stored in the header) with utterances attributed to them
in the spoken part of the corpus.
In retrospect, had we all known as much about SGML at the start of
the project as we did by the end of it, we would have made much more
impressive progress, and perhaps delivered a better product. Needless
effort went into converting from one format to another, which might have
been better spent on gathering more reliable contextual information for
example. We also spent a long time devising ways of representing complex
information about (for example) relationships between the speakers which
in the event was not reliably available for more than a handful of
cases. The data representation we produced was thus rather more
sophisticated and complex than the material included perhaps warranted.
How to analyse a corpus
Linguistic analysis, particularly of large and diversely organized
corpora, is not the same as text retrieval. While some of the
application needs of the BNC user community might be met by standard
SGML browsers or text database systems, many are not. The typical user
of the BNC is interested in its contents as raw material for analysis,
not as material to be searched for particular words or references. There
is a correspondingly greater emphasis on statistical output, on ways of
patterning and reordering result sets, as well as a need to support more
complex kinds of enquiry than are usual in text-retrieval products. To
meet some of these needs, the BNC is now delivered with a
purpose-written SGML Aware Retrieval Application (SARA), developed at
Oxford.
From the start of the BNC project in 1990, it had always tacitly
been assumed that some kind of retrieval software would need to be
delivered along with the corpus. The original project proposal talks of
“simple processing tools” and an informal specification for an“
information search and retrieval processor” was also drawn up by the
UCREL team early on. In the event, the need to complete delivery of the
corpus on time (or at least, not too late), meant that development of
any such software beyond that needed for the immediate needs of the
project was increasingly deferred. It was argued that the lack of such
software might be only transient, since the corpus was to be delivered
in SGML form, tools for which were already becoming widely available, as
a result of the widespread adoption of this standard both within the
language engineering research community and elsewhere.
However, a major stated goal of the project was to make the corpus
available and usable as widely as possible, that is, not just at a low
cost, but also within as wide a variety of environments as possible. It
seemed to us that the potential user community for large scale corpora
like the BNC extended considerably as far beyond the Natural Language
Processing research community as it did beyond the immediate needs of
commercial lexicographers, although it was largely on behalf of these
groups that the project had originally been funded and largely therefore
these groups which had determined the manner in which it should be
delivered.
It seemed to us that the software needs of some of the potential users
of the BNC would be only partially met by the generic SGML software
available in late 1994 (and to a large extent still today). The choice
lay amongst highly specialized, but high performance, application
development tool kits which given sufficient expertise could be
customized to suit the needs of niche markets in NLP or lexicography,
but which were somewhat beyond the needs, comprehension, or indeed
purse, of the person in the street; generic SGML browse and display
engines, designed originally for electronic publication or delivery over
the web, often with very attractive and user-friendly interfaces but
generally unable to handle the full complexity and scale of the BNC; or
simple concordancing tools which were equally unable to take advantage
of the added value we had so painfully put into the encoding and
organization of the corpus. Moreover existing software was either very
expensive (being aimed at large scale electronic publishing
environments), or free, but requiring considerable technical expertise
for anything beyond the most trivial of applications. As discussed
further below, the scale and complexity of the BNC (with its 100 million
tagged words, six and a quarter million sentences, and 4124 interlinked
texts) seemed likely to stretch the capacity of most simple text-based
concordancers available at that time.
We were fortunate enough to obtain funding, initially from the British
Library R & D Department, and subsequently from the British Academy,
to produce a software package which might go some way to fill the gaps
identified. Development of the system was carried out by Tony Dodd, with
valuable input from members of the original BNC Consortium, and from
early users of the software. The system is called SARA, for SGML-Aware
Retrieval Application, to make explicit that although
aware of the SGML markup present in the
corpus, it is not a native SGML database. In this respect, however, it
is no better or worse than a number of other current software packages.
The SARA system
The SARA system was designed for client-server
mode operation, typically in a distributed computing
environment, where one or more work-stations or personal computers are
used to access a central server over a network. This is, of course, the
kind of environment which is most widely current in academic (and other)
computing milieux today. The success of the World Wide Web, which uses
an identical design philosophy, is vivid testimony to the effectiveness
of this approach.
The system has four chief components:
- the indexing program, which generates an index of tokens from an
SGML marked-up text;
- the server program, which accepts messages in the Corpus Query
Language (see below) and returns results from the SGML text;
- the SARA protocol, a formally defined set of message types which
determines legal interactions between the client and server programs;
this protocol makes use of a high-level query language known as CQL (for
Corpus Query Language);
- one or more client programs, with which a user interacts in any
appropriate platform-specific way, and which communicate with the server
program using the protocol.
The SARA index
Computationally, the best-understood method of accessing a text the size
and complexity of the BNC is to use an index file, in which search terms
are associated with their location in the main text file, and into which
rapid access can be obtained using hashing techniques. Such methods have
been employed for decades in mainstream information retrieval systems,
with the consequence that the advantages and disadvantages of the
various ways of implementing the underlying technology are well known
and very stable.
The SARA index is a conventional index of this type. Entries in the
index are created by the indexing program, using the SGML markup to
determine how the input text is to be tokenized. The tokens indexed
include the content of every <w> or <c> element,
together with the part of speech code allocated to it by the CLAWS
program. For example, there will be one entry in the index for
lead as a noun, and another for lead
tagged as a verb. The index is not case-sensitive, so occurrences of
Lead may appear in either entry. The
tokenization is entirely dependent on that carried out by CLAWS, which
accounts for the presence of a few oddities in the index where CLAWS
failed to segment sentences entirely.
The SGML tags (other than those for individual tokens) themselves are
also indexed, as are their attribute values. For example, there is an
entry in the index for every <text> start- and end-tag, and for
every <head> start- and end-tag, etc. This makes it possible to
search for words appearing within the scope of a particular SGML element
type. For some very frequent element-types (notably <s> and
<p> ) whose locations are particularly important when
delimiting the context of a hit, additional secondary indexes called
accelerator files are maintained.
The index supplied with the first version of the BNC occupies 33,000
files and 2.5 gigabytes of disk space, i.e. slightly more than the size
of the text itself. Building the index is a complex and computationally
expensive process, requiring much larger amounts of disk space or
several sort/merge intermediate phases. This was one reason for
delivering the completed index together with the corpus itself on the
first release of the BNC, even though development of the client software
was not at that stage complete. More compact indexing would have been
possible with the use of data compression, at the expense of some
increase in complexity: in practice, the indexing algorithm used
provides equally good retrieval times for any kind of query, independent
of the size of the corpus indexed. The index included on the published
CDs necessarily assumes that the server accessing it has certain
hardware characteristics (in particular, word length and byte addressing
order). To cater for machines for which these assumptions are
incorrect, a localization program is now included with the software.
This can either make a once for all modification to the index or be used
by the server to make the necessary modifications on the fly.
The indexer program is intended to operate on generic SGML texts, that
is, not just on the particular set of tags defined for use in the BNC.
However, we have not yet attempted to use it for corpora using other
DTDs, and there are some features of its behaviour which assume that the
DTD in use is (like the BNC) more or less TEI-compatible. For example,
it requires that texts have a TEI header, that they are decomposed into
<S> like elements, that each token to be indexed be explicitly
tagged as such.
The SARA server
The SARA server program was written originally in the ANSI C language,
using BSD sockets to implement network connexions, with a view to making
it as portable as possible. The current version, release 930, has been
implemented on several different flavours of the Unix operating system,
including Solaris, Digital Unix, and Linux, which appear to be the most
popular variations. The software is delivered with detailed installation
and localization instructions, and can be downloaded freely from the
BNC's web site (see http://info.ox.ac.uk/bnc/sara.html),
though it is not yet of much interest to anyone other than BNC
licensees, since the indexer program is not yet included with it.
The server has several distinct functions, amongst which the following
are probably the most important:
- it allows registered users to log on or off and to change their
passwords;
- it implements the key functions required of the Corpus Query
Language, in particular:
- looking for tokens in the index;
- solving a query;
- supplying bibliographic information about a text;
- displaying some or all of a text at a given location;
- thinning or filtering the result set from a query.
- it handles all housekeeping, allowing concurrent access by
several different users.
The server listens on a specified socket for login calls from a client.
When such a call is received, the server tries to create a process to
accept further data packages. If it succeeds, the client is logged on
and set up messages are exchanged which define for example, the names
and characteristics of SGML elements in the server's database. Following
this, the client sends queries in the Corpus Query Language, and
receives data packets containing solutions to them. Once a connexion has
been established in this way, the server expects to receive regular
messages from the client, and will time out if it does not. The client
can also request the server to interrupt certain transactions
prematurely.
The Corpus Query Language
The Corpus Query Language (CQL) is a fairly typical Boolean style
retrieval language, with a number of additional features particularly
useful for corpus work. It is emphatically not intended for human use.
Like many other such languages, its syntax is designed for convenience
of machine processing, rather than elegance or perspicuousness. A brief
summary of its functionality only is given here.
A query is made up of one or more atomic queries. An atomic query may be
one of the following:
- a word or punctuation character;
- a wildcard character, which will match any single term;
- an L-word, that is a combination of word and
part-of-speech code, such as CAN=NN1 (i.e. can
as a singular noun);
- a phrase, which is decomposed into a search for consecutive terms
irrespective of punctuation;
- a regular expression;
- an SGML query, that is, a search for a start- or end-tag,
possibly including attribute name-value pairs.
- an existing (named) solution set. Names are allocated to queries
by the server.
- any CQL query enclosed in parentheses.
The following unary operators are currently implemented in CQL:
case
The dollar operator makes the query which is its operand
case-sensitive;
header
The commercial-at operator makes the query which is its operand
search within headers as well as in the bodies of texts (it thus assumes
that a TEI-conformant dtd is in use);
optionality
The ? operator matches zero or one solutions to the query which
is its operand; it makes no sense unless the query is combined with
another;
A CQL expression containing more than one query may use the following
binary operators:
concatenation
Two queries written in sequence match occasions where a solution
to the first query is directly followed by a solution to the second.
disjunction
The term query1|query2 matches anything that is a
solution to either query1 or query2
join
The term query1*query2 matches anything that is a
solution to query1 followed by a solution to query2
within the current scope; the term query1#query2
matches anything that is a solution to query1 either
followed or preceded by a solution to query2 within the
current scope.
When queries are joined, the scope of the expression may be defined in
one of the following ways:
SGML element
A join query followed by the / operator and an SGML query
matches cases where the joined query is satisfied within the scope of
the SGML query.
number
A join query followed by the / operator and a number matches
cases where the joined query is satisfied within the number of words
specified.
If no scope is supplied for a join query, the default scope is a single
<bncDoc> element, i.e. a single text in the corpus.
SARA client programs
The standard SARA installation includes a very rudimentary client
program called solve, for Unix. This provides a command line interface
at which CQL expressions can be typed for evaluation, returning result
sets on the standard Unix output channel, for piping to a formatter of
the user's choice, or display at a terminal. This client is provided
mainly for debugging purposes, and also as a model of how to construct
such software.
A web client, written in Perl, has recently been developed at the
University of Zurich, a simplified version of which is currently used
at the BNC online service, and which will also be included in the next
release of the SARA software.
The SARA client program which has been most extensively developed
and used runs in the Microsoft Windows environment, and it is this which
forms the subject of the remainder of this paper.
In designing the Windows client, we attempted to make sure that as much
of the basic functionality of the CQL protocol could be retained, while
at the same time making the package easy to use for the novice. We also
recognized that we could not implement all of the features which corpus
specialists would require at the same time as providing a simple enough
interface to attract corpus novices. In retrospect, there are several
features and functions we would liked to have added (of which some are
discussed below); but no doubt, had we done so, there would be several
aspects of the user interface we would now be equally dissatisfied with.
The SARA client follows standard Microsoft Windows application
guidelines, and is written in Microsoft C++, using the standard object
classes and libraries. It thus looks very similar to any other Windows
application, with the same conventions for window management, buttons,
menus, etc. It runs under any version of Windows more recent than 3.0,
and there are both 16 and 32 bit versions. A TCP/IP stack (such as
Winsock) to implement connexion to the server is essential, and a colour
screen highly desirable. The software uses only small amounts of disk or
memory, except when downloading or sorting result sets containing very
many (more than a few hundred) or very long (more than 1Kb) hits.
The Windows client allows the user to:
- search the word index and check what tokens it contains;
- define, save, re-use, or modify a query
(effectively, a CQL expression to be evaluated);
- view, sort, save, or print all or some of the
results returned by a query;
- configure and manipulate the display of results in a variety of
ways;
- view contextual and bibliographic data for any one text;
- combine simple queries to form a complex one, using a visual
interface.
A brief description of each of these functions is given below; more
information is available from the built-in help file and from the
BNC Handbook
Types of Query
The Windows client distinguishes five types of query, and allows for
their combination as a complex query. The basic query types are:
word query
this searches the SARA word index, either by stem (right hand
truncation only is performed) or by pattern
(see below). All index-entries matching the string entered are returned,
and the user can then select all or some of them for dispatch to the
server as CQL queries against the corpus;
phrase query
A phrase query behaves superficially like a word query, in that
it searches for occurrences of a particular word or phrase. It differs
in that it can be case-sensitive, can search text headers as well as
bodies, can include punctuation, and is aware of the tokenization rules
used by the CLAWS tagger. A phrase query can also include a wild
card character to match any word in a phrase.
pattern query
A pattern query allows for queries using a simple subset of
UNIX-style regular expressions, for example to find variant spellings of
a word. Some limitations on the kind of pattern which can usefully be
searched for are imposed by the nature of the index: for example, left
hand truncation of the search term always implies a scan through the
entire index, and is therefore not allowed.
POS query:
A part of speech (POS) query carries out a word query, further
restricted by a given POS code or code, for example to find occurrences
of lead tagged as a noun. It should be stressed
that this is only feasible for a specified word, since the POS code is
only a secondary key in the SARA word index — it is not possible
to search for (say) all nouns with the current system.
SGML query
An SGML query carries out a search for a given SGML tag in the
corpus, optionally qualified by particular combinations of attribute
values, for example to find all occurrences of
<event> elements in which the desc attribute has the value
laughing or laughter. It
is particularly useful when restricting searches to texts of a
particular type, since text type information is typically carried by
SGML attributes in the BNC.
One or more of the above types of query may be combined to form a
complex query, using the special purpose Query Builder visual interface,
in which the parts of a complex query are represented by
nodes of various types. A Query Builder query
always has at least two nodes: one, the
scope node, defines the the
context within which a complex query is to be
evaluated. This may be expressed either as an SGML element, or as a span
of some number of words. The other nodes are known as content
nodes, and correspond with the simple queries from which the
complex query is built. Content nodes may be linked together
horizontally, to indicate alternation, or vertically to indicate
concatenation. In the latter case, different arc types are drawn, to
indicate whether the terms are to be satisfied in either order, in one
order only, or directly, i.e. with no intervening terms.
Query Builder thus enables one to solve queries such as “find the
word fork followed by the word knife
as a noun, within the scope of a single <u> element”. It can
be used to find occurrences of the words anyhow
or
anyway directly following laughter at the start
of a sentence; to constrain searches to texts of particular types, or
contexts, and so forth.
For completeness, the Windows client also allows the skilled (or
adventurous) user to type a CQL expression directly: this is the only
form of simple query which is not permitted within the Query Builder
interface.
Display and manipulation of queries
By whatever method it is posed, any SARA query returns its results in
the same way. Results may be displayed in one of line or page modes,
i.e. in a conventional KWIC display, or one result at a time. The amount
of context returned for each result is specified as a maximum number of
characters, within which a whole sentence or paragraph will usually be
displayed. Results can be displayed in one of four different formats:
plain
text-only display which effectively ignores and suppresses all
markup;
POS
individual words are colour-coded according to their part of
speech and a user-defined colour scheme;
SGML
all SGML encoding in the original is displayed uninterpreted;
custom
the SGML encoding is interpreted according to a simple
user-supplied specification.
It will often be the case that the number of results found for a query
is unmanageably large. To handle this, the SARA client offers the
following facilities. A global limit is defined on the number of results
to be returned. When this limit is exceeded, the user can choose
- to over-ride the limit temporarily for this result set,
specifying how many solutions are required, discarding any surplus from
the end of the result set;
- to discard all but the first solution in each text;
- to take a random sample of specified size from the available
solutions.
When the last of these is repeated for a given large result set, it will
return a different random sample each time.
Once downloaded to the client, a set of results may be manipulated in a
number of ways. It may be sorted according to the keyword which defined
the query, by varying extents of the left or right context for this
keyword, or by combinations of these keys. Sorting can be carried out
either by the orthographic form, in case-insensitive manner, or by the
POS code of words. This enables the user to group together all
occurrences of a word in which it is followed by a particular POS code,
for example. It is also possible to scroll through a result set,
manually identifying particular solutions for inclusion or exclusion, or
to thin it automatically in the same way as when the limit on the number
of solutions is exceeded.
A result set may simply be printed out, or saved to a file in SGML
format, for later processing by some SGML-aware formatter or further
processor. Named bookmarks may be associated
with particular solutions (as in other Windows applications) to
facilitate their rapid recovery. The queries generating a result set,
together with any associated thinning of it, any bookmarks, and any
additional documentary comment, can all be saved together as named
queries on the client, which can then be reactivated as required.
Additional features of the client
The main bibliographic information about each text from which a given
concordance line has been extracted can be displayed with a single mouse
click. It is also possible to browse directly the whole of the text and
its associated header, which is presented as a hierarchic menu,
reflecting its SGML structure. The user can either start from the
position where a hit was found, expanding or contracting the elements
surrounding it, or start from the root of the document tree, and move
down to it.
A limited range of statistical features are provided. Word frequencies
and z-scores are provided for word-form lookups, and there is a useful
collocation option which enables one to calculate the absolute and
relative frequencies with which a specified term co-occurs within a
specified number of words of the current query focus.
Limitations of the current system and future
plans
As noted above, the current client lacks some facilities which are
widely used in particular fields of corpus-based research. This is
particularly true of statistical information. There is no facility for
the automatic generation of collocate lists, or any of the other forms
of more sophisticated forms of statistical analysis now widely used.
Neither is there any form of linguistic knowledge built into the system
(other than the POS tagging): there is no lemmatized index, or
lemmatizing component, though clearly it would be desirable to add one.
For those sufficiently technically minded, or motivated, the
construction of such facilities (whether using SGML-aware tools or not)
is relatively straightforward; the problem is that no simple interface
or hook exists to build them into the current Windows client.
Similarly, it is not possible to define, save and re-use subcorpora,
except by saving and re-using the queries which define them. The SARA
client can address only the whole of the SARA index, which indexes the
whole of the BNC. This is a design issue, which has yet to be addressed.
If queries become very complex, involving manipulation of many very
large result streams, they may exceed the limits of what can be handled
by the server. This has not yet arisen in practice however.
A more common complaint about the current system is that it cannot be
used to search for patterns of POS codes, independently of the
particular word forms to which they are attached. This is fundamentally
an indexing problem, which may be addressed in the next major release of
the system. The performance problems associated with queries containing
very high frequency words are derived from the same problem, and may be
addressed in the same way. And again, it is a trivial exercise for a
competent programmer to write special purpose code which will search for
such patterns across the whole of the BNC.
Despite these limitations, the system has attracted great enthusiasm
when tested and demonstrated, despite performance problems and
difficulties of access, perhaps owing largely to the intrinsic interest
of the BNC data itself. Since mid-1997, we have been providing a free
online service using the client as a part of the British Library's
Initiatives for Access programme. This service allows
anyone with access to the World Wide Web to search the BNC at no charge.
Using any Web browser and a simple query form, restricted searches can
be carried out via a CGI script accessing the SARA server directly.
Alternatively, the user can download and install their own copy of the
Windows client software, and use it to access the same server. At the
time of writing, this full query service is available free of charge for
a limited trial period, after which an annual registration fee is
charged.
A second updated and corrected version of the Corpus is due for
release in 1998. Up to date information about the project is available
from the project website at http://info.ox.ac.uk/bnc.
Aston, G. and Burnard, L. 1998 The
BNC Handbook Edinburgh: Edinburgh University Press.
Atkins, S., Clear, J. and Ostler, N. 1992.
Corpus design criteria Literary
and linguistic computing 7: 1-16.
Biber, D., Conrad, S., and Reppen, R. 1998 Corpus linguistics: investigating language structure and use
Cambridge: Cambridge University Press.
Dunlop, D. 1995Practical
considerations in the use of TEI headers in large corpora (in
Ide, N. and Veronis, J. eds, 1995 Text Encoding
Initiative: background and context Kluwer.)
Garside, R., Leech, G., and McEnery, T. 1997.
Corpus annotation: linguistic information from
computer text corpora Harlow: Addison-Wesley Longman.
Kennedy, Graeme, 1998 An
introduction to corpus linguistics Harlow:
Addison-Wesley-Longman.
McEnery, A. and Wilson, A. 1996. Corpus
linguistics. Edinburgh: Edinburgh University Press.