5 The SARA system

The SARA system was designed for client-server mode operation, typically in a distributed computing environment, where one or more work-stations or personal computers are used to access a central server over a network. This is, of course, the kind of environment which is most widely current in academic (and other) computing milieux today. The success of the World Wide Web, which uses an identical design philosophy, is vivid testimony to the effectiveness of this approach.

The system has four chief components:

5.1 The SARA index

Computationally, the best-understood method of accessing a text the size and complexity of the BNC is to use an index file, in which search terms are associated with their location in the main text file, and into which rapid access can be obtained using hashing techniques. Such methods have been employed for decades in mainstream information retrieval systems, with the consequence that the advantages and disadvantages of the various ways of implementing the underlying technology are well known and very stable.

The SARA index is a conventional index of this type. Entries in the index are created by the indexing program, using the SGML markup to determine how the input text is to be tokenized. The tokens indexed include the content of every <w> or <c> element, together with the part of speech code allocated to it by the CLAWS program. For example, there will be one entry in the index for `lead' as a noun, and another for `lead' tagged as a verb. The index is not case-sensitive, so occurrences of `Lead' may appear in either entry. The tokenization is entirely dependent on that carried out by CLAWS, which accounts for the presence of a few oddities in the index where CLAWS failed to segment sentences entirely.

The SGML tags (other than those for individual tokens) themselves are also indexed, as are their attribute values. For example, there is an entry in the index for every <text> start- and end-tag, and for every <head> start- and end-tag, etc. This makes it possible to search for words appearing within the scope of a particular SGML element type. For some very frequent element-types (notably <s> and <p> ) whose locations are particularly important when delimiting the context of a hit, additional secondary indexes called accelerator files are maintained.

The index supplied with the first version of the BNC occupies 33,000 files and 2.5 gigabytes of disk space, i.e. slightly more than the size of the text itself. Building the index is a complex and computationally expensive process, requiring much larger amounts of disk space or several sort/merge intermediate phases. This was one reason for delivering the completed index together with the corpus itself on the first release of the BNC, even though development of the client software was not at that stage complete. More compact indexing would have been possible with the use of data compression, at the expense of some increase in complexity: in practice, the indexing algorithm used provides equally good retrieval times for any kind of query, independent of the size of the corpus indexed. The index included on the published CDs necessarily assumes that the server accessing it has certain hardware characteristics (in particular, word length and byte addressing order). To cater for machines for which these assumptions are incorrect, a localization program is now included with the software. This can either make a once for all modification to the index or be used by the server to make the necessary modifications ` on the fly'.

The indexer program is intended to operate on generic SGML texts, that is, not just on the particular set of tags defined for use in the BNC. However, we have not yet attempted to use it for corpora using other DTDs, and there are some features of its behaviour which assume that the DTD in use is (like the BNC) more or less TEI-compatible. For example, it requires that texts have a TEI header, that they are decomposed into <S> like elements, that each token to be indexed be explicitly tagged as such.

5.2 The SARA server

The SARA server program was written originally in the ANSI C language, using BSD sockets to implement network connexions, with a view to making it as portable as possible. The current version, release 930, has been implemented on several different flavours of the Unix operating system, including Solaris, Digital Unix, and Linux, which appear to be the most popular variations. The software is delivered with detailed installation and localization instructions, and can be downloaded freely from the BNC's web site (see http://info.ox.ac.uk/bnc/sara.html), though it is not yet of much interest to anyone other than BNC licensees, since the indexer program is not yet included with it.

The server has several distinct functions, amongst which the following are probably the most important:

The server listens on a specified socket for login calls from a client. When such a call is received, the server tries to create a process to accept further data packages. If it succeeds, the client is logged on and set up messages are exchanged which define for example, the names and characteristics of SGML elements in the server's database. Following this, the client sends queries in the Corpus Query Language, and receives data packets containing solutions to them. Once a connexion has been established in this way, the server expects to receive regular messages from the client, and will time out if it does not. The client can also request the server to interrupt certain transactions prematurely.

5.3 The Corpus Query Language

The Corpus Query Language (CQL) is a fairly typical Boolean style retrieval language, with a number of additional features particularly useful for corpus work. It is emphatically not intended for human use. Like many other such languages, its syntax is designed for convenience of machine processing, rather than elegance or perspicuousness. A brief summary of its functionality only is given here.

A query is made up of one or more atomic queries. An atomic query may be one of the following:

The following unary operators are currently implemented in CQL:

A CQL expression containing more than one query may use the following binary operators:

When queries are joined, the scope of the expression may be defined in one of the following ways:

If no scope is supplied for a join query, the default scope is a single <bncDoc> element, i.e. a single text in the corpus.

5.4 SARA client programs

The standard SARA installation includes a very rudimentary client program called solve, for Unix. This provides a command line interface at which CQL expressions can be typed for evaluation, returning result sets on the standard Unix output channel, for piping to a formatter of the user's choice, or display at a terminal. This client is provided mainly for debugging purposes, and also as a model of how to construct such software.

A web client, written in Perl, has recently been developed at the University of Zurich, a simplified version of which is currently used at the BNC online service, and which will also be included in the next release of the SARA software.

The SARA client program which has been most extensively developed and used runs in the Microsoft Windows environment, and it is this which forms the subject of the remainder of this paper.

In designing the Windows client, we attempted to make sure that as much of the basic functionality of the CQL protocol could be retained, while at the same time making the package easy to use for the novice. We also recognized that we could not implement all of the features which corpus specialists would require at the same time as providing a simple enough interface to attract corpus novices. In retrospect, there are several features and functions we would liked to have added (of which some are discussed below); but no doubt, had we done so, there would be several aspects of the user interface we would now be equally dissatisfied with.

The SARA client follows standard Microsoft Windows application guidelines, and is written in Microsoft C++, using the standard object classes and libraries. It thus looks very similar to any other Windows application, with the same conventions for window management, buttons, menus, etc. It runs under any version of Windows more recent than 3.0, and there are both 16 and 32 bit versions. A TCP/IP stack (such as Winsock) to implement connexion to the server is essential, and a colour screen highly desirable. The software uses only small amounts of disk or memory, except when downloading or sorting result sets containing very many (more than a few hundred) or very long (more than 1Kb) hits.

The Windows client allows the user to:

A brief description of each of these functions is given below; more information is available from the built-in help file and from theBNC Handbook

5.4.1 Types of Query

The Windows client distinguishes five types of query, and allows for their combination as a complex query. The basic query types are:

One or more of the above types of query may be combined to form a complex query, using the special purpose Query Builder visual interface, in which the parts of a complex query are represented by nodes of various types. A Query Builder query always has at least two nodes: one, the scope node, defines the the context within which a complex query is to be evaluated. This may be expressed either as an SGML element, or as a span of some number of words. The other nodes are known as content nodes, and correspond with the simple queries from which the complex query is built. Content nodes may be linked together horizontally, to indicate alternation, or vertically to indicate concatenation. In the latter case, different arc types are drawn, to indicate whether the terms are to be satisfied in either order, in one order only, or directly, i.e. with no intervening terms.

Query Builder thus enables one to solve queries such as ``find the word `fork' followed by the word `knife' as a noun, within the scope of a single <u> element''. It can be used to find occurrences of the words `anyhow' or `anyway' directly following laughter at the start of a sentence; to constrain searches to texts of particular types, or contexts, and so forth.

For completeness, the Windows client also allows the skilled (or adventurous) user to type a CQL expression directly: this is the only form of simple query which is not permitted within the Query Builder interface.

5.4.2 Display and manipulation of queries

By whatever method it is posed, any SARA query returns its results in the same way. Results may be displayed in one of line or page modes, i.e. in a conventional KWIC display, or one result at a time. The amount of context returned for each result is specified as a maximum number of characters, within which a whole sentence or paragraph will usually be displayed. Results can be displayed in one of four different formats:

It will often be the case that the number of results found for a query is unmanageably large. To handle this, the SARA client offers the following facilities. A global limit is defined on the number of results to be returned. When this limit is exceeded, the user can choose

When the last of these is repeated for a given large result set, it will return a different random sample each time.

Once downloaded to the client, a set of results may be manipulated in a number of ways. It may be sorted according to the keyword which defined the query, by varying extents of the left or right context for this keyword, or by combinations of these keys. Sorting can be carried out either by the orthographic form, in case-insensitive manner, or by the POS code of words. This enables the user to group together all occurrences of a word in which it is followed by a particular POS code, for example. It is also possible to scroll through a result set, manually identifying particular solutions for inclusion or exclusion, or to thin it automatically in the same way as when the limit on the number of solutions is exceeded.

A result set may simply be printed out, or saved to a file in SGML format, for later processing by some SGML-aware formatter or further processor. Named bookmarks may be associated with particular solutions (as in other Windows applications) to facilitate their rapid recovery. The queries generating a result set, together with any associated thinning of it, any bookmarks, and any additional documentary comment, can all be saved together as named queries on the client, which can then be reactivated as required.

5.4.3 Additional features of the client

The main bibliographic information about each text from which a given concordance line has been extracted can be displayed with a single mouse click. It is also possible to browse directly the whole of the text and its associated header, which is presented as a hierarchic menu, reflecting its SGML structure. The user can either start from the position where a hit was found, expanding or contracting the elements surrounding it, or start from the root of the document tree, and move down to it.

A limited range of statistical features are provided. Word frequencies and z-scores are provided for word-form lookups, and there is a useful collocation option which enables one to calculate the absolute and relative frequencies with which a specified term co-occurs within a specified number of words of the current query focus.