6 Limitations of the current system and future plans

As noted above, the current client lacks some facilities which are widely used in particular fields of corpus-based research. This is particularly true of statistical information. There is no facility for the automatic generation of collocate lists, or any of the other forms of more sophisticated forms of statistical analysis now widely used. Neither is there any form of linguistic knowledge built into the system (other than the POS tagging): there is no lemmatized index, or lemmatizing component, though clearly it would be desirable to add one. For those sufficiently technically minded, or motivated, the construction of such facilities (whether using SGML-aware tools or not) is relatively straightforward; the problem is that no simple interface or hook exists to build them into the current Windows client.

Similarly, it is not possible to define, save and re-use subcorpora, except by saving and re-using the queries which define them. The SARA client can address only the whole of the SARA index, which indexes the whole of the BNC. This is a design issue, which has yet to be addressed. If queries become very complex, involving manipulation of many very large result streams, they may exceed the limits of what can be handled by the server. This has not yet arisen in practice however.

A more common complaint about the current system is that it cannot be used to search for patterns of POS codes, independently of the particular word forms to which they are attached. This is fundamentally an indexing problem, which may be addressed in the next major release of the system. The performance problems associated with queries containing very high frequency words are derived from the same problem, and may be addressed in the same way. And again, it is a trivial exercise for a competent programmer to write special purpose code which will search for such patterns across the whole of the BNC.

Despite these limitations, the system has attracted great enthusiasm when tested and demonstrated, despite performance problems and difficulties of access, perhaps owing largely to the intrinsic interest of the BNC data itself. Since mid-1997, we have been providing a free online service using the client as a part of the British Library's Initiatives for Access programme. This service allows anyone with access to the World Wide Web to search the BNC at no charge. Using any Web browser and a simple query form, restricted searches can be carried out via a CGI script accessing the SARA server directly. Alternatively, the user can download and install their own copy of the Windows client software, and use it to access the same server. At the time of writing, this full query service is available free of charge for a limited trial period, after which an annual registration fee is charged.

A second updated and corrected version of the Corpus is due for release in 1998. Up to date information about the project is available from the project website at http://info.ox.ac.uk/bnc.