Techniques for Evaluation of Language Corpora: a report from the front

Lou Burnard (OUCS) and Tony McEnery (Lancaster University)

Published in The ELRA Newsletter, vol 2 no 4, December 1997

This brief report describes work we are currently undertaking for ELRA to develop guidelines for the validation of corpus encoding. Until recently, such guidelines would have been meaningless, since almost every new corpus developed using a new encoding scheme. Today, however, with corpus encoding slowly converging upon the use of SGML and the availability of detailed recommendations such as those of the TEI and EAGLES, the task is not merely possible but also necessary. The necessity for such guidelines is best understood if we glance sideways at what has happened in other areas where products with a wide market have been developed. For example, the need for validation with respect to software, or other consumer products with defined outputs and defined goals, is relatively uncontentious, and is indeed the subject of important ongoing work in standardization (e.g. the EAGLES extensions to ISO 9126). The development and application of encoding standards for language corpora seem, however, to be at an earlier stage of development. Although the applicability of a corpus resource is likely to be far greater than the uses originally envisaged for it, indeed may often be unpredictable, corpus developers are only slowly beginning to see how this unpredictability makes more urgent the need for agreed and well-defined practices in encoding. For the producer of a corpus, validation may simply be a form of quality control, akin to traditional proof reading; for the user of a corpus, validation should provide a rapid and explicit account of what a corpus contains, and hence its likely usefulness in a given task.

Our view is that validation procedures for language corpora should thus concern themselves chiefly with the relationship between what is actually present in a corpus, and what is claimed for it. Their primary goal should be to establish that a corpus is accurately and completely described by its associated documentation, and secondarily to assess whether the features present conform with reasonable user expectations, i.e. whether they are "fit for use".

With this in mind, we are working with a tripartite description of validation:

To identify those needs, we began our work by attempting to define an appropriate analytic framework for the validation of language corpora. Our first approach was to derive this empirically by examination of a large sample of existing corpora and their documentation, and by user survey. Indeed it is quite likely that you may well have seen and answered one of the web questionnaires that we have distributed over the past few months. (If so, then may we thank you once again!) Our examination of the data allowed us to compare the features proposed by several related standards with actual user requirements as solicited by questionnaire, and actual user practice as demonstrated in a wide sample of corpora.

At the heart of our work is a cross tabulation of three sets of features; those recommended by European standards (EAGLES in particular), those specified by users and finally the actual features found in the sample corpora. In doing this we are arriving at a view of where any 'reality gaps' are emerging, for example, where current corpus encoding standards do not encode features felt to be essential by the user community or where corpus builders are not encoding corpora in line with current standards and best practice.

Obviously, in order to carry out such a study we have had to select a range of corpora from the many that are currently available. Our sampling procedure aimed to maximise variability in such features as language, delicacy/method of mark-up, commercial interest, size, topic etc. Attention was paid to a range of features, including technical characteristics (delivery media, physical encoding etc.) and documentary characteristics (usability and accuracy of documentation), as well on inherent linguistic properties made explicit in the corpora.

Following this overall survey, we will proceed to define a staged series of validation procedures:

  1. those concerned with detecting the presence of a given feature;
  2. those concerned with identifying the syntactic correctness and consistency of the feature's representation;
  3. those concerned with semantic correctness, i.e. whether the feature is correctly stated to be present in a given context.

Work on each of these is currently progressing. The degree to which these three levels of procedure may be automated is being assessed, and informal descriptions provided of the various tools available to perform such automatic validation at each level.

Our results so far seem to indicate that (with a few notable exceptions) current standards are somewhat in advance of current practice, and are also falling somewhat short of user expectations. This suggests to us that development of better and more exacting validation tools should be given a high priority.

Reports from this project will be made available via ELRA in the near future. In the mean time, we would be very interested in your comments or feedback: please contact either of us at the addresses given. Draft versions of the project reports are available from the URL

Lou Burnard (
Tony McEnery (