ELRA Written Resources Validation Panel

Proposed Workplan from OUCS

Lou Burnard

At our initial meeting I was asked to address three main areas:

The first of these is concerned with what is to be validated; the second with how.

A number of other actions are mentioned in the Minutes of the meeting (notably, to provide information on SGML tools, and to assess ways of using TEI headers to represent information in a corpus header). These are not addressed in this document.

Check list of features (What)

This work package will identify a set of features which are likely to be of maximal utility to the widest range of ELRA customers, while at the same being realistically achievable. By feature, we mean any linguistic or paralinguistic aspect of the corpus or its components, which has been deliberately identified and distinguished by the corpus creator. For example, a corpus in which spoken texts are distinguished from written texts may be said to mark the feature speech; a corpus in which sentence boundaries are identified may be said to mark the feature sentence; and so on.

In choosing amongst the possibly infinite number of such features which might be identified, two criteria are of particular importance:

By definable, I mean that the feature must be clearly and consistently defined, ideally with reference to some external taxonomy such as the EAGLES application of TEI, or some other standard. It should be possible in this way to ensure that corpora using different application standards, but with common features, can be meaningfully interchanged and re-used.

By useful, I mean that the feature must actually be recognized and used in a sizeable number of corpus-based studies or applications. The intention here is to identify features which have in practice established themselves in the field as being of practical use to more than handful of researchers, while at the same time retaining the flexibility to add new features in response to particular research or development needs.

It should be noted that features, as defined here, have nothing to do with SGML markup: a non-SGML corpus will certainly have features marked up in some way, and an SGML corpus may not make explicit in its markup every feature one might desire. The exercise of defining the features which make up the checklist must therefore be carried out on a wide variety of corpora, both SGML and non-SGML; here is the question of overlap to which I referred earlier.

The workplan has the following stages:

1. Identification of a reasonable number of corpora (up to ten), as far as possible varied with respect to such aspects as language, depth or delicacy of markup, commercial interest, size, topic, design etc. Access to existing European-funded resources (TELRI, ECI, MulText, Parole etc.) will be of great assistance here, but other corpora will also be considered where available.

2. Identification of the features encoded by each of the selected set of corpora, and of the intersection of those feature lists.

3. Mapping of the set of features on to existing standard taxonomies such as EAGLES/TEI where possible.

For each corpus, a study of the documentation will be needed, together with possibly some software development (for example to count tags or codes used). The cost of this will vary, depending on the ease with which corpora are available, and the usability of their documentation.

Validation Methodology (How)

This workpackage will identify appropriate strategies that can be adopted in order to carry out automatic, semi-automatic,or manual validation of resources. It will do so in terms of the features defined in the preceding ("What") workpackage, by specifying for each one what validation methods are appropriate (or possible) to determine

For SGML corpora, certainly the first, and probably the second, are always automatically verifiable, provided that the requisite components (DTD, stylesheets etc.) are provided and correct. Even for SGML corpora, however, this check must be carried out and is (indeed) an essential part of the validation process. For non-SGML corpora, by contrast, it may be very difficult indeed to define appropriate syntactic checks, particularly if the encoding procedure used is arbitrary or not formally verifiable. A variety of tools will be needed (their definition being, as I take it,a major output from Wolfgang's task description) In my view, for both SGML and non-SGML resources, semantic checks must always be carried out, at best, semi-automatically. A stochastic approach (e.g. spot checking against an original) is probably the only satisfactory method of determining whether (a) all the things which are marked as feature X are in fact Xs and (b) whether all the Xs are actually marked as such!

The output from this workpackage will include a discussion of the key strategic issues mentioned above, and recommendations arising from them. It will also, more concretely, propose a standardized form in which the results of such an evaluation can be presented, based on the TEI Header. Sample evaluations, in the proposed format, will also be carried out for at least three of the corpora, at least one of them being in non-SGML format.

Validation Manual

This will be a summary or reference guide, possibly of several hundred pages, involving input from other members of the panel. OUCS is willing to take on the job of editing and preparing camera ready and electronic versions of the text, for redistribution from the ELRA web site, if this is acceptable to the other members of the panel.

LB, 3 Oct 96