This document forms the chief deliverable for Work Package 3 of the ELRA contract for validation of language corpora. It discusses the theoretical basis underlying our approach to the formal validation of language corpora, and makes some recommendations about relevant techniques and practices which may be of assistance in performing such evaluations, and documenting their results. Particular attention is paid to the specific case of morpho-syntactically annotated corpora.
Some confusion exists about the terminology associated with linguistically annotated corpora. This is partly because the term tagset is used differently by two different communities. For the traditional corpus linguist, a tagset is the set of possible values used to explicitly annotate a text with a linguistic analysis; for example, the CLAWS tagset comprises a set of values such as NN1, VVD etc., each of which has a specific significance (singular common noun, verb past tense, etc.) For the mark-up specialist however, the term tagset refers to any kind of annotation, in particular the collection of SGML tags corresponding with the elements defined in a particular DTD: for example, the TEI defines a number of tagsets, each containing definitions for specific SGML elements and attributes.
Both usages reflect the fact that all markup introduced into a text is identical, at some level of analysis, in the sense that it serves to record or assert an association between stretches of text and values taken from some externally defined set of interpretations. However most people seem to categorize an analysis such as ``this is a paragraph'' differently from the formally equivalent judgement ``this is a noun''. The former judgement is said to be `structural' and the latter `interpretative'. This kind of categorization also underlies the notion of `level' of annotation as exemplified by (inter alia) the Corpus Encoding Specification (Ide 1998), where the distinction is further justified by the observation that the addition of so-called `structural' markup is generally easier to automate than that of `interpretive' markup, since the latter (almost) invariably requires human judgement and knowledge, while the former rarely does. Particularly in the case of textual markup, interpretative judgements tend to be more controversial than structural ones, if only because the latter relate to aspects of a text which are accepted as intrinsic to its substance by the community of text readers. Structural interpretations form part of the `contracts of literacy' ( Snow and Ninio, 1986) which form the precondition of a text's recognition as meaningful by the members of a particular community of readers.
For purposes of validation, however, the distinction seems unhelpful. All markup introduced into an corpus should be validated in the same way, and the validity of the corpus overall is equally affected by each type of markup used. Nevertheless, we have subdivided our discussion into two parts, reflecting the division currently made by most practitioners between structural and interpretative markup, and which are consequently reflected in actual practices. Structural markup is most generally to be validated with reference to an abstract model of textual components and features which is either entirely intuitive and `common sense' based, or defined in terms of some consensus-based model such as that of the TEI, restated as an SGML DTD. Interpretative markup may be similarly theory-free (see, for example, Leech 1993, but it is more customary to define it with reference to some explicitly stated analytic model, and hence to facilitate both automatic validation of the corpus itself (to check that it is valid in its own terms) and comparison of two corpora using different markup schemes derived from a common abstract model.
In section 2 we discuss the process by which the structural markup defined for a given corpus may be validated. The formal mechanism used for this purpose is an SGML document type definition. In section 3 we discuss in more detail one particular kind of interpretative markup: that which seeks to make explicit morpho-syntactic analysis of a text. We present here an SGML scheme for the formal expression of an abstract model that may be used to validate such analyses both internally and externally. Finally, in section 4 we suggest some ways in which the result of either validation exercise may be formally documented. We begin, however, by describing the model of formal validation which underlies both descriptions. (For a more detailed discussion of the principles adumbrated here, see Sperberg-McQueen and Burnard 1995).
We begin by positing the existence of textual features or abstractions, instances of which are predicated at various positions within a document. The function of markup is to indicate unambiguously the presence of instances of such features. For example, a document may contain instances of the feature `segment', whose presence might be signalled by such markup conventions as:
As noted above, the presence and scope of a feature such as `singular noun' may be predicated in exactly the same way.
We further assume that it is possible to define a grammar for such markup symbols: that is, a grammar which defines which combinations of such symbols in a document are to be regarded as legal. Such grammars generally have regard only to the markup language itself, rather than its extension to the underlying feature set. A markup grammar may simply enumerate all legal markup tokens, or simply specify an algorithm for the identification of markup tokens with no consideration of which markup tokens might be permitted. A more complex grammar (such as SGML) may also be used, enabling the formulation of contextual rules such as ``the tag X is only legal within the scope of the component identified by tag Y'' in addition to these kinds of rules. Note however that legality is still defined here in terms of syntax: only informal legislation can determine whether the content of an SGML element is `correct' with reference to some semantic model. Publications such as the TEI Guidelines typically extend the syntactic definitions embodied in their DTDs by more or less detailed discussion of the intended semantics of elements, but rarely provide a formally verifiable abstract model of such semantics, nor is it entirely clear what such a model might resemble. Nevertheless, throughout our discussion we will use the term feature (and derivatives) to refer to components of such a model, and the term tag (and derivatives) to refer to components of the markup system used to assert their existence.
This distinction seems to us crucial to the feasibility of validation: ``A corpus is a collection of utterances, and therefore a sample of actual linguistic behaviour. However, even if we do not believe that the distinction between competence and performance is valid, a corpus is not itself the behaviour, but a record of this behaviour'' ( Stubbs, 1996). The function of the markup in the corpus is to make explicit, and hence accessible to comparative study, the recording process for both structural and interpretative encoding in a corpus text. Without this, neither comparative studies of different corpora, nor any assessment of the validity of the corpus `record' with respect to what it `records' will be possible.
We define the process of validation as follows:
Taking these in reverse order, it is clear that, in the general case, the last two of these stages are automatable only to the extent that an abstract model can be formally specified for both the feature system itself and for the intended correspondence between that and the tagging employed. We present in section 5.1 below one such abstract model, the EAGLES Guidelines for morpho-syntactic annotation ( Leech and Wilson, 1994), re-expressed as a TEI-conformant feature system, against which any other set of morpho-syntactic annotations using the same representation may be validated, without necessarily having to conform to the EAGLES model. We also discuss the somewhat simpler abstract model proposed by EAGLES itself in section 3.2 below.
Equally clearly, however, neither the third nor the first of the stages above can in principle be automated, since both depend on a human judgement to the effect that such and such a feature is in fact present, whether or not it is signalled by the tagging in a text. Such text-comprehension abilities still seem to be somewhat beyond the state of the art in NLP, despite some advances.
The second of the three stages above is however automatable, to the extent that the tagging syntax of the document is fully specified. In an SGML context, this implies the existence of a DTD against which candidate documents can be verified using an SGML parser. For other forms of markup, validation may involve other forms of verification, some of which may be intimately tied in to the behaviour of particular application software. For example, a document marked up in RTF or LaTeX may be considered valid so long as Microsoft Word or LaTeX does not reject it, irrespective of its output. Technical documentation will often specify what markup should be found in a document: where the markup syntax is arbitrary or application specific, clearly special purpose software must be developed to validate it.