In this section we discuss the possibilities for automatic or semi-automatic validation of one particular form of interpretative markup: that which seeks to mark up the result of a morphosyntactic analysis.
Whatever form of markup is employed, morphosyntactic tagging is usually supplied at the level of individual tokens in a text and is thus usually self-evident. In the absence of any documentation, it is likely to be a generally a simple matter to extract from a document all the unique tokens constituting the markup, and also to identify the lexemes to which they are attached, as was done, for example, by Garside and McEnery 1993. In this example, annotations were separated from words by underscore characters. Other schemes place the markup and lexeme in separate `fields', or on alternate lines within the text proper. In SGML documents, annotations may be represented as attribute values, or as distinct elements, and the association between lexical item and annotation may be made by means of pointer or link.
It will be rather less easy (in the absence of documentation) to determine what feature or combination of features each markup token is intended to represent. The list of all markup tokens, together with an index of their occurrences, and the associated lexical item, might be collated with an annotated corpus in which the same lexical items are associated with annotations whose feature equivalences are known, thus providing a kind of latter-day Rosetta Stone for the purpose. Such a process is hardly likely to be easily automated. This is one good reason for insisting on the availability of such documentation, preferably in a form which can be readily mapped to agreed standards.
Such mapping requires the predefinition of an agreed set of morphosyntactic features, independent of markup. Such a set is provided in the context of several western European languages (such as Danish, English, French, German, Greek and Spanish) by the EAGLES morphosyntactic annotation guidelines (Leech and Wilson, 1994), which we have therefore adopted as a test case for our recommendations. The procedures described here and the conclusions we reach would be equally applicable to any other set of Guidelines. However, as the EAGLES guidelines have been published on the basis of a wide-ranging review of corpus builders, recommendations derived from it are likely both to reflect, and potentially have a wide impact on, current practice.
The EAGLES recommendations have a dual focus: as well as providing an abstract model of the feature sets against which any particular combination of the features tagged in some corpus may be validated, the Recommendations specify explicitly a subset of `recommended' features which it is assumed should always be marked. Validation at this level thus becomes a matter of simply checking that the recommended features are in fact present [mdash ] in the terms we introduced in section 1.2 above, validation that the tagging is not only syntactically correct, but also complete.
EAGLES provides a `intermediate representation ' for the encoding of feature sets. This operates as follows:
Here are some examples of complete intermediate representations for nouns:
This representation provides a convenient means of facilitating validation against a standard list of features. By comparing intermediate representations from the corpus with the representation of the master list of features, it may easily be ascertained what features and values are or are not represented. Even where the intermediate representation is not used, a mapping list can still be produced showing for each corpus tag the EAGLES feature which it encodes. This latter kind of list is also essential for non-EAGLES-conformant corpora and, on a smaller scale, for any additional optional features used within the EAGLES remit. In section 5.2 we present examples of mapping lists for a non-EAGLES-conformant tagset (in this case, Lancaster University's Claws C7 tagset as used in the part-of-speech annotation of the British National Corpus).
Two problems arise however when attempting such mappings. The tagset under consideration may under-specify with relation to the EAGLES master list, that is, some annotation may map onto more than one feature combination. For example, the CLAWS 7 tagset uses the tag VV0 to denote any non third person singular form of a regular present tense verb, thus blurring the distinction between the imperative, first person singular, second person singular and first, second or third person plural.
The opposite situation [mdash ] where the tagset over-specifies is also possible, particularly where the bondary between morphosyntax and semantics is blurred, where the tagset makes distinctions between sets of features regarded as equivalent by EAGLES. For example, CLAWS includes a `Noun of Style' tag (NNB) to mark English honorifics such as `Mr', `Dame', `Professor' etc. for which no equivalent feature is identified by EAGLES, and which therefore cannot be distinguished from other parts of proper names.
It should be noted that EAGLES does allow for arbitrary extensions to cover language-specific features. However, to stay with the previous example, honorifics are to be found in most European languages, and hence to treat them as language-specific is not appropriate. Extensibility of the basic features and their sub-categorizations will clearly be essential to any general purpose representation scheme for feature systems, and some such systems may require something more complex than a simple two-level categorization of this kind. EAGLES, itself the product of a consensus amongst corpus analysts at a particular point in time, was designed with the changing needs and practices of that community in mind. It is anticipated that revisions to both the list of recommended features and the sets of features they summarize will occur steadily, particularly as the field of application extends beyond the relatively frequently studied Western European languages.
In the general case, what is needed is a representation scheme which maximizes the flexibility of the annotation scheme without compromising the need to validate instances of its use. We discuss such a scheme in the next section.
A more powerful and discriminating representation is provided by the TEI tagset for feature structure analysis. This has two parts, a set of tags for the direct representation of feature structures, which can be linked to instances of textual objects so analysed, and a set of tags for documenting the feature system itself, that is, the constraints, allowable feature-value pairs etc. which are to be regarded as valid in a given analysis.
The feature system representation is defined in chapter 26 of the TEI Guidelines ; Langendoen and Simons 1995 provides a useful introduction. A feature, in this scheme, is defined as a pair, comprising a name and a value. The latter may be one of a defined set of value types, including Boolean (plus or minus), numeric, string (an unclosed set of values), symbol (one of a defined set), a feature structure, or a reference to one. A feature structure is a named combination of such features, ordered or unordered.
For example, in an analysis of nouns, we might identify the features number and proper, with values singular or plural, and plus or minus respectively. (The decision as to the appropriate domain for a value is inevitably arbitrary: we have here chosen to regard number as being a symbolic value to allow for the possibility of additional values such as dual or uncountable). These features may be combined to form feature structures corresponding to part-of-speech annotations such as NP1 or NP2 as follows:
<fs id=NP1 name="> <f name=class><sym value=noun> <f name=number><sym value=singular></f> <f name=proper><plus></f></fs> <fs id=NP2> <f name=class><sym value=noun> <f name=number><sym value=plural></f> <f name=proper><plus></f></fs>
To reduce the redundancy of this representation, one may specify the individual features making up a given feature structure by reference. This requires that the features to be used are first specified independently of the structures in which they are to be combined, using a construct known as a feature library, represented by a <fLib> element, each one being given a unique identifier, as follows:
<flib> <f name=class id=FCN><sym value=noun> <f name=number id=FN1><sym value=singular></f> <f name=number id=FN2><sym value=plural></f> <f name=proper id=FPP><plus></f> <f name=proper id=FPM><minus></f> </fLib>
Each of the feature structures attested can now be represented by reference to these underlying primitives, using the feats attribute, as follows:
<fs id=NN1 feats="FCN FPM FN1"> <fs id=NN2 feats="FCN FPM FN2"> <fs id=NP1 feats="FCN FPP FN1"> <fs id=NN1 feats="FCN FPP FN2">
It should be apparent how this approach permits an SGML aware processor to identify automatically linguistic analyses where features such as number or properness are marked, independently of the actual category code (the NN1 or NP2) used to mark the analysis. In addition, of course, the use of the SGML ID/IDREF mechanism allows for simple validation of the codes used. For more sophisticated validation, for example to ensure that the feature properness cannot be both plus and minus in the same analysis, the TEI specifies an additional declarative mechanism, known as a feature system declaration (FSD).
Full details of the FSD are provided in chapter 26 of the TEI Guidelines ; its relevance for our present purposes is that it provide a mechanism, intermediate in constraining power between a full document type definition (which requires that all possible annotations or tags be specified in advance) and the kind of limited validation possible with the EAGLES mapping list. A fully elaborated feature system declaration for the EAGLES morphosyntactic classification scheme is presented in section 5.1 below. This more general solution makes possible a form of internal validation, whereby the contents of the corpus are validated against feature lists produced specifically for that corpus, or where the feature list used is a super- or sub- set of the EAGLES feature list, without losing the ability to validate that part of the feature set which does coincide with EAGLES' recommendations.
Returning for the moment to the utility of the original EAGLES report for validation, as a first step for languages covered by the report, corpus designers would be foolish to ignore the relevance of the EAGLES obligatory and recommended features, since these now form an agreed cross-linguistic EU standard. Any internal validation should thus be regarded as secondary to an EAGLES validation. Adoption of a feature-based system for validation makes possible the application of identical validation techniques in either case.
The process of deriving a feature set from documentation is also a convenient way of checking the thoroughness and consistency of the documentation itself. Anomalies such as the presence of undocumented tags in the corpus, or the presence of unused or `phantom' features in the documentation are often only found by such a process.
The former are easily handled by rectifying the documentation, but the latter are slightly more problematic. Phantom features may occur for any of three reasons:
Clearly, the most serious case is that of (3): here the annotation does not validate against the intended features and needs to be rectified. Such deficiency, at least at the EAGLES obligatory and recommended levels, should be immediately evident when the corpus annotation used is checked against the feature list. In the case of (2), only the documentation needs correcting. In the case of (1), the matter should simply be documented, for the information of corpus users. Phantom tags can be introduced as the result of typographic errors; the use of an automatic system for introduction of tags and their automatic validation against the agreed corpus tagset entirely does away with this form of error.
The aim of this level of validation is to ensure that the form of tags is consistent. Specifically, it should check that:
We use the phrase `lexical item' above to indicate that the tokens to which annotation is attached need not correspond with orthographic words. Although many commonly used annotation schemes for English do in fact attempt to make this correspondence, it is unnecessary where a single formalism such as SGML or something of equivalent power is used to represent both structure and analysis.
Thus, the CLAWS scheme uses a special form of annotation known as `ditto' tags to indicate that the annotation for one token applies also to another. For example, the English conjunction `so that' should properly be regarded as a single conjunction, although it is orthographically represented as two tokens. Early versions of CLAWS tagged this phrase as so_CS21 that_CS22 or, using the equivalent SGML formalism, as
<w CS21>so <w CS22>that.The actual annotation for conjunction is CS, the following digit 2 indicates the number of tokens to which it is to be attached, and the final 1 and 2 indicate the number of this token within the sequence. A more natural approach would be to revise the tokenization rules so that the token so that might be treated as a single unit, tagging it as
<w CS2>so that.. Uncoupling the annotation structure from the orthographic structure also enables a consistent approach to be taken for the case where the morphosyntactic units to be tagged are smaller than orthographic words.
We recommend above that a single annotation be attached to each lexical token, recognizing that in production systems it may be necessary to retain deliberately ambiguous or polyvalent annotations to avoid incorrect deterministic disambiguation. Such exceptions to the ``one word, one tag'' rule, should be clearly documented to aid validation; ideally each possible combination of multiple annotations can be represented as a distinct choice within the feature set. The FSD notation recommended below supports this possibility.
The majority of these tasks can be achieved using a series of procedures aided by simple Unix tools such as awk and grep. Checking SGML requires an SGML parser, and a number of these are available. As part of this workpackage, we reviewed the SGML validation that had been undertaken on the corpora covered in the WP2 review. For most part, the results (summarized in section 5.3 below) indicate that as yet only a few corpus builders are taking advantage of the availability of tools such as SGML parsers to validate formally-defined markup schemes.
This is unsurprising, given the fact that such schemes have only begun to gain wide acceptance in the last few years. However, it does seem strange that the topic of validation is rarely touched on in the extant literature concerning corpus design and construction; where it is, the topic appears to relate almost exclusively to the statistical validity of a given sample as representative of some aspect of language (see for example Clear 1992, Atkins et al 1990). Corpora such as the LOB and Brown have been so exhaustively studied and analysed that it would be surprising if such errors as they contain had not come to light; furthermore, where they have, however, corpus designers and builders seem to have been uninterested in their status or implications. A plausible reason for this is that it is only with the advent of really large corpora, often produced by automatic or semi-automatic methods of data capture such as optical character recognition or as a by-product of electronic typesetting, that questions of accuracy and authenticity have arisen.
As stated above, an accurate assessment of the semantic validity of any markup in a corpus is an inherently intractable problem. Where the function of the markup is to assert the existence of a human interpretation of the data, it is probably the case that this can only be validated manually, although some control over variability may be derived by the application of some rough heuristics to assess semantic conformance to a pre-established norm. For example, if we know the statistical distribution of specific nouns, verbs etc in a general corpus like the BNC, then we may be able to check future corpora on the basis of these rough distributions. However, this is clearly a rough and ready process.
Let us turn to considering hand validation. Even where human checking occurs, a validation cannot be considered 100% accurate, since frequently there is scope for error or genuine disagreement, even within a single set of guidelines [mdash ] (see for example Baker 1997). One possibly automated check would be to see whether an assigned tag is allowed for a given word, by checking the word's entry in a lexicon. However, this only makes sense when (a) a lexicon has been used to tag the text and (b) manual correction has taken place [mdash ] otherwise we can already be sure that the tag is permissible, unless there is something very seriously wrong with the operation of the tagging program. Limitations on this method of checking are (a) the fact that often a suffix list, etc., rather than an exhaustive lexicon, is used for tag assignment and (b) the presence of new tags, i.e., permissible and correct tags added by human annotators because a new contextual reading is missing from the lexicon.
In addition to the strictly morphosyntactic analysis discussed so far, the EAGLES Guidelines also envisage two generic forms of syntactic analysis: phrase structure and dependency. Phrase structure grammars require the ability to model well-balanced trees in a markup language, while structural dependency grammar requires the ability to describe directed acyclic graphs.
Both abilities are intrinsic to the SGML abstract model, and the tasks of first representing, and then validating the correctness of such structures, is thus comparatively trivial. Furthermore, it is clear that the fundamental problems of semantic validation are the same whether analyses are attached to high level structural units such as those identified by syntactic analysis or to lower level word-like tokens.
The generality of the SGML model leads to its being suitable for the tagging of a semantically highly diverse set of textual features. For example, the TEI recommendations propose that SGML tagging be applied to mark inter alia the following features:
While there is no doubt that an SGML encoding can cope with all of these forms of analysis individually, the difficulty of distinguishing them in combination rapidly increases, particularly if they are all located in the same data stream. There is an increasing tendency therefore towards so-called `out-of-line' annotation, in which potentially many, possibly contradictory, annotations or analytic interpretations are stored independently of the text itself, but linked to it by means of hypertext pointers. Similar techniques are required for the alignment of the structural components of multilingual or multimedia corpora.
Such techniques have much to recommend them, but place additional constraints on the ease with which the semantic and syntactic correctness of any one analysis can be validated. As well as checking that the analysis is internally consistent, it must be possible to check that the targets of each link are correctly specified. This may be difficult, if a non-portable or non-robust method has been used to specify them, or impossible entirely if the corpus text has been changed. Reliable standards for the specification of robust and application-independent linking mechanisms (e.g. HyTime, XLL) have a degree of acceptance within the computing sector, but are not yet widely accepted or understood within the community of corpus creators. An obvious exception to this generalization is in the special case of multilingual or multimedia aligned corpora where such mechanisms are essential.
We have restricted ourselves primarily to morphosyntax and syntax, partly because these are the most widely encountered forms of annotation and are also the only ones for which, at present, EAGLES guidelines exist. Other forms of annotation are sparser and more diverse, with insufficient examples of each type to make generally acceptable recommendations, even where consensus exists as to the scope or application of such analyses. This situation is like to change over time and consideration should be given on a rolling basis to validation procedures as the application of annotation types and the development of standards proceeds.
With this said, it is likely that many of the issues for validation of, say, pragmatic annotation, will be similar to those for morphosyntax. While the precise details of the scope of annotations and the interpretative nature of the schemes may differ, basic issues such as idiosyncratic v. widely accepted annotation schemes and questions of rigid v. fluid analysis schemes will most likely remain the same. So future work on the validation of such further annotations will be able to refer to this document for guidance, if not a complete solution.