Language corpora are made by combining together whole texts or extracts from pre-existing documents, usually according to some specific design criteria. The structure of the corpus itself may thus be described (and hence marked up) at two levels: internal, relating to the way the parts of the corpus fit together, and external, relating to compositional features of the originals. This distinction holds good whether the corpus under consideration is a fixed document or a dynamic or `monitor' corpus; in the latter case, as well as generally requiring dictate the use of whole texts rather than extracts, the internal design criteria may be further extended to include such topics as the rate at which new documents enter the corpus, the criteria for determining that they should be discarded from it, etc.
The internal structural features of a corpus are largely self-evident, and require little validation: common practice requires only the clear delimitation of individual text fragments, and to associate with each an appropriate level of description or metadata. In the TEI model, the former constitutes the text proper, and the latter its header. In older corpora, it was common practice to provide such metadata (if at all) as a separate documentary component, with only an informal association between the two, often depending on such artifices as file-naming conventions or sequencing to identify descriptive features of each component. The TEI model uses the power of SGML (in particular, its hierarchic structure and the consequent ability to specify property inheritance) to build more sophisticated structures. (For an account of some of these, see the discussion in e.g. Chapter 23 of the TEI Guidelines.)
The scope of the external features to be found marked up in language corpora varies greatly, depending both on the diverse nature of the materials they include and the diversity of applications envisaged for them. In large corpora, economic considerations alone preclude any attempt at modelling in the markup the full diversity of structures which a detailed textual feature analysis might indicate as possible: in the earliest corpora, for example the Lancaster/Oslo/Bergen corpus, even such basic organizational features as paragraphs or subheadings are rarely distinguished as such. Even today, the corpus designer is always forced to make pragmatic decisions about which structural features will have sufficient usefulness in the intended applications to warrant the expense of identifying them consistently and correctly. For many purposes, division into discrete segments, corresponding with identifiable locations in the original source, is adequate. For other purposes (for example, the study of discourse-related phenomena or text-grammar) a richer approach will be desirable.
Standards such as the CES provide a rich set of feature descriptions from which the corpus builder can select, together with specific tagging rules about how the presence of selected features can be made explicit. There is, however, considerable (and understandable) reluctance to make recommendations about which particular selections are appropriate or mandatory, since this will inevitably depend on the intended application for the corpus.
To validate such corpora therefore, a necessary first step is to identify the intentions of the designer. A corpus which does not mark up paragraph divisions is not necessarily less valid or useful than one which does; a corpus which claims to mark such divisions but which does so inconsistently or inaccurately is. Unfortunately, as WP2 demonstrates, it is often hard for corpus builders to specify their intentions in this respect, and harder for the validator to determine the extent to which these intentions have been carried out. Documentation and the provision of a DTD go some way to simplifying the task, as further discussed below.
As noted above, the extent to which the syntactic consistency of the structural markup in a corpus can be validated depends on the extent to which that markup uses a formally verifiable syntax. The great merit of SGML as a markup language is precisely that it makes this automatic verification simply a matter of defining an appropriate grammar (a document type definition) and checking the corpus against it. The most widely used software for this purpose is currently the freely available SGML parser SP, particularly its DOS incarnation NSGMLS [SP]. With the growing take up of SGML and of its simplified version XML, the number and sophistication of such systems is likely to increase greatly.
SP and similar programs typically perform a number of other functions on a document, but for validation purposes, the key functions may be summarized as follows:
The output from an SGML parser is thus typically either simply confirmation that the document does in fact conform to the DTD, or a list of instances where it does not conform. At the risk of stating the obvious, it should be emphasized that a corpus which does not conform to its DTD, or which lacks a DTD, cannot be validated, no matter how closely its markup appears to be modelled on that of the SGML standard. The notion `SGML-like' or `unvalidated SGML' is not a helpful one in this context.
For corpora which do not use SGML markup, validation will require the provision of some DTD-like set of formal rules, and the production of some parser-like software to check them against the corpus itself. Such procedures are eminently feasible, and for simple markup schemes may be considered preferable to the expense of converting the markup to true SGML. For a variety of reasons not necessary to summarize, we do not recommend this approach: in the long run, the use of a widely accepted standardized markup language should always be less expensive than the maintenance of an idiosyncratic or application-limited scheme.
The list of questions to which an SGML parser will provide answers given in the previous section falls some way short of what we would like to know before deciding that a given corpus is suitable for our purposes in the general case. In particular, a parser cannot tell us
To a large extent, however, these are limitations inherent in the whole markup enterprise; they also touch on fundamental problems of naming and ontology which have exercised philosophers since the time of Aristotle, and for which it would be unreasonable to expect immediate answers. Nevertheless, it is possible to make some pragmatic observations, additional to those provided in section 3.3.1 below concerning the semantic validation of analytic tagging.
Although not formally presented as such, pre-defined feature lists such as those provided by the TEI and CES may be regarded as constituting a kind of abstract model for the structural components of texts. They thus provide a useful reference point against which the validator may check both that the objects tagged as representing some feature appear to conform with the definitions supplied there, and conversely that no features conformant with those definitions are present but untagged or tagged inappropriately. This remains however an entirely manual process.
Few corpora are small enough to permit the luxury of a close reading, and so in the general vase this kind of manual validation can only be done by sampling. Typical procedures are thus to inspect some random sample of the corpus for the presence of specific tagged features, for example, the paragraph boundaries or headings. Provided that the location of these samples within the original documents is known, an attempt can then be made to assess the accuracy with which the tagging of structural features has been carried out across the corpus with respect to the original source. In the absence of an original source, such accuracy can be assessed only in statistical terms, for example by comparing the distribution of certain tagged features in the sample with their distribution across the whole, where a `correct' distribution can be hypothesized on the basis of a priori reasoning (e.g. the number of paragraphs per text of a given type should be reasonably stable) or by applying other statistically derived heuristics.