3 Findings: corpora

Clearly, we would not expect to find any use of SGML or adherence to the CES Guidelines in the earliest corpora studied here. The LOB, Brown, Helsinki and Spoken English Corpora all naturally use idiosyncratic encoding systems; what is of more interest is the high degree of overlap both between all of these early corpora, and between them and the others in terms of the features they do choose to mark up, irrespective of the particular syntactic conventions they apply. This suggests that automatically converting their conventions to TEI conformant encoding would be quite trivial, (although bringing them into conformance with the CES requirements might require the addition of a some information not readily available). Considering how widely used these corpora are at present, and have been for some time past, such a preliminary mapping at least would seem to be well worth undertaking.

Of more concern is the extent of variability in the encoding of the modern corpora. A good number of the corpora which we reviewed might reasonably be regarded as TEI conformant (BNC, CRATER, PAROLE, MULTEXT for example), many of them specifically adhering to CES Guidelines. However, others have a far less systematic approach to encoding matters. Neither the Penn Treebank nor the MUC corpora claim to conform to TEI recommendations, nor even to SGML syntactic correctness. The ICE corpus meets some of the requirements of TEI, but omits various elements required in the header, and has only recently begun to require formal SGML validation of its contributors. The TELRI corpus (or at least the `Plato' subset of it which its designers suggested we examine) appears to be encoding different languages in different ways, with little agreement amongst its co-operating groups even about whether or not such simple features as paragraph or sentence markers should be tagged. Some groups are in a position both to articulate and to enforce validation criteria (for example, ``paragraphs should be tagged using the P tag'', ``corpus documents should use syntactically valid SGML conforming to a specific DTD'') but many apparently are not. In such a situation, corpus interchange and integration will continue to be a dispiriting uphill task.

Even those modern corpora which may be described as TEI conformant, may take different positions with respect to such issues as to whether or not a given textual feature should actually be made explicit in their encoding. For example, the BNC makes explicit in its markup the location and nature of any material (for example a picture or table) which has been omitted from a text. In the CRATER corpus (and others) such material is silently omitted, even though there may be clear reference within the text to it. A similar `silent correction' policy is used by PAROLE.

On the specific issue of documentation, we also found great variation amongst the corpora. Gathering precise information about how particular text features have in fact been encoded in a corpus can be time consuming as well as difficult. At least for corpora which claim to be TEI-conformant, there is a readily available public description of how the encoding scheme should function, while for those which conform to CES Guidelines, there is an additional (and equally easily found) set of rules as to how the TEI scheme should be applied. With this to hand, it should be relatively easy to determine how well the corpus builder has followed the standard, particular if any deviations from it have been correctly documented, for example in the corpus header.

Turning to corpora which use their own idiosyncratic schemes, the situation is in general disappointing. Sometimes documentation takes the form of a published article, sometimes it is available on the net, and sometimes it is only available by detective work. This might be understandable for older corpora, but really cannot be excused in more recently created corpora, whose builders have had ready access to several decades experience in both the necessity for accurate contextual information or documentation and the readiest means of supplying it together with a text.