2 The Sample Corpora

In selecting corpora to review, we attempted to include recently constructed and well-established corpora, and to sample for a variety of languages and text modes, restricting ourselves however to corpora which were likely to be readily accessible or of major interest to European corpus users. On the basis of these criteria, the following corpora were selected for review:
Table 1: Corpora investigated
BNC (British National Corpus)British EnglishSp Wr P1994
BRO (Brown Corpus)US EnglishWrP1960
CRA (Corpus Resources and Terminology Extraction) English, French, SpanishWrP A1995
ENP (English/Norwegian Parallel Corpus)English, Norwegian.WrA1996
HEL (Helsinki Diachronic Corpus) Historical EnglishWr.1994
ICE (International Corpus of English) Geographical varieties of EnglishWrP1990
LAM (Lampeter Corpus)Historical EnglishWr.1997
LPC (Lancaster Parsed Corpus) British EnglishWrP T 1991
SEC (Lancaster IBM Spoken English Corpus) British EnglishSp P1986
LOB (Lancaster Oslo Bergen Corpus)British EnglishWrP1960
LLC (London Lund Corpus)British EnglishSp.1976
MUC (Message Understanding Conference Corpus)American EnglishWr.1992
MUL (Multext)Nine European languages WrP1996
MUE (Multext East)Six East European languagesSp Wr P1997
MUS (Multext Sweden)SwedishSp WrP1997
PAR (Parole)European languagesWrP1997
PEN (Penn Treebank)American EnglishWrP T 1995
SPC (Speech Presentation Corpus) British EnglishWrP1996
TEL (TELRI Plato Parallel Corpus)8 East European languages; English; ChineseP A1997
UAM Madrid Spoken CorpusSpanishSp.1992

This list gives a good range of corpora produced over the last thirty years, containing speech (Sp), writing (Wr), and a mixture of the two. The corpora include a wide range of European languages (Parole and Multext, for example, cover all EU official languages) and they represent work undertaken throughout Western Europe, Eastern Europe and the USA. A high proportion of these corpora were also available in an annotated form which included some form of morpho-syntactic or other analysis, indicated above by the symbols P (part of speech code); A (aligned corpora); and T (tree-banked).

For each of these corpora we reviewed a range of manuals and other documentation; we also carried out examination of the actual corpus texts in some cases. The objective was to identify the encoding practices actually adopted for each corpus, both with respect to text features and with respect to annotations. Where there was no manual to refer to, we contacted corpus builders directly. In this way, we were able to collate the information needed to develop a profile for each corpus.

To facilitate comparison amongst them, we had originally planned simply to list the union of all features marked up (actually and potentially) in all our sample corpora. However, a closer examination of available encoding standards suggested that we might do better to use one of these as the baseline for our comparisons.

For our purposes, the Corpus Encoding Standard (CES), defined by EAGLES was of most relevance. This standard defines a number of SGML document type definitions (DTDs), which are derived from the set of recommendations produced by the international Text Encoding Initiative (TEI). In examining the corpora selected, we found few (if any) textual features for which a tag was not available from this source. It therefore seemed appropriate to use this standard as the yardstick against which to compare their respective practice.

As an indication of the delicacy of the analysis carried out, we identified nearly a hundred features in all. The principal groupings tabulated were :

The results of this cross-comparison are given in detail in Tables 2 and 3 below, and summarized in the next section.

We performed a similar cross tabulation for the subset of our sample corpora in which morpho-syntactic analysis of some kind had been applied. This analysis, given in section 11 below, demonstrates the applicability of the EAGLES recommendations for morphosyntactic analysis across a range of existing analysed corpora.