In selecting corpora to review, we attempted to include recently constructed and well-established corpora, and to sample for a variety of languages and text modes, restricting ourselves however to corpora which were likely to be readily accessible or of major interest to European corpus users. On the basis of these criteria, the following corpora were selected for review:
|BNC (British National Corpus)||British English||Sp Wr||P||1994|
|BRO (Brown Corpus)||US English||Wr||P||1960|
|CRA (Corpus Resources and Terminology Extraction)||English, French, Spanish||Wr||P A||1995|
|ENP (English/Norwegian Parallel Corpus)||English, Norwegian.||Wr||A||1996|
|HEL (Helsinki Diachronic Corpus)||Historical English||Wr||.||1994|
|ICE (International Corpus of English)||Geographical varieties of English||Wr||P||1990|
|LAM (Lampeter Corpus)||Historical English||Wr||.||1997|
|LPC (Lancaster Parsed Corpus)||British English||Wr||P T||1991|
|SEC (Lancaster IBM Spoken English Corpus)||British English||Sp||P||1986|
|LOB (Lancaster Oslo Bergen Corpus)||British English||Wr||P||1960|
|LLC (London Lund Corpus)||British English||Sp||.||1976|
|MUC (Message Understanding Conference Corpus)||American English||Wr||.||1992|
|MUL (Multext)||Nine European languages||Wr||P||1996|
|MUE (Multext East)||Six East European languages||Sp Wr||P||1997|
|MUS (Multext Sweden)||Swedish||Sp Wr||P||1997|
|PAR (Parole)||European languages||Wr||P||1997|
|PEN (Penn Treebank)||American English||Wr||P T||1995|
|SPC (Speech Presentation Corpus)||British English||Wr||P||1996|
|TEL (TELRI Plato Parallel Corpus)||8 East European languages; English; Chinese||P||A||1997|
|UAM Madrid Spoken Corpus||Spanish||Sp||.||1992|
This list gives a good range of corpora produced over the last thirty years, containing speech (Sp), writing (Wr), and a mixture of the two. The corpora include a wide range of European languages (Parole and Multext, for example, cover all EU official languages) and they represent work undertaken throughout Western Europe, Eastern Europe and the USA. A high proportion of these corpora were also available in an annotated form which included some form of morpho-syntactic or other analysis, indicated above by the symbols P (part of speech code); A (aligned corpora); and T (tree-banked).
For each of these corpora we reviewed a range of manuals and other documentation; we also carried out examination of the actual corpus texts in some cases. The objective was to identify the encoding practices actually adopted for each corpus, both with respect to text features and with respect to annotations. Where there was no manual to refer to, we contacted corpus builders directly. In this way, we were able to collate the information needed to develop a profile for each corpus.
To facilitate comparison amongst them, we had originally planned simply to list the union of all features marked up (actually and potentially) in all our sample corpora. However, a closer examination of available encoding standards suggested that we might do better to use one of these as the baseline for our comparisons.
For our purposes, the Corpus Encoding Standard (CES), defined by EAGLES was of most relevance. This standard defines a number of SGML document type definitions (DTDs), which are derived from the set of recommendations produced by the international Text Encoding Initiative (TEI). In examining the corpora selected, we found few (if any) textual features for which a tag was not available from this source. It therefore seemed appropriate to use this standard as the yardstick against which to compare their respective practice.
As an indication of the delicacy of the analysis carried out, we identified nearly a hundred features in all. The principal groupings tabulated were :
We performed a similar cross tabulation for the subset of our sample corpora in which morpho-syntactic analysis of some kind had been applied. This analysis, given in section 11 below, demonstrates the applicability of the EAGLES recommendations for morphosyntactic analysis across a range of existing analysed corpora.