Designing and Building a TEI-conformant Corpus of Historical English

Lou Burnard (Humanities Computing Unit, Oxford University),

Claudia Claridge and Rainer Sigmund (REAL Centre, University of Chemnitz)

This paper describes the background and development of a new corpus of Historical English, covering the period 1640 to 1740 and a wide range of discourse types. As well as describing the corpus itself and its intended applications, we will focus on the procedure by which the corpus has been constructed and made conformant with the TEI Recommendations, and the consequences of those procedures for re-usability and accessibility of the resulting resource.

The Lampeter Corpus Project

The Lampeter Corpus Project began 1991 when Prof. Dr. Josef Schmied and Eva Hertel were at Bayreuth University and moved with them to Chemnitz in 1993. It has been funded by the Deutsche Forschungsgemeinschaft(DFG), the German Research Association, since 1994. Travel grants from the Deutscher Akademischer Austauschdienst (DAAD), the German Academic Exchange Service, made possible research collaboration with the English Department at Helsinki University and the Department of Linguistics & Modern English Language at Lancaster University on questions of corpus compilation and annotation; they have also supported collaboration with the Humanities Computing Unit at Oxford University on questions of text encoding and archiving.

As visiting professor at the University of Wales, Lampeter in 1991, Prof. Schmied learned of the resources available in the University's Founders' Library and worked out an arrangement with the library staff for their collaboration in the building of a corpus of Early Modern English. The main objective of the project was to fill a gap in the availability of historical corpora: specifically, the lack of balanced corpora made up of complete texts for text-linguistic and stylistic analysis. The time-span selected for the corpus was 1640 to 1740, as this appeared to be a highly interesting one in terms of both language change and historical developments during the period.

The corpus mirrors a century that was crucial in the standardisation process of British English as we know it today, and provides a stretch of time long enough to permit investigations into questions of language change. The sampling procedure allows the researcher to observe change across three generations within comparable discourse types.

The beginning of the English Civil War in 1642 marked the beginning of a new era in English history, one which was to create new power structures in society, the economy, politics and religion. The battles during these and later times were fought not only with arms but also with words, however, and the sharpest weapons used in the battlegrounds of public opinion were contemporary tracts and pamphlets.

Structure and Design of the Corpus

In designing the corpus an attempt was made to meet the needs of both linguists and historians. The Corpus is not a randomly selected collection of extracts, but instead a balanced corpus of complete texts, chosen according to specific criteria which may be summarized as follows.

Sampling criteria

Texts were selected for inclusion in the corpus principally by date of publication and by topic domain. The 120 distinct texts making up the circa 1.2 million word corpus are spread evenly across these two dimensions, with two texts being selected for each decade/domain. For purposes of dating, the decade within which each text first appeared was selected. For purposes of topic classification, each text was assigned to one of the following subjective classifications:

In each case, the complete text was transcribed, including dedications, prefaces, postscripts, etc. but excluding illustrations, figures, tables etc. Texts vary somewhat in length, a few being as small as 3,000 or as large as 20,000 words in length, but most are around 10-15,000 words long. In selecting titles, a conscious effort was made to select each author only once, and to exclude major literary figures. Wherever possible, the first edition of a text was chosen; later editions were used only when they were known to have been revised by their author. In no case were modern editions transcribed.

Text encoding

The corpus is marked up in SGML, following closely the principles of the Text Encoding Initiative (TEI), as exemplified in other European corpus building ventures, notably the British National Corpus. Each text forms a distinct document, with its own header, but the corpus is also treated as a single document, with its own header. The headers provide specific detailed information aboutthe context within which each text was produced, for example: information on authors (name, age, sex, place of residence, education, social status, political affiliation), printers/publishers, place and date of print, print date, publication format, text characteristics, bibliographical references.

In the body of the texts, we decided to mark up only the basic structure of each text (though this was often complex enough), and within that, to mark up with particular attention the many changes of typeface and presentation which are typical of texts of this period. This seems at first glance somewhat contrary to the basic "interpretive" spirit of the TEI encoding scheme, but economic, pragmatic, and even theoretical arguments in favour of this approach prevailed in our discussions. It is the underlying function of the presentational variation which our analysis of these texts hopes to determine: to mark it up before making that analysis would therefore be somewhat premature, even if we had the leisure and ability to reach a firm conclusion in each case.

At first, we had hoped to use the simple TEI Lite dtd. However, it soon became apparent that this contained several elements we did not need, and also lacked some features we did need, a few of which were not even defined in the whole TEI scheme. We therefore applied the extension scheme defined in the TEI Guidelines to define a small number of additional elements and remove some redundant ones. The TEI Pizza Baker program was used to generate a Lampeter-specific view of the TEI dtd, which we then used for all subsequent validation and processing of the corpus texts.

Building the Corpus

A glance at a few sample pages from the Corpus rapidly reveals how far removed this kind of material is from that which typically makes up the bulk of modern English language corpora. Paragraphs, headings, and lists all appear here, as do highlighted phrases and a wide range of other recognizable typographic features --- but with function and significance rather different. The kinds of discourse represented here are also widely different, most notably in legal, political and religious polemic which have a vitality and variety quite different from their equivalents today.

With the rather limited tools at its disposal initially, the project began simply by retyping the texts from xerox copies using a proprietary word processor. As the need for markup became clearer, and the kinds of markup to be used were clarified, a set of macros was defined which inserted SGML tags (for the most part derived from the TEI scheme) at appropriate points in the document.

The project at this stage had no Document Type Definition (DTD), however, and consequently no easy way of checking that tags had been correctly inserted. After some iterations, a DTD was agreed in early 1997, and all the texts automatically converted to use its tags. Around this time, new SGML-aware tools became available to the project: notably James Clark's sgmls parser, and Softquad's Author/Editor word processor. During 1997 we were therefore able to re-edit the entire corpus again: in the first pass checking the texts for syntactic validity against the Lampeter dtd, and in the second, checking the semantic validity of the tagging. The first step was carried out using the emacs editor with SGML extensions developed by Lennart Staflin; the second using Author/Editor. We found the emacs solution more appropriate for texts which were not yet syntactically valid SGML and for which many global changes were needed; the second more appropriate for detailed work on a valid document.

Using the Corpus

With completion of the valid SGML tagging of the corpus, our first goal --- a valid TEI conformant corpus with detailed structural markup --- is achieved. The next stage will be to attempt a detailed linguistic analysis of the texts, resulting in an enriched morpho-syntactic markup of each component word. After some initial experiments with the CLAWS system used by the British National Corpus project, we have been advised that the Xerox part of speech tagger is likely to provide better results for material as far removed from modern English as ours, since its training period is considerably smaller. Experiments with this and other systems will be reported on at the conference.

Our next goal will be to make the corpus in its SGML format available as a free standing textual resource, via the Oxford Text Archive and the International Computer Archive of Modern and Medieval English (ICAME). Use of SGML as the basic encoding format allows us to imagine a full range of access possibilities for the many potential users of the corpus. It can be browsed, searched or consulted over the web, either by means of a down-translation into HTML, or directly using an XML browser. Users can download full texts, or collections of extracts, in SGML/XML, RTF, or HTML formats. We also envisage publication of the whole corpus, together with supporting materials, on a single compact disk bundled with appropriate search software. We will report on some experiments in using the SARA software, originally developed for the British National Corpus, for this purpose at the conference.