Computers & Texts 11: Reimer

Computers & Texts No. 11

March 1996

The Chadwyck-Healey English Poetry Full-Text Database

Stephen Reimer
Department of English
University of Alberta

This review of Chadwyck-Healey's English Poetry Full-Text Database is based in part upon a portion of a paper, 'A Textbase of Fifteenth-Century English Poetry and the Index of Middle English Verse', presented at the International Medieval Congress in Leeds, July 1995. It is presented in this issue together with a reply from the managing editor at Chadwyck-Healey Ltd.

I am currently working on a reassessment of the canon of John Lydgate, the disciple of Geoffrey Chaucer. Towards that end, and with permission from the Early English Text Society to use their editions, I am creating a textbase of fifteenth-century English verse: this is incomplete but will eventually include every work ever attributed to Lydgate, and, for purposes of comparison, the works of his contemporaries, immediate predecessors and successors. The Chadwyck-Healey English Poetry Full-Text Database has appeared, then, at a very opportune time for me: as a complete transcription of English verse from Beowulf through the nineteenth century, it promised to be all that I needed and more. What follows, then, is a very idiosyncratic and narrowly focused 'review' of the Chadwyck-Healey product, primarily to explain why their collection of electronic texts will not serve my purposes. My observations, however, will also have a more general applicability, since many of the limitations which I have encountered are the result of methodological choices affecting the whole collection.

Let me first, however, assert my gratitude that Chadwyck-Healey undertook such a mammoth project. Despite the problems which I have encountered, and the inadequacy of the product in its current state for my research, this is good thing that Chadwyck-Healey has done. If they can fix some of the limitations in the product, it will be a profoundly useful tool for all those involved in the computer analysis of English poetry.

Transcription

Turning to the texts of the Database, I encountered many errors of transcription in my examination of several hundred poetic texts of the fourteenth and fifteenth centuries. There were especially great numbers of errors in the copying of Gothic or Blackletter fonts. One might not expect such fonts to have been encountered frequently in the process of transcribing the texts, except that the collection relies rather heavily upon incunabular editions. Two of the procedural choices made by the producers of the Database come into play here: on the one hand, there is the choice to try to avoid the costs of proofreading the Database by employing a sort of 'double blind' method of input whereby two separate transcriptions of the original were keyed into computers by non-English-speaking typists and then the two transcriptions compared by computer. The assumption was made that both typists were likely to make transcription errors but rarely the same transcription errors, and so wherever the two texts agreed with each other it was assumed that the text was correctly transcribed. According to the Chadwyck-Healey promotional literature, they thus achieved 'better than 99.95% accuracy' even on 'sixteenth century gothic broadsides' (The English Poetry Full-Text Database Newsletter no. 3 (Dec. 1992): 3). I will not dispute percentages, though my experience would belie this claim. More to the point, it is a simple fact that a floriated 'I' with a stroke though its centre, a common enough character in a Gothic font, is easily mistaken for an 'E'; as a result, there is, in fact, a high probability that more than one non-English-speaking typist would make the same mistake of reading 'E' for such an 'I.' The hypothesis that 'double blind' input obviates proofreading seems quite dubious: at any rate, I offer as a simple observation that the Database needs a thorough proofreading.

The second 'procedural choice' which contributes especially to this 'Gothic' problem is the choice to use incunabular editions, and this is part of a larger procedural question about the selection of editions to be transcribed. Obviously in a project of this magnitude there were a variety of constraints upon selection, not least questions of the availability of printed texts and the negotiating of copyright permissions, and the editors made (as they freely admit in their promotional material) some hard choices so as to make the Database as complete as possible. Again, for this they deserve praise, but at the same time the quality of the product is fundamentally compromised by its heavy reliance upon antique printings instead of modern editions.

To consider a single example from my period: all seven of the poems attributed to Stephen Hawes are included in the Database, but six of these seven are transcribed from sixteenth-century prints rather than from the EETS's Minor Poems of Stephen Hawes. Thus the corpus of Hawes's poetry is offered in one modern edition and six antique printings, and the inconsistencies between modern and sixteenth-century editorial methods will complicate any computer-aided study of Hawes. Indeed, such inconsistencies seem to undermine the very purpose for creating such a collection of electronic texts: in order to study the works of the author (as opposed to the consequences of how well that author has been edited), one wants a collection of texts which have been produced using consistent editorial procedures. One wants all the non-authorial 'variables' to be, at least, 'controlled' if not eliminated. The inconsistencies in editorial method in the various texts included in this Database will make this collection more or less useless for comparative studies.

Editions

In terms of the coverage of the collection, the producers claim to have included (with some general, and clearly indicated, exceptions - such as verse dramas, or poems which have never been printed) all of the poems listed in The New Cambridge Bibliography of English Literature (NCBEL) down to the year 1900. On this score I found two large problems: one is that, while the NCBEL may, indeed, be the best single available list of all English poems, it is decidedly dated and in need of substantial revision. Further, I question the wisdom of insisting on having a single 'master' list at all: why not use the best available bibliographies (plural)? Again, taking an example from my own field, surely the Index of Middle English Verse (IMEV) with its 1967 Supplement, together with the (to date) nine volumes of the revised Manual of the Writings in Middle English constitute a much more authoritative listing of Middle English poetic texts than does the NCBEL.

Secondly, even allowing that the NCBEL was the best possible choice, I would dispute their claim that their coverage is complete. In fact, quite a large number of items relevant to my research were not to be found. The absence of The Court of Sapience (IMEV 3406 and 168) is a huge gap in the fifteenth-century section of the Database, and the poem is available in both a good recent edition by Ruth Harvey and in a poorer but now public-domain nineteenth-century edition by Spindler, so one is left wondering how it could have been missed. Lydgate's Lives of Ss. Edmund and Fremund (IMEV 3440) is also not included, despite the fact that it appeared in a now public-domain nineteenth-century anthology of verse legends edited by Carl Horstmann, an anthology which was used for a number of other texts in the Database: again, somehow Edmund was overlooked. Lydgate's Cartae versificatae (IMEV 1513), printed in Thomas Arnold's Memorials of St. Edmund's Abbey, is not to be found. The Isle of Ladies (IMEV 3947), an important and lovely piece of Chaucerian apocrypha, available in at least three recent editions, is missing. The Bannatyne Manuscript has twice been fully transcribed, and it constitutes an important anthology (running to four printed volumes) of late medieval and early renaissance verse; why did the editors include the carols but none of the other texts from the Bannatyne Manuscript? Furnivall's EETS edition of poems from the Vernon MS is not included; several other EETS volumes of 'miscellaneous' Middle English verse are not included. The Database includes the complete set of texts from four of the five volumes in an Oxford University / Columbia University series of Middle English verse anthologies edited by Carleton Brown and Rossell Hope Robbins: why is the fifth (Brown's Religious Lyrics of the XIV and XV Centuries) excluded?

Besides such failures of inclusion, and this is just a portion of the gaps which I discovered, there are also some very significant errors of attribution which will easily mislead the unwary. For instance, included among the works of Geoffrey Chaucer, according to Chadwyck-Healey, one finds 'O Mossie Quince' (IMEV 2524), 'Complaint to my Mortal Foe' (IMEV 231), 'Complaint to my Lode-Star' (IMEV 2626), and 'The Describing of a Fair Lady' (IMEV 1300): Chaucer scholars around the world will be surprised and Chaucer students around the world will be confused. Despite all of V. J. Scattergood's excellent work establishing the canon of Thomas Clanvowe, 'The Cuckoo and the Nightingale' (IMEV 3361, also known as 'The Book of Cupid') appears here under 'Anon.' Again, this is only a sample of the errors of attribution which I found among the texts in my research area, but it leaves me concerned that this 21st-century tool is mired in 19th-century scholarship.

Searching the database

Turning from the collection of texts, the Database also includes software to help in locating a particular poem or passage: one can search by author's name, poem title, poem first line, or by any string of characters which might appear anywhere in the text; one can also limit the range of one's search by specifying a historical period or by limiting the search to one of the five CD-ROMs. One can also browse through alphabetically-ordered lists of authors, titles, and first lines; while searching within these lists one can specify the first several characters of one's search term in order to 'zoom' to the relevant place in the alphabetical list. I would recommend that these lists be re-indexed with punctuation ignored: a first line beginning 'Lord, in charity' now, because of the comma, precedes a first line beginning 'Lord above,' which seems to me to be counter-intuitive and can cause a user to fail to find something which is, in fact, there.

The ability to search the entire textbase for a string of characters is a useful feature if one is attempting to find the source of a quotation or, as fifteenth-century scholars are wont to do, attempting to find the origin of some stanza which was copied onto the flyleaf of a manuscript. But given the orthographic variability of all pre-nineteenth-century texts, some sophisticated 'fuzzy' algorithms would be useful. The search engine does include some wild-card capabilities, so that one can search for a string like 'l(ou)v(e)' and find 'lov,' 'luv,' 'love,' and so on, but this is as close as the program gets to any sort of 'fuzzy' matching. I failed initially to find a number of texts which were, in fact, included; I was making my best guess at the spelling (and using the 'wildcards' to cover the alternatives that occurred to me), but my guess did not happen to match the spelling of the phrase as found in the particular anthology from which the poem was transcribed.

There is a bibliography, and it comes as a printed book as well as a file on the CD-ROMs, but one cannot 'zoom' to a section of the bibliography in the same way that one can zoom to a particular section in the browse windows, nor can one use the bibliography to search for a text by its printed source. It might be useful, especially when one cannot remember an author or title but remembers that it was a poem in such and such an anthology, to be able to search for a poem by its printed source. More generally, there is much more bibliographical information which could have been encoded with the texts and made available for searches, and this would have profoundly increased the usefulness of the product for research: for instance, in Middle English studies the IMEV reference numbers are especially important as a standard means of identifying texts.

Displaying texts

When one has found the text for which one is searching, the poem is displayed in a window. In this 'full text' window, there is a scroll bar on the right side to move forward and backwards; however, the range of the scroll bar is not limited to the text which has been found, but will move you through the entire contents of the CD-ROM. Moving the scroll tab to the top of the scroll bar takes you to the beginning of the contents of the disk - and note that the contents of the disk are in no particular order, so this can never be particularly useful - and the bottom of the scroll bar represents the end of the disk. Further, there is no way, because of the way that the scroll bar has been designed, to get directly to the last line of the poem which one has called up: the bottom of the scroll bar does not represent the end of the poem, but the end of the CD-ROM. Thus, in order to move to the end of the poem one must go through the text one page at a time with the 'PgDn' key. For a short lyric, this is no great trouble, but if you want to see a passage which comes at or towards the end of Lydgate's Fall of Princes (nine books, constituting three large volumes in the EETS edition), having to page down from the beginning of the poem becomes a Herculean task. Computers should make it easier, not harder, to 'access' texts in 'non-linear' ways. In particular, the scroll bar should be limited to the present text.

Having found the text for which one is searching and viewed it on the screen, one can copy the entire text or some portion of it onto one's own diskette by means of the program's 'save' command. One can choose either to leave in or to strip out the SGML/TEI tags in the original file; either way, the copy process is slow, but stripping out the SGML tags is especially time-consuming. On a machine which normally copies a large file in a matter of seconds, The Pilgrimage of the Life of Man, just under 790K, took 35 minutes. Further, while the copying process is underway, one must sit staring at the Windows hour-glass rather than being allowed to continue with other searches: could the file not be dumped quickly to some spooler so that the user could get back to work while waiting for the 'save' operation to be completed? Or could the 'save' algorithm be rewritten and optimized for speed? At the very least, have the program offer an honest estimate of the time required to perform a 'save' so that one can go off and do something else while the computer takes its time.

Character entities

There is also a problem in the handling of special characters when one downloads files from the Database CD-ROMs for use in other applications. If one copies a file with the SGML tags left in, the special characters will be represented by SGML-style character names, like 'ë' for lower case 'e' with an umlaut. However, when one requests the SGML tags to be stripped out, these special characters are translated into single, high-bit characters: for instance, the SGML-type character 'Þ' is changed to ASCII character 222, which is the upper case thorn in most Windows fonts. Generally that means that the file can be used in, for instance, Windows 'Write.' But there are two problems: first, the Poetry Database does not actually use 'normal' Windows fonts, but uses two separate customized fonts which alter significantly the usual Windows character set; thus a poem downloaded from the CD-ROMs will not display correctly unless one possesses and loads the Chadwyck-Healey custom fonts, and this limits the portability of these files (Chadwyck-Healey does give permission to download the fonts, but with the proviso that they are only to be used for the display of Chadwyck-Healey poetry files). The use of a proprietary character set - indeed, the preservation of macrons and other diacritical marks in English language texts - will complicate any attempt to use these texts for textual analysis. Secondly, and even more seriously, there seems to be a bug in the algorithm which translates SGML character names into high-bit characters with the result that all lower case special characters are mapped to their upper case equivalents: both the SGML-type character 'Þ' and 'þ' are changed to ASCII character 222, and every thorn in the file thus becomes an upper case thorn, every yogh an upper case yogh, and every vowel with a diacritical mark an upper case vowel with a diacritical mark. Thus, even with the Chadwyck-Healey fonts in a Windows word-processor, one finds words throughout the file like 'cryÉd' for 'cryéd.' This is a nuisance to fix in a file of any but the very shortest length.

Conclusion

In conclusion, I want again to praise Chadwyck-Healey for what they have achieved and for daring to dream great dreams. The Database as it stands, however, does not serve my purposes well, and does not, with its current faults and limitations, serve any of its users as well as it could.

[Table of Contents] [Letter to the Editor]

Computers & Texts 11 (1996), 13. Not to be republished in any form without the author's permission.

HTML Author: Michael Fraser (mike.fraser@oucs.ox.ac.uk)
Document Created: 25 April 1996
Document Modified: 27 April 1996

The URL of this document is http://info.ox.ac.uk/ctitext/publish/comtxt/ct11/reimer.html