Computers & Texts No. 11
Table of Contents
March 1996

A Response to Dr Reimer's review of The English Poetry Full-Text Database

Stephen Pocock
Managing Editor and Head of Data Conversion
Chadwyck-Healey Ltd

Dr Reimer is grateful that Chadwyck-Healey have undertaken such a 'mammoth project' as the English Poetry Full-Text Database, but he finds a number of faults with the database when he tries to use it for his own research, which focuses on fifteenth-century English verse. Dr Reimer criticises the overall editorial strategy, the detailed choice of texts, the level of transcription accuracy, and the software functionality. I would like to reply to Dr Reimer's criticisms because some of them stem from an understandable but, I would argue, unrealistic desire for a database of ideal texts and others arise from apparent misunderstandings about how the software for accessing the database works.

Editorial considerations

The editorial policy pursued by the creators of the database has been defended in print before by members of the project's Editorial Board (see Professor Daniel Karlin on the selection of editions in The English Poetry Full-Text Database Newsletter, issue 1 (December 1991), Professor Derek Brewer on the choice of early texts in Newsletter, issue 2 (May 1992) and Professor Howard Erskine-Hill on the editorial policy in practice in Newsletter, issue 4 (September 1993)). However, it is worthwhile to reiterate some of the points made in these pieces. With regard to overall editorial strategy, essentially what Dr Reimer is asking is that English Poetry should have edited all English verse up to 1900 on consistent principles. This is obviously absurd. For the great majority of texts in English Poetry there was little or no choice about what edition to use. Such texts inevitably vary from period to period and author to author. The one printing of a minor eighteenth-century author's poem, untouched by later editing, may well be inconsistent in editorial accidentals with a similarly treated nineteenth-century poem, and even more so with a sixteenth-century poem. We could never have attempted to remove the editorial inconsistencies that existed between such texts. The elimination of 'non-authorial variables' of which Dr Reimer writes is an impossible ideal; the best that can be achieved is to present a faithful version of the texts that people actually read at the time, and for the great majority of texts in the database, this is what we have done.

Authors whose output has been subject to significant editorial attention offer more complex problems and it may be added that one of the revelations of the database is how many of the major poets in English have only just begun to be adequately edited. When we needed to go to a later edition to complete a poet's accepted canon satisfactorily we did so. Copyright questions were an irritation and permission was refused for Chadwyck-Healey to use some texts that it would much have liked to include (for example The Riverside Chaucer (1988), Karina Williamson's Poetical Works of Christopher Smart (1980-)) but copyright availability was not a primary consideration in deciding the most suitable text to include. It is also worth Dr Reimer reflecting that had we consistently favoured recent editions he may have been breaking the law downloading the 790kb poem (an entire large volume) that he complains took him over half an hour to generate (how long did it take the original scribe to copy or editor to edit?). He might have considered a modern edition a 'better' (from his perspective) text, but he would not have been allowed to do any more than look at it.

Dr Reimer notes that he considers his review is 'idiosyncratic and narrowly-focused' but states that he believes his criticisms are general. However, it is the case that fourteenth- and fifteenth-century texts do offer special difficulties distinct from the bibliographic problems that surround texts later than 1500. These difficulties are particularly apparent when considering the question of which edition provides the best core text. In the case of Chaucer earlier texts are obviously inferior. We have used Skeat's edition faute de mieux. So too with Langland. But any edition, however good, will embody controversy, as Dr Reimer will know, too extensive, and too specialised, to be more than mentioned here. For many fifteenth-century texts we have often - but not always - been able to use Early English Text Society editions. But as specialists will know, these themselves are very variable and some over a 100 years old. Some texts have had to be taken from late-nineteenth-century learned journals. It is greatly to be regretted that three items by or attributed at some stage to Lydgate (The Court of Sapience, Lives of Ss. Edmund and Fremund and Cartae versificatae) were missed. It is doubly unfortunate that we were not made aware of this sooner, since Chadwyck-Healey has just released a revised version of the database that made good the omissions of which it was aware and enhanced the database software at no charge to its customers. We regularly urge all users of the database to notify us of errors and omissions.

With regard to other allegedly missing items, Dr Reimer is mistaken. Our coverage of anthologies or collections of verse is, of necessity, selective and we have never claimed otherwise. However the Bannatyne MS carols are not all that have been included from that source. Bellenden's Benner of Pietie also appears and there is some material in the Bannatyne that is included in the database from alternative MS sources, for example the poems of Alexander Scott. The Vernon MS material is also not restricted to carols. The text of the Early English Text Society edition of the minor poems edited by Horstmann (1892) does appear in the database. The Early English Text Society edition of Joseph of Arimethie edited by Skeat (1871) from the Vernon MS is also in the database. Dr Reimer appears to have made a mistake in his reference to a non-existent Carleton Browne volume, Religious Lyrics of the XIV and XV Centuries. We have included the two volumes published by Oxford, Religious Lyrics of the XIVth Century (1924) and Religious Lyrics of the XVth Century (1939). We have also included Robbins' Historical Poems of the XIVth and XVth Centuries. Perhaps the missing item that Dr Reimer refers to is Secular Lyrics of the XIVth and XVth Centuries edited by R H Robbins? This certainly is not included in the database. It is possible that its omission reflects the fact that the texts it contains are included from other sources; I am not at present sure and must retrace some editorial steps to try to clarify. The Isle of Ladies, despite appearing in the Index of Middle English Verse, does not appear to be mentioned in NCBEL which is why it does not appear in the database.

The misattributed Chaucer items are all flagged in the database as of uncertain attribution. The Editorial Board early on took the view that it would be impractical and probably undesirable to review all attributions in the database in the light of current scholarship. To change attributions would have violated our intention of being faithful to our source texts, led to an unquantifiable amount of bibliographical work and have given rise to an enormous number of potentially contentious questions. Bibliographical work of this scope and detail was outside our remit and we felt best left to scholars specialising in the field, such as Dr Reimer.

The criticism of the selection of Stephen Hawes texts also seems misdirected. We have not been wilfully negligent as Dr Reimer suggests in mixing old and new editions. The database includes four books from the early sixteenth century, that is texts exactly contemporary with their author, to provide the core of the canon. For The Pastime of Pleasure we could not use an early printing so took instead an early twentieth-century EETS edition which is itself a 'literal reprint of the earliest complete copy (1517)'. Nothing has been taken from the EETS's Minor Poems of Stephen Hawes, an edition that was not published until 1974 and is not referred to in NCBEL. However, our researchers did check the recent EETS edition to confirm that it added nothing to Hawes' canon as represented by the texts we had keyed.

As to the other fundamental criticism, our use of the New Cambridge Bibliography of English Literature (NCBEL), this is really a simple issue of what was practicable and of not allowing the non-existent best to be the enemy of the good - a principle which also governed our choice of the most suitable text. We needed a comprehensive bibliography. There is only one in existence, NCBEL. There are of course hundreds of more limited bibliographies of very variable quality (even the admirable and well-known Index of Middle English Verse that Dr Reimer recommends can hardly be counted as error free), but if the project were to be completed in reasonable time at a price that was at least affordable by libraries - for the investment of both time and money in the database, unsupported by any institution, has been very considerable - then there had to be concentration on this one modern and, in principle, comprehensive guide. NCBEL's faults are familiar to practising scholars but it has the great strength of being a resource known to all in the field and used by all in the field, and thus a practical and clearly-defined basis for this undertaking. It might be argued that we should have waited for the proposed revised NCBEL before commencing work, but at that time the revised NCBEL was a long way off and today looks unlikely to be completed for several years. And after the New NCBEL do we await the New New NCBEL?

Accuracy

Turning to Dr Reimer's criticisms of our accuracy, I am frankly puzzled by his unsubstantiated attribution of multiple errors of transcription. It would be a miracle if there were none, but has Dr Reimer really compared the texts of several hundreds of fourteenth- and fifteenth-century poems in the database with the sources the database shows that it used? He gives no examples, so it is impossible for us to check. Everybody who deals with these texts knows that a fifteenth-century scribe rarely copies the same text twice in exactly the same spelling. (Chaucer's texts in University of Cambridge Gg IV. 27 and Oxford Bodleian Rawlinson Poet. 163 offer some very clear examples. See Derek Brewer 'The Grain of the Text' in Acts of Interpretation, ed. Mary J Carruthers and Elizabeth D Kirk, Norman, Oklahoma, 1982, 199-28; also The English Poetry Full-Text Database Newsletter, issue 2 (May 1992).) Is it possible that Dr Reimer has consulted different sources from those used by the database?

Dr Reimer seems to suggest that we used sixteenth-century texts as a cheap or easy option. But where sixteenth-century texts (oddly called incunabula by Dr Reimer) have been used it has been following the principle of presenting texts contemporary with their authors or simply due to the unavailability of an alternative. Using black letter texts is difficult, demanding and expensive. We did not do it more than we felt was necessary. For the record, Dr Reimer is quite wrong in his summary of our method of production: our keyers are all English speakers; the 99.95% accuracy level is the minimum that the keyers guarantee to us and all texts are sampled to ensure that they better this minimum; all texts that fall below around 99.97-8% accuracy are proofread; all black letter texts in the database have been 100% proofed.

Sorting results

On functionality issues, Dr Reimer appears to be confused about what English Poetry can and cannot do and about the way in which it does those things that it can do.

Ignoring punctuation in title/first line browse lists might produce 'intuitively' acceptable results in some cases but not in all. The question of phrase sorting is a very difficult one, particularly if you are dealing with 180,000 phrases. Our routines for sorting have been carefully thought out and are sophisticated. Unfortunately intuition is not a rule that a machine can follow. In any case what is the intuitive order for the following titles taken from English Poetry?

Dream
The DREAM Dream.
DREAM!
A Dream. (RECORDED AS FAITHFULLY AS POSSIBLE)
THE DREAM. A FRAGMENT.
The DREAM. To Sir Charles Dumcomb from the Country.
THE DREAM. XLIII
A DREAM. AFTER READING DANTE'S EPISODE OF PAULO AND FRANCESCA.
The Dream: Imitated from Propertius
DREAM AND DAWN
DREAM BABIES
DREAM-COME-TRUE
"THE DREAM DIVINE"
THE DREAM IS - WHICH?
THE DREAM IS BROKEN.

It would be nice to have the luxury of reviewing the sort order and manually moving things that an intelligent eye felt were not placed as 'intuitively' as they could be, but then it would be impossible to have the feature Dr Reimer praises that enables you to 'zoom' to a particular alphabetical section of the browse window; this relies on logic rather than intuition.

Subsequently he complains that a similar 'zoom' feature is not available to move to particular texts in the full-text display. I can only conclude that he has not made use of the Table of Contents, which lists authors alphabetically disc by disc and works alphabetically under authors. The Table of Contents allows instant access to the full-text at the author level or the document level or the poem level, or the canto level ... indeed at any named subdivision of the text there is an immediate link into the full-text, and all subdivisions are nested hierarchically, making the Table of Contents for a whole disc quick and easy to navigate. Using the Table of Contents Dr Reimer could have got to any one of over fifty named subsections in the last book of Lydgate's 'Fall of Princes' in seconds, including the final 'Lenvoye'. If all he had wanted to do was get to the end, he could simply have clicked the next poem button in the tool-bar and scrolled up a few lines. He is also mistaken in claiming the database is in no particular order. The texts on each disc are in the same order as the Table of Contents, that is alphabetical by author/title. Does anyone have a better order?

Special characters and searching the database

With regard to special character sets, I know there are many scholars that would strongly disagree with him in recommending that we do not retain 'macrons' on letters (by which I take it he means the bars or tilde-like characters commonly used to indicate omission of a following nasal). Indeed we went to the trouble and expense of commissioning a special character set for English Poetry precisely because those scholars whom we consulted considered it essential to retain them. These characters and all other special characters and accent characters do save and print correctly (although it is possible that they did not on the particular software release Dr Reimer was testing). The special Chadwyck-Healey font is available while English Poetry is running for printing and viewing texts. This font differs from the standard Windows fonts only in having these few extra characters, which in turn are found only in a few early texts. There is really no question of there being limited portability of downloaded files.

The complaint about the difficulty of finding texts because of variant spellings is fair enough. For texts of this period it is difficult, that is part of the problem of studying texts of this period. The keyword browse index which lists all words in the database provides a good guide to the variants that might exist, and, as Dr Reimer admits, judicious use of the wild cards and Boolean operators makes the job easier. I do not honestly think that any 'fuzzy matching' algorithm that could be used across the whole range of material in English Poetry would help very much. It is hard to imagine coming up with a search algorithm that could do better than an expert in the orthography of a particular period, such as Dr Reimer, could manage. It would certainly slow things down enormously and would inevitably retrieve many more items than the user would require while perhaps missing some classes that it should retrieve. 'Why not have a fuzzy logic feature that does what I need?' is easy to ask for but, like intuitive browse lists, not easy to provide.

The impression that all that English Poetry is capable of doing is searching for a string of characters across a mass of texts is a major misrepresentation that needs correcting. The coding in English Poetry goes down to the level of individual lines of verse, epigraphs, side notes, typestyle, indentation, rhyme, etc., enabling the intellectual and physical structure of the documents to be captured and hence searched. Searches can be made for data elements, such as arguments, as well as words. Searches can be restricted to particular data elements or to particular periods or to particular authors or to particular works, or to complex combinations of any of these. A range of possibilities that amounts to far more than just 'searching the entire textbase for a string of characters'.

Conclusion

Speaking for the Chadwyck-Healey editorial team, I welcome Dr Reimer's detailed look at what he admits is a limited subset of the database. We are glad that he considers English Poetry to be potentially 'a profoundly useful tool'. We are sorry that he has not found it as useful as he could have done and hope that he may be able to make better use of it in the future. With regard to the reported shortcomings in the data, no-one condones error and we welcome correction. We are grateful to have omissions pointed out but the difficulty of achieving complete accuracy in every particular cannot be overrated. Dr Reimer himself confuses Thomas Clanvowe with John Clanvowe while referring to Scattergood who made the distinction - which is still not absolutely certain. There are and always will be mistakes and omissions in the English Poetry Database. In any database of this scale that is inevitable. Paradoxically, the effectiveness of the software in accessing the data and displaying the results makes such mistakes quite easy to spot once the database exists. While acknowledging that Dr Reimer has undoubtedly spotted some genuine errors and omissions, I cannot help thinking that overall his view of the data content and quality of English Poetry has been coloured by the difficulties he appears to have experienced in making effective use of the extensive functionality that the software provides.


[Table of Contents] [Letter to the Editor]


Computers & Texts 11 (1996), 16. Not to be republished in any form without the author's permission.

HTML Author: Michael Fraser (mike.fraser@oucs.ox.ac.uk)
Document Created: 25 April 1996
Document Modified: 27 April 1996

The URL of this document is http://info.ox.ac.uk/ctitext/publish/comtxt/ct11/pocock.html