The Cataloguing of Western Medieval Manuscripts in the Bodleian Library: a TEI approach with an appendix describing a TEI-conformant manuscript description Lou Burnard, Richard Gartner, and Peter Kidd August 1997

There are three main series of published catalogues of the western manuscripts at the Bodleian Library: the so-called Quarto catalogues, published between 1845 and 1900, in quarto format, which cover the major collections acquired (for the most part) in the seventeenth and eighteenth centuries (Coxe 1845-1900); the Summary Catalogue, published between 1895 and 1953, which covers the manuscripts acquired from 1602 to 1915, except those already described in the Quarto catalogues (Madan 1895-1953); and the Summary Catalogue of Post-Medieval Western Manuscripts, published in 1991, which covers most post-medieval manuscripts acquired between 1916 and 1975 (Clapinson and Rogers 1991). In January 1996 the Library began a four-year project, funded by the Higher Education Funding Council for England (HEFCE), under the Non-Formula Funding Specialised Research Collections initiative, the purpose of which is to make available descriptions of the medieval western manuscripts acquired by the Library since 1916, for which no full published catalogue yet exists. For more detailed bibliographical information on the catalogues of western manuscripts at the Bodleian, see

Broadly speaking, this project focusses on two rather different tasks: firstly, the cataloguing of the manuscripts themselves, and secondly, the publishing of the resultant catalogue descriptions in both printed and electronic form. Each of these two main strands of work presents its own challenges, while the second, in particular, involves breaking new ground at the Library.

This paper reports on some of the problems addressed by the project, primarily from the bibliographic point of view, together with the technical approaches we have adopted for their resolution. Our basic approach has been to build on existing work as far as possible, while at the same time seeking to develop a system adequate to the Bodleian's arguably rather specialist needs. In that spirit, we have developed a set of extensions to the Text Encoding Initiative (TEI) proposals for general purpose text encoding (Sperberg-McQueen 1994), tailored to the needs of manuscript cataloguers. A detailed appendix to the paper documents this set of extensions, as they are currently formulated.

It should be emphasized that the proposals here presented are very much work in progress intended to promote discussion and review by other similar projects, and to clarify our own thinking on the issues. They should not be regarded in any sense as a final or fully articulated set of recommendations. The Bodleian authorities have not yet made a final decision about which SGML solution to its cataloguing needs will be adopted; and in addition to the exploration of the TEI, experiments are underway to explore the uses to which the Encoded Archival Description (EAD) DTD can be put.

The cataloguing of manuscripts

The material falling within the orbit of the project is diverse. It includes manuscripts written in most of the major languages and scripts current in Europe in the medieval period (excluding Greek), represented by fragments as well as complete codices, and ranging in date from the ninth to the early sixteenth century. A parallel project to compile and publish descriptions of Greek manuscripts acquired this century is also under way, but has not yet reached the stage of developing automated procedures.

Some of these manuscripts are well-known and extensively published , whilst others have never even been mentioned in print. Most fall somewhere between these two extremes: the majority have been described in one or more unpublished typescript catalogues prepared by successive generations of Bodleian staff, notably the loose-leaf Green Folders of catalogue descriptions, and van Dijk's Handlist of the Latin Liturgical Manuscripts (1957-60), and have thus been made known to readers who visit or otherwise contact the Library. Almost all the illuminated manuscripts were described, albeit very briefly, in the three volumes of Illuminated Manuscripts in the Bodleian Library (Pächt and Alexander 1966-73); while many others have been described in greater or lesser detail elsewhere. However the only group of post-1916 accessions to have been systematically catalogued to modern standards remains the manuscripts from the collection of J. P. R. Lyell, of which a detailed catalogue was published in 1971 (de la Mare 1971).

One of the first questions to be addressed was the actual cataloguing methods to be employed. Although the 1971 Lyell catalogue was widely considered exemplary, we nevertheless thought it right to re-examine questions of content and format in the light of developments during the last quarter of a century. During this period, cataloguing, of both existing collections and new accessions, has steadily continued, but the pressure of other duties has meant that no catalogue of any group of manuscripts had yet been brought to the point of publication. We thus had the opportunity to re-think, in some cases from first principles, what information we should be aiming to provide in the new catalogue. As with any limited-term project, the crux of the task was to find an acceptable compromise between the ideal and the realistic: the ideal might be an extremely detailed catalogue, embodying new research, and with a large number of reproductions, but achieving such a goal might well be unrealistic, given the limited resources at our disposal. On the other hand, if we set ourselves the much more modest aim of producing only a basic summary description of the manuscripts, prompt completion would be achieved—but the end product would, in all probability, fail to meet most needs of its intended audience. Our aim was to find an acceptable compromise by providing information at different levels of detail.

At the most fundamental level, a catalogue may be little more than an inventory, simply informing potentially interested readers of the existence of the manuscripts in a given collection, and providing a shelfmark or other reference number to allow them to be located. To satisfy this requirement a very brief Checklist description of over 550 manuscripts has been compiled, and will be made available electronically: this will act as a finding aid to give readers a general idea of what material might be of use to their studies, and will be a starting-point from which to pursue their enquiries. The checklist contains information under the following heading: shelfmark; former shelfmark; author(s); title(s); language(s); date(s) of origin; country(/ies) of origin; town(s) of origin; provenance; select bibliography. It is interesting to note that this list of headings, which was substantially defined before the the Studley Priory meeting was held, already includes most of the First level categories of information discussed at that meeting. Of the topics not included, some are implicit or easily obtained: for example, the presence or absence of illuminations in a manuscript is implied by the presence or absence of a reference to the catalogue of Illuminated Manuscripts in the Bodleian Library under the Select Bibliography heading, while identifiable scribes or artists are always listed as the first item under the Provenance heading,. It is hoped that the Checklist's bibliography will point users to widely-available published sources for most or all of the other categories of information discussed at Studley.

At the opposite extreme, a catalogue may contain so much detail, so clearly expressed, that students are able to glean the information they need about manuscripts from the catalogue, without recourse to the manuscripts themselves. A good catalogue will inform the user which manuscripts do not need to be consulted, as well as which ones might reward further study. This clearly has benefits not only in terms of saving the student time and other resources, but also in terms of the preservation of unique and often fragile materials. In addition to the Checklist, therefore, detailed catalogue descriptions are being prepared, and these will be made available in stages, both in printed and online form. It is our full intention that these descriptions will be linked to digitized images.

A middle way between these extremes is to be tested in the foreseeable future. As stated above, many of the manuscripts covered by the Project have existing unpublished catalogue descriptions in typescript, prepared over the course of the past several decades. While not always meeting today's demanding standards, these descriptions were prepared by the Library's professional staff, and contain a wealth of unpublished information. It is therefore planned that these descriptions will be entered onto the online system, being checked for accuracy in the process, but otherwise with minimal alteration or expansion, to serve as yet another level of finding aid and information provision.

Cataloguing methods and standards

The preparation of a new catalogue involves the resolution of two potentially conflicting forces: provision of information for the ever-developing needs and interests of the scholarly (and, increasingly, the not-so-scholarly community), as reflected in the evolving methods employed in a variety of catalogues of other collections; and in-house styles, conventions, and methods, which cannot lightly be altered or abandoned.

There is no common standard for the cataloguing of medieval manuscripts, although various countries have each begun to form their own general consensus about cataloguing methods, often as a result of a major cataloguing effort or project. In the USA and UK there has been a tendency in recent years to follow the format and conventions developed by Neil Ker in his pioneering Medieval Manuscripts in British Libraries (1969-92); once the user has become familiar with the conventions used, it has the great advantages of clarity, precision, and concision (see, for example, (Shailor 1984, 1987, 1992); (Dutschke 1989); (Ferrari & Rouse 1991); (Light 1995)). In the introduction to the first volume of this work (pp. vii-xiii), Ker discusses some of these conventions in a list of sixteen of the points covered by his catalogue descriptions, and few modern catalogues with any pretensions to completeness neglect to include the features on this list (see Appendix 000). The Bodleian catalogue descriptions follow his format in its general outline, by providing, as a general heading, the Shelfmark, Contents, and Language, and Place and Date of Origin; this is followed by a detailed description of the Contents, Decoration, Physical Description, Binding, Provenance, and Bibliography; but some of these (such as decoration) are treated in considerably greater detail than by Ker.

Approaches to Automation

The automation of manuscript cataloguing has to be able to handle descriptions at all the varying levels of detail and complexity discussed above. The interface for displaying these descriptions has to allow the user to search for manuscripts via a number of different paths, based on searchable and scrollable indexes (e.g. authors and texts, scribes and artists, owners and donors, iconographic subjects, etc.), and free-text searching; it has to be easily navigable, and visually acceptable to those who are familiar only with conventional printed media; it has to enable downloading and printing of catalogue entries; it should not be software-specific or require high-specification hardware; and it needs to co-exist with catalogues of other types of material, including post-medieval items.

Various automation options were examined over the course of a year, with these aims in mind: we examined proprietary systems such as Cairs, and experimented with the production of our own relational databases using the FoxPro package. The optimum solution which we have so far identified, however, is SGML, and this has formed the basis of most of our work during the project: for our collection-level, and minimal item-level, descriptions we have been using the Encoded Archival Description (EAD) (Library of Congress 1996) and for detailed records we have extended the Text Encoding Initiative (TEI) (Sperberg-McQueen and Burnard 1994) to improve its handling of manuscript descriptive information (metadata).

The EAD had reached its alpha version when we began encoding our finding aids, and it has readily proved itself suitable for providing the information which we had traditionally included in our printed versions of collection descriptions. It does not, however, provide enough specific elements at the item level to allow the marking up of catalogue records for individual manuscripts in as much depth as had been used in the most recent printed Bodleian catalogues. This can be circumvented by using the generic ODD (Other Descriptive Data) element, but this represents something of a evasion. It was therefore decided to use the EAD for information from collection level down to a minimal item level description, and then link from an EAD entry to a corresponding TEI record, in which much greater detail could be encoded: the EAD has several ways of linking to external files, of which the simplest is to use an entity reference. We plan to employ the same user interface for both DTDs: the user will not have to know which one applies at any given point.

Extending the TEI

Initially we considered designing our own DTD for our detailed, item-level descriptions of manuscripts, but soon rejected this approach in favour of using an extended version of the TEI. Several reasons prompted this decision: the TEI is a robust, well-tested DTD, already used for manuscript transcriptions, and using it obviates much of the basic ground work involved in setting up a new DTD from scratch. It is readily extensible, with well established mechanisms for incorporating new elements and entities, and for rewriting those already present. The ease with which metadata, texts, and images can be linked in a single TEI file also gives the possibility of records acting as the basis of a wider range of applications, including electronic editions or image databases.

The TEI already includes a small number of elements designed specifically for the encoding and description of manuscripts, particularly hand and handshift, which are used to record information on distinct scribes or handwriting styles. Other parts of the manuscript catalogue record are already catered for by the TEI's standard tagsets: shelfmarks, for instance, are covered by the idno element, which has a type attribute that can be used to distinguish present from past shelfmarks, dates are covered by the standard date element, and bibliographies attached to records can be represented by the listbibl, biblStruct or bibl elements. The languages in which a manuscript is written are covered, as for any transcribed text, by the global lang attribute.

These elements tend to prove inadequate for detailed manuscript metadata. Of Ker's sixteen points, only the first (Date) is readily available within the TEI, and one other (Script) is partially covered by the hand element. This paucity of manuscript-specific elements has caused some problems in the past for the compilers of electronic editions: the Canterbury Tales Project, for instance, has tried to rectify this by the incorporation of extensions to the DTD specific to the project. For the detailed cataloguing of the Bodleian's manuscripts, it was decided to produce a set of extensions, using the facilities for modification provided within the TEI DTD, that were planned to have as generic an application as possible.

The most radical extension we have designed to the TEI DTD is the incorporation of a new element, the mssStmt, as a child of the sourceDesc within the TEI header. This acts as a container for an extensive set of sub-elements designed specifically to encode the types of catalogue data which we already incorporated into our printed catalogues. Figure 1 shows the overall structure of this element, which is more fully described in section below.

Overall structure of the proposed Manuscript Statement

The term manuscript can refer to a wide variety of physical objects - for example, a number of quite disparate items may be bound together and re-foliated, so requiring descriptions for an item as a whole as well as each of its components. The mssStmt may be repeated and nested to allow close modelling of these often complicated structures: in the case of the above example,one mssStmt may be used to describe features common to all parts of the manuscript, and further mssStmt elements within this parent may cover features applicable to each component. Subsidiary mssStmts need only record those features by which a component differs from the whole item of which it forms a part: if nothing is noted, those values declared in the parent mssStmt are inherited by its children.

The mssStmt contains most of the new elements added to the TEI: its major constituents are as follows:- a container element for up to seven sub-elements, covering the decoration of a manuscript: these include an overview, a description of miniatures, of historiated and decorated initials, borders, minor decoration, and an element for attributions and commentary. a repeatable element used to record the physical dimensions of ruled, written, pricked, or leaf areas. A type attribute states which kind of dimension is being described, and the element contains sub-elements for width, height, depth and a free text description. records the number of leaves in the fly leaves and main text block of an item, and a description of each type of material present in these sections, including a recording of damage, and (for paper) of watermarks. a repeatable element containing foliation information for an item, including the period, medium (pencil, ink, etc.), and type of numerals. collation is recorded by a quireformula element, which models the structure of the collation itself, and an evidence element, which is used to record markings (such as catchwords or quire signatures), and other evidence relating to the collation.

Further elements provide for descriptions of the script used (with links via a hand attribute to information on scribes and handwriting styles enumerated in the handlist element within the TEI header's profiledesc), of rubrication, of the binding, and of secundo folio.

Provenance information for an entire manuscript or a constituent part is contained within a repeatable provenance element. If more than one of these is present, a containing listProvenance element can be used to group them together (on the model of bibl and listbibl in the standard TEI guidelines).

In addition to the new mssStmt element, some further modifications have been made to the TEI DTD to incorporate important information that can be used in both the catalogue description and transcribed text. The bibl element has been extended to include elements to mark up the name of a repository, the place of origin of a manuscript and the collection to which it belongs. New phrase-level elements incipit, explicit, colophon, or heading have been added to allow the mark-up of the corresponding features within the header or main text. These are particularly useful for the creation of indices of incipits etc. A new phrase-level element, iconTerm, is used, primarily within the decoration sub-elements, to describe iconographic subjects, and includes an optional alphanumeric Iconclass code (ICONCLASS Research and Development Group 1997) . A summary sub-element is available for inclusion in all div elements to incorporate an abstract of their contents compiled by the cataloguer.

An additional global range attribute is defined, which can be used to specify the physical span of pages or folios covered by a given element. The summary element, for instance, uses this attribute to indicate the span of folios represented by the division, without the need to explicitly mark them in the text with milestone tags. In the description of collation, this attribute can be used to record the range of folios covered by a single quire sequence, or a larger grouping of these sequences. Within the miniatures sub-element of decoration, it is used to indicate the position of each miniature by its folio reference.

In section below, we supply a fuller technical description of these extensions, including examples of their proposed use.

Using the extensions

Within the Bodleian, we have been marking up catalogue records directly in SGML format using SoftQuad's SGML authoring package, Author/Editor. A blank sample record is used by the cataloguer as a template — each important section within the template is marked by an identifying number, which corresponds to an entry in a detailed cataloguing manual designed specifically for this project. The manual provides the cataloguer with information on what is required within a given element, how it should be expressed, and what attributes are used. In practice, we have found that marking up a record directly into TEI takes little longer than producing a version in a standard word-processor. The bulk of the time spent producing a record in fact is taken up by the bibliographic analysis of the item, rather than the encoding of cataloguing details in TEI format.

Once a record is complete, the cataloguer validates it, exports it from Author/Editor's proprietary format into SGML, and moves it to a specified directory on our server. Here it is processed in batch mode by a script which loads it onto our in-house WWW system.

The user interface

The design and implementation of a user-interface for our manuscript system is proving the most complex and time-consuming part of the whole endeavour. We decided fairly early on to attempt to design our own WWW interface, instead of using an SGML browser such as SoftQuad's Panorama. Several factors told against the Panorama approach: It requires users to download and install the free version of the software, which is more difficult than using a frames-compatible browser alone. It is slow to load a new SGML file - not only has the file itself to be downloaded, but so have all the DTD files, navigators, and style sheets, with which the SGML file is then processed. It is quite possible to wait several minutes while an extensive file and its accompaniments are downloaded and processed. Using two DTDs, as we are planning to do, would only exacerbate this problem. It has only crude searching and indexing facilities, much less sophisticated than those we hope to achieve.

We have been designing our own in-house WWW interface to allow the browsing of both EAD and TEI records: it also offers users the facility to search the full-text of entries, or given indexes, by keyword, and to browse alphabetically through the indexes themselves. It aims to make our records easily accessible via any frames-compatible Web browser, without the need to install any specialized software. This frames-based application is based on Tcl scripts, and uses Open Text's PAT software for searching and browsing.

The interface allows the user to browse up and down the hierarchies of an EAD file, displaying information relating to the current level being viewed and to move down to any lower level present. In addition, it can carry out keyword searches either on the full text of catalogue entries, or on a number of given indexes— full Boolean searching is available here, and the user has a choice of searching across all collections, or a single one only. The user may also browse a number of dynamically-created indexes (such as personal name, geographic name etc.), which can contain multiple levels of description.

The link from an EAD to a TEI record is invisible to the user: it appears as a further hierarchical level below an item description in the EAD record. The same frames interface is used to display the TEI record, reformatted to HTML: the user can choose to browse a record's basic details, contents, decoration, physical description, provenance or its attached bibliography.

Future Plans

The Checklist of over 500 manuscripts described above is to be made available on-line as soon as is practically possible; a prototype interface may currently be seen at Whatever its weaknesses, it is to be hoped that the provision of minimal information will be of more use than the provision of no information; any faults and omissions can be rectified with relative ease in due course. Once the automated system is fully operational, the newly encoded versions of the old Typescript descriptions and the new detailed catalogue entries may be added to the online catalogue one by one, as each is completed, thus making the new information available on the internet at the earliest possible moment. The Checklist can also be used to indicate which manuscripts are in the process of being catalogued, and therefore provide users with an indication of which ones will be appearing as detailed descriptions in the near future. It is also hoped that by providing the Checklist and Typescript descriptions online, new awareness and interest in the manuscripts covered will be generated, and thus the cataloguing effort will benefit from feedback, which can be incorporated into the new detailed descriptions.

It is planned that the printed version of the detailed catalogue descriptions will initially be made available in a series of fascicules: rather than wait until the completion of the entire Project, it is thought that it will be more beneficial to make groups of catalogue entries available in printed form as and when they are completed. Thus, the collection of about twenty-five medieval illuminated manuscripts collected by T. R. Buchanan was chosen as the first group to be tackled, since it has a certain homogeneity of content and provenance, and has provided a suitable test-group with which to develop the cataloguing methods and the automated system; this will be followed by the larger group of liturgical manuscripts from all other sources; and so on. Once all the manuscripts are published in this form, it may be desirable to reprint the descriptions as a single volume, with addenda and corrigenda, and cumulative indexes.

The user interface is likely to be subject to major revision once the XML (eXtensible Markup Language) becomes established, and new WWW browsers will be able to view XML marked-up texts directly. Instead of converting to HTML, the interface will rely on stylesheets based on XML-conformant DSSSL (Document Style Semantics and Specification Language), which should prove faster, more elegant, and thus easier to maintain than the current, complicated Tcl-based scripts. It is hoped to incorporate digitised images into the system shortly: both the EAD and TEI provide facilities to link to image files, and it should prove relatively simple to include in-line images and pointers to external files. Inline images may be useful for collation diagrams, for instance, while links to external files would allow us to use our catalogue records as the core of a digital archive of manuscript images. The first such links may possibly be to the high-resolution images produced by the Celtic Manuscripts Project (Oxford University 1997), currently underway at Oxford.


SGML has proved a useful medium for encoding information about manuscripts at both the collection and item level: its hierarchical functionality is ideal for expressing the intellectual structure of a collection, and its combination of flexibility and rigour make it suitable for a detailed item-level description. Our experience so far is that these features have made it far easier to implement an SGML-based solution than a complicated relational database equivalent. The TEI itself has proved a solid basis for a cataloguing standard, its modular structure and easy extensibility paying dividends when it comes to building up a set of elements for manuscript metadata, and in providing a structure in which to place them.

The Bodleian's archivists and cataloguers have been able to adopt SGML as their cataloguing medium with very little difficulty, and can now easily encode directly into SGML without the need for an intervening interface (such as a database form). A sophisticated authoring package such as Author/Editor can make the process of encoding much faster, by use of macros for instance, and allows fast and easy navigation of a complex document.

If a well thought out interface is provided, the system's users themselves need know nothing of SGML or the structures of the DTDs used. The WWW has proved a convenient and powerful medium for dissemination of SGML-based metadata: the conversion from SGML to HTML is easy to do, using any common scripting language, and powerful SGML-aware software, such as Open Text's PAT, can provide performance to match any conventional database. There are, unfortunately, few suitable turnkey SGML systems which can do the same, but, for those without the resources to design their own interface, Panorama provides an acceptable medium for making records available to the Internet, albeit with the provisos noted above.

The cataloguing Project at the Bodleian Library described above has depended for its success on co-operation and consultation on a number of levels. At the local level, every catalogue entry is scrutinised in detail by the medieval specialists on the Bodleian's permanent staff, and benefits enormously from the dialogue that results from their comments and contributions. Similarly, every aspect of the cataloguing method has been (or is still in the process of being) discussed with SGML specialists, so that they may better understand the needs of the medievalist scholar, and so that the medievalists involved in the Project may better understand the possibilities and limitations of the SGML system being developed.

At the national and international level it is sincerely hoped that the cataloguing and encoding methods developed at the Bodleian will be discussed, commented upon, and constructively criticised by the participants of the MASTER Project, and others besides, so that the final solutions reached can be as widely applicable as possible, and bring us a significant step closer to meeting the needs of our readers, and providing an aid to research that will be of benefit for decades to come.

As noted above, the work presented here is an on-going project, and we are conscious of several things in the present description which we propose to change. Although our chief goal at this stage has been to test the adequacy of our provisions only for the cataloguing of a part of the Bodleian's collection of medieval Western manuscripts, we hope that it will also provide at least a useful first attempt at the problem of describing other types of hand written resources, ranging from clay tablets and classical graffiti to modern notes and spoiled papers. However, even in the limited domain in which the system is currently employed, it should be regarded only as a preliminary sketch.

For the most up to date version of the TEI extensions used in this project, and related discussion, the interested reader is recommended to consult the Project's web pages at

