]> Minutes of The Studley Manuscript Encoding Meeting Peter Robinson Distributed by Peter Robinson

email message of 12 Nov 1996

Dear everyone

Herewith, the minutes of the meeting. Please check what I have said you said and let me know any amendments. When everyone is satisfied with this, I should like to HTML this and put it up on a friendly Web site. You can also read an excellent shorter report on the proceedings from Lou Burnard, at http://users.ox.ac.uk/~lou/reports/9611stud.htm.

The Studley Manuscript Encoding Meeting
Background to the meeting

Peter Robinson opened the meeting, and welcomed all participants. He outlined the background to the meeting: the likely onset of massive digital imaging initiatives, where whole libraries of manuscripts might be photographed digitally and the images distributed on the internet, created a need for some agreed system of manuscript description in electronic form to accompany the images. Otherwise, we are going very soon to have the situation where there are many images available, but it is difficult to find any one image of any one manuscript. Many libraries and archives have already begun to tackle this problem, or are about to tackle it, and there is a danger of much wasted effort and many incompatible electronic manuscript records without some co-ordinated initiative. The award by the Mellon Foundation of a grant to the Electronic Access to Medieval Manuscripts project (EAMM), headed by Hope Mayo, signals an American-based effort to circumvent these problems. Robinson met Mayo in Kalamazoo in May, and again in Princeton in July, and this meeting was arranged, firstly, as a step towards a joint European/US initiative and, secondly, as an attempt to devise a basis for an agreed format for manuscript descriptions.

Hope Mayo, the director of the EAMM project, then spoke about the background to this project. It is sponsored jointly by the Hill Monastic Manuscript Library (HMML), Saint John's University, Collegeville, Minnesota, in association with the Vatican Film Library, Saint Louis University, Missouri. The scale of the project, and the need for it, can be measured by the microfilm holdings of the HMML. This has microfilms of some 100,000 codices. For most of these codices, there is no published catalogue, and the library has only a 5 x 8 inventory card for each microfilm, which might contain just one typed line describing each separate text in the manuscript. The HMML began collecting microfilms in the 60s, and its microfilms cover manuscripts mostly in Austria, Malta and Ethiopia. The Vatican Film Library has microfilms of some 60 to 70,000 codices, all from the Vatican library, and began collecting microfilms in the 50s. The catalogue records for these are often themselves on microfilm, with copies of the Vatican library's own index cards.

The project is concentrating on catalogue records only, and is not planning to do any imaging. The Mellon foundation is funding another project, the Digital Scriptorium at Berkeley which is to do manuscript imaging. Two associates of this project, Merrilee Proffitt and Daniel Pitti, were at the meeting.

The project is aiming to establish content guidelines for both a 'core-level' description and a fuller 'detailed' description. A 'detail' description should comprehend all a competent cataloguer might want to say about a manuscript. A 'core' description could point at a detail description, be contained within it, or point to an image etc. It should provide 'first-level' search capability, both to retrieve manuscript records by the most commonly-used search mechanisms and to provide sufficient information to enable the user to decide whether to explore this record further (typically by retrieving a detailed description). [The meeting later decided to use the term 'first-level' rather than 'core', in part to distinguish these abbreviated and specialized manuscript records from the more generic Dublin Core records]. The intellectual content of both kinds of records should not be bounded by any specific format: software, hardware, or encoding.

The project intends to create these in both MARC and (Standard Generalized Markup Language) SGML format, to achieve the maximum utility, flexibility, and compatibility with existing bibliographic tools. MARC has software and hardware systems maintained by national public libraries, with millions of existing records. The volume of data in MARC already guarantees that it will be maintained, and better MARC schemes for interchangeability of data between MARC and other formats are being developed. A large number of manuscript records are already encoded in MARC. These are very inconsistently encoded. For example: in the case of a record of a microfilm of a manuscript (very common in the HMML and VFL materials) there no standard method of designating the repository and shelfmark of the original manuscript, or that the object being described is a microfilm of the manuscript and not the manuscript itself. SGML offers highly flexible encoding for new and existing materials, and SGML encoding schemes in library and archive contexts have very rapidly found many users and advocates (thus, the Text Encoding Initiative [TEI] and Elements for Archival Description [EAD] schemes. The development by OCLC of a SGML Document Type Definition (DTD) for MARC, and the embedding of MARC information in SGML encoding, suggests there is considerable scope for interplay between the two methods.

Library presentations

The rest of the Saturday was taken up with presentations from libraries of the work they have in hand, or are about to commence, in the making of machine-readable manuscript descriptions.

The British Library

The first to present (at their request!) was the team from the British Library: Michelle Brown, Rachel Stockdale and Richard Masters. Michelle Browne began by outlining the tradition of scholarly cataloguing at the British Library, from 1753 to the present. Over that period, the foundation collections have been augmented by material ranging from papyri of 300 bc; to the contents of seven garden sheds owned by Tony Benn. The library must respond to anyone in need of its resources. Every year 17,000 readers consult 70,000 items in the library; its staff answer 28,000 queries; and the library now has a web service. It is possibly the largest research collection in the world and is an 'unexcavated treasure house'. Catalogues dating from late c18 are often sparse in detail. There are some indexes of names, while the handwritten card catalogue still remains the only comprehensive system for finding any item in the library. However, the library has made a considerable effort towards machine readable records covering all its materials, devising its own MARC format specifications in 1989. These records provide indexes of person, place, and other names; governed by name authority files.

Michelle Brown is currently engaged in the making of a new 'shelf catalogue' of the illuminated manuscripts in the British Library. This involves examining every manuscript book on the shelves in the stack area of the library, and making a record for all pre-1200 materials, and all illuminated manuscripts dating before 1600, and any later important hand-written materials or relating to the history of earlier manuscripts. Each manuscript is described by the cataloguer in a word-processor file using a library-devised structured paragraph format, amenable to later processing: this format provides for enumeration of contents and illuminations. It is estimated that this will require checking of some 35,000 medieval or related manuscripts in total, and that around 85 per cent of the manuscripts in the library are illuminated and will require recording in this manner. Some five hundred records have been so far made, with a further 1500 records ready for checking. Each manuscript takes between two and ten minutes to catalogue in this manner, and the scheduled move of the Library to St Pancras dictates that the whole survey should be finished by easter 1997.

Rachel Stockdale, head of cataloguing at the British Library, then gave a wider picture of the Library's cataloguing policy and activities. The library has a vast quantity of automated data in very basic format, with its only structure being presentational markup for bold type, indentation, etc. These records were prepared for camera ready copy, in order to bring published catalogues up to date, with all manuscripts acquired since 1956, and a total of some 20,000 volumes in all, so described.

The Library makes two kinds of records. The first is a longer descriptive, narrative-like prose record, made in Microsoft Word 5 using style sheets to define paragraph content and character styles within each paragraph to give further structure. Thus, the first element in the first paragraph is always the manuscript number and this is bold face, with the language of the manuscript appearing in italic. The Library also makes shorter 'index' MARC format records pointing at these longer records, and automatic searching needs to work over both types of records. Advanced Revelation software is used for entry of these short records into the database with automatic translation of these from the database into MARC. The British Library MARC format is now rather distant from the UK MARC standard, and so their records are only easily usable in the Library's own OPAC and it is difficult to import the records into the RLIN catalogue. The Library has now decided to concentrate on the making of the longer, detailed, records, as earlier described by Michelle Brown.

The Library is now engaged on a massive retro-conversion task of its pre-1957 accession catalogues, ready for mounting on the OPAC. This is done by optical character reading of the existing printed catalogues with the data parsed into fields on the basis of their type face. The resulting machine-readable records are then printed out for proofing. Some one million records are to be converted in a two-year project. It is proposed to make the Library's OPAC the route for making the records available on the internet. Stockdale commented that the time absorbed in continual conversion and re-conversion of data into different formats for different applications left no time for improvement of the records themselves: data churning on a grand scale.

Richard Masters described his work on an SGML DTD for illuminated manuscripts. He has designed a DTD with elements for all areas of British Library detailed descriptions. This is a 'tight' DTD tuned for British Library cataloguing needs. Some elements (place, name, date etc.) are modelled on TEI elements. There is potential to use the TEI 'reg' attribute on elements to give a standard form of names to enable authority control. Lists can be generated for each element.

Detailed manuscript records are made in Word, as earlier described by Brown and Stockdale. These records are then converted into ASCII files, converted into Author/Editor and then tagged. It is planned to devise a 'shelf' system to enable shelf cataloguers to enter SGML directly at the shelf. Open Text's PAT search engine was previously used for searching the completed SGML records; the Library now uses LiveLink from Open Text to provide a Web-based sgml-aware searcher. Panorama can be used to view the retrieved SGML, or convert it to HTML; LiveLink can also convert the SGML to HTML.

In discussion, Dominik Wujastyk commented that the British Library projects currently had to provide a variety of outputs from a variety of inputs. It would clearly be more efficient for there to be one input only, and the SGML-based system promises to give this.

For further information on the British Library, see http://portico.bl.uk/.

The Arnamagnaean Collection, Copenhagen and Reykjavik

Matthew Driscoll then described the Arnamagnaean collection of manuscripts, in Copenhagen and Reykjavik. This collection is based on 2500 mss left by Arni Magnusson, the great Icelandic bibliophile and manuscript collector, to the University of Copenhagen in the early eighteenth century. Following Icelandic independence in 1944 the newly-independent nation asked for the return of these manuscripts. In 1965 the Danish goverment agreed to return many manuscripts. The 2827 manuscripts in the collection in 1965 were divided according to type by committee. In 1971 the first manuscript was taken back to Iceland, of a total of 1666 to be returned. A counterpart to the Arnamagnaean Institute in Copenhagen was established in Reykjavik to hold the returned manuscripts. Another 141 manuscripts were returned to Iceland from the Royal Library. 452 of the manuscripts in the Arnamagnaean collection are vellum, 50 of these not Icelandic, 100 are fragments, and there are a total of 26,000 vellum leaves in the collection. Many manuscripts are paper, and modern: the result of tradition of manuscript copying which continued into this century. A further 16000 manuscripts are in the National Library in Reykjavik and others are in Stockholm, Uppsala, the British Library, the Bodleian, and in private collections in Iceland. There are excellent printed catalogues of the Arnamagnaean collections, and printed catalogues (of variable quality) of the other main library collections. Some of the best catalogues, dating from 1950s in some cases, are in typescript with few copies in existence. None of these catalogues are in machine readable form, though some word-processed files exist. The catalogues in different languages, and cover some 20,000 manuscripts in all. A general machine-readable catalogue for all Icelandic manuscripts is feasible, and highly desirable. We have a considerable body of information about the copyists and provenance of the manuscripts (many of which have nicknames). Even in the medieval period, we can identify actual scribes by names and most manuscripts can be dated and localized. A 'greatest hits' CD-ROM of images from some 20 to 30 of the best-known manuscripts is planned.

l'Institut de recherche et d'histoire des textes (IRHT), Paris

Elisabeth Lalou gave a brief history of the IRHT. This was founded as a govemment funded CNRS laboratory in 1937. At first, it gathered microfilms to constitute an ideal library for text editors and so contains 53,000 microfilms of manuscripts. Not all these manuscripts are literary, and they include Latin, Greek, and Arab as well as French language materials, and a considerable body of archival materials. IRHT began making machine-readable manuscript records in the late 70s, and now uses a MEDIUM database begun in early 80s. At first, this was on a mainframe, but in 1992 moved to personal computers in the institute running Oracle. It also moved from a hierarchical system to a relational database. MEDIUM is complexly structured; with 'sub-bases' for archival, liturgical and (planned) musical segments. At first, it was desired to put enormous amounts of information in the database and its structure was made very complex to cope with this. Much of this power is not used.

IRHT plans to scan some 95000 slides from 6000 french manuscripts in municipal libraries and put these on the web; similarly project Initial for 20,000 illuminated manuscripts. A separate database for illuminated mss has been made for this, linked to MEDIUM. This contains complexly structured iconographic information (based on Garnet's typology) and links to images. A Taurus database is used for this and 3000 slides have so far been digitized. It is also planned to modernize and simplify MEDIUM, with more complex information held in sub-bases rather than in the master database.

The Vatican Library

Ambrogio Piazzone described the extent of the Vatican Library's holdings. It has some 150,000 manuscripts, of which 75,000 of which are archival, 75,000 rare books, some 150,000 photos and many maps. Around 70,000 of the manuscripts are Latin, some 5000 are Greek, 9000 are arab and 800 are Hebrew. Many of these items are extremely precious.

The library is very short of resources, and thus of the 70,000 Latin manuscripts only 9,000 are fully catalogued. The rest have a summary entry only in a handwritten inventory, with many having no entry at all. There is a card catalogue, of which there is a copy in the Vatican Film Library. This was made in 30s and is not a complete catalogue of the library but rather an index by author and work. Various printed catalogues dealing with segments of the collection do exist. The highest priority is to give scholars access to manuscripts, then to provide descriptions next. The Vatican wants to give world wide access to its collections using new technology. It aims to give access to all via computer searching. MARC format is used for the printed books in its collections, and has attempted to use MARC format for manuscripts: 100 manuscripts have been so catalogued. MARC, in its present form, was found to be inadequate. For example, the cataloguers had to put both the incipit and explicit in the MARC 'added titles' field , with the first characters of the field specifying whether it is an incipit or explicit. There is a need for a common MARC system adapted for medieval manuscripts.

The Vatican Library is giving on line access to manuscripts via an IBM project, described in a paper on the internet. The aims of this project include capture adequate for shcolarly uses. This has now photographed some 20,000 page images in sixty manuscripts. These are so far available only to a few sites but will soon be available to all. The images are distributed in lower resolution form. Next year, the project aims to distribute these freely on the web with MARC based descriptors, and so needs to find a MARC format to support this.

National Library of the Czech Republic, Prague

Adolf Knoll, deputy director of this library (formerly the Prague University Library) described the large digitization program under way at this library. The project was developed under the Unesco 'memory of the world' program, and began in 1992 as one of two pilot projects. This Unesco program aims to preserve the world's documentary heritage. In March 1996 a meeting of Unesco and EU officials issued certain recommendations based on this pilot, concerning the use of html to enable Internet publication, the inclusion of basic information about the original and provision of a technical record for each image.

The program began by scanning high quality colour slides, and used special software designed for manuscripts for this. However, the slides were expensive to make and earlier this year the project decided to use a digital camera instead. In March/April 1996 the library bought a Kodak 2048 by 3072 camera, and can now make one hundred very high quality page images a day. The project chose to use html format for data records giving access to images. The aim is to make each image behave like a document, with each image linked to a html file and each file linked to the book at the top level. The images themselves are of different quality, with each associated html file identifyingthe record with the orginal object. Each record also gives aacr2 compatible information, to expedite access to each record: thus, each gives the shelf mark, the library, and the document owner. The image files must also be manageable: this is achieved by compression techniques. A system of access by subscription is planned, alongside the existing published CD-ROMs. The next step is to build digital archives, with important data flagged (perhaps by something like the Dublin Core system) for retrieval by search engines.

For further on this project, see http://www.nkp.cz/externi/digit/Structure_Proposal/.

The Wellcome Institute for the History of Medicine

Dominik Wujastyk, Associate Curator of manuscripts and printed books in the Wellcome institute, set its holdings in the context of other large Sanskrit collections in this country and abroad. It is one of three large collections in this country, with the other two being in Oxford and Cambridge, and has some 6 to 7000 manuscripts in its holdings. Sanskrit manuscripts range in date from 300 AD to 1800 AD, and indeed (as with Icelandic manuscripts) copies are still being made: libraries in India without photocopiers will arrange for material to be copied for you, and Dominik cited a charge of £8 for a copy of a 200 folio manuscript. Older Sanskrit manuscripts are written on palmy, the dried leaves of palm trees. From 1200 on paper becomes very common. Binding is rudimentary, with pages often just tied together with string, and rarely sewn. Sometimes the pages are sandwiched between carved plates of wood. Around fifteen per cent of manuscripts give considerable information concerning scribes, authors, their lineage, dates, locations, teacher, owners, etc. There are some 30,000 manuscripts in Great Britain alone, for which half have finding aids, and similar numbers in Germany and US. There are 11 volumes of catalogues of Sanskrit manuscripts in Germany. These numbers are, of course, dwarfed by the numbers of Sanskrit manuscripts in India: there are five institutions in Poona which each have in excess of 100,000 manuscripts. One estimate puts the number at 7 million Sanskrit manuscript worldwide; another puts the number at 30 million. A Sanskrit manuscript is rather perishable: the average life of a manuscript is 200 years in India. Most manuscripts came to the UK in 1830 and are nearing the end of their life. It is estimated that some three hundred manuscripts are disappearing every day. In 1868 the government of India started giving money to fund manuscript collections; these are the ancestors of the great modern collections

In 1968 a Madras-based enterprise began a 'catalogus catalogorum' which is now up to P. This does not include material published after 1968. This cataloguing is funded by a government-backed scheme, which gives very basic and very inadequate information. Some catalogues put everything else in an appendix to the funded catalogue. To be usable in India a catalogue must be printed. Even very well-equipped scholars in India cannot get access to the Internet; indeed often one cannot even get a bulb for a microfilm reader.

Beside its Sanskrit materials, Dominik explained that the Institute had materials in 42 languages: a half million printed books, many western manuscripts, an archives collection, an iconographic dept with 100,000 paintings, some available on video disk. Textual descriptions of materials from this department are available on the internet. The Institute has a connection with a library in Madras, with 100,000 books available in microfilm and MARC record format. It is also engaged in a Columbia cataloguing project of 1000 manuscripts, in handwritten form, and an initiative to catalogue all 30,000 Sanskrit manuscripts in the US, in digital form. There is also a Delhi-based microfilming project, which is aiming to gather microfilms of all they can and is considering digitization of the microfilm.

The National Library of the Netherlands, The Hague

Anne Korteweg, Curator of medieval manuscripts of the Royal Library, described its holdings: some 6000 manuscripts and 100,000 letters. The first catalogues from this library were written in Latin, and the manuscripts have been renumbered several times. The cataloguers use the database InMagic for data input and access, and hold information about the provenance, title, and bibliography for each manuscript. A inventory of illuminated manuscripts in the Royal Library has now been completed and is now being extended for all illuminated manuscripts in Holland. The cataloguers are paying particular attention to the iconography of the manuscripts, and use ICONCLASS for this purpose: this provides very powerful retrieval mechanisms. InMagic is making an interface to the Internet, with the prospect of these records being easily available over the Web, and the project is committed to InMagic until this work is finished. She described the Dutch national manuscript description format: this is an implementation of PICA, the northern European MARC-like format for machine-readable bibliographic records. See also http://www.konbib.nl/kb/100hoogte/hh-en.html.

Deutsches Dokumentationszentrum fuer Kunstgeschichte, Marburg

Thomas Brandt of this centre outlined its work, best known to the world through its microfiche 'marburger index' and latterly its impressive CD-ROM publications. The centre began work in the early 1980s, making machine-readable catalogues. It has now catalogued some 150,000 images, including around 2000 manuscripts catalogued as art historical objects. The centre uses the same ICONCLASS software as does the project earlier described by Anne Korteweg.

The centre has now been asked by the Deutches Forschung Gemeinschaft to develop a manuscript cataloguing workstation in collaboration with the state libraries. Up to now, most catalogue records in Germany are in printed books, and are very elaborate in detail and structure. The proposed workstation is to give access to authority files, e.g. the Library of Congress authority files. It is intended that it will use ICONCLASS as the iconographic authority file. Compatibility with an agreed internat format is obviously highly desirable.

The centre is used to working with the very detailed MIDAS database. This has more than 2000 fields and is highly structured, with complex links to authority files. MIDAS fields can be nested anyway you want, and can include images of the structure of the object with links to different records. It could cope with manuscripts which consist of seperate booklets and could document the structure of the manuscripts. The centre's researchers have found that MIDAS can do all that is done in hihgly elaborate German printed catalogues. All this is bound into MIDAS, with the power to customize it so that the user only sees what is relevant to his/her work. For manuscripts this may be twenty-five special fields plus other general fields. It currently provides no fields for incipit/explicit; but these are all that appear to be missing among standard catalogue fields.

Thus, MIDAS is likely to be the system of choice for data entry and storage in Germany. It is easy to export information from MIDAS into ASCII or thence to SGML/MARC format. Brandt also outlined the Discus project. In this, many museums are contributing data to a central dbase, all using the MIDAS system, with the centre taking all the information and mounting it on the networks. There is considerable money and government pressure behind MIDAS. The system is available outside Germany also.

The Bodleian Library, Oxford

Peter Kidd and Richard Gartner of the Bodleian described the system of machine-readable catalogue records they are developing, in collaboration with Lou Burnard, for the making of records for the Bodleian's manuscript holdings. At first, they sought to find some written statement of the library's manuscript cataloguing rules. In fact, they could find none, though they did find many examples, many of them superbly-detailed and structured, of manuscript catalogue records in the Bodleian. Also, the Ker catalogues (N. R. Ker) are becoming a de facto standard.

Their aim was to devise an ideal format of catalogue entry, capable of expressing all that one might wish in a manuscript catalogue record. They first considered using a relational database for this, but then chose the SGML implementation of the Text Encoding Initiative (TEI) instead: much of the preparatory work had already been done, and a DTD already available. However, the TEI guidelines do not provide the specialized structures necessary for manuscript description, and need to be extended to furnish these. The TEI DTDs, however, are specifically designed so that new elements can be added, or existing ones modified. Thus, in consultation with Lou Burnard, a new element mssstmt was created for the TEI header. This element contains a decoration element, itself containing various nested sub-elements; the mssstmt also contains physdesc and provenance subelements. Attributes have been added to existing TEI elements (thus: quote type=incipit is legal) and further new elements (e.g. iconsub, mssbibl, heraldry) created.

The SGML editor Author/Editor is used to output the SGML. This is then linked to EAD descriptions, all indexed with PAT and presented to the WWW by on-fly translation into HTML. Some twenty-five manuscripts have now been described using this system, with many more to come. See also http://www.bodley.ox.ac.uk/mss/.

Sunday 3 November : Existing manuscript-related digital activities

The next part of the meeting (actually, commencing late on the Saturday afternoon) was taken up with surveying some existing initiatives relating to the formation of machine-readable manuscript records.

The Canterbury Tales Project

Late on Saturday, Peter Robinson briefly surveyed the history and scope of this project, of which he is joint general editor with Professor Norman Blake of the University of Sheffield. The survey was necessarily brief as his notebook computer failed to communicate with the CD-ROM drive carrying the CD-ROM of The Wife of Bath's Prologue, which he intended to demonstrate. This CD-ROM shows the linking of complex manuscript descriptions (prepared by Dan Mosser of Virginia State University and Polytechnique) with full transcriptions of each text in each manuscript, images of each page in the manuscript, full word-by-word collations of every word in every manuscript and rather massive lemmatized spelling databases. It is therefore something of a tour-de-force of SGML in its TEI incarnation, and demonstrates the capacity of SGML/TEI to cope with the most complex document structures. It demonstrates also the maturity of the medium, in that all this is presented by a standard SGML browser package (EBT's DynaText), on two different computer platforms, without any need for complex programming. It shows too that the concept of linking manuscript descriptions with transcriptions, images, collations, and much other analytic matter, within a single published electronic resource, is not a distant ideal but a practical reality. An introduction to the project may be seen at http://www.shef.ac.uk/uni/projects/ctp/.

The Celtic manuscripts digitization project

David Cooper of the Oxford Libraries Automation Service then described this project, which aims to create a complete digital image record of some one hundred Irish and Welsh manuscripts, most of them in Oxford libraries. The project has links with the Irish CURIA project, which is transcribing some Irish manuscripts. So far, some 6500 pages have been captured by digital photography direct from the manuscripts themselves. The eventual aim is for Internet distribution.

The target of the digitization is a level of detail corresponding to viewing the manuscript in reasonable detail with a simple hand-lens. This equates to a resolution of between 300 to 600 dpi against the orginal (thus: an image size of between 3000 to 6000 by 2400 to 4800 pixels for a ten inch by eight inch page) at 24 bit full colour. Experience shows that we need to work at the upper end of this range, giving file sizes commonly of one hundred megabytes uncompressed. A fortunate consequence of the very high resolution of these images is that compression is extremely efficient, and JPEG commonly yields compression ratios of up to 30 to one without visible loss, so reducing 100 mb files to 3 mb. One such image was shown by Cooper, and it was indeed stunning: one could count the individual strands of hair in a twelfth-century thread mending a tear in a manuscript. So far, in excess of 500 gigabytes of images have been captured, and even with the excellent computing facilities at Oxford this quantity of data causes difficulties. The uncompressed images are sent to the university's hierarchical file server, an IBM mass-storage system: the sheer quantity of data being transferred sometimes saturates the network bandwidth. Pointers to the compressed JPEG files are assembled into HTML documents, and Cooper noted the need to keep information about the digital image in addition to the image itself. In the discussion that followed (and which took place at other times in the meeting) he emphasized that for full-colour manuscript photography at this level, digital photography is actually cheaper than traditional film-based photography. One could cost digital photography at this level at around £15 to £20 per image. This compares very well with the cost of making a single 5 x 4 colour transparency, typically costed at £30 an image by British libraries. And, of course, the digital image has other advantages over the film image.

IBM Digital Library Projects

Peter Elliot and Uschi Reber described the thinking behind the establishment of the IBM Digital Library initiative. This initiative is driven by access to digital images, and the projects focus on cultural heritage: eg Vatican Library. A key element is the development of watermarking for protection of intellectual property, and this was demonstrated to the meeting. Clever watermarking, by subtle alteration of pixels leaves the image usable but very clearly marked: it gives the impression of a 'see-through' watermark stamped on the image, which remains perfectly usable beneath the watermark. A variant of this is invisible watermarking, which cannot be seen but embeds information about who supplied the image to whom in the image in encrypted form. Other information, as desired, could be included in the image: for example, one could embed URLs for metadata. Other IBM projects, e.g. the Lutherhalle museum, and the Archive of the Indias projects, were described. This latter project makes clever use of with image enhancement for difficult to read handwritten materials, removing 'bleedthrough' for example. IBM are also involved in numerous projects which do not highlight imaging in the same manner. For example: an Institute for Scientific Information project, where a user may browse abstracts then order the original; a Marist College project for multimedia collaborative learning development (at: http://www.newdeal.marist.edu); a Case Western library collections services project, with a very sophisticated rights management system. Case Western is also digitizing musical scores, with high quality colour images. Behind all these projects lies a declared aim, BOTH to increase rights protection AND to improve access, through rights management and access control schemes.

IBM sees its strength in the provision of robust, mission-critical systems handling vast quantities of data, on the petabyte scale (a petabyte is a thousand terabytes; a terabyte is a thousand gigabytes; and so on). IBM's digital software library is a collection of software and hardware services and tools to help us make our own digital libraries.

Encoding Strategies

The next segment of the meeting considered different possible encoding systems for the making of machine-readable catalogue records. The long-established MARC and newer SGML-based EAD and TEI encoding schemes were described.

MARC

Larry Creider outlined the possibilities and difficulties of MARC encoding of manuscipt catalogue records. Specific changes are necessary and possible within the MARC cataloguing rules for manuscript materials, laid down in chapter 4 of AACR2. His own work is in the University of Pennsylvania. This is a private university, with a research library with three million volumess, and a 1000 codex manuscripts, and with large archival collections. A brief entry manuscript catalogue was published in 1965. A new catalogue is required and a grant has been given to do an on-line catalogue. At present, the the only guide to cataloguing manuscripts is chapter 4 of AACR2 and this is inadequate. Whatever system is developed has to RLIN compatible. Manuscript scholars also want a much closer analysis of manuscript matters than the book based AACR2 rules. However, there is a need to keep manuscript and book records working together. From 1994 he has developed new draft guidelines for manuscript cataloguing in MARC and has got 250 manuscripts done into records in RLIN.

He stressed the adage, that museum and special collections do not adopt standards, they adapt them. MARC is not good at indicating relations between items (e.g. a manuscirpt containing various items). There is a a need to indicate shelf-mark and a need to use some ingenious methods to cope with (for example) incipit/explicits: the 740 field can be used to hold incipits/explicit. Above all, agreement on what is to captured and where it is to be put is needed. MARC is collaborative in nature and therefore a good base for standards development. (In later discussion, Lou Burnard stressed that SGML was also a highly-collaborative environment). Creider suggested that a separate cat. entry be made for EACH work in a MS, as well as the whole MS itself, and that these should be linked. An entry for a full manuscript record should have an entry for preferred citation form. It is necessary to have the manuscript repository listed, with shelf mark, and necessary to have incipits recorded where available. The library cataloguers responsible for making the record should be listed and this could be put in a MARC note field (510 field).

EAD: Elements for Archival Materials

Daniel Pitti, the chief architect of the EAD scheme, explained the thinking behind the EAD implementation of SGML. EAD has been designed for archival materials of all kinds, and not specifically for medieval manuscripts. As such, it has found rapid favour in many archives, and in a very short space of time many thousands of archival objects have been described in EAD encoding.

The top level of an EAD description is a functional subset of a TEI header element. Following this top level, comes a front matter section, which can be used to give an electronic title page for the object. Then follow the findaid element, containing a archdesc to describe the collection of materials. This can be accompanied by an add element, to give an adjunct for holding other descriptive data and other elements which might be helpful for other users (a bibliography, arrangement of files, etc.) The did element contains a brief descriptive identification of the archival unit itself, including the following elements:

origination an agent assoc with creation of the object. An agent can be 'active' (writer) or passive (recipient). physdesc physical description of the object, containing elements for dimensions, extent etc. note anything you want to put it repository where this object is unitdate for date of creation of the thihng unitid identifier unique for it unitloc location within repos unittitle title of the object

Pitti observed that any number of did elements, containing all these, may be nested one within another, to any level, with did elements contained within one another 'inheriting' the properties of the containing did elements. Thus: if one declares for the higher-level did that its repository is the Bancroft Library, then all did elements contained within this also have this repository, unless it is specifically declared otherwise. He noted that he would like to see an element giving a summary of the scope and content for items.

Back at the level describing the whole collection EAD has further elements useful for administrative and retrieval purposes:

coredesc contains administrative information (contains information on acquisition, access, and other versions; the custodial history of the body of material; the preferred form of citation; and processing information. biography who has been making the collection? its history? odd other descriptive information dsc descriptive of subordinate components Other elements at this level give information about controlled access: who has access to the whole collection, or what parts of it; about the content and scope of the collection.

TEI (The Text Encoding Initiative)

Lou Burnard, the European Editor of the TEI, focussed in his account of the TEI on the teiheader element. This is designed to give guidance to the perplexed and to help the expert, by providing structured information about the encoded object itself. Thus, it includes elements designed to give support to librarians and to corpus builders, by provision of metadata useful for their purposes: what is it? who made it? what principles guided its making? how was it made?

Burnard gave a rapid account of the teiheader, outlining the 'mandatory' elements in the TEI scheme: the header itself, which must contain a filedesc, which itself must contain a mandatory title element. He suggested that a mssStmt element be placed within the sourceDesc (this is the system the Bodleian are using). A listBibl element can be used to list a long list of bibliographic items, where we have multiple items within a manuscript. He described how a header lets us state editorial policy, the languages used, and the classifications of text, including the scheme of the classification. A revisionDesc element can be used for internal documentation. He suggested we need 'application profiles' on the z39.50 model, to customize TEI for particular user communities. Finally, TEI headers can be independent of text: they are thus usable as a 'stand-alone' catalogue record for any object, including non-textual objects.

The Dublin Core

Jennifer Trant first outlined the objectives of the Arts and Humanities Data Service (AHDS), within which she is based. This is funded by JISC (Joint Information Systems Committee: a UK-based and publicly-funded initiative devised to foster the use of information technologies in the higher education community). The aims of AHDS are to collect, preserve, and describe digital research data and to keep them available for reuse. This digital data may be derived from any medium (sound, image, etc. as well as text). AHDS seeks the interoperability of all the digital data, from whatever source: this can be achieved by the provision of common metadata protocols and discovery tools which work with these common protocols.

She outlined the Dublin Core scheme, which is an attempt to provide just such a common metadata protocol. The Dublin Core scheme was initiated at a conference in Dublin, Ohio, in March 1995, under the sponsorship of OCLC. This is designed to help retrieval of 'document like objects' (which could be images, or other non-text objects) by providing a simple set of thirteen categories which can be used to describe any object. The categories are: author/creator, title; date; publisher; other agent; object type; format; subject/description; relationships; source; language; coverage; identifier. The Dublin Core scheme was given further sophistication in a conference at Warwick in April 1996, which agreed the 'Warwick framework'. The concept of the Warwick frame is that the metadata should be grouped into 'packages', with each package giving access to domain-specific information. Thus, an archaelogical implementation of the Warwick frame might define 'coverage' in terms of Ordnance Survey grid-references for each excavation. [????]

Finally, Trant showed some models of metadata, ranging from core generic data to very specialist information. The Dublin core seeks to identify the generic information common to all packages.

Elements for manuscript description

The meeting then (at 12.00 noon on Sunday) turned to its main business: the agreement of a common set of elements for manuscript description, and further discrimination of just what elements should be 'mandatory' in any description, and which might form the basis of a 'first-level' description, suitable for abbreviated catalogue records. With a break for lunch, this occupied the rest of the meeting, up to its close at 5 pm.

Discussion veered between the highly-focussed (how do I encode provenance information written on the flyleaf?) to the very general (what are the aims of an encoding scheme?) Much time was spent establishing just what we were trying to do: in the course, of this we decided the following:

a. the term 'core' as in 'core descriptive elements for manuscripts' is misleading. Since the advent of the 'Dublin Core' the term implies that a 'core' group for manuscripts might be the same as a 'core' group of descriptive elements of other materials. We felt this unlikely, and did not want to prejudge the issue. Further, we wished to make clear that an abbreviated record had two major functions: it should provide a route to a more detailed record, where one exists, and it should 'stand alone', in cases where there is no more detailed record. To serve both functions, we thought it must be designed for maximum retrieval capacity, and thus have a clearly articulated structure, and it must permit maximum useful information to be packed into a short space. As such, what we sought was only a core in the few cases where the short description was embedded in a fuller description. In most cases, the short description would actually act as a pointer to a longer description, or be all the description there is. We agreed that the term 'first-level description' expressed this better than 'core description'.

b. we could not establish the elements for a 'first-level description' until we had established ALL the elements necessary for ANY manuscript description. Only then would we be able to look over all we had and decide which elements provided the detail a scholar, cataloguer, or other reader would find useful when making or searching for a 'first-level' record. Thus, the meeting (after lunch, and with an interval for coffee) concentrated on deciding all the elements necessary for any manuscript description. This process was helped greatly when, at a key moment, Jennifer Trant produced a taxonomy of the agreed elements, dividing them into categories covering 'creation', 'physical description', etc. Once we had seen and refined this full list of elements, decision of what elements should be mandatory in all descriptions, and what should be recommended as present in a first-level description, was astonishingly easy (helped no doubt by the lateness of the time and delegate exhaustion), and unanimous. Here is the list of categories determined in the meeting. [M] stands for mandatory; [MA] for mandatory when applicable; [1] for first-level description.

[M] Repository name [M] Shelf mark/Identifier Previous owner Previous shelf mark/identifier [1] Date(s) of production [1] Place(s) of production Scribe(s) Hand(s) Artist(s) Physical Description [1] Dimensions [1] Extent [1] Materials [1] Illumination: yes or no Format Binding Collation Contents [1] Author. subcategory: role=author/translator/commentator/other [1] Title. subcategory: type=transcribed/supplied by cataloguer/source unspecified [1] Incipit (must be supplied when 'title' not available in first-level description) Explicit Colophon Subject subcategory: scheme=specified classificatory scheme Language/writing system Iconography Status (incomplete/imperfect ) Number (in manuscript sequence) Location (folio number, etc.) Layout (in columns, written in margin, etc. Or in physical description?) References to associated information Reproductions (Must be included if the object described is a reproduction of the manuscript, not the manuscript itself) [MA] format [MA] repository name [MA] shelf mark/identifier Date of reproduction Rights information Cataloguer Sources consulted Date

There is much here which needs much more definition. The relationship between the description of the whole manuscript and each item within it is uncertainly defined here: the taxonomy here given suggests that 'creation' and 'physical description' are applicable to the whole manuscript while 'contents' is applicable to each item within it. In fact, in many manuscripts this division does not apply: each item within the manuscript may have its own creation and physical characteristics.

What next

Peter Robinson concluded the meeting by explaining the likely next steps, and the shape of collaboration with the Mellon EAMM project. Central to this collaboration would be a successful EU bid, in the next libraries call. The call is to go out on 15 December, with proposals due in on 15 March. A project might take the following form:

1. Two meetings to establish and review an encoding standard. There might be other smaller technical group meetings, that would actually draft the standard and pass that to the fuller meetings for agreement. 2. Libraries agreeing to encode manuscript catalogue records in the form agreed by the standard, and submitting those to a central agency for mounting on the Internet (or mounting them themselves). 3. Establishing a central manuscript record archive, which could itself hold the records or point to holdings in other institutions 4. Development of applications to ease the task of data-input and validation into the agreed format.

Since the meeting, it has become apparent that the time-scale of the EU bid would mean that a collaborative project could not begin until rather late in 1997. Of course, such a bid might not succeed. It is important that the goodwill and momentum we have established be continued, and in the next week some ways this might be done will be explored. One possibility is a book, containing articles and other materials prepared by members of this group and others, focussing on the making of machine-readable catalogue records. This should also include detailed instances of possible encoding strategies. Such a book would be a very useful input into both the EAMM project and into any EU bid.

The organizers of the meeting, Peter Robinson and Hope Mayo, would like to thank all who attended for their many positive contributions, which made this weekend so useful and productive.

Oxford, 15 November 1996.