Oxford University

Oxford University Computing Services

Draft Recommendations for TEI Digital Facsimiles



This document outlines a set of recommendations for using the TEI encoding scheme, with the MASTER extensions, to represent the following distinct kinds of object:


In each of the above cases, there is an additional need to record appropriate metadata, combining detailed cataloguing information about the manuscript or print source represented with technical metadata relating to its digital representation. These recommendations do not address the scope and content of either kind of metadata; our focus is on where such metadata should be located within the overall TEI structure.

Our recommendations attempt to address the following particular areas of concern:

As far as possible, our recommendations conform to the Text Encoding in Libraries: Guidelines for Best Encoding Practices, produced by the Digital Library Federation working party on XML and TEI in Digital Libraries in July 1999. Attention has also been paid to existing practice in creation of such resources, derived from largely anecdotal evidence. There is however scope for a more exhaustive survey of current practice in this area, which we have not yet been able to undertake.

See the bottom of this file for a complete sample record, illustrating the recommended form of a file used to encode a digital facsimile.

1. Case 1: transcription

The content of a transcription should be marked up as a single <TEI.2> element using the standard TEI elements <text>, <body>, <div>, etc. from the TEI core tag sets. Additional elements from the additional tagsets for physical description or for text critical editions may also be appropriate. A transcription will typically be taken from a single source; when it is not, the encoder may choose to combine transcriptions into a single entity using either the <group> or (where this is inappropriate) the <teiCorpus> element.

In a transcription, metadata relating to the source itself should be recorded within the <sourceDesc> element of the appropriate <teiHeader> element. If such metadata is of non-trivial scope, and relates to a manuscript source, it should be recording using the MASTER <msDescription> element embedded within the <sourceDesc> element.

If the source being transcribed includes significant illustrative material, this should be marked at the appropriate location within the transcript, using the standard TEI <figure> element. For example:

 <!-- example to be supplied -->

Note that this element does not require there to be any associated digital version of the illustration, although encoders will often wish to include one. Technically, such a <figure> represents a transclusion: that is, it functions within the transcript simply as a placeholder for the image.

For accessibility purposes, it is good practice always to include a <figDesc> element within the body of a <figure> which supplies descriptive text for use when the image itself cannot be displayed. This element can also be used to supply descriptive or topical metadata about the content of the image, where this is not available from the TEI Header. For example:

 <!-- examples to be supplied -->

Where digital versions of such transcluded figures are available, they form part of the transcript. The mechanics for including non-SGML data such as digital images within an SGML or XML document are discussed further below (3. Techniques for referencing images). Note that any technical metadata relating to such images should be included in the <encodingDesc> of the associated TEI Header.

An illustration may contain text, such as a heading, or even a distinct text not forming part of the text in which the illustration appears. Headings should be encoded using the <head> element within the <figure>; nested text should be encoded using the <text> element within the <figure>. For example:

 <!-- examples to be supplied -->

The <pb> or other appropriate milestone element (e.g. <cb>) should be used to mark reference points within a transcription.

For further, complete, examples, see the Master reference Guide.

2. Case 2: digital facsimile

A digital facsimile should be marked-up as a <TEI.2> element in its own right, since it is a distinct object from the manuscript. The TEI structural tags (<body>, <div>, etc.) may be used if desired to mark the internal organization of the manuscript. If, as is often the case, the boundaries of structural units such as chapters or texts do not coincide exactly with page boundaries, this may not be possible, or may require some special treatment in the encoding (see www.tei-c.org/Guidelines/NH.html for some suggestions).

Each distinct image making up the facsimile should be encoded as a <figure> element, arranged in the normal reading sequence of the facsimile. 2 For example:

 <!-- examples to be supplied -->

As noted above, it is good practice always to include a <figDesc> element within the body of a <figure> for accessibility purposes. In the case of a digital facsimile, the <figDesc> element should only contain a standard text such as ‘[Image of fol 15 recto]’ ; any descriptive metadata should be collected together into the appropriate part of the Header.

Where a number of alternative versions of a page image are available, for example at different resolutions, recommended practice is not to combine the alternatives into a single facsimile. Selection of the appropriate resolution is a rendering issue which should not affect the encoding of the document. Alternatively, where it is thought appropriate, <figure> elements may be self-nested to show that one image logically contains others. For example:

<figure id="F1">
<figDesc>[Image of folio 1]</figDesc>
   <figure id="F1a">
       <figDesc>[Folio 1: Detail of upper part]</figDesc>
   <figure id="F1b">
       <figDesc>[Folio1: Detail of lower part]</figDesc>

Note that any elements containing text will be assumed to be part of the facsimile rather than its source.

The <pb> or other appropriate milestone element (e.g. <cb>) should be used to mark reference points within a set of facsimiles, as it is inside a transcribed text. If further the <pb> elements are given identifiers, they may be used to align transcription and facsimile pages by standoff markup, as further discussed below (4. Aligning transcription and fascimile.

As with a transcription, metadata relating to the manuscript or printed source should appear in the <sourceDescription> of the associated TEI Header, within a <msDescription> if the images were taken from a manuscript, or within a <bibl> if the images were taken from a published item3.

3. Techniques for referencing images

The traditional SGML/XML method for handling non-SGML (or unparsed) data such as a digital image within an SGML/XML document is to declare the file containing the image as an external entity and then reference that entity. 4 The following entity declarations (for example) define entities called image1 and image2, associating each with a URL, and specifying for each that it uses a non-SGML notation called JPG:

   "http://www.bodley.ox.ac.uk/canterbury-tales/gpl.jpg" NDATA JPG> 
   "http://www.bodley.ox.ac.uk/canterbury-tales/gp2.jpg" NDATA JPG>

The names used for such entities have no particular significance, but should be unique within the document. Several methods may be used to identify the external entity itself (the URLs in the above example), including operating system filenames; for reliability and portability however, we recommend the use of a non-qualified URL. The notation name ("JPG" in the above example) must be defined in the associated DTD.

As noted above, the <figure> element is used to indicate the point in a document at which an image is to be transcluded. In the standard TEI scheme, it carries an entity attribute, the value of which should be a previously-declared entity. Thus, the point in the document at which the first of the two images declared above should appear will be marked as follows:

<figure entity="imagel"/>

Though well established and entirely practical, this method has two distinct disadvantages:

One simple solution would be to add a url attribute to the <figure> element, as an alternative to the existing entity attribute, which would enable the association to be done directly, like this:

<figure url="http://www.bodley.ox.ac.uk/canterbury-tales/gpl.jpg"/>

As well as being entirely compatible with XML Schema, this approach clearly simplifies the task of preparing documents, though at the expense of increasing the difficulty of maintaining them, particularly if the entities involved are referenced from many places.

These problems are not, of course, uniquely a matter of concern to those producing digital facsimiles. We think that the simple solution proposed is worth further investigation as an alternative to the continued use of the entity-based approach outlined above. If it turns out to be generally acceptable, we intend to propose it for inclusion in a future revision of the TEI scheme.

Other possible methods would be to use Xlink or Xinclude; we have not yet investigated these options however.

4. Aligning transcription and fascimile

The TEI Guidelines propose a number of methods for aligning parts of an SGML or XML document (see www.tei-c.org/Guidelines/SA). These may be briefly summarised as follows

  1. use the corresp attribute to assert that a <pb> (or other reference point) in the transcription corresponds with a <pb> in the facsimile (or the reverse);
  2. alternatively, use a stand-off <link> element to assert the association between the two <pb> elements;
  3. for cross-document linking, an intermediate <xptr> may be used as the target.

The following example demonstrates this technique:

     <!-- The transcript document (entity name t1234) -->
<!-- ... -->
<pb n="fol 4" id="M1234.4" corresp="XF1234.4"/>
<!-- ... -->
<xptr id="XF1234.4" entity="f1234" from="id(M1234.4)"/>
<!-- ... -->
     <!-- The facsimile document (entity name f1234) -->
<!-- ... -->
<pb n="fol 4" id="M1234.4" corresp="XT1234.4"/>
<!-- ... -->
<xptr id="XT1234.4" entity="t1234" from="id(M1234.4)"/>
<!-- ... -->
We begin by assigning an ID value to each of the <pb> elements in both transcript and facsimile documents. (Any element could be used for this purpose, of course, but since the purpose of the <pb> element is to provide a reference system common to the two views of the source materials it seems most appropriate). The above example shows a page break which is identified as M1234.4 in both transcript and facsimile, though this is not essential to the method — the page break could have different identifier in the two documents6.

The corresp attribute asserts a correspondence between the element carrying it and the element whose identifier it supplies. In the example, its value is that of an <xptr> element in the current document (XF1234.4 in the transcript, XT1234.4 in the facsimile). This is necessary because corresp attributes can only be used to point to elements within the current document; if, as here, we wish to point at a different document, an intermediate x-pointer must be supplied.

The <xptr> elements specify the target for the correspondence in each case by supplying the entity name for the other document (using the entity attribute) and the location of the target within that document (using an expression in TEI extended pointer syntax on the from attribute)

A variation on this method would be to use standoff markup as follows:

     <!-- The transcript document (entity name t1234) -->
<!-- ... -->
<pb n="fol 4" id="M1234.4">
<!-- ... -->
<!-- ... -->

     <!-- The facsimile document (entity name t1234) -->
<!-- ... -->
<pb n="fol 4" id="M1234.4"/>
<!-- ... -->

    <!-- the link section (can be in either entity or elsewhere) -->
<xptr id="XF1234.4" entity="f1234" from="id(M1234.4)"/>
<xptr id="XT1234.4" entity="t1234" from="id(M1234.4)"/>
<!-- ... -->
<xlink type="correspondence" targets="XF1234.4 XT1234.4"/>

These methods both suffer from the following minor disadvantages:

The verbosity of this approach could be reduced considerably if it were not necessary to use the intermediate <xptr> element. Indeed, the majority of existing digital projects we have examined seem not even to have considered using this mechanism. The favoured approach seems to have been simply to re-purpose the <pb> element to mean something like ‘pointer to external image of current page’ adding an extra attribute to identify the page image 7. Such undocumented semantic shift in standard TEI elements is not to be recommended.

In the long term, the best approach would be to use something more easily mapped to (or identical with) the W3C's Xlink syntax. Unfortunately, at the time of writing, the Xlink specification is still only a candidate recommendation ( www.w3.org/TR/xlink/), and has not been implemented in any major browser. Moreover, the TEI work group responsible for co-ordinating it with the TEI extended pointer syntax has yet to be formed. Nevertheless, it seems clear that adding something like an xlink:href attribute as an alternative to the linking attributes corresp and targets would be highly desirable.

5. Sample record

<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" <IENTITY % TEI.prose "INCLUDE"> <IENTITIY % TEI.transcr "INCLUDE"> <IENTITY % TEI.names.dates "INCLUDE"> <IENTITY % TEI.figures "INCLUDE"> <IENTITY % TEI.extensions.ent SYSTEM IlmsDesc.entll> <!ENTITY % TEI.extensions.dtd SYSTEM IlmsDesc.dtdll> <!ENTITY imagel SYSTEM llhttp://www.bodley.ox.ac.uk/canterbury-tales/gpl.jpgll> <IENTITY image2 SYSTEM "http://www.bodley.ox.ac.uk/canterbury-tales/gp2.jpg"> ]&nil;]> <TEI.2> <teiheader type=lltext" status=llnew"> <filedesc> <titlestnrt> <title>ms Rawlinson poet. 149: a digital facsimile</title> </titiestmt> <publicationstmt> <publisher>The MASTER consortium</Publisher> </publicationStmt> <sourcedesc> <MsDeBcription status="uni"> <msidentifier> <settlement>oxford</settlement> <repository>Bodleian Library</repository>

1. We do not address here the special case of a print source consisting only of manuscript catalogue records (which may also contain images of the manuscripts). Such works are treated in the same way as transcriptions: they will contain optional prose with embedded <msDescription> elements.

2. The current versions of the TEI DTD require that <figure> elements appear within a <p> element; this is a bug which should be fixed in the next (November 2001) version

3. Not sure what to do about images scanned from microfilm: presumably a published microfilm has its own metadata, but it seems perverse not to use a msDescription

4. For a useful summary of the rules, see www.xml.com/pub/a/98/08/xmlqna2.html#ENTDECL

5. It is not yet clear whether the Xinclude proposals currently under debate will provide a reliable alternative mechanism

6. ID values must be unique within a single document, but can be duplicated across documents

7. cite some examples here

Date: (revised 7 Sept 2001)  Author: Richard Gartner and Lou Burnard (revised Lou Burnard) .
© Oxford University Computing Services.