Oxford University Computing Services | |
Draft Recommendations for TEI Digital Facsimiles |
This document outlines a set of recommendations for using the TEI encoding scheme, with the MASTER extensions, to represent the following distinct kinds of object:
In each of the above cases, there is an additional need to record appropriate metadata, combining detailed cataloguing information about the manuscript or print source represented with technical metadata relating to its digital representation. These recommendations do not address the scope and content of either kind of metadata; our focus is on where such metadata should be located within the overall TEI structure.
Our recommendations attempt to address the following particular areas of concern:
As far as possible, our recommendations conform to the Text Encoding in Libraries: Guidelines for Best Encoding Practices, produced by the Digital Library Federation working party on XML and TEI in Digital Libraries in July 1999. Attention has also been paid to existing practice in creation of such resources, derived from largely anecdotal evidence. There is however scope for a more exhaustive survey of current practice in this area, which we have not yet been able to undertake.
See the bottom of this file for a complete sample record, illustrating the recommended form of a file used to encode a digital facsimile.
The content of a transcription should be marked up as a single
<TEI.2> element using the
standard TEI elements <text>, <body>,
<div>, etc. from the TEI core tag sets. Additional
elements from the additional tagsets for physical description or
for text critical editions may also be appropriate. A
transcription will typically be taken from a single source;
when it is not, the encoder may choose to combine transcriptions
into a single entity using either the <group> or (where
this is inappropriate) the <teiCorpus> element.
In a transcription, metadata relating to the source itself
should be recorded within the <sourceDesc> element of the
appropriate <teiHeader> element. If such metadata is of
non-trivial scope, and relates to a manuscript source, it should be recording using the MASTER
<msDescription> element embedded within the <sourceDesc> element.
If the source being transcribed includes significant
illustrative material, this should be marked at the appropriate
location within the transcript, using the standard
TEI <figure> element. For example:
<!-- example to be supplied -->
Note that this element does not
require there to be any associated digital version of the
illustration, although encoders will often wish to include
one. Technically, such a <figure> represents a
transclusion: that is, it functions within the
transcript simply as a placeholder for the image.
For accessibility purposes, it is good practice always to include
a <figDesc> element within the body of a <figure>
which supplies descriptive text for use when the image itself
cannot be displayed. This element can also be used to supply descriptive
or topical metadata about the content of the image, where this is
not available from the TEI Header. For example:
<!-- examples to be supplied -->
Where digital versions of such transcluded figures are
available, they form part of the transcript. The mechanics for
including non-SGML data such as digital images within an SGML or
XML document are discussed further below (3. Techniques for referencing images). Note that any technical metadata
relating to such images should be included in the
<encodingDesc> of the associated TEI Header.
An illustration may contain text, such as a heading, or even a
distinct text not forming part of the text in which the
illustration appears. Headings should be encoded using the
<head> element within the <figure>; nested text
should be encoded using the <text> element within the
<figure>. For example:
<!-- examples to be supplied -->
The <pb> or other appropriate milestone element
(e.g. <cb>) should be used to mark reference points within
a transcription.
For further, complete, examples, see the Master reference Guide.
A digital facsimile should be marked-up as a <TEI.2> element
in its own right, since it is a distinct object from the
manuscript. The TEI structural tags (<body>, <div>,
etc.) may be used if desired to mark the internal organization of
the manuscript. If, as is often the case, the boundaries of
structural units such as chapters or texts do not coincide
exactly with page boundaries, this may not be possible, or may
require some special treatment in the encoding (see
www.tei-c.org/Guidelines/NH.html for some
suggestions).
Each distinct image making up the facsimile should be encoded
as a <figure> element, arranged in the normal reading sequence of the
facsimile. 2
For example:
<!-- examples to be supplied -->
As noted above, it is good practice always to include
a <figDesc> element within the body of a <figure>
for accessibility purposes. In the case of a digital facsimile,
the <figDesc> element should only contain a standard text such
as
‘[Image of fol 15 recto]’
; any descriptive metadata should
be collected together into the appropriate part of the
Header.
Where a number of alternative versions of a page image are
available, for example at different resolutions, recommended
practice is not to combine the alternatives into a single
facsimile. Selection of the appropriate resolution is a rendering
issue which should not affect the encoding of the
document. Alternatively, where it is thought appropriate,
<figure> elements may be self-nested to
show that one image logically contains others. For example:
<figure id="F1">
<figDesc>[Image of folio 1]</figDesc>
<figure id="F1a">
<figDesc>[Folio 1: Detail of upper part]</figDesc>
</figure>
<figure id="F1b">
<figDesc>[Folio1: Detail of lower part]</figDesc>
</figure>
</figure>
Note that any elements containing text will be assumed to be part of the facsimile rather than its source.
The <pb> or other appropriate milestone element
(e.g. <cb>) should be used to mark reference points within
a set of facsimiles, as it is inside a transcribed text. If
further the <pb> elements are given identifiers, they may
be used to align transcription and facsimile pages by
standoff markup, as further discussed below (4. Aligning transcription and fascimile.
As with a transcription, metadata relating to the manuscript or printed
source should appear in the <sourceDescription> of the
associated TEI Header, within a <msDescription> if the
images were taken from a manuscript, or within a <bibl> if
the images were taken from a published item3.
The traditional SGML/XML method for handling non-SGML (or
unparsed) data such as a
digital image within an
SGML/XML document is to declare the file containing the image as an
external entity and then reference that entity.
4
The
following entity declarations (for example) define entities
called image1 and image2,
associating each with a URL, and specifying for each that it
uses a non-SGML notation called JPG:
<!ENTITY image1 SYSTEM
"http://www.bodley.ox.ac.uk/canterbury-tales/gpl.jpg" NDATA JPG>
<!ENTITY image2 SYSTEM
"http://www.bodley.ox.ac.uk/canterbury-tales/gp2.jpg" NDATA JPG>
The names used for such entities have no particular significance, but should be unique within the document. Several methods may be used to identify the external entity itself (the URLs in the above example), including operating system filenames; for reliability and portability however, we recommend the use of a non-qualified URL. The notation name ("JPG" in the above example) must be defined in the associated DTD.
As noted above, the <figure> element is used to indicate the
point in a document at which an image is to be transcluded. In
the standard TEI scheme, it carries an entity
attribute, the value of which should be a previously-declared
entity. Thus, the point in the document at which the first of
the two images declared above should appear will be marked as
follows:
<figure entity="imagel"/>
Though well established and entirely practical, this method has two distinct disadvantages:
One simple solution would be to add a url attribute
to the <figure> element, as an alternative to the existing
entity attribute, which would enable the association
to be done directly, like this:
<figure url="http://www.bodley.ox.ac.uk/canterbury-tales/gpl.jpg"/>
As well as being entirely compatible with XML Schema, this approach clearly simplifies the task of preparing documents, though at the expense of increasing the difficulty of maintaining them, particularly if the entities involved are referenced from many places.
These problems are not, of course, uniquely a matter of concern to those producing digital facsimiles. We think that the simple solution proposed is worth further investigation as an alternative to the continued use of the entity-based approach outlined above. If it turns out to be generally acceptable, we intend to propose it for inclusion in a future revision of the TEI scheme.
Other possible methods would be to use Xlink or Xinclude; we have not yet investigated these options however.
The TEI Guidelines propose a number of methods for aligning parts of an SGML or XML document (see www.tei-c.org/Guidelines/SA). These may be briefly summarised as follows
<pb> (or other reference point) in the transcription
corresponds with a <pb> in the facsimile (or the reverse);<link> element to assert the association
between the two <pb> elements; <xptr> may be used as the
target.The following example demonstrates this technique:
<!-- The transcript document (entity name t1234) -->
<TEI.2>
<!-- ... -->
<pb n="fol 4" id="M1234.4" corresp="XF1234.4"/>
<!-- ... -->
<xptr id="XF1234.4" entity="f1234" from="id(M1234.4)"/>
<!-- ... -->
</TEI.2>
<!-- The facsimile document (entity name f1234) -->
<TEI.2>
<!-- ... -->
<pb n="fol 4" id="M1234.4" corresp="XT1234.4"/>
<!-- ... -->
<xptr id="XT1234.4" entity="t1234" from="id(M1234.4)"/>
<!-- ... -->
</TEI.2>
We begin by assigning an ID value to each of the <pb> elements
in both transcript and facsimile documents. (Any element could be
used for this purpose, of course, but since the purpose of the
<pb> element is to provide a reference system common to
the two views of the source materials it seems most
appropriate). The above example shows a page break which is
identified as M1234.4 in both transcript and facsimile, though
this is not essential to the method — the page break could have
different identifier in the two documents6.
The corresp attribute asserts a correspondence between
the element carrying it and the element whose identifier it
supplies. In the example, its value is that of an <xptr>
element in the current document (XF1234.4 in the transcript,
XT1234.4 in the facsimile). This is necessary because
corresp attributes can only be used to point to
elements within the current document; if, as here, we wish to
point at a different document, an intermediate x-pointer must be
supplied.
The <xptr> elements specify the target for the
correspondence in each case by supplying the entity name for the
other document (using the entity attribute) and the
location of the target within that document (using an expression
in TEI extended pointer syntax on the from attribute)
A variation on this method would be to use standoff markup as follows:
<!-- The transcript document (entity name t1234) -->
<TEI.2>
<!-- ... -->
<pb n="fol 4" id="M1234.4">
<!-- ... -->
<!-- ... -->
</TEI.2>
<!-- The facsimile document (entity name t1234) -->
<TEI.2>
<!-- ... -->
<pb n="fol 4" id="M1234.4"/>
<!-- ... -->
</TEI.2>
<!-- the link section (can be in either entity or elsewhere) -->
<xptr id="XF1234.4" entity="f1234" from="id(M1234.4)"/>
<xptr id="XT1234.4" entity="t1234" from="id(M1234.4)"/>
<!-- ... -->
<xlink type="correspondence" targets="XF1234.4 XT1234.4"/>
These methods both suffer from the following minor disadvantages:
The verbosity of this approach could be reduced considerably if
it were not necessary to use the intermediate <xptr>
element. Indeed, the majority of existing digital projects we
have examined seem not even to have considered using this mechanism.
The favoured approach seems to have been simply to re-purpose the
<pb> element to mean something like
‘pointer to external
image of current page’
adding an extra attribute to identify
the page image 7. Such undocumented semantic shift in standard TEI
elements is not to be recommended.
In the long term, the best approach would be to use something more easily mapped to (or identical with) the W3C's Xlink syntax. Unfortunately, at the time of writing, the Xlink specification is still only a candidate recommendation ( www.w3.org/TR/xlink/), and has not been implemented in any major browser. Moreover, the TEI work group responsible for co-ordinating it with the TEI extended pointer syntax has yet to be formed. Nevertheless, it seems clear that adding something like an xlink:href attribute as an alternative to the linking attributes corresp and targets would be highly desirable.
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" <IENTITY % TEI.prose "INCLUDE"> <IENTITIY % TEI.transcr "INCLUDE"> <IENTITY % TEI.names.dates "INCLUDE"> <IENTITY % TEI.figures "INCLUDE"> <IENTITY % TEI.extensions.ent SYSTEM IlmsDesc.entll> <!ENTITY % TEI.extensions.dtd SYSTEM IlmsDesc.dtdll> <!ENTITY imagel SYSTEM llhttp://www.bodley.ox.ac.uk/canterbury-tales/gpl.jpgll> <IENTITY image2 SYSTEM "http://www.bodley.ox.ac.uk/canterbury-tales/gp2.jpg"> ]&nil;]> <TEI.2> <teiheader type=lltext" status=llnew"> <filedesc> <titlestnrt> <title>ms Rawlinson poet. 149: a digital facsimile</title> </titiestmt> <publicationstmt> <publisher>The MASTER consortium</Publisher> </publicationStmt> <sourcedesc> <MsDeBcription status="uni"> <msidentifier> <settlement>oxford</settlement> <repository>Bodleian Library</repository>
1. We do not address here the special case of a print source
consisting only of manuscript catalogue records (which may also
contain images of the manuscripts). Such works are treated in the
same way as transcriptions: they will contain optional prose with
embedded <msDescription> elements.
2. The current versions of the TEI DTD require that
<figure> elements appear within a <p> element; this
is a bug which should be fixed in the next (November 2001) version
3. Not sure what to do about images scanned from microfilm: presumably a published microfilm has its own metadata, but it seems perverse not to use a msDescription
4. For a useful summary of the rules, see www.xml.com/pub/a/98/08/xmlqna2.html#ENTDECL
5. It is not yet clear whether the Xinclude proposals currently under debate will provide a reliable alternative mechanism
6. ID values must be unique within a single document, but can be duplicated across documents
7. cite some examples here