A Syntax for Dublin Core Metadata

Recommendations from the Second Metadata Workshop


Lou Burnard, Eric Miller, Liam Quin, and C.M. Sperberg-McQueen

Table of Contents


This document summarizes a set of recommendations concerning the representation of metadata, derived from discussion within the syntax working group which met at the second Metadata Workshop, held at Warwick University in April 1996. The discussion begun in Warwick has been continued electronically by the current authors, and this paper presents both the recommendations agreed on by the syntax working group in Warwick and some further developments for which the authors alone are responsible.

In brief, the syntax working group recommended:

Discussions in Warwick also led to an informal demonstration of how SGML could be used as the mechanism for encoding the containers and metadata packages foreseen in the Warwick Framework. A sample DTD for such packages is given in section 4 The Warwick Framework DTD fragment.

1 Requirements

The following criteria were advanced as desirable features in whatever syntax is to be used in defining a standard format for metadata:

The following functional requirements were identified for the syntax:

The ability to carry out down-translation to the proposed scheme from existing metadata schemes (specifically richer formats such as MARC, TEI, IAFA templates) was assumed.

There was no discussion of how the following additional requirements might be achieved, though there was a general feeling that they were all highly desirable:

The problem of grouping, inheritance, and their meaning is discussed further in a paper by C. M. Sperberg-McQueen, "On Information Factoring in Dublin Metadata Records," which is accessible on the World-Wide Web at http://www.uic.edu/~cmsmcq/tech/metadata.factoring.html.

2 Summary of proposals

This section presents the various possible approaches discussed at the workshop, whether actively recommended by the syntax working group or not. A fuller treatment of some of them is also presented in a paper written by Eric Miller following the first Metadata Workshop (see Issues of Document Description in HTML, available at http://www.oclc.org:5046/~emiller/publications/metadata/issues.html).

2.1 The minimal effort approach

The syntax working group recommended that authors, publishers, and site managers be encouraged to provide metadata in HTML documents by means of HTML <meta> elements embedded in their documents. More elaborate metadata can be provided if the metadata records are external to the HTML document, as described below, but for information providers with limited ambitions, the method described here is recommended.

The assumption here is that existing browsers and search engines cannot be expected to accommodate any variation from current practice. Any additional features must be transparent to existing software and authoring practices.

The <meta> element of HTML2 should be used, with nameand content attributes set to the metadata element's name and value respectively.

Example:

 
<html>
<head>
<title>On the pulse of the morning</title>
<meta name="title" content="On the pulse of the morning">
<meta name="publisher content="University of Virginia Electronic Text Center">
<meta name="otheragent:transcriber"
      content="University of Virginia Electronic Text Center">
<meta name='date(ISO)' content="1993-01-23">
<meta name="objectType" content="poem">
<meta name="form" content="1 ASCII file">
<meta name="form(IMT)" content="text/ASCII">
<meta name="source"
      content="Newspaper stories and oral performance of text at the
       Presidential inauguration of Bill Clinton">
<meta name="language(ISO 639)" content="en">
 ...
</head>
<body>
<h1>On the pulse of the morning</h1>
 ...

Advantages: No change is needed to existing browsers or search engines. Any set of attribute-value pairs can be represented.

Disadvantages: No constraint can be imposed on the semantics of the attribute names used, and name clashes may occur. Other, possibly inconsistent, conventions are already established for use of the <meta> elements by other agents. This could however be overcome by using a prefix such as "DC:", e.g.

 
<meta name='DC:date(ISO)' content="1993-01-23">

The order of <meta> elements within the <head> element is not significant, and elements cannot be grouped, though a sufficiently determined imagination might conceive of something like the following:

 
<meta name='DC:groupStart' content='group number 42'>
   <meta name='DC:something' content="something else">
   <!-- more metas here -->
<meta name='DC:groupEnd' content= 'group number 42'>

Miller's paper, referred to above, also suggests prefixing a group of <meta> tags which together make up a metadata description of this kind with a particular labelling <meta> tag such as

 
<meta name='citation' content="Dublin Core">

Without additional attributes such as source and type, considerable `overloading' of the attribute values is necessary to contain all the information available in the Dublin core. Even in this trivial example, it has been necessary to introduce some arbitrary syntax (the use of the colon and parentheses) to distinguish parts of the name attribute.

Furthermore, attribute values are limited in length by the value of LITLEN (1024 according to the official SGML declaration for HTML2), or by other arbitrary limits imposed by particular browsers. A literal cannot contain any tags which a browser might recognize, so another syntax must be invented if subfields of Dublin core elements are required.

2.2 Keeping the metadata at arms-length

For more complex metadata records, an unstructured series of <meta> elements will not suffice; the syntax working group recommended, therefore, that metadata consumers recognize references to external metadata from within the HTML <head> element.

This approach involves keeping the metadata in an distinct document. Because the metadata is independent of the form of the data proper, free-standing metadata can document with equal facility documents in HTML, ASCII, SGML, PDF, or proprietary formats, images, sound files, maps, etc. Clear endorsement of free-standing metadata, and the construction of metadata catalogs, is thus important for ensuring that metadata is usable for objects on the net which are not also objects on WWW.

Two variants of the encoding syntax for metadata were discussed at the meeting: in the first, the metadata document uses existing HTML elements. In the second it uses some other syntax better suited to the requirements listed above. At the Warwick meeting, the workgroup agreed that this syntax should be expressed using an SGML DTD, and this is the approach which has been followed below. However, there is no reason why some other syntax that meets the functional requirements outlined above could not be invented for this purpose.

A one-way linkage in HTML documents, for example, is effected using the <link> element:

 
<html>
<head>
<link rel='metadata'
      href='pulse.meta'>
</head>
 ...
</html>

Separating the metadata from the document makes it easy for existing browsers and search engines to ignore it if they wish, while those which are Dublin core-aware can access and process it effectively with no additional cost. On the other hand, there may be significant additional costs in ensuring that metadata and data are kept in step and consistent.

The next two sections discuss what exactly might be the contents of the object referenced by pulse.meta.

2.2.1 Mapping Dublin Core elements to HTML

The attribute-value-class triples needed for the Dublin core can be mapped on to any appropriate HTML element. At the meeting, the <DL> element was suggested, as in the following example:

 
<html>
<head><title>Metadata for the Nice Pome</title></head>
<body>
<dl>
<dt>title</dt>
<dd>On the pulse of the morning</dd>
<dt>publisher</dt>
<dd>University of Virginia Electronic Text Center</dd>
<dt>otheragent:transcriber</dt>
<dd>University of Virginia Electronic Text Center</dd>
<dt>date:created/ISO</dt>
<dd>1993-01-23</dd>
<dt>objectType</dt>
<dd>poem</dd>
<dt>form</dt>
<dd>1 ASCII file</dd>
<dt>form/IMT</dt>
<dd>text/ASCII</dd>
<dt>source</dt>
<dd>Newspaper stories and oral performance of text at the Presidential
inauguration of Bill Clinton</dd>
<dt>language/ISO 639</dt>
<dd>en</dd>
</dl>
</html>

Advantages: Metadata is cleanly separated from the data. Problems consequent on using attribute values to represent element content are no longer a concern and more powerful structuring abilities (e.g. nesting, repetition) are potentially available.

Disadvantages: Almost anything can go into a metadata description. (Unenforceable) conventions need to be established about how the metadata descriptions are to be mapped to HTML elements. It's not clear how, for example, to do the SOURCE and TYPE attributes of the Dublin Core without extending HTML2.

This suggested approach did not gain much support from the syntax working group and is not recommended.

2.2.2 Using a Dublin-Core specific syntax

The syntax working group recommended the preparation of an SGML DTD for Dublin-Core metadata records; one such DTD is described below in section 3 The Dublin Core DTD fragment.

The Dublin DTD defines specific elements for the 13 core elements, each of which bears attributes for type and source.

Using this syntax, the above example might like look this:

 
<!DOCTYPE dublinCore PUBLIC '-//OCLC//DTD Dublin core v.1//EN'>
<dublinCore>
  <title>On the Pulse of Morning</title>
  <author>Maya Angelou</author>
  <publisher>University of Virginia Electronic Text Center</publisher>
  <otherAgent name='transcriber'>University of Virginia Electronic Text
   Center</otherAgent>
  <date name='created' scheme='ISO'>1993-01-23</date>
  <objectType>poem</objectType>
  <form>1 ASCII file</form>
  <form scheme='IMT'>text/ASCII</form>
  <source>Newspaper stories and oral performance of text at the
    Presidential inauguration of Bill Clinton</source>
  <language name='ISO 639'>en</language>
</dublinCore>
 

Advantages: The syntax makes explicit the semantics of each Dublin core element. Distinct attributes can be defined for scheme and type. Element content could include other tags if subfields are required.

Disadvantages: Only Dublin core elements are provided (but there is an extension field). Discrete packages of metadata cannot be identified and the semantics of repeated elements are not specified.

2.3 The Warwick Framework DTD

At the workshop, the authors suggested applying SGML not only to the encoding of Dublin-Core records but also to the creation of metadata packages and containers, as defined in the architectural proposals for the Warwick Framework. This section summarizes the relevant points.

The Warwick Framework DTD builds on the notion of discrete packages of metadata elements discussed at the Warwick Workshop. One such package might contain Dublin core elements; others might contain specialised elements appropriate to other kinds of metadata, or references to other components using other (possibly non-SGML) notations.

No specific package types additional to the Dublin core were discussed in any detail; though it seems likely that other groups will wish to define them. This can be done relatively easily by defining an additional DTD fragment (along the lines of that discussed below). Alternatively, new package types can also be created by using a generic package type called a <package>, composed of typed <metaData> or nestable <metaGroup> elements. This may be easier to define (and avoids possible namespace clashes). Full details of these are given below in section 4 The Warwick Framework DTD fragment.

This approach was explored in some detail by the authors in order to demonstrate that the additional syntax and functionality required by the `container' approach could be supported directly by SGML, with no need to invent a new syntax and consequently additional ad hoc software.

A collection of packages of the same or different types makes up a <container> element. This could be linked to from an HTML document in the same way as in the preceding examples (using a <link> element in the HTML document), or form a part of a multipart MIME message along with the document itself. An example might look like the following:

 
 
<!DOCTYPE container PUBLIC '-//OCLC//DTD Warwick Framework Demo v.1//EN'>
<container>
 
<dublinCore>
  <!-- etc. as in preceding example -->
</dublinCore>
 
<package URI='hdl:oclc:repository/tc'
         name='OCLC Standard Terms and Conditions / Set FPC45'
         version='1.0'>
 <metadata name='permit'>www.oclc.org
 <metadata name='permit'>rsch.oclc.org
 <metadata name='permit'>lou@vax.ox.ac.uk
 <metadata name='deny'>dev.oclc.org
 <metadata name='inquiries'>stu@oclc.org
</package>
 
</container>

3 The Dublin Core DTD fragment

This DTD fragment, prepared by the authors, is a slightly simplified version of that proposed in the paper by Miller cited above. It defines the following metadata elements, one for each of the components of the Dublin Core, as defined at the first Metadata Workshop:

title
The name of the object, if it has one.
author
Name of the persons and organizations primarily responsible for the intellectual content of the resouce. Encode one name per element.
otherAgent
Other person(s) and/or organization(s) who have made a significant contribution to the resource. The value of this element should follow the guidelines for the <author> element. The author and publisher elements are semantically equivalent to instances of this element with the values "author" and "publisher" for their type attributes respectively.
publisher
The agent or agency responsible for making the resource available. The value of this element should follow the guidelines for the <author> element.
date
The date of publication in any format (as indicated by the scheme attribute).
subject
The field of knowledge to which the resource belongs, typically indicated as a series of keywords, possibly taken from a controlled vocabulary as indicated by the scheme attribute.
objectType
The abstract category of the resource, such as article, image, dictionary, etc.
form
The particular data representation of the resource. Typically this will be an Internet Media Type (formerly known as MIME content type).
identifier
String or number used to uniquely identify this resource, for example a URN, or identification number used by some other scheme.
relation
Relationship of this resource to another resource. This element should specify what the relationship is (using the type attribute)
source
Objects, either electronic or printed, from which this resource was derived. This is a special case of the <relation> element.
language
The natural language(s) of the resource. When more than one <language> element is specified, it indicates that more than one language is used to a significant degree in the work. No inference should be made about the relative proportions of language content based on the order of appearence of such elements.
coverage
The spatial extent and/or temporal duration characteristic of the resource, e.g. "19th Century France".

These elements all share the following attributes:

type
optionally identifies a subcategorization of the metadata
scheme
identifies the domain or naming scheme from which categorizations are taken

No closed set of values is defined for either of these attributes in the present proposal, though some suggestive examples are to be found in Miller's paper cited above.

These elements and attributes are formally defined as follows:

 
<!ENTITY % a.global '
          type               CDATA               #IMPLIED
          scheme             CDATA               "uncontrolled"'>
 
<!ELEMENT title         - O  (#PCDATA)                          >
<!ATTLIST title              %a.global                          >
<!ELEMENT author        - O  (#PCDATA)                          >
<!ATTLIST author             %a.global                          >
<!ELEMENT otherAgent    - O  (#PCDATA)                          >
<!ATTLIST otherAgent         %a.global                          >
<!ELEMENT publisher     - O  (#PCDATA)                          >
<!ATTLIST publisher          %a.global                          >
<!ELEMENT date          - O  (#PCDATA)                          >
<!ATTLIST date               %a.global                          >
<!ELEMENT subject       - O  (#PCDATA)                          >
<!ATTLIST subject            %a.global                          >
<!ELEMENT objectType    - O  (#PCDATA)                          >
<!ATTLIST objectType         %a.global                          >
<!ELEMENT form          - O  (#PCDATA)                          >
<!ATTLIST form               %a.global                          >
<!ELEMENT identifier    - O  (#PCDATA)                          >
<!ATTLIST identifier         %a.global                          >
<!ELEMENT relation      - O  (#PCDATA)                          >
<!ATTLIST relation           %a.global                          >
<!ELEMENT source        - O  (#PCDATA)                          >
<!ATTLIST source             %a.global                          >
<!ELEMENT language      - O  (#PCDATA)                          >
<!ATTLIST language           %a.global                          >
<!ELEMENT coverage      - O  (#PCDATA)                          >
<!ATTLIST coverage           %a.global                          >
<!ELEMENT metadata      - O  (#PCDATA)                          >
<!ATTLIST metadata           %a.global                          >
The <metadata> element is described in more detail below, in section 4 The Warwick Framework DTD fragment.

Any number of any of the above elements may be grouped together to form a single Dublin core metadata description. Such a description is contained by a single <dublinCore> element, which bears an attribute version to indicate its version status.

This element is formally defined as follows:

 
<!ELEMENT dublinCore    - O  (title
                             | author
                             | otherAgent
                             | publisher
                             | date
                             | subject
                             | objectType
                             | form
                             | identifier
                             | relation
                             | source
                             | language
                             | coverage
                             | metadata)*                       >
<!ATTLIST dublinCore
          version            CDATA               #IMPLIED       >

Several alternative methods have been proposed for defining the scheme and type attributes for various elements, in order to combine the virtues of a controlled vocabulary with the flexibility of an uncontrolled vocabulary:

When this version of this document was prepared, no final consensus had been reached.

4 The Warwick Framework DTD fragment

This DTD, prepared by the authors, is intended to support the following three objectives:

4.1 Containers and packages

A document conforming to this DTD is represented by a <container> element. Each <container> element consists of a sequence of one or more of the following package-level elements:

dublinCore
contains one or more of the metadata elements defined in the Dublin core
package
contains one or more generic metadata or metagroup elements
packageRef
a reference to some other package

Other package-level elements may be defined at a later date: to facilitate this, the contents of the <container> element are defined indirectly using a parameter entity (see DTD below).

Package-level elements all share the following attributes:

name
human-readable name for the element
URI
Universal Resource Indicator (?) referencing the element
version
version number or name

Of these attributes, nameis required, while the other two are optional. All three have CDATA content.

These elements and attributes are formally defined as follows:

 
<!ENTITY % packageType 'package | dublinCore | packageRef'      >
 
<!ELEMENT container     - O  (%packageType)*                    >
<!ELEMENT package       - O  (metadata | metaGroup |
                             %packageType)*                     >
<!ATTLIST package
          name               CDATA               #REQUIRED
          URI                CDATA               #IMPLIED
          version            CDATA               #IMPLIED       >
<!ELEMENT packageRef    - O  EMPTY                              >
<!ATTLIST packageRef
          URI                CDATA               #IMPLIED
          name               CDATA               #REQUIRED
          version            CDATA               #IMPLIED       >
 

Note that a <package> may also contain nested <package>, <dublinCore> or <packageRef> elements. This allows considerable flexibility in structuring metadata.

4.2 Package components

The components of the <dublinCore> element were defined above in section 3 The Dublin Core DTD fragment. The <package> element may contain a sequence of any number of the following sub-elements:

metadata
contains a single piece of metadata
metaGroup>
contains a group of <metadata> or nested <metaGroup> elements

The above elements all share the following attributes:

type
categorizes the metadata or metagroup
scheme
references the authority or taxonomy within which this piece of metadata is defined
show
indicates whether this metadata is visible (show), invisible (noshow), or inherits visibility from its immediate parent element (inherit).
sortKey
optionally supplies a sort key or other normalized version of this piece of metadata.
index
indicates whether this metadata should be indexed (index), not indexed (noindex), or should be treated in the same way as its immediate parent element (asparent).

These elements and attributes are formally defined as follows:

 
<!ELEMENT metaGroup     - O  (#PCDATA | metadata | metaGroup)*  >
<!ELEMENT metadata      - O  (#PCDATA | metadata)*              >
<!ATTLIST metadata
          type               CDATA               #REQUIRED
          scheme             CDATA               'uncontrolled'
          show               (show
                             | noshow
                             | inherit)          inherit
          sortkey            CDATA               #IMPLIED
          index              (index
                             | noindex
                             | asparent)         asparent       >