]> A syntax for Dublin core Metadata: Recommendations from the second Metadata Workshop Lou Burnard Eric Miller Liam Quin C.M. Sperberg-McQueen

Unpublished draft circulated to Workshop attendees for revision.

Not to be quoted or cited without permission.

No pre-existing source for this document exists.

19 Apr 96 draftC. M. Sperberg-McQueen Revised to clarify status of each bit 17 Apr 96 draftEric Miller Revised following comments from LB, MSM 16 Apr 96 draftLou Burnard Revised following comments from MSM, EM 10 Apr 96 draftLou Burnard Revised after discussion with LQ Easter Sunday, 1996 draftLou Burnard Initial draft
A Syntax for Dublin Core Metadata Recommendations from the Second Metadata Workshop Lou Burnard, Eric Miller, Liam Quin, and C.M. Sperberg-McQueen

This document summarizes a set of recommendations concerning the representation of metadata, derived from discussion within the syntax working group which met at the second Metadata Workshop, held at Warwick University in April 1996. The discussion begun in Warwick has been continued electronically by the current authors, and this paper presents both the recommendations agreed on by the syntax working group in Warwick and some further developments for which the authors alone are responsible.

In brief, the syntax working group recommended: that recommendations be made showing how to use the HTML meta element for Dublin-Core metadata (as described below in section and section . that a standard canonical syntax be defined for Dublin-Core metadata, using SGML syntax. The working group defined no DTD, but a possible DTD devised by the authors is given below in section . Discussions in Warwick also led to an informal demonstration of how SGML could be used as the mechanism for encoding the containers and metadata packages foreseen in the Warwick Framework. A sample DTD for such packages is given in section .

Requirements

The following criteria were advanced as desirable features in whatever syntax is to be used in defining a standard format for metadata: ability to express optional, repeatable, sequences verifiability, robustness compactness uniformity ability to express of qualifiers repeatability of groups machine processability and human comprehensibility internationalizability simplicity formality

The following functional requirements were identified for the syntax: it should be simple enough for authors to create their own metadata (either directly or via a suitable interface) it must be able to represent at least the elements of the Dublin core it must support extension by users, with varying degrees of formality and control, so that users can extend the Dublin core it must be able to express attribute-value-scheme triplets (since the metadata elements are all expressed as such) it must be able to express optionality and repeatability

The ability to carry out down-translation to the proposed scheme from existing metadata schemes (specifically richer formats such as MARC, TEI, IAFA templates) was assumed.

There was no discussion of how the following additional requirements might be achieved, though there was a general feeling that they were all highly desirable: mechanisms (such as a name registry) for constraining the range of some or all possible attribute names and values mechanisms for grouping sub-parts of a specifications, e.g. to express the object/instance distinction mechanisms for specifying the meaning of repeated sub-parts of a specification, e.g. whether two form components should be regarded as exclusive (copy in one of two forms) or additive (copy in both forms).

The problem of grouping, inheritance, and their meaning is discussed further in a paper by C. M. Sperberg-McQueen, On Information Factoring in Dublin Metadata Records, which is accessible on the World-Wide Web at http://www.uic.edu/~cmsmcq/tech/metadata.factoring.html.

Summary of proposals

This section presents the various possible approaches discussed at the workshop, whether actively recommended by the syntax working group or not. A fuller treatment of some of them is also presented in a paper written by Eric Miller following the first Metadata Workshop (see Issues of Document Description in HTML, available at http://www.oclc.org:5046/~emiller/publications/metadata/issues.html ).

The minimal effort approach

The syntax working group recommended that authors, publishers, and site managers be encouraged to provide metadata in HTML documents by means of HTML meta elements embedded in their documents. More elaborate metadata can be provided if the metadata records are external to the HTML document, as described below, but for information providers with limited ambitions, the method described here is recommended.

The assumption here is that existing browsers and search engines cannot be expected to accommodate any variation from current practice. Any additional features must be transparent to existing software and authoring practices.

The meta element of HTML2 should be used, with nameand content attributes set to the metadata element's name and value respectively.

Example: On the pulse of the morning ...

On the pulse of the morning

... ]]>

Advantages: No change is needed to existing browsers or search engines. Any set of attribute-value pairs can be represented.

Disadvantages: No constraint can be imposed on the semantics of the attribute names used, and name clashes may occur. Other, possibly inconsistent, conventions are already established for use of the meta elements by other agents. This could however be overcome by using a prefix such as DC:, e.g. ]]>

The order of meta elements within the head element is not significant, and elements cannot be grouped, though a sufficiently determined imagination might conceive of something like the following: ]]>

Miller's paper, referred to above, also suggests prefixing a group of meta tags which together make up a metadata description of this kind with a particular labelling meta tag such as ]]>

Without additional attributes such as source and type, considerable overloading of the attribute values is necessary to contain all the information available in the Dublin core. Even in this trivial example, it has been necessary to introduce some arbitrary syntax (the use of the colon and parentheses) to distinguish parts of the name attribute.

Furthermore, attribute values are limited in length by the value of LITLEN (1024 according to the official SGML declaration for HTML2), or by other arbitrary limits imposed by particular browsers. A literal cannot contain any tags which a browser might recognize, so another syntax must be invented if subfields of Dublin core elements are required. Keeping the metadata at arms-length

For more complex metadata records, an unstructured series of meta elements will not suffice; the syntax working group recommended, therefore, that metadata consumers recognize references to external metadata from within the HTML head element.

This approach involves keeping the metadata in an distinct document. Because the metadata is independent of the form of the data proper, free-standing metadata can document with equal facility documents in HTML, ASCII, SGML, PDF, or proprietary formats, images, sound files, maps, etc. Clear endorsement of free-standing metadata, and the construction of metadata catalogs, is thus important for ensuring that metadata is usable for objects on the net which are not also objects on WWW.

Two variants of the encoding syntax for metadata were discussed at the meeting: in the first, the metadata document uses existing HTML elements. In the second it uses some other syntax better suited to the requirements listed above. At the Warwick meeting, the workgroup agreed that this syntax should be expressed using an SGML DTD, and this is the approach which has been followed below. However, there is no reason why some other syntax that meets the functional requirements outlined above could not be invented for this purpose.

A one-way linkage in HTML documents, for example, is effected using the link element:

... ]]>

Separating the metadata from the document makes it easy for existing browsers and search engines to ignore it if they wish, while those which are Dublin core-aware can access and process it effectively with no additional cost. On the other hand, there may be significant additional costs in ensuring that metadata and data are kept in step and consistent.

The next two sections discuss what exactly might be the contents of the object referenced by pulse.meta. Mapping Dublin Core elements to HTML

The attribute-value-class triples needed for the Dublin core can be mapped on to any appropriate HTML element. At the meeting, the DL element was suggested, as in the following example: Metadata for the Nice Pome

title
On the pulse of the morning
publisher
University of Virginia Electronic Text Center
otheragent:transcriber
University of Virginia Electronic Text Center
date:created/ISO
1993-01-23
objectType
poem
form
1 ASCII file
form/IMT
text/ASCII
source
Newspaper stories and oral performance of text at the Presidential inauguration of Bill Clinton
language/ISO 639
en
]]>

Advantages: Metadata is cleanly separated from the data. Problems consequent on using attribute values to represent element content are no longer a concern and more powerful structuring abilities (e.g. nesting, repetition) are potentially available.

Disadvantages: Almost anything can go into a metadata description. (Unenforceable) conventions need to be established about how the metadata descriptions are to be mapped to HTML elements. It's not clear how, for example, to do the SOURCE and TYPE attributes of the Dublin Core without extending HTML2.

This suggested approach did not gain much support from the syntax working group and is not recommended.

Using a Dublin-Core specific syntax

The syntax working group recommended the preparation of an SGML DTD for Dublin-Core metadata records; one such DTD is described below in section .

The Dublin DTD defines specific elements for the 13 core elements, each of which bears attributes for type and source.

Using this syntax, the above example might like look this: On the Pulse of Morning Maya Angelou University of Virginia Electronic Text Center University of Virginia Electronic Text Center 1993-01-23 poem

1 ASCII file
text/ASCII
Newspaper stories and oral performance of text at the Presidential inauguration of Bill Clinton en ]]>

Advantages: The syntax makes explicit the semantics of each Dublin core element. Distinct attributes can be defined for scheme and type. Element content could include other tags if subfields are required.

Disadvantages: Only Dublin core elements are provided (but there is an extension field). Discrete packages of metadata cannot be identified and the semantics of repeated elements are not specified. The Warwick Framework DTD

At the workshop, the authors suggested applying SGML not only to the encoding of Dublin-Core records but also to the creation of metadata packages and containers, as defined in the architectural proposals for the Warwick Framework. This section summarizes the relevant points.

The Warwick Framework DTD builds on the notion of discrete packages of metadata elements discussed at the Warwick Workshop. One such package might contain Dublin core elements; others might contain specialised elements appropriate to other kinds of metadata, or references to other components using other (possibly non-SGML) notations.

No specific package types additional to the Dublin core were discussed in any detail; though it seems likely that other groups will wish to define them. This can be done relatively easily by defining an additional DTD fragment (along the lines of that discussed below). Alternatively, new package types can also be created by using a generic package type called a package, composed of typed metaData or nestable metaGroup elements. This may be easier to define (and avoids possible namespace clashes). Full details of these are given below in section .

This approach was explored in some detail by the authors in order to demonstrate that the additional syntax and functionality required by the container approach could be supported directly by SGML, with no need to invent a new syntax and consequently additional ad hoc software.

A collection of packages of the same or different types makes up a container element. This could be linked to from an HTML document in the same way as in the preceding examples (using a link element in the HTML document), or form a part of a multipart MIME message along with the document itself. An example might look like the following: www.oclc.org rsch.oclc.org lou@vax.ox.ac.uk dev.oclc.org stu@oclc.org ]]> The Dublin Core DTD fragment

This DTD fragment, prepared by the authors, is a slightly simplified version of that proposed in the paper by Miller cited above. It defines the following metadata elements, one for each of the components of the Dublin Core, as defined at the first Metadata Workshop: The name of the object, if it has one. Name of the persons and organizations primarily responsible for the intellectual content of the resouce. Encode one name per element. Other person(s) and/or organization(s) who have made a significant contribution to the resource. The value of this element should follow the guidelines for the author element. The author and publisher elements are semantically equivalent to instances of this element with the values author and publisher for their type attributes respectively. The agent or agency responsible for making the resource available. The value of this element should follow the guidelines for the author element. The date of publication in any format (as indicated by the scheme attribute). The field of knowledge to which the resource belongs, typically indicated as a series of keywords, possibly taken from a controlled vocabulary as indicated by the scheme attribute. The abstract category of the resource, such as article, image, dictionary, etc. The particular data representation of the resource. Typically this will be an Internet Media Type (formerly known as MIME content type). String or number used to uniquely identify this resource, for example a URN, or identification number used by some other scheme. Relationship of this resource to another resource. This element should specify what the relationship is (using the type attribute) Objects, either electronic or printed, from which this resource was derived. This is a special case of the relation element. The natural language(s) of the resource. When more than one language element is specified, it indicates that more than one language is used to a significant degree in the work. No inference should be made about the relative proportions of language content based on the order of appearence of such elements. The spatial extent and/or temporal duration characteristic of the resource, e.g. "19th Century France".

These elements all share the following attributes: optionally identifies a subcategorization of the metadata identifies the domain or naming scheme from which categorizations are taken

No closed set of values is defined for either of these attributes in the present proposal, though some suggestive examples are to be found in Miller's paper cited above.

These elements and attributes are formally defined as follows: ]]> The metadata element is described in more detail below, in section .

Any number of any of the above elements may be grouped together to form a single Dublin core metadata description. Such a description is contained by a single dublinCore element, which bears an attribute version to indicate its version status.

This element is formally defined as follows: ]]>

Several alternative methods have been proposed for defining the scheme and type attributes for various elements, in order to combine the virtues of a controlled vocabulary with the flexibility of an uncontrolled vocabulary: unrestricted values (as shown above), with suggested values given in the written documentation values restricted to particular schemes and types, with provision for simple modification of the DTD to enable other types or schemes provision of both a scheme and an otherScheme attribute, the first using controlled vocabular, the second accepting unrestricted values When this version of this document was prepared, no final consensus had been reached.

The Warwick Framework DTD fragment

This DTD, prepared by the authors, is intended to support the following three objectives: support for the Dublin core DTD as defined above support for structured generic metadata elements a framework within which these two can coexist Containers and packages

A document conforming to this DTD is represented by a container element. Each container element consists of a sequence of one or more of the following package-level elements: contains one or more of the metadata elements defined in the Dublin core contains one or more generic metadata or metagroup elements a reference to some other package

Other package-level elements may be defined at a later date: to facilitate this, the contents of the container element are defined indirectly using a parameter entity (see DTD below).

Package-level elements all share the following attributes: human-readable name for the element Universal Resource Indicator (?) referencing the element version number or name

Of these attributes, nameis required, while the other two are optional. All three have CDATA content.

These elements and attributes are formally defined as follows: ]]>

Note that a package may also contain nested package, dublinCore or packageRef elements. This allows considerable flexibility in structuring metadata. Package components

The components of the dublinCore element were defined above in section . The package element may contain a sequence of any number of the following sub-elements: contains a single piece of metadata contains a group of metadata or nested metaGroup elements

The above elements all share the following attributes: categorizes the metadata or metagroup references the authority or taxonomy within which this piece of metadata is defined indicates whether this metadata is visible (show), invisible (noshow), or inherits visibility from its immediate parent element (inherit). optionally supplies a sort key or other normalized version of this piece of metadata. indicates whether this metadata should be indexed (index), not indexed (noindex), or should be treated in the same way as its immediate parent element (asparent).

These elements and attributes are formally defined as follows: ]]>