Computers & Texts No. 15
Table of Contents
August 1997

Resource Discovery Metadata for Electronic Texts and Linguistic Corpora

Michael Popham
Oxford Text Archive
ota@oucs.ox.ac.uk

AHDS/OTA logoA workshop convened by the Oxford Text Archive focused on identifying the metadata essential to finding electronic texts of interest to those working in the fields of literary and linguistic studies, encompassing texts of every type and period. It worked with a broad definition of what might constitute a text in order to consider various forms of text collection, linguistic corpora, and other works.

Significant Problems and Potential Solutions

Arguably, this workshop should have encountered few challenges when evaluating the Dublin Core against the communities' resource discovery needs. The Dublin Core was initially envisaged as metadata for document-like objects, and there has been substantial work 'mapping' between the two text documentation standards which focused the group's attention (MARC, and the Text Encoding Initiative's Header), and the Dublin Core metadata format. Despite this, two significant challenges were identified which tempered the consensus that emerged that the Dublin Core provided a reasonable basis for resource discovery.

A key issue bearing directly on how much information (metadata) is actually required for any given resource. The consensus was that the more information that could be fed back to a user in response to an enquiry, the easier it would be for that person to identify the resources which are likely to be of interest.

Variety of users' resource discovery requirements

The workshop focused initially on the needs of literary and linguistic scholars, but rejected early-on the possibility of considering the disciplines in any uniform way. The problem was further compounded given that texts (whether electronic or not) are frequently of interest to scholars working across the range of humanities and other disciplines, and who therefore represent an extremely broad range of resource discovery requirements.

The group did feel, however, that the Warwick Framework (the container architecture for aggregating sets of metadata-style packaging) of more detailed and specialist documentation offered a reasonable mechanism for satisfying the resource discovery requirements of diverse user communities. Currently, such a model is employed by academics working with conventional library catalogues to discover paper-based texts. The catalogue provides basic search facilities for author/title, keyword, and subject. The initial enquiry can then be followed by either browsing the complete library catalogue record (if available, e.g. online), and/or consulting a copy of the work itself. With this in mind, it was felt that the basic information necessary for the successful discovery of non-electronic resources in literary and linguistic studies would also appear to be sufficient for discovering electronic texts, and that the Dublin Core made a good starting point for satisfying these basic information requirements.

Scope: collection vs. item level description

The problem is easily stated though not easily addressed. In an anthology of verse or the collected works of an individual playwright, should the metadata relate only to description at the collection level, or should each individual work (or even section - e.g. chapter, verse, act, scene) within a collection also have its own descriptive metadata? If the latter, then in certain circumstances (e.g. a collection of works by the same author), perhaps certain metadata could be inherited from the collection-level description by each of the works that constituted the collection? Similarly, the collection-level metadata description should perhaps be sufficient to convey basic information about each of the individual works within the collection (but would this be feasible in the case of, say, an anthology of 500 poems produced by different authors?). These issues are of even greater concern when considering large-scale literary or linguistic corpora, which may contain many thousands of individual texts. The concept of scope also raised a number of related issues, such as the possible requirement to identify discrete resources (e.g. a number of specific texts within a corpus, a specific act within a play), and the need to know whether or not a resource was static or dynamic (i.e. liable to change), as knowing such information might aid initial resource discovery when searching across large volumes of material.

Here the problems seemed more difficult to resolve, and it was later agreed at a meeting of workshop convenors that they were likely to be addressed by individual service or information providers who would weigh up their users' resource discovery needs against the size of their collections and the costs and redundancy entailed in their item-level description.

Recommendations Regarding the Dublin Core

Problematic elements and element usage

Subject and Description presented difficulties with purely literary texts (for example, there are many potential keywords for Shakespeare's play 'Hamlet', but a text about the play might require only a handful of subject keywords), though none were envisaged for linguistic resources.

The relationship between 'Source' and 'Relation' was considered to be confused, and the group felt unsure about where best to express the relations familiar to those studying literary materials (e.g. an adaptation by X of Y's translation of a work by Z).

Type was considered useful but not essential and presented problems as the group was sceptical about the usefulness of the proposed Dublin Core object types. It recommended instead the use of one of the many existing controlled vocabulary lists, such as those used by conventional library cataloguing staff to describe genres of literary resources.

Element qualifiers

The group argued that these were necessary for the Dublin Core 'Title', 'Creator', 'Contributor', 'Date', and 'Identifier' elements. With regard to 'Date' it argued for a controlled list of types allowing for: date of original creation of a work, the publication date of the relevant printed edition of that work, and the release date of the electronic version of the printed edition.

Element schemes

The group felt that there was a particular requirement for the 'Identifer' element to indicate which identifying scheme was being used to identify a resource.

Implementation issues

The group's discussions pinpointed three key implementation issues:

Participants

Those who attended and contributed to the workshop were: Jean Anderson, STELLA Project Manager (Glasgow University); Lou Burnard, Manager of OUCS' Humanities Computing Unit, Co-editor of the TEI Guidelines, founder of the OTA (Oxford University); Michael Day and Andy Powell (UKOLN); Dr Claire Warwick, Resource Development Officer for the British National Corpus, and also representing Oxford University's English Faculty; Paul Miller, Collections Manager for the Archaeology Data Service (York University); Peter Robinson, Senior Research Fellow for the Institute for Electronic Library Research (De Montfort University) Harold Short, Director of the Centre for Computing in the Humanities; John Bradley, Senior Analyst at the CCH (King's College London); Clive Souter, Lecturer and Deputy Director of the Centre for Computer Analysis of Language and Speech (Leeds University), and the staff of the Oxford Text Archive: Michael Popham (Head of the OTA), Alan Morrison (OTA Information Officer), and Jakob Fix (OTA Computing Officer).

The Dublin Core Metadata Set home page is at http://www.oclc.org:5046/research/dublin_core/.

A full version of this paper is available from http://ota.ox.ac.uk


[Table of Contents] [Letter to the Editor]


Computers & Texts 15 (1997), p.15 Not to be republished in any form without the author's permission.

HTML Author: Sarah Porter
Document Created: 8 September 1997
Document Modified:

The URL of this document is http://info.ox.ac.uk/ctitext/publish/comtxt/ct15/popham.html