Oxford University Computing Services
Towards a TEI version of the Meter corpus: preliminary report
This document suggests a way of converting the Meter corpus, as distributed on CD ROM at the Lancaster CL-2001 conference, into a TEI conformant structure. It is a report of work in progress, and has no normative standing. The initial approach taken has been to identify the path of least resistance towards making the Meter corpus parse against the existing TEI DTD. No attempt has yet been made to define a specialised view of that DTD more appropriate to the Meter corpus, desirable though that objective is.
The first problem in building any TEI corpus is deciding what its overall structure should be. The TEI has rather specific notions of how electronic documents should be constructed, offering a range of choices in support of them. Making the right choice for a particular corpus is not always an easy choice -- not so much because the distinctions are imprecise as because it's often hard to think through their implications. I will try to justify my suggestions below.
First observation: the Meter corpus is not made up of complete newspapers, or even categories of stories. It is composed of stories selected from those available under a specific strapline, in various places on a certain day in one of two categories. A particular combination of strapline, day and category constitutes a major structural unit, and it is probably more appropriate to combine all stories with a common strapline than it is all those with a common data.
Second observation: such metadata as is available (and there is very little) applies at the level of particular sources (e.g. date and title of source newspaper). There is much less to be said about individual stories (just page numbers, as far as I can see). So we can use the TEI Header at quite a high level.
Third observation: most structural information about the corpus as currently delivered is embedded within the file structure used to store it, rather than explicitly within its markup. Conversion to any form of SGML or XML markup, but in particular to TEI, makes this policy at best unnecessary, and at worst infringes the TEI commandment "Thou shalt have no other encoding scheme besides me".
meter_corpus/newspapers/annotated/showbiz/07.01.00/zeta/zeta144_mirror.sgmlwhich shows that story number 144 comes from the Daily Mirror, has the strapline ‘Zeta’ , date 07.01.00, and is in the showbiz category. (Unfortunately, the story numbers are not unique across the collection, so they cannot be used as identifiers).
I propose to make part of this hierarchy explicit within the XML markup structure, thus making it independent of the storage medium, and also simplifying a wider range of analyses. I do not however consider that the "showbiz/courts" distinction need be represented within the structure: it is simply an attribute of the stories concerned and should be represented along with other metadata. I also propose to ignore the distinction between annotated and original, since I am concerned here only with encoding the annotated form of the corpus.
I propose to map individual stories to TEI <div> elements and straplines to TEI <text> elements. All <text>s on a given date are regarded as constituting a <group>, which is the <body> of a (higher-level) <text>. So the example cited would fit into a structure like this:
<TEIcorpus.2> <teiHeader> <!-- metadata relaing to the whole meter corpus --> </teiHeader> <TEI.2> <teiHeader> <!-- metadata relating to this sampling-day of texts --> </teiHeader> <text id="07.01.00"> <body> <group> <text id="zeta"><!--contains all stories for this strapline --> <div type="pa"> <!-- PA story under this strapline --> </div> <div type="story" n="Mirror" id="T144"> </div> <!-- other divs here for other stories with same strapline --> </text> <!-- other texts here for other straplines on same date --> </group></body> </text> </TEI.2> <!-- other TEI.2 elements here for other dates -->
<TEI.2> <teiHeader> <!-- metadata relating to the whole corpus --> </teiHeader> <text> <div type="day" id="07.01.00"> <div type="strap" id="zeta"><!--contains all stories for this strapline --> <div type="pa"> <!-- PA story under this strapline --> </div> <div type="story" n="Mirror" id="T144"> </div> <!-- other divs here for other stories with same strapline --> </div> <!-- other divs here for other straplines on same date --> </div> <!-- other divs here for other dates --> </text> </TEI.2>
However, from my reading of the corpus design, it makes sense to regard all the stories for a given strapline on a given date as a unitary item, complete in some sense. The key distinction, in TEI terms, between <div> and <text> is that the latter is complete in a way that the former is not. Hence my preference for the first of these two possible mappings.
The TEI is unusual amongst encoding standards in that it has fairly detailed and explicit minimum requirements for metadata which should be supplied for a conformant text. This takes the form a <teiHeader> element which should be attached to each <TEI.2> element, and (in the case of a corpus) additionally to the whole corpus. The corpus header documents information specific to the whole corpus (and which does not therefore need to be repeated in each of its constituent parts), but otherwise conforms to the same minimum structure. The following indicates the minimal content of a TEI header for the Meter corpus:
<teiHeader> <fileDesc> <titleStmt> <title>The Meter Corpus</title> </titleStmt> <publicationStmt> <!-- publication information about the corpus --> </publicationStmt> <sourceDesc> <!-- bibliographic information about the sources from which the corpus was derived --> </sourceDesc> </fileDesc> </teiHeader>
A good header should make ancillary documentation unnecessary. The above structure is expandable to include all manner of other information, ranging from classification definitions and code tables, to change logs and alternative titles. In the example below, I have used the following additional features of the TEI Header:
The body of each story consists of an optional title, followed by a page number and a series of paragraphs. As mentioned previously, I think segmenting paragraphs into smaller sentence-like units may be desirable, but is not essential. (The advantage of doing so is the increase in granularity with which one can reference parts of the corpus).
Within a story, stretches of text may be characterized as `verbatim' `rewrite' or `new' and optionally linked to a PA Source. Other than that, no markup beyond word separation is currently envisaged (but see further below).
I propose to handle AP stories in basically the same way, but they present an additional problem in that they are often split across several discrete files. The simplest way of handling that is simply to treat each file as a (further) nested <div> element within the story-level div. However, it is probably important to distinguish between files which are continuations of a previous one, and files which are substitute versions: this could be done using the type attribute on the relevant <div>. In the long run I think I would prefer to combine files of a single story into a single <div> of type AP, and combining any alternate versions into a distinct <div> of type AP-SUB,
To simplify transition, I suggest that a driver file be created, which will embed the contents of each file at the appropriate point. I am in the process of creating such a file to test out the design.