Oxford University Computing Services |
|
Report on electronic dissertation workshop: DTDs and the usage of new XML-technologies for electronic theses and dissertations, Humboldt-University, Berlin, May 10-12 2000 |
From the 10th to 12th May 2000, I attended a workshop at Humboldt-University, Berlin, on the topic of XML DTDs for dissertations. It was part of an ongoing international effort, including a UNESCO workshop organised in November 1999. About 20 people came, from Germany, France, Portugal, Scandinavia, the UK and the USA. Planned participants from India and Chile did not materialize. I was invited as an expert in LaTeX to XML conversion, but most of the other participants were best classifed as digital librarians.
Suzanne Dobratz and her electronic dissertation group at Humboldt-University made us very welcome, and organised a pleasant stay in Berlin, which showed us remarkably warm weather, fascinating historical surroundings, and good food.
The head of the computing center at Humboldt-University, Peter Schirmbacher, welcomed us, and explained the relationship between this workshop, and the UNESCO initiative to establish international standards for electronic dissertations. He then handed over to Suzanne Dobratz, who outlined her plans for the 3 days. Her hopes were
We had to decide whether it was desirable and/or possible to have a common DTD, or whether (at a minimum) different academic disciplines would need their own. We also need to have ideas onPieter Diepold, Humboldt-University, described and demonstrated the UNESCO clearing house site, which let interested parties record details of their work and expertise. The inevitable questions were raised about privacy, and how to edit information. Within its limits, the site looked like a useful concept.
Anthony Atkins, Virginia Tech, talked about the project in his institution, part of the NDLTD work. They have a special thesis SGML DTD (http://etd.vt.edu/etd-ml/index.htm) written by Neill Kipp. A very few students had tried using it, but in practice student input is limited to a web form for collecting metadata. Problems with electronic dissertations included:
Matthias Schulz, Humboldt-University, talked about the DiML SGML DTD for dissertations, developed locally. It uses CALS for tables, and is an adaption of the Virgina Tech DTD, having the approximate size of HTML 4. It has its own metadata markup in the front matter, but they are considering Dublin Core. Interesting features include a bibliographical portion of the DTD modelled on BiBTeX, and special support for the German `affidavit' in a dissertation. About 100 dissertations have been encoded (by project staff) so far, with the usual problems:
Phil Potter, University of Iowa, outlined the plans for his institution. They are developing a DTD for a pilot project which started in 1999; student participation is optional, but XML is the only option if they do submit electronically. They hope that standard WP packages like Word will develop as good XML authoring tools. Special concerns include a worry about the permanence of non-ASCII data such as bitmap images, and students are told to rely on text only in their explanations. Experience so far has been with Word conversion, using RTF and Majix, with 3 students.
Tuija Sonkilla, Technical University of Helsinki, described Finnish projects, including the most advanced at Oulu. The HUT work is based on
Paul Schaffner, University of Michigan , said that they have converted about 20 dissertations by hand, and are considering a new trial this year. They foresee issues of:
A discussion session ensued, covering topics like
Viviane Bouletreau, Université de Lyon, started the second day by covering the work done at Lyon in the Social Sciences. Their `Cyberthèses' SGML-based project, in collaboration with Montreal, and the University Press, has been running for several years. They use free software where possible, currently processing with Omnimark, and a separate database for metadata. The DTD is TEI Lite, which they appreciate for its `corpus' ability to cover a group of texts with their own headers. The <teiHeader> is standardized, but the body of the text allows free rein for variable presentation, using the `type' attribute for <div>. In some area (eg verse and drama) they use deep TEI markup, but in others (eg math) they use bitmap images or TeX markup. They plan to look at musical notation soon.
For delivery, Lyon convert to HTML from SGML, and convert the teiHeader to formats like MARC. For input, they convert from Word, Star Office, Wordperfect, etc, via RTF and Omnimark scripts (conversion can take from 2 hours to 2 weeks). Metadata is gathered using a web form. In the future, they plan courses for students on how to use the stylesheets currently used by library staff. About 130 theses a year need to go through the system, and they hope to cover this with a half-time job. Since January 2000, electronic submission (PDF) is compulsory, and the paper copy will be printed by the library; PDF will be kept to preserve pagination. This started another discussion about citation and referencing issues, and mention was made of Xlink/TEI linking as possible solutions.
Håvard Fosseng, Oslo University, talked about their work on theses in political science. An ISO 12083-based SGML DTD is available, and they find conversion from Word (via RTF, using Balise) to be relatively straightforward in the given subject area. As in other project, a Web form is used for gathering metadata, controlled by librarians (they will need to link it with BIBSYS, the library cataloguing system in Norway). A subset of the metadata corresponds to Dublin Core.
Oslo do not want to put a lot of work into `deep' conversion or much intervention; their current setup generates parseable files. They have not yet decided whether to store metadata in the document or outside in a database. They are looking Oracle as a possible integrated storage system for the whole thing.
Suzanne Dobratz, Humboldt-University, returned to give a review of multi-media DTDs. While many people hide things in graphics files, she stressed that higher-level semantic markup (as we have in MathML and CML) was far preferable. Useful DTD projects included:
Christof Steinbeck, Jena, gave a clear and helpful talk describing the features and limitations of CML (Chemical Markup Language). CML 1.0 is clearly described in a formal article published in 1999 by Peter Murray-Rust and Henry Rzepa, and covers reactions, 2D structures, 3D structures, annotations and crystallographic/stereoscopic display. The XML element set under a root <molecule> is quite small, consisting of a set of data types (including arrays), a small number of chemical tags, and some convenience wrappers. While there are not many tags, there are lots of attributes.
In practice, CML will likely be one part of chemist's toolkit (they may also use PNG, SVG, BioML, Xlink, MathML etc when writing), and it is not a universal panacea yet (if ever). It is likely that chemists will build new high-level languages (like a SchemaML for coding NMR data).
CML input and output is supported by a variety of free software, but the commercial products (like ChemDraw) have not been involved, since it does not cover enough of chemistry for them. In discussion, this point was emphasized, that CML is doing a good job so far as it goes, but its scope is not universal.
Per Åkerlund, Swedish University of Agricultural Sciences, talked about their work using XML for data exchange, archiving and publication. They have a project (EPSILON) dealing with dissertations; it aims to convert old material from 1990 in abstract form only, and full text from 1997. At some future date they plan pure electronic publication, with paper only on demand. The technical proposal includes:
Martin Hess, Frankfurt University, provided an account of a complex system in their digital library project, based around automatic structure derivation from plain text. Using scanned and OCRed texts, they generate tables of contents and front matter. A simple DTD describes the pages, and output is either page-by-page HTML or PostScript. Web forms are used to drive the system, in which PostScript of the full dissertation is the primary input. The PS is converted to TIFF, which is then fed to an OCR engine.
Matthias Schulz, Humboldt-University, started a discussion about modular DTDs, and described the TEI pizza model. The majority view seemed to be that metadata must be captured and stored separately from the text of a dissertation, since it represents a different sort of information. Even though metadata may be derived from the text, it has its own life, and may need updating, while the text remains untouched. Talking about this led to discussion of interoperability of metadata, and the lack of universal identifiers.
A somewhat heated argument carried on throughout these sessions, about whether structured text was actually worth having for retrieval purposes, as opposed to brute-force free-text search. It was hard to get examples of retrieval which would either use structure, or return anything except the full document.
Kerstin Zimmerman, Oldenburg University, gave concrete details of Dissertation Online , a 2 year old cataloguing and dissemination project in the physics field. This has European scope, but the coverage is variable, ranging from full text and metadata, through abstracts, down to simple name and title summaries. Formats include PDF, DOC, RTF, TeX, etc. Oldenburg runs a service gathering data about theses, using RDF. The basis is a open-source brokering gatherer system (Harvest) to present a single interface to users. At present, the system has details of 1475 dissertations, of which about 250 are full text. There was some detailed information about the SOIF record format which Harvest uses to exchange information.
In the last session of Day 2, I covered the issue of LaTeX to XML conversion, which will be very important in some disciplines. I showed that the problem is hard, but not insoluble, and best solved using TeX itself. I talked in some detail about the TeX4ht system, and showed that it could perform LaTeX to MathML conversion properly. My conclusion was that this sort of system really needed authorial input, and was not likely to be completely suitable for production `sausage-machine' conversion. When converting complex material like mathematics, it is often not clear what the right result is.
Uwe Müller, Humboldt-University, started the day by talking about Humboldt's Word to SGML conversion setup; it is not XML yet, because the DTD is not (yet) XML-compliant. They use Word templates, and sophisticated authoring macros. Students must produce PDF files themselves, and provide English translations of titles and abstracts, and the library prints from the PDF. The computing center converts to SGML and HTML, and keep the PDF. The library prepares metadata.
The Word template system enforces some structure, although it cannot model the whole DTD. The macros help the author with, eg, entering bibliographies, and they do some internal monkeying to record page numbers during the SGML conversion. This is based on the (no longer sold) Microsoft SGML Author and deals with cross-references, character templates, special characters, etc; math is converted to pictures. Graphical features in Word are not converted. The expected problems surface:
In discussion of conversion, it was clear that Word to XML was the big problem, and I (at least) questioned whether it was an appeasement strategy doomed to failure. Others had more faith that it had to be made to work.
Suzanne Dobratz demonstrated the Humboldt document server, and talked about the technicalities of secure digital signatures, which caused some more interesting discussion of copyright and security in general, and the issue of whether or how dissertation archives could be converted to new technologies in the future.
In the final sessions, we moved on to metadata details. Nuno Freire , Portuguese National Library, talked about the issues, and then Thorsten Bahne, University of Duisburg Dissertation Online Project , presented concrete proposals for dissertation metadata based on Dublin Core. This caused a long and detailed discussion item by item; the most controversial were names, addresses, dates and thesis types. There was a perhaps surprising fuzziness about the precise semantics of supposedly mandatory core elements, but there was also an argument that almost all the information should be regarded as simple clues for a researcher trying to locate a thesis.
I found the subject of electronic dissertations quite compelling, obviously sharing problems and solutions with many other applications. The workshop started out with the perhaps unrealistic aim of getting agreement about standard XML DTDs and metadata for dissertations, but I do not think we really advanced very far in those directions.
My personal conclusions after the workshop were as follows: