Portable Documents

UK TeX User Group: "Portable documents: Acrobat, SGML and TeX"

Bridewell Theatre, London

19 Jan 95

This joint meeting of the UK TeX Users Group and the BCS Electronic Publishing Specialist Group attracted a large and mixed audience of academics, TeX hackers, publishers, and software developers, with representatives from most UK organizations active in the field of electronic publishing and document management. I was expecting rather more disagreement about the relative merits of the various approaches now available for the creation of portable documents; in the event, the path of SGML-based righteousness, with appropriate concessions to the practical merits of PostScript-based systems, was apparently endorsed by the consensus.

First of the seven speakers was David Brailsford from Nottingham University, who described Adobe's Acrobat as "a de facto industry standard". His presentation of exactly how the various components of this product worked together, and could be made to interact with both LaTeX and SGML, was very clear and refreshingly free of hype. The choice of PDF (which is effectively a searchable and structured form of Postscript, in which logical structure and hypertextual links are preserved along with the imaging information) as an archival format was a pragmatic one for journals such as EPodd where fidelity to every detail of presentation was crucial. The availability of a free Acrobat reader was also a plus point. He characterized the difficulties of mapping the logical links of a LaTeX or SGML document on to the physical links instantiated in a PDF document as a classic case of the importance of "late binding", and revealed the open secret that Adobe's free PDF reader would soon be upgraded to recognise and act on HTML-style anchors. A demonstration of the Acrobat-based electronic journal project CAJUN is already available online at http://quill.cs.nott.ac.uk.

David Barron, from Southampton, gave an excellent overview of what exactly is implied by the phrase "portable document". Documents are not files, but compound objects, combining text, images, time-based media. There is a growing awareness that electronic resources should be regarded as virtual documents, repositories of information from which many different actual documents may be generated. These developments all make "portability" (defined as the ability to render documents -- with varying degrees of visual fidelity -- in different hardware or software environments) very difficult. Portability was of crucial importance, not only for publishers wishing to distribute in the electronic medium, and not only for specific user communities wishing to pool information, but also for all of us. Information available only in a non-portable electronic form was information at the mercy of technological change. He cited as portability success stories the widespread use of PostScript and LaTeX as a distribution medium by the research community, referring to the Physics preprint library at Los Alamos as a case where this had now become the normal method of publication. By contrast, the success of the World Wide Web seemed to be partly due to its use of a single markup language (HTML) which effectively takes rendering concerns entirely out of the hands of authors. From the archival point of view, however, none of the available standards seemed a natural winner: hypertext was still too immature a technology, and there were still many intractable problems in handling multiple fonts and character sets. Professor Barron concluded with a brief summary of the merits of SGML as providing a formal, verifiable and portable definition for a document's structure, mentioning in passing that Southampton are developing a TEI-based document archive with conversion tools going in both directions betweeen SGML and RTF, and SGML and LaTex. Looking to the future, he saw the IBM/Apple Opendoc architecture as offering the promise of genuinely portable dynamic documents, which could be archived in an SGML form once static.

The third speaker of the morning, Jonathan Fine, began by insisting that the spaces between words were almost as important as the words themselves. I felt that he wasted rather a lot of his time on this point, as he did later on explaining how to pronounce "TeX" (surely unnecessary for this audience) before finally describing a product he is developing called "Simsim" (Arabic for sesame, which is a trademark of British Petroleum we learned). This appears to be a set of TeX macros for formatting SGML documents directly, using components of the ESIS to drive the formatter, but I did not come away with any clear sense of how his approach differed from that already fairly widely used elsewhere.

Peter Flynn, from University College Cork, did his usual excellent job of introducing the Wondrous Web World, focussing inevitably on some of its shortcomings from the wider SGML perspective, while holding out the promise that there is a real awareness of the need to address them. What the Web does best, in addition to storage and display of portable documents, is to provide ways of hypertextually linking them. Its success raises important and difficult issues about the nature of publishing in the electronic age: who should control the content and appearance of documents -- the user, the browser vendor, or the originator? Publishing on the Web also raises a whole range of fundamental and so far unresolved problems in the area of intellectual property rights, despite the availability of effective authentication and charging mechanisms. He highlighted some well-known "attitude" problems -- not only are most existing HTML documents invalid, but no-one really cares -- and concluded that the availability of better browsers, capable of handling more sophisticated DTDs, needed to be combined with better training of the Web community for these to be resolved.

The three remaining presentations, we were told after a somewhat spartan lunch, would focus on the real world, which seemed a little harsh on the previous speakers. Geeti Granger from John Wiley described the effect on a hard-pressed production department of going over to the use of SGML in the creation of an eight volume Chemical Encyclopaedia. Her main conclusions appeared to be that it had necessitated more managerial involvement than anticipated, largely because of the increased complexity of the production process. She attributed this partly to the need for document analysis, proper data flow procedures, progress reports etc., though why these should be a consequence of using SGML I did not fully understand. More persuasively, she reported the difficulty the project had had in finding SGML-aware suppliers, in designing a DTD in advance of the material it described, in agreeing on an appropriate level of encoding and in getting good quality typeset output.

Martin Kay, from Elsevier, described in some detail the rationale and operation of the Computer Aided Production system used for Elsevier's extensive stable of academic journals. Authors are encouraged to submit material in a variety of electronic forms, including LaTeX, for which Elsevier provide a generic style sheet. Other formats are converted and edited using an inhouse SGML-aware system (apparently implemented in WordPerfect 5, though I may have misheard this). This uses their own dtd, based on Majour, with extensions for maths, which seemed to be a major source of difficulty. Documents will be archived in SGML or PDF in something called an electronic warehouse, of which no details were vouchsafed. Both PDF and SGML were seen as entirely appropriate formats for online journals, CD-ROM and other forms of electronic delivery. The advantages of SGML lay in its independence of the vagaries of technological development, and its greater potential. However, potential benefits always had to be weighed against current costs; like any other business, Elsevier was not interested in experimentation for its own sake.

The last speaker was Michael Popham, formerly of the SGML Project at Exeter, and now of the CTI Centre for Textual Studies at Oxford. His presentation did a fairly thorough demolition job on the popular notion that there is still not much SGML-aware software in the world, starting with a useful overview of the SGML context -- the ways in which SGML tools might fit into particular parts of an enterprise -- and then listing a number of key products organized by category. It was nice to hear the names of so many real SGML products (auto-taggers, authoring aids, page layout systems, transformation tools, document management systems, browsers and parsers) being aired, after a long day obsessed by Acrobat and LaTex. He concluded with a useful list of places where up-to-date product information can be found, and a reminder that the field is rapidly expanding, with new tools appearing all the time.

The day concluded with an informal panel session, onto which I was press ganged, which effectively prevented me from taking notes, but also gave me the chance to promote the recently-published DynaText version of the TEI Guidelines, which I did shamelessly. I also remember Malcolm Clark asking, tongue firmly in cheek, why everyone couldn't just use Word, and being somewhat agreeably surprised by the number of people in the audience who were able to tell him the answer, and in no uncertain terms. Other topics addressed included auto-tagging, whether maths and formulae should be encoded descriptively or presentationally, whether Microsoft will still be around in the next century, and whether we would ever learn how to format documents for electronic presentation as well as we could on paper.