1 What's the beef?

This audience probably does not need to be reminded what SGML is (Standard Generalized Markup Language: the international standard for structured document interchange, ISO 8879 (1986) if you've forgotten). However, it might be helpful to remind you of what exactly the Web is, particularly now that URLs and cyberspace have become an established part of journalism, junk capitalism, and other components of life as we know it. The best definition of the World Wide Web I have come across was from Dave Raggett, who pointed out that the Web has exactly three components:

Given the immense success of the World Wide Web, it is not unreasonable to ask what more anyone could reasonably require. As they say, If it ain't broke, why fix it?. I'd like to begin by rehearsing some of the things that have proved to be wrong with that simple architecture.

First, the use of existing protocols. This has always seemed to me one of the greatest strengths of the Web's original design. By allowing from the start for an object to specify, either directly or by implication, that a client intended to deal with it must be able to launch some specific application, the Web has always been extensible in new and unpredictable ways. It makes it attractive for vendors wishing to make sure we all use package X without at the same time preventing us from glorying in the CGI craziness provided by hacker Y. When the Web was first designed, it was for many, simply a way of conveniently integrating many existing TCP/IP-based tools (hence, I suspect, the name 'Mosaic'). Then, as it evolved into the great docuverse imagined by Ted Nelson and other hyperheroes, the HTML protocol began to seem the most important. Unfortunately, in at least one respect, this protocol is Broken As Designed: its object granularity is fixed and file-based (necessitating large amounts of clever cache-management to maintain acceptable performance in an era of shrinking band-width). While immense ingenuity has gone into implementing such necessary components of the docuverse as authentication, encryption, synchronous transmission of sound and video, etc., the results are inevitably complex, ad hoc, and in a permanent state of evolution. This may not, of course, be entirely a bad thing (as a system design principle, for example, it seems to have served the natural world pretty well) but it makes life difficult for those concerned with the longer term view of our emerging global information systems.

The URL naming system has been so successful that I am a little shy of making any criticism of it. Nevertheless, I am not the first to wonder whether it might not be improvable. The number of broken links in the universe, and the difficulties constantly encountered in keeping them unbroken; the difficulty of identifying objects on the web other than by a fragile addressing scheme, tied to specific instantiations of objects rather than an abstract name; the impossibility of reliably identifying higher level groups of objects; all point to something fundamentally wrong. If we think of the web as a kind of library, it is one in which books have no ISBNs, no bibliographic control, no accession numbers, and no agreed set of subject-level descriptors. It took several centuries for those necessary mechanisms to evolve for print-based information-carriers: it is depressing to think that none of that expertise seems to be carried forward into non-print based media.

Lastly, what is wrong with HTML? Well, rather a lot, if we compare it with other general purpose document type definitions. At the risk of reminding you of something rather obvious, the HTML dtd tries to cater for the immense and glorious variety of structures that exist in electronic resources by taking the line of least resistance, and pretending that documents have no structure at all. Compare for example the following two declarations:

<!ELEMENT Book - - ((Title, TitleAbbrev?)?, BookInfo?, ToC?, LoT*, Preface*,
                (((%chapter.gp;)+, Reference*) | Part+ | Reference+ |
                Article+), (%appendix.gp;)*, Glossary?, Bibliography?,
                (%index.gp;)*, LoT*, ToC? ) +(%ubiq.gp;) >
<!ENTITY % html.content "HEAD, BODY">
<!ELEMENT HTML O O  (%html.content)>
<!ENTITY % body.content "(%heading | %text | %block | HR | ADDRESS)*">
<!ELEMENT BODY O O  %body.content>

The first, from the DocBook dtd, makes explicit that books potentially contain a number of subcomponents, each of which is distinguishable, and has a proper place. The second, from the HTML 2.0 dtd, states that the body of an HTML document contains just about anything in just about any order. (I have often wondered why HTML did not simply use ANY as content model for the body and have done). There is a place, of course, for such content models (particularly in DTDs such as the TEI, where an unpredictable richness of element types is available), but their downside in the HTML world should not be forgotten.

HTML's permissiveness makes it difficult or impossible to do many of the things for which we go to the trouble of making information digitally accessible. Specifically, it is hard to:

This last difficulty highlights a further major case of drawbacks resulting from the nature of the HTML document type definition: it is semantically impoverished, and it is presentation-oriented. By semantically impoverished, I do not simply mean that HTML lacks any way of distinguishing say personal names and institutional names, or even names at all; indeed it provides no way of marking up any kind of textual object other than headings, lists, and (arguably) paragraphs. By presentation-oriented , I mean that HTML compensates for this serious lack only by allowing for an increasingly complex range of ways of specifying the way that a span of text should be rendered, rather than any way of specifying what kind of an object the span is. The relationship between what an object is, and how it is rendered, has exercised much theoretical debate, which I will not rehearse here, but one key fact remains: all SGML systems are predicated on the assumption that markup is introduced in ordered to distinguish semantic categories of various kinds, the meaning of which are rarely limited to how they should be rendered. On the contrary, the assumption is that they may be rendered in many different ways by different applications. This is hard, or impossible, with HTML.

This focus on bold and italic, on headings, and bulletted items, would matter less if HTML were extensible (or if its host environment allowed for its substitution by a more expressive dtd). It would also matter less if HTML were even adequate as a data format for large-scale commercial publishing. But neither of these is the case. If we compare even the best of HTML tools with even the worst of generic SGML tools, we note that the hardwiring of the HTML tool to a particular set of tags (with or without proprietary extensions) make it impossible for the user to extend the tool's functionality in any way. By separating out formatting and structuring issues, even the humblest of SGML tools allows the user to retain complete control over the data.

The advent of HTML stylesheets appears to address this limitation, by extending the choice of formatting options available to HTML tools in a number of useful ways. However, the stylesheet mechanism as so far defined lacks several aspects of output control typically supported by generic SGML tools. It cannot for example be used as means of re-ordering the components of a document, or of selecting parts of it in some application-specific manner -- both of which are perfectly reasonable requirements in mature technical publishing environments, and both of which are easily achieved by current generic SGML document processing systems.