If SGML-to-HTML servers are complex, expensive, and CPU-intensive database applications which only large corporations can afford, and hybrid clients like Panorama provide only half the functionality needed, at twice the cost, we clearly need to see a new breed of software before we can deliver on some of the promises of the world of structured documents. However sophisticated our servers, existing user agents will remain unable to take full advantage of the potential richness of the SGML documents already existing in the world, still less those which are being created, so long as they persist in regarding anything beyond HTML as outside their preserve.
What is required for future web user agents to be able to receive and process any SGML object in the way that they are currently able only to handle HTML? There are two halves to the answer. Firstly, servers providing SGML objects need to deliver along with them some kind of wrapper indicating the document's structural description (DTD) and stylesheet information defining its rendition. Secondly, clients must be able to unpack this package of information correctly, delegating the actual processing of the document to appropriate subcomponents responsible for parsing, constructing, and rendering it. This approach necessitates the creation of a number of specifications, key elements of which are listed below:
specifications for the packaging of SGML objects and fragments: this work is currently being undertaken by a technical committee of the SGML Open consortium of vendors;
specifications for the transmission of SGML entities: the SGML Open Catalog mechanism goes a long way to meeting this need, though its interoperability with MIME-based mechanisms remains unclear;
specifications for document structuring and rendering
specifications for document link semantics based on HyTime
a simplified version of SGML
Work on specifying all these components in the context of the Web, is already well advanced, as a result of substantial discussion and serious work within the appropriate expert communities, most notably within working groups of the W3C consortium. For an overview of the current situation, see http://www.w3.org/pub/WWW/MarkUp/SGML. I will conclude with a few remarks on two of these only: those concerned with document rendering, and the need for a simplified SGML.
At present, each SGML browser has its own proprietary SGML stylesheet language. No content provider could reasonably be expected to design and supply a different stylesheet for every possible target. Some kind of generic stylesheet mechanism is thus clearly essential. At present two candidates for this mechanism present themselves: the cascading style sheet mechanism, and the ISO standard Document Style Semantics and Specification Language (DSSSL). The advantage of the latter is not simply that it has emerged from the standards community after nearly a decade of very hard work, nor that real implementations of it are now freely available. It is simply that cascading styles don't have enough power for the job.
As currently defined, the Cascading Style Sheets lack the concept of a parse tree essential to correct processing of an SGML document. Consequently:
you cannot take an element (a chapter title perhaps) from one part of the tree for re-use in another (say, a page header);
you cannot treat all sibling elements (say all but the first paragraph in a division) in a particular way;
you cannot treat elements differently dependent on their context (for example headings of a figure as opposed to headings of a chapter).
Because there are no programming language features, a CSS style sheet lacks decision structures, modularization, variables, and any way of doing arithmetic calculation. As a way of improving the way that HTML texts are rendered on screen (provided that they are in Western alphabets), it is adequate, but as a generic solution to the problem of rendering SGML documents, it lacks a lot.
The key advantage of DSSSL lies in its modular design. It integrates three key components:
a language for querying SGML documents
a language for specifying transformations from one SGML document intoanother
a language for associating formatting characteristics with an SGML document
These components interact as shown in figure 3 below:
Figure ThreeA full description of DSSSL is beyond the scope of this paper: a good description of the DSSSL-Online subset (from which the above figure is taken), and a number of other tutorials are freely available from James Clark's DSSSL pages (at http://www.jclark.com/dsssl) and elsewhere. Its key features for the present argument are as follows:
it incorporates document transformation as a distinct exercise from document rendering;
the rendering component retains access to all parts of the SGML input;
free software tools implementing key parts of the specification are already available.
Consequently, anything which can be expressed in the SGML definitions underlying a document repository can be used in the creation of the view of it which a particular client sees. A user agent with a suitable DSSSL specification can handle whatever SGML structures are obtained from a true SGML server, reordering, selecting, combining, and rendering SGML elements according to a formally complete specification.
The big question in all this remains: if SGML is so great, why has it not taken over the world already? Amongst (varyingly sensible) answers to this which I won't pursue further are the argument that it has, at least as far as serious document management is concerned; the argument that taking over the world is not the object of the exercise since SGML vendors and advocates are culpably uninterested in developing software for the common man or woman; and the argument that there is an inherent contradiction between the goals of SGML and the goals of the politico-industrial-military complex which currently runs the data processing industry. However, the question requires an answer, and perhaps the development of XML will provide it.
XML (eXtensible Markup Language) is a new activity of the W3C SGML work group, which is due to see the light of day at the end of 1996, with a targeted implementation date of March 1997. Its goal is to define a leaner, simpler, subset of the SGML metalanguage, better suited to use on the Internet, able to support a wide variety of applications, and with a concise formal design. A set of design principles (available at http://www.textuality.com/sgml-erb/dd-1996-0001.html) spells out what is meant by "leaner and simpler', but is a little less clear on what is meant by a `subset of SGML'. Over the last three or four months, a select group of about fifty SGML experts have been debating, with all the vigour and obsessive attention to detail so characteristic of the breed, exactly which parts of the SGML elephant should be cast to the wolves following the sledge on its way towards the promised land of XML implementability.
Amongst topics which have been discussed I list only a few to give some flavour of the radical nature of what is being proposed:
disallow variant concrete syntaxes;
rationalization of the rules about where whitespace and record boundaries are significant;
abolition of most optional SGML features;
abolition of most minimization conventions;
abolition of the need for a DTD for all kinds of processing;
mandatory support for wide character sets such as Unicode.
A smaller subset of the group has also been voting on about a hundred specific aspects of ISO 8879 which need to be dropped, revised, or retained, to support these objectives. This electronic electoral process (carried out over the web, needless to say) was completed at the start of October, and the XML Editorial Review Board will presumably be spending the next few months either reconciling their own decisions with the views expressed, or coming up with some pretty convincing ideas as to why they have not followed them. Publication of a complete XML specification early in the new year will, it is hoped, remove the last obstacle to the emergence of a new breed of truly SGML-aware user agents on the web, able to take full advantage of the true potential of the information revolution that began ten years ago with the publication of ISO 8879.