VISIT REPORT L. Burnard TO School of Oriental and African Studies, London FOR Knowlege Warehouse Conference 30 March 1988

A company called Mandarin Communications has for the last year been running the Knowledge Warehouse project, a pilot investigation into the feasibility of archiving publishers' typesetting tapes as a quasi- commercial, semi-philanthropic venture. Not the least curious aspect of this project is its hybrid nature, reflected in its original funding (the British Library, the Department of Trade and Industry and the Publishers Association are its primary 'angels'). Despite the billing for this one day conference organised by Mandarin at SOAS ("To review the next steps in establishing the National Electronic Archive"), I did not come away with the impression that the interests of all three parties were being (or were likely to be) equally well-served by the proposed Archive, or Warehouse.

Robin Williamson, a director of Mandarin and originator of the Knowledge Warehouse scheme when it was run by a company called Publishers Database Ltd, opened proceedings with a description of what the project had achieved in its first year and what its prospects were for the future. He made clear at the outset that the Warehouse was intended to hold only "knowledge works" and, at least initially, only works which were also published in conventional form. The value (or rather, in a much used phrase, "added value") of archiving such works in electronic form lay in the structural information embedded within them (as distinct from the "creative value" of the authors' words) and in the consequential ease with which they could uniformly indexed, and also combined and integrated to create a knowledge resource from which new works could be produced. Archiving of such texts was also (at least notionally) of importance to bodies such as the British Library, given its responsibility to conserve "the National heritage in all its forms".

Phase 1 of the Warehouse project, now completed, had studied and reported on the main legal, commercial and technical difficulties in setting up such an archive. Over a hundred texts in a wide variety of formats from various publishers, but all in the same subject area (maritime law), had been archived. A commercially viable "theoretical product" had been successfully designed. Ways of indexing and cataloguing the material in the Archive had been defined: these would include the usual bibliographic data, detailed commercial and technical information about rights in and format of the text, and also free text descriptive data derived from publishers' blurbs, contents pages and printed indexes to the work (if any). The archive would be financed by one-off deposit fees and by recurrent fees for access to the archive and (even) its index. Royalties (not less than the cost of an equivalent print copy) would be levied and passed back to depositors for each text actually extracted from the archive. A distinction was made between entrepreneurial and research usage of the archive; the chief difference seeming to be that royalties in the case of entrepreneurial access would be negotiable between the depositor and the user, rather than fixed. Another distinction seemed to be that research access would be restricted to "authorised users" (an ominous phrase which appeared to mean just the "library community").

It was claimed that this first phase had achieved a satisfactory balance in reconciling the divergent interests of publishers, librarians and whatever community the DTI is supposed to represent, and stated that an independent and non-profit review body called the Archive Trust, was being set up to maintain it. No further details of the powers or function of this body were given, other than its charitable status and its membership. Less nebulous achievements included the demonstration of the commercial viability of the project and its transparent devotion to the safeguarding of the publishers' interests.

Phase 1 had also demonstrated the practical advantages of standardisation in descriptive markup and of optical media. Williamson concluded his presentation with a ten year growth plan: based on the PA's figures, he predicted that the Warehouse would need to be handling up to 15,000 works per year by the end of its third year to achieve 10% of the industry's output. In the first year there would be a significant gap (0.75 millions) between the cost of archiving and the income generated by deposit and issue fees, which would need to be subsidised. By the tenth year however the Archive would be self-supporting.

In discussion of this first presentation a bogeyman called "legal deposit" emerged, which Williamson was clearly anxious to stress was not in any sense on the agenda. He was also keen to assure us that non-print works would also find their way into the Archive eventually. Some of the far from trivial technical problems of archiving rapidly changing databases (such as the French telephone directory) were mentioned, though these were clearly not on the agenda either.

After coffee, Clive Bradley, chief executive of the Publishers Association, took the opportunity afforded by his chairing of the next session to launch into a brief tirade against the current copyright bill which he asserted was "whittling away at the concept of a fair return on investment in favour of an extension of consumers' rights". Investment in something called the "information chain", he warned us, was in danger of drying up as a result. Whether appalled at the notion that any extension to our rights might be being proposed by the present government, or delighted by this transparent appeal to entrepreneurial cupidity, we were then all treated to two homilies on the economics of electronic publishing, from Patrick Gibbins (Archetype Systems) and from Richard Gray (Page Bros).

Gibbins' talk had very little to do with archiving. He began by stressing the diversity of electronic publications and the consequent difficulty of making secure predictions about their marketplace. Nevertheless he had some reassuring messages for the entrepreneurs in his audience. One was that no evidence had been found that "electronic delivery" eroded conventional print sales, but rather the reverse. Another was that the value of the information conveyed electronically was generally higher than that of the delivery system - and hence it was possible to adopt premium pricing policies. On the other hand, the seventeen million PCs in the home have such lamentably poor text handling capabilities that the electronic publisher has to put a lot of effort into making the product attractive (or, indeed, legible). This is at least possible when aiming at the PC marketplace (unlike when selling online services); he cited Microsoft's Bookshelf CD-ROM as a well designed and well integrated product. He also told quite a good (and not entirely irrelevant) joke about a chicken and a pig. Struck with the plight of some starving beggars, the chicken suggests joint action in the form of bacon and eggs. The pig points out that this represents considerably more committment on its part than on that of the chicken. (Collapse of feathered party)

Gray's paper concerned the cost benefits of good (i.e. generic) electronic markup. Good markup reduced editorial costs, reduced the likelihood of creating an unusable archive and facilitated spinoff products both now and in the future. He gave some simple examples to make clear what he meant by generic markup, made some rude remarks about SGML ("too complicated"), TeX ("only suitable for maths") and ASPIC ("out of date"), and then recommended that publishers "brew their own". Most of what he had to say was however eminently practical: for example, that the use of generic coding involved an extra cost in keyboarding, and that the typographic industry was not currently expected to produce archivable products - whence the widespread practice of "stripping in" final corrections and the usual absence of detailed security provisions. He was confident that printers would adapt to the new requirements, provided they were not imposed as rigid standards. In one unsuccessful attempt to ginger up the atmosphere, I asked whether he didn't think that generic markup was more properly an editorial than a typographic function; in another, someone else asked whether he didn't agree that Desk Top Publishing had put back the cause of generic markup by a decade. By this time however, lunch was looming larger than such well trodden controversies in most of the 40 delegates' minds.

The pragmatic tone was continued after lunch by Mike Buckingham, of Elsevier IRCS, the consultant responsible for the technical evaluation carried out during the Knowledge Warehouse first phase. 103 works from eleven different publishers had been archived; 11% of these had been originated on word processing equipment of some sort. He outlined the difficulties entailed: typesetters don't keep files together on a text-by- text basis; some word processors and typesetting systems store text in formats compared with which (say) even Wordstar internal files look reasonable. His experience indicated that operating the Knowledge Warehouse in the ways so far outlined would not introduce any particularly new technical problems, but that its difficulties (and presumably costs) would be much alleviated by a greater awareness within the publishing industry of the problems of handling electronic texts. He particularly stressed the importance of descriptive markup for "value-added" applications -it was no use using tagging them all as "italics" if foreign language words, animal species and plant species were to be indexed distinctly. His view was that publishers wished to shift the onus for doing this sort of markup back to authors, pointing out that the AAP's Guidelines were aimed specifically at authors, rather than typesetters. He saw DTP systems as essentially ways of producing superior copy from tagged manuscript for proofing purposes.

After all this, Marcia Taylor provided what seemed to me a breath of fresh air in her description of the history and current range of operations of the ESRC Data Archive at the University of Essex. This archive has managed to exist on a variety of short term grants for over twenty years, meeting the archival needs of one particular research community with exemplary thoroughness and without compromising any academic principles. They hold numeric data only, and of a particular type, of course, so it is that much easier to catalogue and integrate; and perhaps also that much easier to do so freely and to distribute it freely. Much of the data is deposited by ESRC grant holders, and much by local or national government departments, who use the Archive as a convenient go-between for dealing with the academic research community. Much also comes from similar archives in other countries: the spirit of international co-operation still survives in academia if nowhere else. The Data Archive does not own its collections, but is licensed to distribute it along lines very similar to those of our very own Text Archive, (not surprising since their "User Undertaking" form was the model for ours). They acquire three or four hundred new datasets annually, and are now sufficiently confident to reject any which are inadequately documented. They handle about three thousand requests for material or information annually. All of their data is converted to and stored in a standard format; it is made available to academic users free of charge, and to commercial users on payment of a royalty. As for "value- adding", they have recently started combining statistical tables in specialist areas to form new resources, such as the Rural Areas Database, and have also produced some experimental educational packagings such as 2500 tables for the BBC Domesday disk, or sets of census data for use in schools. Marcia's low-key delivery may have masked somewhat the challenge which the demonstrable success of the Data Archive presents to the commercial orthodoxy characterising all the other speakers.

Neville Cusworth of Butterworths restated that orthodoxy with a fairly crushing series of financial models. As responsible electronic publishers wishing to exploit the Archive (when it came about), what would be the costs entailed in creating a CD-ROM product from it? How much should be allowed for design of the product, data conversion, software development (or licensing), creation of trial product, indexing, mastering, distribution, demonstration, obtaining market feedback, production of documentation? Mr Cusworth (or his research assistant) had succesfully obtained from somewhere figures for all of these and more, on the basis of which he was able to present a series of marketing scenarios for low, high and medium volume sales. It seems that to break even in the first year a CD-ROM priced at 800 with a profit margin of 20% will need a total sale of 1000. A product priced at 10,000 with the same margin however and achieving only 30 subscriptions will give a loss in the first year but prove "a real goer" in years two and three (considering 30 subscriptions a "real goer" in any circumstances is probably what is known as creative accounting). Speculation apart, it seems clear that current pricing for CD- ROM is based largely on the model of online access: i.e. it is expensive. Books in print costs 750 a year and Harrap's Multilingual Dictionary 595.

Dr Maurice Line (British Library) had the difficult task of putting the case for the archival function. He reminded us that most countries accepted the necessity of recording and archiving all their cultural artefacts for scholarly use, both internally and internationally, whence it seemed necessary to derive the need for something he rather equivocally called a "comprehensive deposit" of electronic materials as well. The British Library clearly had a responsibility for archiving such materials, especially where they did not exist in any other form. He hoped that the Knowledge Warehouse was a step towards the goal of true national archiving: certainly no other country had anything comparable. He raised some interesting problems which had not been addressed by previous speakers: how would archival copies be refreshed if they were not re-published? how could book structures be encouraged to resemble more closely the structures appropriate for online access? on what basis would the 10% sample of new books archived in the Warehouse's first three years be selected? He agreed that it was hard to see how libraries could avoid being charged for access to the Warehouse, but pointed out that those most likely to want to use it were from the arts and the humanities, fields in which the sorts of funding assumed by the legal profession simply did not exist. How could abuses of the system be prevented - or even identified? how would the boundary between the recombination of existing knowledge wares and old fashioned plagiarism be defined? When works went out of copyright (or their publishers out of business) would the Warehouse simply take over their royalties? He concluded by repeating that the Warehouse would necessarily continue on a voluntary basis and that "the eyes of the world were upon us".

A final discussion, chaired by Sir Harry Hookway, picked up some of these points, and spelled out more clearly to the librarians present exactly how the scheme would operate. It would not be possible to extract less than a complete book from the system, and charging would be on a book by book basis. As to deposits, it was repeated that there was no possibility of "legal deposit" being imposed on the publishers or the Warehouse; and that the materials chosen for deposit would be "well-structured" - which I took to mean that only works for which some subsequent re-use could be identified would be archived. I found this all very dispiriting: despite the lip-service being paid to the notion of a national electronic archive, it was clear that Knowledge Warehouse's priorities were all to do with the establishment of a viable entrepreneurial archive. This seems a pity, since in the absence of proper committment to the archiving of non-print works by institutions such as the British Library or the P.R.O. (or at any rate of funding to support it), the coming electronic age may be unique in leaving no trace of itself for its successors, if any.