Along with several thousand other punters I received an invitation to the first series of "Sequent Lectures" among the usual spring tide of junk mail; as did several hundred others, I decided it might be worth a visit, since it was free and featured most of the dbms I know and love. The event was held in the unspeakably awful Novotel, which is largely composed of multi storey car park, and had three parts. In part one, an earnest American salesman explained very slowly why Sequent machines are so fast (and reliable and wonderful and cheap). This I found mildly informative, never having paid the company much attention before; the parallel architecture (lots of stripped down VAXes hanging off a superfast bus) sounds remarkably sensible, providing that you can take advantage of it. To do this properly, however, you clearly need some pretty smart programmers. My jargon detector popped its dial on the phrase "We have architected that [i.e. Oracle's use of background processes] on the Sequent" and failed to function for the rest of the morning.
Part two came after coffee and comprised three two-stage sessions held in parallel (another case of Sequent architecting). These were supposed to be for Sequent's favourite software vendors to endorse the message by explaining how they'd taken advantage of the wonderful box in their implementations: predictably they turned out to be fairly low key sales pitches with only token gestures in this direction. Products featured were Ingres, Oracle, Informix, Unify and The Office Smith. I missed all of Unify (it's a supposedly high performance TP-type system hosted under UNIX), Oracle (heard it all before) and also all of the technical seminar on parallel programming (from which however I did steal a programming guide). The Ingres speaker seemed proudest about Ingres' Visual Programming (Trademark) and its "state of the art" query optimiser. It exploits the Sequent architecture by running front and back end processes on different processors, as does Oracle, I assume. The Informix speaker was proudest about their 4GL; he did however announce the new Informix-TURBO which can be used to beef up multi-user large scale UNIX implementations (not however noticeably using parallel programming techniques) amd also DATASHEET ADD-OM with which you can make your Informix-SQL database look just like Lotus 1-2-3. There's progress. Office Smith turns out to be fairly drab hierarchic text indexing system for UNIX boxes only. The speaker clearly felt rather defensive about this ("relational is just a small subset of database technology") and rightly so. It uses B-trees and compresses index terms rather like those speed-writing adverts (KN Y RD THS?); one thing in its favour is that it was designed to be bilingual, emanating as it does from the Canadian Government.
The main event of the day was Dr Rob Wilmott's Vision of the Future, an inspirational little number, featuring lots of graphs showing sharply divergent lines with labels such as "Shipped MIPs", "Price/Performance Learning Curve" (and only 3 spelling mistakes) etc etc. Fortunately for the innumerate, the lessons Dr Wilmott wished us to take home were not too difficult: (1) lots of small machines is better value for money than one biggie (2) progress is being impeded by fuddy-duddy conservatism and the deadweight of current software investment (3) OSI standards are a Good Thing, and are Happening. Likewise, UNIX, C etc. These messages were all dressed up rather fetchingly with the usual sort of stuff about the imminent collapse of the British non- manufacturing industry and the appalling levels of ignorance in British management. To fill the latter gap, our Rob has -surprise- started a new management consultancy called OASIS which will help you "go multi- vendor" and transform your software productivity before the astonished eyes of the competition breathing down your neck. Question time provoked an unexpected smear on government collaborative ventures, and (with reference to whether IBM would ever get involved in parallel architectures), quite the best mixed metaphor of 1987, so far, viz "Once the benchmarks are on the table, you will see all Hell break loose".
A nominal lunch was provided, after which I trekked across London to visit the British Library's Research & Development Division, deep in the heart of Soho. It is possible that they would be willing to fund a one year research post here to assess the actual and potential uses of machine readable texts, which would also help keep the Text Archive on its feet. I spoke to Terry Cannon, who was encouraging.
The purpose of this expedition was to tie up loose ends left over from Bodley's long standing investigation of the suitability of the Memex text-searching engine as a means of providing online access to the pre-1920 and similar catalogues. This investigation began with a British Library R&D Grant in 1983, at which time Memex was hosted by the Library's own PDP-11; for a variety of technical reasons this proved inadequate to the task, and the project was temporarily dropped in 1985. In 1986 Memex set up a marketing agreement with Gould which proved to be a distinct improvement both commercially, in that there are now several installed systems running on Gould minis, and technically in that they now have a demonstrable version of the Bodleian's catalogue. Hearing this, Geoff Neate arranged a two day trip to Memex's East Kilbride offices, and kindly asked me to accompany him. In the event, although Memex were still unable to demonstrate a true working version of the catalogue, the visit proved well worthwhile.
We were first given a detailed account of the company's current state and market prospects, which look much healthier as a result of the one year agreement with Gould. The company now employs 18 staff at East Kilbride, and seven at the Edinburgh research lab. There are nearly twenty pilot systems now installed, and some of these were described in some detail. They included the usual unspecifiable Defence and Police applications, and some fairly boring ones like a database of all the telexes received at Peat Marwick Mitchell's New York office, but also some rather more imaginative systems such as 20 Mb of script summaries maintained by TV-am, which could be searched for visually interesting snippets such as President Reagan picking his nose on camera etc. In the commercial world MEMEX's speed both in search and in set- up time makes it a natural for companies wishing to scan the 'Commerce Business Daily' - an electronically published list of all US Government jobs currently up for tender, - or even (I suppose) the body of case law maintained by Context Legal Systems. There are no other library applications however, which is largely attributed to librarians' lack of desire to step outside the approach favoured by the British Library.
Development of the product continues; the exclusivity of the Gould arrangement has now lapsed, which means that development is now concentrating on the DEC and DEC OEM marketplace. One (very interesting) current version of Memex is a single board that plugs into the Q BUS on a microVAX2 running VMS and costs about 5000 pounds; similar boards are available for bigger machines with prices up to 20,000. Because the device uses a standard VME BUS, it can be configured into wide range of hardware; one other possibility clearly under consideration was the SUN workstation.
The current system operates in a way quite similar to conventional indexing systems. The text is regarded as a flat file of hierarchically organised structural units (document, chapter, paragraph, sentence for example) which are composed of tokens all of a single type. Conversion of text to "infobase" (sic), involves the creation of an index of non-numeric tokens (the "vocabulary") which maps the external form of each such token to a unique symbol. The text is stored in a compressed form by replacing each token by this symbol, which may be up to 3 bytes long. Capitalisation, whether or not the token is a word- separator and whether or not it is a number are all indicated by flag bits. Tokens recognised as numbers are converted to fixed- or floating- point form and excluded from the vocabulary.
No occurrence index is maintained. Searches are carried out by first scanning the vocabulary for the required terms (so zero-hit searches are very rapid indeed!) to extract the corresponding codes; then delegating the search for these codes to the Memex board (this has a throughput of around 0.4 Mb/sec, or - since it is operating on compressed text - effectively about 200,000 words/second). Hit records (i.e. addresses within the file) must then be decoded for display, or may be retained for further (set) operations. In the version of Memex available on Gould (though not that now implemented on VAX) inspection for proximity matching also has to be put off to this post-processing stage, as it does with CAFS.
Unlike CAFS, the MEMEX hardware does not support any sort of fuzzy matching: all search terms must be stated explicitly. The availability of the vocabulary file goes a long way to counteracting this inconvenience and it is possible to add a 'reversed' vocabulary file so that searches for words ending with particular strings can easily be identified; obviously the full generality of the facilities available with CAFS fuzzy matching is still not catered for however. If the number of search terms exceeds the number of search channels available (8, cp. CAFS 16), the query optimiser will initiate more than one scan through the file transparently to the user, rather than rejecting the search as CAFS currently does.
For very large files, a signature file can also be maintained to optimise performance by allowing for focussed searching, in much the same way as the Advanced CAFS Option. With all these options in place, the amount of filestore space saved by the compression becomes rather less significant; detailed figures calculated for one of the Bodleian files only (DAFZ) show that although the original raw data file (16.6 Mb) was reduced to 12.6, the amount of space needed for ancillary indexes etc brought the total filestore requirement for this file up to 21.9 Mb; the CAFS searchable form of the same file was 23 Mb. Compression is still a very effective way of speeding up the search process, simply by reducing the amount of data to be scanned, of course.
The other possible drawback of storing text in compressed form - updating problems - is obviated to a large extreme by the provision of an online screen editor which operates on the "infobase" directly. We were not able to see this in action, but from its description in the documentation it seems more than adequate for most uses.
As currently packaged the system does not support multiple indexes nor any other way of categorising tokens within an index, except insofaras numbers are specially treated. The sort of precision made possible by CAFS SIF features is thus entirely lacking. To search for "London" in a title rather than "London" in an imprint, we had to resort to the rather counter-intuitive process of specifying that "London" must precede the word "Imprint" in the record; to search for books printed in Tunbridge Wells, one would similarly have to search for "Imprint" and "Tunbridge Wells" in that order and within 3 words. Aside from their reliance on the existence of the tokens "Imprint" (etc) within the record, neither procedure worked entirely satisfactorily in the Bodleian data, which contains multiple bibliographical entities within one record.
Post-processing facilities in the software demonstrated were quite impressive: the user can combine results of searches, mark particular hits as significant, narrow or broaden the search focus, re-run previous searches, interrogate a history file etc etc. The query language used is also reasonably comprehensive, though its syntax would present some problems to users not previously exposed to such notions as "exclusive or" or "proximity match" or "regular expression"; it would be quite easy to hide all of this as a CALL-level interface to the search engine is also provided, which is directly accessible from C programs.
Documentation provided consists of a programming manual and a descriptive user guide, which is reasonably accessible. (Though it does include the following benumbing sentence: "The NOT operator is existential and cannot be interpreted as an 'outwith' operator in the case of proximity".)
The staff at MEMEX were very helpful, not just in their willingness to explicate sentences of this type, but also in the readiness to let us take over one of their Gould machines for a day's tinkering. Unfortunately, the transfer of the pre-1920 catralogue had not been done properly, several of the records being incomplete and the numeric fields being incorrectly translated, so it remained difficult to make an accurate assessment of the system's performance. However, so far as we could tell, one complete scan through all 12 'infobases' into which the pre-1920 catalogue is currently divided, assuming that the tokens to be searched for exist in every file, would take around 5 minutes. This compares favourably with the current CAFS guesstimate for the same operation, which is around ten minutes. We carried out rough timings for a range of searches against one of the files; these are detailed in Geoff Neate's report.
Testimony to the ease with which text can be converted to a Memex "infobase" was provided by the Cart Papers, a collection of 17th century documents which we brought with us on a magnetic tape, and were able to search (on the micro-VAX) within a few hours.
We also learned something of the company's future plans. Of most interest here was something called the "Vorlich machine" currently being designed at their Edinburgh research laboratory. This device will use the kind of pattern recognition algorithms built into the current generation of image and voice recognition systems to tokenise free text by hardware, thus doing away with the need for the current encode/decode software.
As yet, Memex do not have a system which we could consider as an off the shelf user text searching product. Neither have they actually demonstrated to us all of the claimed potential of their current product as a library searching system-builder. Nevertheless, the company now has a secure financial basis from which to engage in the sort of primary research needed to make one, together with a great deal of expertise. Their switching to DEC hardware with or without the UNIX environment to host the system also makes them very attractive in the academic context. If hardware assisted text searching engines do become commonplace in the next few years, as they show every sign of doing, Memex must have a bright future.
Westfield College campus begins increasingly to resemble the set for some grimy Channel 4 documentary on the state of British Education. The exteriors of its gracious 19th century buildings are suffering a rash of desperate fly-posting while their bare interiors remain un-redecorated and unloved. For this conference, the ruins have put on a pretence of being inhabited still, which somehow makes them all the more depressing. In an ambitious moment twenty years ago, Westfield erected a functionalist science block, derelict for the last few years since it lost its science department; for this occasion it has been unlocked and its ground floor heating switched on. Ghosts lurk in the corridors, however. Elsewhere, in what was once a library, there are still a few comfortable chairs and a non-stop coffee machine, but all the bookshelves are bare.
Maybe the atmosphere affected my judgement, or maybe it's just that it had a hard act to follow, but I found this second conference less exciting than the first one. There was the same extraordinarily broad-based constituency of delegates, from secondary school teachers to academic researchers, as well as a significant European presence (except for the French who were conspicuously absent): the attendance list includes nearly 500 people. There was also the same abundance of material: around 250 papers crammed into two days of parallel sessions. Considerable effort had been made to group papers on a common theme into the same session, which encouraged more detailed and informed discussion but discouraged the serendipity I had enjoyed at the previous year's event. The distributed nature of Westfield's surviving lecture rooms also made it very difficult for butterflies like myself, once stuck in a group of rather limp papers on the applications of Knowledge Based Systems in secondary education, to escape to the parellel session on "Recent advances in historical demography" which was clearly where the real action was going on.
There were two plenary sessions, of which I attended only the first, which was a "keynote address" style lecture by Roderick Floud. Prof Floud has been somewhat of a pioneer amongst computing historians, having published an article advocating the use of electronic card readers in 1973. His lecture was enthusiastic but decently sober about the micro revolution, stressing that new tools did not mean new methods. In the future, he was confident that data input methods would remain a central problem, however advanced the technology. He described what he called a "prompting data input program" that had been developed for use in capturing US Army pension records and demonstrated the ease with which data could be manipulated by a typical cheap micro dbms/spreadsheet package (REFLEX, no less) and concluded with a plea for historians to fights against the "mythology of computing".
As aforesaid, I made the mistake of choosing the wrong session from the four parallel workshops offered next, from which I gained nothing but a nice new acronym (MITSI - the Man In The Street Interface). The third paper in this group was the best: it was from a Portuguese scholar who had developed an expert system for handling about 2000 depositions of "sinners" as recorded in 17th - 18th century ecclesiastical court records. Unfortunately Carvalho's English was not up to the task of explaining a great deal of its inner workings, though the principals seemed clear enough.
I had no choice in the next set of four: whatever the rival attractions of "Urban and regional studies" (quite a bit), "Higher Education Seminar" (rather less) or "Prosopographical studies" (rather more), I had to attend the workshop on "Relational database method", if only because I was giving the first paper in it. This (a rapid introduction to SQL and the relational model using D. Greenstein's Philadelphia data as example) had to be boiled down from about 2 hours worth of overheads to a very fast 30 minutes, but it seemed to go down reasonably well. Phil Hartland (RHBNC) then gave an unusually clear and jargon-free exposition of the virtues of SSADM in managing large projects: two intriguing examples he mentioned were a projected history of the music hall and also a database about music in the 18th century. Michael Gervers from Toronto ( one of the few non-Europeans present) reported on his Pauline conversion to ORACLE in much the same terms as last year: he has now produced some quite interesting results about changes in the landholding status of Mediaeval textile workers.
Next day, I arrived in time for the last part of an informal workshop on data standardisation chaired by Manfred Thaller, which appeared to be making very little progress: someone was pleading for a set of 'ethical guidelines'. After coffee, I plumped for the session on "Problems of multiple record linkage", thus missing the intellectual ("Recent advances in historical psephology"), the exotic ("Schools Education Seminar") and the ineluctable ("Academic wordprocessing" - a dizzying combination of Tex, Latex and Tustep). My chosen session began with Arno Kitts' (Southampton) solid exposition of the historical and methodological problems involved in accurately linking together Portuguese names as they appear in 19th and 20th century passport lists, electoral rolls, cemetery lists etc. The object of the exercise is to determine patterns of emigration: calculating for example the rates of return migration. The linkage procedure should be completely automatic (he asserted) to avoid subjectivity, but necessarily involved dictionary lookup for some more widely varying name forms. None of these problems seemed to worry the next speaker, our very own A. Rodriguez, whose record linkage problems were virtually non existent: her data consisting of some 8000 records of birth, marriages and deaths in all of which surname, forenames, and father's names are all obligingly present. Even SIR could cope with data as simple as this: all that was necessary was a massive sort on the names, followed by a forty line piece of procedural gibberish to insert links between records with the same namestring present, written for her by the obliging D. Doulton of Southampton, centre of SIR fanaticism in the known universe. The last speaker, Ariane Mirabdobaghli (LSE) was using Ingres to link 18th centurty parish and tax records: it was not at all clear how, which is a pity.
The remainder of the conference consisted of five parallel sessions of five "research reports" each, spaced out so as to permit session hopping. I managed to catch Dunk (sic) and Rahtz (sic) on strategies for gravestone recordings (a flatteringly familiar exercise in conceptual modelling); Dolezalek (Frankfurt) on ways of reconstructing manuscript stemma (an intriguing, if apparently hopeless text); Nault (Montreal) on an enormous historical demography project at Quebec (births and marriages of every individual betwen 1608 and 1729) - being stuck with a Cyber 70 they had to write their own dbms, but seem to be doing quite well with it; and finally, Lamm (Birkbeck) who has been let into the MOD's secret archive of first world war soldiers' records with an Epson portable. He is using this to extract a minute random sample of about 8000 records, about thirty variables (height, age, length of service etc) from the attestation papers, personal correspondence, war records, pension and medical books etc etc here stored away on some 64,000 feet of shelfspace. I found it rather depressing that this numerically recoded set of SPSS data would probably be all that remained of this archive by the time it was made public in 1995, the rest - already damaged by fire- having long since crumbled to dust. But my friend from the Public Record Office seemed quite relieved at the prospect.
The AHC (as I suppose we shall have to call it) now has a formal constitution and its own magazine. The enthusiasm generated at last year's conference continues to thrive. But I hope that next year, when it is planned to organise a smaller national conference on more focussed topics, I shall be able to report more substantial fruits from it.
CATH 87 (as it will no doubt come to be known) was an unusual event in several respects. For one thing, as Nigel Gardner (CTISS) pointed out in his introductory remarks, it approximated to that perfection proposed by David Lodge, a conference with no formal papers. For another, instead of published proceedings at some vague time in the future, all delegates were presented at registration time with a collection of essays by various hands covering most of the topics addressed by the conference, now published by Ellis Horwood as "Information Technology in the Humanities", edited by S. Rahtz.
Another unusual aspect of the proceedings, at least from my cloistered viewpoint, was that just as many of the 100+ delegates came from Polytechnics and other institutions in the 'public sector' of higher ed, as came from Universities and similar bastions of privilege. This burgeoning of interest may have something to do with the coming into existence of a working party on IT in the Humanities (public sector only) sponsored by the CNAA. This working party is chaired by David Miall and is currently conducting a survey, planning a workshop on the theme this autumn and aims to set up a clearing house of some description.
There were in fact two formal papers: one at the start, from the charismatic Richard Ennals, and one at the end, from the even more charismatic (because French) Jean-Claude Gardin. Ennals, who is now at Kingston CFE, was inspirational (at least in intent) on the importance of the humanities and their under-rated powers which, he insisted, could be made more effective still by the appropriate use of computers. AI, the 'technology of thought', might provide a way of bridging the gap between the "two cultures" (Ennals is clearly a child of the sixties); the absence of Theory from the humanities might be a strength; Piaget's beneficial influence on primary school teaching needed to be carried through into the secondary system; logical positivists were a lot more 'dehumanized' than computers; rules (as in expert systems) could be descriptive rather than delimiting; input from the Humanities was needed because of the complexity of the problems to be tackled. These and similar ideas served to illuminate, fitfully, Ennals' final proposition of "computational politics" - that software engineers could profitably learn from social engineers. This highly seductive notion relied on what (I suspect) is a purely metaphorical similarity between the transition from single CPU to parallel architectures on the ne hand, and the transcending of solipsism in the modern democratic state on the other. It was a bravura performance.
In between the two formal papers, there were six parallel workshop sessions, each on specific topics, and also three introductory tutorial sessions. The organisers of the workshops had been briefed to stimulate discussion and argument rather than simply read out papers, which for the most part they did. The range of topics covered was impressive, as was the concentration of expertise. I attended workshops on Concordances (P. King from Birmingham), Programming (Wendy Hall from Southampton), Art History (Dave Guppy and Will Vaughan from UCL), Classics (Kevin O'Connell from Exeter), Linguistics (L. Davidson from Leeds) and Literature (Tom Corns from Bangor), thus missing inter alia S. Rahtz on Archaelogy, R. Trainor on History, G. Davies on CALL, J. MacGregor on Theology, A. Pearce on Music and P. Salotti on Databases.
I found the Concordances Workshop rather disappointing, though it did stimulate much discussion. King was anxious to demonstrate his own concordance generator which runs on an Amstrad word-processor, though he did bring out several useful applications for its output (fairly rudimentary KWIC lists) in teaching non-native speakers of English to identify patterns in contemporary usage. There was much argument about the normative effect of such exercises. Several people enquired about micro-OCP.
The Programming Workshop was equally ready to tackle fundamental issues. Wendy Hall quoted Dijkstra on the malignant effect of BASIC to great effect and also (clearly) took a quiet pleasure in the total absence of any evidence that teaching programming was a good way of training people to reason logically. Dave de Roure advocated LISP; Sebastian Rahtz Icon. Several people pointed out that the programming environment probably mattered more in determing the ease with which a language was acquired than the language itself; there was some agreement that the difficulty of structured languages might in fact be no bad thing. A gentleman from IBM endeared himself greatly to me by asserting that (a) any progamming skills acquired at universities were totally useless in a commercial context and (b) it would be a lot more use to teach people how to organise and structure their data properly.
After dinner (bearable) we were rewarded for our persistence in trekking half a mile through pouring rain by a postprandial entertainment from Jon Nicholl of Exeter's Education Department. This consisted of demonstrations of three applications ('authorizations'?) of the LINKS program, a simple expert systems shell for use in comprehensive schools. One recreated a detective story apparently familiar to every former B.Ed student; one (written by a ten year old) impersonated a mediaeval physician; one had to do with Devonian placenames. The second was the most fun; the subtext of the presentation was that teaching project work in this way could actually be a lot more fun as well as getting across some interesting principles of abstraction. Though I wondered whether hierarchic tree structures might not turn out to be just as mentally crippling as BASIC.
Dave Guppy opened the Art History Workshop with a sceptical survey of possible computer applications, including image processing, storage problems, indexing problems etc etc. For him Art History was about fundamentally difficult and affective aspects of experience. Will Vaughan tried to redress the balance by pointing to the possibilities of new storage media as means of manipulating images, but had to agree that there still very few real applications outside museums. As case study Guppy provided us with two very nice pictures of naked ladies, one by Giotto and the other by Titian, together with commentary by a distinguished art historian called Freedberg and the workshop eventually cohered in a long discussion about how a computer could possibly have assisted in his analysis. (Not a lot it transpires)
The Classics Workshop was somewhat of a misnomer and also nearly floored completely by an uncooperative AT. Fortunately Kevin O'Connell was too much of a professional to let this seriously impair his presentation of how Nichol's LINKS could also be used to represent the plot of Antigone, though it did slow down somewhat his description of an expert system (programmed in micro Prolog) based on the "Roman World" flash cards which are apparently now widely used to teach classics (if 'widely' is the right word). The claim was that a model of the inter-relationships recorded on Latin inscriptings from Lugdunum could be adequately represented and easily manipulated using micro Prolog; I remain unconvinced.
Of those I attended, the Linguistics Workshop probably adhered closest to the organisers' brief, perhaps because Leeds is one of the few places where computing is taught as an essential component of the Linguistics course. Davidson described in some detail the various parts of this teaching, plotted against two axes which he saw as mutually exclusive, viz the type of amount of purely computational skill needed and direct relevance of the skill acquired to the academic subject. He raised a number of pedagogically important issues, notably that current research in linguistics seems to be depending more and more on computational models which owe little or nothing to formal linguistics (which did not use to be the case). One prime case is the 'simulated annealing' parsing project at Leeds which uses a purely stochastic model; another is the need for socio-linguists to employ purely sociological data, such as census returns. Most of the discussion centred on what actually gets taught. Leeds' BA students apparently thrive on a three day intensive course covering the rudiments of CMS and OCP together; there was little support (presumably as a result of bitter experience) for my view that general courses on operating systems were better left to computing centre staff.
Tom Corns began the Literature workshop by asserting simply that literature was very difficult for humans, let alone computers, because of the complexity and subtlety of readers' responses to it (which was one of the strengths of the case according to Ennals). Perhaps more significantly, (and certainly more memorably), he remarked that literary criticism remained "totally innocent of computer-aided achievements", despite the fact that the subject itself was still alive and well. Stylistics, which had once seemed to offer the computer an entree, had been effectively killed off by the likes of Fish on theoretical grounds, while the McCabe/Eagleton radico-deconstructionist-feminist axis had no time for the "toys for the boys" technological ethos. But as all good critics (and readers of Kipling) know, ignoring the technology of your day simply marginalises your discipline. The bulk of his presentation therefore concentrated on immediate strategies to raise the level of awareness of computational possibilities amongst the current crop of students. The discipline had always required high standards of presentation and well organised bodies of data; the word processor, the database, and even the concordance were all highly effective means to those ends, if they had no more theoretically seductive claims on students' time. In the future of course, there would be other possibilities; amongst these he adumbrated the possibilities of an Old English CALL system, and something called "advanced study aids", by which (I think) he (or rather Margarette Smith who shared the honours of this presentation) meant hypertext systems, incorporating a user-modelling component.
The proceedings were wound up by Prof Jean-Claude Gardin's formal paper which (I regret to say) I did not fully understand, largely because of its use of mathematical formulae to express types of inferential methods and other habits of mind alien to my anglo-saxon soul, but which I suspect would have been congenial to Fish. Gardin has been eminent in the sphere of interpreting archaelogical finds and other cultural manifestations for the last thirty years but (he said comfortingly) the only progress he could detect had been the recognition that there could be no independent standards to describe such objects: there are as many descriptions as there are research goals. Like Ennals, he saw interesting opportunities in AI systems not just because they are well funded, though that should not be ignored, but because they paralleled his current research strategy. A given set of semiological components (representation systems) can be acted on by different processing components to reach different conclusions, according to different goals; in the same way, a given set of facts and rules may be acted on by an inference engine to construct a knowledge based system. The recursiveness of deconstructive criticism was exemplified at some length: Jakobson & Levi Strauss' study supposedly saying all there was to be said of Baudelaire's "Les Chats" had stimulated 28 critical responses, which they had dutifully included in a revised edition, und so weiter. He also felt the need to preserve 'bilinguism', that is to present their results in ways appropriate to (their expectations of) their readers' likely expectations.
If Ennals began this conference by assuring us that the humanities had something to offer the world, then Gardin closed it by reminding us that whatever that might be it was not certainty, and that scientistic rigour was as out of place in the humanities as we had always hoped. In between, we had had ample opportunity to see what the technology could do and how it could be shaped to our ends, provided of course we could determine what those might be. I have already remarked on various unusual aspects of this conference; perhaps not the least significant of these was a general willingness to confront and discuss quite fundamental issues at a non-trivial level.
This was the first of a series of planned meetings with chosen suppliers. Information Dimensions laid on a fairly detailed presentation of the various components of BASIS, followed by a presentation, and much argument. The level of detail was nicely judged for the audience (for a change) and members of the Working Party were clearly impressed. Lynne Brindley at Aston was reported to be making a video course on BASIS, which sounded interesting.
The status of other contenders on the shortlist was briefly reviewed after lunch. Assassin (after further discussion) was scratched from the list : it lacked interfaces, was comparatively poor in facilities and was only available on a few machines. INFO/DB+ had not responded to their "last chance" appeal for information and was therefore also scratched. J.Duke had not yet responded on the state of MIMER. Two further presentations from manufacturers were arranged: Status on 5th October, Cairs on the 11th.
The first and worst snowstorm of 1987 hit the Eastern seabord of the United States as delegates to this historic gathering were making their various ways there. Everyone consequently having a travel horror story to tell, proceedings began (and continued) in an atmosphere of gritty resolution against adversity - which was probably just as well, since the working hours were long (8 a.m. till 10 p.m. with occasional short breaks for refreshment), the topics of discussion were not trivial (just how do you get 32 different and highly individual delegates to agree on anything?) and the accommodation spartan (overheated cupboards in the attic of Alumnae House). On the other hand, the organisers (chiefly Nancy Ide of Vassar and Michael Sperberg-McQueen of U of Illinois at Chicago) had put a great deal of effort into organising the workshop beforehand, and even more into maintaining some structure throughout the event, which together with the evident good will of all participants contributed a great deal to its unusually successful conclusion.
The workshop was funded by the American National Endowment for the Humanities (NEH) as earnest of its expanding interest in facilitating computer aided research in the Humanities, and sponsored by an impressive list of learned societies (ACH, ACL, ACM/SIGIR, ADE, AHA, ALLC, APA, MLA ...). Its purpose was to find some way of defining a consistent scheme for the encoding of textual data for use in humanistic research. To that end a small committee of the ACH had already thrashed out a proposed framework and agreed a list of delegates without whose participation (or at least recognition) no such scheme could stand a chance of survival. Most of the major European text archives were represented (Oxford, Bar-Ilan, Nancy, Oslo, Tubingen, Louvain, Pisa) and many important North American research centres, including both those purely academic (Provo, Toronto, Pennsylvania) and those with some commercial affiliation (AAP, BELLCORE, IBM, OCLC). Three observers from the NEH attended the meeting, at which the ALLC, ACL and ACH were all actively represented.
Proceedings consisted almost entirely of energetic discussion, the full details of which are beyond the scope of this report. Two delegates remarked on the fact that they had previously attended equally prestigious gatherings with remarkably similar purposes, which had in both causes come to naught. Other than that of the Working Committee's proposed structure for text encoding guidelines by its principal author (Sperberg-McQueen), there were no formal presentations, although several position papers had been circulated before the meeting.
The first session, appropriately enough, attempted to reach a consensus on the scope and nature of the proposed guidelines. One major area of discussion was whether the guidelines were to be prescriptive - setting out what features should be encoded and how - or descriptive - setting out how those features that had been encoded could be described in a neutral way. A completely open syntax would provide no guidance for those capturing new texts, but a completely prescriptive set of rules would render 90% of existing encoded texts useless. The audience for the Guidelines was relevant here: many of those present clearly regarded secondary use of existing machine readable texts, (either from such sources as typesetting tapes or from established archives) as the norm, where others were more concerned about new projects, and the provision of 'guides for the perplexed'. The consensus of this first and wide-ranging discussion was that some kind of 'meta-language' should be defined, capable both of formulating a recommended standard encoding scheme and of describing existing commonly used schemes, almost by way of illustration. It was also agreed that the recommended scheme would be usable as an interchange format, with no implied necessity for retrospective conversions, though it was also pointed out, perhaps somewhat prematurely, that the existence of the Guidelines might be a powerful argument in persuading funding agencies to support such retrospective conversions.
During a very crowded afternoon session, eleven speakers were allowed ten minutes each to describe their own archives and point of view. Stig Johannsson (Oslo) described the work of ICAME, the International Computer Archive of Modern English, an organisation with much experience in documenting and distributing large corpora of machine readable text; Randall Jones (Provo) talked a little about the Brigham Young "Wordcruncher" program; Jacques Dendien (Paris) gave a valuable cautionary tale in the shape of a brief history of the development of the Tresor de la Langue Francaise database; Paul Tombeur (Louvain) described briefly the CETEDOC Mediaeval Latin database and stressed the importance of traditional scholarly virtues of fidelity to source and accuracy of description; Robert Kraft (Pennsylvania) talked of the ongoing work of creating both the Thesaurus Linguae Graecae CD-ROM and its Latin counterpart, together with his own concern, ways of representing textual variation in a compact and comprehensible way; Yaacov Choueka (Bar-Ilan) gave a masterly presentation of the problems inherent in handling 100 million words of online Hebrew; I tried briefly to give an idea of the sheer chaos which is the Text Archive in the absence of any standards; Antonio Zampolli (Pisa) described the evolution of the various corpora of Italian and Latin texts (totalling some 80 million words) at Pisa, all of which are encoded and lemmatised to a consistent standard, touching also on the development of the 'linguistic database'; David Barnard (Queens, Ontario) gave a brief introduction to SGML, claiming that it was more than adequate to the tasks so far outlined for a text encoding standard; Frank Tompa (Waterloo) described briefly the Century Dictionary Project (a sort of poor-man's machine-readable OED), stressing how capturing the layout and typography of such works was usually adequate to capturing the underlying structure, provided that the encoding could subsequently be extended and modified; and finally Carol Risher (American Association of Publishers) described the process by which the AAP's SGML Guidelines had been created, which had involved industry-wide co-operation, massive funding (the standard had been drafted by an external consultancy and had cost $450,000 so far) and continued testing, modification and publicity.
After all this, a third and rather meandering discussion session was given over to the question of whether it was meaningful or politically sound to distinguish 'levels' of encoding, as the Working Party had proposed. For some, the use of levels implied a possibly invidious distinction between 'recommended' and 'optional' (where 'recommended' implied 'obligatory for funding support' and 'optional' implied 'not worth the effort'), while for others it implied a possibly unimportant distinction between 'automatically verifiable or capturable' and 'requiring scholarly effort to perceive'. After a straw vote, it was agreed that the guidelines should not make proposals for minimal encoding standards, but rather propose various taggable items under various categories ('boxes'), yet to be defined.
The last session of the first day was held in the library of Alumnae House in a slightly more relaxed atmosphere (i.e. wine was served, to the relief of all) and concerned itself largely with organisational matters. Several people pointed out the advantages of starting small and getting bigger, (particularly given the fact that some areas of encoding were still a matter of scholarly dispute) while others stressed the need for an overall framework. It was agreed that a small steering committee should meet as soon as possible to set up a committee structure within which more technical discussion could take place and to seek ways of funding this. (This steering committee, which has two members from each of the ALLC, ACH and the ACL, will meet in December in Pisa).
After an initial summary of the previous day's proceedings, the second day was taken up almost entirely with discussion of the scope and content of the Guidelines, and the nature of its proposed 'meta- language', concluding with the drafting of a set of recommendations. As a point of departure, Stig Johansson provided a helpful set of statements: the scope of the Guidelines should be 'pieces of extended natural discourse' rather than wordlists, concordances or linguistic surveys. After some discussion, it was agreed that monolingual dictionaries (the major interest of several of those present) should also be covered by the Guidelines. Their purpose should be to facilitate use of such texts in research more widely than by an individual project, rather than to aid conversion of texts to (or from) printed form. Standardisation was needed in three distinct areas: documentation, representation and interpretation. In each case the guidelines should propose what should be included together with indications of how it should be expressed, The end product would not be a formal standard but a style manual and its production would necessitate working groups in different subject areas.
There was some inconclusive discussion of whether or not SGML provided an appropriate syntax for the definition of the proposed description of encoding schemes; although no-one was able to propose an alternative to the use of SGML, neither were many of the delegates confident enough in their knowledge to criticise (or defend) it, except at a fairly superficial level. There was some consideration of a simplified 'keyboarding' syntax which could be mapped to SGML, though this clearly had little to do with the standard as such. One criticism made of SGML was the difficulty of supporting more than one hierarchy of structural information within a document, though it was claimed that the optional 'concur' feature would support this; another was that the SGML notion of an exhaustive set of document type definitions was fairly inimical to scholarly research, but that these need not be used. On the other hand, several delegates were strongly attracted to the notion of document-type definitions as a means of guiding the perplexed. One telling argument in favour of an SGML-style syntax was its extensibility.
The end product of the meeting was the following text, confected in full session with Michael Sperberg McQueen at the keyboard, projecting the wordprocessor screen on the wall. Much argument about vocabulary and word-order went into its production, but the final product was universally accepted.
Afterwards, there was a general sensation that the real work had at last been defined. A final and fairly desultory evening session kicked around the question of whether or how a North American Text Archive could be established; somewhat to my surprise, a number of very flattering things were said about the way the Oxford Text Archive is run, and it was agreed that funding and support for that style of approach should be sought, though in no very precise manner.
In conclusion, I felt that this working party had indeed established something, if only in bringing together and actually getting serious discussion out of a very broad-based but surprisingly interdependent constituency of research workers. In the past, there has been almost as much lip-service paid to the notion of a universally-agreed standard for text encoding as there has been to that of an international directory of machine readable texts; with this workshop, for the first time, a significant proportion of those without whose support all such notions must founder had actually devoted their undivided attention to the issue and the problems it raises for at least two days. The organisers (and participants!) deserve every credit for this achievement, still more for having orchestrated a productive kind of consensus from which something more tangible may well eventually emerge.
Oxford University Computing Service
30 November 1987