1. The DCI200 Field Trial runs from March 1 to May 1. Participating sites represented at this meeting were NCC (souping up Filetab with DCI), Gateshead MBC (still working on their alternative Querymaster), the Inland Revenue , (the Advanced CAFS Option uses DCI), Logica (hotly denying press reports that VME/Rapport was no longer supported let alone under development), Glasgow University and OUCS. A creep from ICL Training also attended to see whether anyone had any stealable ideas for a two day DCI course to eke out the cursory gesture made towards teaching it in E-CAFS.
2. New facilities in DCI200 are as follows:
(a) the basic product
(b) as an extension (probably separately charged)
DCI200 is fully forwards compatible with DCI100, except that COPY is added to the list of reserved words.
3. No-one seemed to have many ideas about how these wonderful new facilities would be used. The IR man did a good job on selling ACO. D Fildes told us about Glasgow's DCI applications - their GEAC library system is snapshotted to a DCI searchable file routinely. Their enthusiasm for CAFS/DCI echoes our own; it was very nice to hear someone else enthuse about the appropriateness of DCI in the university environment for a change.
4. Field Trial logistics were described with P. Harrison's usual thoroughness. As part of this deal, we also get to participate in the DDS-750 Field Trial, almost worth doing for the documentation alone, which is approx 6 inches thick and (it must be said) a distinct improvement on all DDS documentation to date.
5. Lunch was up to usual Beaumont standards. I stuck to the salad and cheese, though I gather the little bowls of rice and curried pork were quite acceptable. The lemon souffle was predictably nasty.
Ten sites had taken place in the field trial, of which eight had actually managed to do something. These were New Zealand Post Office (actually a bank), South African Post Office (ditto), ICL Sweden, ICL training, Gateshead MBC, OUCS, Glasgow Univ. and Logica. Only NZPO, OUCS and SAPO had managed to to test IDMS+DCI, but NZPO had hammered it. The application, involving reconciliation of approx 3.5 million customer records, took one hour on IDMS searching alone. Syndicating the search to ten CAFS engines in parallel meant that even with 100% hit rate (there's a customer record on every page) the job took about 10 minutes.
GUCS reported awful performance degradation when going to DCI-200 from DCI-100; this was attributed to 8.12 rather than DCI however, and was currently being investigated by VME Support.
ICL Sweden has done some work integrating DCI with strategic products; they also made useful criticisms of the manual and described a technique for invoking DCI searches invisibly from database procedures.
Gateshead's Querymaster clone was now running 300 queries a week against a 500 Mb ISAM file, and also against various other odd files. They added fuel to the case for providing a CAFS independent CAFS-KEYED-READ. They also have a library catalogue application, like the Berkshire one.
There was some fairly desultory argument about whether DCI/IDMS wouldnt do better to return just database keys, instead of insisting on returning a complete database record. Also about whether or not it should get at pointers. It was desultory because all development on DCI is now frozen, pending the arrival of the new CAFS hardware. Whatever facilities this has, they will be accessible through DCI.
The IDMS/DDS bits of DCI are to be marketed as a separate product called DCI-PLUS, at an extra charge. This is largely to avoid a rerun of the QM250 debacle, I assume. General release is end of July, and is tied in with DDS-750.
Over lunch I heard all about Glasgow's visit from the Computer Board. Ooops.
Historians have been using computers for a good many years, but there has been no formal forum for discussion techniques and results (apart from the sidelines of straight History conferences and the sidelines of the ALLC); the eventual purpose of this conference was to set up an international association for computing historians. But it was also a first opportunity for nearly 300 historians of all periods, disciplines and nationalities to get together and find out who else had been reinventing the wheel, and to see if anyone had yet put 4 wheels on a cart. Southampton was well represented by Computer Studies (me), Education (Mark Farley), Computing Service (Dave Doulton), Statistics (Ian Diamond) and History (Frank Colson, Arno Kitts and Elisabeth Reis). .PA Peter Denley (Westfield College) and Deian Hopkin (University College of Wales, Aberystwyth) managed to create an amazingly efficient and friendly conference, at which there was free coffee on tap all day, a generous supply of space and equipment for demonstration, working microphones, a daily bulletin and good vegetarian food (my special thanks for the garlic in the pasta on the Friday lunch). The conference consisted of, in parallel, a) a dense set of plenary talk sessions, b) ongoing presentations by computer manufacturers and c) demonstrations by individuals. This report discusses firstly the papers, then the demonstrations, and finally draws some conclusions.
From 2pm on Friday to 4pm on Sunday, a vast battery of speakers were paraded in front of us in 20 minute slots; roughly speaking, the first day was historical periods, the second day was methods for research and the third day was teaching and publication. In general, many of the papers dealt rather more with history than with computing, so I will not comment on them, but the following made some effect on me (in order of presentation):
The hardware and software demonstrations consisted of stalls from manufacturers, and individuals showing their own stuff. Of the commercial people, IBM showed graphics (not very specific), Triumph Adler and Victor had machines (I felt sorry for them), Research Machines had some nice Nimbuses, and were used for a Prolog demonstration, Ashton Tate had a flibbertigibbet with dBase and Framework, while the scholarly Oxford University Press showed the Nota Bene academic word-processing package on a battered IBM; if I were buying a package for myself, I would get Nota Bene.
The Hull Domesday material was around on Acorn machines, but I didn't really look at it; various people gave demos of databases over JANET. Of the other displays, I found the following of interest: In general, the demonstrations provided welcome relief from the papers about numbers and were more relevant to teaching than the research papers.
I was depressed by the predominance of quantitative historians; too little distinction was made between quantitative studies and statistics, and their implementation on computers -- for many, the two were inextricably linked. It seemed to me that much of the teaching of statistics would have been better done with pencil and paper. As one might have expected, the details of the technology, and the practical problems of hardware, operating systems, languages, databases and statistics packages overwhelmed the important general issues which seem to me to be as follows:
It was an exhausting weekend; I got drunk once, and heard two life stories; I met some interesting people, heard some gossip, wished the source of SPSS & SAS had been lost, and came away hoping that the new Association would flourish.
This conference aimed to bring together as many as possible of those currently using computers as primary research tools or as teaching aids in the historical disciplines, with a view to establishing a new learned society, the AHC, with a wide ranging membership including universities, research bodies, polytechnics, and local government bodies responsible for secondary education. This catholicism may be one reason for the evident success of the event: there were nearly 300 delegates, including several from Germany France and Italy, and ten sessions each of four or five short papers, of extraordinary richness and variety, with only a few duds. There was also a more or less permanent but changing exhibition room featuring assorted micro manufacturers and software publishers as well as lots of online demos, mostly on the ubiquitous BBC micro or via JANET at the exhibitor's home base. I was alas unable to show off the wonders of CAFS due to the continued absence of weekend working on the 2988.
There were too many papers to summarise individually (it is probable that some sort of publication will emerge eventually) so this report simply describes over all trends and memorabilia.
Predictably, there were large numbers of home grown quasi-DBMSs on display, ranging in sophistication all the way from data entry systems in BBC BASIC tailored to a particular type of historical document up to the all-singing all-dancing current market-leader, the German CLIO package. I had previously met this in Gottingen; it is still being re-written in C but now has satisfied customers in Austria and France as well, and is stimulating interest here. Others mentioned included -yes- Famulus77 which got blush-making praise from a Nottingham group and Oracle which was mentioned by several speakers as the obvious choice for dbms, despite the presence of a strong SIR-fancying contingent. In fact the conference revealed a very healthy eclecticism as far as software is concerned, both Prolog supporters and one renegade APL fanatic being given if not equal time at least equal respect.
Aside from methodological manifesto, some real historical research was reported, largely by groups of economic historians. In this category, I found Turner (using multi-dimensional scaling to analyse 19th c House of Commons voting practice - which rings a bell) and Riehl (using predictive mathematical models to analyse the emergence of the Greek city state) particularly interesting. At quite the other end of the educational spectrum, there were sessions devoted to methods of introducing computational methods into the undergraduate syllabus, and to novel applications of computers in the secondary and below classroom. These were very interesting, and oddly complementary. One speaker compared the university teacher addressing the post-micro generation to a hunter-gatherer trying to teach neolithic man how to increase agricultural productivity - a simile whch seemed to strike several chords. One novel CALL application is about to be marketed by Longmans: it is a role-playing game in which children are introduced to decision making procedures and the role of chance in historical events, by simulating the Palestine agreement of 1947. It seemed a good way of teaching people to "think historically" - an activity which the charismatic Richard Ennals (who jetted in to chair one session and then jetted out again) assured us was worth big bucks in pushing back the frontiers of AI.
I noted two major trends:
(1) The biggest problem area is still data capture. There is now a widespread recognition of the need to capture original sources and to integrate them to some extent with a relational dbms. Few seem to doubt that relational is right and hierarchic has had it.
(2) There is a growing awareness of the possibilities for IKBS. Nevertheless for most people, the topic of computational history MLE (is more or less equal to) quantitative analyses.
No particular theme had been specified for year's ALLC conference (one had last year, in Nice, but no-one took any notice of it). Vague attempts had been made to clump together related papers, the chief effect of which was that anyone interested in OCP-style sofware couldn't find out anything about database style software, and anyone not interested in literary statistics had absolutely nothing to do for most of one day. There were three invited speakers, as well as three days of parallel sessions, and two major social events clearly calculated to impress foreign delegates. Much of what transpired was well up to expectation; in the 200+ delegates there were only a few new faces amongst the ALLC diehards, and most of the issues discussed had a more than familiar ring to them. The accomodation at UEA was also no worse than usual, though the food was remarkably nasty.
Leaving mercifully aside the more tedious papers, I noted with interest the following:-
(1) Christian Delcourt (Liege) presented an algorithm for partitioning lines of verse mathematically in order to identify their component structures automatically. I didnt understand the maths (and the French wasnt easy) but the results were impressive, and novel.
(2) B. van Halteren (Nijmegen) presented the Linguistic Data Base (LDB) - this is a really natty query processor for accessing structured linguistic corpora in terms of the structure. It forms one end of the TOSCA project, which I have come across several times before; this time he gave enough details of its query language and programming language to do more than whet the appetite. We could have it for free if only (a) we had a spare vax (b) we had an analysed corpus to put into it.
(3) S. Rahtz (Southampton) had been scheduled to coincide with S. Hockey (OUCS), an excessively shabby trick on the part of the organisers. Despite poor attendance, he gave a competent account of the vicissitudes of computerizing the Protestant Cemetery in Rome, and even proceeded to some highly dubious speculations about the implications of funerary inscriptions.
(4) J. Simpson (NOED) gave the orthodox version of the current state of the computerised OED project, thus incidentally making nearly everything else described at this conference seem fairly toytown in size scope and significance. It is nice to learn that the word 'database' first recorded in 1964 will enter NOED before it is printed in 1989, also that G.Gonnet's Algol68-like query language for interrogating dictionary definitions is called GOEDEL (Glamourous OED Enquiry Language). He remarked that "a lot of science fiction has been written about the NOED project" and then revealed that semantic labelling was considered easier than syntactic.
(5) B. Rossiter (Durham) nearly made me fall out of my chair by asserting, after a thoroughly admirable exposition of how he'd used entiry-relationship modelling to design his database, that no software existed capable of supporting ER structures properly, so they'd used SPIRES instead. The project is dealing with the full text of English statute law, available from HMSO for a song it seems. Over lunch I broke the news to him about DDS and CAFS; he admitted their choice was largely determined by what was actually available at NUMAC.
(6) Tony Kenny (Balliol) summarised his work in statistical stylistics and was also chief lion at the subsequent round table discussion on "whither computation stylistics?". The discussion turned out to be unusually interesting, if inconclusive, while his paper was exhaustive, if exhausting. It made eminently reasonable distinctions between what made sense in the field (distinguishing texts in terms of parameters that could be shown to be internally consistent - cf Delcourt) and what did not (postulations about undefinable entities such as 'all the works Aristotle might have written'). He compared statistical techniques to aerial photogrpahy, showing the wood rather than the trees and concluded with a summary of his next book, which uses clustering techniques (Pearson correlation coefficients in particular) to discriminate the Pauline and non-Pauline bits of the Greek New Testament on the basis of their usage of different parts of speech.
(7) John Burrowes (Newcastle) also has a book coming out. I suspect his was the most interesting paper at the conference. It summarised his work so far on the analysis of Jane Austen's high-frequency vocabulary in sections of her novels categorised as dialogue, narrative and 'rendered thoughts'. Both these categorisations and the vocabulary counted are carefully hand pruned to avoid both ambiguity and polysemy (which is why he's been at it for five years). The interesting thing is that the results actually add something to an appreciation of the novels and are used to make critically significant judgements about stages in Austen's development as a novelist. His statistics are based on Pearson and Spearman correlations, presented in scattergram form; he is now threatening to go for multidimensional scaling.
As usual at these gatherings there was a certain amount of political manoeuvering in evidence. It transpired that Nancy Ide (Chairman of the Association for Computing in the Humanities) is planning an international workshop on standardisation of machine readable texts. I put forward the proposal that the Text Archive deserved more funds to whatever sympathetic ear came within reach, and was told on several occasions to think BIG.
Hearing that I was hoping to attend the second University of Waterloo conference on the new OED, Ian Lancashire, driving force behind Toronto University's thriving Centre for Computing in the Humanitives (CCH), kindly invited me to give a seminar there. This being too good an opportunity to miss, as there are several projects at Toronto of considerable interest, I arrived in Toronto (on a special -bilingual- cheap flight via Montreal) a few days before the OED Conference proper and visited...
The Dictionary of Old English which flourishes in several rooms on the 14th floor of the magnificent Robarts Library, where I saw some very flashy Xerox workstations given to the project together with a small VAX configuration. The project has a single programmer who has to develop all the software for editing and page make-up of the dictionary entries, which have now reached the letter D. They were astonished to hear that we could offer online searching of substantial portions of the corpus, even if we could not display Old English letters. Their interface is pleasantly similar to the desk-tops on which the terminals sit, (i.e. cluttered) and just to be on the safe side the project also maintains (in several hundred drinking chocolate cartons) numerous bits of paper derived from the corpus we know and love. Ashley Amos, sole surviving editor of the dictionary, managed to track down some obscure charters which a user here had been unable to find in our copy and was generally encouraging.
At University of Toronto Computing Services (UTCS), I inspected the bookshop (which is splendid) and the computer shop (likewise; hottest new property reportedly the Amiga 1040 which is selling like hot cakes, or muffins as the Canadians unaccountably call them). I was not shown their Cray, nor indeed their IBM 4361, but instead was introduced by John Bradley, apparently the resident humanities computing boffin, to TACT - a new interactive text-searching system he is developing to run on Toronto's ubiquitous (see below) IBM-XTs - and by Lidio Presutti (his sidekick) to MTAS, an OCP look-alike of which I was given a copy, also for use on IBM-XTs. Time did not permit me to discover much about the way the centre is organised, other than the fact that they have recently started charging their users real money (well, dollars and cents anyway) for computing resources consumed, with predictably dire consequences for anyone not funded by the Defence Dept or similar.
Nevertheless, Humanities Computing is set to thrive at Toronto, largely as a result of Ian Lancashire's "partnership" with IBM-Canada. This involves the setting up of four rooms full of XTs and staff to support them over three years, all paid for by Big Blue, which gets no more in return than lots of academic credibility and three years worth of humanities graduates convinced that all computers should run PC-DOS. Any software etc. developed will be placed in the public domain. One of the four centres was on the verge of opening its doors: it had 24 XTs on a token ring with an AT as file server and three printers. The XTs were set up in such a way that they could only be booted from a supplied disk, which could not be removed from drive A. They were also bolted to the floor, despite Canadians' proverbial honesty. Students will use these to prepare machine-readable texts, using EDLIN or NotaBene (WordPerfect is not regarded as highly as it is here), to be processed using MTAS and maybe TACT. Other software to be made available includes the Duke Toolkit and the usual clutch of concordance packages, Kermit, network mail etc. as well as some public domain text-jiggling utilities designed to whet if not satisfy the literary appetite. Students will apparently be expected to become reasonably proficient in not just PC-DOS but also VM-CMS and UNIX as well, which seems a bit steep. Conspicuously absent was any whisper of what databases are for. There is rumoured to be a Masscomp somewhere in the English Dept but I never saw it.
I gave my seminar in the Centre for Mediaeval Studies (where the second of the four IBM rooms was still under construction); I had been billed to talk about the KDEM but instead waxed lyrical on the necessity for the Text Archive, the problems of representing and processing text properly and the wonders of CAFS to a gratifyingly large (c. 36, including the Director of UTCS, I later learned) audience, most of which survived till the end.
The next day, being Saturday, I spent at Niagara Falls, of which the Canadian end is unquestionably the better, and truly spectacular. I was startled by a bright red London bus (used for touristic purposes) and resisted the temptation to have my photo taken going over in a barrel, though I did go into the tunnels behind the Falls which command a magnificent view of their derriere.
Back in Toronto, I lunched with Michael Gervers, who runs the Documents of Essex England Data Set (DEEDS) project, more or less on his own with some Government assistance in the form of temporary (YOP-type) staff. The project involves the indexing of a massive collection of mediaeval records from Essex (England) and is the only real database project I came across at the University. It started off using an awful DBMS package which sounds like a Canadian version of IMS, but is now going through the traumas of conversion to Oracle, at present on a huge AT (with a 40 Mb disc AND a Bernoulli box), though it will be moving to the UTCS IBM system shortly. The cost of Oracle for this system appears to have been met from the IBM 'partnership', although what other users it will have in the absence of any local knowledge of how to exploit or support it is less clear.
I travelled to Kitchener, the nearest large town to the University of Waterloo, by train in the company of Willard McCarty who works with Ian Lancashire in running the CCH, and Abigail Young, who works on the Records of Early English Drama (REED) project also at Toronto. She had been largely instrumental in depositing in the Text Archive that proportion of the published corpus of REED texts which was still available on floppy disk, so I was very pleased to meet her.
And so to Advances in Lexicology (not a word to be found in OED -yet) which was the second annual conference held at Waterloo's Centre for the New Oxford English Dictionary and was generally felt to be a distinct improvement on its predecessor. Twelve papers were given over three days to about 150 delegates, roughly equally divided in their alleigances between lexicography, computer science and artificial intelligence. One reception, many coffee breaks and two fairly spartan lunches were laid on, during all of which there was much animated discussion. The best joke of the conference was probably Dana Scott's collection of howlers, of which I recall only "AI is when the farmer does it to the cow instead of the bull" which manages to combine innuendo with syntactic ambiguity.
Howard Webber (Houghton Mifflin) 's keynote address was the only one of the papers not (yet) available in printed form; like many keynote addresses it sounded rather as if he had made it up on the plane from several old after dinner speeches. However, it got out of the way all that necessary stuff about the role of dictionaries as a sort of "Language Command Central" (his phrase), the unease with which lexicographers had regarded the machine, the difference between machine- readable dictionaries and lexical databases and the transition from the former to the latter, while also dropping a few hints about where the 'American Heritage' dictionary project was now going in its co-operation with Brown University (nowhere in particular, as far as I could tell, other than the preparation of a new 50 million word corpus).
Manfred Gehrke (Siemens AG) tackled head-on the computational difficulties of providing rapid access to a realistically large lexicon. The method described, using morphemes rather than 'words' as primary keys has several attractive features (like the comparatively smaller number - and size - of such keys), though is perhaps more appropriate to highly agglutinative languages such as German. The fact that morphemes have meanings which the compounds derived from them usually employ is also of particular importance in German. Even so segmentation can cause problems: "Madchen handelsschule" is a girls business college, but "Madchenhandels schule" is a white slavery school.
Mark Aronoff (SUNY) and Roy Byrd (IBM) gave a rather dubious account of the role of etymology and word length in English word formation. A dictionary of high frequency affix lists was extracted from the top end of the Kucera-Francis word list, and another unspecified 3/4 million word list. This was then enhanced with some fairly simple etymological information from Webster's 7th (i.e. did the affix enter the language from a Germanic language or a Romance one). Any complications (such as words which were imported into English from French, but came into French from Germanic) were rigorously disregarded, as was the distinction between words which were formed within the English language and those which were borrowed -as it were- fully formed. Much statistical jiggery- pokery was then employed to determine how syllable-length and etymology accounted for the productivity of various affixes, and much wonder expressed at the apparent ease with which native speakers keep their neologisms racially pure. But the results, as Mike Lesk pointed out, would have been equally consistent with a simple phonemic explanation: (predominantly longer) Latinate suffixes naturally sound better on (generally Latinate) polysyllabic verbalisations, while (often short) German endings go best with (mostly Saxon) little words.
Walter and Sally Sedelow (Univ of Arkansas) have been in the field of computational linguistics almost since it began; their paper, which was mostly given by Walter, thus had a tendency to historical reminscence not quite germane to the issue, while employing terminology and a style, the clauses of which were embedded well beyond the capacity of most intelligences not endowed with a 640 Mb hardware stack, not unlike some really nasty exercises in automatic parsing, and consequently seemed to go on for a great deal of time without ever getting very far. This was a pity, because its subject (the adequacy and usability of Roget's Thesaurus as a knowledge representation language) is important and interesting. A mathematical model of the thesaurus (derived from work by Robert Bryan at San Francisco State) was presented and various other research reports adduced in support of the claim that the associations represented in Roget are actually significant. The skeptical might say that this was rather unsurprising; however anyone who can (apparently in all seriousness) express the concept in the following way (quoted from the abstract) clearly cannot be ignored, if only on stylistic grounds:
"The paper ends...with a statement to the effect that any assertions that the Thesaurus is a poor representation of Emnglish semantic organization would be ill founded and, given depth of analysis, would have to be regarded as counterfactual"
Judy Kegl (Princeton), Beth Levin (MIT) and Sue Atkins (Collins) gave probably the meatiest of the papers at the conference, - and coincidentally no doubt the only one co-written by a real lexicographer (Atkins). It includes much analysis of the different ways in which two language learners dictionaries (LDOCE and OALDCE) attempt to convey the intimate relationship between the various senses of English verbs and their complement-taking properties, (or case structure). Even such apparently obvious characteristics of the verb as transitivity are not always predictable by simple transformational rules such a "If the verb takes an object then it can be used passively" (e.g. "The shoes fit", "The shoes fit me" but not "I am fitted by the shoes"), but there is no self-evident place to put such facts about the verb "fit" in a dictionary. Consequently dictionaries differ: "cost" for example is intransitive according to OALDCE, and transitive according to LDOCE. The paper also contains much that is refreshingly sensible about the nature of specialised dictionaries (such as learners' dictionaries) and the distinction between them and the sort of immensely complex linguistic snapshot to which some computational linguists expect all lexicons to aspire. The sort of knowledge needed for the latter, though indispensable to the former, must be processed and combined a way appropriate to particular users. Detailed assessment of the way certain notoriously difficult verbs are presented in OALDCE and LDOCE is used to present inconsistencies inherent in the absence of any generally agreed meta- language for verbal descriptions, a point which recurred in other papers. The strength of this paper is the combination of the structuring capabilities offered by theoretical linguistics with the reductive classificatory skills of lexicography, which it both demonstrates and advocates.
Thomas Ahlswede (Illinois Inst Tech) reported on the continuing saga of the parsing of Websters 7th, first initiated by John Olney et al in 1968. 'Parsing' here means the recognition and extraction of semantic information from the text of a dictionary definition which can then be stored and manipulated within a lexical database. It is analagous to (but even more ambitious than) attempts to extract similar semantic structures from free text. Dictionary definitions provide implicit information about relationships between words, not just taxonomic (an x is a sort of y) but also part/whole relationships. But a simple syntactic analysis of the text of a definition is rarely adequate to the task of understanding it; a detailed lexicon containing information about each word likely to be encountered by the parser is evidently necessary. For Websters 7th, some of this information (but not all) can be extracted from the entries themselves, while some of it is already available in the existing parser's lexicon of about 10,000 entries. This process was later dubbed "dictionary hermeneutics" by Graeme Hirst. How much easier it might have been if the dictionary structure had been initially captured in a meaningful way (as was the OED) is an embarassing question which no-one had the poor taste to ask.
Nicoletta Calzolari (Pisa) described an equally ambitious but rather more practical project now under way under Zampolli's charismatic aegis: the construction of a bilingual lexical database (LDB) system by linking together existing monolingual LDBs, the linkage being provided by machine readable bilingual dictionaries. Combining monolingual and bilingual dictionaries, which typically differ in the degree of discrimination felt necessary for individual word senses, should lead to a much richer integrated system. The dictionaries to be used include Zingarelli, Garzanti, Collins Italian/English, LDOCE and OALDCE. No complex supra- linguistic model is envisaged, simply the ability to discriminate senses when going in either direction between two languages. Such old chestnuts as the fact that Eskimos have 99 different words for 'snow' and Arabs none at all were not on the agenda: the project is not really concerned with semantics, but aims rather to provide a useful tool for translators and others using existing dictionaries.
The final session of the second day comprised summaries of the current state of play of the NOED Project itself, as viewed by firstly Tim Benbow and John Simpson (OUP) and secondly Frank Tompa and Gaston Gonnet (Waterloo), all of whom were sporting Oxford Dictionary Ties to mark the occasion. Benbow reported that the dictionary's 21,000 pages had now been rendered machine readable, using an estimated 500 million keystrokes, with an error rate of around 4 in 10,000; this was being proof read and corrected by ICC to bring the residual error rate down to less than 1 in 250,000 characters. The data is held at Oxford and manipulated in a SQL database under VM/CMS. Rick Kazman's parser would be used to convert the ICC mark-up to SGML, and other software developed in house mostly by IBM secondees (e.g. a text editor called LEXX) will be used by the lexicographers to complete the integration of the dictionary and the supplements. Some wholesale changes will be made (notably Murray's method of representing pronunciation will be replaced by IPA) at the same time as automatic integration is carried out; some (about 4000) new words/senses will also be added to bring the early parts of the supplement up to date (This is the responsibility of John Simpson's NEWS project). Publication of the new integrated dictionary (the Book) is planned for spring 1989. It will have 22 volumes and cost about 1500. Publication of a CD-ROM version of the OED alone (without supplements) is also planned, probably for late 1987, mainly as a means of testing the market for electronic forms of the dictionary, and providing information for the database design work going on at Waterloo. It is hoped to set up a unit of lexicographers in Washington which, together with the NEWS team, will ensure that the dictionary, or rather its eventual electronic form, will be kept up to date on both sides of the Atlantic.
At Waterloo several very interesting pieces of software have been developed, which were described by Gaston Gonnet and Frank Tompa. While waiting for the ICC data to reach them, they had been experimenting with a smaller dictionary of Trinidadian English which had successfully demonstrated the generality of their approach. The software used comprises (1) INR/lsim - a parser-generator and parser for context free grammars (2) PAT - a fast string searcher and (3) GOEDEL the "Glamorous OED Enquiry Language". INR/lsim (no-one seems to know what this is short for) resembles in philosophy the parser-generator developed for Project TOSCA at Nijmegen, though I never got the opportunity to ask Tompa whether he'd heard of this. Maybe it's just the only way of solving the problem properly. It has been used by Kazman among others to convert the ICC mark-up to SGML, and to convert the OALDCE tape as first supplied to the Text Archive into a similar SGML markup. PAT (written by Gonnet who has made quite a study of text searching algorithms, I discovered) stores indexes to a text in a Patricia tree, a form of condensed binary tree new to me, though apparently to be found in Knuth if you know where to look. PAT is very fast but, at present, very simple minded. GOEDEL is a more sophisticated system, still under development, the most crucial element of which is not so much its current Algol-like syntax as the fact that its basic datatype is a parse tree (again like the Dutch system). This solves all manner of nasty data management problems and bridges the gap between DBMS and Text Processing systems in a way at least as natty as CAFS and probably more so. The user specifies a parse tree for the text to be returned and can impose selectional restraints using SQL like conditions.
Peter Davies (described as an 'Independent Lexicographer') read out his paper from the conference proceedings in a dry monotone well suited to its contents, which contained rather few conclusions derived from some fairly specious data. He had tagged the most frequent word forms in the American Heritage Corpus with the century of their introduction to English and the language from which they derived. (Like Aronoff and Byrd he was uninterested in the fact that this corpus neither distinguishes homographs nor associate inflections of the same lemma.) The results presented were raw percentages ("In the top 500 types, 75% are native") with no attempt to adjust for the known skewness of vocabulary distribution irrespective of origin.
Alexander Nakhimovsky (Colgate) is much obsessed with time, more specifically with how language reflects "common-sense reasoning" about the nature of time. He is one of the "Meaning-Text" school of Soviet theoretical linguists. To understand why to answer "I have a 12.30 class" constitutes refusal of a lunch invitation requires not just knowledge of social convention, but also of the probable durations of lunches and classes. English verbs are not marked for perfect as opposed to imperfect so that "Alice read her book quickly" could have two quite different meanings (either as a process or as an event). Knowledge of the duration of events is a linguistic phenomenon because many words cannot be understood without knowing the duration typically associated with them - not just obvious words like 'lunch' or 'nap', but also 'holiday' (which cannot be in minutes but is usually not in years) or 'insomnia'(which cannot be measured in minutes nor go on in the afternoon). It is apparent that the units of time appropriate to common sense reasoning vary with the duration of the event, as does their precision. (Thus '5 minutes' means somewhere between 1 and 10, but '3 minutes' usually means just that). To make up for the absence of a perfective/imperfective aspectual difference, English relies on an opposition Nakhimovsky calls telic/atelic, which has to do with the narrative within which the verb appears, so far as I understand (or have understood) it.
Fabrizio Sebastiani (Pisa) presented a more conventional view of the role of the lexicon in describing QWERTY, a knowledge-based system designed to 'understand' technical writing. It operates by converting parse trees representing syntactic structures onto semantic structures represented in the knowledge representation language KL-MAGMA. The fact that technical writing is mostly composed of 'paradigmatic' sentences, from which such nasty problems as mood, aspect and temporal position are conspicuously absent was touched but not insisted upon: unfortunately Sebastiani did not allow himself enough time to make clear exactly what was original in the system nor how effective the design was in practise.
Graeme Hirst (Toronto) closed the conference on a controversial note which also managed to bring together the two sides of lexicology, if only in debate. His paper stated with agreeably frank partisanship why dictionaries should list case structure, that is, specifications of the valid roles associated with any particular verb, together with any semantic restrictions on the values that may fill those roles and any preposition or other syntactic marker specific to them. Case structures may be found in one guise or another in almost every theory of language or cognition, from Chomsky to Charniak, and in a weak form are almost present already in the 'verb-patterns' specified in some language learning dictionaries. Hirst's most telling argument in favour of his proposal was that if lexicographers did not supply such information then computational linguists would, and would certainly do a worse job of it. The most telling counter argument was that, at present, no-one has a clear idea of exactly what cases (roles) exist nor is there any agreement on how to define them. A less telling counter-argument, which to my mind unnecessarily dominated the subsequent heated discussion, was the economics of adding this information to the already overcrowded pages of a dictionary; when pressed, Hirst said that he thought it would be more use to more dictionary users than etymology if something had to be sacrificed.
After the conference proper, I visited the NOED Centre itself, where I met Donna Lee Berg, the librarian on the project, and acquired offprints of some technical reports on the software being developed there by Gonnet. I also watched some confused lexicographers struggling with GOEDEL and, while being very impressed by the power of this system, was glad to notice that there are 18 months labelled "development of user interface" set aside in the planning of the project which have not yet begun.
Back in Toronto, I found Ian Lancashire very busy preparing a long term plan for funding humanities computing beyond the end of his IBM partnership. This entails the preparation of a detailed report of all the activity currently going on at the six or seven universities in Ontario which is to be presented to the Ontario government with a request for funding very shortly. I managed to distract him sufficiently to discuss briefly his slightly different views of the functions of a text archive. He wishes to see texts distributed as freely as public domain software, the role of the Archive being simply one from which an original non- corrupted version can always be retrieved, and the only restriction on the user being simply not to redistribute materials for profit. To that end, all texts encoded at Toronto (and there will be many, since preparation of a specified text forms a part of students course work) will be prepared to a common standard from non-copyright texts, such as facsimiles of early printed books. Whether this is practical in our case, where many of our texts are prepared from modern editions or texts otherwise still in copyright, is unclear. It is certainly something we should consider when capturing texts in the future however. I would also like to give some thought to the possibility of making some of our other texts more freely available (i.e. copyable).
CHArt - Computers in the History of Art - is a special interest group organised by Prof. Will Vaughan at UCL and Dr Antony Hamber at Birckbeck, with a burgeoning membership (about 150 attended this conference) drawn rather more from the major national museums than from academic departments. I attended its inaugural meeting nearly two years ago mostly out of idle curiosity; I was invited to this, its second annual conference, I suspect largely on the strength of my performance at Westfield (historians of art seeming to overlap a little with historians in general) on condition that I explain what databases were in words of one syllable, preferably employing lots of pictures.
The conference was a two day event, with mornings given over to formal presentations and afternoons to a number of parallel demonstration sessions. In between was a very pleasant reception featuring memorable dim sum. All around was the wealth of the National Gallery; definitely among my favourite conference venues to date. I opened the first day's formal sessions (which all concerned cataloguing/database applications), using as my main example a page from the Gallery's Catalogue written (I later learned) by the distinguished old buffer who had formally welcomed us into said gallery's hallowed portals not five minutes earlier. Fortunately he'd left by the time I started to get personal. Colum Hourihane from the Courtauld, where the only computer-assisted art historical cataloguing of any importance is actually going on, then gave a very impressive resume of every known method of iconographical classification. He'd found eight different methods used to categorise the subjects of images, of which the best appeared to be ICONCLASS, as used by, yes, the Witt Library at the Courtauld. His paper, when written up, should become a standard reference on the subject.
After coffee in an adjoining room of old masters, Jeanette Towey (described as 'a researcher' and evidently not a sales person) gave a work-person-like introduction to what word-processors are, how they differ from typewriters etc. etc. She advocated Nota Bene, having used that and Word Star, but had never tried Word Perfect nor heard of SGML, page description languages or - mirabile dictu- TeX. Gertrude Prescott from the Wellcome Institute and her 'data processing consultant' (whose name I forgot to write down) then described their current prototype cataloguing system for the Wellcome's immense collection of prints, using dBase III+. It was rather depressing to see that although they were starting from scratch - much of the collection never having been catalogued in any way - their data analysis was very rudimentary. It seemed to me to be over-reliant on dBase III's tendency to sweep anything difficult under the carpet into a "MEMO" field, of which they had about eight in one record. No doubt they will learn better from the example of their neighbours at the Witt Library.
After lunch, there were various demonstrations, of Nota Bene (which I avoided) and of STIPPLE, our old friend from the pigsty, which does not appear to have changed much and which I am now close to thinking I understand. ERROS Computing is still in business, but does not appear to have gained any new customers since the last report, some 18 months ago, nor indeed to have expanded its standard demo at all. Another demonstration, of somewhat dubious relevance to Art History, was being given by a Dr Alick Elitthorn from a private charity called PRIME (no relation to any manufacturer) which has something to do with the analysis of dreams. Its chief point of interest was that it used STATUS on a PC AT, of which I have long heard but never actually seen. The software costs #2000; by dint of sitting on my hands I prevented myself from taking a security copy of it immediately.
Day Two, which was supposed to be on visual rather than historical aspects of the subject, was opened by a Mr Duncan Davies (formerly with ICI, now retired) who gave what was reported to have been a magnificent overview of the rise of western civilisation. Owing to the caprices of British Rail, I missed much of this, arriving only in time for the Reformation, from which time, according to Dr Davies, may be dated the end of the period during which written communication had constituted the intellectual power base. With the rise of universal education came the stress on words and numbers as the only fit means of communication, the discouragement of the most able from visual forms of expression and our consequent inability to say anything intelligent about visual images. The second great invention of humanity, will be the pictorial equivalent of the phonetic alphabet and if anyone had any ideas on how it could be done, would they please telephone Dr Davies on 01-341-2421. The visual content of his talk, which my summary does not attempt to include, was, of course, the better part. Terry Genin had the difficult task of following this, but persevered, remarking that he would normally be on playground duty rather than addressing a gathering of this sort. He has developed some fairly straightforward courseware involving image and colour manipulation on RM380Zs as a means of teaching art history in a secondary schools but the bulk of his talk was a plea for the possibilities of interactive video to be more widely recognised in that context, (which seems to me to be a political rather than an art historical question), rather than just as a means of selling Domesday Book, of which he had several (unspecified) criticisms.
After coffee, Andrew Walter (IBM Research) gave a rapid canter through the York Minster Computer Graphics project. This is somewhat of a tour de force in CAD; it consists of a model of the York Minster, sufficiently detailed for views to be plotted from every angle both inside and outside. A video of the resulting tour was on display throughout the conference; each frame took about three hours CPU time on an IBM 3430, so interaction was impossible. The presentation included samples of the high level graphics language in which the Minster views were specified (primitives such as cylinder, sphere, cube etc. are combined in a procedural way) which was interesting though how much sense it made to the majority of the audience I can only guess. Wire frame drawing with dotted in-fill was presented as a more promising way of getting interactive processor speeds; the problems of including perspective in the views were also touched on.
David Saunders (National Gallery) described an ingenious application of image processing techniques. The problems of colour changes in 16th century paint are fairly well known (Ucello didn't actually paint blue grass, it's just that the yellow wash he put over it has gone transparent); more modern pigments also change over time. Usually the only way of telling what has happened is when a part of the painting has been protected from light, e.g. behind the frame. By storing carefully controlled digitised images of the painting and the comparing them after a five year gap, the NG hopes to identify more precisely what types of material are most at risk and what factors cause most damage. The equipment (which was also demonstrated in the afternoon) includes an HP 9836 frame store and a special digitising camera. Several art historical applications of image processing techniques were also given in what was, rather unexpectedly, the most stimulating paper of the conference.
Finally, two ladies from the Weeg Computing Centre at the University of Iowa described their videodisc retrieval project. A library of about 18,000 colour slides had been stored (in analogue form) on video disc, and a simple text DBMS (called Info-Text) used to index them. The system was designed for use by faculty members wishing to collect together appropriate illustrative material. In the classroom, images can be projected in the same way as conventional slides; the quality of the images (we were assured) was "better than might be expected"; it looked reasonable on the standard video monitors available at the National Gallery. Images are catalogued according to nineteen different categories (date, provenance, size etc.); no formal iconographic indexing was used. Apart from the obvious advantages of being tougher and cheaper to maintain, one great attraction of the system was seen to be its integration of indexing and displaying comparable and contrasting treatments of equivalent subjects.
The conference closed with a plenary discussion centre. This focussed at first on the difference between the words "analogue" and "digital", rambled off into ill-informed speculation about the possibility of automatic subject-recognition and was brought to heel by a plea for more information about what sort of database system was worth buying, and whether or not art historians should be expected to get their brains dirty trying to design them. My views on all these topics being fairly predictable, I shall not summarise them here.