Conference on Encoding and Corpora

A visit to Oslo

University of Oslo

14-16 Nov 1994

I was invited to give a number of talks at the University of Oslo as part of a small conference organized by the new inter-faculty Text Laboratory set up there, in collaboration with the Department of English and American studies, but with visiting guests from other linguistics departments at Oslo, and from the Universities of Jykslava (Finland) and Lund (Sweden). The emphasis was on corpus linguistics and encoding; between twenty and thirty staff and research students attended over the three days of the conference.

Willard McCarty from the University of Toronto's Centre for Computing in the Humanities began the first day with a detailed presentation of his forthcoming electronic edition of Ovid's Metamorphoses, which continues to be a fascinating example of just how far the humanities scholar can go with an ad hoc encoding scheme. I then gave the usual rapid canter through the TEI Guidelines, their milieu and architecture, which gave rise to some quite useful discussion before we broke for a substantial lunch. In the afternoon, Willard and I spent some time in the Text Laboratory, trying to install the very first BNC starter set (in my case) and checking email (in his). The Lab has a large Unix fileserver (some kind of DEC machine, since it runs Ultrix), and a room full of Windows and MACs connected to it via ethernet. We saw no-one else trying to use the equipment while we were there, but the Lab has only just begun operations.

On day two of the conference, Willard gave a talk which began promisingly by outlining the history of concordancing and concordances, from the middle ages onwards, but then became an overview of the features of TACT, which did little to improve my opinion of the design of that loose baggy monster of a concordance program. I then gave the usual rapid canter through the BNC, which aroused considerable interest. There were several intelligent questions about the design and construction of the corpus, and the accuracy of its linguistic tagging. I was also able to do my bit for the European Union by pointing out that a "no" vote in Norway might make it more difficult for us to distribute copies of the BNC there (the day before I arrived the Swedish referendum had confirmed Swedish membership; while we were there, rival campaigns on either side of the Norwegian referendum were in full swing).

During the afternoon, Willard and I were (independently) esconced in offices to act as consultants for a couple of hours: I spent most of my time re-assuring a lady from the German department that the TEI really could handle very simple encodings as well as complex ones, and rehearsing with her the TEI solutions to the usual corpus-encoding problems. Oslo is collaborating with Finnish and Swedish linguistic researchers in the development of a set of bilingual corpora (English-Finnish, English-Norwegian, and English-Swedish), so I also spent some time discussing and reviewing the project's proposed usage of the TEI Guidelines. Bergen and Oslo have developed a procedure for automatically aligning parallel texts in English and Norwegian, which appears to work reasonably well, perhaps because the languages are not so dissimilar. I rather doubt whether automatic alignment of English and Finnish will be as easy, but the Finns seemed quite cheerful about the prospect. In the evening we were taken out for a traditional Norwegian Christmas dinner, comprising rotten fish, old potatoes, and boiled smoked sheep's head, washed down with lots of akvavit: not as nasty as it sounds, but twice as filling.

The final day of the conference began with an excellent talk by Doug Biber, from Northern Arizona University, describing the use of factor analysis in the identification of register within a large corpus of materials in three languages (English, Korean and Somali). Biber's use of statistics is persuasive and undogmatic; the basic method was outlined in his book on speech and writing (1988) but its application to cross-linguistic (or diachronic) corpora is new and provoked considerable discussion.

This was followed by my swan song at the conference, a real seat-of-the-pants nail-biting event, being my first ever attempt to describe and then demonstrate the BNC retrieval software running (on Willard's laptop) live and in real time. As a result of careful pre-selection and late night rehearsal, I'm relieved to say that the software did not crash once, though my ability to control Willard's laptop's track-ball in public was frankly pitiful. SARA herself attracted favourable reaction, in particular because of the system design. Interest was expressed in the idea of extending her functionality to cope with the display and searching of parallel TEI-encoded corpora: not a task I think we will be undertaking ourselves in the near future.

This was a relaxed but far from vacuous three days, with ample opportunity for discussion and debate in pleasant surroundings. Sincere thanks are due to my host, Stig Johansson, and his department for arranging it and funding my participation.