PALC97: Practical Applications of Language Corpora

University of Lódz

April 12-14 1997

It's somewhat of a cliché to describe Lódz (pronounced, approximately, Wootch) as the Manchester of Poland: this doesn't mean so much that it is blessed by an excellent football team, as that it is cursed with a major industrial past. At the turn of the century, it was a rich city, built on cloth, and with one of the most prosperous middle class communities east of Berlin. The centre of the town, and the University quarter, still boasts a number of the fine houses they built, some of them now being carefully restored; others still hidden beneath the grey dust and neglect that seems to live over the whole of this region of Europe. But then, of course, came the thirties, and the appropriation of Poland by invaders, first from the West, and then from the East. Lódz, I learned from my Guidebook, has the dubious honour of having been the first city in modern times to establish a ghetto and the ghost of that absence still haunts the place. Curiously, for there are now no Jews to be seen here, it's the star of David which is now daubed on walls, in contexts where elsewhere one might find the swastika daubed, in association with the swirling iconography of urban rage imitated from the inner cities of further west.

The teaching of English is a major growth area in Polish universities; for every student learning Russian or German, there are ten wishing to learn the language of McDonalds and Marks and Spencer, whose emblems now dominate the centre of Warsaw as well as that of High Wycombe. The British Council appears to be playing a major role in satisfying this demand, as witness its funding of this conference, and a number of other educational programmes aimed at secondary and tertiary English language teaching. An unusual and interesting aspect of these programmes is the recognition that access to (and study of) language corpora are of particular importance if there is to be a systematic improvement in the quality of English spoken (and taught) by Poles. Hence the organization of this well-attended international conference.

The four day event was hosted jointly by the British Council and the University of Lódz Institute of English Studies, represented respectively by Susan Maingay and Barbara Lewandowska, with assistance from James Melia, and took place in the University's recently constructed and comfortable conference centre. Each day began with a brisk walk to the University's council chamber for a plenary session, held beneath the stern depicted gaze of assorted dignatories in funny hats, followed by parallel sessions at the conference centre, combining project reports and research papers from a wide spread of corpus applications and interests. Evenings were given over to relaxation, discussion, and the opportunity to sample some excellent Polish hospitality.

Plenary sessions

The conference began on a high note, with an excellent lecture by Professor Michael Hoey (Liverpool) on the nature of the corpus linguistics enterprise, the questions it asked, and the answers it might provide. Asserting that corpora could be used to provide evidence of lexical patterns, of semantic prosodies, of syntactic patterns, of colligation, and even of text grammars, Hoey proceeded to discuss striking examples from each of these categories, of which I will summmarize here only what he called "the drinking problem principle". If someone has difficulty in drinking, it will generally not be referred to as a drinking problem because the more common collocational sense is inappropriate. In the same way, corpus evidence, rather surprisingly, demonstrates that following a posessive adjective (my, our, his etc) the plural form reasons is always preferred to the singular reason when the intended sense is "cause" (our reasons for doing this...) rather than "rationality" (to lose one's reason).

The second plenary speaker was Tony McEnery (Lancaster Univ) introducing the notion of what he called multimedia corpora. He stressed the need to introduce visual information as a context for understanding verbal material, and showed us a few pages from the corpus of children's writing and drawings currently being created at Lancaster, which will be distributed freely over the web.

The final plenary speaker was Patrick Hanks (OUP) who gave a bravura demonstration of the problems that corpora give lexicographers. What exactly is it that lexicographers do when they go through the lines of a concordance assigning each one to some sense or another of a word? and how on earth do they do it? Hanks has published several papers on this, and worked with some of the best computational names in the business (Atkins, Fillmore..) but he's honest enough to say he still has no definitive answers. His presentation focussed on a few interesting examples: the word "baked" for example, which seems to require certain lexical classes (not just edible food stuff, but specific categories of them) and the word .

Parallel sessions

In the nature of things, I couldn't attend all of these. The exigencies of time and space meant that I had to choose to miss presentations from inter alia: Bengt Altenberg (Lund), Wieslaw Babik (Krakow), Michael Barlow (Rice), Simon Botley (Lancaster), Igor Burkhanov (Rzeszów), Doug Coleman (Toledo), Martha Jones (Nottingham), Dorothy Kenny (Dublin), Bernhard Kettermann (Graz), Przemyslaw Kaszubski (Poznan), Anne Lawson (Birmingham), Barbara Lewandowska-Tomaszczyk (Lódz) , Belinda Maia (Oporto), Michal Pawica (Kraków), Margherita Ulrych (Trieste), and Maciej Widawski (Gdansk), whose names I list to give at least some indication of the geographical spread of participants. What follows by way of review should therefore be regarded only as a sample of the concerns raised and materials discussed -- though not, I hope, too unrepresentative a one.

Stig Johansson (Oslo) reported on the progress of the English-Norwegian parallel corpus project, now expanding to include up to a hundred texts in other European languages, notably German, Dutch, and Portuguese. The well thought out design of this corpus allows comparison both between texts translated from language A into language B, and the reverse, although it is not always easy to find sufficient texts to do this (there are far more English-Norwegian translations, for example, than the reverse, simply because it is hard to find comparable texts) In a separate evening session, he described some of the software developed for the project, in particular the automatic alignment procedure developed at Bergen by Knut Hofland and a Windows retrieval package developed by Jarle Eberling at Oslo.

Michael Rundell (Longman) gave a pleasant presentation about corpus evidence for the British fondness of understatement, in particular the phrase not exactly and similar ironic uses. It's probably not too controversial to say that this was not exactly unfamiliar to those who had heard him speak at TALC last year, but none the worse for that.

Guy Aston (Forlì) contrasted the pedagogic usefulness of large corpora such as Cobuild or the BNC with that of small specialized corpora drawn from, specific text types, such as a 14,000 word "hepatitis corpus" in use at Forlì. Small corpora are more easily managed by the language learner, and their lexis can be studied in extenso; learners can use them to practice their inductive powers, hypothesizing lexical, colligational or collocational patterns, which may or may not be confirmed by examination of large reference corpora. In this respect, it is possible to get the best of both worlds.

Akua Anokye (Toledo) described some interpretative problems in analyzing the transcriptions of Afro-American folk narratives recorded on aluminum disk by Hurston, Lomax et al in the late twenties and now stored in the Library of Congress. She had transcribed some of these recordings, using her own scheme, and presented a largely impressionistic account of the interplay between their phonological and contextual features.

Sylvia Shaw (Middlesex) described how corpora had influenced the production and format of the third edition of Longman's Dictionary of Contemporary English. This had included both the use of frequency information, derived from large corpora, and particular attention to typical language learner errors, derived from corpora of language learners' production. Thus the student can be advised, for example, of the range of things to which words such as beautiful are typically applied by native speakers, which is much smaller than that used by language learners, and given advice on how to choose between near synonyms such as error and mistake.

Raphael Salkie (Brighton) quoted a number of French writers' opinions about the differences between French and English, intended to help translators as rules of thumb (e.g. ). His paper reported some interesting work on the extent to which these perceptions were born out by corpus evidence, and gave a brief overview of the Intersect project.

Chris Tribble, (Lancaster, Reading, Warsaw) picking up Guy Aston's paper on the benefits of small corpora, suggested that for class room use, small corpora were of more use than large as well as being more accessible. His paper reported on some experiments using Microsoft's Encarta as a language resource (as well as a source of factual information) noting that the type of language it contains is very similar to that which language learners are typically required to produce: brief factual articles.

Oliver Mason (né Jakobs) from Birmingham presented what was in many ways an exemplary research report about some very interesting work he has been doing on identifying statistically the size of collocation spans, by calculating the type-token ratio of the words appearing in each position to the left and to the right of the keyword. The results are striking: different node words exhibit strikingly different patterns of influence on their neighbours, giving a visual hint of the extent to which they construct fixed phrases, for which he proposed the term lexical gravity.

Sylvia Scheur (Poznan) discussed several aspects of her research into the pronunciation difficulties faced by Polish-speaking learners of English. She had recorded 17 Polish language-learners reading the same English texts at the start of their course and a year later, and was transcribing these phonetically (using the SAMPA writing system for the International Phonetic Alphabet, developed by John Wells). Students were also asked to assess their own performance, which produced some interesting comments about their perceptions of English prosody.

I gave that paper about the BNC and SARA again, spiced up somewhat for this audience by the addition of hints about forthcoming availability of the corpus outside the EU, and also with the first ever live demonstration of the sampler corpus.

Philip King (Birmingham) gave an overview of the Lingua multilingual parallel concordancing project, now in its second phase of existence, with a particular focus on some of the pedagogic software being developed at Birmingham for its exploitation, its use in generating course material for student use, and the ability to browse and search parallel corpora. It would be interesting to compare the methods and results of this project with those of the ENPC described by Stig Johansson, but no-one had the temerity to do so, in public at least.

Social events

Like other academic conferences this one was oiled by a couple of very pleasant evenings, drinking, dining, and discussing. Particularly memorable was an evening concert of baroque music by Teleman and Handel played on original instruments, followed by a splendid buffet dinner. This was held in one of the afore mentioned bourgeois palaces and much appreciated by all. After a couple of glasses of very drinkable Hungarian wine, McEnery, Ketteman and I were able successfully to empress all the help we needed to make TALC 98 (hopefully, to be held in Oxford next July) a reality.

I also took the opportunity of a free Sunday at the end of the conference to visit Arkadia: this is an ornamental garden full of picturesque gothic ruins, classical statues, and the like, originally laid out by the local aristocracy in the 18th century, and now a pleasant enough place for a Sunday afternoon stroll. Getting there involved a detailed and educational study of Polish regional railways and bus services, to say nothing of the refreshment room at Lowicz railway station (which I won't).