Computers & Texts 14: Fix

Computers & Texts No. 14

April 1997

Computer-Aided Processing of Old German Texts

University of Würzburg, 4-6 March 1997

Jakob Fix
Oxford Text Archive
jakob.fix@oucs.ox.ac.uk

AHDS/OTA Logo MAVAT (Maschinelle Verarbeitung altdeutscher Texte) is an international conference whose objective is to investigate and promote the use of computers in the field of Old German philology and linguistic research. It was founded in 1972 through the initiative of two German linguists, Hugo Moser and Wilfried Lenders. The fifth MAVAT conference took place this year in Würzburg, nine years after the previous one. Although, or indeed perhaps because the conference's topic is a rather specific one, participants came from all over the world: Austria, Germany, Japan, Sweden, Switzerland, United States, and the UK.

The organisers at Würzburg University's Department of German Language decided to focus on three topics during this conference: 1) the use and influence of New Media for Old German philology; 2) the use and importance of corpora for the subject; 3) lexicography in general. Besides these foci, there was also interest in legal issues connected with the use and creation of copyright material.

The opening speech, given by Norbert Richard Wolff, one of the organisers, caused some amusement with its comparison of research results from computing humanists with the virtual Japanese pop star Kyoko Date. The last MAVAT conference in 1988, he recalled, was dominated by a concern for basic technical issues like disk sizes and processor speed. This conference, he hoped, would address the higher-level problems which researchers continue to face.

Language and Corpora

The first presentation was given by Lou Burnard (Oxford University) on the British National Corpus (BNC). Lou outlined how this huge project was sponsored, how the texts were selected, how the corpus was created and a description of the various stages of production. He also drew attention to how copyright issues were solved by the use of a standard agreement drawn up with the co-operation of the Society of Authors, the Publishers Association, and other interested parties. Finally, he showed some screenshots to illustrate how the BNC can be used. Although the BNC created a significant amount of interest some participants were disappointed to learn that the Corpus is currently only available to members of the European Union.

Randall Jones (Brigham Young University Provo) discussed his experiences of creating a corpus on colloquial German in (then West and East) Germany, Austria, and Switzerland. He and others conducted interviews in settlements ranging from villages to large cities which were then transcribed. However, some of the problems involved became apparent in the discussion afterwards, when Swiss participants claimed there was nothing like a Swiss Standard German, and a participant from Northern Germany was able to falsify another of Jones' results. These problems seemed to be mainly caused by the fact that not all interviewees were really 'natives', and results therefore were sometimes imprecise.

Thomas Klein (Bonn University) talked about the possibilities of using computers to make lemmatised indices of Middle High German (MHG) texts searchable, and to link exceptions to the respective paragraph(s) in a MHG grammar book. If such an index exists for a text, it can then be searched for word classes, inflection (if applicable), for the normalised form of a word, rhymes, and so on. In the subsequent discussion, questions focused on the grammar book's choice (Paul/Wiehl/Grosse), and on the idiosyncratic markup used.

Wolfgang Klein (Institut für deutsche Sprache (IDS), Mannheim) spoke about the techniques and software used at the IDS to transcribe taped conversations. The IDS has developed DIDA ('Diskursdatenbank') which allows for the continuous entry of conversation, with an unlimited number of speakers, and the possibility of adding general and (speaker-) specific comments. The software is used for the creation of corpora which are the basis of COSMAS (Corpus Storage, Maintenance, and Access System).

The theoretical underpinnings to corpus research at the IDS and COSMAS were given the next day by its IT department's manager, Robert Neumann. He introduced the IDS' concept of the 'virtual corpus', which enables users of COSMAS to select a number of corpora and thus to create an individual corpus for their specific needs. A demonstration is available at http://www.ids-mannheim.de/ldv/cosmas/

Klaus Schmidt (Bowling Green State University) and Horst P. Pütz (Kiel University) presented on probably 'the largest electronic text archive of medieval German literature' currently available (about 100 texts, 1 million lines of text, and 8 million individual words). It is fully lemmatized and can be queried over the Internet through thesaurus-like conceptual categories, as the speakers demonstrated during their presentation. Further details available at http://www.bgsu.edu/departments/greal/MHDBDB.html

Jakob Fix completed the proceedings of the first day with a presentation on how to use SGML, and the TEI Guidelines in particular, to encode dialect dictionaries. He demonstrated how, once the text is encoded, the dictionary can be displayed using a WWW or a dedicated SGML browser, and how it can be searched using OpenText, a SGML-aware text search and retrieval application. Highlighted were two particular problems: how to get the printed book into electronic form, and how to display the unusual phonetic transcription, Teuthonista.

Texts and Archives

The second day started off with two presentations on existing text centres, or archives. Catherine Tousignant, University of Virginia, introduced the Electronic Text Center (ETC) at the University of Virginia Library. She was followed by Alan Morrison who talked about the Oxford Text Archive (OTA). This was of interest to the audience, as a German text repository does not as yet exist. The universities of Oxford and Virginia have two quite different approaches. The ETC is fully dependent on the University library, whereas the OTA is jointly funded by Oxford University and the Joint Information Systems Committee (JISC). The OTA is part of the Oxford University Computing Services, whilst the ETC is integrated with a library, both dependencies having advantages and drawbacks. However, the American institution draws heavily on the use of graduate students for many tasks, while this possibility has yet to be explored at the OTA.

In the evening, participants were invited to a public lecture by Michael Sperberg-McQueen. His eloquent lecture, given in German, was titled 'Die Hochzeit der Philologie mit dem Merkur: Philologische Datenverarbeitung' (The Marriage of Philology and Mercury: Philological Computing). Essentially Michael gave an overview of different text models, ranging from what he called the 'linear text' (a text regarded as a string of characters) to the 'rectangular text' (text consisting of words consisting of characters, and having characteristics and functions) to the 'text cake with command bits' (for which read: text with layout information, such as RTF or LaTeX), and finally, the 'tree structure text model with element types' which is, as one would expect from an SGML expert, an abstraction for SGML, and has by far the most possibilities and advantages to precisely define a text grammar or structure which then can be used for analysis or presentation of the text.

The last day saw many parallel presentations to choose from. In the morning, there was a talk by Ralf Plate and Ute Recker (University of Trier) on how they use computers to prepare the creation of a Middle High German dictionary. The main goal presented (and, apparently, demonstrated in a later workshop) was to replace the traditional, huge collection of index cards with an electronic collection of lemmatised texts, the latter produced by an interactive program that is based on TUSTEP. Further details on this project are available from http://www.uni-trier.de/uni/fb9394/germanis.htm

Heinz Korten and Michael Prinz (University of Regensburg) presented an impressive prototype of multimedia place name books. These books contain much information about the origin of place names, their connection with other names, detailed descriptions of where they occur, etc. This mass of information lends itself to a multimedia presentation. The project shown, in fact supporting material for the speakers' Masters Theses, was produced as a Toolbook application on CD-ROM. Although only sample data was entered, the application's functionality was obvious: for example, there were click-sensitive maps, entries from the printed place name books, spoken audio samples from each place, a note pad function, and the entire corpus of material could be searched.

Ingrid Lemberg (Academy of Heidelberg), spoke about another dictionary project, the German Legal Dictionary, whose goal is to record all medieval legal sources in dictionary form. For that purpose, a custom-made database system, FAUST, is used which was originally designed for library systems. Ingrid described how the dictionary is created and how the system has helped to eliminate errors and make the whole process smoother.

The 'father' of TUSTEP, Wilhelm Ott (University of Tuebingen) gave a short introduction on the current state of TUSTEP followed by a number of presentations demonstrating how the software can be used. For example, TUSTEP can be (and is) used to produce critical editions. The system is able to collate several versions of a manuscript, to select one as the main version, and to give all changes, omissions or additions in footnotes. However, this process needed some editing of configuration files for every task executed, and the syntax of TUSTEP commands requires Gewoehnung. Further details about TUSTEP may be found at http://www.uni-tuebingen.de/zdv/zrlinfo/tustep-des.html. Wolfram Schneider-Lastin (University of Basle) showed an extension to TUSTEP which makes data input much easier. He has developed an equivalent of 'forms' in database packages for the DOS version of TUSTEP, including drop-down menus, restriction of entries, automatic error checking, etc.

Conclusion

At the end of this conference, it became apparent that there is little or no consensus on appropriate standards amongst scholars working with Old German texts. Each researcher or group develops their own programs (perhaps based on SGML or TUSTEP), resulting in much duplication of effort. Outside of this conference, information exchange between scholars working in this area would appear to be rather sparse. In his closing speech, Norbert R. Wolf concluded that the need for standards was recognised by the majority of those present; that text repositories which offer reliable German texts for research are desirable; and that corpora are an important field of research. The next MAVAT will take place 2002 when, one hopes, scholars will be even further advanced in their application of new technologies to the study of German text.

[Table of Contents] [Letter to the Editor]

Computers & Texts 14 (1997), 15. Not to be republished in any form without the author's permission.

HTML Author: Michael Fraser (mike.fraser@oucs.ox.ac.uk)
Document Created: 24 May 1997
Document Modified: 8 October 1997

The URL of this document is http://info.ox.ac.uk/ctitext/publish/comtxt/ct14/fix.html