After brief initial background presentations from F. De Bruine
and Sergei Perschke, placing the Eurotra project in its political
and historical contexts respectively, Nino Varile (CEC)
gave a brief overview of the last three decades of what is now
known as NLP. He stressed chiefly the way in which idiosyncratic
procedural systems based on Transformational Grammar had given
way to declarative systems based on lexical unification theories.
He argued that such systems, being inherently more robust, would
speed up progress in MT systems during the 90s, as would the
notion of reusable modular lexical and processing resources. The
object should be systems of high quality; the days of what Nino
disparagingly referred to as do-it-yourself
NLP were over.
Multi-functional resources, shareable between projects, would be
the norm.
Roberto Cencione (CEC) then introduced the main
business of the day: initial reports from four feasibility
studies commissioned by the CEC, each of which had been briefed
to investigate a distinct aspect of Eurotra II. Unlike Eurotra I,
this would be a kind of NLP workbench
, modular, formalism-
independent and capable of evolving to meet new requirements, but
currently solidly based in industry standards (POSIX, X/OPEN,
WINDOWS, NFS, SQL, SGML...). Each of the four studies had been
asked to assess the current prototype, consider existing relevant
formalisms, and specify new tools, formalisms or strategies as
appropriate. Each had involved collaboration between academic and
industrial partners: final reports are not due until July, but
initial versions of all were presented at the meeting. Cencione
also highlighted Eurotra's drift into professionalism (my
phrase): until 1987 all R&D had been in the hands of academics;
during the last two years a central software team had taken this
over. The next phase, until 1994, would be characterised by
turnkey projects carried out under contract. After 1994, cost-
sharing projects would become the norm.
Jörg Schütz (Inst for Applied Information Science,
Saarbrücken) picked up the theme of open modular
architecture. By contrast with Eurotra I, the new system takes an
object-oriented approach. He presented the various layers of the
architecture, from man-machine interface down to database
storage, by way of user agent and object manager. The latter
handles interactions with lexical resources and rules; the former
oversees a number of co-operative software agents or
toolboxes
, for example for text-handling. A need was
identified for a formalism-independent lexical interface
representation language
. A speaker from CAP-Gemini gave some
further detail of the MMI agent: it would have a distinctive
Eurotra look and feel
, but would be built on X/Windows. As
for system control -- if all else failed, there would always be
Unix.
Steve Pullman of SRI International began by noting
that the linguistic formalism must be usable for general NLP, in
a multilingual as well as a monolingual context, as well as for
MT. Other desiderata were that it should be declarative,
reversible (i.e. usable for generation as well as analysis) and
theory independent. It should have an easily implemented core, to
which equally monotonic and declarative extensions could be
hooked, and should use typed feature logic
. There was some
discussion of the interfaces between the Language Analysis agent
(LA) and the user agent's virtual machine (VM) on the one hand
and with the Text Handling agent (TH) on the other: LA/VM was
well defined: primitive functions included Parse, Generate,
Refine (i.e. further transform the output from Generate) and
Transfer (i.e. translate), with in each case appropriate
parameters such as language or grammar. LA/TH was rather more
fuzzy, with some linguistic functions such as morphological
analysis being done by TH rather than LA.
Christian Devillers of SEMA reviewed the text-
handling design study. This component interfaces the Linguistic
Analysis system with real texts, both during input and
generation. The study had involved a brief survey of existing
office document handling systems, SGML systems, and systems used
within the literary and linguistic computing
paradigm. A
simple SGML dtd for texts passing across the TH/LA interface had
been defined (EDIF - Eurotra Document Interchange Format).
Recognising that TH tools would probably be of wide interest
outside Eurotra, EDIF has been designed with an eye on TEI
conformance.
The TH component's main function is to translate between a formatted document and whatever linguistic structures are used for input to (or output from) the LA component. At present, LA requires input of single sentences, with no nested quotations etc., as well as some quite detailed morphological analysis, and that is therefore what TH must produce. Devillers stressed that the segmentation performed by TH was determined entirely by the LA: if this were enhanced for example to deal with paragraphs, then that would be passed across the interface. Some rendition features of the input text are passed through to LA; the majority however are filtered out and stored somewhere unspecified, so that they can be re-associated with the output text.
This presentation was followed by a few desultory questions about other related CEC-funded projects and about the range of material anticipated for translation. I spoke very briefly about the TEI in response to a prod from Cencione. It felt like a very long morning (no coffee break) by the time we all went, thankfully, to lunch in what is unquestionably one of the biggest and best office canteens in the world.
Ulrich Heid (University of Stuttgart) presented some
initial results from this rather different feasibility study.
Like Varile, he stressed the economic argument in favour of re-
using resources, which follow from the imperative need to size
up
NLP projects. A toy system can demonstrate anything: you
only demonstrate what is really feasible with a realistically
sized system. Reusability might mean simply re-cycling of a
resource prepared for some other purpose, or it might mean
designing resources with multiple applications in mind from the
start.
Most of the presentation dealt with questions specific to re-
usable machine-readable dictionaries (MRDs). Heid touched briefly
on the existence of a number of related projects and initiatives
(e.g. Acquilex, Genelex, Multilex, which are concerned with
acquisition, formal description and integration of MRDs
respectively). MRDs contain vastly larger amounts of information
than electronic lexica, but they are not available for many
languages and their underlying structures are not explicit. The
study group's approach to unification of lexicographic
information across different MRDs was to try to define some
primitive level of description, expressed in a type feature
logic, corresponding with the linguistic phenomena which the
dictionaries purported to describe.
This talk provoked some disagreement from the floor, in the shape of Wolf Paperote (Munster), who asserted on behalf of corpus linguists everywhere (there were none present) that MRDs were a lot less useful than corpora as a source of linguistic information, and that since parsing corpora was marginally easier than parsing MRDs, and much cheaper, wouldn't the money be better spent on the former? To judge from the icy silence that greeted this remark, his was a minority view.
Cencione rounded off the day by setting out the procedure to be adopted for the call for tenders, set out in ET9. Tenders were invited for two distinct projects: first the implementation of a Eurotra II development environment, as described by the four study papers; second the provision of maintenance and software support facilities for all Eurotra Project researchers (currently 17 sites located in 12 countries). The contracts would run to 1993, with the possibility of extensions under the new Language Research Engineering (LRE) programme for a further two years from 1994.
Nearly a hundred different companies and institutions had expressed interest in tendering. The formal invitation to tender would be published towards the end of May; the deadline for bids would be the end of July; contractor/s would be chosen by late September with a view to concluding contractual negotiations by the end of the year, and starting work early in Jan 1992. Contractors could organise the work as they deemed fit, bid for one or both projects, subcontract work etc, but the CEC would contract with only one member of a consortium, who should moreover be responsible for at least half of the work on the project. Software developed under ET9-1 would remain the property of the CEC, and must be shareable with any future research projects funded by the Commission. Their estimate of the costs for both projects over two years was around 30 man/years, with two thirds of the approximately 15 man/years for the first year being allocated to ET9-1.
At the risk of stating the obvious, I would like to stress the importance of the CEC's Linguistic Research Engineering project to the future of the TEI, and not just because of the amounts of money involved (several millions of ECU over the next few years). Over the next few weeks I will be writing up an assessment of the TH study, as part of the OUCS/SEMA consultancy project. Any input or comments from the TEI perspective would be most useful. I see three chief areas of overlapping interest, briefly summarised below.
There is a lot of effort and money going into initiatives to standardise lexical resources such as MRDs, with which TEI is already involved by virtue of overlapping membership (Nicoletta Calzolari is, of course, a significant contributor to ET-7); however, it was clear from talking to Heid that closer collaboration would be both possible and welcome. As a first step I have requested copies of ET-7's detailed working papers, several of which include surveys of existing encoding schemes and recommendations for standardisation which should be brought to the attention of the relevant WGs. I think the new WGs on lexical resources and on terminology in particular should be encouraged to build on this European work rather than go their own way.
Someone competent to judge the issues within the AI1 should be asked to assess the linguistic formalism of ET6-1 and consider ways of representing it using TEI style feature sets. In my report on ET6 I would like to suggest that LA should be able to output the results of its analysis in a TEI conformant way: it would be nice to have some specific arguments and examples to support this, but I am not confident of my competence to produce them.
Despite the general lack of enthusiasm within the Eurotra project for corpus linguistics, it seems to me that some of the tools developed as part of TH may prove to be of particular interest to several TEI projects. TH will (for example) have to develop ways of automatically detecting and tagging sentences and morphological structure in the full range of European languages in SGML. If properly designed and implemented, such tools would be of great general applicability. CEC's policy as regards making such tools availabile freely to the research community, at least within Europe, sounds distinctly encouraging, as does their declared intention of working within an open Unix environment.
The following documents were made available at the workshop.