Coreference is a linguistic phenomenon where different words refer to the same entity. For example, in the sentence Thomas said he shaved himself, he and himself may both refer to Thomas; the referential relationships between words often vary, depending on the sentence structure, meaning, and context. Understanding coreference across languages is known to be a key challenge for theoretical, experimental, and computational linguistics.
While coreference has been studied in many languages, little attention has been paid to Cantonese, which exhibits complex coreference behaviour. Existing theoretical work on Cantonese coreference is rare, and none to date provides a comprehensive analysis with formally explicit constraints capturing the interplay across all relevant levels of grammatical architecture. In the field of natural language processing, Cantonese is an under-resourced language in terms of data scale and diversity. Therefore, to further demystify the cross-linguistic variation of coreference phenomena, our project has adopted Cantonese as the main object language for investigation.
We will integrate theoretical, experimental, corpus, and computational approaches, embodying a holistic investigation. Our theoretical analysis will be conducted using an advanced grammatical framework, namely Lexical-Functional Grammar (LFG). What makes LFG especially useful for this project is its ability to handle different levels of linguistic analysis simultaneously, from sentence structure to meaning and discourse. LFG works well with Glue Semantics, which helps us understand how different pieces of meaning fit together in a sentence, and with Discourse Representation Theory, which focuses on how information is carried across sentences in a conversation. Our research will also take into account existing analyses that have been conducted in other linguistic frameworks, in particular Minimalism, in the spirit of facilitating cross-framework dialogue and understanding.
Computationally, we will develop computational grammar fragments for Cantonese using the tool Xerox Linguistic Environment, which enables computational testing of linguistic constraints and creation of computational grammar resources. In the current AI-driven landscape, handcrafted computational grammars, grounded in well-defined linguistic theories, have often been overlooked. However, it has been noted that machine-learning models, including LLMs, require vast amounts of training data, which are difficult and expensive to collect and annotate. Broad-coverage handcrafted grammars could, in principle, generate high-quality, well-annotated data for training and fine-tuning these models.
More broadly, our project aims to inform the development of the linguistic theory of coreference resolution by adducing solid empirical evidence. As part of our project, we will engage with the wider computational linguistics community via the Parallel Grammar Consortium to explore the future potential of handcrafted grammars in the AI era, bridging the current gap between theoretical and computational linguistics.