Computers and Corpora

* The what and why of corpora
* English corpora: the story so far
* What do corpora tell us about language?
* Designing and building a corpus: the British National Corpus

What is a corpus?

* How do we find out what words mean?

- algorithm
- authority
- usage

* Corpus linguistics re-centres the last of these

What is a corpus?

1. The body of a man or animal. (Cf. corpse.) Formerly frequent; now only humorous or grotesque....2. Phys. A structure of a special character or function in the anima body, as corpus callosum, ... 3. A body or complete collection of writings or the like; the whole body of literature on any subject. ... 4.The body of written or spoken material upon which a linguistic analysis is based... 5. The body or material substance of anything; principal, as opposed to interest or income.
What is a corpus?

“a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” (Sinclair, 1994)

* linguistically-motivated selection
* representative intention
* distinct from collection or archive

Why make corpora?

* to get a more comprehensive view of language in use
* to improve on individual intuition
* total accountability vs individual salience

What can a corpus show us?

* lexis, and patterns of lexis

- is starting always replaceable by beginning?
- is it only time that is immemorial?

* syntax, and patterns of syntax

- start to do vs. start doing
- I hope that vs. I hope to

What can a corpus show us?

* idiolects and speech communities

- male and female, old and young
- preferred and deprecated usages

* discourse and rhetoric

- code switching and topic change
- turn taking and negotiation

Some major English corpora

* The Brown School

- Brown (1964): American English
- Lancaster-Oslo-Bergen (1980): British
- Kolhapur (1988): Indian
- Macquarie (1988): Australian
- Wellington (1993): New Zealand
- International Corpus of English (1992)

* 1 million words each
* Specific text types sampled

Some major English corpora

* The Birmingham School

- Cobuild and the Bank of English

* Up to 300 million words
* Designed to monitor usage, initially for lexicographic use

Some major English corpora

* The British National Corpus
* 100 million words
* Designed to sample a wide variety of text types, both spoken and written
* Richly encoded with POS and structural information

Other varieties of corpus

* spoken language

- London-Lund (1990)
- Trains, PhoneBook (1995-6)

* childrens' and learners' language

- CHILDES (1990)

* genre and topic specific

- Map Task (1991)

Other varieties of corpus

* Diachronic corpora

- Helsinki corpus (1993)

* Multilingual corpora

- Parallel corpora

Using corpora

* natural language processing
* language teaching
* contrastive studies
* describing language

Natural Language Processing

* corpora provide

- statistical evidence
- realistic testbeds

* for such NLP tools as

- spell checkers
- automatic indexers
- human interfaces
- cross-linguistic retrieval systems

Language Teaching

Exploratory use of a corpus can
* complement dictionaries and grammars

- a stone's throw from here
- what's a whammy?

* reduce dependence on individual teacher
* extend cultural knowledge

Contrastive studies

* American and British English
* Translation corpora
* Comparable corpora

- FLOB and LOB (1993)
- Archer

* Variation in speech and writing

Describing language

* tools
* theoretical axioms
* the idiom principle
* collocation and colligation
* semantic prosody

Corpus analysis tools

* concordances
* frequency lists
* collocation tables
* software

- Word Cruncher, Wordsmith
- SARA, OpenText, Corpus Bench

A corpus linguist's manifesto

* Linguistics is essentially a social science, and an applied science
* Language should be studied in actual attested authentic instances of use, not as intuitive invented isolated sentences
* The unit of study must be whole texts
* Text and text types must be studied comparatively across text corpora
* Linguistics is concerned with the study of meaning: form and meaning are inseparable.
* There is no boundary between lexis and grammar: lexis and grammar are interdependent
* Much language use is routine
* Language in use transmits the culture
* Saussurian dualisms are misconceived

The Idiom Principle

* words are co-selected, not individually chosen
* hence the importance of collocation
* cf. the open-choice principle


* lexicalization may be arbitrary

- of course vs maybe

* `fixed' phrases vary

- set on fire vs set fire to

* high frequency words have distinct patterns of usage rather than distinct senses

- back

Semantic prosody

* for example:

- only bad things set in
- causing work is bad, but providing work is good

* stylistics often assumes a norm; corpus studies help define that norm

Designing and Building a Corpus

* Statistical properties of language
* Composition and typology
* Encoding and annotation
* Transcription

Word Frequencies
Statistical properties

* The top 50 words in any corpus will be much the same
* The top 50 words will account for over 80% of the corpus size
* The rate at which new hapax legomena appear is (apparently) constant

Statistical consequences

* The mean frequency is very different from the median
* Small samples will be seriously skewed
* Statistics assuming a normal distribution should be regarded with deep suspicion

Composition and typology

* representativeness

- production or reception?
- sampling frame?

* typologies

- internal or external criteria
- field, tenor, mode

* the BNC compromise

Encoding and annotation

* part-of-speech classification
* lemmatization
* word senses
* syntactic roles
* pragmatic annotation

Transcribing speech

an acute form of a general problem
* all encoding is interpretative
* orthographic transcription
* prosodic features
* paralinguistic features
* phonemic or phonetic data

The future

* blurring of traditional linguistic categories
* multimedia corpora
* language pedagogy
* political linguistics
* corpus methods in literary studies