Computers and Corpora
* The what and why of corpora
* English corpora: the story so far
* What do corpora tell us about language?
* Designing and building a corpus: the British National Corpus
What is a corpus?
* How do we find out what words mean?
- algorithm
- authority
- usage
* Corpus linguistics re-centres the last of these
What is a corpus?
1. The body of a man or animal. (Cf. corpse.) Formerly
frequent; now only humorous or grotesque....2. Phys. A structure of a special
character or function in the anima body, as corpus callosum, ... 3. A body
or complete collection of writings or the like; the whole body of literature
on any subject. ... 4.The body of written or spoken material upon which
a linguistic analysis is based... 5. The body or material substance of
anything; principal, as opposed to interest or income.
What is a corpus?
“a collection of pieces of language, selected and ordered
according to explicit linguistic criteria in order to be used as a sample
of the language” (Sinclair, 1994)
* linguistically-motivated selection
* representative intention
* distinct from collection or archive
Why make corpora?
* to get a more comprehensive view of language in use
* to improve on individual intuition
* total accountability vs individual salience
What can a corpus show us?
* lexis, and patterns of lexis
- is starting always replaceable by beginning?
- is it only time that is immemorial?
* syntax, and patterns of syntax
- start to do vs. start doing
- I hope that vs. I hope to
What can a corpus show us?
* idiolects and speech communities
- male and female, old and young
- preferred and deprecated usages
* discourse and rhetoric
- code switching and topic change
- turn taking and negotiation
Some major English corpora
* The Brown School
- Brown (1964): American English
- Lancaster-Oslo-Bergen (1980): British
- Kolhapur (1988): Indian
- Macquarie (1988): Australian
- Wellington (1993): New Zealand
- International Corpus of English (1992)
* 1 million words each
* Specific text types sampled
Some major English corpora
* The Birmingham School
- Cobuild and the Bank of English
* Up to 300 million words
* Designed to monitor usage, initially for lexicographic use
Some major English corpora
* The British National Corpus
* 100 million words
* Designed to sample a wide variety of text types, both spoken and written
* Richly encoded with POS and structural information
Other varieties of corpus
* spoken language
- London-Lund (1990)
- Trains, PhoneBook (1995-6)
* childrens' and learners' language
- CHILDES (1990)
* genre and topic specific
- Map Task (1991)
Other varieties of corpus
* Diachronic corpora
- Helsinki corpus (1993)
* Multilingual corpora
- ECI and PAROLE
- Parallel corpora
Using corpora
* natural language processing
* language teaching
* contrastive studies
* describing language
Natural Language Processing
* corpora provide
- statistical evidence
- realistic testbeds
* for such NLP tools as
- spell checkers
- automatic indexers
- human interfaces
- cross-linguistic retrieval systems
Language Teaching
Exploratory use of a corpus can
* complement dictionaries and grammars
- a stone's throw from here
- what's a whammy?
* reduce dependence on individual teacher
* extend cultural knowledge
Contrastive studies
* American and British English
* Translation corpora
* Comparable corpora
- FLOB and LOB (1993)
- Archer
* Variation in speech and writing
Describing language
* tools
* theoretical axioms
* the idiom principle
* collocation and colligation
* semantic prosody
Corpus analysis tools
* concordances
* frequency lists
* collocation tables
* software
- Word Cruncher, Wordsmith
- SARA, OpenText, Corpus Bench
A corpus linguist's manifesto
* Linguistics is essentially a social science, and an
applied science
* Language should be studied in actual attested authentic instances of
use, not as intuitive invented isolated sentences
* The unit of study must be whole texts
* Text and text types must be studied comparatively across text corpora
* Linguistics is concerned with the study of meaning: form and meaning
are inseparable.
* There is no boundary between lexis and grammar: lexis and grammar are
interdependent
* Much language use is routine
* Language in use transmits the culture
* Saussurian dualisms are misconceived
The Idiom Principle
* words are co-selected, not individually chosen
* hence the importance of collocation
* cf. the open-choice principle
Collocation
* lexicalization may be arbitrary
- of course vs maybe
* `fixed' phrases vary
- set on fire vs set fire to
* high frequency words have distinct patterns of usage rather than
distinct senses
- back
Semantic prosody
* for example:
- only bad things set in
- causing work is bad, but providing work is good
* stylistics often assumes a norm; corpus studies help define that
norm
Designing and Building a Corpus
* Statistical properties of language
* Composition and typology
* Encoding and annotation
* Transcription
Word Frequencies
Statistical properties
* The top 50 words in any corpus will be much the same
* The top 50 words will account for over 80% of the corpus size
* The rate at which new hapax legomena appear is (apparently) constant
Statistical consequences
* The mean frequency is very different from the median
* Small samples will be seriously skewed
* Statistics assuming a normal distribution should be regarded with deep
suspicion
Composition and typology
* representativeness
- production or reception?
- sampling frame?
* typologies
- internal or external criteria
- field, tenor, mode
* the BNC compromise
Encoding and annotation
* part-of-speech classification
* lemmatization
* word senses
* syntactic roles
* pragmatic annotation
Transcribing speech
an acute form of a general problem
* all encoding is interpretative
* orthographic transcription
* prosodic features
* paralinguistic features
* phonemic or phonetic data
The future
* blurring of traditional linguistic categories
* multimedia corpora
* language pedagogy
* political linguistics
* corpus methods in literary studies