AHDS LitLangLing Logo

Developing Linguistic Corpora:
a Guide to Good Practice

AHDS: Arts and Humanities Data Service. http://www.ahds.ac.uk/.

AHRC: Arts and Humanities Research Council. http://www.ahrc.ac.uk/.

Allen, J., and Core, M. 1997. Draft of DAMSL: Dialog Act Markup in Several Layers.  http://www.cs.rochester.edu/research/cisd/resources/damsl/RevisedManual/.

Automatic Mapping Among Lexico-Grammatical Annotation Models (AMALGAM). http://www.comp.leeds.ac.uk/amalgam/amalgam/amalghome.htm.

BAAL: British Association for Applied Linguistics. BAAL Recommendations on Good Practice in Applied Linguistics. http://www.baal.org.uk/goodprac.htm.

Baker, J. P. 1997. Consistency and accuracy in correcting automatically tagged data. In Corpus annotation: Linguistic information from computer text corpora, eds. Roger Garside, G. Leech and A. McEnery, 243-250. London: Longman

Baker, P., Hardie, A., McEnery, A., Xiao, R., Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Maynard, D., Tablan, V., Ursu, C., Jayaram, B., and Leisher, M. 2004. Corpus linguistics and South Asian languages: Corpus creation and tool development. Literary and Linguistic Computing 19:509-524

Biber, D., Johansson, S., Leech, G., Conrad, S., and Finegan, E. 1999. Longman grammar of spoken and written English. Harlow: Pearson Education

Burnard, L. 1995. Users' reference guide to the British National Corpus. Oxford: Oxford University Computing Services

Burnard, L. 1999. Using SGML for linguistic analysis: the case of the BNC. In Markup languages theory and practice, 31-51. Cambridge, Mass: MIT Press

Burnard, L., and Dodd, T. 2003. Xara: an XML aware tool for corpus searching.  http://www.oucs.ox.ac.uk/rts/xaira/Talks/cl2003.html.

Carletta, J. 1996. Assessing agreement on classification tasks: the Kappa statistic. Computational Linguistics 22

Carletta, J., McKelvie, D., and Isard, A. 2002. Supporting linguistic annotation using XML and stylesheets. In Corpus linguistics: readings in a widening discipline, eds. G. Sampson and D. McCarthy. London & New York: Continuum Interpretations

CLAWS part-of-speech tagger for English.  UCREL. http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/.

Clear, J. 1992. Corpus sampling. In New directions in English language corpora, ed. G Leitner, 21-31. Berlin: Mouton de Gruyter

COLT: Corpus of London Teenager.  Department of English, University of Bergen. http://torvald.aksis.uib.no/colt/.

Cook, G. 1995. Theoretical issues: transcribing the untranscribable. In Spoken English on Computer, eds. G. Leech, G. Myers and J. Thomas, 35-53. Harlow: Longman

Dunlop, D. 1995. Practical considerations in the use of TEI headers in large corpora. In Text encoding initiative: background and context, eds. Nancy Ide and Jean Veronis, 242. Dordrecht; London: Kluwer Academic

Edwards, J. 1993. Principles and contrasting systems of discourse transcription. In Talking Data: Transcription and coding in discourse research, eds. J. Edwards and M. Lampert, 3-32. Hillsdale, NJ: Lawrence Erlbaum Associates

Edwards, J., and Lampert, M. 1993. Talking Data: Transcription and Coding in Discourse Research. Hillsdale, NJ: Lawrence Erlbaum Associates.

Garside, R., Leech, G. N., and McEnery, T. 1997. Corpus annotation: linguistic information from computer text corpora. London: Longman

GATE - general architecture for text engineering. http://gate.ac.uk.

Gibaldi, J. 1998. MLA Style manual and Guide to Scholarly Publishing. New York: Modern Language Association

Gibbon, D., Moore, R., and Winski, R. 1998. Handbook of standards and resources for spoken language systems.vol. 1: spoken language systems and corpus design. Berlin: Mouton de Gruyter

Gillam, R. 2003. Unicode demystified. Boston: Addison-Wesley

Goundry, N. 2001. Why Unicode won't work on the Internet: Linguistic, political, and technical limitations.  http://www.hastingsresearch.com/net/04-unicode-limitations.shtml.

Granger, S. 1998. Learner English on computer. London: Longman

Granger, S., Hung, J., and Petch-Tyson, S. eds. 2002. Computer learner corpora, second language acquisition, and foreign language teaching. Amsterdam: John Benjamins

Grice, M., Grice, M., Leech, G., Weisser, M., and Wilson, A. 2000. Representation and annotation of dialogue. In Handbook of multimodal and spoken dialogue systems: Resources, terminology and product evaluation, eds. D. Gibbon, I. Mertins and R. K. Moore, 1-101. Boston: Kluwer

Halliday, M. 1993. Quantitative studies and probabilities in grammar. In Data, description discourse, ed. Michael Hoey, 1-25. London: Harper Collins

Halteren, H. v. ed. 1999. Syntactic wordclass tagging. Text, speech, and language technology; 9. Dordrecht; Boston: Kluwer Academic Publishers

Hirst, D. 1991. Intonation models: towards a third generation. In Actes du XIIeme Congres International des Sciences phonetiques. 19-24 aout 1991. Aix-en-Provence, France, 305-310. Aix-en-Povence: Universite de Provence, Service des Publications

Hofland, K., and Johansson, S. 1982. Word frequencies in British and American English. London: Longman

Hofland, K. c. 1999. ICAME CD-ROM. HIT Centre, University of Bergen. http://www.hit.uib.no/icame/cd.

Ide, N. 1996. Corpus encoding standard. Version 1.5. Expert Advisory Group on Language Engineering Standards (EAGLES). http://www.cs.vassar.edu/CES/.

James, G., Davison, R., Cheung, A., and Deerwater, S. 1994. English in computer science: a corpus-based lexical analysis. Hong Kong: Hong Kong University of Science and Technology and Longman Asia

Johansson, S., Atwell, E., Garside, R., and Leech, G. 1986. The tagged LOB corpus: Users' manual. Norwegian Computing Centre for the Humanities. http://khnt.hit.uib.no/icame/manuals/lobman/INDEX.HTM.

Johansson, S. 1995. The approach of the Text Encoding Initiative to the encoding of spoken discourse. In Spoken English on Computer, eds. G. Leech, G. Myers and J. Thomas, 82-98. Harlow: Longman

Karlsson, F., Voutilainen, A., Heikkilä, J., and Antilla, A. 1995. Constraint grammar: a language-independent system for parsing unrestricted text. Berlin & New York: Mouton de Gruyter

Kipp, M. Anvil.http://www.dfki.uni-sb.de/~kipp/anvil/.

Knowles, G., Williams, B., and Taylor, L. 1996. A corpus of formal British English speech: the Lancaster/IBM Spoken English Corpus. London: Longman

Korpela, J. 2001. A tutorial on character code issues.  http://www.cs.tut.fi/~jkorpela/chars.html.

Lamport, L. 1986. Latex: a document preparation system. Reading, Mass.: Addison-Wesley

Leech, G., and Wilson, A. 1994. EAGLES morphosyntactic annotation. EAGLES report EAGSCSG/IR-T3.1. Pisa: Istituto di Linguistica Computazionale

Leech, G., Barnett, R., and Kahrel, P. 1995a. Guidelines for the standardization of syntactic annotation of corpora. In EAGLES Document EAG-TCWG-SASG/1.8.

Leech, G., Myers, G., and Thomas, J. eds. 1995b. Spoken English on computer. Harlow: Longman.

Leech, G., and Wilson, A. 1999. Standards for Tagsets. In Syntactic Wordclass Tagging, ed. Hans van Halteren, 55–80. Dordrecht.: Kluwer Academic.

Leech, G., and Weisser, M. 2003. Generic Speech Act Annotation for Task-Oriented Dialogue. In Proceedings of the Corpus Linguistics 2003 Conference, eds. D. Archer, P. Rayson, A. Wilson and A. McEnery. Lancaster: UCREL Technical Papers.

Lickley, R. HCRC Disfluency coding manual.  http://www.ling.ed.ac.uk/~robin/maptask/disfluency-coding.html.

Marcus, M., Santorini, B., and Marcinkiewicz, M. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19:313-330.

Mengel, A., Dybkjaer, L., Garrido, J. M., Heid, U., Klein, M., Pirrelli, V., Poesio, M., Quazza, S., Schiffrin, A., and Soria, C. 2000. MATE Deliverable D 2.1. MATE Dialogue Annotation Guidelines.  http://www.andreasmengel.de/pubs/mdag.pdf.

Meyer, C. 2002. English Corpus Linguistics. Cambridge: Cambridge University Press.

MICASE: Michigan Corpus of Academic Spoken English.   http://www.hti.umich.edu/m/micase/.

Morton, A. 1986. Once. A test of authorship based on words which are not repeated in the sample. Literary and Linguistic Computing 1:1-8.

Pickering, B., Williams, B., and Knowles, G. 1996. Analysis of transcriber differences in the SEC. In Working with Speech, eds. G. Knowles, A. Wichmann and P. Alderson. London: Longman.

Perez-Parent, M. 2002. Collection, handling, and analysis of classroom recordings data: using the original acoustic signal as the primary source of evidence. Reading Working Papers in Linguistics 6:245-254. http://www.rdg.ac.uk/app_ling/wp6/perezparent.pdf.

Pierrehumbert, J. 1980. The phonology and phonetics of English intonation. MIT.

Roach, P., and Arnfield, S. 1995. Linking prosodic transcription to the time dimension. In Spoken English on Computer, eds. G. Leech, G. Myers and J. Thomas, 149-160. Harlow: Longman.

Roe, P. 1977. The notion of difficulty in scientific text. University of Birmingham.

Sampson, G. 1995. English for the computer: the SUSANNE corpus and analytic scheme. Oxford: Clarendon Press

Scott, M.  WordSmith Tools.  http://www.lexically.net/wordsmith/.

Searle, S. J.  Unicode revisited.  http://tronweb.super-nova.co.jp/unicoderevisited.html.

Searle, S. J. 1999. A brief history of character codes in North America, Europe, and East Asia.  http://tronweb.super-nova.co.jp/characcodehist.html.

Semino, E., and Short, M. 2003. Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Narratives. London: Routledge

Short, M., Semino, E., and Culpeper, J. 1996. Using a corpus for stylistics research: speech and thought presentation. In Using corpora for language research, eds. J. Thomas and M. Short, 110-131. London: Longman

Sinclair, J. 1982. Reflections on computer corpora in English language research. In Computer corpora in English language research, ed. Stig Johansson: 1-6. Bergen.

Sinclair, J. 1989. Corpus creation. In Language, learning and community, eds. C Candlin and T McNamara, 25-33: NCELTR Macquarie University.

Sinclair, J. ed. 1990. Collins Cobuild English grammar. London: Collins.

Sinclair, J. 1991. Corpus, concordance, collocation: Describing English language. Oxford: Oxford University Press.

Sinclair, J. 1995. From theory to practice. In Spoken English on Computer, eds. G. Leech, G. Myers and J. Thomas, 99-112. Harlow: Longman.

Sinclair, J. 2001. Preface. In Small corpus studies and ELT, eds. Mohsen Ghadessy, Alex  Henry and Robert L. Roseberry, vii-xv. Amsterdam/Philadelphia: John Benjamins.

Sinclair, J. 2003. Corpora for lexicography. In A practical guide to lexicography, ed. P Van Sterkenberg. Amsterdam: John Benjamins.

Sinclair, J. 2004. Intuition and annotation - the discussion continues. In Advances in corpus linguistics. Papers from the 23rd International Conference on English Language Research on Computerized corproa (ICAME 23). Göteborg 22-26 May 2002., eds. Karin Aijmer and Bengt Altenberg, 39-59. Amsterdam/New York: Rodopi.http://www.ingentaconnect.com/content/rodopi/lang/2004/00000049/00000001/art00003.

Smith, A. 2004. Preservation. In A companion to Digital Humanities, eds. S. Schreibman, R. Siemens and J. Unsworth, 576-591. Oxford: Blackwell.

Sperberg-McQueen, C. M., and Burnard, L. 1994. Guidelines for electronic text encoding and interchange (TEI P3). Chicago & Oxford: ACH-ALLC-ACL Text Encoding Initiative.

Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, R., Taylor, P., Martin, R., Van Ess-Dykema, C., and Meteer, M. 2000. Dialogue act modelling for automatic tagging and recognition of conversational speech. Computational Linguistics 26:339-373.

Tapanainen, P., and Voutilainen, A. 1994. Tagging accurately - don't guess if you know. In Procedings of ANLP '94, 47-52. Stuttgart.

Thompson, H., Anderson, A., and Bader, M. 1995. Publishing a spoken and written corpus on CD-ROM: the HCRC Map Task experience. In Spoken English on Computer, eds. G. Leech, G. Myers and J. Thomas, 168-182. Harlow: Longman.

Tognini-Bonelli, E. 2001. Corpus linguistics at work: Studies in corpus linguistics, v. 6. Amsterdam: John Benjamins

UCREL: University Centre for Computer Corpus Research on Language.   http://www.comp.lancs.ac.uk/ucrel/.

Unicode Consortium. 2003. The Unicode standard, Version 4.0. London: Addison-Wesley.http://www.unicode.org/versions/Unicode4.0.0/.

van den Heuvel, H., Boves, L., and Sanders, E. 2000. Validation of content and quality of existing SLR: overview and methodology.  http://www.spex.nl/validationcentre/d11v21.doc.

Voutilainen, A., and Järvinen, T. 1995. Specifying a shallow grammatical representation for parsing purposes. In Proceedings from the 7th Conference of the European Chapter of the Association for Computational Linguistics, 210-214: Association for Computational Linguistics.

Wells, J. C., Barry, W., Grice, M., Fourcin, A., and Gibbon, D. 1992. Standard computer-compatible transcription. Esprit project 2589 (SAM). In Doc. no. SAM-UCL-037. London: Phonetics and Linguistics Department, UCL.

Whistler, K.  Why Unicode will work on the Internet. http://slashdot.org/features/01/06/06/0132203.shtml.

Working Group on Romanization Systems.  United Nations Group of Experts on Geographical Names (UNGEGN). http://www.eki.ee/wgrs/.

Zipf, G. K. 1935. The psychobiology of language. New York: Houghton Mifflin.

Return to the table of contents

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service.

Electronic or print copies may not be offered, whether for sale or otherwise, to any third party.