AHDS LitLangLing Logo

Developing Linguistic Corpora:
a Guide to Good Practice

Martin Wynne

A linguistic corpus is a collection of texts which have been selected and brought together so that language can be studied on the computer. Today, corpus linguistics offers some of the most powerful new procedures for the analysis of language, and the impact of this dynamic and expanding sub-discipline is making itself felt in many areas of language study.

In this volume, a selection of leading experts in various key areas of corpus construction offer advice in a readable and largely non-technical style to help the reader to ensure that their corpus is well designed and fit for the intended purpose.

This Guide is aimed at those who are at some stage of building a linguistic corpus. Little or no knowledge of corpus linguistics or computational procedures is assumed, although it is hoped that more advanced users will also find the guidelines here useful. It also has relevance for those who are not building a corpus, but who need to know something about the issues involved in the design of corpora in order to choose between available resources and to help draw conclusions from their analysis.

Increasing numbers of researchers are seeing the potential benefits of the use of an electronic corpus as a source of empirical language data for their research. Until now, where did they find out about how to build a corpus? There is a great deal of useful information available which covers principles of corpus design and development, but it is dispersed in handbooks, reports, monographs, journal articles and sometimes only in the heads of experienced practitioners. This Guide is an attempt to draw together the experience of corpus builders into a single source, as a starting point for obtaining advice and guidance on good practice in this field. It aims to bring together some key elements of the experience learned, over many decades, by leading practitioners in the field and to make it available to those developing corpora today.

The modest aim of this Guide is to take readers through the basic first steps involved in creating a corpus of language data in electronic form for the purpose of linguistic research. While some technical issues are covered, this Guide does not aim to offer the latest information on digitisation techniques. Rather, the emphasis is on the principles, and readers are invited to refer to other sources, such as the latest AHDS information papers, for the latest advice on technologies. In addition to the first chapter on the principles of corpus design, Professor Sinclair has also provided a more practical guide to building a corpus, which is added as an appendix to the Guide. This should help guide the user through some of the more specific decisions that are likely to be involved in building a corpus.

Alert readers will see that there are areas where the authors are not in accord with each other. It is for the reader to weigh up the advantages of each approach for his own particular project, and to decide which course to follow. This Guide not aim to synthesize the advice offered by the various practitioners into a single approach to creating corpora. The information on good practice which is sampled here comes from a variety of sources, reflecting different research goals, intellectual traditions and theoretical orientations. The individual authors were asked to state their opinion on what they think is the best way to deal with the relevant aspects of developing a corpus, and neither the authors nor the editor have tried to hide the differences in approaches which inevitably exist. It is anticipated that readers of this document will have differing backgrounds, will have very diverse aims and objectives, will be dealing with a variety of different languages and varieties, and that one single approach would not fit them all.

I would like to thank the authors of this volume for their goodwill and support to this venture, and for their patience through the long period it has taken to bring the Guide to publication. I would like to acknowledge the extremely helpful advice and editorial work from my colleague Ylva Berglund, which has improved many aspects of this guide.

Continue to Chapter One: Corpus and Text — Basic Principles

Return to the table of contents

© Martin Wynne 2004. The right of Martin Wynne to be identified as the Author of this Work has been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service.

Electronic or print copies may not be offered, whether for sale or otherwise, to any third party.