Developing Linguistic Corpora: a Guide to Good Practice

Developing Linguistic Corpora:
a Guide to Good Practice

Archiving, distribution and preservation

Martin Wynne, University of Oxford
© Martin Wynne 2004

1. Introduction

Once you have created your corpus, what happens next? This chapter attempts to explain how good planning can ensure that, for as long as possible into the future, a corpus is useful and usable for a wide range of potential users.

Usually the creation of the corpus was not an end in itself, but was conceived as part of a research project, and it is only when the corpus building has finished that the real work begins. But the corpus is likely to be of potential value to many more researchers outside of the corpus creator's research group, so it is also advisable to plan to make sure that other users can make use of it too. Ensuring the initial and ongoing availability and usefulness of the corpus is the subject of this chapter.

It is not recommended that you start to address these issues only at the end of the corpus building project. A successful project to create a digital resource will usually have planned for the entire life-cycle of the resource, including what happens after the resource is created.

At the planning stage, it is important to ask whether, under the project plan, the corpus is likely still to be available and usable in one, or ten, or twenty years' time. Potential risks to its future viability include termination of funding, changes in staff or management, changes in technical infrastructure, obsolescence of the technologies associated with the resource and changes in standards. It is possible to be specific and say that it is certain that, at some point, the project funding will end, some of the staff will leave, the computers will be replaced, the servers will be upgraded, the software used to access the corpus will change, the interests and priorities of the staff involved will change and they will eventually get different jobs or retire.

To ensure ongoing availability and usability of the resource, it is desirable to remove reliance on particular individuals, institutional arrangements or technologies. This can only really be effectively managed in the context of an archive which is a trusted repository and which has a long-term access and preservation strategy for its collections.

The following sections attempt to cover some of the important issues which it is useful to consider at the planning stage of the corpus building project.

2. Planning for the future

Stop developing the corpus!

The first thing to say here may appear obvious, but it is sometimes necessary to remind corpus builders to stop developing the corpus. While it is important to achieve as low a rate of errors as possible, there is a danger of excessive perfectionism, which can lead to a situation in which the corpus is never finished, preventing its use and reuse. There may be similar problems if a corpus is made available, but then repeatedly revised, preventing the comparison or replication of results based on its analysis.

It is of course possible to conceive of a corpus which changes in a principled and useful way. For example, a monitor corpus is repeatedly updated with new texts and is constructed in such a way that language change over time can be analysed. For a dynamic resource of this type to be useful, it needs to develop in a managed, predictable and well-documented fashion, and in a way which is transparent to the users.

The corpus creator may plan to add annotations to the text. It is also likely that a well-constructed resource which is made available will have annotation added to it by other researchers. It is good practice to release a version of the corpus without annotation, however, for several reasons. Firstly, there are likely to be many users who do not wish to use the annotation, or indeed who use tools which find it difficult to process a corpus with certain types of annotations. Secondly, the annotation process may involve changing the text in some ways, such as changing the word tokenisation, or removing certain elements. The latter can happen deliberately, or accidentally, and may not be easy to detect. It is therefore important that an original version of the corpus be available for reference purposes.

Delays in finishing a corpus can be caused by checking and correcting errors in the text and markup. If it possible to have a clear idea from the start of a realistic level of quality which is required and an accurate means of measuring this, then it is much easier to know when the acceptable level is reached. Do bear in mind that this may have to be 'good enough' rather than 'perfect'. While it is tempting to attempt to create a corpus which is perfect and a thing of beauty, the important thing is for the corpus to be 'fit for purpose'. It is also worth bearing in mind that most of the techniques of corpus analysis require the identification of repeated patterns. While errors will skew results and may, if serious, hide certain important patterns, you may also be able to rely on a tendency for repeated patterns to shine through despite a certain error rate. In any case, the extent of quality control checks should be documented.

It will also be easier to stop if your project plan is scalable. If your workplan requires everything to be dependent on a final processing stage which can only take place if all previous stages are completed 100% successfully, then there is a high risk of failure. At best, the corpus building process may drag on for a long time beyond the projected end date, with all the problems associated with carrying on without the necessary funding and support. If on the other hand, the project has been designed with a more robust and scalable plan, then there is a much greater chance of successful completion of the project. Such a plan might involve complete production of sub-sections at various stages, with a design that will still work if less than 100% of the texts are successfully collected and processed.

What are my rights and responsibilities?

Corpora are usually made of texts written by different people, and the authors or owners of these texts have intellectual property rights. In addition, the fact that intellectual work has gone into the sampling selection, markup and annotation of texts means that corpus creators have rights over the corpus as a collection. The project to create the corpus will probably have a funder, the work will usually be done within an academic institution which may claim ownership over the products of research. Several people will have been involved. The rights of these stakeholders can potentially restrict the use, reuse, sharing and long-term preservation of the corpus.

The relevant laws in the UK forbid the copying of published materials without the permission of the rights holder. The fact that a text is available freely on the web does not mean that it 'in the public domain' and you can put it in your corpus. On the contrary, publication on the web confers the right of ownership on the creator, and makes copying illegal, even if this is only for your private use. In practice such rights and prohibitions need to be tested in court, and it is usually the case that the corpus developer has to assess the probability of being sued rather than being able to obtain a clear statement of the legal position regarding the use of a text in a corpus. It may be that increased visibility of a widely distributed corpus might increase the likelihood of legal action in defence of copyright. In any case, it is advisable for these issues to be explored and clarified at the planning stage of the project, to ensure that you do not spend time constructing a corpus which cannot then be used legally.

Any agreements which were entered into with funders, copyright holders, publishers, data developers, archives, research assistants and other stakeholders need to be considered and documented. As an example of a responsibility to a funder, if your corpus development project is funded by the Arts and Humanities Research Council (AHRC) in the UK, you will normally be expected to deposit the completed resource with the Arts and Humanities Data Service (AHDS). Measures need to be taken to make sure that the documentation of these issues will continue to be available, preferably in an electronic form which is associated with the corpus. Ethical considerations may be relevant, especially if your corpus is the product of linguistic field-work. It may be useful to conduct a stakeholder analysis, an established business management technique. This analysis would attempt to consider the points of view of the various parties who have an interest in the corpus. It can be useful to highlight potential conflicts, in legal and ethical questions, and may help the development of a plan to ensure that the necessary steps are taken.

It is also useful to document any ways in which the rights associated with any of the materials are going to change. Are some texts likely to come out of copyright soon? If so, which ones and when? Are your rights in some materials likely to expire? For example, have you made use of journal texts or images which you only have rights to for a fixed period of time? These issues need to be discussed with an archivist, and any relevant information included in the metadata. It is likely that future changes in the legal status of the corpus texts can only be dealt with effectively by an archive with the relevant procedures in place.

How is the corpus stored?

First, it is necessary to have some backup procedures during the data collection and data development stage of your corpus building project. While your own ad hoc procedures can be useful for providing extra copies and having them easily to hand, it may be best to make use of professional backup facilities such as those which should be offered by your the computing service at your institution.

Once the corpus is completed, then it is necessary to archive it. It is perhaps useful to explain here the distinction which is usually made by information professionals between backup and archiving. Backup means taking a periodic copy of a file store. Archiving means the transfer of information of public value into a separate repository where it is to be held indefinitely, or for an agreed period of time. It is likely that you will need backup solutions during the lifetime of your project, and you will need to find an archiving solution when the resource is completed. It is however useful to plan the archiving from the start, so it is a good idea to talk to the archivists and make sure that the resource can be provided in an appropriate format, and also so that you can include the time and effort necessary for depositing the corpus in the archive in the project workplan.

In terms of the technical solutions for backup and archiving, there are important issues to do with media, location, metadata and management. Storage media are susceptible to the breakdown and the loss of data. The possibilities of fire, theft and damage need to be considered. It is necessary to consider how the media and files are labelled, and how the documentation is associated with the relevant resource. These technical issues are not covered in detail here, as they are subject to constant change due to technical innovation, development of standards and changes in practice. It is best to consult the AHDS, or other information professionals, for up-to-date advice which takes into account the latest developments.

Where is the corpus archived?

You are likely to need to store the data locally during the data development phase, and you will undoubtedly want to continue to do this so that you can use it. However you may opt to pass on the job of archiving, cataloguing, distributing and preserving your corpus to an organisation which offers professional archival services, such as the AHDS.

The fact that the corpus is archived elsewhere does not mean you lose rights over your resource. An archive will not normally acquire any exclusive rights over the corpus. The creator and other rights holders do not lose any of their rights. The normal arrangement is for the resource creator to retain ownership, and to grant the archive permission to keep a copy, and, possibly, to distribute the resource. The arrangement should be non-exclusive, meaning that this does not prevent the corpus creator from depositing it elsewhere, and it should be possible to dissolve the agreement. You should check the licensing agreement for these and other issues which are relevant to you if you deposit your corpus in an archive. It would also normally be necessary to take a look at the terms under users may be able to download the corpus, and check that this does not come into conflict with any of your rights or responsibilities.

As long as the agreement is non-exclusive, you can continue to distribute the corpus yourself, develop it and exploit it in other ways.

Who will have access to the corpus?

There are several factors which sometimes influence corpus builders not to make resources more widely available. Some are listed below:

to avoid copyright and other rights issues;
to ensure that the creator has the first, or even exclusive, opportunity to exploit the resource and publish research or further resources based on it;
to retain the option to sell the rights on a commercial basis;
because of the danger of uncontrolled commercial exploitation or pirating;
because it is too much trouble to administer distribution.

Avoiding legal issues

It should be noted that the first reason above is not a sound one from the legal point of view. As noted above, copying texts and putting them in a corpus can constitute a breach of copyright, whether or not the corpus is then distributed.

Getting the first chance to use the data

While it may be desirable for the creator to have the first opportunity to publish results based on the corpus, it is also desirable that any published results be replicable, which means that the corpus on which the research is based needs to be made available to other researchers. In any case, the creator will normally have a head start over other researchers, with a research agenda already in place and underway as soon as the corpus is completed. Delaying the deposit of the corpus in an archive runs the risk of the data becoming corrupted, or of versions of the resource becoming confused. In some cases delay leads to the deposit never happening, as priorities and circumstances change.

Releasing the corpus commercially

If commercial exploitation of the corpus is an option, the creator must weigh up the options. While a commercial deal may please your employer, and bring some financial reward, there are some good arguments for open access. The more widely available the corpus is, the more widely known it is, and the more publicity the creator will receive. A community of researchers who work on the corpus will come into being, creating a higher profile for research based on the resource, including your own. Feedback will be obtained on the usefulness of the resource, and errors can be corrected. Others are more likely to share their resources with you if you share yours. Funders are more likely to give you more funding if you have a good record of ensuring that resources which you have created are properly archived and distributed. The funders generally perceive better value for money in creating resources that are reusable. Failure in this respect could seriously weaken a proposal for further funding. Further project funding may be more lucrative and prestigious than what can be obtained from commercial exploitation of the data. In any case, commercial publication and open access are not necessarily mutually exclusive. It may, for example, be possible to sell copies of a corpus bundled with access software, while also making the raw corpus data freely available.

Concern about unrestricted access and piracy

Concern about piracy is not a good reason not to deposit a corpus. It is likely to be easier to control access and defend the rights of stakeholders if the corpus is distributed through an archive. A reliable archive will a rights management policy, and have the means to take action to defend rights that are violated. The corpus creator is unlikely to want to get involved in these issues, even with local institutional support.

It's all too much trouble

It is not necessarily as much trouble as you might think. It should be noted the AHDS normally offers a free service to academics in the UK to archive, catalogue, distribute and preserve corpora, and so the expense and work of the administration of granting access does not need to be borne by the corpus builder or their institution.

Open access: conclusion

It is for the corpus developer to weigh up these issues and decide whether they want to be enlist the help of an archive to distribute the corpus. In the short term they may be able to manage distribution of a resource, but it is unlikely to be viable in the long term. If the situation regarding access is not clearly defined and well-documented then this could seriously affect the future viability of the resource. The developer could thus fail to meet the expectations of their funders, users and other stakeholders. Managing access and dealing with rights issues can be time consuming and complex.

In the event that it is not possible to distribute the resource for some valid reason, it is still be good practice to deposit a copy of the corpus in an archive for long-term preservation purposes. Such an arrangement can normally be negotiated with the AHDS.

How will users find the corpus?

Depositing your corpus in a trusted archive should help ensure that best practice is followed in ensuring the security, availability and long-term preservation of the corpus. It should also help users to find the resource, since an effective archive will make its catalogue records visible to potential users. They will participate in sharing of resource descriptions, through open archives initiatives and institutional and subject portal projects. Such initiatives are currently growing in importance. One of particular relevance to the field of corpus linguistics is the Open Language Archives Community (OLAC, http://www.language-archives.org/). All of the major archives of language resources have come together in OLAC in order to enable users to go to one place to search for corpora and other resources held in different archives and repositories. The creation of this community is also helping the development of standards in the description of resources.

Many more initiatives within institutions and different communities to share information about resources are likely to appear in the coming years, in the shape of portals, virtual learning and research environments, institutional archives and online library and information systems. These are all likely to be built on the open standards which are used by archives and other trusted repositories. Depositing your resource with an archive means that they will catalogue your resource according to appropriate standards and thus make it possible for the existence and availability of the corpus to be discovered via these mechanisms.

What file format should my corpus text files be in for archiving?

One piece of important general advice for file formats for digital preservation is to avoid tie-ins to proprietary formats. If your corpus is made up of files in a format for a commercial word-processing program, such as Microsoft Word, then they cannot be processed by most corpus analysis tools. What is more, the format may not be supported indefinitely into the future, and there will come a time when users won't be able to read the files any more. XML is usually considered to be a more appropriate file format for long-term preservation, because it is an open international standard defined by the World Wide Web Consortium (W3C), it is not tied to a particular applications or platforms and it uses Unicode (another open standard) for encoding the text. However, it should not be thought that simply saving files as XML is a panacea for all archiving and preservation problems. It is perfectly possible to use XML to make a corpus which is in an appropriate form for long-term preservation, but it is also very easy to make a corpus using XML which is NOT viable in the near, let alone distant, future. Simply automatically converting a file from a word-processing format to XML does not magically make it into a good electronic resource. Recommendations of preferred file formats, encoding schemes and software options can obscure more important factors. Open standards like XML are preferred because they make it possible to encode the intellectual content of the resource and the metadata in a consistent and unambiguous way. While there are reasons why XML, and Unicode, are desirable, and likely to become more firmly entrenched and widely used for language corpora, it is often trivial to migrate from other formats and standards, including proprietary ones, as long as good practice has been followed in the creation of the electronic text in whatever format. There should be no short-term problems with converting a text file created and edited in MS Word in which the various relevant textual phenomena have been dealt with in a principled and consistent way.

There is however a particular issue with text corpora, which means that the type of text encoding is especially important. To use a corpus, the text needs to be searchable, preferably with generic tools. This means that binary encoding formats, such as PDF, RTF and Word are inappropriate, and 'plain text' or Unicode (with or without markup) are preferable. There is unfortunately a conflict here between the needs of corpus linguists and those working in the archiving, preservation and digital library worlds. The latter are generally more concerned with ensuring that the content of text documents and other types of data are preserved, sometimes including the 'look and feel' of a text, rather than preserving the 'searchability' of the texts. For this reason, proposals from the digital preservation professionals for an open standard for PDF for preservation purposes (PDF-Archive), or any other kind of binary format, are not appropriate for language corpora (see http://www.aiim.org/standards.asp?ID=25013). Indeed, it would be a hindrance to linguists hoping to use electronic archives as the basis for research if the archives were to adopt binary formats for preservation.

While it may be convenient to use one file format for all stages of the life-cycle of a corpus, it may well be the case that the best preservation formats are not the best formats for the data development stage, or for using with the relevant analysis tools. In this case, a separate preservation version of the resource may be created. But it is important to bear this in mind while developing the corpus and to make sure that information necessary for accompanying the preservation version is not lost. For electronic text, this means avoiding the insertion of annotation or processing instructions in such a way that the original text and its structure are not recoverable. In the case of audio data, this means capturing, storing and depositing the best possible quality, in an uncompressed audio stream, and then converting to a more convenient lower quality, compressed sound for analysis and distribution, if necessary.

A further discussion of issues in digital preservation of electronic resources in humanities disciplines can be found in Smith (2004).

3. Conclusion

Unambiguous, rigorous, consistent and well-documented practices in data development are usually more important than the technologies used. There are preferred options for file formats, encoding, markup, annotation and documentation, but these will change over time. For the latest recommendations, consult the Arts and Humanities Data Service (http://www.ahds.ac.uk/) at the planning stage of your project, and build into your workplan adequate time and resources for the preparation of the corpus for distribution, archiving and preservation.

The general advice here is for conformance to open standards in corpus creation and documentation, but it is acknowledged that there is more than one way to do this. It is hoped that these are the messages of this entire guide.

Continue to John Sinclair's Appendix: How to build a corpus

Return to the table of contents

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service.

Electronic or print copies may not be offered, whether for sale or otherwise, to any third party.

Sections in this chapter: