A number of approaches might be taken for the establishment of an appropriate analytic framework within which the validation of language corpora might be carried out. One might, for example, decide on a priori grounds that validation of a multi-purpose resource such as a language corpus could only be performed with respect to a particular application: in such a case, one might need to define a different analytic framework for each corpus/application pair. The approach taken here however has been to determine empirically that set of textual features which current corpus users appear to agree should be captured (encoded) — whether to maximize the re-usability of the resource, or for other reasons.
This document describes how we set about gathering such evidence, and what the results of our analysis indicated. In a subsequent deliverable, we will assess the implications of these results for the automation of appropriate validation criteria for language corpora.
Data was collected from three distinct sources: