<?xml version="1.0"?> 
<?xml-stylesheet type="text/css" href="teixlite.css"?>
<!DOCTYPE text PUBLIC "-//TEI//DTD TEI Lite XML ver. 1//EN"
"/home/lou/TEI/web/Software/tei-emacs/xml/dtds/tei/teixlite.dtd" []>
<text>
<body>

<div><head>Introduction</head>

<p>This report is a first draft version of deliverable D1.1b for the
    validation unit contract ELTA/0209/VAL-1, and is intended to be
    read as a parallel report to deliverable D1.1a which describes the
    validation of lexica. The theoretical framework underlying this
    work was presented in <ptr target="WP2"/> and <ptr target="WP3"/>; the present document assumes an
    understanding of the issues discussed in those reports.
   </p>
<p>As noted in those reports, and elsewhere, the word <term>validation</term> is
    generally used (synonymously with <term>evaluation</term>) to define the
    process by which the usefulness or applicability of a given
    resource for a specific purpose may be determined. It may well be
    that a resource suitable for one purpose is of no use for another;
    this affects in particular resources such as language corpora
    which typically have many users, and may have applications often
    entirely unanticipated by their original creators. Rather than
    attempt to define validation procedures appropriate to many
    different applications or scenarios, the approach taken here is to
    define objective criteria by which data essential to such an
    evaluation can be assembled, by reference to a check list of
    identifiable characteristics. The intended outcome from a
    validation process should thus be an accurate description of a
    corpus, using standardized terminology; it is up to the individual
    user to determine the extent to which a corpus is fit for a
    given purpose.
</p>
<p>We use the word <term>corpus</term> throughout as a short form for
    <term>written language resource</term>, recognizing however that
    many corpora include both spoken and written materials. There is consequently
    some overlap between  the procedures and requirements described here and those defined
    in <ptr target="dspex"/>.</p>

<p>The bulk of this report consists of a checklist of features and
    properties which the evaluator should investigate in the resource
    under consideration. In some cases, specific recommendations are
    made; for the most part however, the only requirement is that the
    evaluation should consistently attempt to ascertain  whether or
    not information about the feature concerned is available, and if
    so, what it is.
   </p>

<p>Evaluation can (and probably should) be performed both as a part of
    the creation of a resource, and subsequently by an independent
    assessor. As noted above, the creator of a resource may have a
    different set of objectives from subsequent users of it, which
    suggests that it is probably a good idea not to rely solely on
    evaluations carried out during resource production. On the other
    hand, only the creator of a resource may be able to supply some of
    the less obviously apparent information required for a complete
    description of a resource. The two stages are thus complementary,
    and wherever possible both should be attempted.</p>

<p>An appendix to the report contains a summary checklist of questions
   which may be used to guide the evaluation process.</p>

</div>
<div><head>Scope of Corpus Evaluation </head>
<p>The model proposed here derives from that underlying most pre-existing
    corpus standardization work (most notably <ref>TEI</ref> and its applications in
    <ref>Eagles</ref> and <ref>CES</ref>). As descriptive standards,
   these standards largely address issues about corpus 
   <term>content</term>, which may be thought of as the identifiable
   intellectual content of a resource, considered independently from
   any particular physical instantiation of it. Since ELRA is
   adsditionally concerned with the deposit and distribution of
   physical corpora, we have found it necessary to extend the
   descriptive model to include consideration of <term>corpus
    components</term>, that is constituents and status of the carrier
   media used for a specific instance of the written resource. 
 </p>
<p>Under this latter heading we consider such aspects as the way the
    corpus is or may be stored (on which media); where
    alternate versions exists, which of these are to be regarded as
    canonical; what components make up the complete resource (disks,
    manuals, documentation etc); how they are to be unpacked or
    installed, etc.</p>
<p>When considering the content of a corpus, it is convenient to distinguish:
<list>
<item>metadata:  descriptive information about the corpus
      contents and included with it</item>
<item>structural components identified within the corpus</item>
<item>interpretive annotation provided with the corpus</item>
    </list>
   </p>


<p>The distinction between <soCalled>structural</soCalled> and
   <soCalled>interpretive</soCalled> reflects the fact that most
   people seem to categorize an analysis such as <q>this is a
   paragraph</q> (<soCalled>structural</soCalled>) differently from
   the formally equivalent judgement <q>this is a noun</q>
   (<soCalled>interpretive</soCalled>). A similar distinction
   underlies the notion of <soCalled>level</soCalled> of annotation as
   exemplified by (inter alia) the Corpus Encoding Specification <ref
   target="ide98">(Ide 1998)</ref>, where the distinction is further
   justified by the observation that the addition of so-called
   <soCalled>structural</soCalled> markup is generally easier to
   automate than that of <soCalled>interpretive</soCalled> markup,
   since the latter (almost) invariably requires human judgement and
   knowledge, while the former rarely does. Particularly in the case
   of textual markup, interpretative judgements tend to be more
   controversial than structural ones, if only because the latter
   relate to aspects of a text which are accepted as intrinsic to its
   substance by the community of text readers. Structural
   interpretations form part of the <soCalled>contracts of
   literacy</soCalled> (<ref target="sno86">Snow and Ninio,
   1986</ref>) which form the precondition of a text's recognition as
   meaningful by the members of a particular community of readers.</p>

<p>Markup, whether introduced into a corpus to identify structural or
   interpretive components, can only properly be validated with
   reference to an abstract model of some kind. For structural
   features, the model will comprise  textual components and features
   which may be either entirely intuitive and <soCalled>common
   sense</soCalled> based, or expressed in terms of some consensus-based
   model such as that of the TEI.  Interpretative markup may be similarly theory-free (see, for
   example, <ref  target="lee93">Leech 1993</ref>, but it
   is more customary to define it with reference to some explicitly
   stated analytic model, and hence to facilitate both automatic
   validation of the corpus itself (to check that it is valid in its
   own terms) and comparison of two corpora using different markup
   schemes derived from a common abstract model.</p>
<p>At the technical level, structural markup is generally embedded
   within the corpus, while interpretive markup need not be. There is an
   increasing use of <soCalled>stand off</soCalled> markup, which
   represents particular kinds of annotation as distinct documents
   comprising pointers into a corpus and associated annotations of
   various kinds. Such annotation maps require careful validation with
   special purpose  software.</p>

<!--
<p>Although these three overlap considerably, it is both convenient
    and conventional to distinguish them. In each case, in the
    appropriate section below, we 
<item>a brief description of the kind of information
    which may be supplied
, a recommendation as to whether or not this
    information should be required
, how it should be identified   </p>
-->
  </div>
<div id="D1man"><head>Corpus Components</head>
<!--
<p>This document describes how to validate a copy of some corpus. This
    section itemizes the various physical objects which should be
    delivered as constituting the object to be validated.
   </p>
-->
<p>Corpora may be instantiated using a variety of media (disks, CD, DVD,
    tape) etc. or online. It is essential that the validator can
    ascertain whether or not the media delivered form a complete and
    integral copy of the corpus. To make best use of a corpus, it is
    also generally necessary to have access to supporting metadata,
    code books, explanatory documentation, licence agreements, etc. As
    far as possible all such supporting material should be included in
    digital form along with the corpus itself. Where it is supplied in
    paper form, this should also be clearly indicated. </p>

<p>To facilitate this essential step of the validation, we propose
    that each corpus should be accompanied by a <term>corpus
    manifest</term>. This is a list of all the physical components
    which should be assessed together, and which together constitute
    the corpus and the support environment needed to make best of use
    it, indicating for each one
<list>
<item>its type (paper document, computer file, audio or video recording, etc.)</item>
<item>its carrier (computer file name and location, document title etc.)</item>
<item>its status (integral part of corpus, descriptive metadata,
       associated annotation, documentation, etc.) </item>
<item>for digital components, the storage format (character encoding,
      binary format, record structure, etc.)
     </item>    </list>
   </p>

<p>Where more than one version of a given component is supplied, their
    status should be clearly indicated. For example, it is sometimes
    convenient to include different versions of a computer file using
    different levels of encoding, or different sets of conventions. In
    such a case, the corpus creator should select one such version as
    primary and indicate that the others are alternates, so as to
    maintain corpus integrity.
   </p>

<p>Where a corpus consists of very many computer files they will
    generally be organized into a hierarchic structure of some kind,
    and may be supplied in compressed form. Any software needed to
    decompress or to make accessible the corpus must also be specified
    in the corpus manifest.</p>

<p>The digital version of certain supporting documents
    (e.g. word-processed files, spreadsheets, databases, audio or
    video files etc.) may be supplied in non-standard or proprietary
    formats which may require the use of specific software. As far as
    possible all such dependencies on external software should be
    avoided: the primary components of a corpus must be processable on
    any platform, using generic standard interfaces and open
    architectures.</p>

<p>The following formats are recommended for processing of the most
    commonly used media types:
<list type="gloss">
<label>text files</label><item>XML or SGML conforming to a standard or
      supplied DTD or schema</item>
<label>audio</label><item>MP3, WAV</item>
<label>video</label><item>Mpeg, Quicktime</item>
<label>image files</label><item>PNG, JPG</item>
</list>
 </p>

<p>Where XML is used for text files, the character encoding should be
    represented in UTF-16, UTF-8 or (exceptionally) one of the ISO
    8859 character sets, as indicated in the encoding attribute of the
    XML declaration. If different files use different encodings, each
    should contain such a declaration. Where SGML is used, an
    appropriate SGML declaration defining the character encoding
    employed <emph>must</emph> be included in the corpus manifest.
   </p>

<p>Where XML or SGML is used for text files, a DTD or Schema
    <emph>must</emph> be supplied or referenced, against which the
    text files can be validated. For written corpus resources, the
    most appropriate DTD is likely to be an application of the TEI,
    such as the XCES, but any adequately documented vocabulary may be
    employed. Both the DTD instance and its associated documentation
    should be supplied in the corpus manifest.</p>

<p>Text files supplied as alternates to corpus components such as
    documentation may use any appropriate format, though open formats
    such as HTML or XML are to be preferred above proprietary word
    processor or database formats. In particular, where database or
    spreadsheet data is used to document some part of a corpus, a
    printout of the material should be suppled and the data should be
    exported into some open format such as XML or comma-delimited
    files.  </p>

  </div>

<div><head>Metadata</head>
<p>By "metadata" we mean...
   </p>

<p>Multiple <term>levels</term> of metadata may be associated with a corpus. For
    example, some information may relate to the corpus as
    a whole (for example, its title, the purpose for which it was
    created, its distributor, etc); other information may relate only
    to individual components of it (for example, the bibliographic
    description of an individual source text), or to groups of such
    components (for example, a taxonomic classification).</p>

<p>In the following lists, we have supplied the TEI/XCES element
    corresponding with the topic in question. This is not meant to
    imply that only corpora conforming to TEI/XCES standards can be
    validated, of course, but rather to add precision to the topics
    addressed. That said, obviously a corpus which is conformant to
    those standards will be much simpler to validate than one which
    supplies equivalent information in some other manner.
   </p>

<div><head>Corpus identification</head>
<p>Under this heading we group information that identifies the corpus,
     and specifies the agencies responsible for its creation and
     distribution.
<list>
<item>name of corpus (<gi>titleStmt/title</gi>): required</item>
<item>producer (<gi>titleStmt/respStmt</gi>): required. The agency (individuals,
	research group, "principle investigator", company, institution etc.) responsible for the
	intellectual content of the corpus should be specified. This
	may also include information about any funding body or sponsor
	involved in producing the corpus.</item>
<item>distributor (<gi>publicationStmt</gi>): required. The agency
	(individual, research group, company, institution etc)
	responsible for making copies of the corpus available. The
	following information should be provided:
<list>
<item>name of agency <gi>publisher, distributor,</gi> </item>
<item>contact details (postal address, email, telephone, fax) (<gi>pubPlace</gi>)</item>
<item>date first made available by this agency (<gi>date</gi>)</item>
<item>any specific identifier (e.g. a URN) used for the published
	  version (<gi>idno</gi>)</item>
<item>availability: a note summarizing  any restrictions on
	  availability, e.g. where the corpus may not be distributed
	  in some geographic zones, or for some specific purposes, or
	  only under some specific licencing conditions. </item>
	</list></item></list></p>

<p>If a corpus is made available by more than one agency, this should
	 be indicated, and the information above supplied for at least
	 one of them.  If the intention is to make the corpus
	 available only from ELRA, however, then the above details
	 need not be supplied.
 </p>
<p>If specific licencing conditions apply to the corpus, a copy of
	 the licence or other agreement must be included in the corpus
	 manifest (<ptr target="D1man"/>)</p>

  </div>
<div><head>Corpus derivation</head>
<p>Under this heading we group information that describes the sources
     sampled in creating the corpus.
    </p>
<p>Written language resources may be derived from any of the
     following:
<list>
<item>books, newspapers, pamphlets etc. originally printed</item>
<item>unpublished handwritten or "born-digital" materials</item>
<item>web pages or other digitally distributed materials</item>
<item>recorded or broadcast speech or video</item>
     </list>
    </p>
<p>A brief description of each different source used in building a
     corpus should be supplied. This may take the form of a full TEI
     <gi>sourceDescription</gi> attached to the relevant corpus
     component, or it may be supplied in ancillary printed
     documentation, but its presence is essential. In a language
     corpus, samples are taken out of their context; the description
     of their source both restores that context and enables a degree
     of independent verification that the sample correctly represents
     the original.</p> 

<p>For conventionally printed and published material, a standard
     bibliographic description should be supplied or referenced, using
     the usual conventions (author, title, publisher, date, ISBN,
     etc.), and using a standard format such as TEI, BibTeX, etc. For
     other kinds of material, different data is appropriate: for
     example, in transcripts of spoken data it is customary to supply
     information about each speaker, and the context in which the
     speech interaction occurs. See further the detailed
     specifications in <ptr target="dspex"/>, which should be closely
     followed in this case.
    </p>

<p>Information about the size of each sample should be given.</p>

   </div>

<div><head>Corpus encoding</head>

<p>Under this heading we group descriptive information relating to the way in
     which the source documents from which the corpus was derived
     have been processed and managed</p>
<list>
<item>Project goals and research agenda (<gi>projectDesc</gi>);
     </item>
<item>Sampling principles and methods employed
      (<gi>samplingDecl</gi>); </item>
<item>Editorial principles and practices
      (<gi>editorialDecl</gi>)</item>
<item>XML or SGML tagging used (<gi>tagsDecl</gi>)</item>
<item>Reference scheme applied (<gi>refsDecl</gi>)</item>
<item>Classification scheme applied (<gi>classDecl</gi>)</item></list>

<p>The project goals and research agenda associated with the creation
    of a corpus should be explicitly stated. </p> 

<p>Where a corpus has been made (as is usually the case) by selecting
    parts of pre-existing materials, the principles on which the
    sampling was performed should be explicitly stated. For example,
    what is the relationship between size of sample and size of
    original?  were all samples taken from the beginning, middle, or
    end of texts? how were samples chosen? etc.
   </p>

<p>By editorial principles and practices we mean the practices
followed when transforming the original source into digital form. For
textual resources, this will typically include such topics as the
following, each of which may conveniently be given as a separate
paragraph.

<list type="gloss">
<label>correction </label><item>how and under what circumstances corrections have been made in
the text.</item>
<label>normalization</label><item>the extent to which the original source has been regularized or
normalized.</item>
<label>segmentation</label><item>how has the text has been segmented, for example into
sentences, tone-units, graphemic strata,
       etc.</item>
<label>quotation</label><item>what has been done
with quotation marks in the original &mdash; have
they been retained or replaced by entity references, are opening and
closing quotes distinguished, etc. </item>
<label>hyphenation</label><item>what has been done with hyphens (especially end-of-line
hyphens)  in the original &mdash; have they been retained, replaced by
entity references, etc.</item>
<label>interpretation</label><item>what analytic or
       interpretive information has been added to the text</item></list></p>

<p>There is no requirement that <emph>all</emph> of the above be
    formally documented and defined. For validation purposes, however,
    it is necessary to identify whether or not information is
    <emph>available</emph> under each such heading, so that the end
    user for whom a particular category may or may not be significant
    can make an informed judgment of the usefulness to them of the
    corpus.
</p>

<p>Where a resource has been marked up in XML or SGML, or some other
    formal language, the markup
    scheme used must be documented in full, unless it is an
    application of some publicly defined markup vocabulary such as
    TEI, CES, Docbook, etc.  Non XML or SGML markup is not generally
    recommended. </p>
<p>For XML or SGML not conforming to a publically
    available scheme   the    following should be available, and listed in the
    corpus manifest:
<list>
<item>a copy in electronic form of a DTD or XML Schema which can be used to validate each
      resource supplied </item>
<item>a document providing definitions for each element used in the
      DTD (The TEI element definitions may be used as a model, but any
      equivalent description may be used) </item>
<item>any additional information needed to correctly process and
      interpret the markup scheme</item>
    </list>
   </p>
<p>For XML or SGML which does conform to a publically available
    scheme, the following information should be supplied:
<list>
<item>Name of the scheme</item>
<item>How the scheme has been customized (i.e., for TEI, which modules are used)</item>
<item>Any modification files</item>
    </list>
This may be conveniently provided in the form of a driver file which
    can be used to validate the resource.</p>
<p>For schemes permitting user modification or extension (such as the
    TEI),  documentation of the additional or modified elements
    provided must also be provided, as for a non-Public DTD
   </p>

<p>Finally, for documents in XML or SGML, it is useful to provide a
    list of the elements actually marked up in the resource,
    indicating how often each one is used, which can be used to
    validate the coverage of the category of information marked up
    within the corpus. Such a list can then be compared with one
    generated auytomatically during validation of the corpus in order
    to confirm integrity of the resource. The TEI <gi>tagsDecl</gi>
    element is    useful for this purpose.
   </p>
<p>By <term>reference scheme</term> we mean...
   </p>
<p>By <term>classification scheme</term> we mean...

   </p>
<p>A classification scheme may be defined externally or
    internally. Where it is defined internally, a structure like the
    TEI <gi>taxonomy</gi> element may be used. Where particular parts
    of a corpus use codes to define the classification assigned to
    them, these codes <emph>must</emph> be defined somewhere, either
    in ancillary documentation, or more formally.
   </p>


<!--
<p>Linkage between a particular text and a category within such a
taxonomy is made by means of the <gi>catRef</gi> element within the
<gi>textClass</gi> element, as further described
       below.</p></div3></div2>
<div2 type="div2" ><head>The Profile Description</head>
<p>The <gi>profileDesc</gi> element enables information
characterizing various descriptive aspects of a text to be recorded
within a single framework. It has three optional components:
<list type="gloss"><label><gi>creation</gi></label><item>contains information about the creation of a text.</item><label><gi>langUsage</gi></label><item>describes the languages, sublanguages, registers, dialects,
etc., represented within a text.</item><label><gi>textClass</gi></label><item>groups information which describes the nature or topic of a
text in terms of a standard classification scheme, thesaurus, etc.</item></list></p>
<p>Examples:
<eg><![CDATA[<creation>
     <date value="1992-08">August 1992</date>
     <name type="place">Taos, New Mexico</name>
</creation>]]></eg></p>
<p>The <gi>textClass</gi> element classifies a text by reference to
the system or systems defined by the <gi>classDecl</gi> element, and
contains one or more of the following elements:
<list type="gloss"><label><gi>keywords</gi></label><item>contains a list of keywords or phrases identifying the topic or
nature of a text. Attributes include:
<list type="gloss"><label><ident>scheme</ident></label><item>identifies the controlled vocabulary within which the set of
keywords concerned is defined.</item></list></item><label><gi>classCode</gi></label><item>contains the classification code used for this text in some
standard classification system. Attributes include:
<list type="gloss"><label><ident>scheme</ident></label><item>identifies the classification system or taxonomy in use.</item></list></item><label><gi>catRef</gi></label><item>specifies one or more defined categories within some taxonomy
or text typology. Attributes include:
<list type="gloss"><label><ident>target</ident></label><item>identifies the categories concerned</item></list></item></list>
</p>
<p>The element <gi>keywords</gi> contains a list of keywords or
phrases identifying the topic or nature of a text. The attribute
<ident>scheme</ident> links these to the classification system
defined in
<gi>taxonomy</gi>.
<eg><![CDATA[<textClass>
     <keywords scheme="LCSH">
          <list>
          <item>English literature : History and criticism :
               Data processing.</item>
          <item>English literature : History and criticism : 
               Theory etc.</item>
          <item>English language : Style : Data
               processing.</item>
          </list>
     </keywords>
</textClass>]]></eg></p></div2>
<div2 type="div2" ><head>The Revision Description</head>
<p>The <gi>revisionDesc</gi> element provides a change log in which
each change made to a text may be recorded. The log may be recorded as
a sequence of <gi>change</gi> elements each of which contains
<list type="gloss"><label><gi>date</gi></label><item>contains a date in any format.</item><label><gi>respStmt</gi></label><item>supplies a statement of responsibility for someone responsible
for the intellectual content of a text, edition, recording, or series,
where the specialized elements for authors, editors, etc., do not
suffice or do not apply.</item><label><gi>item</gi></label><item>contains one component of a list.</item></list></p>
<p>Example:
<eg><![CDATA[<revisionDesc>
     <change><date>6/3/91:</date>
          <respStmt><name>EMB</name><resp>ed.</resp></respStmt>
          <item>File format updated</item></change>
     <change><date>5/25/90:</date>
          <respSmt><name>EMB</name><resp>ed.</resp>
          <item>Stuart's corrections entered</item></change>
</revisionDesc>]]></eg></p></div2></div1>-->


</div>

<div><head></head>
<p></p>
   </div>

</div><!-- end of metadata -->

<div><head>Validation of Structural Information</head>
<p></p>
  </div>
  <div><head>Validation of  Information about Interpretive Annotation</head>
<p></p>
  </div>

</body>


<back>
<div><head>Validation Checklist</head>
<p>
   </p>
  </div>

<div><head>References</head>
<listBibl>
<bibl id="atk92"><author>Atkins, S., Clear J. and Ostler, N.</author>
(1992). <title level="a">Corpus design criteria</title>
<title level="s">Literary and Linguistic Computing</title> 7:1, 1-16.</bibl><bibl id="bak97"><author>Baker, J.P. </author> (1997) <title level="a">Consistency
and accuracy in correcting automatically tagged data</title> in
<editor role="editor">Garside, R., Leech, G. and Mcenery, A.P.</editor><title level="m">Corpus Annotation</title><publisher>Addison Wesley Longman</publisher><date>1997</date></bibl><bibl id="cle92"><author>Clear, J.H.</author> (1992) <title level="a">Corpus sampling</title>
in <editor role="editor">Leitner, G.</editor><title level="m">New directions in
English language corpora</title><publisher>Mouton de Gruyter</publisher><date>1992</date></bibl><bibl id="gar93"><author>Garside, R.G. and McEnery, A.M. </author>
(1993). <title level="a">Treebanking: the compilation of a corpus of
skeleton parsed sentences</title>. In: <editor role="editor">E. Black, R. Garside and G.Leech</editor>, <title level="m">Statistically Driven Computer Grammars of
English: The IBM-Lancaster Approach </title>. Amsterdam: Rodopi.</bibl><bibl id="ide95"><editor role="editor">Ide, N.  and Veronis, J.</editor> (1995) <title level="m">Text Encoding Initiative: background and context</title>
<publisher>Kluwer</publisher> <date>1995</date><idno type="isbn">0-7923-3704-2</idno></bibl><bibl id="ide98">Ide, Nancy (coordinator) (1998) 
<title level="a">Corpus Encoding Specification</title> (forthcoming, in
<title level="m">Proceedings of the First International Conference on
Language Resources and Evaluation</title>); see also URL
<ref>http://www.cs.vassar.edu/CES</ref></bibl><bibl id="lan95"><author>Langendoen, T.L. and Simons G.</author> (1995) <title level="a">Rationale for the TEI Recommendations
for  Feature-structure Markup</title> (in <ref target="ide95">Ide and
Veronis 1995</ref>) </bibl><bibl id="lee93"><author>Leech, G.</author> (1993). <title level="a">Corpus
Annotation Systems</title>. <title level="s">Literary and Linguistic
Computing</title>, 8(4) pp. 275--281.</bibl><bibl id="lee94"><author>Leech, G. and Wilson, A.</author> (1994).
<title level="m">EAGLES Morphosyntactic Annotation. EAGLES Report
EAG-CSG/IR-T3.1.</title>. Pisa: Istituto di Linguistica Computazionale.</bibl><bibl id="nel96"><author>Nelson, G.</author> (1996). <title level="a">Markup
systems</title>. In: S. Greenbaum (ed.), <title level="m">Comparing
English Worldwide: The International Corpus of English</title>, pp.
36--53. Oxford: Clarendon Press.</bibl><bibl id="sno86"><author>Snow, C. and Ninio, A.</author> (1986).
<title level="a">The Contracts of Literacy: What Children Learn from
Reading Books</title>. In: W. Teal and E. Sulsky (eds.),
<title level="m">Emergent Literacy </title>, pp. 116-138. New Jersey:
Ablex.</bibl><bibl id="spe95"><editor role="editor">Sperberg McQueen, C.M. and Burnard, L.</editor> (1995)
<title level="a">The design of the TEI Encoding Scheme</title> (in <ref target="ide95">Ide and
Veronis 1995</ref>) </bibl><bibl id="stu96"><author>Stubbs, M. </author> (1996) <title level="m">Text
and  Corpus
      Analysis</title><publisher>Blackwell</publisher></bibl><bibl
     id="sp"><author>Clark, James (1998) <title level="m">SP: An SGML
       system </title>[software]. Available from URL
      <ref>http://www.jclark.com/sp/</ref></author></bibl></listBibl>
<bibl id="dspex"></bibl>
<bibl id="WP2"><author>Burnard, Lou, and Tony McEnery, with Paul Baker
    and Andrew Wilson</author><title>Validation of Linguistic Corpora</title>
<date>April 1998</date></bibl>
<bibl id="WP3"><author>Burnard, Lou, and Tony McEnery, with Paul Baker
    and Andrew Wilson</author><title>An analytic framework for the validation of
       language corpora</title><date>December 1997</date></bibl>
  </div>
</back>
</text>