Auditing catalogue quality by random sampling

5. The pilot study

5.1 Background and catalogue profile

The University of Bath kindly agreed to host a pilot of the technique outlined above. The aim of the pilot was less to discover the incidence of errors in the catalogue than to investigate whether the right errors are being measured, while evaluating the practicality of the proposed technique, especially the time needed.

The Library & Learning Centre at the University of Bath has approximately 400,000 books and 2,000 serial subscriptions, supporting study in a range of disciplines with a bias towards the sciences. There are only small collections of non-book materials, principally maps, printed music, videocassettes and microform material, although there is an increasing number of CD-ROMs. The collection is classified using the Universal Decimal Classification with some local modifications.

The computerised catalogue at Bath has its origins in an experiment on the effectiveness of short records when 45,000 catalogue card entries were converted to machine-readable form (Bryant, Venner & Line 1972). The records were deliberately created according to the principle of 'minimal data, maximal access', including only title proper, surnames and initials, date, edition, class and book numbers. Subtitles were included when necessary to distinguish items. Edition statements were given in the abbreviated form '4ed' and all accents and apostrophes were omitted. Those few ISBNs borne at this time were recorded, as was the language of the item when it was not English, but these fields were not displayed. The only subject access was by classification number, perhaps because it was believed that most searches for were known items. It is clear that these records are of a significantly lower quality than the rest of the catalogue, but their accuracy must be evaluated in their own terms.

At the start of the 1980s, the library was a member of the SWALCAP (South West Academic Libraries Co-operative Automation Project) consortium, when these short-entry records were converted to UKMARC format. Records dating from the library's membership of SWALCAP may have been created elsewhere. In 1985 SWALCAP's system was replaced with URICA and all subsequent cataloguing has been performed in-house. Unfortunately, all 248 fields were lost during the latter conversion, although some of these have been systematically reinstated; the 260 field was also affected, with copyright dates of the form 'c1980' corrupted to 's1980'.

The next change was to Unicorn in 1998, when the creation date of existing records was reset to 25 June 1998, the date of the changeover. The University's catalogue is now shared with that of the University of the West of England (UWE), although difficulties in merging holdings records have prevented a union catalogue of serials.

As a result of these changes, there are three distinct types of record in the catalogue. First, there are the original minimal records created in Bath. Since 1982, records have been created to the second level of description in AACR, except that physical description is omitted (although some records created in the last year have included pagination) and no subject headings are assigned. These will be called standard records. Finally, records created by UWE include pagination and, for non-fiction, in-house subject headings; items held by both Bath and UWE therefore have these enhanced records.

Records are upgraded and revised if and only if further copies of the item are bought or classmarks are changed, but the number of acquisitions since the introduction of second-level cataloguing means that short-entry records constitute only a minority of the catalogue. Unfortunately, it was impossible to discern the cataloguing level of any particular record except by individual inspection, so no calculations could be made of the proportion of each type of record in the database.

The library subscribed to the British Library's Name Authority List until 1998, using it to verify names of corporate bodies and occasionally of personal authors. The subscription has not been renewed since the installation of Unicorn and less authority work is done. Many personal name headings, and all of those from the original short-entry catalogue, were generated directly from the statement of responsibility. It was therefore possible to check only for their presence or absence rather than their establishment in an AACR2-compliant form.

Although Unicorn supports diacritics, they are usually neglected in transcription and so their omission was not counted as an error. Items in some non-English languages are catalogued by freelance staff. The language field 041 seems to have been inconsistently applied so no check of this field was made. Similarly, personal names occasionally appear as subjects in the 600 field but these were ignored. The author lacked the expertise to evaluate the construction of classmarks. Each item in multivolume sets receives an individual record, risking reproducing an error in each.

5.2 The first (convenience) sample

Generating a true random sample from the cataloguing system unfortunately proved impossible. Instead, a convenience sample was taken consisting of 305 unique records for all books issued at Bath during the weekend of 11 and 12 March 2000. A dozen theses, short loan offprints and course packs were excluded from the sample. For each item in the sample, the full MARC record was printed on a separate sheet, with room for annotation, and the records were arranged in classmark order. An example is included as Appendix 1.

The audit took place during the university vacation, on 5, 6 and 7 June 2000, to minimise inconvenience to readers. It was assumed that the twelve weeks since the sample was created would be sufficient for the majority of books to have been returned, although there was a risk that books worth borrowing in March would be borrowed again. The combination of the low proportion of items on loan during the vacation (estimated by Bath as around 2%) and the difficulty of recovering items from students at home led to the decision not to issue recall notices.

It took the author 17 hours to locate and check 305 records and the associated items five positions to the left, an average of a little under 18 records per hour. There were 13 items for which all copies were on loan (over 4% of the sample, but sufficiently few to justify not recalling them) and 3 items were recorded as in the library but could not be found after two searches. One item in Cyrillic script could not be checked; records for four items in Chinese were checked only against a previous cataloguer's pencilled transcription on the title page. Thus 288 records were found and checked.

5.2.1 Bias in the sample

The potential for bias introduced by this convenience sample is illustrative of the problems encountered in designing a fair test and the invalidity of generalising from such a sample. The method of constructing the sample altered the population: instead of including all catalogue records, it excluded reference-only material and special collections which cannot be borrowed, and was in fact a sample of holdings rather than records. Consequently, items held in multiple copies are more likely to be included. This problem can be solved by removing all but one instance of any records which occur more than once (in fact there were none).

It is plausible that a book with an inaccurate catalogue record is less likely to be found on the shelves and therefore less likely to be borrowed; therefore a book which has been borrowed is likely to have relatively better cataloguing. This factor leads to an underestimation of the error rate in the catalogue.

It is also plausible that the frequency with which a book is borrowed is inversely related to its age and so newer books are over-represented in the sample. Furthermore, at least in an academic library, readers will borrow several books on related topics and so the sample will contain clusters of books with similar classmarks. Of course, such clusters will sometimes occur in a random sample - the principle is that it is entirely unpredictable, and a random sample consisting of the first 200 books in the library is just as likely as any other - but this clustering by classmark is, unfortunately, predictable.

5.3 Results of the catalogue-to-collection test on the first sample

The 288 records consisted of 30 at minimal level, 243 standard records and 15 enhanced records. Of these, 189 were free from error in the fields checked. (To summarise, the expected fields in minimal records were ISBN, title proper, personal authors' surnames and initials, date, edition and classmark; in standard records, ISBN, title, statement of responsibility, publisher, date and place of publication, edition, author headings, series and series number; and in enhanced records, all of the above plus pagination and, for non-fiction, UWE subject headings. Some standard records included language codes and 22 included pagination.)

This leaves 99 records with at least one error, or 34.4% of those checked. As proposed in section 4.2.5, assuming first that each of the 17 records which were not checked contain no errors at all, and then that they contain one error apiece, the figure for the sample of 305 records is between 32.4% and 38.0%. It is not appropriate to calculate a margin of error for this estimate because the sample was not random.

This figure seems high, but many of these errors could be considered minor. The subjective distinction between major and minor errors proposed in section 4.6.2 gives only 22 major errors; assuming these arise from 22 distinct records, only 7.6% of those records checked contain a major error.

It is important to stress that the non-random sampling means that these rates may not reflect the true rate in the catalogue; they may do so, but it is impossible to say. Nonetheless, the results for each field are given in the table and discussed below because the variety of errors encountered is of interest when evaluating the procedure for collecting data. Note that the first row of the table gives the number of records in which at least one field was incorrect or missing (or both).

acceptable incorrect or incomplete missing total % errors
to 1 decimal place
All records 189 90 24 288 34.4
Title 263 25 0 288 8.7
Material description not checked - -
Statement of resp. 250 6 2 258 3.1
Author heading(s) 283 0 5 288 1.7
Edition 45 2 3 50 10.0
Physical description 37 1 1 39 5.1
Imprint 226 59 3 288 21.5
Series 72 7 7 86 16.3
Classmark 288 0 0 288 0.0
Subject heading(s) 12 0 1 13 7.7
Language not checked - -
Genre / category not checked - -
Location / branch ID not checked - -

Table 2. Summary results of the catalogue-to-collection test on the first sample

5.3.1 Title

All 288 records had a title field and 25 records had incomplete or incorrect data. There were only 3 instances of spelling mistakes ('behaviour' instead of the US spelling 'behavior', 'milliennial' and 'regressison') and 1 record with an incorrect filing indicator (filing under the article for a French title, 'La galère'). Other errors likely to affect access were the omission of a subtitle, the omission of a volume title (probably a 248 field lost with the system conversion) and the use of a volume title in place of a title for the set (which was present nowhere in the record).

The majority of other errors involved incorrect capitalisation (11, of which 8 were the capitalisation of a word following an initial article, a rule dropped in later editions of AACR2, and 2 occurred in non-English-language material). Other minor errors were 2 missing closing brackets, 2 sets of unnecessary quotation marks, 2 ellipses which should have been replaced by dashes and a case where a subtitle was marked by a colon instead of the subfield indicator $b.

5.3.2 Statement of responsibility

There were 258 records checked for accurate statements of responsibility (the 30 minimal records had statements generated from the headings field) of which 6 had incomplete or incorrect statements and 2 had missing statements.

Only 3 of these could conceivably cause confusion: the first when the statement of responsibility incorrectly appeared in the edition statement, one when the editor of a set was omitted despite appearing on the title page and the third when a textbook with the name Vogel in the title included that name in its statement of responsibility but excluded the current author, both contradicting the title page information.

Other errors were the omission of 'by', the omission of 'and associates', an intrusive 'and' between authors and the unnecessary capital in '(Ed.)'.

5.3.3 Author headings

All 288 records had at least one author heading. 4 were missing corporate authors. The record for an edition of a book with a previous author named Wills, included in the title but no longer present on the title page, had no heading for Wills.

As explained in section 5.1, it was not possible to check authority work.

5.3.4 Edition

50 records should have had edition statements, of which 2 were incomplete or incorrect and 3 were missing. One error was the use of '2nd ed.' instead of 'SI metric ed.' and the other was that noted above in which a statement of responsibility appeared in the edition statement. The idiosyncratic use of '4ed' for '4th ed.' in minimal-level records was accepted even though this is normally incorrect.

5.3.5 Physical description

The only records to include physical description were those from UWE and a handful of recent records from Bath, 39 in total. Only pagination was included. One UWE record omitted it, which was considered an error; 1 other had incorrect pagination. It is possible that some Bath records should have included pagination but omitted it.

5.3.6 Imprint

The imprint field (place of publication, name of publisher and date of publication) was checked in all 288 records. 59 of these had incomplete or incorrect information and 3 had no imprint data. These erroneous records divided into 32 at 'standard' level and 30 at 'minimal' level.

The specification of minimal records includes only date of publication. 26 had no problem other than corruption of the copyright date ('s1980' for 'c1980'), 3 had missing dates (recorded as a completely missing imprint because no other information was expected) and 1 had a wrong date.

Among the standard-level records there was a variety of significant errors. 6 omitted any place of publication and 2 included a US place but no UK place although one was present in the item. One item checked had a record attached to a different edition of the book, so every part of its imprint data was incorrect. There was 1 record with a completely wrong place, 2 with wrong dates and 4 with the wrong publisher (by chance, places of publication here were correct as they were usually London or Oxford). 2 records had keyboarding errors in the name of the publisher. One record had a very minor error, recording the place as 'Ithaca and London' instead of the AACR2-prescribed 'Ithaca ; London'.

The large number of records (59) with incorrect information was a result of corrupted copyright dates. Ignoring these, only 16 incorrect records and 3 missing dates remain.

5.3.7 Series

There were 86 records which should have carried series information. This includes 7 missing series and a further 7 incomplete or incorrect statements: one subseries was omitted and one promoted to a series; there was a keyboarding error ('Macmillian'), an omitted series numbering and the omission of 'no.' from a series numbering.

3 of these 7 records had a 490 field with just the series numbering and an 840 field with the complete series statement; this was recorded as an error in each case.

5.3.8 Classmark

All 288 records had classmarks and all appeared to be correct. It is possible that the classmarks of the 3 items which could not be located were incorrect either in their records or on their spines.

5.3.9 Subject headings

The subject headings are given in the 670 field, or the 671 field for a place name. They are given in a post-coordinate form, that is, simple headings such as 'Fiction' and 'Politics' are provided for readers to combine for more complex queries.

Only the 14 records from UWE were expected to contain subject headings - 13 records excluding one for fiction. Of these 13, one had no subject heading and this was recorded as an error.

5.3.10 Other errors

Some errors were noticed during the pilot which were not encompassed by the checklist. These include keyboarding errors in notes, the absence of a note explaining an added entry not mentioned in the statement of responsibility and the omission of the leading character in ISBNs (not SBNs). A conference was given an added entry under 710 not 711 and a name-as-subject was recorded in 700 rather than 600.

5.3.11 Variations over time

Although it was not possible to divide the sample by cataloguing date, a rough idea of the variation in quality over time could be obtained by looking at the error rates for minimal, standard and enhanced records separately. The widespread corruption of dates has been ignored in this section because it skewed the results for minimal-level records (all of which had some kind of error in the date).

With this proviso, among 30 minimal-level records were 3 with errors (5 errors in total), or 10%. There were 221 standard records without pagination, 54 of which had errors (68 errors in total), or 24.4%. Of 22 standard records with pagination, which were among the most recent, 6 records (or 27.3%) contained 10 errors between them. Finally, 15 UWE-enhanced records included 3 with one error apiece, or 20%.

It is hard to know what to deduce from these results. The minimal-level records probably have so few errors only because of their simplicity. It is disturbing that recent records created at Bath appear to have more errors, but without an adequate breakdown of the 221 standard records created over fifteen years there is no way to tell whether these are particularly bad or whether they are an improvement over, say, records from five years ago.

Full printouts for every record were available in the pilot, but if this were not the case then it would be necessary to record the year of cataloguing on the worksheet.

5.4 Results of the collection-to-catalogue test on the first sample

Only a few items (16) could not be located so it was decided that the 288 which could be found would suffice for the collection-to-catalogue test. Every one of the items five to the left was found on the catalogue and no duplicates were discovered. One problem was that the original 'mini-catalogue' included separate records for each of multiple copies of a book, in which cases it was necessary to use accession numbers for identification (these were also checked to avoid mistakenly identifying duplicates).

5.5 The second (systematic) sample

Although the first sample was not random and therefore its conclusions could not be extrapolated to the whole collection, it served as a useful pilot of the audit technique and of the worksheet. To gauge the extent of the underestimation of errors, a second, smaller sample was taken. The 100% intactness rate from the collection-to-catalogue test validated systematic sampling of the shelves, starting at a column randomly selected from the first twenty and taking the rightmost book on the top shelf of every twentieth column thereafter. This excludes items on loan so, by a previous argument, should overestimate errors; the proportion of items on loan had been estimated as around 2%, so this bias would be small.

79 items were checked in two-and-a-half hours, an average of just over 30 per hour; note that the collection-to-catalogue test was not performed, so this is consistent with the previous time. The items had 5 minimal, 70 standard (3 with pagination) and 4 enhanced records. By chance, no records had corrupted dates.

There were 30 errors from 27 records, of which 9 were judged major errors on the same criteria as before. 27 out of 79 records translates to 34.2% with a 10.5% margin of error at the 95% confidence level (the error range is so high because the sample was small). 9 out of 79 translates to 11.4% with a 7.0% margin of error at the 95% confidence level. Thus the incidence of errors was similar to that found by the first, non-random sample, but a slightly greater proportion were major errors. The results for each field are given below, followed by a list of the errors found, which were characteristic of those found in the first sample.

acceptable incorrect or incomplete missing total % errors
to 1 decimal place
All records 52 20 10 79 34.2
Title 72 7 0 79 8.9
Material description not checked - -
Statement of resp. 71 3 0 74 4.1
Author heading(s) 75 2 2 79 5.1
Edition 12 0 0 12 0.0
Physical description 7 0 0 7 0.0
Imprint 66 8 5 79 16.5
Series 26 0 3 29 10.3
Classmark 79 0 0 79 0.0
Subject heading(s) 4 0 0 4 0.0
Language not checked - -
Genre / category not checked - -
Location / branch ID not checked - -

Table 3. Summary results of the test on the second sample

code error
T2initial 'The' omitted
T2unnecessary 'The' at beginning (incidentally implying incorrect capitalisation of next word)
T2'Brainstorm' for 'Brainstem'
T2'Harrap'S' for 'Harrap's'
T2'Historical' capitalised
T2subtitle incorrectly recorded as series
T2missing colon before subtitle
A2unnecessary 'and'
A2'by' omitted, 'Maxwell' for 'Maxfield'
A2'Putzgers' for 'Putzger' [German genitive in title]
H2'Putzgers' for 'Putzger' [German genitive in title]
H2'Internatonal' in conference heading
H3corporate author omitted
H3corporate author omitted
I2no place
I2no place
I2UK place omitted
I2UK place omitted
I2'0xford' with initial zero
I2Cambridge, Mass. taken as UK place
I2US publishing details, not UK
I2incorrect date
I3no date (minimal record)
I3no date (minimal record)
I3no date (minimal record)
I3no date (minimal record)
I3no date (minimal record)
S3series omitted
S3series omitted
S3series omitted

Table 4. List of errors found in the test on the second sample

