Auditing catalogue quality by random sampling

Dissertation | Proposed technique < Commentary > Pilot

4. Commentary on the proposed technique

4.1 The scope of the audit

This technique gives an overall figure for the number of records containing at least one error, either a keyboarding error of some kind, incomplete data, or an omitted field. The quality of the catalogue is evaluated in the library's own terms, that is, counting deviations only from the library's standard, which may not coincide with an external standard such as third-level description in AACR2. At the extreme, a collection of juvenile fiction and a research-level university collection might be catalogued to very different standards. It is helpful for this to have a profile of the library's cataloguing and collection policy, as envisaged by Trickey (1999), describing the nature of the collection, past cataloguing regimes and reclassifications, and salient local policies. Known idiosyncrasies should be tolerated in the same way that a misspelling is not a cause for demerit when marked by [sic] or [i.e., ...].

The intended use of the results should be considered before they are collected, because this may affect the detail and quantity of data needed and the method of collection. The tolerable error rate should be also decided before auditing, since beyond a certain level the utility of correcting catalogue errors becomes marginal. It may suffice to know only that that threshold has been exceeded without knowing the actual proportion of errors or their types, and for this, methods of sequential analysis (DiCarlo & Maxfield 1988) can save time.

The audit is intended as an overview of the whole catalogue but it can easily be adapted to concentrate on a specific part of the collection. This is especially useful when a date is associated with the creation or modification of a catalogue record, in which case the audit can be used to monitor current cataloguing by sampling only those records generated in, say, the last month. Variations over time can also demon- strate recent improvements in quality, or help decide whether it is worth upgrading older, minimal-level records from retrospective conversion.

4.2 Choosing the sample

The set of catalogue records to be evaluated is called the population. In many cases, the population will be all catalogue records, but sometimes it will be of interest to evaluate only part of the collection, perhaps one perceived as more prone to cataloguer error; in these cases the population needs to be clearly delimited. Libraries whose catalogues cover more than one physical site, such as public library services with a central library and many branch libraries, must decide whether it is necessary to check an item at the site chosen at random or whether it is permissible to check a copy held at a convenient location.

Records for items on order which have not yet been processed should be removed from the population, although this should not be done by eliminating all records to which no copies are attached as such 'ghost' records other than orders are to be considered errors. Similarly, it is desirable to remove records for items which are being bound or repaired, or marked as missing.

It may be necessary to print only a minimum of bibliographic information, e.g. control number, title, first author entry and shelfmark, but if at all practical, the full bibliographic record should be printed, in MARC format if this is used. A printout in shelf order saves a great deal of time.

There are three types of sample: the probability or random sample, the convenience sample in which records are chosen according to availability rather than at random and the purposive sample in which records known or suspected to be inaccurate are chosen. Convenience and purposive samples are not strictly statistically valid, but are sometimes necessary when a random sample is unfeasible. Random samples are susceptible to a mathematical analysis, and divide into three types, the simple random sample, the systematic sample and the stratified sample.

4.2.1 The simple random sample

A simple random sample consists of items selected from the population of catalogue records in such a way that every record has an equal non-zero probability of being selected. With a sufficiently large population - more than a thousand items - the difference between sampling with and without replacement is negligible, that is, it does not matter if items selected are removed from the shelf.

Simple random sampling guarantees a sample free from bias, but is often impractical. A computerised library system should, in theory, be capable of producing a simple random sample from the catalogue. One method is to take generate random numbers using a computer program or printed random number tables and use these to select records by control number, bearing in mind the caveats about the application of random numbers in Drott (1969).

4.2.2 The systematic sample

An alternative is the systematic sample in which records are chosen from the population at a fixed interval. Care needs to be taken when using a systematic sample, as any repeated structure in the population can lead to distortion: for example, width of book might be correlated with the height of the shelf so sampling every sixth shelf will get either all reports, or all encyclopaedias; taking every seventh day's cataloguing might be biased if staff tend to be more accurate at the start of the working week. If, however, there is no periodic structure then a systematic sample is an effective random sample.

The required interval, or sampling fraction, can be determined by dividing the size of the population (estimated if necessary) by the sample size (see section 4.2.4). It is standard practice to choose a random number smaller than the sampling fraction and count this many records before starting, as otherwise the first record in the population would always be included.

A systematic sample can easily be generated if there is any kind of sequential control numbering, provided that there are no gaps in the run of numbers (for example, if material has been withdrawn and the control numbers not reused). If the number picked no longer corresponds to any item then another number should be drawn, independently and at random, until one is found which has been used. Selection can be performed either by computer or by hand with the aid of a printout. The use of a printout can be simplified by taking a sampling fraction equal to the number of records on a page and selecting records a fixed distance down each page, although Bookstein (1974) points out that if pages contain different numbers of records, especially if the last page is not full, then bias will be introduced.

A systematic sample from the shelves is a popular technique but it restricts the population to those items not on loan. For a non-lending collection, or a library in which few items are on loan (perhaps during a closed period), it is justifiable to equate the whole collection with those items present in the library. Hence a systematic sample can be produced by taking an item from each shelf, or each bay of shelving. Books, especially, vary greatly in width, so the sampling interval should be of a fixed number of items rather than a fixed distance along each shelf as with the latter technique the probability of selection is proportional to an item's thickness.

Systematic sampling by classmark is legitimate because if few or no items are absent from the library then it is equivalent to systematic sampling from the shelves. It may be difficult to guarantee a fixed interval between classmarks, depending on the classification scheme (for example, choosing the fortieth item in classes K, L and M respectively is invalid because the classes are unlikely to be of equal sizes), so selection from a printout in classmark order is safest.

4.2.3 The stratified sample

It is sometimes convenient to partition the population into disjoint subpopulations such as books and non-book material, or books and serials, or adult and children's collections, or different subjects, and to sample each; this also allows results to be collected separately for each subpopulation, focusing on known trouble spots. A stratified random sample is a set of samples taken from separate subpopulations (even if they do not comprise the whole population) and it is valid to pool the results to generate statistics for the union of these subpopulations provided there is no overlap.

Other examples of subpopulations of interest include reference material, audiovisual material, material in languages other than English, local history, government reports and other grey literature; records created by outsourced cataloguing, or retrospective conversion, or downloaded from a co-operative cataloguing system, or created in-house; all records excluding serials, or records created before or after a certain date.

It is sometimes difficult to identify a subpopulation clearly, so some approximation will be necessary (e.g. to find Russian-language material, sample the classmarks for Russian literature and exclude items within that range which are translations or critical works in English). If the computer system cannot restrict the sample to the subpopulation of interest then it is acceptable to sample the whole population and ignore irrelevant records, but this will obviously require a larger sample size.

4.2.4 Sample size

There are two distinct situations in choosing the size of the sample. The first is when constraints of time or staffing or other practical considerations limit the number of items that can be sampled to around 300 or fewer. In this case, the sample size should be fixed in advance and the margin of error calculated subsequent to the audit (see section 4.6). Otherwise, the required level of precision in the estimate should be chosen and used to calculate the minimum adequate sample size. The better the required estimate, the larger the necessary sample size.

Formulas for sample size can be found in statistics textbooks, often with confusing differences in terminology; alternatively, ready-reckoner tables are available (e.g. Drott 1969). Some formulas include the size of the population, but if the population is sufficiently large that less than 10% of records are to be sampled then the required sample size can be considered independent of the size of the population, and even if this is not the case the effect is a slight overestimate of sample size. It is important to note that 'one cannot compensate for bias with increased precision' (Bookstein 1974, p. 129): increasing the sample size will not validate a biased design.

The formula recommended is (Hernon 1994, p. 175):

    p x (1-p) x z²
n = --------------
          E²

Simply plug in values for p, z and E; the sample size is n, rounded up to the next whole number. Notice that the sample size is independent of the size of the population but does depend on the estimated proportion of errors. (Hernon gives references for other formulas; Clark (1984) gives a more sophisticated discussion.)

z is a standard normal score, that is, the number such that 95% (say) of a standard normal distribution falls below it. The percentages here are confidence levels for the result: a confidence level of 95% means that 95 out of 100 samples will produce results within the specified margin of error of the true proportion. Values of z can be found in standard statistical tables; typical values are, for 90%, z = 1.65; for 95%, z = 1.96; and for 99%, z = 2.58.

p is the proportion of errors in the population, expressed as a number between 0 and 1 by dividing a percentage by 100 (for example, an error rate of 7% translates to p = 0.07). Inconveniently, this proportion is precisely what is to be determined, so p must be estimated either by intuition or by taking a very small presample of records (perhaps a dozen). Alternatively, it can be shown that the maximum sample size occurs when p = 0.5, which value guarantees a safe sample size, but in the present context p is likely to be considerably smaller so this conservative estimate is wasteful.

E is the tolerable margin of error in the result, similarly expressed as a number between 0 and 1. If the error should not exceed 3%, so that the result given by the sample is no less than 3% below and no more than 3% above the true proportion (at least, as often as provided for by the choice of confidence level), then E should be set at 0.03. Lowering the tolerance increases the sample size and vice versa.

For example, if a rough estimate for the proportion of errors is 12% and the acceptable margin of error is 3% either way with a 1-in-20 or 95% chance of a true value outside these limits, the requisite numbers are p = 0.12, E = 0.03 and z = 1.96. Putting these in the formula above gives n = 450.747..., so the minimum sample size is 451 items. For comparison, the most conservative estimate, when p = 0.5, is 1068.

4.2.5 Problems in collecting the sample

A random sample of any of the above three types may be impractical. One convenient alternative is to take a sequence of records forming an exhaustive sample. The validity of this technique depends crucially on the population being reasonably homogeneous. For example, a run of the first five hundred books from a certain call number is likely to be unrepresentative because records may vary in complexity across subjects, but the first five hundred books catalogued after a certain date are acceptable unless the books were grouped in some way before cataloguing.

When an item cannot be located on the shelf nor recalled for checking, it can be marked as missing, but this reduces the sample size. Similarly, if some subpopulation such as items on order needs to be ignored but cannot be extracted from the sample, then when an item from that subpopulation occurs, should auditors ignore it or should they substitute another item from the desired subpopulation?

The problem in the first case is that the characteristics of items which are presumed missing may be different from those which are on the shelf or on loan. One possibility is to record the number of items unchecked and to calculate the proportion of errors when their records are assumed all to be correct and when they are assumed all to contain one error. The real figure must lie between these two extremes, but they may be so far apart as to render the estimate useless.

It is not acceptable to substitute an item drawn for convenience, such as the next item on the shelf, because this may introduce a systematic bias. The ideal procedure is to draw another item at random. However, if there are multiple, identical copies of the missing item then it is acceptable to check the catalogue record against another copy. There is a possibility that two different editions of a book, say, will be attached incorrectly to the same record, but the probability of this may be assumed to be slight.

4.3 The catalogue-to-collection test for accuracy

Reference should be made to the catalogue profile to decide if any fields can be omitted from the checklist. There is space at the bottom of the worksheet for miscellaneous issues such as the absence of a uniform title or inappropriate subject headings. These should not be recorded as errors unless auditors are specifically asked to look for them, but they will be useful when the catalogue is corrected. More than one error in a field should be recorded as a single error on the worksheet so that the incidence of errors in that field is not exaggerated.

It is best to decide before auditing on the status of aberrant spacing, punctuation, capitalisation, diacritics and the spelling out of numbers and symbols such as ampersands. Typographical errors can be much more serious in online catalogues than in card catalogues because human filers can often compensate for a minor error whereas computers can seldom even recognise the error. Inconsistent punctuation is a particular problem with computer filing: there may be a great displacement between 'cooperation' and 'co-operation'. Incorrect capitalisation, by contrast, has no effect on access at all (except in such rare ambiguous cases as 'A History Of Reading') and diacritics do not affect filing order in English (although they do in other languages, e.g. German umlauted letters expanded by appending e or Scandinavian languages which place å at the end of the alphabet).

Punctuation between fields is prescribed in AACR2, following the practice set down in the various ISBDs, although in UKMARC such punctuation is implied by subfield markers and so should be correct. AACR2 leaves most punctuation within fields to the discretion of the cataloguer. Exceptions include the use of square brackets for information supplied by the cataloguer and the replacement of ellipses (dots marking omission: �) with dashes. Another exception to consider is the correct placement of apostrophes, although many systems ignore apostrophes for keyword access.

The checklist can be used simply as a tally chart, but the value of the audit is increased by recording errors for later correction, rather than just the number of errors. This can be achieved by completing a separate worksheet for each item including its control number. Either the worksheet can be used in the form provided or the printouts can leave space for recording errors next to the catalogue record, using the suggested codes (e.g. T2 for an incorrect or incomplete title).

4.3.1 Control number

The control number recorded should be the accession number or some other number that uniquely identifies an item, so ISBNs are unsuitable because there could be multiple copies of a book. Most cataloguing software will automatically validate ISBNs but they are sometimes attached to the wrong item, especially a paperback rather than hardback edition of a book, so if the ISBN is an access point then it may also be considered worth checking.

4.3.2 No holdings

The sample should be taken from the population of all records in the database rather than the (usually larger) population of all holdings. Occasionally, an item is withdrawn but its record remains in the catalogue with no holdings attached. These 'ghost' records frustrate readers and reduce the overall quality of the catalogue so their presence should be noted.

4.3.3 Title

The test is simplified by unifying title proper and subtitles as 'title' even though this prevents recording of such recondite errors as including an alternative title as a subtitle instead of title proper. Title is a mandatory field in most computerised library systems but the option of recording it as missing is retained in case this is not universal. The absence of parallel titles can be recorded here if desired.

Keyboarding errors in any part of the title will usually affect access, but not if the word in question is a stopword such as 'from' or 'with' (unless it occurs at the start of a title, which would still not frustrate a keyword search).

4.3.4 Material description

The choice of material description can be subjective but it is probably easier to find consensus on the application of this field than other fields such as subject headings, so clearly wrong entries can be recorded as such. Naturally, the omission of material description when it is expected and any errors in spelling should both be recorded.

4.3.5 Statement of responsibility and author heading(s)

The inclusion of separate checks of statement of responsibility and author headings is intended to discriminate between simple errors in transcription and errors in establishing a correct heading which may differ from the form transcribed. Of course, a few items legitimately lack either or both fields.

Romero (1994) found that errors in the statement of responsibility were often the addition of words such as 'by' not present on the title-page. A mistranscribed name might be copied from the statement of responsibility to an incorrect heading, so that the error is counted twice, but this does not seem misleading.

A record may have more than one author heading or a mixture of personal and corporate authors. General guidance is to record at most one error per field, but an alternative is to verify only main entries, if these can be distinguished (for example by inspecting 1xx fields rather than 7xx fields in MARC).

4.3.6 Edition

In most cases it will be straightforward to check that edition statements are present and correct. There is sometimes difficulty in distinguishing true editions from impressions or reprintings and the correctness of the record may be a matter for individual judgement. Incorrect abbreviation was the largest source of error in edition statements in the study by Romero (1994).

4.3.7 Physical description

Physical description, also called collation or extent, is laborious to check, especially the height of an item, multiple pagination sequences and the presence or otherwise of plates and other enriching characteristics. Since the physical description of books is often considered of little importance for readers, a library may decide that it is only worth checking for non-book items, but it may be relevant if pagination decides the location of a pamphlet or if height dictates whether an item counts as outsize.

4.3.8 Imprint

The three fields (place of publication, publisher and date of publication) that make up the publisher's imprint are considered as one for simplicity, a common practice in the literature. An incorrect entry for publisher will often entail an incorrect place of publication being recorded and often an incorrect date.

Intner (1989) and Romero (1994) found misapplication of AACR2 in this area to be a widespread problem, although the only implication for access of an incorrect imprint is the inability to limit a search by date, which will be impeded by a mistyped date or the date of a reprint.

4.3.9 Series

In most library systems, the series is an access point, even if it is little used. One objection to assuming that errors in the series field will affect access is that not all series are traced. UKMARC has separate fields for recording series according to whether they are traced (440 or 840) or not (490), and in principle the choice of MARC field should be noted. However, widespread computerisation of cataloguing has introduced a tendency to trace more series than before, including publishers' series, so this objection is less forceful.

Subseries need no special treatment. For items bearing more than one series, an error in any of them is sufficient to count as an error and at most one error should be recorded. No distinction is made between series statements in 440 or 490 and series headings in 840, all of which should be verified.

4.3.10 Classmark

The classmark is not intended to be checked for appropriateness, although errors in its construction should be recorded if identified. If the classmarks on the item and its record do not correspond then it will prove difficult to locate the item for checking, so the number of items with incorrect classmarks may be underestimated and missing items overestimated. It is sensible to check for the absence of collection designators such as REF, JUN or a marker of oversize books.

4.3.11 Subject heading(s)

It is intended that subject headings be marked as incorrect only if they are improperly formed or inadmissible (in the scheme of subject headings in use), or if they contain a keyboarding error. Again, an error in any one of multiple headings, and at most one error, should be recorded. The absence of a useful subject heading is a matter of opinion and 'missing' should be marked only if there are none at all yet some were expected (for example, fiction often has no subject headings).

Subject headings and Dewey or Library of Congress classmarks can be compared with cataloguing-in-publication data if provided, or against an appropriate national bibliography, but neither is an entirely satisfactory substitute for comparison with a subject authority file. Outdated headings reduce catalogue quality by introducing inconsistencies, but their detection may require an experienced cataloguer.

4.3.12 Genre / category

This field is included for fiction collections which are broadly categorised. It is expected that the choice will be of a few reasonably discrete entries in this field. Therefore it makes sense to note an inappropriate entry as incorrect in a way which is less justifiable for subject headings.

4.3.13 Location / branch ID

In public libraries, fiction stock is often rotated around branches to keep collections fresh, while academic libraries switch items between long and short loan. The location may be coded in the classmark, making a separate check superfluous; as with classmarks, the item may not be found for checking if the information is wrong.

This field permits a partial check on information on holdings, which is slightly beyond the scope of the audit.

4.3.14 MARC fields

The checklist gives textual descriptions of fields rather than MARC codings since some libraries may not use MARC or staff familiar with the format may not be available. However, there is a correspondence between the descriptions and ranges of UKMARC fields, as follows:

Title	245 $a, $b, 246, 248
Material description	008 $p, 245 $z
Statement of responsibility	245 $d, $e, $f
Author heading(s)	100, 110, 111, 700, 710, 711
Edition	250
Physical description	300
Imprint	260
Series	440, 490, 840, 890
Classmark	050, 080, 081, 082
Subject heading(s)	6xx
Language	008 $l, 041
Genre / category	655
Location / branch ID	[no specific field]

Table 1. MARC equivalents for the fields in the checklist

If MARC is used then additional checks of tags, indicators and subfields may be desirable. For example, indexing is affected by incorrect filing indicators in fields 245 or 440 or missing subfields for articles and prepositions in names. The material description and language fields should be consistent with their coding in the 008 field. Obviously, different versions of MARC will produce different problems: in MARC21, for example, $b and $c do not imply the ISBD punctuation for subtitles and statement of responsibility and it may accidentally be omitted.

4.3.15 Other fields

Notes fields have not been included in the checklist even though a substantial number of errors may occur in notes. They have been excluded precisely because their diversity, multiplicity and length introduce enough errors to skew the results. However, this policy may be mistaken, and examples such as contents notes (especially in music) may be sufficiently important to reinstate a check on notes.

There is also no attempt to verify the information in fixed fields (in particular, the 008 field in MARC). It was felt that where these were of interest to readers, the contents would be duplicated elsewhere in the record. Moreover, it is laborious for humans to check fixed fields and many systems already validate their contents automatically.

4.4 The collection-to-catalogue test for intactness

There are different degrees of accuracy but each item is either present or absent from the catalogue. The test for intactness, the percentage of items that have catalogue records, is correspondingly simple. It suffices to take a random sample of items and check whether each one is represented in the catalogue, the quickest way being to search by a control number such as an ISBN. The random sample is created by examining each item five positions to the left of items in the sample generated for the catalogue-to-collection test.

Inaccuracies in the catalogue record can be noted for correction but are irrelevant to this test. The only reason for relating this test to the catalogue-to-collection test is to save time by performing the two procedures in tandem. While it is not necessary to follow Kiger & Wise (1996) in advocating a fresh sample for this test, it is certainly invalid to halve the sample size by checking the cataloguing of items selected for the collection-to-catalogue test and combining the results with those from the catalogue- to-collection test. This practice would double the effect of any bias in the original sample on top of the unavoidable bias of neglecting books on loan.

The population here is not all items owned by the library but all items on the shelves, thus excluding those on loan, in processing or awaiting shelving. In principle, there is an inevitable bias if the library loans material, but it is reasonable to assume that no items outside the library are unrecorded because such items should be detected at the point of issue. Even if this were not the case, items on loan are more likely to have catalogue records because unrecorded items are less likely to be found by a reader. So this inherent bias is justifiable because it is known that the test will underestimate intactness. In contrast, there is a slight overestimate from items held in multiple copies, such as scientific textbooks in a university library, which increase the chance that the item five to the left will be the same as that from the catalogue-to-collection test, therefore certainly present in the catalogue.

4.4.1 Problems determining the sample

The key to the collection-to-catalogue test is finding a way to select items which may be missing from the catalogue. The choice of five positions to the left is entirely arbitrary, and 'left' is to be interpreted broadly but consistently. If there are fewer than five items to the left then counting should continue at the right-hand end of the shelf above, or of the bottom shelf of the preceding bay. If there is no bay to the left then the leftmost item can be substituted, with only insignificant bias. 'Behind' can be substituted for 'left' with items stored front-to-back, such as vinyl records or large-format children's books. Misfiling should be ignored; items piled on their sides or otherwise disordered can be ignored, or treated as if shelved correctly.

This method of selection does not respect subpopulations: for example, if books and videocassettes are interfiled but only videocassettes are of interest, it is likely that the sample derived will include books. In these circumstances the rule should become 'the first appropriate item to the left'.

If the item in the catalogue-to-collection test was not on the shelf for some reason (missing, on loan or a ghost record) then counting should start from the position where it would have been shelved, or, in the case of classmarks which do not deter- mine unique shelfmarks, the leftmost such position. It may prove simpler to ignore unavailable items provided the reduction in sample size is small.

A subtle difficulty arises when two items in the original sample are five places or fewer apart because removing the item corresponding to the one on the right affects which item corresponds to the one on the left. The best remedies are to check items in strict classmark order or to replace items immediately after checking rather than dealing with them in batches. However, the bias introduced by ignoring the problem entirely is negligible unless there is some special structure to the shelf arrangement, which would probably invalidate the sampling procedure in any case.

If there are multiple copies of a work and the sample does not distinguish between records and holdings then a consistent tie-break procedure is necessary, for example, choosing the leftmost item to minimise the chance that the item five to its left is the same. Policy when a volume from a multivolume set is encountered should similarly be consistent, whether counting a set as one item or treating each volume individually.

4.4.2 Duplicates

The collection-to-catalogue test has been combined with a rudimentary search for duplicate records, since duplicates or near-duplicates may well be found when searching by different access points. Only the existence of duplicates needs to be recorded for the audit, but it is sensible to note their location for later removal.

Confirming that two records really are duplicates requires a careful check of edition and publication statements and may be possible only by reference to the shelves. It is important to check the location, especially if multiple copies of an item may be held in different sites. If duplicates are thought to be a major problem then it is more efficient to go straight to the substantial literature on their detection.

4.5 Staffing issues

Any member of staff can locate items in the library, recall those on loan (provided they have access to the reservation system) and perform the collection-to-catalogue test. It is entirely possible for more than one member of staff to work on the sample separately and concurrently. However, comparison of items with records and confirmation of duplication require professional staff, presumably cataloguers, as although the need for value judgements has been kept to a minimum, familiarity with cataloguing rules is still necessary. This raises the issue of consistent practice between staff, which could be verified in a small presample. If suitable staff are not available at all sites then either staff or the sample will need to be moved.

When planning staff allocation, time should be allowed for the printing and distribution of a sample, the tests themselves, the return of recalled items and so on. An academic library can perform the test in the summer vacation but there are fewer quiet periods in other libraries and more recalls might be necessary. That said, it is not necessary actually to close the library during the audit as unless the sample makes up a very large proportion of the collection, it will not be affected by items being borrowed or returned. Even additions and alterations to the catalogue are unlikely to make a significant difference, although strictly speaking the results will be valid only for the date on which the sample was drawn.

4.6 Compiling the results

The table in the worksheet is suitable for displaying the totals. Alternatively, the data can be input into a spreadsheet or database capable of simple manipulations, either in full detail (allowing flexible interpretation) or just as a tally of errors in each field. The number of uncatalogued items found can simply be totalled. The principal figure to calculate is the proportion of records with at least one error.

The number of records with incorrect data, rather than the absolute number of errors, is usually simpler and clearer to provide, and more appropriate because the population comprised all catalogue records rather than all fields in all records. It is not clear whether errors are independent of each other or whether the presence of one error in a record makes another more likely, but the proportion of records with more than one error is not greatly informative. Other statistics such as the mean number of errors per record, or per incorrect record, may be derived as desired.

4.6.1 Margin of error

Statistics should always be quoted with the relevant margin of error. This may have been chosen when the sample size was determined, but if the sample size was fixed by practical constraints then it will be necessary to calculate the error at this stage.

The margin of error E is given by this formula (multiply it by 100 for a percentage):

            _________
           /p x (1-p)
E = z x \ / ---------
         V      n

This formula is simply a rearrangement of that for sample size: z and n are as in section 4.2.4 and p is the proportion of errors found in the sample. Upper and lower limits for the estimate can be found from p by and adding and subtracting the margin of error. For example, with n = 451, z = 1.96 as at the end of section 4.2.4 and the incidence of errors in the sample found to be p = 0.07 or 7% then the margin of error is 2.4%, i.e. the true proportion lies somewhere between 4.6% and 9.4%. If the size of the collection is known then this can be multiplied by the percentage of incorrect records, and by the margin of error, to give an indication of the absolute number of records affected.

Technically, it is the upper and lower limits for the proportion of errors which are random as they depend on the results of a random sample; with a confidence level of 95%, that many samples will give results within the margin of error of the true proportion. It is much simpler to state that there is a 95% chance that the true proportion lies somewhere between 4.6% and 9.4%. (This is technically misleading because the true proportion is usually considered not random but fixed and unknown.)

4.6.2 Major and minor errors

Not all keyboarding errors or missing fields have the same effect on the utility of the catalogue. Considerations of the gravity of errors have deliberately been excluded from the checklist for simplicity, but a broad classification of errors is still possible. A simple division into major errors, which affect access, and minor errors, which do not, is proposed. Words with miskeyings other than capitals, punctuation or diacritics in any field, omitted words in titles and series and missing author headings and series all greatly obstruct access, as do incorrect classmarks. These can be considered major errors, with all others relegated to the status of minor errors. A valid objection is that the checklist does not ask for even this level of detail to be gathered, so the division is only possible when the full record has been printed.

The distinction is not perfect: series are used less than other access points, and keyboarding errors can decrease recall by affecting filing order. Other operations in an integrated library system can be affected by keyboarding errors, for example when an item is ordered twice because the first order record cannot be found. Conversely, very high-frequency words such as 'English' or 'history' can contain a keyboarding error without necessarily frustrating access, unless used to restrict a search.

A variation on this theme is to follow Cahn (1994) in distinguishing 'redundant' errors where the correct form appears elsewhere in the record and 'unique' errors where it does not and access is lost completely.

Dissertation | Proposed technique < Commentary > Pilot

Owen Massey McKnight <owen.mcknight@worc.ox.ac.uk>