Fiji RHD Mortality Study: Probabilistic Record-Linkage Procedure

Thank you for accessing the publicly available materials of the Fiji RHD Mortality Study.

Overview of Procedure

These scripts run a probabilistic record-linkage procedure in Stata software. Note that they are not written by a specialist programmer. Broadly, the process mirrors standard record-linkage efforts. Details of the methods are described in Parks T et al. (2015) Rheumatic Heart Disease-Attributable Mortality at Ages 5–69 Years in Fiji: A Five-Year, National, Population-Based Record-Linkage Cohort Study. PLoS Negl Trop Dis 9(9): e0004033; when using the procedure, please cite this paper.

In each file, instructions for specifying arguments and examples are given at the top.

Step 1. Cleaning

First, the records are cleaned and identifiers standardised. The first script (recclean_v1_0.ado) is designed to facilitate the cleaning and standardisation of names (either individual or a parent's). Other identifiers such as the birth date will need to be cleaned using other commands.

Names are reconfigured into a format in which the last component of the last name is placed in a variable called name1 and up to three subsequent components of names are placed in variables name2 to name4 (the total number stored could be increased). Non-alphabetical characters are removed. Records containing names referring to babies (e.g. "baby of", line 137) and dummy records (e.g. "clinic", line 173) are deleted. Spelling of common names are standardised; an example is shown for Mohammed (line 255). Dummy names and titles are removed (e.g. "xx" or "senior", line 287). All of these steps need to be customised to suit your dataset.

Step 2. Blocking

Second, potentially matchable records are paired by components of the names and a birth year range and (recblock_v1_0.ado) is designed to facilitate this. Records with missing birth year are compared with all possible matches across the birth year range. The Stata strgroup command is used to find records which share at least five of the first six characters of the first or last name. A similar procedure can be carried out using the first two letters only. Due to the large number of potential matches generated output files are limited to 2Gb.

Step 3. Matching

Third, specified identifiers in blocked records are compared using (recmatch_v1_0.ado). If several outputs files have been generated by recblock due to a large number of possible matches then this command will need to run on each. The script can perform comparisons of the following common identiers: the individual's name; a parent's name (e.g. father's); birth year, or birth date and birth year combined; death year, or death date and death year combined; a numeric identier (e.g. national security/health number); categorical information on locality (e.g. division, subdivision, area, zone); gender; and ethnicity. Broadly, identifiers are categorised into exact match, partial match, nonmatch and missing. However, these categorisations have been tuned to our dataset and should be validated in yours.

To avoid cross-comparison of names, the four components of the individual's name (and three component's of a parent's name) are "reordered" until the best fit is identified. For example "Mohammed Abdul Singh" would be reordered to match "Abdul Mohammed Singh". Components that have been joined will be separated. For example "Abdulprasad" will match "Abdul Prasad". Partial matches also include shortened names and changes to the spelling.

Dates are compared to allow for exact matches, matches within three days and matches within seven days, although these cutoffs could be modified. In addition, cross comparisons between date and month of birth and typographical errors such as "13" instead of "31" are catered for. If numeric identifiers have been entered by hand (and so may contain typos), up to one digits difference can be allowed for.

Step 4. Probabilities

Fourth, match (m) and nonmatch (u) probabilities can be calculated for identifiers using (recprob_v1_0.ado). Then using information on prior probability, the posterior probability that a record pair refers to the same individual is calculated. Initially, all possible blocks based on combinations of the identifiers available are defined. The ratio of UIDs in the search and index records is checked and only blocks with a ratio between 1.33 and 1:20 containing over 1000 record pairs are analysed. Next, crude initial probabilities are defined and then the Expectation Maximisation algorithm is run. The m and u probabilities are combined to give the match weight score. The best match weight across all blocks is and multiplied by the priors to give the posterior probablity.

Licence

These files are distributed under the GNU General Public Licence. Accordingly there is no warranty for their use. At the present time they are not actively maintained and technical support cannot be offered. However suggestions for improvements are welcome. Please contact Dr Tom Parks (tomparks@well.ox.ac.uk) for all matters relating to their use.