Historical Record Linkage
Charini Nanayakkara, Peter Christen, and Thilina Ranbaduge
ANU Research School of Computer Science
Research in the social sciences is increasingly based on large and complex data collections, where individual data sets from different domains need to be linked to allow advanced analytics. A popular type of data used in such a context are historical registries containing birth, death, and marriage certificates. Individually, such data sets however limit the types of studies that can be conducted.
Specifically, it is impossible to track individuals, families, or households over time. Once such data sets are linked and family trees are available it is possible to, for example, investigate how education, health, mobility, and employment influence the lives of people over two or even more generations.
Historical record linkage involves the linkage of records from historical censuses, as well as birth, death, and marriage certificates, to construct longitudinal data sets about a population. This problem has been studied in the past two decades by researchers working in different domains.
The linkage of historical records is challenging because of data quality issues and because often there are no ground truth data available. Unsupervised techniques need to be employed, which generally are based on similarity graphs generated by comparing individual records.
This page is dedicated to presenting the novel research conducted by us, to address the many challenges and limitations in the area of historical record linkage and record linkage in general.
We proposed two graph based clustering algorithms for record linkage, which incorporate constraints implied by time differences (known as temporal constraints). For example, when we attempt to find groups of siblings in a dataset, we would refrain from linking two records with a five month difference between birth dates, because we know that biologically it is impossible for the same mother to have two children five months apart. In previous work related to record linkage, such constraints were not considered. We have shown with our experiemnts that superior linkage results can be achieved with our proposed temporal clustering approaches. The clustering algorithms we proposed are as follows:
Star clustering first aims to find the nodes that best represent a cluster, where these are the nodes that have the highest average similarity to their neighbouring nodes and also the highest number of neighbours in a similarity graph G. Cluster centers are identified iteratively, such that an unassigned node becomes a cluster center, and all its neighbours are assigned to the corresponding cluster. This process can result in overlapping clusters which are resolved by assigning a node to the cluster where it has the highest average similarity to other nodes in the cluster. The following paper outlines the star clustering algorithm in detail:
Robust graph-based clustering is based on the categorisation of pairwise links according to their strength; strong, normal and weak-high (Saeedi et al. 2018). Initially, connected components are created using a subset of link strengths, which are referred to as base clusters. Subsequently, these base clusters are iteratively merged, where the pairwise cluster similarity needs to be greater than a user defined threshold. The following paper outlines the robust graph-based clustering algorithm in detail:
We have also implemented a baseline connected component clustering algorithm for comparison purposes.
Connected component clustering considers all the pairwise links in graph G with a similarity greater than a user defined similarity threshold, and then finds the connected components in the resulting graph. A connected component is a set of nodes in G that are directly or indirectly connected via edges.
Our clustering approaches can be run with or without temporal constraints. However, our experimental evaluation has indicated that the linkage results improve when temporal constraints are applied.
The robustness of record linkage evaluation measures is of high importance, since linkage techniques are assessed based on these. Linkage quality is traditionally evaluated based on the classified versus the true matching links using measures such as precision and recall. Both these measures however are based on the evaluation of links between individual records (record pairs) rather than clusters of records.
However, in a group linkage context these traditional evaluation measures are not suitable for evaluating groups of linked records because they evaluate the quality of individually linked record pairs rather than the quality of records grouped into clusters.
Novel evaluation measure for group record linkage To address the problem of the lack of suitable linkage evaluation measures for group based record linkage, we propose a novel method for evaluating the quality of the clusters generated in a record linkage process, which classifies records (rather than links) according to how correctly they have been generated when compared to the ground truth clusters.
In our evaluation approach, we classify each record in a database into one of seven categories, based on how they have been clustered by an algorithm, with respect to the ground truth. The seven categories are described below.
Category | Description |
---|---|
Correct singleton | These are the records which appear as singletons in both the ground truth data and the predicted clusters |
Wrongly grouped singleton | These are the records which appear as singletons in the ground truth but were assigned to a group of records in the prediction |
Missed group member | These are the records which appear in a group in the ground truth, but were assigned as a singleton in the prediction |
Exact group match | These are the records contained in a predicted cluster that exactly matches a ground truth cluster, where the size of the cluster is larger than one. An exact match occurs when each record in the predicted cluster appears in a ground truth cluster and vice versa |
Majority group match | A majority group match occurs when at least 50% of the records in a predicted cluster (a group, not a singleton) come from a single ground truth cluster |
Minority group match | A minority group match occurs when less than 50% of the records in a predicted cluster (a group, not a singleton) come from a single ground truth cluster |
Wrongly assigned member | These are all the records from a ground truth cluster (a group, not a singleton) which appear in a predicted cluster (a group) different to the majority or minority group match |
Our proposed linkage evaluation method provides unambiguous results while providing insight to how the records themselves were classified; not the links between records.
Further, it provides more detailed information regarding which aspects of a clustering algorithm is better or worse (such as whether they are more suited for predicting singletons or groups correctly).
The Python source code for the evaluation program is available here: evaluation.tar.gz (111.3 KB)
Isle of Skye dataset We used a Scottish dataset that covers the population of the Isle of Skye over the period from 1861 to 1901 for conducting experimental evaluation. This dataset consists of 17,613 birth certificates, each containing personal details about a baby and its parents such as their names, addresses, occupations, and the birth date of the baby. This dataset has been linked semi-manually by demographers using Microsoft Excel and Access programmes to assist with sorting, searching, and querying records, and therefore ground truth is available for conducting linkage quality evaluation. The ground truth clusters were validated by demographers using census, marriage, and death certificates. For more information regarding the Isle of Skye dataset, please refer to the following paper:
Description of similarity graph G | Link to similarity graph G |
---|---|
Attribute combination: parents' names, their addresses, occupations, and marriage dates Attribute weighting: no | all-attr-no-weights.csv (7.6 MB) |
Attribute combination: parents' names, their addresses, occupations, and marriage dates Attribute weighting: yes | all-attr-weights.csv (7.4 MB) |
Attribute combination: parents' names and their addresses Attribute weighting: no | parent-name-address-no-weights.csv (86.9 MB) |
Attribute combination: parents' names and their addresses Attribute weighting: yes | parent-name-address-weights.csv (92.1 MB) |
Attribute combination: parents' names only Attribute weighting: no | parent-name-no-weights.csv (129.9 MB) |
Attribute combination: parents' names only Attribute weighting: yes | parent-name-weights.csv (149.2 MB) |
- The ground truth file corresponding to our dataset is available here: ground-truth.csv (186.6 KB)
- The temporal plusibility file corresponding to our dataset is available here: temporal-plausibility.csv.gz (1.3 MB)
- Common utility program used with clustering and evaluation codes: system-utils.py (16.5 KB)