Multi-Party Privacy-Preserving Record Linkage
Thilina Ranbaduge, Dinusha Vatsalan, and Peter Christen
ANU Research School of Computer Science
This research is funded by the Australian Research Council under Discovery Project DP130101801.
- Home
- Parameter settings
and data masking - Private blocking
- Private matching
and classification - Evaluation
Welcome to MERLIN, the online Multi-party privacy-presErving Record LINkage demo. PPRL, the problem of identifying matching sets of records from multiple databases that correspond to the same real-world entity without revealing any sensitive information about these records, is increasingly being required in many real-world applications. Examples include public health surveillance systems, crime and fraud detection, and national security applications. This tools provides a demonstration of multi-party PPRL techniques that have been developed by the research group of PPRL at the Research School of Computer Science, ANU.
The pipeline of PPRL is illustrated in the below figure. In step 1, the parties agree on the parameter settings and the masking functions to be used for the linkage. The searching space is reduced in step 2 by employing a private blocking function that results in a set of candidate records which will be securely compared and classified in the third step. Finally, the performance of linkage is evaluated in terms of complexity (or scalability), linkage quality, and privacy. For more information about the pipeline and the process of PPRL click here.
This tool follows the PPRL pipeline and represents each of the steps in different tabs. To begin the PPRL process, click on the Parameter settings & data masking tab. On this tab, you will be able to set the number of parties and their datasets to be linked using our tool, as well as the data masking settings. The masking function is limited to Bloom filter encoding in this demo (as all our multi-party PPRL techniques are based on the Bloom filter encoding method).
Once the original records have been transformed into masked records, on the Private blocking tab you can select one of several private blocking functions, and set the parameters specific to the blocking function. At the end of this step, the masked records (Bloom filters) will be blocked into different groups using the blocking functions according to their similarities. A view option allows you to preview the statistics of the generated blocks.
These blocked records are then used on the next Private matching & classification tab, where one of several techniques is used to match and classify the candidate record sets resulting from blocks into matches and non-matches. Parameters need to be provided that are specific to the matching technique used on this tab. A sample of matches and non-matches classified can be previewed for each of the private matching and classification technique.
On the final tab, Evaluation, you can view the performance results of these techniques using several evaluation measures, including Reduction ratio, Pairs completeness, Pairs quality, Precision, Recall, F-measure, and Probability of suspicion and disclosure risk measures.
We simulated communication between the parties in these multi-party PPRL protocols by creating a directory for each party and writing the communicated data into a file in the directory. This demo is executed on a 64-bit Intel Core i7 (3.40GHz), 16 GBytes of main memory computer with Ubuntu 14.04 OS platform.
An overview of PPRL problem and it's challenges, and a detailed description of our proposed multi-party PPRL techniques can be found in the following publications:
- Clustering-Based Scalable Indexing for Multi-party Privacy-Preserving Record Linkage, Thilina Ranbaduge, Dinusha Vatsalan, and Peter Christen, Springer PAKDD 2015, Ho Chi Minh City, Vietnam. (PDF, 382 KBytes)
- Tree Based Scalable Indexing for Multi-Party Privacy-Preserving Record Linkage, Thilina Ranbaduge, Peter Christen, and Dinusha Vatsalan, CRPIT AusDM 2014, Brisbane, Australia. (PDF, 754 KBytes)
- Scalable Privacy-Preserving Record Linkage for Multiple Databases, Dinusha Vatsalan and Peter Christen, CIKM 2014, Shanghai, China. (PDF, 460 KBytes)
- Challenges for Privacy Preservation in Data Integration, Peter Christen, Dinusha Vatsalan, and Vassilios S Verykios, ACM JDIQ 2014. (PDF, 43.8 KBytes)
- An Evaluation Framework for Privacy-Preserving Record Linkage, Dinusha Vatsalan, Peter Christen, Christine M. O'Keefe, and Vassilios S Verykios, JPC 2014. (PDF, 43.8 KBytes)
- A taxonomy of privacy-preserving record linkage techniques, Dinusha Vatsalan, Peter Christen, and Vassilios S Verykios, Elsevier JIS 2014. (PDF, 1.1 MBytes)
Generate sets of Bloom filters for the number of parties participate in the record linkage process.
Please select number of parties and provide the relavant attributes and parameters values.
Alternatively, load some examples:
Name | value |
---|---|
Data set Name | |
Data set size per party | |
Corruption level | |
Number of parties (3, 5, 7) |
Set of parameters for the Bloom filter generation
Parameter Name | parameter value |
---|---|
Q-gram length | |
Bloom Filter length | (e.g 500, 1000, 1030) |
Number of hash functions | (e.g 20, 30, 40) |
Selecting the attributes for using in blocking & linking
Attribute Name | Usage |
---|---|
First Name | |
Last Name | |
Suburb | |
Postcode |
Perform private blocking for multi-party PPRL.
Warning : Blocking attributes are not selected.
Indexing technique |
Parameter settings to generate blocks with Single-bit tree indexing
Parameter name | Parameter value |
---|---|
Minimum Block size | (e.g 10, 20, 50) |
Maximum Block size | (e.g 20, 25, 60) |
Parameter settings to generate blocks with Phonetic indexing
Attribute name | Encoding method |
---|
Parameter settings to generate blocks with Standard canopy clustering based indexing
Parameter settings for Multi-bit Bloom filter splitting
Parameter name | Parameter value |
---|---|
Minimum Mini-block size | (e.g 5, 10, 15) |
Maximum Mini-block size | (e.g 10, 20, 30) |
Bit selection threshold | |
Maximum node degree |
Parameter settings for Standard Canopy clustering
Parameter name | Parameter value |
---|---|
Tight similarity threshold | (e.g 0.9) |
Loose similarity threshold | (e.g 0.8) |
Maximum merge size | (e.g 25, 50, 75) |
Parameter settings to generate blocks with Hierarchical canopy clustering based indexing
Parameter settings for Multi-bit Blooom filter splitting
Parameter name | Parameter value |
---|---|
Minimum Mini-block size | (e.g 5, 10, 15) |
Maximum Mini-block size | (e.g 10, 20, 30) |
Bit selection threshold | |
Maximum node degree |
Parameter settings for Standard Canopy clustering
Parameter name | Parameter value |
---|---|
Tight similarity threshold | (e.g 0.9) |
Loose similarity threshold | (e.g 0.8) |
Maximum merge size | (e.g 25, 50, 75) |
Perform private comparison and classification for multi-party PPRL.
Warning : Linking attributes are not selected.
Private matching & classification technique: |
Private approximate matching and classification using Bloom filters - proposed by Dinusha Vatsalan and Peter Christen, ANU, 2014. Paper
Minimum similarity threshold | (between 0.0 and 1.0) |
Length filtering |
Yes
No |
Include disclosure risk calculation (this might take longer) |
Yes
No |
Private exact matching and classification using Bloom filters - proposed by Lai et al., 2006. Paper
Include disclosure risk calculation (this might take longer) |
Yes
No |
Private approximate matching and classification with efficient communication patterns using Counting Bloom filters - Vatsalan et al., 2015 - TBC.
Minimum similarity threshold | (between 0.0 and 1.0) |
Communication pattern |
SEQ
RBR NAI |
Include disclosure risk calculation (this might take longer) |
Yes
No |
Evaluate performance of multi-party PPRL techniques.