Multi-Party PPRL Demo - DMM

Welcome to MERLIN, the online Multi-party privacy-presErving Record LINkage demo. PPRL, the problem of identifying matching sets of records from multiple databases that correspond to the same real-world entity without revealing any sensitive information about these records, is increasingly being required in many real-world applications. Examples include public health surveillance systems, crime and fraud detection, and national security applications. This tools provides a demonstration of multi-party PPRL techniques that have been developed by the research group of PPRL at the Research School of Computer Science, ANU.

The pipeline of PPRL is illustrated in the below figure. In step 1, the parties agree on the parameter settings and the masking functions to be used for the linkage. The searching space is reduced in step 2 by employing a private blocking function that results in a set of candidate records which will be securely compared and classified in the third step. Finally, the performance of linkage is evaluated in terms of complexity (or scalability), linkage quality, and privacy. For more information about the pipeline and the process of PPRL click here.

pipeline

This tool follows the PPRL pipeline and represents each of the steps in different tabs. To begin the PPRL process, click on the Parameter settings & data masking tab. On this tab, you will be able to set the number of parties and their datasets to be linked using our tool, as well as the data masking settings. The masking function is limited to Bloom filter encoding in this demo (as all our multi-party PPRL techniques are based on the Bloom filter encoding method).

Once the original records have been transformed into masked records, on the Private blocking tab you can select one of several private blocking functions, and set the parameters specific to the blocking function. At the end of this step, the masked records (Bloom filters) will be blocked into different groups using the blocking functions according to their similarities. A view option allows you to preview the statistics of the generated blocks.

These blocked records are then used on the next Private matching & classification tab, where one of several techniques is used to match and classify the candidate record sets resulting from blocks into matches and non-matches. Parameters need to be provided that are specific to the matching technique used on this tab. A sample of matches and non-matches classified can be previewed for each of the private matching and classification technique.

On the final tab, Evaluation, you can view the performance results of these techniques using several evaluation measures, including Reduction ratio, Pairs completeness, Pairs quality, Precision, Recall, F-measure, and Probability of suspicion and disclosure risk measures.

We simulated communication between the parties in these multi-party PPRL protocols by creating a directory for each party and writing the communicated data into a file in the directory. This demo is executed on a 64-bit Intel Core i7 (3.40GHz), 16 GBytes of main memory computer with Ubuntu 14.04 OS platform.

An overview of PPRL problem and it's challenges, and a detailed description of our proposed multi-party PPRL techniques can be found in the following publications:

Clustering-Based Scalable Indexing for Multi-party Privacy-Preserving Record Linkage, Thilina Ranbaduge, Dinusha Vatsalan, and Peter Christen, Springer PAKDD 2015, Ho Chi Minh City, Vietnam. (PDF, 382 KBytes)
Tree Based Scalable Indexing for Multi-Party Privacy-Preserving Record Linkage, Thilina Ranbaduge, Peter Christen, and Dinusha Vatsalan, CRPIT AusDM 2014, Brisbane, Australia. (PDF, 754 KBytes)
Scalable Privacy-Preserving Record Linkage for Multiple Databases, Dinusha Vatsalan and Peter Christen, CIKM 2014, Shanghai, China. (PDF, 460 KBytes)
Challenges for Privacy Preservation in Data Integration, Peter Christen, Dinusha Vatsalan, and Vassilios S Verykios, ACM JDIQ 2014. (PDF, 43.8 KBytes)
An Evaluation Framework for Privacy-Preserving Record Linkage, Dinusha Vatsalan, Peter Christen, Christine M. O'Keefe, and Vassilios S Verykios, JPC 2014. (PDF, 43.8 KBytes)
A taxonomy of privacy-preserving record linkage techniques, Dinusha Vatsalan, Peter Christen, and Vassilios S Verykios, Elsevier JIS 2014. (PDF, 1.1 MBytes)

Generate sets of Bloom filters for the number of parties participate in the record linkage process.

Please select number of parties and provide the relavant attributes and parameters values.

Alternatively, load some examples:

Name	value
Data set Name
Data set size per party
Corruption level
Number of parties (3, 5, 7)

Set of parameters for the Bloom filter generation

Parameter Name	parameter value
Q-gram length
Bloom Filter length	(e.g 500, 1000, 1030)
Number of hash functions	(e.g 20, 30, 40)

Selecting the attributes for using in blocking & linking

Attribute Name	Usage
First Name
Last Name
Suburb
Postcode

	Please wait ...

Perform private blocking for multi-party PPRL.

Warning : Blocking attributes are not selected.

Indexing technique

Parameter settings to generate blocks with Single-bit tree indexing

Parameter name	Parameter value
Minimum Block size	(e.g 10, 20, 50)
Maximum Block size	(e.g 20, 25, 60)

Parameter settings to generate blocks with Phonetic indexing

Attribute name	Encoding method

Parameter settings to generate blocks with Standard canopy clustering based indexing

Parameter settings for Multi-bit Bloom filter splitting

Parameter name	Parameter value
Minimum Mini-block size	(e.g 5, 10, 15)
Maximum Mini-block size	(e.g 10, 20, 30)
Bit selection threshold
Maximum node degree

Parameter settings for Standard Canopy clustering

Parameter name	Parameter value
Tight similarity threshold	(e.g 0.9)
Loose similarity threshold	(e.g 0.8)
Maximum merge size	(e.g 25, 50, 75)

Parameter settings to generate blocks with Hierarchical canopy clustering based indexing

Parameter settings for Multi-bit Blooom filter splitting

Parameter name	Parameter value
Minimum Mini-block size	(e.g 5, 10, 15)
Maximum Mini-block size	(e.g 10, 20, 30)
Bit selection threshold
Maximum node degree

Parameter settings for Standard Canopy clustering

Parameter name	Parameter value
Tight similarity threshold	(e.g 0.9)
Loose similarity threshold	(e.g 0.8)
Maximum merge size	(e.g 25, 50, 75)

	Please wait ...

Minimum similarity threshold	(between 0.0 and 1.0)
Length filtering	Yes No
Include disclosure risk calculation (this might take longer)	Yes No

Minimum similarity threshold	(between 0.0 and 1.0)
Communication pattern	SEQ RBR NAI
Include disclosure risk calculation (this might take longer)	Yes No

Data Mining & Matching

Multi-Party Privacy-Preserving Record Linkage