ANU Online Personal Data Generator and Corruptor (GeCo)
Download: geco-data-generator-corruptor.tar.gz (1.9 MB)
Khoi-Nguyen Tran, Dinusha Vatsalan, and Peter Christen
ANU Research School of Computer Science
Welcome to the online GeCo demo. This tool allows the generation of various types of personal data, ranging from names and addresses, dates, social security and credit card numbers, to numerical values such as salary or blood pressure. Our tool can model dependencies between attributes, and it allows the corruption of values in various ways.
To begin the data generation process, click on the Generate Data Set tab. On this tab, you will be able to customize the types and names of attributes to be generated, as well as their various parameter settings. You can also specify the number of original records to be generated (maximum number in this demo is limited to 9,999 records). A preview option allows you to view a sample of the generated data.
Once the original records have been generated, on the Corrupt Data Set tab you can specify various corruption functions, and the probabilities of how likely they are applied on the different generated attributes. You can also set the number of such corrupted (or duplicate) records (again the maximum number is limited to 9,999 records). if you do not want to add duplicate records set simply their number to 0.
A detailed description of the data generator code underlying this demo, the types of attributes that can be generated, and the format of the various look-up files used by the generator, please consult the following manual (PDF, 264 KBytes):
Flexible data generator manual (pdf)The Python source code and look-up files, plus documentation, are available here:
geco-data-generator-corruptor.tar.gz (1.9 MBytes)If you like to contribute with attribute generation functions (short Python programs) or look-up files for attribute generation or corruption, please contact one of the authors.
Please name your attributes, select their type and parameters.
Alternatively, load some examples:
Number of original records to be generated (1-9999):
(Leave name blank to ignore attribute.)
Attribute Type | Attribute Name | Attribute Parameters |
---|
(Debugging)
(Show 10 samples) (Go to corrupt data set)
To generate a data set for corruption, please click the 'Generate Data Set' tab.
Upload your own data set to corrupt.
Header (1st row assumed to contain header.)
Reupload needed when (un)checking box.
The grey attributes do not have corruption functions applied to them. Please ensure that probabilities sum to 1.0 for each of attributes, functions applied to attributes, and parameters of functions.
Number of duplicate records (0-9999): (0 for no corrupted records.)
Number of duplicates per record:
Distribution of duplicate records:
Maximum modifications per attribute:
Number of modifications per record:
Attribute Name | Attribute Probability | Corrupt Function |
---|
Sum of Attribute Probabilities do not equal 1.0.
Sum of Function Choice Probabilities do not equal 1.0.
Sum of Function Parameter Probabilities do not equal 1.0.
(Show 10 samples)
The list of look up files is available below. We will add uploading of your own data sets in future revisions of the GeCo software. For now, if you would like to use our software with your own data sets, please download a copy of GeCo.