De-duping based on Sample Files

A while back I was asked if I could carry out a contact de-duplication exercise based on samples from the actual files to be de-duplicated. My initial answer was no. Clearly checking the level of de-duplication on a sample will not give the same results as for the complete files. Given two identical files (100% duplicated) of 50 records and taking a random sample of 5 records from each the chance of getting two identical samples is a million to one against! My second more thoughtful answer was yes but it would be fiddly (more so than just using the complete files). Still I was curious enough to work out how it would be done.

First the samples would have to be randomly selected. Selecting data by any specific dimension (for example a specific time period) would risk prejudicing the sample and giving results invalid for the complete file. Secondly the files sizes and sample sizes need to be large – large enough to provide a statistically significant answer. I’ll test this with two original files of 1M records each and take 10% samples.

For the case where the two files are identical (100% duplicated), intuitively, I would expect the samples to be 10% duplicated: given a random sample A of 100K records, each record in a random sample B has a 10% chance of being the sample A. If the files are, for example, only 50% duplicated I’d expect the chance (and hence level of duplication in the samples) to drop by half to 5%. So my guess is:

Duplicates in Source Files = Duplicates in Sample / (Sample Size / File Size)²

To check this I’ll run some tests to see if it checks out. First create a table and populate with 2M unique IDs that I’ll use as the source for the sample files and then two tables for these sample files. I’m using a pre-populated integers table with sequential row numbers in to do this (1, 2 … N).

create table #2M 
(id int)

insert 
into 	#2M 
select	row_num 
from 	dbo.integers_large
where	row_num <= 2000000

create table #file1
(id int)

create table #file2
(id int)

Then create the two sample files. The following parameters define the original file sizes (1M), the sample sizes (100K) and the degree of duplication in the original files (0 = identical files, 1M = no duplication).

declare @file_size int = 1000000
declare @sample_size int = 100000
declare @overlap_offset int = 0

Then generate the sample files. The inner query generates the original file of file of @file_size and the outer query takes a random sample from this file of @sample_size. The overlap (amount of de-duplication between the original files) can be varied using @overlap_offset used in the second file query.

insert into #file1
select id from 
(
select id, row_number() over (order by checksum(NEWID())) row_num from #2M
where id between 1 and @file_size
) a
where row_num between 1 and @sample_size


insert into #file2
select id from 
(
select id, row_number() over (order by checksum(NEWID())) row_num from #2M
where id between 1 + @overlap_offset and @file_size + @overlap_offset
) a
where row_num between 1 and @sample_size

Finally check the number of duplicates between the two sample files.

select count(*)
from #file1 f1 join #file2 f2 on f1.id = f2.id

Running this for various levels of duplication shows that the formula provides a fair estimate of the level of duplication in the original files based on random sampling.

Overlap Offset	Duplicates	Sample Duplicates	Estimated File Duplication
0	1000000	10031	1003100
100000	900000	9097	909700
200000	800000	8137	813700
300000	700000	6942	694200
400000	600000	6042	604200
500000	500000	4998	499800
600000	400000	4100	410000
700000	300000	3034	303400
800000	200000	2076	207600
900000	100000	1012	101200
1000000	0	0	0

The formula can be extended to handle original files of different sizes. For example taking two original files of 250,000 and 750,000 and samples of 100,000 from each.

Duplicates in Source Files = Duplicates in Sample / ((File Size 1 / Sample Size) * (File Size 2 / Sample Size))

This also works as shown below. So a solution is possible but the fact remains that the effort needed to generated the necessary random samples probably outweighs any benefit of deduping on a smaller data set.

Overlap Offset	Duplicates	Sample Duplicates	Estimated File Duplication
0	250000	3323	249225
50000	200000	2718	203850
100000	150000	2049	153675
150000	100000	1321	99075
200000	50000	655	49125
250000	0	0	0