De-duping based on Sample Files

A while back I was asked if I could carry out a contact de-duplication exercise based on samples from the actual files to be de-duplicated. My initial answer was no. Clearly checking the level of de-duplication on a sample will not give the same results as for the complete files. Given two identical files (100% duplicated) of 50 records and taking a random sample of 5 records from each the chance of getting two identical samples is a million to one against! My second more thoughtful answer was yes but it would be fiddly (more so than just using the complete files). Still I was curious enough to work out how it would be done.

First the samples would have to be randomly selected. Selecting data by any specific dimension (for example a specific time period) would risk prejudicing the sample and giving results invalid for the complete file. Secondly the files sizes and sample sizes need to be large – large enough to provide a statistically significant answer. I’ll test this with two original files of 1M records each and take 10% samples.

For the case where the two files are identical (100% duplicated), intuitively, I would expect the samples to be 10% duplicated: given a random sample A of 100K records, each record in a random sample B has a 10% chance of being the sample A. If the files are, for example, only 50% duplicated I’d expect the chance (and hence level of duplication in the samples) to drop by half to 5%. So my guess is:

Duplicates in Source Files = Duplicates in Sample / (Sample Size / File Size)2

To check this I’ll run some tests to see if it checks out. First create a table and populate with 2M unique IDs that I’ll use as the source for the sample files and then two tables for these sample files. I’m using a pre-populated integers table with sequential row numbers in to do this (1, 2 … N).

create table #2M 
(id int)

insert 
into 	#2M 
select	row_num 
from 	dbo.integers_large
where	row_num <= 2000000

create table #file1
(id int)

create table #file2
(id int)


Then create the two sample files. The following parameters define the original file sizes (1M), the sample sizes (100K) and the degree of duplication in the original files (0 = identical files, 1M = no duplication).

declare @file_size int = 1000000
declare @sample_size int = 100000
declare @overlap_offset int = 0


Then generate the sample files. The inner query generates the original file of file of @file_size and the outer query takes a random sample from this file of @sample_size. The overlap (amount of de-duplication between the original files) can be varied using @overlap_offset used in the second file query.

insert into #file1
select id from 
(
select id, row_number() over (order by checksum(NEWID())) row_num from #2M
where id between 1 and @file_size
) a
where row_num between 1 and @sample_size


insert into #file2
select id from 
(
select id, row_number() over (order by checksum(NEWID())) row_num from #2M
where id between 1 + @overlap_offset and @file_size + @overlap_offset
) a
where row_num between 1 and @sample_size


Finally check the number of duplicates between the two sample files.

select count(*)
from #file1 f1 join #file2 f2 on f1.id = f2.id


Running this for various levels of duplication shows that the formula provides a fair estimate of the level of duplication in the original files based on random sampling.

Overlap OffsetDuplicatesSample DuplicatesEstimated File Duplication
01000000100311003100
1000009000009097909700
2000008000008137813700
3000007000006942694200
4000006000006042604200
5000005000004998499800
6000004000004100410000
7000003000003034303400
8000002000002076207600
9000001000001012101200
1000000000

The formula can be extended to handle original files of different sizes. For example taking two original files of 250,000 and 750,000 and samples of 100,000 from each.

Duplicates in Source Files = Duplicates in Sample / ((File Size 1 / Sample Size) * (File Size 2 / Sample Size))

This also works as shown below. So a solution is possible but the fact remains that the effort needed to generated the necessary random samples probably outweighs any benefit of deduping on a smaller data set.

Overlap OffsetDuplicatesSample DuplicatesEstimated File Duplication
02500003323249225
500002000002718203850
1000001500002049153675
150000100000132199075
2000005000065549125
250000000