A while back I was asked if I could carry out a contact de-duplication exercise based on samples from the actual files to be de-duplicated. My initial answer was no. Clearly checking the level of de-duplication on a sample will not give the same results as for the complete files. Given two identical files (100% duplicated) of 50 records and taking a random sample of 5 records from each the chance of getting two identical samples is a million to one against! My second more thoughtful answer was yes but it would be fiddly (more so than just using the complete files). Still I was curious enough to work out how it would be done.
First the samples would have to be randomly selected. Selecting data by any specific dimension (for example a specific time period) would risk prejudicing the sample and giving results invalid for the complete file. Secondly the files sizes and sample sizes need to be large – large enough to provide a statistically significant answer. I’ll test this with two original files of 1M records each and take 10% samples.
For the case where the two files are identical (100% duplicated), intuitively, I would expect the samples to be 10% duplicated: given a random sample A of 100K records, each record in a random sample B has a 10% chance of being the sample A. If the files are, for example, only 50% duplicated I’d expect the chance (and hence level of duplication in the samples) to drop by half to 5%. So my guess is:
Duplicates in Source Files = Duplicates in Sample / (Sample Size / File Size)2
To check this I’ll run some tests to see if it checks out. First create a table and populate with 2M unique IDs that I’ll use as the source for the sample files and then two tables for these sample files. I’m using a pre-populated integers table with sequential row numbers in to do this (1, 2 … N).
create table #2M (id int) insert into #2M select row_num from dbo.integers_large where row_num <= 2000000 create table #file1 (id int) create table #file2 (id int)
Then create the two sample files. The following parameters define the original file sizes (1M), the sample sizes (100K) and the degree of duplication in the original files (0 = identical files, 1M = no duplication).
declare @file_size int = 1000000 declare @sample_size int = 100000 declare @overlap_offset int = 0
Then generate the sample files. The inner query generates the original file of file of @file_size and the outer query takes a random sample from this file of @sample_size. The overlap (amount of de-duplication between the original files) can be varied using @overlap_offset used in the second file query.
insert into #file1 select id from ( select id, row_number() over (order by checksum(NEWID())) row_num from #2M where id between 1 and @file_size ) a where row_num between 1 and @sample_size insert into #file2 select id from ( select id, row_number() over (order by checksum(NEWID())) row_num from #2M where id between 1 + @overlap_offset and @file_size + @overlap_offset ) a where row_num between 1 and @sample_size
Finally check the number of duplicates between the two sample files.
select count(*) from #file1 f1 join #file2 f2 on f1.id = f2.id
Running this for various levels of duplication shows that the formula provides a fair estimate of the level of duplication in the original files based on random sampling.
Overlap Offset | Duplicates | Sample Duplicates | Estimated File Duplication |
---|---|---|---|
0 | 1000000 | 10031 | 1003100 |
100000 | 900000 | 9097 | 909700 |
200000 | 800000 | 8137 | 813700 |
300000 | 700000 | 6942 | 694200 |
400000 | 600000 | 6042 | 604200 |
500000 | 500000 | 4998 | 499800 |
600000 | 400000 | 4100 | 410000 |
700000 | 300000 | 3034 | 303400 |
800000 | 200000 | 2076 | 207600 |
900000 | 100000 | 1012 | 101200 |
1000000 | 0 | 0 | 0 |
The formula can be extended to handle original files of different sizes. For example taking two original files of 250,000 and 750,000 and samples of 100,000 from each.
Duplicates in Source Files = Duplicates in Sample / ((File Size 1 / Sample Size) * (File Size 2 / Sample Size))
This also works as shown below. So a solution is possible but the fact remains that the effort needed to generated the necessary random samples probably outweighs any benefit of deduping on a smaller data set.
Overlap Offset | Duplicates | Sample Duplicates | Estimated File Duplication |
---|---|---|---|
0 | 250000 | 3323 | 249225 |
50000 | 200000 | 2718 | 203850 |
100000 | 150000 | 2049 | 153675 |
150000 | 100000 | 1321 | 99075 |
200000 | 50000 | 655 | 49125 |
250000 | 0 | 0 | 0 |