A while back I was asked if I could carry out a contact de-duplication exercise based on samples from the actual files to be de-duplicated. My initial answer was no. Clearly checking the level of de-duplication on a sample will not give the same results as for the complete files. Given two identical files (100% duplicated) of 50 records and taking a random sample of 5 records from each the chance of getting two identical samples is a million to one against! My second more thoughtful answer was yes but it would be fiddly (more so than just using the complete files). Still I was curious enough to work out how it would be done. Continue reading “De-duping based on Sample Files”