I recently needed to test for similarity between several sets of time series data – specifically response curves and attrition curves – and find the most similar data sets. I’m in no way an expert on curve fitting but using the squared deviation method seemed to work quite well and lends itself to a nice implementation in SQL so thought I’d share it here.
Given two sets of values over the same time intervals the squared deviation between them is the sum of the squares of differences in those values. The lower the deviation the more similar the curves. If we have an arbitary number of data sets we can easily write a query to perform this calculation to compare every set with every other set and pick the most similar pairs. Assume we have four data sets of three points.
drop table data_sets create table data_sets ( set_name varchar(50) , interval int , value int ) insert into data_sets values ('Set A', 1, 10) insert into data_sets values ('Set A', 2, 9) insert into data_sets values ('Set A', 3, 1) insert into data_sets values ('Set B', 1, 8) insert into data_sets values ('Set B', 2, 4) insert into data_sets values ('Set B', 3, 2) insert into data_sets values ('Set C', 1, 10) insert into data_sets values ('Set C', 2, 6) insert into data_sets values ('Set C', 3, 4) insert into data_sets values ('Set D', 1, 9) insert into data_sets values ('Set D', 2, 8) insert into data_sets values ('Set D', 3, 2)
We can the self join this table on the interval column and calculate square of the difference between the value at each interval for all possible pairs of datasets.
select t1.set_name sample_1 , t2.set_name sample_2 , sum(power(t1.value - t2.value, 2)) squared_deviation from data_sets t1 join data_sets t2 on t1.interval = t2.interval group by t1.set_name , t2.set_name order by sum(power(t1.value - t2.value, 2))
This gives us 16 records (4 x 4 datasets). This includes comparing sets with themselves and the complement of every pair (A to B and B to A) however which are redundant. These can be eliminted by adding and t1.set_name < t2.set_name to the join. This is useful technique anytime your comparing tables with themselves such as when performing an internal de-duplication. So the final results suggest that the most similar data sets are A and D and the most dissimilar are A and B.
sample_1 | sample_2 | squared_deviation |
---|---|---|
Set A | Set D | 3 |
Set C | Set D | 9 |
Set B | Set C | 12 |
Set B | Set D | 17 |
Set A | Set C | 18 |
Set A | Set B | 30 |