Comparing Response and Attrition Curves in SQL

I recently needed to test for similarity between several sets of time series data – specifically response curves and attrition curves – and find the most similar data sets. I’m in no way an expert on curve fitting but using the squared deviation method seemed to work quite well and lends itself to a nice implementation in SQL so thought I’d share it here.

Given two sets of values over the same time intervals the squared deviation between them is the sum of the squares of differences in those values. The lower the deviation the more similar the curves. If we have an arbitary number of data sets we can easily write a query to perform this calculation to compare every set with every other set and pick the most similar pairs. Assume we have four data sets of three points.

drop table data_sets

create table data_sets
(
	set_name varchar(50)
,	interval int
,	value int
)


insert into data_sets values ('Set A', 1, 10)
insert into data_sets values ('Set A', 2, 9)
insert into data_sets values ('Set A', 3, 1)
insert into data_sets values ('Set B', 1, 8)
insert into data_sets values ('Set B', 2, 4)
insert into data_sets values ('Set B', 3, 2)
insert into data_sets values ('Set C', 1, 10)
insert into data_sets values ('Set C', 2, 6)
insert into data_sets values ('Set C', 3, 4)
insert into data_sets values ('Set D', 1, 9)
insert into data_sets values ('Set D', 2, 8)
insert into data_sets values ('Set D', 3, 2)

 

We can the self join this table on the interval column and calculate square of the difference between the value at each interval for all possible pairs of datasets.

select	t1.set_name sample_1
,	t2.set_name sample_2
,	sum(power(t1.value - t2.value, 2)) squared_deviation
from	data_sets t1
	join data_sets t2
		on t1.interval = t2.interval
group by t1.set_name
,	t2.set_name
order by sum(power(t1.value - t2.value, 2))

 

This gives us 16 records (4 x 4 datasets). This includes comparing sets with themselves and the complement of every pair (A to B and B to A) however which are redundant. These can be eliminted by adding and t1.set_name < t2.set_name to the join. This is useful technique anytime your comparing tables with themselves such as when performing an internal de-duplication. So the final results suggest that the most similar data sets are A and D and the most dissimilar are A and B.

sample_1sample_2squared_deviation
Set ASet D3
Set CSet D9
Set BSet C12
Set BSet D17
Set ASet C18
Set ASet B30