A common ask (if you’ll excuse the pun) in selection briefs is some kind of randomisation; as for a test mailing or for AB test. This is not a difficult job but is easy to get wrong – and when it does go wrong it can have serious consequences.
One CRM I worked with implemented a function to provide the user with a “random” selection of N contacts from some larger selection set. It did this based on the distibution of contact IDs. This was certainly not random but was difficult to spot if the distribution of IDs in a selection was fairly even. A later change to the system imports procedure however meant that the contact ID sequence would often jump by one hundred everytime an import was run. This meant that contact IDs were much more spread out for newer contacts than for older ones and any random selections would actually include mostly older contacts. Since the age of a supporter is often the biggest predictor of it’s behavior this rendered many test mailings that had been run before the issue was spotted at best redundant, at worst misleading.
Anyway a simple approach to getting a truly random selection is to attach a random value to each row in your selection, sort by this value and take the top N rows from the result. In SQL we can generate a random value using the function:
select checksum(newID())
Here newID() generates a 32 byte ID which is always unique. It is not random however, but if we apply checksum() to the result this will generate a hash value of the ID. This is as close to random as makes no odds for our purpose. Order the selection data by this value and add a top clause and the job’s done.
select top 1000 selection_columns from selection_table order by checksum(newID())
We can very easily perform an even random split using the NTILE windowing function on the same random value. Here we split a selection into three evenly sized random portions.
select ntile (3) over (order by checksum(newID())) part_number , selection_columns from selection_table
This is pretty quick but does mean every time we want to change the details of the split we have to update the code. It also does not allow uneven splits (different proportions in each split). An alternative is to maintain a table that holds the split details and get the users to maintain that.
create table splits ( test_name varchar(50) --the name of the test , split_count int --the number of splits in the test , split_number int --the number of the split , split_desc varchar(50) --the name of the split ) insert into splits values ('Value Test', 3, 1, 'Low Ask') insert into splits values ('Value Test', 3, 2, 'Medium Ask') insert into splits values ('Value Test', 3, 3, 'High Ask')
There’s obviously some normalisation that could happen here but to keep the example simple I’ve kept the test and splits in the same table where they should really be separated. If we add a row number to our selection again based on our random value.
select selection_columns , ROW_NUMBER() over (order by checksum(newID()) rand_seq from selection_table
We can join to the splits table thus:
select s.selection_columns, s.split_desc from selection_table s join splits st on (s.rand_seq % st.split_count) + 1 = st.split_number
Here (sub.rand_seq % st.split_count) + 1 gives the modulus of the row number by the number of splits (3) and adds one so every row has either the number 1, 2 or 3. We know that the row number is sequential (no gaps) and is randomly assigned to each row so every row has the same chance of being 1, 2 or 3. Join this to the splits table on split_number and the file is divided into 3 equal, random samples. Now you can vary the splits by updating the table.
I’ll discuss approaches for creating uneven splits and splits by absolute numbers in an update to this article another time.