Friday, May 22, 2015

Data Engineering Design Pattern - Divide and conquer to speed up Data transformation process

Divide and conquer to speed up Data transformation process
----------------------------------------------------------------------
Every once in a while we face this situation where an ETL process finishes by X hour, while we have to deliver reports by X-2 hour or so. And we have to go back and tune our ETL pipeline to gain back those hours.

And more often than not we already have our bases covered( good table design, check, latest stats, good query plan). What if everything is fine and we still have to bring down the ETL time.

People will always say that you have to use a haddop/map reduce problem for such scenario but what if we you have to manage with only the resources that are available to you. In such situation the most commong thing to do is to divide the problem into smaller pieces. And the way I do it is by logical partitioning the data transformation layer.

Here is an example which will help you understand better:


Lets say, we have a basic web sites visits table with billions of records in it and we are running aggregation on this to generate a daily metric. Below is our simplistic transform query, which takes 2 hours to run but we have to finish the job in half an hour:

Original Query:
Select user_agent, count(1) as total_vists, sum(case when behavior='click' then 1 else 0 end) than total_clicks from website_visitors where visit_day=to_date('20150501');

To divide the problem into smaller pieces we can use any inbuilt hashing function provided by the database. For example in case of Oracle we can use Ora_hash, and in case of Redshift we can sue MD5 or Checksum function, although checksum gives a better performance.

Well, a modified query at Oracle will look like this:

Select user_agent, count(1) as total_vists, sum(case when behavior='click' then 1 else 0 end) than total_clicks from website_visitors where visit_day=to_date('20150501') and ora_hash(user_agent,7)=0;

What this query is essentially doing is that it is looking only 1/8 of the data and if everything is fine it will run in about 1/8th of the original time (although in practical cases its takes more than that, still will be much better than 2 hour).

Now for this to work much more effectively, we have to pick up a column which has a good distribution and also we need to ensure that the logical partitions that we are creating doesnt generate a duplicate metric.

And that's all it is :) In the next article, I will tell about how to speed up the loads using divide and conquer.