BUSINESS OBJECTIVES

Client is a cloud-based data analytics company focusing on operations, security and BI use-cases. It provides log management and analytics services that leverage machine-generated big data to deliver real-time IT insights

SOLUTION

Our client has multiple sources from which marketing data is captured. It was difficult to organize the same set of data from all the sources into one space. The immediate challenges included:

 

Extracting marketing data from one of the sources via API. There is no concept of Pagination to collect all the data in one run. All the data is to be collected internally which goes through several internal loops until which it won’t expose all the records. This was complex to achieve in ETL tools and reduced the overall performance. Also, for one of the subjects, multiple iterations were required.

 

All these made the processing time taking and infeasible. End-to-end automation, use of variables, and multiple looping made the standard solution complex. So, error handling was also complicated.

SUCCESS CRITERIA & BUSINESS VALUE

The standard solution was the use of POST request in Matillion to get the ‘Access Token’ which was used as endpoint to send the GET request to extract data. As mentioned above, there was no concept of pagination, so the endpoint was required to be looped through using variables so that all the records can be extracted. These made the standard approach infeasible. We used the python script component in Matillion which included the python code having all the loops and variables making the other part of mapping simple. For failure recovery of data, we overlapped the data extraction time period. We used Change Data Capture(CDC) in ETL to filter out the same records which came because of time period overlap. This solution produced the best results and the time and complexity were reduced to negligible.