Need for speed (In the world of data)

13. 06. 2019

Overview

As a result of enormous knowledge and desire for new challenges, example bellow shows our capacity to accomplish better performing results, maintaining the quality of data, using different technologies.

The amount of data is increasing each day, which can slow down the process of loading data from different sources to your Data Warehouse. Beside the delays, sluggish data loads can be unstable and cause errors. Most of ETL tools are scalable, but is it enough? Does the way ETL jobs and workflows are developed make any difference? It does, because these tools are only scalable if used properly.

How to speed up loading processes?

There are different ways to speed up loading processes – optimizing the existing workflows using ETL tools or implementing the same process on high performance technology. We can do both – get in touch.

Here is an example of what we have done to optimize a time-consuming loading process.

Workflow implemented using Informatica PowerCenter for processing monthly transactional data (about 50 million records) lasted for over 20 hours every month. It caused trouble to numerous business users who needed their reports as soon as possible. This was a huge problem since these are ground data for generating all the reports and dashboards in the company. What can be done to make it faster?

We analyzed the existing ETL mappings and workflow and found the issues causing the slowdown. After that we and modified the loading process using the same technology and system resources, preserving the accuracy of data. After this modification, duration of the workflow has been reduced to 4,5 hours. Isn’t that amazing? How is it possible? The most important thing is to detect bottlenecks: is it reading data from source, transforming data or writing to target database? It might not be just one of those.

The next step is repairing bottlenecks by deciding which of the mapping transformations to use, how to set buffer sizes and memory properties, to use advanced optimization options or not, etc. There is no cookbook to do this because there are many different variables affecting performance of ETL processes. To use joiner or lookup transformation, how to set joiner inputs, to sort data before aggregating, to wright an override query? These question never have the same answer for different ETL mappings. It takes great knowledge and experience to find the right answer.

Image 1: Architecture of Spark integrated with source system and Enterprise Data Warehouse

Is it possible to make it even faster?

Of course, by using a different technology which offers better performance engine.

BigData technologies are becoming crucial for analyzing data. Among many benefits of using these technologies, processing data in high performance is one of the most important.

Using Spark engine for large-scale data processing, we managed to reduce workflow duration to 30 minutes, which is over 60 times faster than the duration of initial workflow. It was the most exciting challenge to transcribe Informatica PC code to Spark, which are totally different technologies, and to maintain the original data output. Spark offers several libraries for different purposes, such as ETL processing, machine learning, data analysis and visualization, but great thing about Spark is that we can combine them according to our needs. In this case we used Scala and SQL libraries for optimal performance.

Image 2: Performance improvement using Spark for one period of data

As a result of enormous knowledge and desire for new challenges, this example shows our capacity to accomplish better performing results, maintaining the quality of data, using different technologies.

Need to speed things up? Feel free to contact us!