Blog

Need for speed (In the world of data)

Jelena Loncarevic

13.06.2019

The amount of data is increasing each day, which can slow down the process of loading data from different sources to your Data Warehouse. Beside the delays, sluggish data loads can be unstable and cause errors. Most of ETL tools are scalable, but is it enough? Does the way ETL jobs and workflows are developed make any difference? It does, because these tools are only scalable if used properly.

How to speed up loading processes?

There are different ways to speed up loading processes – optimizing the existing workflows using ETL tools or implementing the same process on high performance technology. We can do both – get in touch.

Here is an example of what we have done to optimize a time-consuming loading process.

Workflow implemented using Informatica PowerCenter for processing monthly transactional data (about 50 million records) lasted for over 20 hours every month. It caused trouble to numerous business users who needed their reports as soon as possible. This was a huge problem since these are ground data for generating all the reports and dashboards in the company. What can be done to make it faster?

We analyzed the existing ETL mappings and workflow and found the issues causing the slowdown. After that we and modified the loading process using the same technology and system resources, preserving the accuracy of data. After this modification, duration of the workflow has been reduced to 4,5 hours. Isn’t that amazing? How is it possible? The most important thing is to detect bottlenecks: is it reading data from source, transforming data or writing to target database? It might not be just one of those.

The next step is repairing bottlenecks by deciding which of the mapping transformations to use, how to set buffer sizes and memory properties, to use advanced optimization options or not, etc. There is no cookbook to do this because there are many different variables affecting performance of ETL processes. To use joiner or lookup transformation, how to set joiner inputs, to sort data before aggregating, to wright an override query? These question never have the same answer for different ETL mappings. It takes great knowledge and experience to find the right answer.

Image 1: Architecture of Spark integrated with source system and Enterprise Data Warehouse

Is it possible to make it even faster?

Of course, by using a different technology which offers better performance engine.

BigData technologies are becoming crucial for analyzing data. Among many benefits of using these technologies, processing data in high performance is one of the most important.

Using Spark engine for large-scale data processing, we managed to reduce workflow duration to 30 minutes, which is over 60 times faster than the duration of initial workflow. It was the most exciting challenge to transcribe Informatica PC code to Spark, which are totally different technologies, and to maintain the original data output. Spark offers several libraries for different purposes, such as ETL processing, machine learning, data analysis and visualization, but great thing about Spark is that we can combine them according to our needs. In this case we used Scala and SQL libraries for optimal performance.

Image 2: Performance improvement using Spark for one period of data

As a result of enormous knowledge and desire for new challenges, this example shows our capacity to accomplish better performing results, maintaining the quality of data, using different technologies.

Need to speed things up? Feel free to contact us!

BIG DATA DATA ENGINEERING DATA QUALITY DATA WAREHOUSING

Jelena Loncarevic

2 posts

Data engineer dedicated to business with high sense for data analysis, development, testing, and deployment. Always tries to fulfill data preparation craft with business refinement using traditional ETL tools like Informatica Power Center, DataStage, Talend or new data engineering technologies like Hadoop and Spark.

Get in touch

Want to hear more about our services and projects? Feel free to contact us.

Get in touch

Not sure where to start? Let our experts guide you. Send us your query through this contact form.

Get in touch

Contact us for all inquiries regarding services and general information

Use the form below to apply for course

Get in touch

Contact us for all inquiries regarding services and general information

Blog

Need for speed (In the world of data)

Jelena Loncarevic

13.06.2019

How to speed up loading processes?

Is it possible to make it even faster?

Jelena Loncarevic

2 posts

CONTACT

Get in touch

Blog

Need for speed (In the world of data)

Jelena Loncarevic

13.06.2019

How to speed up loading processes?

Is it possible to make it even faster?

Jelena Loncarevic

2 posts

Private: NETS CEE- Setting up a Data Warehouse & BI ecosystem

CROZ

Data Anonymization – Hide and Seek with Data

CROZ

How to Understand All That Data – Visualize It!

Vali Zagar

CONTACT

Get in touch