Working on an ambitious project of Data Lake implementation as part of digital transformation, several members of our Data Engineering team spent the majority of 2019 in Sweden.
We found ourselves in a country of different culture and exceptionally pleasant people, where work is done thoroughly with focus on digital transformation and modernization in every field of business: Agile, Data driven, containers, AI, microservices – all modern and up-to-the-minute.
The first thing we’ve noticed when we came there was FIKA – a Swedish “ritual” of friendly socializing while enjoying coffee and wonderful sweets and cakes. It is a great way to rest your grey cells or to get some good advice from a colleague from another team.
That’s why we’ve decided to tell you this story about building Data Lake in FIKA sustainable way and how we did it.
Data Driven Architecture
After the first few presentations and getting to know the client it was obvious that they had invested a lot in modernization strategy and everything had beed thoroughly planned. Namely, before we arrived, a data driven strategy had been built as part of a digital transformation program that contained several interesting and challenging goals accompanied by modern AI products.
The focus was not only on the delivery of AI products and services, the focus was also to build a concrete and self-sustaining data driven architecture, which relies on the Data Lake concept with a fully organized Data Governance.
Hortonworks and Data Lake
The greenfield Data Lake project is based on the popular Hortonworks platforms (HDP, HDF). The Data Lake platform needed to be integrated with other components in the data driven architecture. Some of these components are Self Service BI, DWH appliance supported by Data Vault methodology, Data Integration services. Since a new microservice architecture based on Kubernetes was built, it was necessary to integrate data through Enterprise Services Bus (ESB). It is an interesting and challenging data driven architecture that encompasses essential components for both offensive and defensive data strategies.
Data Ingestion Patterns
Prior to the implementation, respecting the architectural requirements, the design of the data ingestion process (the process of loading data into Data Lake) was approached in order to select the appropriate technological solution. Eight different patterns were detected, which were described and a recommendation was written when to use which pattern adequately. We had to provide a continuous data ingestion process for legacy source databases, but also insource new applications and systems (microservices) through the ESB integration platform using Kafka. This means we had to integrate legacy systems and a new service architecture into one Data Lake system. Unfortunately, we were not able to use the CDC log based solution, which would be ideal for a large number of legacy databases and shorten the data loading process in Data Lake. Instead, we found a very good alternative – DBimport tool based on the Hadoop ecosystem, which we will explain later.
Converge all of your data sources into a polyglot platform capable of providing every insight about your business! Get in touch!
Data Catalog
After the source systems were prioritized, the Data Catalog was built, containing all the necessary information for the data ingestion process, such as keys for each table, object types, data formats, descriptions of objects and attributes, risk assessment for each field, classifications for legal department, and unavoidable requirements for anonymization of personal data, etc.
We automatically tagged all the key information we collected in the Apache Atlas data catalog so that we could successfully define the rules (Rule Base Policy) of access rights and search for data. Atlas is a data governance tool for the HDP platform that does data lineage, data classification, and defines terms (Glossary). We definitely wanted to avoid data swamp, which is usual for first generations of Data Lake solutions that do not have adequate data classification. It is important to note that in the entire architecture for the Data Catalog, an enterprise solution is based on the Informatica platform (Enterprise Data Catalog) that integrates with the Apache Atlas.
Thanks to the implementation of the Data Catalog, we were able to reconcile the requirements of the legal department (GDPR, PII, etc.) and the need for democratization of data (AI, ML, product oriented development). One of the key challenges was to ensure unlimited data collection regardless of the purpose, but also to limit analytical products that have clearly defined data retention depending on the purpose and type of data. Combining the Apache Atlas and Apache Ranger tools solves this challenge.
Deep Dive into Technology
When we entered the project, pretty much all of the fundamental technologies for Data Lake that were going to be used were already defined. Those were all technologies we had extensive experience with. The entire data lake system was based on Hortonworks HDP platform and one of our first major tasks was to install the platform on the cluster of servers. We did that many times before so there was no unexpected problems, except for the organization specific rules and restrictions that we needed to adjust to. Most of these rules and restrictions were, of course, regarding security. We will talk about it a bit more later in the text.
After the platform had been installed we had two more important components to install, that were not part of HDP platform, DBImport and Airflow. DBImport, that we wrote about in one of our previous blogs, was used for daily load of new data from source systems into data lake. And Airflow was used as a workflow scheduler for DBImport pipelines, but also for other pipelines we had in the system.
Besides DBImport, used for daily loads, we had integration with some source systems in real time through integration platform and Kafka. Source systems were sending real time data to integration platform that would write this data to Kafka in data lake. By having data in Kafka, data lake team could pick it up and move it through data lake zones and in some cases write transformed data back to Kafka.
After data landed into raw zone, it would move through transformation pipeline to other zones of the data lake. Depending on the type of data and nature of data, whether it was a real time or batch processed data, we would use different technologies to transform it. Since data was primarily stored in Hive we used a lot of HQL scripts to transform data. Other than that, since we love Spark and it’s powerful and reliable, we wanted to use it for data processing where it made sense. Therefore, Spark played one of the most important roles in data processing inside of data lake.
Security
After the Data Lake was installed and basic configuration of cluster was set up, it was time to initialize security installation and configuration. In Sweden, there are strict rules and regulations that must be followed. Especially when working with personal data of our client’s costumers. Some of the regulations were bit outdated when it comes to modern data storage and processing technologies, so first we needed to find middle ground with legal team. It Took us some time to come to agreements and incorporate many regulations into appropriate security requirements.
All rules and regulations were being tracked using security catalog. Security catalog is a document in which all requirements were listed, grouped depending on whether they are related to the platform or some procedure which needed to be established. Those requirements covered different security areas, like access control, logging, data storage, configuration, documentation and so on. After we implemented a requirement, we were required to write detailed description of how it had been done. The idea behind was to give better insight to security officers supervising the entire process. We also used this catalog to track progress of things done and as agenda in our weekly security briefing with our security officer.
Here is an example of how strict regulations gave us headache but also helped us improve tremendously. When we arrived in Stockholm, first things we were given were a smart card and a laptop. Smart cards, in addition to their regular use for entering premises, were also used as two-factor authentication for logging to your laptop. More importantly for us, it was required that acquiring Kerberos ticket is also done using a smart card which brought plenty issues. For those not familiar with the term, Kerberos is authentication protocol used for safe communication between nodes or users in network and it is most common used protocol for authentication in Hadoop Data Lakes. Despite having plenty beforehand experience and being prepared to handle Kerberos, newly found issues proved to be a challenge which we were able to solve successfully. In the process we learned a lot and it prepared us for some future challenges.
In the End
Swedes’ strong attitude towards the environment is transferred to the their work environment. Everything done must be sustainable, applicable and acceptable, from the way of communication to the implementation and maintenance. Accordingly, we have adapted to that mindset and in addition to implementing the Data Lake system, as almond paste inside of Swedish Semla, we added mechanisms and key milestones of sustainable Data Lake, so as to prepare it for migration to the Cloudera Data Platform (CDP) and Cloud. In order to implement Data Lake successfully, it is necessary to focus on the sustainability of the system at an early stage, which means adapting people and processes, i.e. having adequate Data Governance. Therefore, the focus should not be exclusively on technology, but also on Data Governance and Data Privacy (Legal and Security) as well as on collaboration of all participants through the Data Catalog.
We hope that Data Lake will successfully continue to build, maintain, and migrate to the new Cloudera platform in the land of FIKA.
Authors:
Ivan Dundović
Matija Maljić
Darko Benšić
Converge all of your data sources into a polyglot platform capable of providing every insight about your business! Get in touch!
Falls Sie Fragen haben, sind wir nur einen Klick entfernt.