Making data ingestion easier with DBImport
Big data (or simply data) and Hadoop have been around for many years now. In order to keep the Hadoop clusters happy, we are supposed to feed them with data. When it comes to feeding Hadoop with data from relational databases, Apache Sqoop is a quite common tool to use. But if you have ever worked with Sqoop, you know how messy it can become to maintain all these pipelines, schedule them or change table definitions when something changes on the source side. That’s where DBImport comes to rescue.
DBImport has been developed by Berry Österlund, a Hadoop architect and system developer from our partner company Middlecon based in Stockholm. It is completely open-source and Berry is constantly improving it. With a growing number of users, he’ll soon need help from the open-source community to continue developing it.
What is DBImport and how does it work?
DBImport is an open-source ingestion tool that uses Sqoop or Spark to ingest data from the relational databases into Hive database in the Hadoop cluster. The best way to imagine what DBImport brings on top of Sqoop is to think about the things we need to do in the following situations if we use only Sqoop for data import:
- After an initial load is done, users want more data and they want it loaded on an hourly or daily basis
- To keep everything in order and to make your life easier, you create a bunch of bash scripts and schedule cron jobs that eventually become too hard to maintain
- You already have enough things to worry about and just when you thought you finally managed to keep the imports running fine for a couple of days, someone changes one column on the source side and everything falls apart
There’s quite a lot of work ahead of you if you recognize yourself in the situations described above. Well, luckily there is DBImport that can solve all these problems for you.
First data load with DBImport
DBImport can load data from many relational databases including Oracle, MySQL, DB2 or PostgreSQL. One of the nicest features in DBImport is that it will handle table schemas for you – not only before the initial load but even if the column definition or description changes in the later runs. The only thing you need to worry about is to give DBImport a valid connection string.
Wow, that’s already quite a big improvement in the quality of the data engineer’s life. The user needs new data and the only thing that we need to worry about is to find the database connection string, schema and the table name; DBImport will do the rest for us.
When loading data into Hadoop, we usually want it to be loaded regularly (e.g. hourly or daily), not only once. Sometimes we have a transactional table on the source side and we want to load newly inserted rows. But sometimes we have more complicated situations like master tables where all CRUD operations happen. DBImport provides us with different import type options that cover the aforementioned scenarios and a lot more.
We can divide DBImport import types into two basic categories: full and incremental import.
Full import reads the entire source table and loads it into Hive. But it is further divided into full import subtypes:
- Truncate and insert
- Full merge
- Full merge with history audit
Truncate and insert is the simplest one that will truncate data in Hive and load all data from the source into Hive every time the import is run.
The most interesting one is full merge with history audit that will load all data from source into Hive staging table, and then do the merge command to recognize what changed in the source after the previous load. As a final result, we will have a table in Hive with a current state of the data in the source, along with a history table in Hive that will hold all operations (inserts, updates or deletes) that happened in the source.
Incremental import types ingest only new data from the source to a Hive by keeping track of value in incremental integer or timestamp column. There are also different incremental import subtypes:
- Merge with history audit
- Oracle Flashback
Inserting newly arrived data into Hive, incremental insert import is fairly simple. Merge and merge with history audit do the same thing as in full merge imports. The only difference is that in incremental mode merge is done only on new data.
These import types provide us with a wide range of options and standardized solutions for our data ingestion jobs. It is worth mentioning that DBImport has export possibilities too, so you can move data in full or incremental mode from Hive table into a relational database table.
So far we’ve described cool basic features of DBImport for data ingestion that solve a lot of everyday problems data engineers go through. But you must be wondering how did DBImport solve the problem of scheduling all those jobs to run periodically. The answer is – Apache Airflow, an extremely popular workflow management platform. We can use all features of DBImport from the command line, and we can even schedule our DBImport imports with cron jobs. But that’s not something we want to do because that’s something we are running away from. J
After we have configured a new import in DBImport that we want to schedule to run daily, DBImport will provide a command that will generate an Airflow DAG for us. And, what’s more, it will even put it in the appropriate DAG directory, so that we wouldn’t have to worry about a thing.
Newly created DAG will show up in Airflow so that we can completely control, start/stop and monitor loads.
That seems quite nicer and a lot more user-friendly than maintaining cron jobs.
Multi cluster import
DBImport is being constantly improved and lately, we’ve seen some cool new features being implemented. Since DBImport is used in big enterprises, users have a lot of interesting requests and they push DBImport, (actually they push Berry) to improve DBImport constantly.
Another feature that is useful and sometimes mandatory for big enterprises is copying data to more than one cluster. DBImport solves this problem by providing Multi Cluster Import options that will make sure data is properly copied to all Hadoop environments in the organization. What is important to point out is that by using DBImport, data will be pulled out of the source system only once, which will greatly minimize the load on the source system. For big systems that is quite an important requirement.
We all know how that story goes, ignore data governance and you will turn your data lake into a data swamp. DBImport doesn’t want you to do that and that’s why it integrates with data governance tool Apache Atlas. Because of that, we can see the entire data lineage from source to Hadoop with much more detailed information about the source than what Sqoop provides.
Here we described some of the most important features that DBImport brings into the Hadoop world and how it can help us with our data ingestion scenarios.
We at CROZ are running DBImport heavily at customers with big Hadoop clusters, loading thousands of tables every day and we love it.
What’s next for DBImport?
When working with DBImport, users do most of the setup through inserting and updating DBImport database. It’s very simple but it would be nicer to have a web UI through which we could do all that. CROZ will contribute to DBImport by developing a frontend for DBImport in form of web application and make it even easier to use. The frontend will be developed during this summer so expect it to be ready when you come back from your summer vacation. J
In the meantime, take a look at DBImport documentation for more details and guidelines.
Download it from GitHub, install it and start ingesting data into Hadoop with ease!
Whether you’re building a new IT system or transforming an existing one, we are your experienced partner! Get in touch if you need help.