Tag: DATA ENGINEERING
Cloud Pak for Data – a solution for multi-cloud, Open Shift Containers, Data Virtualization and integration, DataOps and AI pipeline.
Imagine having thousands of processes in your data warehouse with millions of data. How do you know which ones are critical? Which process consumes most of CPU or memory resources? Which processes consume and produce large amounts of data? What to do to find out? The answer to all these questions is SLA.
Documenting doesn’t have to be a toil. It can be fast and easy, yet extremely effective and useful. If team members communicate and make a note of their work on daily basis, the team is more productive, unleashed from irreplaceability, and headed for success.
R is a free software environment for statistical computing which also supports tremendous graphs, maps, tables and other visualizations. R is also extraordinarily soft and accessible tool when it comes to visualization building.
Deequ is a great tool for exploratory data analysis as well as for in depth data quality evaluation.
Here at CROZ Data Engineering Team, we are excited to use Deequ in our data processing pipeline and are looking
DBImport is an open-source ingestion tool that uses Sqoop or Spark to ingest data from the relational databases into Hive database in the Hadoop cluster.
When using partitioning your performance of querying large tables is better, queries can access partitions in parallel, faster data deletion, and data load, better manageability and many others
If you want to feel accepted, understood and stimulated for new challenges, apply for Summer Accelerator – a student internship program at CROZ.
Delta Lake provides great features and solves some of the biggest issues that come with a data lake. On top of all, it is easy to use! Keep reading this post for some useful tips and tricks.