In the past, data has been considered a byproduct of business activity or process. The data is stored in a system and rarely used for business improvement.
Today, on the contrary, organizations increasingly recognize data as a relevant asset that enables business performance optimization and better decision-making. Consequently, organizations invest a lot of effort and resources in collecting, storing, and analyzing data.
However, with insufficient accuracy and consistency of data, all this effort can result in bad business decisions.
According to the Gartner article published in July 2021, “Every year, poor data quality costs organizations an average of $12.9 million” [1].
No matter how heavily you try to utilize the data, your best-laid plans are all for nothing if the data it processes is unreliable.
Although business processes within different organizations differ, most of them have the same challenges when dealing with data management. Below we will try to summarize the most important terms when it comes to mastering the data challenge.
Data governance
Knowing where your data lives and who has access to it is fundamental to acknowledge its impact on your business. Data governance is the process of managing the availability, usability, integrity, and security of the data in enterprise systems. Data governance rules are based on internal data standards and policies. Effective data governance ensures that data is consistent and reliable and does not get misused.
Data catalog and metadata management
One of the main data governance initiatives is to provide metadata management, which defines informational assets for converting data into an enterprise asset. Metadata is much more than just “data about data” or a definition of what identifies data. Due to the increasing complexity of data processes, metadata management is both technical and business oriented. Such an approach helps you understand the format and structure of the data but also defines business rules, data sharing rules, and data quality rules to use the data appropriately.
Data catalogs are used to help data users collect, organize, and enrich metadata to support data governance. Today, most integration platforms have integration with the data catalog, which uses data to create an indexed inventory of available data assets that includes information on data lineage, search functions, and collaboration tools. Besides that, many users use data catalogs as support in self-service analytics for easier navigation and understanding of the business context of the data.
Using the data catalog appropriately means better data usage and contribution to operational efficiency and cost savings.
All successful data-driven companies have a person responsible for governance and utilization of data within the company – a chief data officer, or CDO for short. CDOs oversee a range of data-related functions to ensure that the data is being used to its fullest potential.
However, a well-structured data governance department is not enough – to be able to draw conclusions from the data, the mentioned data must be accurate.
Data quality and data observability
For accurate conclusions, data should be correct, unambiguous, consistent, and complete. At the beginning of the data processing workflow, data has not yet been cleaned or processed. The first step towards ensuring data accuracy is to transform data into a usable form as well as apply data quality rules before data is analyzed. If data contains errors, it can be fixed before entering the next stage of processing.
For understanding the health and condition of the data in your system, you can answer a series of questions that define the five pillars of data observability [2].
- Freshness – when was the last time the data was generated?
- Distribution – is the data within an accepted range, properly formatted and complete?
- Volume – has all the data arrived?
- Schema – what is the schema, and how has it changed?
- Linage – for a given data asset, what are the upstream sources and downstream assets which are impacted by it?
At the end of the data processing workflow, we need to validate and verify results. That includes confirming that the data satisfies specified criteria and checking for accuracy and inconsistencies after data migration is done.
DataOps
Data Ops uses a set of practices, technologies, and steps to automate the design, deployment, and management of data processing workflow. The core of the process is a data pipeline – a series of stages data goes through starting from its extraction from various data sources and ending with its consumption. DataOps orchestrates and automates this pipeline to ensure consistent and reliable data flow, by leveraging CI/CD (continuous integration, continuous delivery) practices. This approach improves data quality across the data pipeline to ensure analytics can be trusted.
Data mesh
Data mesh is an approach to creating a data architecture using a domain-oriented self-serve design. A data mesh supports distributed, domain-specific data consumers and views “data-as-a-product”, with each domain handling its own data processing workflows.
Understanding the concepts is clearer by looking at the four basic principles of Data Mesh.
Domain-driven development
Data mesh is based on decentralization and distribution of responsibility to people who are closest to the data to support the development and trustworthiness of data. The challenge is to determine what constitutes a domain within the business of an organization and which employees belong to it. From a data perspective, a domain can be defined by grouping source systems that form coherent business units.
Data as a product
The product in this case is an intersection between users, business, and technology. Users define their needs – problem, and a business can give insights for an appropriate solution. Data Product Owners need to understand the problems that domain business users face and develop solutions that meet all business needs. These solutions use data to facilitate a goal and are called data products.
Self-serve
Data mesh concepts advocate the idea of the underlying infrastructure for data products that can be easily accessed by the various domains in an organization. The domain members should not have to worry about data architecture complexity – they can only think about creating data as a product.
Federated governance
Federated governance means that any central governing body only provides guidelines for quality and security, but the main responsibility for quality and security lies within the domains. Data mesh governance enables the dynamic nature of data, which means that it allows for domains to designate the structures that are most suitable for their data products.
Four principles of data mesh address significant issues that have long troubled data and analytics applications, and therefore there is real value in thinking about them. For a detailed elaboration of data mesh concepts, along with an explanation of all its principles and applications, read the article by its originator, Zhamak Dehghani [4].
Technology and methodology change over time, evolve in accordance with user needs, and the architectural approach can be centralized or decentralized. Regardless of all the changes, one thing is certain – a profitable data platform is based on a well-arranged data governance system, has an automated data pipeline, and tends to have products defined by data owners. And at the beginning and the end of everything stands data quality – if we process correct, unambiguous, and consistent data, we can expect reliable data-driven decisions at the end. Put your data to use with the CROZ experts!
Related News