Cloud-based machine learning powered anonymization solution for DATEV
CROZ AI helped implementing a reliable anonymization solution which will help DATEV leverage its data for producing new business value.
”Thanks to this highly effective solution by CROZ AI, we took the first step towards making decisions insights gleaned from our data, without compromising on commitment to privacy while keeping our users happy and safe providing them with the best service. And after all, it was lovely working with a such committed, responsive and professional team!”
DATEV is one of the largest software solution providers in Germany. The company, based in Nürnberg, employs more then 8000 people, has over 450 000 customers all around Europe and a turnover of over one billion Euros anually. It offers software solutions designed to meet the needs of tax consultants, lawyers, auditors, small and medium-sized enterprises, municipalities, and founders. The solutions, known for their reliability, timeliness, data protection, and data security, are a popular choice among professionals in these fields. By using DATEV’s software, they can ensure that they are meeting all of their requirements and operating at the highest standards.
To provide the best possible service to their clients, DATEV must apply all best practices to ensure user satisfaction and extract valuable insights which enable them to improve existing products and to come up with new solutions. As a company with a vast amount of data, it can be challenging for human workers to effectively extract and utilize relevant information. CROZ AI offers AI-powered solutions to assist DATEV in maximizing the value of its data.
As DATEV operates in the European Union, it must adhere to GDPR. That is why DATEV needs to anonymize any data that could potentially be used in the future, since the raw, original data contains sensitive information and must therefore be deleted. For this reason, we have to implement a robust, reliable and especially audit-proof anonymization solution before DATEV can leverage its data for producing new business value.
Anonymization is a complex process that involves several steps, including identifying words that need to be anonymized, assigning them to appropriate labels, and replacing them with words that preserve the sentiment and value of the data while also complying with regulations. This can be a challenging task, especially when available training data is relatively limited. Human workers are still often used for this task, but it is expensive and doesn’t scale easily.
To address this challenge, we implemented a transformer-based named-entity-recognition model using AWS as cloud environment. This type of model is highly configurable and provides state-of-the-art results with modest training resources. It can be easily adapted to a variety of use cases, which paves the way for downstream tasks such as sentiment analysis and topic extraction.
We leveraged various open-source libraries and platforms such as Pandas, PyTorch, Huggingface, and MLflow to provide a solution best suited to our client’s needs. Some aspects of our implementation are the following:
- 1. Custom methods to process and prepare datasets – enabling us to easily customize datasets in the future if new requirements arise
- 2. Machine learning libraries are leveraged in the process of creating custom named-entity-recognition models – flexible and reusable, while being transparent and under full control
- 3. MLflow was utilized both as a platform and as a set of concepts that help us to create and manage our solution by the best MLOps practices
- 4. A fully cloud-based solution utilizing the full potential of modern cloud services such as AWS paired with Databricks platform
- 5. Implemented crucial MLOps processes such as labelling jobs, Human in the Loop for model validation, and automatic model retrain
There are several major features we believe to be essential for a reliable and usable AI anonymization system:
- 1. The machine learning model which is a core component of every AI system must be configurable and replaceable, meaning that the client could switch to some other model without modifying the whole system
- 2. Machine learning model must be resistant to data drift, meaning it can be easily and automatically retrained on new data, or any additional compatible data source if the need arises
- 3. The training and evaluation process must be straightforward, well-documented, and reproducible so that no time is wasted on redundant and repetitive work
- 4. The system must provide an option for human workers to intervene, meaning that some random samples, or those whose model labelled with a low confidence score, can be forwarded for additional evaluation
- 5. The solution must incorporate well-integrated annotation workflows so that new datasets and additional labelling can be done with ease
- 6. Everything must be implemented in the Databricks and AWS ecosystem, guaranteeing stability and scalability
- 7. At Croz AI, we prioritize good practices in machine learning engineering and operations, and our solutions always adhere to these requirements. The only exception is that our solutions are not limited to cloud-based development, as we also support on-premises solutions.
Our anonymization solution for DATEV served as an extensive proof-of-concept for a robust, scalable, and reliable anonymization pipeline. We successfully showed that a cloud-based anonymization solution has a great potential and should be considered as a candidate for a production system. Our team also provided a codebase and set of good practices which could be applicable and reusable in future machine learning projects and will surely become an important asset for DATEV.
Get in touch
Want to hear more about our services and projects? Feel free to contact usContact us