Challenge
CROZ AI helped implementing a reliable anonymization solution which will help DATEV leverage its data for producing new business value. “Thanks to this highly effective solution by CROZ AI, we took the first step towards making decisions insights gleaned from our data, without compromising on commitment to privacy while keeping our users happy and safe providing them with the best service. And after all, it was lovely working with a such committed, responsive and professional team!”
-Dr. Jonas Rende, Lead UX/CX Technologies
DATEV is one of the largest software solution providers in Germany. The company, based in Nürnberg, employs more then 8000 people, has over 450 000 customers all around Europe and a turnover of over one billion Euros annually. It offers software solutions designed to meet the needs of tax consultants, lawyers, auditors, small and medium-sized enterprises, municipalities, and founders. The solutions, known for their reliability, timeliness, data protection, and data security, are a popular choice among professionals in these fields. By using DATEV’s software, they can ensure that they are meeting all of their requirements and operating at the highest standards.
As DATEV operates in the European Union, it must adhere to GDPR. That is why DATEV needs to anonymize any data that could potentially be used in the future, since the raw, original data contains sensitive information and must therefore be deleted. For this reason, we have to implement a robust, reliable and especially audit-proof anonymization solution before DATEV can leverage its data for producing new business value. Anonymization is a complex process that involves several steps, including identifying words that need to be anonymized, assigning them to appropriate labels, and replacing them with words that preserve the sentiment and value of the data while also complying with regulations.
Solution
To address this challenge, we implemented a transformer-based named-entity-recognition model using AWS as cloud environment. This type of model is highly configurable and provides state-of-the-art results with modest training resources. It can be easily adapted to a variety of use cases, which paves the way for downstream tasks such as sentiment analysis and topic extraction. We leveraged various open-source libraries and platforms such as Pandas, PyTorch, Huggingface, and MLflow to provide a solution best suited to our client’s needs.
There are several major features we believe to be essential for a reliable and usable AI anonymization system: configurable and replaceable machine learning model, resistance to data drift, straightforward and reproducible training and evaluation process, human intervention option, well-integrated annotation workflows, everything implemented in the Databricks and AWS ecosystem, and adherence to best practices in machine learning engineering and operations.
Our anonymization solution for DATEV served as an extensive proof-of-concept for a robust, scalable, and reliable anonymization pipeline. We successfully showed that a cloud-based anonymization solution has a great potential and should be considered as a candidate for a production system. Our team also provided a codebase and set of good practices which could be applicable and reusable in future machine learning projects and will surely become an important asset for DATEV.
Tags
Industry
Related Collaborations
Related News