Data Alchemy – Crafting AI with Synthetic Data Generation

20. 11. 2024

Howdy AI friends,

In the last issue, we investigated fine-tuning LLMs. This month, we’re diving into the delightful world of synthetic data generation. These two topics are like the perfect pairing—think butter and honey, or fresh tagliatelle and Bolognese sauce.

These two topics go hand-in-hand because they aim to improve LLM performance and make the outcome consistent with the context where the AI solution is in use. For example, in the DATEV data anonymization use case, these techniques were both complimentary used while being different in some fundamental aspects:

– Fine-tuning involves taking a pre-trained LLM and further training it on a specific dataset to improve its performance on a particular task. It’s like taking a generalist and specializing them.

– Synthetic data generation involves creating artificial data that mimics real-world data. It can be handy when real-world data is scarce, expensive, or sensitive.

For the DATEV use case, synthetic data were generated to address the issue of data privacy in fine-tuning. Yet synthetic data can also be used to expand training datasets and make the model more robust, generate edge cases and scenarios and increase data diversity, and finally generate data where collection and labeling of real-world data would come with exorbitant costs—see this IBM research blog for an overview.

Now, there are a few ingredients to keep an eye on when it comes to synthetic data. I recently tuned into a scrumptious episode of AI Stories hosted by Neil Leiser, where Loubna Ben Allal stirred up some insightful points while building Cosmopedia. One of the key takeaways? Maintaining diversity in your synthetic data can be as challenging as making sure your pasta doesn’t overcook. As projects grow, it is essential to curate diverse prompts that cover various topics and minimize duplicate outputs. Loubna makes that point about her team aiming to avoid spending computing resources on generating billions of textbooks, only to discard most because they are too similar.

Synthetic data generation for code generation and reasoning

A hot area for synthetic data generation is code generation and code reasoning, both real game-changers for developers. It is instrumental in improving pre-trained models on niche programming languages. When asked about non-mainstream and legacy codes, pre-trained LLMs with coding capabilities like StarCode, GPT-4, LLama, or Code Mistral may still have weak capabilities. That is where synthetic data generation comes into play, like in the fascinating story of the IBM Granite code generation model and its fine-tuning for COBOL programming language using InstructLab –see this video from THINK 2024 with Dario Gil at this link(starting from minute 26:00). It is an exciting approach that could be applied to other programming languages currently not supported by code assistants on the market.

A very captivating use case is code reasoning combined with algorithmic reasoning. I explored this technique as the base for a client request for support for a niche programming language together with my colleague Hrvoje Simik, who explains the details in this blog. In a nutshell, we developed a meta-model approach, which first leveraged the algorithmic capabilities of the LLM model to break down the task and generate a generic solution. This can be seamlessly translated into Python, a coding language where most of the best-performing LLM models are well-versed for coding tasks. Second, the process involved creating a structured JSON format that is then turned into the niche programming language statement. Python expressions are converted into a tree structure (AST) and then into code. Let me know what you think about this approach!

Data Alchemy and Synthetic Data Generation

Business considerations behind synthetic data generation

While synthetic data generation presents a powerful tool in the AI arsenal, it’s essential to recognize that it’s an advanced technique. Before embarking on a synthetic data generation project, organizations should carefully consider several factors:

– Data quality and realism: Ensuring that the synthetic data accurately reflects real-world data is crucial. Poor-quality synthetic data can lead to biased and inaccurate models. Poor-quality datasets are like stale bread—nobody wants that!

– Computational Cost: Just like whipping up an elaborate meal can strain your wallet, generating high-quality synthetic data can be computationally expensive, especially for large datasets and complex models.

– Model Bias and ethical implications: Beware! Just as a dish can be spoiled by too much salt, synthetic data can inadvertently perpetuate biases lurking in the training data.

AI Business Innovation Sprint can help organizations navigate these complexities by providing expert guidance and support. The general idea is engaging both business and technical mindsets to go through the steps of ideation and strategic evaluation of AI use cases. For example, this November, my colleagues in Zagreb are organizing an AI Innovation Sprint Lab in collaboration with IBM in Zagreb. If you are curious about the format, check this link. Past editions spun the AI journeys of banks such as PBZ in Croatia and machinery manufacturers like AGCO in the USA. The goal is to help picking the right tools and techniques for your specific AI journey, making sure the projects sizzle with innovative ideas. So, let’s get cooking!