News > Modernizing Legacy ETL with Databricks: AI-Assisted ETL Transformation

16 Min reading time

Modernizing Legacy ETL with Databricks: AI-Assisted ETL Transformation

20. 05. 2026

Overview

Automated ETL migration is only the start. Learn how LLM agents help validate, improve, and prepare generated pipelines for production.

In the first blog of this series, we explained the strategic roadmap of ETL modernization, moving from legacy data warehouses like Informatica and Oracle toward modern Lakehouse architectures on Databricks.

In the second blog, we took a deeper technical look at Databricks Lakebridge, focusing on how it can automate large parts of the ETL migration process. While this automation significantly accelerates the initial conversion step, the resulting code is still not production-ready and requires non-trivial manual validation, testing, optimization, and refinement to ensure correctness and consistency with legacy behavior.

In this final blog of the series, we focus on the step after conversion: validating the generated code and moving one step closer to production-grade pipelines. We will demonstrate how we used our internal LLM agent, Jarvis, to accelerate and improve this final phase of the migration by analyzing transpiled PySpark and Spark SQL code, identifying syntax and logical issues, suggesting improvements, and helping engineers close the gap between generated outputs and production-ready pipelines more efficiently.

Minimalist Single-Agent Approach

We focused on a minimalist approach that requires little setup while significantly improving the final validation phase. Our approach uses a single-agent setup structured into four sequential steps, as shown in the diagram below.

AI assisted etl transformation through phases

This approach combines lightweight tooling, prompt-guided LLM reasoning, and human validation, focusing on maximizing automation where possible while maintaining control and accuracy through targeted human oversight. Following a simplified end-to-end pattern, the system takes transpiled PySpark or Spark SQL scripts along with domain context as input, processes them through a guided reasoning layer, and produces a structured validation report that is subsequently reviewed and used as the backbone for final validation and refinement by engineers.

LLM Reasoning Layer

At the core of the system is the LLM reasoning layer, where the agent performs sequential analysis steps. The LLM agent processes transpiled Python scripts one by one with provided general domain context.

ETL and Bussiness Logic Understanding

The agent first performs a high-level analysis of the full script, analyzing ETL flow structure, object names, and transformation patterns and summarizing the meaning of the ETL and business logic into a structured, human-readable semantic representation to provide better understanding and support for subsequent LLM reasoning stages and human review.

Component-Level Analysis

The script is then processed component by component, based on Lakebridge marked component code blocks. For each component, the agent uses a deterministic, component-based lookup mechanism to retrieve validation knowledge for each Informatica component type. This design follows a tier-based retrieval structure, separating lightweight issue identification from deeper remediation context.

In the first tier, based on the Informatica component type (e.g., Joiner, Lookup, Source Qualifier, Target), the agent retrieves a predefined Known Issues Document using a deterministic lookup function. This document contains a concise, curated checklist of known issues (validation rules and common failure patterns) specific to that component type. These known issue patterns are grouped in the document into four main categories:

Syntax Issues – compilation or runtime-breaking errors (e.g., invalid expressions to watch for, incorrect brackets, unsupported Informatica-specific code)
Simple Logic Issues – straightforward transformation mistakes (e.g., specific incorrect mappings or filters patterns)
Complex Logic Issues – semantic or data correctness issues (e.g., incorrect joins, inconsistent transformations patterns)
Optimization Issues – performance inefficiencies or suboptimal transformation patterns ( e.g., use of Python functions when PySpark logic is available)

Each issue pattern is defined with short description, as reference prompt based on which the agent will identify possible issues without taking too much context token size.

Example:

# Joiner_Known_Issues.md
## Syntax Issues
- replace invalid reference {{('param_name')}} with '{param_name}' -> JOINER-SYN-003
- mark Informatica/Oracle specific variable reference like @TableName for manual validation -> JOINER-SYN-006 
- ensure every opening bracket has a matching closing bracket -> JOINER-SYN-002
- remember that dbutils library is also available on Serverless compute -> JOINER-SYN-001
## Simple Logic Issues
- look out for joins on not expected table columns -> JOINER-LOG-002
- do not join on hash columns derived from mismatched sources -> JOINER-LOG-004
## Complex Logic Issues
- prevent cartesian joins by ensuring proper join conditions exist -> JOINER-CMP-003
## Optimization Issues
- use broadcast joins for small reference datasets -> JOINER-OPT-001
- avoid Python UDFs in join conditions -> JOINER-OPT-004
- remove unnecessary repartition before joins -> JOINER-OPT-003

In the second tier, each issue that was identified by the agent in the first tier is linked to a Detailed Issue Reference Document which containts all issues for a specific component type that are grouped on issue type and ordered by reference number.

For example reference id JOINER-CMP-003 means we can search the Detailed Issue Reference Document for an issue that matches properties:

Component type: Joiner
Issue type: Complex Logic Issue
Reference number: 003

This extended document provides deeper context on each of the linked issues with concrete code examples illustrating the issue, detailed explanation of root cause and behavior, structured remediation guidance and best-practice resolution patterns.

This tier separation allows the system to first identify potential issues using lightweight descriptions, and then use deeper contextual knowledge from the second tier to analyze each issue with a more complete context and generate a more precise, context-aware remediation solution.

Structured Output

After processing all known issues for a component, the agent continues the analysis using its general PySpark and Spark SQL knowledge and transformation logic to identify potential unknown issues that are not covered in the Known Issues Documents. Any such findings are then described, categorized using the same issue taxonomy, and accompanied by a suggested remediation solution based on best-practice Spark and PySpark patterns.

In the end, for each script, the agent produces a structured findings report that consolidates all issues, ranks them by component type and category, and includes both issue descriptions, code and recommended solutions in a consistent, review-ready format.

Human Review and Validation

This LLM produced findings report serves as the primary artifact for script-level validation and review. For the current migration scope and representative set of scripts, only input and expected output datasets are available at the script level, not at the individual component level. As a result, validation can only be performed at the full script level.

Human engineers use this report as the primary validation reference, while simultaneously reviewing each identified issue and its suggested remediation to assess whether the findings are accurate, relevant, and correctly resolved. This includes comparing script-level inputs and expected outputs, as well as evaluating the logical correctness of the proposed fixes.

Feedback Loop

All confirmed and rejected findings by human reviewers are documented during this process. Human reviewers document for each issue:

Accuracy Assessment: Whether the issue was correctly identified (true positive) or incorrectly flagged (false positive)
Relevance & Importance: The significance of the issue in terms of its impact on code quality, security, performance, or maintainability
Remediation Validation: Whether the suggested remediation approach was appropriate and effective
Final Resolution: The actual remediation solution implemented for confirmed issues, including any adjustments made to the original suggestion

Then with LLM assistance, to better structure the feedback information, both valid and invalid issues are incorporated into the Known Issues Documents and Detailed Issue Reference Documents grouped by component type with also refactoring exisiting issues based on above listed assessment. This provides deeper context to the agent about what to look out for and what to ignore, which further refines future analysis. This creates a continuous feedback loop that progressively improves detection accuracy, issue coverage, and the quality of remediation guidance over time.

Evaluation

With the LLM-driven validation workflow defined, the next step is to evaluate how effectively it performs in practice when applied to transpiled PySpark and Spark SQL scripts. We aim to answer questions such as:

Does the workflow accurately identify syntax, simple logic, and complex logic issues?
How reliable and relevant are the suggested remediations?
To what extent does the human-in-the-loop validation confirm or challenge the agent’s findings?
Where does the approach perform well, and where does it require refinement?

This chapter presents a structured evaluation of the proposed approach, examining its strengths, limitations, and practical implications for validating and modernizing data pipelines in Databricks environments.

Improvements: Areas where LLM shines

In our evaluation, the LLM agent proved most useful in four areas: detecting issues more consistently, helping engineers understand complex ETL logic faster, suggesting remediations, and identifying optimization opportunities.

Enhanced Issue Detection and Consistency

Reviewing code manually line by line is tiring and a slow process. Eventually we lose focus and overlook details that will cause us to revisit the same sections of code multiple times, increasing review time and still leaving a risk of subtle issues slipping through into later stages of validation or production. Our agent does not get tired or overlooks details. Each line is checked in a consistent way, focusing on specific component type, making it easier to catch issues, especially in large and more complex scripts.

This leads to faster and more reliable issue detection early in the review process, reducing the need for repeated re-analysis and follow-up reviews.
Additionally, with the implementation of well-defined set of known patterns and standard issues for each transpiled Informatica component type, we further increase detection accuracy and consistency and improve the overall review process.

Faster Understanding of ETL and Business Logic

When reviewing complex logic manually, it can take a decent amount of time to first understand the underlying business requirements it is designed to satisfy. In our approach we leverage the agent to explain what the transpiled code is doing, translating PySpark operations back to the original business logic. This accelerates and makes it easier to validate that the transformation meets business requirements, ultimately improving both the speed and completeness of manual validation by engineers through a faster and more accurate understanding of the underlying logic.

Improving Remediation Suggestions

When issues are identified the solution can mostly be really simple with just slightly changing the syntax to make it work. But it can also be quite complex, taking significant amount of time to find the right logic to use. By leveraging our internal agent, it provided us with a best practice remediation suggestions that are generated based on a combination of inferred business logic, general PySpark/Python knowledge and previously validated solutions, turning what would be a lengthy debugging session into a quick code change.
At the same time, these suggestions should not be treated as fully authoritative. Due to the non-deterministic nature of LLM agents outputs and the complexity of ETL transformations, every proposed fix must be validated by engineers to ensure it aligns with expected business outcomes and data behavior. However, by incorporating a structured feedback loop, where validated and rejected suggestions are fed back into the system, we observed a noticeable improvement in the quality and relevance of recommendations over time, particularly in later migration iterations.

Optimization Recommendations

Beyond issue detection, the agent is able to analyze PySpark and Spark SQL code from a performance perspective and highlight potential optimization opportunities. This includes suggesting more efficient join strategies, identifying unnecessary or expensive transformations, and recommending caching where intermediate results are reused across multiple steps.
These recommendations are particularly valuable in the context of ETL modernization, where legacy logic is often translated in a direct but not necessarily optimal way. The agent helps bridge this gap by introducing Spark-specific optimization awareness that is not always obvious during automated transpilation.
Be aware that these optimization suggestions are advisory and must be validated by engineers before implementation.

Limitations: What to Look Out For

While the LLM-assisted workflow brings clear benefits, it also introduces limitations that need to be actively managed, especially around false positives, context handling, and scalability.

False Positives

Even though in most cases our internal agent correctly flagged mistakes in code, its outputs are not fully deterministic. In some cases, it flagged patterns that are not actual problems or misinterpreted context-specific logic, particularly when processing larger scripts with complex ETL transformations. This is primarily due to the available context window reaching its limits and business rules that are implicit or not explicitly encoded in the code or supporting documentation.

This makes human validation a critical part of the workflow. Every finding must be reviewed against actual requirements and expected behavior before any fix is applied. The agent should therefore be treated as a decision-support tool rather than a fully reliable validator, and its suggestions should never be applied without verification.

Overload of Information, Missed Issues, and Scope Trade-offs

In some cases agent will generate a large amount of findings in a single analysis, not all of which is relevant or accurate. Also it may miss relevant issues that will later be flaged in the human review and validation steps. This behavior is typically influenced by prompt design, the amount of code being processed, and the overall context management strategy.

In our approach, we reduced this behaviour by breaking scripts into component-level units so the agent can focus on smaller, well-defined parts of the pipeline, while still providing script-level ETL and business context to preserve overall understanding. We also added prompt constraints and response limits to reduce excessive or unfocused output.

This design improves focus, reliability, and consistency of the analysis. However, like most design choices, it introduces trade-offs. While we gain more stable and precise results, this approach may reduce visibility of broader, cross-component relationships and higher-level logic across the full pipeline.

Processing Speed and Scalability

From a scalability perspective, our single-agent, component-level approach is simple and works well for representative scripts, as it enables precise analysis while preserving business context. Processing speed is still much faster than manual review, but it is limited by both the single-agent design and context constraints compared to a multi-agent, distributed setup, which makes scaling to large numbers of ETL jobs more challenging.

While moving toward a multi-agent architecture could significantly improve scalability and throughput, it would also introduce substantial architectural complexity, which is out of scope for this evaluation and is discussed further in the Future Next Steps section.

Results

In our evaluation, the LLM-assisted workflow helped ease the initial review and analysis phase by surfacing relevant issues and suggesting possible fixes, allowing engineers to focus more on validating logic rather than analyzing code line by line.

The approach improved overall validation by making issue detection faster and more detailed, while steadily improving consistency and reliability through feedback over time. It also provided suggested fixes that often matched previously successful patterns, helping us reach correct solutions more quickly. Overall, this resulted in a more structured review process with fewer missed issues early on, along with a clear audit trail of findings and resolutions that strengthened engineering decisions.

That said, the approach is still limited by challenges in determinism, context handling, and scalability, which makes human oversight and final testing and validation essential to ensure accuracy and alignment with business expectations.

The next section builds on these findings and explores how this foundation can be extended into a more scalable and automated architecture.

Future Next Steps: Automation and Scaling with Multi Agentic Approach

Next steps for automation and scaling in etl transformation

Although our approach proved to be a viable solution for our migration scope, based on our broader experience with AI, we see clear opportunities to further improve it, particularly by automating more of the workflow and scaling it into a multi-agent setup.

At its core, this means moving beyond a single, sequential LLM validation workflow toward a dependency-driven multi-agent system where responsibilities are distributed across specialized agents. This allows large-scale parallel validation, reduces context overload, and makes the overall workflow more efficient as complexity grows.

However, before this system can operate, the full migration Assessment phase must first be completed. This phase serves as the prerequisite for a multi-agent setup by establishing the dependency graph, execution order, and orchestration constraints of the ETL workflows, and providing the essential domain context required for accurate validation, analysis, and coordination.

Our improved approach is shown in the diagram below:

The foundation of this approach is a dependency-aware Orchestration agent that continuously evaluates the ETL dependency graph and determines which scripts are eligible for validation. Its responsibility is intentionally limited to creating validation branches (validation processes) only for scripts whose dependencies are already completed. Once a validation branch is created, the remaining workflow progresses autonomously through Git-driven events, where branch creation, issue creation, merge requests, and issue status changes automatically trigger the next corresponding agents in the pipeline.

Within each validation process, specialized agents focus on clearly defined responsibilities. The ETL Documentation Agent creates its own ETL_documentation branch and analyzes the individual script currently under validation together with its business and dependency context to generate standardized ETL documentation. It then merges it into the main validation branch with optional Human review and additional change request.

After that the Issue Recognition Agent reads the script component-by-component and evaluates each component individually. For each component, it creates a separate Git Issue if it either matches any entry in the Known Issues Document or when the Issue Recognition Agent, based on its general knowledge of Python and Spark, determines that there is a potential issue or misimplementation. In the created Git Issue description, agent references component code block, all matching issue patterns and their categories (categorization according to the issue taxonomy), ETL documentation, and contextual domain information.

This design allows Analysis Agents to run independently and in parallel, each processing a single Git Issue created by the Issue Recognition Agent. Each Analysis Agent moves the Git Issue into “In Analysis” status, reviews all associated description context, including code, the Detailed Issue Reference Document, and relevant ETL/domain information.
Using this input plus general knowledge of Python and Spark, the agent classifies each issue as valid or invalid, Valid issues are paired with a recommended remediation based on already verified solutions from Detailed Issue Reference Document and general best-practice Spark and PySpark knwoledge, while invalid issues are explicitly marked as rejected with described justification. In the end each Git Issue is moved to “Ready for Testing” status, with containing a structured report that consolidates all issues, ranks them by validity, component and category, and presents each with relevant context and recommended solution in a consistent, review-ready format.

Once all Git Issues are moved to “Ready for Testing” status a dedicated Testing Agent then validates the script against predefined input and expected output datasets. The agent iteratively runs and for each failed attempt creates a seperate testing branch upon which it applies remediation strategies derived from the Git Issues matching same component type and continuously re-executes the pipeline until the generated outputs fully match the expected results.
It then creates a git merge request towards the main validation branch where deterministic and previously validated fixes may be automatically approved, while more complex and uncertain changes in code or repeatedly failing scenarios escalate to human reviewers for manual intervention after defined retry thresholds are exceeded.

Once testing is successfully completed, the Documentation/Feedback Agent consolidates the knowledge generated throughout the validation process. Its primary responsibility is not validating the script itself, but continuously improving the system’s future decision-making capabilities. The agent reviews all resolved component Git Issues, remediation outcomes, testing conclusions, and human feedback to update the Known Issues documents and Detailed Issue Reference documents. This includes adding newly validated remediation patterns, refining inaccurate recommendations, removing obsolete issue mappings, and improving contextual references for future migrations. In this way, every successfully validated script incrementally strengthens the quality and accuracy of future automated validations.

To support coordination and traceability, the framework relies heavily on Git-based workflows. Branch creation, issue management, merge requests, comments and approvals become the communication mechanism between agents while simultaneously maintaining full auditability of the validation process.

In this model, the role of engineers evolves from performing repetitive validation tasks toward supervising and governing the orchestration ecosystem itself. Human reviewers primarily focus on architectural oversight, escalation handling, and approval of final remediation outcomes, while the agents automate the majority of operational documentation, issue analysis, testing, and remediation activities across the migration workflow.

Outro

Across this series, we looked at how organizations can modernize ETL systems by transitioning away from legacy warehouse platforms toward Lakehouse architectures like Databricks. We then shown how a large portion of the ETL migration phase can be automated using Databricks Lakebridge, which rapidly converts existing ETL logic into PySpark and Spark SQL, while still requiring a dedicated validation step to ensure the generated code is logically correct, aligned with expected business behavior, and ready for production deployment.

Building on that foundation, we focused on strengthening the validation step by introducing our internal LLM agent, Jarvis. It was used to analyze transpiled PySpark and Spark SQL code, detect syntax and logical issues, propose improvements, and support engineers in systematically closing the gap between automatically generated code and production-grade pipelines.
We also suggested future next steps for our approach, where it evolves into a multi-agent, dependency-driven system that enables parallel validation, reduces context overload, and improves end-to-end pipeline reasoning.

Taken together, this Blog series highlights how more modern approaches to ETL modernization are shifting the focus toward accelerating the ETL migration, while maintaining a careful balance between automation and human validation to ensure correctness, reliability, and production readiness of the final data pipelines.

News

Monatliche News

Kontakt

Falls Sie Fragen haben, sind wir nur einen Klick entfernt.

Kontaktieren Sie uns

Vereinbaren Sie einen Termin mit einem Experten

AI and Data