AirflowRCA agent: root cause analysis using AI

29. 05. 2026

Overview

Discover how the AirflowRCA agent uses AI to analyze Airflow failures, identify root causes, automate notifications, and reduce incident resolution time.

A while back our team was working on a project that required supporting the data load jobs in the early morning hours. The task was to monitor Airflow jobs and react to any failures. That included reading the tasks’ logs, analysing them and eventually fixing them. A common cause for failures was when the network connection to a source database stopped working, or when the source DBA decided to change something related to the connection, or our account, without notifying us. In that case we would need to contact the responsible DBA team via email and ask them to fix the issue.

Even though we were ready to tackle job failures early in the morning, there was still a gap from 5pm of the day before to 4am of that day that was left unmonitored.

In case of job failure early in the night, the failed job would still have to wait for 4am for someone to notify the responsible team or fix the issue, increasing lead time to resolution.

This is where the idea of an LLM powered AI agent for Airflow log analysis came to mind. The goal was simple: reduce the lead time from job failure to resolution, while reducing the need for manual intervention.

Our problem was defined by 3 questions:

How can we react to DAG failures as soon as possible?
How can we analyse the logs of failed tasks?
How can we notify the responsible team or fix the issue?

The key feature of Airflow that we used to react to DAG failures was the on_failure_callback mechanism. In each DAG, we set up the callback to call our Python method. Once the DAG failed and the callback was triggered, we would fetch the Airflow logs for each failed task using the Airflow API. After some log parsing, we extracted the relevant parts and prepared them for analysis.

Along with the callback, we implemented our AI agent using the Anthropic SDK and connected it to our internal AI service called Jarvis. If you are interested in running your own LLM, check out this blog on Jarvis and self-hosted enterprise AI platforms.

The AI agent had a carefully designed prompt which was used together with the parsed error logs to analyse them. Once the analysis was done, we would receive a detailed report of what went wrong and what the next steps are. The process uses a combination of deterministic and AI reasoning. Instead of letting AI decide everything, we chose to implement as much logic as possible in a deterministic way, including how the Airflow logs are fetched and what should be done next depending on the AI analysis of the logs. AI is a powerful tool, but that doesn’t mean every simple task should be implemented to leverage AI.

Notifying the responsible teams and fixing the failures

The analysis itself is not worth much if someone must read it before reacting. That is why we developed a few simple tools that allowed our AI agent to send emails to dedicated teams and to raise new issues on GitLab. We also implemented a mechanism to check if a ticket for a specific DAG run was already created, so we don’t spam the DBA team with unnecessary emails.

The next step would be to determine if the root cause for failure is something simple that a DAG restart would fix or if custom development is needed.

In both cases, by implementing a few safeguards like idempotent DAGs and merge request reviews, the AirflowRCA agent can drastically reduce the time it takes from DAG failures to resolution and ready fixes.

Conclusion

The integration of the AirflowRCA agent into our platform didn’t only help us achieve our goal of reducing lead time from failure to resolution, but it also ensured all DAG failures are analysed and categorised in the same way. All failures are now systematically logged inside our GitLab project, together with comments from fixing attempts. That way we can always look at an issue and find out what happened and how we resolved it.

The cherry on top is that now our processing jobs are monitored 24/7, while our team can get an extra few hours of sleep each day 😊

If you think the AirflowRCA agent can help optimise your monitoring and support process, feel free to reach out and discuss ideas with me!