Innovating at speed: BMW’s generative AI solution for cloud incident analysis

nimda March 5, 2025

0 18 12 minutes read

Innovating at speed: BMW’s generative AI solution for cloud incident analysis

This post was co-authored with Johann Wildgruber, Dr. Jens Kohl, Thilo Bindel, and Luisa-Sophie Gloger from BMW Group.

The BMW Group—headquartered in Munich, Germany—is a vehicle manufacturer with more than 154,000 employees, and 30 production and assembly facilities worldwide as well as research and development locations across 17 countries. Today, the BMW Group (BMW) is the world’s leading manufacturer of premium automobiles and motorcycles, and provider of premium financial and mobility services.

BMW Connected Company is a division within BMW responsible for developing and operating premium digital services for BMW’s connected fleet, which currently numbers more than 23 million vehicles worldwide. These digital services are used by many BMW vehicle owners daily; for example, to lock or open car doors remotely using an app on their phone, to start window defrost remotely, to buy navigation map updates from the car’s menu, or to listen to music streamed over the internet in their car.

In this post, we explain how BMW uses generative AI technology on AWS to help run these digital services with high availability. Specifically, BMW uses Amazon Bedrock Agents to make remediating (partial) service outages quicker by speeding up the otherwise cumbersome and time-consuming process of root cause analysis (RCA). The fully automated RCA agent correctly identifies the right root cause for most cases (measured at 85%), and helps engineers in terms of system understanding and real-time insights in their cases. This performance was further validated during the proof of concept, where employing the RCA agent on representative use cases clearly demonstrates the benefits of this solution, allowing BMW to achieve significantly lower diagnosis times.

The challenges of root cause analysis

Digital services are often implemented by chaining multiple software components together; components that might be built and run by different teams. For example, consider the service of remotely opening and locking vehicle doors. There might be a development team building and running the iOS app, another team for the Android app, a team building and running the backend-for-frontend used by both the iOS and Android app, and so on. Moreover, these teams might be geographically dispersed and run their workloads in different locations and regions; many hosted on AWS, some elsewhere.

Now consider a (fictitious) scenario where reports come in from car owners complaining that remotely locking doors with the app no longer works. Is the iOS app responsible for the outage, or the backend-for-frontend? Did a firewall rule change somewhere? Did an internal TLS certificate expire? Is the MQTT system experiencing delays? Was there an inadvertent breaking change in recent API changes? When did they actually deploy that? Or was the database password for the central subscription service rotated again?

It can be difficult to determine the root cause of issues in situations like this. It requires checking many systems and teams, many of which might be failing, because they’re interdependent. Developers need to reason about the system architecture, form hypotheses, and follow the chain of components until they have located the one that is the culprit. They often have to backtrack and reassess their hypotheses, and pursue the investigation in another chain of components.

Understanding the challenges in such complex systems highlights the need for a robust and efficient approach to root cause analysis. With this context in mind, let’s explore how BMW and AWS collaborated to develop a solution using Amazon Bedrock Agents to streamline and enhance the RCA process.

Solution overview

At a high level, the solution uses an Amazon Bedrock agent to do automated RCA. This agent has several custom-built tools at its disposal to do its job. These tools, implemented by AWS Lambda functions, use services like Amazon CloudWatch and AWS CloudTrail to analyze system logs and metrics. The following diagram illustrates the solution architecture.

When an incident occurs, an on-call engineer gives a description of the issue at hand to the Amazon Bedrock agent. The agent will then start investigating for the root cause of the issue, using its tools to do tasks that the on-call engineer would otherwise do manually, such as searching through logs. Based on the clues it uncovers, the agent proposes several likely hypotheses to the on-call engineer. The engineer can then resolve the issue, or give pointers to the agent to direct the investigation further. In the following section, we take a closer look at the tools the agent uses.

Amazon Bedrock agent tools

The Amazon Bedrock agent’s effectiveness in performing RCA lies in its ability to seamlessly integrate with custom tools. These tools, designed as Lambda functions, use AWS services like CloudWatch and CloudTrail to automate tasks that are typically manual and time-intensive for engineers. By organizing its capabilities into specialized tools, the Amazon Bedrock agent makes sure that RCA is both efficient and precise.

Architecture Tool

The Architecture Tool uses C4 diagrams to provide a comprehensive view of the system’s architecture. These diagrams, enhanced through Structurizr, give the agent a hierarchical understanding of component relationships, dependencies, and workflows. This allows the agent to target the most relevant areas during its RCA process, effectively narrowing down potential causes of failure based on how different systems interact.

For instance, if an issue affects a specific service, the Architecture Tool can identify upstream or downstream dependencies and suggest hypotheses focused on those systems. This accelerates diagnostics by enabling the agent to reason contextually about the architecture instead of blindly searching through logs or metrics.

Logs Tool

The Logs Tool uses CloudWatch Logs Insights to analyze log data in real time. By searching for patterns, errors, or anomalies, as well as comparing the trend to the previous period, it helps the agent pinpoint issues related to specific events, such as failed authentications or system crashes.

For example, in a scenario involving database access failures, the Logs Tool might identify a new spike in the number of error messages such as “FATAL: password authentication failed” compared to the previous hour. This insight allows the agent to quickly associate the failure with potential root causes, such as an improperly rotated database password.

Metrics Tool

The Metrics Tool provides the agent with real-time insights into the system’s health by monitoring key metrics through CloudWatch. This tool identifies statistical anomalies in critical performance indicators such as latency, error rates, resource utilization, or unusual spikes in usage patterns, which can often signal potential issues or deviations from normal behavior.

For instance, in a Kubernetes memory overload scenario, the Metrics Tool might detect a sharp increase in memory consumption or unusual resource allocation prior to the failure. By surfacing CloudWatch metric alarms for such anomalies, the tool enables the agent to prioritize hypotheses related to resource mismanagement, misconfigured thresholds, or unexpected system load, guiding the investigation more effectively toward resolving the issue.

Infrastructure Tool

The Infrastructure Tool uses CloudTrail data to analyze critical control-plane events, such as configuration changes, security group updates, or API calls. This tool is particularly effective in identifying misconfigurations or breaking changes that might trigger cascading failures.

Consider a case where a security group ingress rule is inadvertently removed, causing connectivity issues between services. The Infrastructure Tool can detect and correlate this event with the reported incident, providing the agent with actionable insights to guide its RCA process.

By combining these tools, the Amazon Bedrock agent mimics the step-by-step reasoning of an experienced engineer while executing tasks at machine speed. The modular nature of the tools allows for flexibility and customization, making sure that RCA is tailored to the unique needs of BMW’s complex, multi-regional cloud infrastructure.

In the next section, we discuss how these tools work together within the agent’s workflow.

Amazon Bedrock agents: The ReAct framework in action

At the heart of BMW’s rapid RCA lies the ReAct (Reasoning and Action) agent framework, an innovative approach that dynamically combines logical reasoning with task execution. By integrating ReAct with Amazon Bedrock, BMW gains a flexible solution for diagnosing and resolving complex cloud-based incidents. Unlike traditional methods, which rely on predefined workflows, ReAct agents use real-time inputs and iterative decision-making to adapt to the specific circumstances of an incident.

The ReAct agent in BMW’s RCA solution uses a structured yet adaptive workflow to diagnose and resolve issues. First, it interprets the textual description of an incident (for example, “Vehicle doors cannot be locked via the app”) to identify which parts of the system are most likely impacted. Guided by the ReAct framework’s iterative reasoning, the agent then gathers evidence by calling specialized tools, using data centrally aggregated in a cross-account observability setup. By continuously reevaluating the results of each tool invocation, the agent zeros in on potential causes—whether an expired certificate, a revoked firewall rule, or a spike in traffic—until it isolates the root cause. The following diagram illustrates this workflow.

The ReAct framework offers the following benefits:

Dynamic and adaptive – The ReAct agent tailors its approach to the specific incident, rather than a one-size-fits-all methodology. This adaptability is especially critical in BMW’s multi-regional, multi-service architecture.
Efficient tool utilization – By reasoning about which tools to invoke and when, the ReAct agent minimizes redundant queries, providing faster diagnostics without overloading AWS services like CloudWatch or CloudTrail.
Human-like reasoning – The ReAct agent mimics the logical thought process of a seasoned engineer, iteratively exploring hypotheses until it identifies the root cause. This capability bridges the gap between automation and human expertise.

By employing Amazon Bedrock ReAct agents, significantly lower diagnosis times are achieved. These agents not only enhance operational efficiency but also empower engineers to focus on strategic improvements rather than labor-intensive diagnostics.

Case study: Root cause analysis “Unlocking vehicles via the iOS app”

To illustrate the power of Amazon Bedrock agents in action, let us explore a possible real-world scenario involving the interplay between BMW’s connected fleet and the digital services running in the cloud backend.

We deliberately change the security group for the central networking account in a test environment. This has the effect that requests from the fleet are (correctly) blocked by the changed security group and do not reach the services hosted in the backend. Hence, a test user cannot lock or unlock her vehicle door remotely.

Incident details

BMW engineers received a report from a tester indicating the remote lock/unlock functionality on the mobile app does not work.

This report raised immediate questions: was the issue in the app itself, the backend-for-frontend service, or deeper within the system, such as in the MQTT connectivity or authentication mechanisms?

How the ReAct agent addresses the problem

The problem is described to the Amazon Bedrock ReAct agent: “Users of the iOS app cannot unlock car doors remotely.” The agent immediately begins its analysis:

The agent begins by understanding the overall system architecture, calling the Architecture Tool. The outputs of the architecture tool reveal that the iOS app, like the Android app, is connected to a backend-for-frontend API, and that the backend-for-frontend API itself is connected to several other internal APIs, such as the Remote Vehicle Management API. The Remote Vehicle Management API is responsible for sending commands to cars by using MQTT messaging.
The agent uses the other tools at its disposal in a targeted way: it scans the logs, metrics, and control plane activities of only those components that are involved in remotely unlocking car doors: iOS app remote logs, backend-for-frontend API logs, and so on. The agent finds several clues:
1. Anomalous logs that indicate connectivity issues (network timeouts).
2. A sharp decrease in the number of successful invocations of the Remote Vehicle Management API.
3. Control plane activities: several security groups in the central networking account hosted on the testing environment were changed.
Based on those findings, the agent infers and defines several hypotheses and presents these to the user, ordered by their likelihood. In this case, the first hypothesis is the actual root cause: a security group was inadvertently changed in the central networking account, which meant that network traffic between the backend-for-frontend and the Remote Vehicle Management API was now blocked. The agent correctly correlated logs (“fetch timeout error”), metrics (decrease in invocations) and control plane changes (security group ingress rule removed) to come to this conclusion.
If the on-call engineer wants further information, they can now ask follow-up questions to the agent, or instruct the agent to investigate elsewhere as well.

The entire process—from incident detection to resolution—took minutes, compared to the hours it could have taken with traditional RCA methods. The ReAct agent’s ability to dynamically reason, access cross-account observability data, and iterate on its hypotheses alleviated the need for tedious manual investigations.

Conclusion

By using Amazon Bedrock ReAct agents, BMW has shown how to improve its approach to root cause analysis, turning a complex and manual process into an efficient, automated workflow. The tools integrated within the ReAct framework significantly narrow down potential reasoning space, and enable dynamic hypotheses generation and targeted diagnostics, mimicking the reasoning process of seasoned engineers while operating at machine speed. This innovation has reduced the time required to identify and resolve service disruptions, further enhancing the reliability of BMW’s connected services and improving the experience for millions of customers worldwide.

The solution has demonstrated measurable success, with the agent identifying root causes in 85% of test cases and providing detailed insights in the remainder, greatly expediting engineers’ investigations. By lowering the barrier to entry for junior engineers, it has enabled less-experienced team members to diagnose issues effectively, maintaining reliability and scalability across BMW’s operations.

Incorporating generative AI into RCA processes showcases the transformative potential of AI in modern cloud-based operations. The ability to adapt dynamically, reason contextually, and handle complex, multi-regional infrastructures makes Amazon Bedrock Agents a game changer for organizations aiming to maintain high availability in their digital services.

As BMW continues to expand its connected fleet and digital offerings, the adoption of generative AI-driven solutions like Amazon Bedrock will play an important role in maintaining operational excellence and delivering seamless experiences to customers. By following BMW’s example, your organization can also benefit from Amazon Bedrock Agents for root cause analysis to enhance service reliability.

Get started by exploring Amazon Bedrock Agents to optimize your incident diagnostics or use CloudWatch Logs Insights to identify anomalies in your system logs. If you want a hands-on introduction to creating your own Amazon Bedrock agents—complete with code examples and best practices—check out the following GitHub repo. These tools are setting a new industry standard for efficient RCA and operational excellence.

About the Authors

Johann Wildgruber is a transformation lead reliability engineer at BMW Group, working currently to set up an observability platform to strengthen the reliability of ConnectedDrive services. Johann has several years of experience as a product owner in operating and developing large and complex cloud solutions. He is interested in applying new technologies and methods in software development.

Dr. Jens Kohl is a technology leader and builder with over 13 years of experience at the BMW Group. He is responsible for shaping the architecture and continuous optimization of the Connected Vehicle cloud backend. Jens has been leading software development and machine learning teams with a focus on embedded, distributed systems and machine learning for more than 10 years.

Thilo Bindel is leading the Offboard Reliability & Data Engineering team at BMW Group. He is responsible for defining and implementing strategies to ensure reliability, availability, and maintainability of BMW’s backend services in the Connected Vehicle domain. His goal is to establish reliability and data engineering best practices consistently across the organization and to position the BMW Group as a leader in data-driven observability within the automotive industry and beyond.

Luisa-Sophie Gloger is a Data Scientist at the BMW Group with a focus on Machine Learning. As a lead developer within the Connected Company’s Connected AI platform team, she enjoys helping teams to improve their products and workflows with Generative AI. She also has a background in working on Natural Language processing (NLP) and a degree in psychology.

Tanrajbir Takher is a Data Scientist at AWS’s Generative AI Innovation Center, where he works with enterprise customers to implement high-impact generative AI solutions. Prior to AWS, he led research for new products at a computer vision unicorn and founded an early generative AI startup.

Otto Kruse is a Principal Solutions Developer within AWS Industries – Prototyping and Customer Engineering (PACE), a multi-disciplinary team dedicated to helping large companies utilize the potential of the AWS cloud by exploring and implementing innovative ideas. Otto focuses on application development and security.

Huong Vu is a Data Scientist at AWS Generative AI Innovation Centre. She drives projects to deliver generative-AI applications for enterprise customers from a diverse range of industries. Prior to AWS, she worked on improving NLP models for Alexa shopping assistant both on the Amazon.com website and on Echo devices.

Aishwarya is a Senior Customer Solutions Manager with AWS Automotive. She is passionate about solving business problems using Generative AI and cloud-based technologies.

Satyam Saxena is an Applied Science Manager at AWS Generative AI Innovation Center team. He leads Generative AI customer engagements, driving innovative ML/AI initiatives from ideation to production with over a decade of experience in machine learning and data science. His research interests include deep learning, computer vision, NLP, recommender systems, and generative AI.

Kim Robins, a Senior AI Strategist at AWS’s Generative AI Innovation Center, leverages his extensive artificial intelligence and machine learning expertise to help organizations develop innovative products and refine their AI strategies, driving tangible business value.