Self-healing Software is no Longer a Pipedream

The idea of self-healing software has existed for almost along as software has existed. However, the deterministic nature of traditional software made it impossible to create self-healing systems that could adapt to unpredictable real-world situations. However, thanks to AI and the fact that it's non-deterministic, we can now design systems that can adapt to unpredictable situations, even when the primary system is deterministic. At Sentrix.ai, we've been actively building a new generation of software that adds a self-healing layer.

What this delivers is an improved customer experience where users see fewer errors, while the business experiences fewer support cases and lower costs.

What we're seeing in practice is that designing a self-healing system is a lot easier than it used to be, yet far from trivial. There are still considerable challenges and tradeoffs.

As with any new architectural pattern, there's still a lot to learn. We're excited to share what we've learned so far with you.

In system design, we like to think back to great engineering approaches from the past. In this instance, the great automotive engineers in the 1980s, and specifically focus on the difference in approach between German engineering and Japanese engineering. The Germans focused on precision. Every part worked at its maximum efficiency. This makes for excellent driving experiences in BMWs and Porsches. On the other hand, Japanese engineers focused on reliability. The system was designed where each part could function as designed while tolerating variance in dependent parts. This led to Toyota's and Honda's legendary reliability.

Traditional software methods borrow from both approaches, yet tilted in the direction of precision. Ultimately, if the system is designed to be reliable, it makes certain assumptions about its dependencies such as an APIs shape and behavioral characteristics. And these assumptions are compiled into the code.

Thanks to the non-deterministic nature of AI, we're seeing a new approach to software design. Many dependencies are now non-deterministic. And even deterministic dependencies are going through rapid changes. A good example of such a dependency is an agentic CLI. Agentic CLIs have deterministic APIs and non-deterministic output. An interesting mix, but let's focus on the API.

Traditionally, we would expect a CLI and the interface it exposes to be stable. Often a CLI that is designed to be called by other applications exposes its functionality via a daemon, where the CLI is effectively a client that talks to the daemon. So, our integration happens via the daemon and its API, which would have been stable as recently as few years ago. However, this assumption of stability is no longer valid in the age of AI as most applications are undergoing rapid iteration, often multiple times a week and occasionally more than once per day.

So how do we handle a situation where the API we're calling is deterministic, but unstable?

One AI-native approach is to shorten the break-fix cycle. In this approach, we'd attach an AI coding agent to our logging and telemetry systems, then allow the AI to propose a solution and open a PR as soon as a bug is found. But that's not what we're focusing on here. A more dynamic approach is to automate the process of detecting and adapting to changes in the API. This can be achieved by only bringing AI into the process when a failure occurs. For example, if the API calls behaves as expected we can avoid AI all together.

However, if we experience a error, we can then classify the error to determine if it's recoverable via AI, such as if the API surface area has changed, a new flag has been added, etc. Essentially, if AI is able to interrogate the error, the API, and the current documentation, we can add a new type of dynamic retry where the AI calls the API, recovers from the error, and logs the issue such that a permaent patch can be applied in a future release. What this yields is a self-healing system that can adapt to rapid changes in the API. It also delivers a better customer experience and fewer customer support issues.

Before you say this looks a lot like tool calling. It does. Sort of. But tool calling is purely dynamic. In this situation, we're building a deterministic system that can adapt to the changing nature of the CLI. The system is default deterministic and only leverages AI when an error has occurred. The non-determinism is a temporary state giving engineers time to patch the underlying code with zero impact on customers.

It's important to note that this approach is not without tradeoffs. Injecting AI into a deterministic system increases the software's attack surface. It can also introduce latency as well as introduce prompt injection vulnerabilities. However, the benefits of improved customer experience and reduced support costs often outweigh these risks. But it's up to you to decide if this is the right approach for your application.

Self-healing Software is no Longer a Pipedream

Akbar Ahmed

Related Articles

Rethinking Scrum Team Size in the Age of AI

The 5 AI Adoption Archetypes in Engineering Teams

The Pull Request is Dead

Ready to orchestrate your AI workforce?