Decoding software failure: challenges and implications
Businesses are increasingly reliant on software applications to deliver revenue-generating services and power their internal operations. In 2024, the ability to ship code fast has proven to be a competitive advantage in many verticals. Unfortunately, there are consequences to moving fast. Software quality and platform reliability being the most frequent casualties.
We should never expect software to be perfect but in my 20 years of building large scale applications this is the most concerned I've been. Engineering teams have to worry about the quality of the code they ship and the dozens of libraries they inherit. 80-90% of a typical software application was not developed in-house. The vast majority of this being open source libraries.
All of this code has inherent flaws that lead to faults that can cause failures under the wrong conditions.
Software Reliability = customer satisfaction and cost management
Worse case these failures leads to customer impacting outages that hurt revenue and customer satisfaction. These are the severity 1 and severity 2 incidents we remember in our sleep. These are bad and get all the headlines - but they ultimately get fixed.
As engineers, we face two other categories of issues on a more frequent basis.
Ghosts in the Machine
Customer (or internally) reported issues that aren't widespread but consume valuable hours of engineering time per week to reproduce, troubleshoot, and hopefully fix. Across 152 engineering teams, we found that on average 30%-40% of engineering time is spent on this unplanned work.
Here's how it most frequently plays out -
- Infrastructure looks at the logs to see if it's something they've seen before. If not, they use available information (i.e. service name) to try and figure out where the ticket should go. Due to the maze of microservices, it's difficult to know who to assign it to - an reassignment frequently occurs.
- The assigned software engineer requests some additional logs - if available.
- QA tries to reproduce the failure - but they frequently can't.
- In most case, there is not enough information to get to root cause. The ticket ages. Another unsolved mystery.
- Engineers grow reluctant to take on more of these phantom tickets.
Sweeping Failure Under the Rug
Another category is more silent.
These failures lay dormant, below some threshold, until avoiding them is no longer an option. The cloud buys us time. Redundancy kicks in, request are retried, more resources are allocated. Cloud bills grow. These hidden failures are not free.
For example, if a container crashes, Kubernetes spins up another one. You rarely know why it crashed - or spend too much time figuring out how to stop it from happening again. The same goes for non-fatal errors or latency blips. If a service is blocked because it is waiting on a slow query, Kubernetes may launch more replicas to service the growing queue of requests.
The cloud infrastructure has been the bigger enabler of reliability over the last decade but it has also become the seemingly infinite rug that issues get swept under.
When we ran our own data centers, running out of memory or taxing the CPU usually meant you had to purchase, configure, and rack a new server. Some of you may remember having to think about power space and cooling (psc) demands. Those challenges are largely gone for most organizations. But adding more "hardware' has become the easy button.
Data Overload
While observability and application performance monitoring (APM) have gain mass adoption over the last decade, most engineering teams struggle to sufficiently detect and pinpoint hidden failure.
Troubleshooting still requires tremendous context, expertise, experience, and time. As a result, the majority of issues end up on the same people's desks. You can probably name them. The usually have 15-20 years experience and know where the proverbial bodies are buried.
For a while, vendors declared that we had a last mile problem. The AI Ops category was created as a way to make "sense" of the data that was already being collected. By in large, this technology category did not work as advertised. Teams faced more alerts and more noise.
In parallel, the 3 pillars (M.L.T) of observability grew to 6 (T.E.M.P.L.E) -
- Traces: Usually intended to mean distributed traces, that attempt to traces a request from cradle to grave.
- Events: Refer to change events such as configuration changes.
- Metrics: Timestamped numbers that describe a throughput, latency, resources, or some application specific counter.
- Profiles: Snapshots of a system's execution at specific points in time, allowing for detailed analysis of resource usage such as CPU or memory.
- Logs: Messages generated by the system components, containing valuable information for understanding behavior, diagnosing issues, and auditing activities.
- Errors: Denotes any explicit unexpected or abnormal conditions or behaviors detected within the system.
Unfortunately, 90% of this data is doing nothing but costing the engineering team money. We've met teams who are spending more on observability than cloud - and still not able to get software failure under control.
A Better Way...
It's time to start from first principles. We're building a better way. We'd love to hear how your team is tackling these challenges.