Five Categories of Failure
Failure categories are a simplified approach to understanding distributed systems failures and what to do about them. Without failure categories, failure mode analysis and system design tends to operate off of a list of thousands of specific failures, which tends toward one-off approaches to failure detection and mitigation. Failure categories can help you design a few common approaches to detect and mitigate a wider range of failures.
Traditional Failure Analysis
Even simple distributed systems are extremely complex. A single transaction may use hundreds of computers and many networks. Distributed systems need DNS names, SSL certificates, a myriad of security credentials, layers of software and layers of networked devices connecting everything together. Any of these components can fail and impact an application. There are lots and lots and lots of ways for distributed systems to fail.
Organizations that have been building and operating distributed systems for any period of time have long lists of failure modes and what to do about them. As part of launching a new AWS service, each service team reviews ~100s of different failures in the context of their new service. As an organization, Amazon tracks historical failures in their Correction of Error (COE) database for their lifetime of operations. Their COE process has identified 1,000s of specific types of failure. The problem is, 1000s of types of failures is too many to build mitigations for.
When teams design for failure, most focus on their own favorite list of failures out of the 1,000s of possible failures. Then they design solutions to prevent or mitigate those failures without always considering the rest. This approach is better than nothing, but it often leads to resilience with gaps.
Other organizations may pursue failure mode analysis and cost tradeoff ROI analysis as if resilience risks can close in a mathematical way. Resilience risk math for systems as complex as a typical distributed system don’t close. You can spend an unending amount of time identifying, classifying, and testing risks and you can still experience prolonged failures. Some investment in threat modeling and risk analysis is reasonable, but organizations that need to up-level their resilience are better off focusing their energy on building a few well-tested mitigations that cover a wide-range of risks within each category of failure. Getting this right still takes considerable investment, but yields tangible protection. Never ending risk-management practices tend to produce more paperwork than protection.
Categories of Failure
Categories of failure is a simplified approach to designing resilient distributed systems. Instead of designing for specific failures, you can implement detection and mitigation strategies that broadly address the 100s to 1000s of specific types of failure within a given category. Designing reusable detection and mitigation mechanisms leads to fewer, more reliable mitigations compared to a one-off approach.
This article will introduce you to the five categories of failure. Each category has different properties and different kinds of applicable detection and mitigation strategies detailed below.
There are five categories of failure, starting with the service layer of the application. Failures can originate within a service in three ways. Service Code & Config bugs or bad deployments at the service layer can cause a service to fail. Service Data & State including ephemeral request data, cached state or durably stored data can all experience corruption or interact with software bugs that can cause a service to fail. Service Infrastructure like the physical host or network devices can fail. Clients can cause service failures by sending corrupted data or poison pill requests. A collective of clients can overwhelm a service and cause it to fail. Dependencies are other services an application depends on. Failures of a database dependency can cause a service to fail. We will look at each of these categories in turn, starting with the one that tends to get the most attention.
1 - Service Infrastructure
Distributed systems applications typically use many redundant application instances running on redundant infrastructure. Things like load balancer devices and distributed systems platforms like kubernetes are built to handle host-level infrastructure failures. Heartbeat health checks from the load balancer or external controller ping each host to detect hard failures. If the host fails it is taken out of service and client requests are routing amongst the remaining healthy hosts. This same approach is used for network devices. This approach works great for hard-down failures, but standard heartbeat health checkers aren’t good at detecting partial failures or intermittent failures, which become more common as a system grows in scale & complexity.
Scope of Failure
Service Infrastructure failures should be thought of as a set of nested boxes. A host is one such box, and can fail independently and without affecting the other hosts. However, the load balancer that routes requests to each of those hosts can also fail. The load balancer is like a box that encompasses all of the host boxes behind it. If that load balancer box experiences a failure, it can affect all of the host boxes within it, clients won’t be able to reach the hosts. Similarly, the datacenter is another box that wraps around the load balancer and all host boxes. If a datacenter fails, all of the hosts and load balancers in that datacenter will fail too. This nested box concept can be useful for identifying potential large-scale failures that can impact many or all of your application instances or application clients. In the diagram below, a failure of one datacenter would impact three out of six total application instances.
Probability of Failure
Server hardware failure rates can run about 5% average annually. Said another way, if your service runs across 100 hosts, you can expect 5 host failures in a given year. If you are operating in the Cloud, the virtual instance fault rates are within the traditional average, but can vary based on the instance type, however 5% is still a good place to start if you want to build some predictions about what kind of failure rate you might see. Using the example above, if it takes 3 minutes to remove an unhealthy host and you have a total of 3 * 5 = 15 minutes per year of having a single unhealthy host in service per year.
Network services, specifically routers and switches from reputable suppliers, tend to fail at less than 1% annually. The impact of a network device failure can impact many hosts. The configuration of the network can also determine whether a few hosts experience a total outage when a network device fails, or whether many hosts experience a partial failure. Network design as it relates to distributed systems is a topic for another blog post, but the takeaway here is that even though the network device failure rate is predictable, the scope and type of impact is variable based on the network design, making any probability of failure of a network device much harder to reason about when it comes to planning application-level mitigations.
Failure Detection & Diagnosis
Infrastructure failures can show up as hard down failures, as intermittant or partial failures, or as a reachability issue preventing a client from talking to a server or preventing a server from talking to another server. Ideal infrastructure monitoring can quickly identify the specific device or network path causing the problem.
Hard-down failures are caused by things like loss of power. Regular heartbeat pings of every physical device is a good way to detect and diagnose hard-down device failures for hosts and network devices.
Intermittent failures can be caused by things like overheating devices leading to physical processing errors. Diagnosing intermittent hardware failures is difficult. If your application emits error rate and request processing latency at a rate of ~100 requests per minute per host, then you can use host-level metrics to identify a single host that is having trouble, even if the failure results in something as small as a ~5% error rate. You can compare the error rates of each host to see if any one host shows anomalously high error rates compared to other hosts. If you don’t have that kind of request volume, it may be very difficult or impossible to identify a single problematic host.
Reachability issues are failures that prevent a client request from reaching a service. Reachability issues caused by network device or host failures are very difficult to detect, but can be done with knowledge of the network design. Client side metrics and synthetic request monitoring can be used to detect failures along all network paths.
The unifying theme of infrastructure failures is the metrics must be designed to identify a particular piece of physical infrastructure. Or a cluster of devices that live within the same common infrastructure, like a rack or a datacenter.
Failure Mitigation
As a service owner, you can control the scope of failure for infrastructure type failures based on where you place your application instances. For example, you could place all of your application instances on a single physical host. If that single host fails, all of your redundant instances will fail. As an alternative, you can put each application instance on a separate host, and a single host failure will only impact a single application instance. You can further reduce the probability of correlated failure of application instances by putting your hosts into different physical racks, connected with different top of rack switches, and then putting the racks into different independant data centers.
In cloud environments like AWS, using independant infrastructure means distributing application instances across many different Availability Zones or Regions. AWS provides EC2 placement groups, which gives you some control over whether instances running near each other and possibly in the same datacenter and rack, or run far away on different datacenters and racks. By designing application instance placement with a clear idea of scope of infrastructure failure, you can ensure your application has plenty of remaining capacity when infrastructure failures occur.
There isn’t a way to prevent infrastructure failures. Hardware fails like the sun rises. If you’re using typical infrastructure, tooling and network protocols, hard-down type failures are probably handled well for your applications. However, large scale failures of datacenters, or more complex partial failures that lead to elevated error rates, high latency or packet loss are things most applications aren’t designed to detect and mitigate automatically. Most teams aren’t aware that the health checks they are using to detect a failed host won’t detect and mitigate intermittent failures.
If you need to hit the highest level of resilience and are sensitive to elevated error rates, you’ll need to invest more in building your own detection and mitigation systems. We’ll deep dive on that problem in another blog post.
2 - Service Code & Config
For most organizations, bugs in software, problems in code deployment, misconfigurations, or expiry of certificates & credentials are the most common source of application failures. Software bugs and misconfigurations can either show up immediately after deployment or they might show up some time later. As an example, a new security credential might be deployed without issue and then expire 5 days later, causing a failure.
Scope of Failure
The scope of failure is determined by the scope of the change. For example, a change made to a single application instance may only impact that single instance. However, a change to a database schema can impact all application instances that depend on that schema. Because a change in one service can show up as a failure in another service, it can be difficult to identify the change that caused the failure. You can design your system to isolate change related failures even when failures show up in different services other than the one directly changed, but it takes careful design in networking and cross-service dependencies.
Probability of Failure
Probability of failure usually correlates to overall change rates for the application. As you increase the rate of deployment, you are more likely to suffer a deployment related failure. The exact failure rate per deployment is unique to an organization and the complexity of the changes they are making. Failures can be reduced, but not entirely prevented, with good pre-production testing. Probability of failure may also be reduced by limiting changes, however this often results in more complex and higher risk changes. Many organizations today have decided that frequent less risky changes are better than infrequent riskier changes, but this trade off should be made based on the unique needs of a specific application.
Failure Detection & Diagnosis
Ideal metrics clearly identify the specific change or configuration that caused the problem. A common strategy is to pair changes with a suite of functionality tests that are run every time an application changes. However, teams rarely achieve this ideal for a few reasons. First, it is nearly impossible to devise a set of tests that can catch any change related problem. Second, changes in one part of the system can manifest in another part of the system. It is usually impractical to run a full test suite on the entire system for every change. Fully system testing could take many minutes or even hours to complete. That leads to an impractically long time to diagnose the issue. Don’t get me wrong, you should run tests after every change, just don’t pretend this will allow you to immediately detect every change related failure in your applications.
You can improve your ability to detect change related failures by building logical change partitions into your application. If you run in multiple datacenters, then using a datacenter as a logical change domain is a common and effective method for limiting the scope of change. To detect change related failures, you’ll want application error rate and latency metrics that aggregate at the datacenter/Availability Zone level, or at whatever level maps to your logical change partitions. Now if you make a change within a single datacenter, you can detect a problem at the datacenter level, whatever the cause. You can mitigate the event by shifting client workloads away from the impaired datacenter to mitigate the failure. Then you can root cause and repair the change related failure taking the time you need, but without impacting your clients.
Failure Mitigation
Deploying incrementally, to just a few instances at a time, can limit the scope of failure and ensure you have redundant healthy application instances to serve your clients when a deployment failure occurs. The same is true of configuration and certificates. You can use many SSL certificates for the same name, deployed to different application instances. Each cert can have a different expiry time. If an expiry occurs, it will only cause a subset of your application instances to fail. To fully mitigate a failure, you’ll need to pair incremental deloyments with client retry logic or the ability to detect and remove problematic application instances.
Rollback capability is another common and useful method to mitigate Code & Config failures. However rollbacks face two challenges. First, it is sometimes difficult to correlate a failure to a specific change. If you can’t identify the change, you can’t roll it back. Second, rollback operations can be complex and time consuming. Rollbacks sometimes take minutes or hours to complete. Applications targeting recovery times of <5 minutes should consider a different primary mitigation strategy, like incremental deployments and removal impaired hosts from service. You can read more about rollback alternatives here
3 - Service State & Data
This category of failure is a candidate for most difficult to mitigate. State & Data failures tend to be less common than other categories, but when they do show up, they often cause a large scope of impact for a long period of time.
Applications run code on compute infrastructure interconnected with clients and services over networks to process a workload. That workload is composed of lots of different state. This includes the ephemeral TCP connection state between the client and the server, the request/response data between the client and server and any other servers hosting application dependencies. The application memory on each host and durably stored data on disks and databases are more examples of state and data needed to process a workload.
Durably stored data for distributed systems usually has to deal with CAP Theorem constraints to ensure data is redundantly stored across hosts and meets an applications data consistency requirements. Design trade off choices in durable or redundantly stored data directly impact data staless or data availability when failures occur. The expanded PACELC Theorem explains that these choices also impact application throughput and latency at all other times.
Poison pills and data corruption are examples of State & Data failures that can impact your application. As an example, a poison pill request from a client could cause an application instance to crash when it attempts to read the request data. If that same poison pill shows up in many client requests, then impact could spread to all application instances.
Scope of Failure
For most applications, the potential scope of failure covers the entire application. Sharded or federated applications are the exception, because they segment clients and client data to independent application stacks. Everything else shares common data and state by design. State and data failures can also show up in a time-bomb fashion, one that isn’t correlated to any obvious event. For example, imagine an application is teetering at the limits of the configured java stack memory space. Then suddenly, all application instances hit their stack limit, once the client request pattern changes. Or imagine a datastore is using a tinyint ID, and the last record exceeded the available ID space, preventing any new records from being created. These kinds of failures can cause a whole application to fail 100% of requests.
Probability of Failure
State & Data type failures are the least common of the five categories, however they are some of the most difficult to identify and mitigate, leading to long downtimes. As an example, corrupted data in a very large database may require hours or even days of data processing to repair. These types of failures can have the largest impact, because the scope of impact is often felt application-wide and across all clients. Fortunately, layers of defensive parameter validation and corruption checks in application code and interprocess communication protocols reduces (but doesn’t eliminate) the probability of State & Data failures.
Failure Detection & Diagnosis
State & Data issues are the most difficult of the five categories to diagnose. Ideal metrics would identify the specific bit of problematic state. Most State & Data failures are one-offs and require specific monitoring to produce an automated alert identifying the particular bit of problematic state. Detailed logging is the next best thing. Detailed logs describing the function call and data being processed just prior to a crash or an error is usually how these failures are diagnosed. Diagnosis usually requires experts to manually dig through logs. Additional logging may need to be added on the fly, extending the time to diagnose further.
These issues take a long time to resolve due to time consumption diagnostic process. Sharding data into separate partitions can help you narrow the search, since only one partition is likely to be affected at a time. But recovering the impaired shard will still require the same time consuming diagnostics.
Failure Mitigation
Engineering hyper vigilance in parameter validation, use of hashing or CRCs to detect data corruption and careful consideration of how IDs are generated and used can prevent State & Data failures. Things like solar-flare bit flips can’t really be prevented, but diligent design can detect them and prevent cascading issues. However, this approach leads to an unending amount of one-off solutions. Hyper vigilance is another way of saying “I’m relying on the good intentions of my software engineers”. You still gotta do it, but if you need categorical protection, then data/state and service sharding is the only way to reduce the potential scope of failure when State & Data failures occur. Marc Brooker, an AWS Sr.Principal Engineer once said, “If you’re running the same code, processing the same data, then the same failure will occur” - the only guaranteed mitigation is to split your application so all instances aren’t processing the same data.
4 - Clients
Client-originated failures are caused by clients and include things like poison pill requests, or an overwhelming amount of requests from a multitude of clients. Under capacity failures of any kind can be thought of as a client originated failure, because the failure mitigations are similar to what you would need to do to deal with an unexpected surge in client requests.
Scope of Failure
The potential scope of failure for Client failures is commonly the whole application, unless the application uses a form of customer sharding. Said another way, any part of the application that processes a customer request is subject to client-originated failures.
Probability of Failure
The probability of failure roughly correlates with the type of the client the application is servicing. For example, public facing services are more exposed to malicious clients that might push DDoS attacks to overwhelm a service, whereas internal services are not exposed to such threats. Multi-tenant systems with lots of clients are more likley to experience a client-originated failure relative to a single-tenant application.
Failure Detection & Diagnosis
Per-client metrics are idea for diagnosing client sourced failures. If you have a very large number of clients >100, then use of standard metrics tools can be problematic because visual graphs just don’t work with 100s or 1000s of lines and some monitoring systems may not be able to handle that many metric dimensions. Something like CloudWatch high cardinality metrics can help here, however these metrics work best for sortable top-n type metrics. For example, you can use a built-in metric sort to identify the single client sending you the most requests or generating the highest error rate. This can help you very quickly identify and isolate or block the problematic client. This approach won’t work as well if the problematic client is breaking your application with a single request. For these types of issues, you’ll probably have to resort to log diving to find the single request causing the problem. If you have a large number of clients (>100) causing issues, high cardinality metrics may help you see the problem, but it becomes difficult to identify the problematic versus regular clients. And even if you could, can your isolation or blocking mitigation handle >100 entries?
Failure Mitigation
Mitigations to client-related failures fall into three categories: 1/ Block specific customer requests with things like throttles, firewall rules, and request validation. 2/ Reroute or tarpit specific customer requests to a null route or to a separate application stack. 3/ Scale up or scale out the service. If you have control over the clients, you may also mitigate the failure by coordinating with the clients.
5 - Dependencies
Service dependencies are other applications, often black-box services, that your application depends on. If your application runs in the cloud, an example might be DynamoDB used as a durable store for application data. A problem in DynamoDB can cause your application to fail.
Scope of Failure
The scope of failure for dependency related failures often maps to specific application functionality and implementation details. For example, a read only operation that uses application-local data is unlikely to see an impact due to a downstream dependency. However a write/update operation in the application probably needs to make a call to a database service.
It is common for dependancy failures to have a whole-application scope of failure. Federated or sharded applications that use fully independent instances of their dependancies for each application stack can limit the scope of failure to just one stack. However, even these types of applications may end up using the same DNS or CDN service for all of their stacks. Most applications have a number of these kinds of “Global” scope services.
Probability of Failure
The probability of a dependency related failure increases as a service takes on more dependencies. Probability is also affected by the reliability of the dependencies the application is using. Getting good reliability data for most SaaS services is difficult, it is similarly difficult to assess the underlying resilience designs of 3rd party services. If you truly need to know the level of resilience of a service to know if it will meet your particular resilience requirements, you may need to build and manage the needed functionality yourself rather than relying on unknowable 3rd party service resilience.
Failure Detection & Diagnosis
Ideal metrics provide error rate and latency data for each application-dependency pair. If you know something about the infrastructure design of the dependency, then more granular metrics can help. For example, you can record error rate and latency metrics for every call from a particular app to DynamoDB. The design of DynamoDB means that different tables and different accounts could experience errors while other do not. So table-level metrics can help you identify the particular table causing issues. In practice, this level of granularity isn’t useful unless you have some kind of mitigation plan you can enact. For example, if you use redundant tables in different Regions, a table level metric could allow you to shift traffic away from the problematic table, even when all of the other tables in that Region are ok.
Failure Mitigation
Failure mitigations here are more specific to the application functions and overall service design. In some cases it is possible to make use of redundant dependencies, in other cases it maybe possible to invest more engineering time into building additional functionality into your application so there’s no need for the dependency. Building retry logic or caching into an application can reduce the impact of some types of dependency failures.
Conclusion
Organizations can use these five categories of failure to simplify decision making for resilience investment decisions. For example, it may be obvious that very little investment is needed to protect an internal application from Client originated failures because it won’t be exposed to external malicious bad actors. That same application may be so critical to the business that it must recover in minutes from infrastructure failures, including datacenter level infrastructure failures. A categorical approach to failure mitigation can help engineering teams quickly align on the most important areas of focus for any particular application.
Thinking categorically about failure can help engineering teams with mitigation reuse. My favorite example of mitigation reuse centers on Code & Config, and Infrastructure categories. Companies commonly make use of simple health-checks to detect and mitigate host failures. Many also have software rollback capability, sometimes they even invest in fully automated rollback capability. Software rollbacks and infrastructure recovery are treated as separate mitigations in most organizations. With thoughtful engineering, host-level failure mitigation could be reused to mitigate Infrastructure failures and Code & Config failures. Instead of using simple health checks, adding a wider rages of error rate detection, canaries and functional testing within the health check can detect and mitigate problems caused by Code & Config issues by removing hosts when the application instance shows signs of impairment.
The full details of how to safely use host removal to mitigate Code & Config failures is discussed at a high level in this article. For now, I’ll leave you with the idea that fewer well exercised, well understood mitigations lead to better resilience than a plethora of one-off mitigations that are rarely used and may not work when you need them. Categories of failure can help you do just that.