Defining resiliency in energy and software

The word “resiliency” is all the rage right now.

The notion of resiliency is uniquely applicable in a systems context. Specifically, it is a desirable feature of any system that is made up of many moving parts that operate in a distributed, coordinated fashion.

In such systems, failure is an inevitability that must be planned for. Whether you’re talking about an electricity grid or a network of software services, the study and construction of distributed systems necessarily entail having to worry about component failure. Planning for failure and designing systems to be able to mitigate its impact is at the core of resiliency.

To study the implications of this mindset, let’s take a look at how resiliency is defined in the electricity generation, transmission, and distribution industry.

Setting the stage

Within the US energy sector, a debate is raging among industry analysts, regulators, and vendors on the degree to which renewable energy resources could wreak havoc on our electricity grid. On the one hand, proponents of the status quo of subsidized fossil fuels and centralized power generation are sowing fear that intermittent solar and wind generation will cause brownouts and systemic failures. Countering this narrative is a growing pile of research and field evidence that indicate that the distributed nature of renewables—and particularly the one-two punch of solar-plus-storage—will make the grid more resilient to systemic failures.

These terms (reliability, resiliency) are also all the rage in the software industry. In fact, if you blur your eyes a little and think abstractly about the systems involved, recent trends in software architecture look strikingly similar to those in the renewable and distributed energy resource space.

Fig. 1. Centralized vs distributed architectures of power and data generation, transmission, storage, and consumption.

The last couple of decades saw similar arcs in the trajectories of SaaS-/IoT-era software stacks and renewable energy resources. In place of relying on big, centralized resources, we’re seeing more use of distributed resources. Analogous to the onslaught of microservices and smart devices, the future of the energy grid lies in energy harvested from solar panels or demand response providers, and stored in batteries, cars, or even water heaters.

Defining reliability and resiliency

In my experience in the software industry, “resiliency” is one of those whizbang words that’s fun to throw around, but remains generally ill-defined. Often, reliability and resiliency are used by executives to describe effort spent paying down technical debt.

What is said What is meant
"We're going to focus on reliability this quarter." "I'm getting flak from customers/investors about our app not working, so I want you to fix bugs, reduce latencies, and increase success rates, potentially at the cost of timely feature development."

By comparison, these terms are very precisely defined and measured in the electric power industry.

For the electric sector, reliability can be defined as the ability of the power system to deliver electricity in the quantity and with the quality demanded by users. (…) Reliability means that lights are always on in a consistent manner.

Aaron Clark-Ginsberg, What’s the Difference between Reliability and Resilience

In this light, reliability is binary along the time dimension—your thing either works under a given set of conditions or it doesn’t. These conditions are typically defined in a service-level agreement (SLA), which a service is charged with adhering to over time.

Fig. 2. Screenshot from Stripe’s system status dashboard, which is used to signal whether or not their systems are functioning properly.

Resilience is more complicated.

Resilience, stemming from the root resilio, meaning to leap or spring back, is concerned with the ability of a system to recover and, in some cases, transform from adversity.

Clark-Ginsberg’s report goes on to say that “resilience operates from a systems perspective, understanding incidents as a complex process occurring at the intersection of natural and human forces across multiple scales, evolving and changing over time.”

Reliability and resiliency, while related, are fundamentally different attributes. Resilience involves the gray area of partial failure, as in the case of a rolling brownout or a broken widget on an otherwise functional web page. It implies thinking of a service as a system of constituent components, with too many moving parts to be reasonably characterized using a simple “does it work or not” rubric.

We aren’t so different, you and I

With respect to the nature of renewable energy and software systems, it’s not a coincidence that both can be characterized as distributed systems or that both lend themselves towards discussions of resiliency.

In both cases, intermittent and composable resources require thinking about a service as a distributed system. Part of distributed systems theory and practice is the notion that failure is inevitable, and thus the topic of being resilient to failure is paramount.

Thanks to Oren Schetrit and Berk Demir for reading and providing feedback on drafts of this essay.