In Praise of Control Planes, or: Why You Need a Place to Stand

A control plane is a coordinator. It’s the part of a system that decides what should happen, while other parts (the workers, the data plane, the things that do the actual work) carry those decisions out. Boss and workers. Conductor and orchestra. Thermostat and furnace. This is not a profound architectural insight. It is, frankly, one of the oldest ideas in engineering.

So why write a post about it?

Because I’ve been writing recently about CI orchestration and why bash isn’t a build system, and I kept reaching for the phrase “control plane” without ever explaining what I mean by it. More importantly, I think there’s value in naming a pattern this basic, because unnamed patterns are invisible patterns, and invisible patterns are the ones teams accidentally violate until something breaks.

A note on terminology: I’m using “control plane” more loosely than a networking engineer would. In networking, the term has a precise meaning (I’ll get to that). I’m using it as a label for the general boss/worker separation (deciding what should happen vs. making it happen) which shows up across many domains. Not everyone would use the phrase the way I do. That’s fine. The pattern is what matters.

The thing I want to flag before we go further: the opposite of a control plane is not simplicity. It is implicit control. When you don’t have a coordinator, the decisions still get made: by accident, by convention, by whoever happens to be awake. The coordination still happens. It’s just happening in ways you can’t see, can’t audit, and can’t change without touching everything.

The cybernetician W. Ross Ashby formalized this intuition in what he called the Law of Requisite Variety: a control system must be at least as complex as the system it’s controlling¹. A thermostat can be simple because a room’s temperature is a single variable. A self-driving car needs a staggeringly complex controller because the road is a staggeringly complex environment. You cannot regulate a system with a regulator that is simpler than the system itself. “Only variety can absorb variety,” as Ashby put it. As Lorin Hochstein observed recently², this means our engineering solutions to problems necessarily add complexity. The question is never “can we avoid the complexity?” but “where should it live?” A control plane is an answer to that question. It is additional complexity, deployed deliberately, in a place where it can do the most good.

What Is a Control Plane?

The term comes from networking, where it has a precise meaning: the control plane decides what to do with traffic, and the data plane moves the packets³. A router’s control plane builds the routing table; the data plane forwards packets according to that table. They are separate concerns, often running on separate hardware, and the entire modern internet depends on this separation working correctly.

But the pattern is far older and more general. Any time you separate deciding what should happen from making it happen, you have a control plane. The foreman on a construction site. Air traffic control. A restaurant’s ticket system. The pattern predates computers entirely. We just didn’t have a name for it until the networking people gave us one.

The defining characteristic: the control plane has a model of the whole system. It knows what’s happening everywhere (or tries to), makes decisions based on global state, and communicates those decisions to the components that carry them out. The data plane, by contrast, knows only about its own work. It does what it’s told. Local, fast, and dumb in the best possible way⁴. The intelligence lives above it.

When You Don’t Need One

Before I sell you on this, let me tell you when to ignore everything I’m about to say.

If you’re running a single service on a single box, you don’t need a control plane. You are the control plane. You SSH in, you check things, you restart processes, you deploy by running a script. This is fine. This is good, even. The overhead of a formal coordinator for a system that fits in one person’s head is real overhead with no corresponding benefit. If your deploy process is ssh prod && git pull && systemctl restart myapp, you should not be running Kubernetes. You should be enjoying your simple life.

The pattern becomes valuable when the system outgrows one person’s ability to hold it all in their head. When the number of components exceeds what you can monitor by checking each one. When the number of interactions exceeds what you can predict by thinking really hard. When the blast radius of a mistake exceeds what you can fix by SSHing into one box.

There’s also a long, uncomfortable middle period where the system is too big for ad-hoc management but too small to justify a full Kubernetes deployment, and this is where most teams live for longer than they’d like to admit. This is where lighter-weight control planes (Nomad, Consul, even a well-structured set of Terraform modules and a good dashboard) earn their keep. You don’t have to go from “I SSH into the box” to “I have a service mesh” in one step. The intermediate steps are valid and under-discussed.

That threshold (the one where you need something) is lower than most people think. But it’s not zero, and I won’t pretend it is.

The Separation You Already Depend On

You are already using control planes. You just might not have noticed.

DNS is a control plane for name resolution (or rather, the management layer above it is; the pedantic among you will note that DNS resolution is more “lookup” than “control,” and you’re right, but the management of what those lookups return is pure control plane). Your application says “I need to talk to api.example.com” and does not care which IP that resolves to, or whether the answer changed ten minutes ago because an engineer drained a data center on the other side of the planet.

DNSClick a server to reroute

api.example.com → 10.0.1.1

One record change. Zero application changes. All traffic moves.

Your database’s query planner is an interesting case. It’s closer to a compiler than a traditional control plane (it makes a one-time decision per query rather than managing ongoing state) but it embodies the same separation. You write SELECT * FROM orders WHERE customer_id = 42, and the planner decides whether to use the index, do a sequential scan, or something involving a bitmap heap scan that you couldn’t explain if asked. You express intent. The planner decides execution.

Query PlannerToggle the index

SELECT * FROM orders WHERE customer_id = 42

CREATE INDEX idx_customer_id ON orders(customer_id)

Seq Scan on orderscost: 10842 · 240ms

Same query. The planner decides how.

Kubernetes is the most visible modern example⁵. You declare the desired state of your system (three replicas, a load balancer, a persistent volume) and Kubernetes reconciles reality with your declaration. If a node dies, it reschedules. If you change the desired state, it converges. You describe the world you want, and the machinery tries to make it so.

Kubernetes ReconciliationClick a pod to kill it

desired: 3 · current: 3 · converged

A load balancer is a control plane for traffic distribution. Your clients don’t choose which backend to talk to. The load balancer chooses for them, based on health checks and capacity. When a backend dies, the failover is invisible. The complexity that would otherwise live in every client is absorbed into one place⁶.

Load BalancerToggle health, then send requests

A service mesh (Istio, Linkerd, Consul Connect) is a control plane for service-to-service communication. It decides who can talk to whom, how retries work, where circuit breakers trip. The services just make HTTP calls. They don’t know about mTLS or rate limiting. The control plane knows. The sidecar proxy enforces.

Service Mesh PolicyToggle policies

auth

no policy

api

no policy

orders

no policy

payments

no policy

One toggle. Four services. Zero code changes.

Durable execution frameworks (Temporal⁷, Restate, Azure Durable Functions) apply the pattern to code execution itself. Full disclosure: I wrote the Haskell SDK for Temporal⁸, so I’m biased.

The idea: your workflow code runs as normal functions, but the framework intercepts every side effect and records it to a durable event log. If the worker crashes, the framework replays the log on a new worker and resumes where you left off. Your code doesn’t know it crashed. The control plane manages execution state, retries, timeouts, and audit trails. You write straight-line code. The framework provides durability as a property of the environment, not something you bolt on with try/catch blocks and a dead letter queue.

Durable ExecutionStart, then crash the worker

validate

charge

reserve

notify

confirm

worker-1

And, as I’ve been arguing for two posts now, a CI orchestrator is a control plane for build infrastructure. It decides what runs where, in what order, with what resources. The build scripts just run commands. The orchestrator handles scheduling, artifacts, caching, observability, failure recovery.

CI OrchestratorRun clean or inject a failure

lint

· pending

build

· pending

unit-test

· pending

integration

· pending

deploy

· pending

The Six Virtues

The reason I keep coming back to this pattern is that it solves the same six problems everywhere it appears. These aren’t incidental benefits. They’re structural consequences of the separation itself.

1. Observability

A control plane gives you a place to stand and look at the whole system.

Without one, understanding the state of your infrastructure requires interrogating each component individually. You SSH into boxes. You read log files. You piece together what happened from fragments, the way an archaeologist reconstructs a civilization from pottery shards and guesswork. At one point, roughly 40% of our incidents sent no alert at all. We discovered we were in an incident when vendors notified us, or when an engineer happened to check metrics and realized problems had existed for a while, or, worst case, when customers wrote in. One engineer reflected on an incident: “Should have been more obvious that we weren’t processing records for over 12 hours.” Some weren’t even incidents at all, but rather long-standing degraded states that no one noticed until someone happened to look. The information existed, technically: in logs, in metrics dashboards someone could have checked, in processes quietly failing. Assembling it into a coherent picture was itself the incident.

With a control plane, the global view is the default. You don’t go looking for this information. The control plane has it, because having it is its entire reason for existing. The answer to “what is currently deployed?” is one API call, not an archaeological expedition.

2. Policy Enforcement

A control plane gives you a place to say “no.”

Every organization has policies. “We don’t deploy on Fridays.” “All services must have health checks.” “Container images must come from the internal registry.” These are fine as sentences in a wiki. They are less fine as your enforcement mechanism, because a sentence in a wiki is a suggestion, and suggestions have a half-life that’s approximately “until the next person who hasn’t read the wiki joins the team.”

A control plane turns suggestions into gates. Kubernetes admission controllers reject pods that don’t meet standards. Your CI orchestrator refuses builds that reference unapproved images. The service mesh denies traffic between unauthorized services. Things that don’t pass the gate don’t happen, regardless of whether the person pushing the change has read the wiki.

3. Consistency

Go look at how your services handle retries. Really, go look. I’ll wait.

If you’re like most organizations I’ve worked with, you’ll find: exponential backoff in Service A (written by someone who read the Google SRE book), a fixed 500ms delay in Service B (written by someone in a hurry), and no retry logic at all in Service C (written by someone who was confident the network would never fail). Timeouts vary. Error handling varies. Logging formats vary. Each service is a sovereign nation with its own laws.

A control plane collapses this. The service mesh configures retries and timeouts once, in one place, for everything. The CI orchestrator sets resource limits. The load balancer defines health checks.

I’ll be honest about the limits: most organizations of any size are polyglot. You have services in Go and Python and TypeScript and maybe something in Rust that one person wrote and everyone is afraid to touch. “Consistency” here doesn’t mean “everything works identically.” It means the cross-cutting concerns (retries, timeouts, circuit breaking, observability) are applied at the infrastructure layer rather than reimplemented in each language’s idiom by each team⁹. Lamport showed us decades ago that ordering and coordination in distributed systems is the hard part; a control plane is where that hard part lives. The individual services still behave differently. But the operational envelope around them is the same, and that’s the consistency that matters at 3am.

4. Graceful Degradation

A control plane gives you a mechanism for handling failure that isn’t “everything breaks and nobody knows why.”

A node dies. Without a coordinator, the services on it simply stop existing. Dependent services fail, or hang, or retry forever on a timeout chosen by someone who is no longer at the company. The blast radius is unknown because nobody has a model of the dependency graph⁶.

With a control plane, the system responds. Kubernetes reschedules pods. The load balancer removes the unhealthy backend. The CI orchestrator retries on a different machine. The service mesh opens a circuit breaker. None of this requires a human to wake up. The system is not self-aware, but it is self-correcting, which for infrastructure purposes is close enough.

5. Failure Decoupling

This is subtler than graceful degradation, and I think it’s the virtue people understand least until they’ve lived through an incident without it.

The control plane and the data plane can fail independently. This goes in both directions.

If the data plane fails, the control plane is still running. It can observe the failure, route around it, reschedule work, page the humans. This is why Kubernetes can lose a node and recover without intervention: the API server and scheduler are still up, they notice, they act¹⁰.

The inverse is equally important. If the control plane goes down, the data plane keeps running. Your pods don’t stop serving traffic because the Kubernetes API server is unavailable. Your DNS records don’t evaporate because the management console is offline. The last known good configuration persists. The system is frozen (you can’t make changes or respond to new failures) but it’s frozen in a working state, which is different from “everything stopped.”

Without this separation, failure is total. At Mercury, where I work, we had a vivid example: legacy tooling in our admin console for debugging and retrying failed queued jobs. The tools worked fine normally. The problem was that when a queued job took down the system (database overload, say), the admin console went down with it, because the console and the job execution shared the same infrastructure. You couldn’t log in. You couldn’t run the repair queries. The instrument you needed to diagnose and fix the failure was itself a casualty of the failure. The control plane and the data plane were the same plane, which meant they crashed together, which meant the thing you needed most during an incident was the thing the incident had taken from you.

The Admin Console Problem

Job Queue

job-1running

job-2queued

job-3queued

healthy

Admin Console

all three share the same database connection pool

Click "trigger" to simulate a runaway job...

Try clicking "kill bad job" after the DB goes down. You can't. That's the problem.

The control plane pattern doesn’t eliminate failure. It partitions it. And partitioned failure is the difference between “we had a degraded hour” and “we had an outage,” which is the difference between a line in a weekly summary and a postmortem that spawns three follow-up postmortems, each with its own action items, none of which will be completed.

6. Evolvability

Here is a question that will tell you a lot about your infrastructure: how hard is it to change one thing?

Want to add mTLS between services? Without a coordinator, you touch every service. Want to change the retry policy? Touch every service. Want to migrate from one cloud provider to another? I am sorry. The migration will take years, because the provider-specific details are smeared across every service like butter on toast that has been dropped face-down.

The control plane is the seam. You change the coordinator; the workers don’t know and don’t care. Change your DNS provider without touching an application. Swap CI orchestrators without rewriting your build logic, because the build logic was never coupled to the orchestrator in the first place. Even something like switching service meshes, which I won’t pretend is trivial, becomes a problem scoped to the infrastructure layer rather than one that requires touching every service.

What You Pay For It

This is not free.

Complexity. Ashby’s Law cuts both ways. Yes, you need a controller as complex as the system you’re controlling, but that controller is itself a system you now have to operate. Service meshes add latency and failure modes. Even a “simple” CI orchestrator like Buildkite requires managing agents and debugging interactions between the control plane and your scripts. (Kubernetes gets a reputation for complexity that I think is increasingly undeserved if you’re on a managed offering; a single engineer can run EKS or GKE comfortably. The complexity discourse is largely a holdover from the era of self-hosted clusters.) Regardless: you have traded one kind of complexity (every component managing itself) for another kind (a centralized coordinator managing everything). You haven’t reduced the total variety in the system¹. You’ve concentrated it in a place where it can be reasoned about. The argument is that concentrated complexity is more tractable than distributed complexity, but it is not zero.

The single point of failure. The data plane keeps running when the control plane goes down. This is true and important. But “frozen in the last known good state” is not the same as “healthy.” You can’t deploy. You can’t respond to new failures. The longer the control plane is down, the wider the gap between what the system is doing and what it should be doing¹⁰. This requires its own mitigations: HA deployments, graceful degradation of the control plane itself, runbooks for when it’s unavailable.

Indirection. The control plane adds a layer between you and what’s actually happening. When something goes wrong, you now need to understand both the control plane’s model of the world and reality, and figure out where they diverge. “Kubernetes thinks this pod is running but it’s actually in a crash loop” is a category of problem that doesn’t exist without Kubernetes. The abstraction helps, until it doesn’t, and when it doesn’t, you need to understand both layers.

The learning curve. Every control plane has its own concepts, vocabulary, and failure modes. Terraform’s state management. Buildkite’s dynamic pipeline generation. Even Kubernetes, genuinely simpler than its reputation suggests on managed providers, requires internalizing a new set of abstractions. The payoff comes later. The cost comes now. This is the fundamental tension of infrastructure investment.

The Lesson

Every mature infrastructure domain has independently converged on this pattern⁴. Networking did it with BGP and SDN¹¹. Container orchestration did it with Kubernetes⁵. Service communication did it with service meshes. CI did it with build orchestrators. Database systems did it with query planners. Even hardware did it: your CPU’s out-of-order execution engine is a control plane for instruction scheduling¹².

They all arrived at the same answer because they all faced the same problem: components that need coordination, where the coordination logic is too important to distribute across the components themselves. The components should do their work. Something else should decide what work to do, when, where, and with what resources.

The Build Systems à la Carte paper¹³ that I referenced in the previous post puts it precisely: “the part of the build system responsible for scheduling tasks in the dependency order (a ‘scheduler’) can be cleanly separated from the part responsible for deciding whether a key needs to be rebuilt (a ‘rebuilder’).” Every real build system is a specific combination of the two, and, crucially, “these choices turn out to be orthogonal.” The scheduler is the control plane. The rebuilder is local policy. Two independent axes of variation, cleanly separated. The paper’s authors weren’t trying to describe control planes. They were doing formal PL research. But the decomposition they found is the same one that shows up in networking, orchestration, databases, and CI.

It doesn’t have to be a product. Some of the best control planes I’ve seen were built in-house by teams who understood their domain well enough to know what decisions needed to be centralized. The pattern is “have a thing that knows what’s happening across the whole system and makes decisions based on that knowledge.” If you have that, off-the-shelf or custom, you have a control plane. If you don’t, every component is making local decisions without global context, and the failure modes get harder to diagnose the longer you wait.

The next time you’re wiring up ad-hoc coordination between components (writing the retry logic, the health check, the deployment script that shells into three boxes in sequence), consider whether what you’re actually building is a control plane, just without the name. If it is, it might be worth treating it like one.

W. Ross Ashby. An Introduction to Cybernetics. Chapman & Hall, 1956. Chapter 11 introduces the Law of Requisite Variety: “only variety can absorb variety.” The implication for infrastructure: you cannot build a simple controller for a complex system. The controller must match the system’s complexity, which means the question is never whether to have complexity, but where to concentrate it. ↩ ↩²
Lorin Hochstein. “Ashby taught us we have to fight fire with fire.” Surfing Complexity, January 31, 2026. Available at https://surfingcomplexity.blog/2026/01/31/ashby-taught-us-we-have-to-fight-fire-with-fire/. Hochstein connects Ashby’s Law to David Wheeler’s observation about indirection: “We can solve any problem by introducing an extra level of indirection,” except for the problem of too many levels of indirection. Engineering solutions to problems necessarily add complexity. The question is where to deploy it. ↩
The control plane / data plane distinction is formalized in RFC 6192. The IETF’s original framing: the control plane handles “the signaling and routing protocol machines” while the data plane handles “the forwarding of transit traffic.” ↩
Joseph L. Hellerstein, Yixin Diao, Sujay Parekh, Dawn M. Tilbury. Feedback Control of Computing Systems. Wiley-IEEE Press, 2004. The reconciliation loop at the heart of every control plane (observe current state, compare to desired state, act) is a feedback control loop. This book provides the theoretical grounding. ↩ ↩²
Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes. “Borg, Omega, and Kubernetes.” ACM Queue, Vol. 14, No. 1, pp. 70-93 (2016). Available at https://queue.acm.org/detail.cfm?id=2898444. Describes the evolution of Google’s cluster management control planes across three generations. ↩ ↩²
Armando Fox, Eric Brewer. “Harvest, Yield, and Scalable Tolerant Systems.” Proceedings of the 7th Workshop on Hot Topics in Operating Systems (HotOS-VII), IEEE, 1999. Available at https://ieeexplore.ieee.org/document/798396. Introduces the harvest/yield framework for reasoning about graceful degradation: a control plane can choose to reduce completeness rather than refuse requests entirely. ↩ ↩²
Temporal is a durable execution platform that manages workflow state, retries, timeouts, and visibility. See the Temporal documentation. ↩
The Haskell Temporal SDK is available at github.com/MercuryTechnologies/hs-temporal-sdk. I wrote it because I wanted durable execution with strong types and I was tired of pretending that stringly-typed workflow definitions were acceptable. ↩
Leslie Lamport. “Time, Clocks, and the Ordering of Events in a Distributed System.” Communications of the ACM, Vol. 21, No. 7, pp. 558-565 (1978). Available at https://dl.acm.org/doi/10.1145/359545.359563. The foundational paper on logical ordering in distributed systems. Relevant because coordination (the job of a control plane) is fundamentally about ordering. ↩
Peter Bailis, Kyle Kingsbury. “The Network is Reliable.” ACM Queue, Vol. 12, No. 7, pp. 20-32 (2014). Available at https://queue.acm.org/detail.cfm?id=2655736. A catalog of real-world network partition events. Relevant context for why control planes must be designed to tolerate the data plane disappearing. ↩ ↩²
Software-Defined Networking made the control plane / data plane separation its organizing principle. The OpenFlow protocol (McKeown et al., “OpenFlow: Enabling Innovation in Campus Networks,” ACM SIGCOMM CCR, 2008) explicitly decouples the forwarding decision from the forwarding action. ↩
Tomasulo’s algorithm (1967) implements out-of-order execution by separating the scheduling of instructions from the execution of instructions. Your CPU has been doing control plane separation since before most of us were born. ↩
Andrey Mokhov, Neil Mitchell, and Simon Peyton Jones. “Build Systems à la Carte.” Proc. ACM Program. Lang., Vol. 2, No. ICFP, Article 79 (September 2018). Available at https://doi.org/10.1145/3236774. ↩

Navigation

What Is a Control Plane?

When You Don’t Need One

The Separation You Already Depend On

The Six Virtues

1. Observability

2. Policy Enforcement

3. Consistency

4. Graceful Degradation

5. Failure Decoupling

The Admin Console Problem

6. Evolvability

What You Pay For It

The Lesson

Footnotes