No, Really, Bash Is Not Enough: Why Large-Scale CI Needs an Orchestrator
In my previous post, I talked about why GitHub Actions is slowly hollowing out your engineering team, and I mentioned in passing that bash is not a build system. A number of people wrote in to disagree. Some were polite. Some were not. One person suggested I simply didn’t know how to write bash, which, fair, nobody really knows how to write bash1, we just accumulate coping mechanisms and call it expertise.
But the most common response was some variant of: “I’ve been running CI with a Makefile and some shell scripts for years. It works fine.”
I want to take this seriously, because it deserves to be taken seriously. Then I want to explain why it stops being true, and who this conversation is actually for.
A Note on Audience
I write for teams at a certain level of organizational maturity. If you’re a solo developer, or a team of three shipping a Rails CRUD app, your CI needs are simple and your Makefile is probably fine and I am not talking to you. Go outside. Touch grass. Revel in the lightness of your build system. I envy you.
I am talking to teams where CI is a load-bearing piece of infrastructure. Teams where 20 or 50 or 200 engineers push code daily. Teams where a broken CI pipeline doesn’t mean one person waits a few extra minutes; it means a queue of pull requests backs up, a deploy window gets missed, and product timelines slip. Teams where CI time is measured in engineering-hours-lost-per-week and has a line item on somebody’s OKRs.
If that’s not you, most of this post will feel like overkill, because it is. But if it is you, and you’re still running a pile of bash scripts held together with && and good intentions, I’d like to have a conversation.
A few common responses to the previous post deserve direct engagement, so let me address them before we go further.
“Just use a Makefile and shell scripts. Keep it simple.” Yes. For simple things. The whole point of this post is about what happens when your needs outgrow “simple,” which they will if your organization is growing, and which may have already happened if you’re being honest with yourself. “Keep it simple” is excellent advice right up until the thing you’re doing isn’t simple anymore, at which point it becomes “keep it simple by pretending complexity doesn’t exist,” which is a different and worse kind of advice.
“If your CI is anything more than running a script, you’re overcomplicating it.” This is true for small projects and flatly wrong for large ones. A CI pipeline that builds Docker images, runs tests across multiple services, generates coverage reports, validates infrastructure-as-code, and deploys to staging is not “overcomplicated.” It’s the actual job. Telling a team with a monorepo and 15 microservices to “just run a script” is like telling someone with a warehouse to “just use a closet.”
“I orchestrated a move from CircleCI to GitHub Actions precisely because we couldn’t express a performant pipeline in their model.” Fair! Genuinely. CircleCI 2.0’s model had real limitations, and I’m not going to pretend otherwise (I was there for v1, and the v1-to-v2 migration was rough). But the argument in this post isn’t “use CircleCI.” It’s that you need some orchestrator, and “bash scripts” is not one.
“The log viewer? So what, just download the file.” I mean, sure. You can also debug your code by reading a hex dump of the binary. The question is whether you should have to, and what it says about the tool that you do. “Just download the file” is not a defense of the tool. It’s a coping mechanism described in the language of a feature.
“GitHub Actions is pretty good in the ways that truly matter.” I have enormous respect for this take because it’s measured and reasonable and comes from real experience. I think we disagree mostly about where the threshold of “truly matters” is. If your team is 5-10 people with a straightforward build, I probably agree with you. If your team is larger, or your build is complex, I think the things that “don’t truly matter” start to compound in ways that are hard to see until you’re drowning in them.
Now. Let me tell you about bash.
Part I: The Concession
Bash Is, in Fact, Good at Some Things
Let me be honest about what bash gets right, because if I’m not, the rest of this post is just a straw man wearing a #!/bin/bash shebang, and you’ll stop reading, and you should.
Bash is universal. It is on every Linux box, every Mac, every CI runner, every Docker image that isn’t built from scratch. It requires no installation, no runtime, no package manager. It is already there, the way gravity is already there. You don’t have to decide to use bash. You have to decide not to.
Bash is excellent for linear workflows. “Check out code, install dependencies, run tests, report results.” Four commands. Maybe ten lines with some error handling. This is what bash was born to do, and it does it well, and if this is all your CI needs to do, you should use bash and you should stop reading this post and go do something fulfilling with your afternoon.
Bash is testable locally. You can run ./ci.sh on your laptop and get roughly the same result as you’d get in CI. This is a genuine, meaningful advantage over CI systems where the pipeline definition is entangled with the platform to the point where local execution requires an emulator, a prayer, and a Docker-in-Docker setup that makes your laptop fan sound like a jet engine preparing for takeoff. Several commenters rightly pointed out that making your CI “just call a script” keeps you CI-agnostic: you can migrate between systems by changing the thin wrapper, not the logic. This is correct and good advice. The Makefile-as-entrypoint pattern is sound. I’m not arguing against it. I’m arguing about what happens inside that entrypoint when the demands on it grow beyond “run these commands in order.”
Bash is also legible to almost every engineer alive. You don’t need to learn a DSL. You don’t need to understand a plugin system. The .sh file is right there and it does what it says, top to bottom, and when it breaks, the error is usually within ten lines of where you’re looking. This is no small thing. Many CI systems have failed precisely because they asked too much of people who just wanted to run their tests and go home.
I grant all of this freely. Bash is a fine tool. The question is not whether bash is good. The question is whether bash is sufficient, and the answer depends entirely on what you’re asking it to do. At a certain scale, you are asking it to be something it was never designed to be, and it will comply, because bash always complies, and that compliance is exactly the problem.
Part II: The Reckoning
The Memory of CircleCI v1
I was an early employee at CircleCI. In the first version of the product, we did something that seems obvious in retrospect but was, at the time, a small act of mercy: we would automatically detect Ruby test suites, split them by timing data, and distribute them across multiple concurrent agents2.
You didn’t configure this. You didn’t write a script to do it. You pushed your Rails app with its spec/ directory, and CircleCI would look at it and say “ah, I know what this is,” and then it would divide your test suite across however many containers your plan afforded you, weighted by historical execution time so the splits were roughly even. Your 20-minute test suite ran in 5 minutes. You didn’t have to think about it. Thinking about it was our job. We had thought about it extensively. We had the thousand-yard stares to prove it.
One commenter noted that CI tools going from specialist to generalist was a good thing, that encoding knowledge of the web framework into the CI system was a dead end. They’re right that the industry moved that direction, and I understand why. But I think something was lost in the transition. The specialist model had a virtue: it could do things for you that the generalist model requires you to do yourself. The generalist model is more flexible, but it shifts the burden of orchestration onto the user. And many users respond to that burden by reaching for bash.
Try doing what CircleCI v1 did in bash.
No, really. Think about what’s involved. You need to collect timing data from previous runs. You need to store it somewhere durable. You need a splitting algorithm that accounts for the fact that some tests take 30 seconds and some take 30 milliseconds. You need to distribute the resulting test groups across multiple execution environments. You need to collect results from all of them. You need to determine overall pass/fail status from the union of those results. You need to handle the case where one of the agents dies mid-run. You need to surface which specific tests failed across which specific agents, in a way that a human being can look at and understand, ideally without developing a migraine.
That’s not a bash script. That’s a distributed system. And if you build it in bash, you have built a distributed system in a language that doesn’t have data structures, doesn’t have a type system, doesn’t have error handling beyond set -e (which is itself a carnival of surprising behavior1), and considers the string “0” to be a success and everything else to be a philosophical question. You have summoned something. It will not fit neatly back into its container. It has opinions now.
The Combinatorial Sprawl
Here is a partial list of things a mature engineering organization might need from its CI on a single pull request:
- Run the backend test suite
- Run the frontend test suite
- Run integration tests against a spun-up stack of services
- Build the production Docker image
- Build a profiling-instrumented variant of the same image
- Generate code coverage reports and enforce coverage thresholds
- Build the JavaScript frontend with production optimizations
- Build the JavaScript frontend again with source maps for error tracking
- Run static analysis and linting
- Check for dependency vulnerabilities
- Generate API documentation from source
- Run database migration checks against a test database
- Validate Kubernetes manifests or Terraform plans
- Build platform-specific binaries (linux/amd64, linux/arm64, darwin/amd64)
- Run performance regression tests against known benchmarks
This is not hypothetical. This is Tuesday at a mid-stage startup with 40 engineers. At a larger company, triple it. At a company with a monorepo, add “figure out which of the above to even bother running based on what changed,” which is itself a research problem.
Now. Some of these can run in parallel. Some must run sequentially (you can’t run integration tests until the Docker image is built). Some share dependencies (the frontend build produces artifacts the integration tests need). Some are conditional (only run the Terraform validation if infra/ files changed). Some need different machine sizes (the performance benchmarks need a large instance with no noisy neighbors; the linting can run on a potato).
In a proper CI system with an orchestrator, you express this as a pipeline: a directed acyclic graph of steps with dependencies, resource requirements, and conditional execution. The orchestrator handles scheduling, parallelism, artifact passing, and failure propagation. It knows that if the Docker build fails, there’s no point running the integration tests that depend on it. It knows that the linting job and the unit tests can run simultaneously. It knows that the performance benchmarks need to be queued for a specific agent pool. It knows these things because you told it, in a language designed for telling it, and it believed you, and it acted accordingly.
In bash, you express this as… what, exactly?
A series of & operators and wait calls? A hand-rolled dependency graph using file-based locks? A script that shells out to other scripts that shell out to other scripts, each one a small prayer that the previous one left the filesystem in the expected state?
I have seen all of these. Each one was written by someone who was solving a real problem, and each one became the problem. They all start the same way: “it’s just a few parallel jobs.” They all end the same way: someone is paging through 1200 lines of shell at 11pm trying to understand why the coverage report job ate the integration test’s Docker socket. The scripts did not plan this. They are not malevolent. They simply have no mechanism by which to prevent it. They are innocent in the way that a river is innocent when it floods your basement. The river was just doing river things. Your basement was in the way.
What the Research Actually Says
There is, it turns out, a rigorous academic treatment of this question. In 2018, Mokhov, Mitchell, and Peyton Jones published “Build Systems à la Carte”3, a paper that decomposes every build system into two orthogonal choices: a scheduler (what order to build things in) and a rebuilder (how to decide whether something needs rebuilding). The paper opens with the observation that build systems are “awesome, terrifying, and unloved,” which is a phrase I wish I had written. They then proceed to formalize every build system you’ve ever used (and several you haven’t) as a point in a two-dimensional design space.
The scheduler axis has three options: topological (compute the full dependency order statically, like Make), restarting (start building in some order and abort when you hit an unbuilt dependency, like Excel’s recalculation engine and Bazel), and suspending (recursively build dependencies on demand, like Shake). The rebuilder axis has four: dirty bits (like Make’s file modification times), verifying traces (record hashes of dependencies and check whether they’ve changed, like Shake), constructive traces (store the actual build results keyed by dependency hashes, enabling cloud caching, like Bazel), and deep constructive traces (like Nix, which hashes only terminal inputs).
This gives you a 3×4 matrix. Eight of the twelve cells are occupied by existing, real-world build systems. The remaining four are all viable designs, and the paper constructs one of them (Cloud Shake, combining suspending scheduling with constructive traces) to demonstrate the framework’s predictive power.
What does this have to do with bash?
The paper also defines the simplest possible correct build system, which they call busy. It rebuilds everything from scratch every time. It has no memory between builds, no concept of what changed, and no mechanism to avoid redundant work. The authors note that it “will therefore busily recompute the same keys again and again if they have multiple dependents.” This is, almost exactly, what a bash CI script does. Your bash script is busy. It has achieved the theoretical minimum of build system sophistication.
Every improvement over busy requires choosing a scheduler and a rebuilder, and those choices have consequences. If you want to avoid rebuilding things that haven’t changed (the paper calls this minimality: “a build system is minimal if it executes tasks at most once per build and only if they transitively depend on inputs that changed since the previous build”), you need persistent build information. If you want to handle dynamic dependencies (where the set of things a step depends on isn’t known until you start running it), you need either a restarting or suspending scheduler. If you want to share build results across machines (cloud caching), you need constructive traces.
None of these are things you add to a bash script. They are architectural decisions with formal properties and correctness constraints. The paper gives a precise definition of what it means for a build system to be correct: after a build completes, recomputing any task using the final store values must produce the same result as what’s already stored. Bash scripts have no mechanism to verify or enforce this property. They simply run commands and hope.
Now, in fairness to the bash partisans, the paper also provides ammunition for the other side. It demonstrates that the simplest entry in the taxonomy, Make (topological scheduling + dirty bits), is both minimal and correct with very little machinery. If your dependencies are static and your notion of “changed” is file modification times, Make is optimal. A Makefile is a build system in the formal sense: it has a scheduler, a rebuilder, and persistent build information, and it satisfies the correctness definition. So when someone says “just use Make,” they are not wrong, exactly; they are proposing a specific and legitimate point in the design space. The paper simply makes clear what they’re giving up: dynamic dependencies, early cutoff (skipping downstream rebuilds when a rebuild produces the same output), cloud caching, and everything else that lives in the cells Make doesn’t occupy.
The real takeaway is not that bash is bad. It’s that the design space of build systems has structure, and that structure has been studied, and that the properties you care about (minimality, correctness, support for dynamic dependencies, cloud caching, early cutoff) correspond to specific architectural choices that live at a level of abstraction bash cannot express. When you write a build pipeline in bash, you are either implementing one of the twelve cells in the Mokhov-Mitchell-Jones matrix (poorly, by hand, with strings and exit codes), or you are living in the busy cell and rebuilding everything every time. There is no third option. There has never been a third option. The paper, if anything, is too kind about this.
The Isolation Problem, or: Why Your Builds Haunt Each Other
Here is the thing about running multiple jobs on the same machine without an orchestrator: they share everything.
They share the filesystem. They share the process table. They share the network ports. They share /tmp. They share the Docker daemon and its build cache. They share the DNS resolver and the loopback interface and the kernel’s entropy pool and every other ambient resource that you don’t think about until two things try to use it at the same time.
This creates a category of failures that I think of as spectral contamination. Build A runs and, as a side effect, writes a file to a shared cache directory, or starts a service on port 5432, or leaves a Docker container running that’s bound to a port the next build needs. Build B runs next, or concurrently, and it encounters these residues from a different context. Its tests pass when they should fail, or fail when they should pass, or pass but produce subtly wrong output that won’t be detected until production.
These failures are not reproducible. They depend on timing. They depend on which other builds happened to be running on the same machine at the same moment. They appear and disappear like condensation on a cold window. You cannot debug them by running the build again, because running the build again produces different timing, different cohabitation, a different séance.
And here’s the thing that really curdles the blood: it’s not just your scripts that need to be well-behaved.
Every tool in your build pipeline (every third-party dependency, every CLI you invoke) has to operate hermetically. Your linter needs to not write to a global cache that another concurrent linting run is also writing to. Your test database driver needs to not assume it’s the only thing talking to Postgres. Your package manager needs to not corrupt its state when two instances run simultaneously. And then there’s Docker, whose build cache is a particular delight under concurrent access, with well-documented issues ranging from disk space leaks that survive docker system prune4 to locked cache references that fail builds nondeterministically5 to outright BuildKit crashes from concurrent map writes6.
You do not control these tools. You did not write them. Their authors may or may not have considered the case where two copies of their software run concurrently on the same filesystem with the same user. Many of them did not, because it’s a weird thing to do, except in CI, where it’s the only thing anyone does. npm’s cache, for instance, is built on a library that claims to be “lockless, high-concurrency” and “immune to corruption”7, and yet concurrent npm install processes sharing a cache directory produce EINTEGRITY errors and install failures in practice89. Yarn Classic doesn’t even pretend: it provides a --mutex flag specifically because concurrent installs are unsafe without explicit serialization10.
An orchestrator solves this by not asking the question. Each job gets its own container, its own filesystem, its own network namespace. It doesn’t matter if npm install isn’t concurrency-safe, because there is no concurrency. There is one npm install, in one container, alone with its thoughts and its node_modules. The tool doesn’t need to be hermetic because the environment is hermetic. You have moved the responsibility from “every tool in the ecosystem must be well-behaved” to “the orchestrator must provide isolation,” and unlike the ecosystem, the orchestrator is one piece of software that you can evaluate and trust.
Bash provides none of this. Bash doesn’t even know that isolation is a concept. Bash is a hallway with no doors, and every process walks through it freely, and they all leave things on the floor, and eventually someone trips over something that was left there by a process that finished twenty minutes ago and has already gone home.
The Language Safety Analogy
There’s a useful comparison here, and I want to draw it carefully because analogies about programming languages tend to generate more heat than light. But I think there are actually three relevant points on the spectrum, not two.
Writing CI in bash is like writing systems software in C. It is powerful. It is flexible. It gives you complete control. And it will let you do anything, including and especially things you should not do, and it will not warn you, and it will not stop you, and when the undefined behavior manifests at 3am on a Saturday, it will not explain itself. It will simply have done the wrong thing, silently, and moved on with its life.
A proper CI orchestrator is, I think, best compared not to Rust but to a language with a managed runtime, something like Go or Java or C#. The orchestrator is the runtime system. It manages resources on your behalf. It allocates and deallocates execution environments. It handles the lifecycle of jobs the way a GC handles the lifecycle of objects: you create them, you use them, and when they’re done, the runtime cleans them up, and you don’t have to think about what “cleaning up” entails. You don’t have to remember to free your /tmp directory. You don’t have to remember to kill your background processes. The runtime does it, because that’s what a runtime is for.
You can still write bugs. The runtime doesn’t prevent logic errors. But it prevents resource management errors: the use-after-free, the double-free, the leaked file descriptor. In CI terms: the orphaned Postgres, the stale Docker container, the cache directory that belongs to a build that finished an hour ago but whose residue still haunts the machine.
There’s a Rust analogy too, if you want it. Some orchestrators go further and provide compile-time-ish guarantees: the DAG won’t let you reference an artifact that hasn’t been produced by a prior step. The schema validation catches your malformed pipeline definition before it runs. This is the borrow checker of CI: annoying when it stops you, invaluable when it stops you from doing something you didn’t realize was wrong.
In C, you can write correct code. Millions of lines of correct C exist in the world. But the correctness depends entirely on the discipline of the programmer, and the correctness of every library the programmer links against, and nobody ever makes a mistake, forever. This is the world bash CI lives in. Not because bash programmers are bad, but because bash, like C, is a language that has chosen not to help you with certain categories of problems, and those categories turn out to be exactly the ones that matter most when you’re running builds at scale.
”Just Have the LLM Write It”
This was my favorite response. Someone suggested that the whole “bash doesn’t scale” argument is obsolete now because you can just ask an LLM to write your CI scripts. Claude is good at bash. ChatGPT is good at bash. So even if you can’t write the elaborate parallelized resource-aware hermetic build script, the machine can, and therefore the problem is solved.
I find this argument fascinating in the way I find a beautifully optimized function that solves the wrong problem fascinating. Every part of it is technically correct and none of it helps.
Yes, LLMs can write fluent bash. They can write better bash than most humans, in the narrow sense that they remember to quote their variables and use [[ ]] instead of [ ] and set their error flags properly. If the problem were “I don’t know bash syntax well enough,” then LLMs would indeed be the solution. But that was never the problem.
Now, can you cobble together something that sort of works? Sure. Linux gives you the pieces. You can write bash that invokes unshare for namespace isolation. You can configure cgroups for memory limits. You can SSH into a fleet of dumb boxes and scatter jobs across them with parallel or a for loop. You can duct-tape Docker and tmux and jq and curl into something that, if you squint, looks like an orchestrator. An LLM could probably generate all of this for you in an afternoon, and it would probably even work, for a while, on a good day, when the wind is right.
But what you’ve built is a bespoke CI system. You’ve assembled it from parts, the way you might assemble a car from parts. And the question is not “can a talented person (or machine) assemble a car from parts?” The question is “should you?” when there are teams of engineers (at Buildkite, at GitLab, at Dagger, at a dozen other companies) who spend 40 hours a week, every week, thinking about how to make CI reliable, observable, and fast. Those teams have encountered the race conditions you haven’t hit yet. They’ve handled the edge cases your afternoon prototype doesn’t know exist. They’ve built the dashboards, the retry logic, the graceful draining, the artifact management, the cgroup integration, the log aggregation. Not because any one of those features is hard in isolation, but because making them all work together, reliably, at scale, across thousands of different workloads, is the kind of problem that takes years of accumulated institutional knowledge to get right.
An LLM can generate text that invokes unshare(2). It cannot give you the five years of operational experience that teaches you when namespace isolation isn’t enough and you need a full VM boundary instead. It can generate a cgroup configuration. It cannot give you the battle-tested heuristic for what memory limit to set when the job’s peak usage varies by 3x depending on the test suite’s random seed. These are not text problems. They are systems problems, and systems problems are solved by systems, not by scripts, no matter how eloquently the scripts are written.
There’s a deeper issue, too. The LLM doesn’t know what else is running on your machine. It doesn’t know that your frontend build is going to start a dev server on port 3000 at the same time your integration tests expect port 3000 to be available. It doesn’t know that your two parallel docker build invocations are going to fight over the build cache4. It can’t know these things, because they are emergent properties of the runtime environment, not properties of the script itself. The script is correct in isolation. The script is beautiful in isolation. The script will still break when it meets another copy of itself running concurrently, because the failure mode is not in the text. It’s in the world. It’s in the filesystem and the process table and the port space, and no amount of better text will fix a problem that exists outside the text.
This is, in some sense, the strongest version of my entire argument. If a literally superhuman bash author still can’t solve the problem (not because it can’t write the syntax, but because the problem isn’t syntactic), then the answer is not “write better bash.” The answer is “use a system designed for this.” You would not hand-roll your own database because an LLM can write fluent SQL. You would not build your own load balancer because an LLM can write fluent nginx configs. The tool exists. People have built it. The hard-won knowledge of how CI fails at scale is encoded in the tool. Use the tool.
OOM and the Domino Theory
A test suite consumes too much memory. On a developer’s laptop, this means the test runner gets killed by the OOM killer, the developer swears, and they fix the leak or add more RAM. The blast radius is one person’s afternoon. They may have lost a browser tab. They will recover.
In a CI system without isolation, the blast radius is everything.
The Linux OOM killer does not simply kill the process that caused the memory exhaustion. It scans all running processes, computes a “badness score” for each based on memory footprint and other heuristics, and kills whichever process scores highest11. That process may have nothing to do with the one that actually ate all the memory. The OOM killer has a documented history of taking out SSH daemons (locking operators out of the machine during the very incident they need to investigate), database processes, and web servers, simply because they had large memory footprints, not because they caused the problem1213. As one kernel developer put it: “the kernel can never know which processes are important.”14
On a shared CI machine, this means a single memory-hungry test can kill processes belonging to entirely unrelated builds. The OOM killer doesn’t know or care that the Postgres process it just terminated belongs to a different team’s integration test suite. It saw a large process and it acted. The collateral damage is silent and immediate: builds fail with inexplicable errors, or worse, proceed with corrupted state because a dependency they were relying on vanished mid-execution.
With an orchestrator, each job runs in its own cgroup with memory limits. When a job exceeds its allocation, the OOM kill is scoped to that cgroup and cannot affect processes outside it15. The orchestrator reports a clean failure: “Job X was killed for exceeding its memory limit.” You know what happened. You know where to look. The failure is local and comprehensible. You fix the test. You move on.
In bash, on a shared machine, the OOM kill manifests as a process that simply stops existing, mid-output, with no explanation beyond a terse Killed in the terminal (if you’re lucky) and total silence (if you’re not). The other jobs on the machine may also be dead. Or alive but corrupted, running in an environment where shared resources have been partially reclaimed out from under them. You won’t know until you look, and by the time you look, more jobs have started, and the contamination has spread. The incident has developed a history. The history has footnotes. The footnotes have footnotes.
The Port Problem, and Other Shared Resources
Ports are a finite shared resource on a machine. Your test suite needs Postgres on 5432. Your other test suite also needs Postgres on 5432. In isolation, this is fine; each gets its own network namespace, its own port space. Without isolation, you have a race condition, and the resolution of that race condition is that whoever starts second gets a cryptic “address already in use” error and fails.
The same applies to:
- Temp directories. Two builds write to
/tmp/build-output. One overwrites the other’s artifacts. Neither knows. Both proceed confidently into the future with the wrong files. - Docker images and containers. Two builds run
docker-compose upwith the same project name. Docker will create multiple network instances with the same name instead of sharing one, leaving the services unable to communicate16. The containers have formed a social bond that exists entirely outside the awareness of the humans who created them, except the bond is broken, and neither side knows why. - File locks. One build acquires a lock on a shared resource. The other build waits, or times out, or ignores the lock (because who puts proper file locking in a bash CI script? Be honest.)
- GPU devices. If you’re doing ML work, GPU access is zero-sum. Without an orchestrator allocating GPUs to jobs, two builds will fight over
/dev/nvidia0like seagulls over a chip, and with roughly the same dignity. - Package manager caches. Two concurrent
npm installruns sharing a cache directory will produceEINTEGRITYerrors8. Yarn Classic will deadlock unless you pass--mutex10. The cache, like many things in this field, was designed for a world where one thing happens at a time.
Each of these is solvable in isolation. You can use random ports, unique temp directories, namespaced Docker projects. And a disciplined team will do all of these things. But here’s the rub: it’s not just your team’s discipline that matters. Every third-party tool in your pipeline also has to be well-behaved. The Docker CLI has to not corrupt its own state when invoked in parallel (it will, under certain conditions45). The discipline has to be perfect, across every tool, written by every author, in every version, forever. The orchestrator doesn’t require this. The orchestrator gives each job its own universe and the question of hermetic behavior never arises, because there is nothing to be non-hermetic with.
The Control Plane You Didn’t Know You Needed
When I say “orchestrator,” I’m not talking about something exotic. I’m not asking you to deploy Kubernetes for your CI. I’m talking about the thing that answers basic operational questions:
- What is running right now?
- What ran in the last hour, and what was the result of each?
- How long is each step taking, and is it getting slower over time?
- Which machines have capacity, and which are overloaded?
- If I need to stop everything because we discovered a security issue in a dependency, how do I do that?
- If a machine dies, what was running on it, and does it need to be retried?
In a bash-based CI, the answer to all of these questions is “check the logs, I guess.” There is no dashboard. There is no API. There is no concept of the fleet as a whole. Each machine is an island, running its scripts in solitary ignorance of the others, like monks in separate cells who have each independently concluded they are the only monk. If you want to know what’s happening across your CI, you have to SSH into each box and look around, which is less “observability” and more “archaeology.”
A control plane gives you visibility. It gives you the ability to drain a machine (finish current jobs, accept no new ones) when you need to do maintenance. It gives you the ability to prioritize certain pipelines over others. It gives you automatic retry with backoff for transient failures. It gives you the ability to say “this job has been running for 90 minutes, which is three times its usual duration, something is wrong, kill it.” It gives you historical data: this pipeline used to take 8 minutes and now it takes 22, here’s when it changed, here’s the commit that caused it, here’s the person who needs to be gently informed that their new test is downloading a 4GB model from Hugging Face on every run.
These are not luxury features. These are operational necessities for any CI system running more than a handful of builds per day. Without them, you’re flying blind. You’re operating a distributed system with no observability, no control, and no way to answer the question “is everything okay?” without manually inspecting each component. This is less a CI system and more a series of faith-based assertions about the state of the world.
The Argument from Simplicity (Steelmanned, Then Addressed)
“But an orchestrator is complex,” the objection goes. “Bash is simple. I understand bash. I don’t need the overhead.”
I want to treat this argument with the respect it deserves, because it’s not wrong exactly. It’s locally correct. Bash is simpler than any orchestrator. A 50-line shell script is easier to understand than a Buildkite pipeline definition plus agent configuration plus plugin system. If your needs are served by a 50-line shell script, an orchestrator is overhead. Genuine, real, unnecessary overhead. I am not in the business of selling people tools they don’t need.
But here’s the thing: the simplicity of bash is the simplicity of a blank page. It’s simple because it provides nothing. The complexity doesn’t disappear when you choose bash; it moves from the tool into the scripts you write with the tool. The orchestrator is complex so that your pipeline definitions can be simple. Bash is simple so that your operational life can be complex.
This is, more or less, the same argument we had about web frameworks twenty years ago. “Why use Rails when I can write a CGI script in Perl?” Because the CGI script in Perl is simple until it isn’t, and when it stops being simple, you have no foundation to stand on. You have Perl. And /tmp. And a growing unease that you can’t quite articulate but that manifests as a reluctance to modify the deployment script without first checking whether anyone is currently deploying.
The orchestrator gives you a vocabulary for expressing the things you actually care about: dependencies between steps, resource requirements, parallelism constraints, failure handling, artifact management, caching policies. In bash, you have to invent this vocabulary yourself, from scratch, every time. Your vocabulary will be different from the next person’s. Neither of you will have documentation. Six months from now, both vocabularies will be wrong in different ways, and the person who has to reconcile them will not thank either of you.
The crossover point (where bash stops being simpler and starts being a liability) is different for every team. But if you’re honest with yourself, you probably know which side of it you’re on. If you’ve ever said “don’t touch the CI scripts” to a new hire, you’re past it. If you’ve ever debugged a CI failure that only happens when two specific builds run at the same time, you’re past it. If you’ve ever written sleep 30 # wait for the other build to finish in a CI script, you are so far past it that the crossover point is a speck on the horizon behind you, and I am sorry, and I have been there, and it does get better.
So What Do You Use?
I’m not here to sell you a specific product. I recommended Buildkite in the previous post because I’ve used it and like it, not because I have any financial relationship with them. (Someone on HN helpfully confirmed it wasn’t an ad, at least not one Buildkite commissioned. He’s right. Nobody is paying me for any of this. I am simply a man with opinions about CI and an internet connection, which is a dangerous combination.) I’m also not claiming Buildkite does everything I’ve described in this post; no single product does all of it perfectly. What I’m arguing for is a category: systems that provide orchestration, isolation, and a control plane. Systems that treat CI as a distributed computing problem, which it is, rather than a shell scripting problem, which it is not.
Whether that’s Buildkite, or Dagger, or Tekton, or Concourse, or even GitLab CI (which, for all its faults, at least has a real runner architecture with isolation), the point is the same: you need something between “a pile of bash scripts” and “a full Kubernetes deployment for CI” that understands what a build pipeline is and provides the primitives for expressing one. Evaluate them on their merits. Some are better at isolation, some at dynamic pipelines, some at observability. None of them are bash scripts, and that’s the point.
The orchestrator is not overhead. It’s infrastructure. The same way a database is not overhead for storing data, and a load balancer is not overhead for serving traffic. You can store data in flat files and serve traffic from a single box. It works. For a while. For certain definitions of “works” that exclude “I need to understand what’s happening” and “I need it to keep working when the system is under load” and “I need to not get paged at 2am because two builds tried to use the same port.”
Your CI is a distributed system whether you acknowledge it or not. The only question is whether you operate it with tools designed for the purpose, or with bash and the sincere hope that nothing goes wrong at the same time as anything else. Hope is not a strategy. It is, at best, a coping mechanism, and the things it must cope with are patient, and many, and they have learned to share a machine in ways you have not anticipated. You will meet them eventually. You might as well have an orchestrator when you do.
Footnotes
-
The behavior of
set -e(errexit) is so convoluted that it has its own FAQ entry on Greg’s Wiki, which the Oils/YSH project describes as “POSIX shell has fundamental problems with error handling. Withset -eyou’re damned if you do and damned if you don’t.” Among the highlights:local var=$(failing_command)silently succeeds, functions called in conditional context (if myfunc; then) have errexit recursively disabled inside them, and behavior varies across bash versions because the POSIX spec is ambiguous. See also the Oils error handling documentation and David Rothlis’s catalog ofset -eperils. ↩ ↩2 -
CircleCI’s automatic test balancing was announced in March 2015. The system parsed JUnit XML output to collect per-test timing data and used it to distribute tests across containers by execution time. See also Intelligent CI/CD with CircleCI: Test Splitting for the history of how this evolved from v1 to v2. ↩
-
Andrey Mokhov, Neil Mitchell, and Simon Peyton Jones. “Build Systems à la Carte.” Proc. ACM Program. Lang., Vol. 2, No. ICFP, Article 79 (September 2018). Available at https://doi.org/10.1145/3236774. ↩
-
Multiple parallel
docker buildruns using BuildKit can leak disk space that cannot be recovered bydocker system prune. The root cause is cache locking issues in BuildKit’s local source handler. See moby/moby #46136, fixed in PR #47523. ↩ ↩2 ↩3 -
Concurrent builds with
--import-cacheproduce intermittent “ref … locked: unavailable” failures. See moby/buildkit #1491. ↩ ↩2 -
BuildKit crashes with a Go
concurrent map writepanic when exporting many images in parallel, e.g. viadocker buildx bake. The crash likelihood increases with the number of parallel targets. See moby/buildkit #2041. ↩ -
cacache claims “lockless, high-concurrency cache access” and “immune to corruption, partial writes, process races.” See the cacache README. In practice, the guarantee is that reads are integrity-verified; corrupted data triggers
EINTEGRITYerrors or re-fetches rather than silent corruption. But concurrent writes do cause failures. ↩ -
npm/npm #2500: “concurrent npm install problems,” originally reported in the context of Jenkins running parallel Node.js builds on the same machine. ↩ ↩2
-
npm/npm #5948: “Running multiple npm install causing random errors when writing to npm cache directory,” triggered by 16 concurrent Maven threads each invoking
npm install. ↩ -
Yarn Classic provides
--mutex fileand--mutex networkflags specifically because concurrent installs are unsafe. See the Yarn Classic CLI docs and yarnpkg/yarn #2146. ↩ ↩2 -
The OOM killer’s process selection algorithm is in mm/oom_kill.c. The kernel source comment: “We don’t have to be perfect here, we just have to be good.” The badness score is based on memory footprint, runtime, privileges, and the user-tunable
oom_score_adj. See the kernel OOM documentation. ↩ -
LWN.net: Taming the OOM killer: “This may be pretty annoying for users who may have wanted a different process to be killed.” ↩
-
LWN.net: Improving the OOM killer. Christoph Lameter: “starting the OOM killer wrecks the system by killing off useful processes.” Andrea Arcangeli: “warned against attempts to make the OOM killer perfect, since it is unlikely to ever get there.” ↩
-
From the LKML OOM discussion: “The kernel can never know which processes are important. Just an admin (a man or management software) knows.” ↩
-
cgroup v2’s
memory.maxscopes OOM kills to the cgroup: “If the OOM killer is invoked in a cgroup, it’s not going to kill any tasks outside of this cgroup.” See the kernel cgroup v2 documentation and LWN.net: Teaching the OOM killer about control groups. ↩ -
docker/compose #3013: “Concurrent calls to docker-compose with different services creates disjoint network for each service.” Two simultaneous
docker-compose upcalls with the same project name create multiple network instances instead of sharing one, leaving services unable to communicate. ↩