Making Haskell Talk to PostgreSQL Without Suffering
How to eliminate the three performance taxes that make database queries slow: encoding overhead, round-trip latency, and N+1 query patterns.
Every web application is, at its core, a machine for turning user intent into database round-trips. The number of those round-trips, and the cost of each one, determines whether your service responds in 11 milliseconds or 1.2 seconds. That gap is the difference between software that feels instant and software that feels broken.
If you build software, you already know this. The N+1 query problem is probably the most widely known performance pitfall in all of web development. It has its own Wikipedia article1. Every ORM tutorial warns you about it. Every senior engineer has a war stories, usually involving production downtime and a slowly dawning realization that those innocent-looking loops were actually a ritual summoning of little demons that inflicted suffering on the database. And yet it persists, not because people don’t understand it, but because the traditional solutions all demand constant vigilance. You can write eager-loading joins by hand. You can restructure your data access layer around batch-friendly patterns. But solving databse access pattern problems requires you to think about query patterns at every call site, and the moment someone writes a clean little helper function that does a single fetch (the kind of function that is obviously correct in isolation) the N+1 problem creeps back in through the cracks of composition.
Meanwhile, two other performance problems sit underneath the N+1 problem, less famous but just as costly: the encoding overhead of text-based wire protocols, and the latency cost of synchronous round-trips (although this is typically how N+1 suffering surfaces in practice). These three taxes compound each other, and addressing only one still leaves the other two extracting their toll.
I work at Mercury, where our Haskell codebase had 2,112 tables as of a few months ago. Many of our data access patterns date back six or more years, to when we were orders of magnitude smaller. Queries that were perfectly fine for a few hundred users making a few thousand requests a day, now executing at a scale their original authors didn’t need to worry about yet. Nobody wrote bad code2. They wrote correct code for the system they had. The system changed by virtue of success; and the queries simply didn’t keep up well.
Meanwhile, as an SRE-type fellow, I have a small cage filled with monkeys living in the back of my skull. They serve as a captive audience, and when they see nonperformant code in production, they reenact the opening scene of 2001: A Space Odyssey.
But instead of a sleek black monolith, they are shrieking at a database trace that shows a waterfall of sequential single-row fetches inside a for loop.
The monkeys are throwing bones at the walls of my cranium, screeching and howling at the incomprehensible: each response parsed from text into an intermediate representation and then immediately parsed again into the record it should have been all along. Each round-trip paying the full cost of a network hop to a database that finished processing the query in microseconds and then sat idle, waiting for the next one.
This occurs regularly enough that I decided to spend my remaining two brain cells during my parental leave fixing it. (To be clear: we haven’t adopted these solutions at Mercury yet. This is the kind of deep-dive engineering that fascinates me but rarely fits into a product roadmap, so I’m sitting here at 5am writing this whilst the baby sleeps since my brain doesn’t know how to turn off.)
The trouble is at a company with millions of lines of code, you need real solutions that worked with real constraints. I can’t (/ don’t want to) rewrite millions of lines of logic that heavily use our database libraries, persistent and esqueleto, but I building drop-in replacements and easy tooling that make the existing code faster without anyone noticing the change is, surprisingly, more tractable. This led to two efforts:
The first, persistent-postgresql-ng: what if I could replace persistent’s PostgreSQL backend with one that speaks the binary wire protocol and pipelines everything automatically, but keeps every existing API call working exactly the same?
The second effort, sofetch, focuses on the N+1 problem: automatic batching and deduplication at the application layer.
This post is about how they work, how they compose, and why the combination produces results that neither achieves alone. The code is Haskell, the libraries are Haskell, but the three underlying problems (encoding overhead, synchronous round-trips, and N+1 query patterns) exist in every language and every ORM. If you’re not a Haskeller, the techniques described here can still apply, to some extent; you’ll just be reaching for different tools to implement them.
Three taxes levied upon every query
When your application speaks to PostgreSQL, three performance costs are incurred: encoding overhead, round-trip latency, and query count inefficiency. Most developers accept these costs as inevitable.
The encoding tax. Every value you send to or receive from PostgreSQL must be serialized and deserialized. There are two ways to do this: the text protocol and the binary protocol. The text protocol converts every value to a human-readable string: your Int64 becomes the ASCII string "42", your UTCTime becomes "2026-02-19 14:30:00+00". The binary protocol sends the machine representation directly with a smidge of framing: 8 bytes for that integer, 8 bytes for that timestamp, etc.
The standard Haskell PostgreSQL ecosystem (postgresql-simple, and by extension persistent-postgresql) uses the text protocol exclusively. This means every field of every row of every query pays a serialize-to-text and parse-from-text cost, even for types like integers and timestamps that have compact, fixed-size binary representations.
The persistent ecosystem then adds a second layer of overhead on top of this. Each value decoded from the wire is first stuffed into a PersistValue tagged union (essentially a sum type with constructors like PersistInt64, PersistText, PersistByteString, etc.), then collected into a linked list, and then pattern-matched again to extract each field’s payload into your actual Haskell record. This intermediate representation exists so that persistent can be backend-agnostic (the same fromPersistValue code works for PostgreSQL, SQLite, and MySQL) but the cost is higher than consumers of the library may realize.
For a 10-column entity, a single row produces roughly 10 PersistValue heap objects, 10 list cons cells, 9 intermediate Either Text (a -> b -> ...) values from the applicative fromPersistValues chain, and then 10 more Either values from per-field fromPersistValue calls. That’s around 49 intermediate heap objects, roughly 560 bytes of pure overhead, that exist solely as intermediate computation that will be garbage collected in short order. For a 100k-row result set, you’ve just generated ~56MB of short-lived allocations whose only purpose is to shuttle data between two representations that your application code never sees. (We’ll come back to why this allocation volume matters at a systems level; it’s not just about the time spent allocating or using less memory in isolation, but about what it does to GHC’s garbage collector under concurrent load.)
The round-trip tax. Every query you issue is a synchronous request-response pair. Your application sends the SQL, then blocks, waiting for the server to parse the query, execute it, serialize the results, and send them back over the network. Only after the response arrives can the next query be sent.
This matters because network latency is not zero. Even within a single datacenter, a round-trip typically takes 1-2ms. That doesn’t sound like much, but it’s per query. If you issue 100 queries sequentially, and each round-trip costs 2ms, you’ve spent 200ms doing nothing: your CPU is idle, the database is idle between queries, and the only thing working is the speed of light through fiber optic cable. The database could have been processing query #2 while the response to query #1 was still in transit, but the synchronous protocol doesn’t allow this. Each query must wait its turn.
Each query treats the database connection like an express checkout lane where you’re only allowed to buy one single item per trip through the line. You buy an apple, walk out to your car, put the apple in the trunk, walk back in, buy a second apple. The cashier recognizes you. You recognize the cashier. Nobody is happy. But the ORM says this is the only way to make a fruit salad. Regardless, the approach takes 100× longer.
The query count tax. The N+1 problem. You know what it is. But it’s worth spelling out why it’s so hard to avoid in practice, because that’s the problem these libraries actually solve. Each function in your codebase fetches the data it needs, and each function in isolation looks perfectly reasonable. getUser fetches one user. renderPost calls getUser for the author. renderPage calls renderPost for each of 50 posts. Fifty queries materialize from composition.
The standard advice is to fix this by restructuring: write a getUsersBatch function, collect all the author IDs up front, do one big WHERE id IN (...) query. And that works… until someone adds a new feature that calls getUser from a different code path, or a new team member writes a perfectly reasonable function that doesn’t know about the batch-fetching convention. In practice, threading all of the data from child functions up to a function that knows how to do a batched called after you’ve written code that performs an access internally, is just a huge pain in the ass to rework. The N+1 problem isn’t hard to fix in any one place. It’s hard to fix everywhere, all the time, forever. It requires a level of global awareness about query patterns that doesn’t compose the way code composes. Every new function is a chance for it to come back.
These three costs are multiplicative: If you have N+1 queries, each one encoded with the text protocol, each one a synchronous round-trip, your total latency is roughly: (number of queries) × (encoding overhead per query) × (round-trip time). Halving any one of these helps, but the other two are still dragging you down. Eliminating the N+1 problem turns 100 queries into 2, but if those 2 queries are still synchronous and text-encoded, you’ve only addressed one dimension. The real gains come from addressing all three at once.
persistent-postgresql-ng attacks the first two. sofetch attacks the third. Together, they make all three largely irrelevant.
The broader landscape
Before getting into the Haskell-specific implementations: these three problems are universal, well known, and what I’m outlining has plenty of prior art in the broader technical ecosystem. It clarifies both what’s been solved elsewhere and what persistent-postgresql-ng and sofetch contribute.
Within Haskell: Hasql
I should acknowledge the elephant in the room. Hasql has been doing binary protocol, direct decoding, and prepared statements for years, and doing them well. If you’re a Haskeller reading this and thinking “doesn’t Hasql already solve the encoding problem?”; yes, it largely does. Hasql was an important proof that the Haskell PostgreSQL ecosystem didn’t have to be shackled to the text protocol, and persistent-postgresql-ng learned from its example. In fact, the binary decoding library I’m using was written by Nikita Voltov who wrote hasql, so there you go.
So why not just use Hasql? Because I already have enough major system migrations on my plate, frankly.
Mercury has north of a thousand tables and millions of lines of persistent and esqueleto code. You don’t rewrite that. You don’t even start rewriting that. The migration cost would dwarf any performance gain, and the risk of introducing subtle behavioral differences across a codebase that handles actual money is not a risk any reasonable person takes. The pragmatic move is to make persistent fast, not to replace it without further justification.
Beyond the migration itself, Hasql has its own downsides that have rebuffed me from really feeling like it is viable, even in my own side projects.
The first is boilerplate. Hasql requires you to manually define a Statement for every query: specifying the SQL text, the parameter encoder, and the result decoder as explicit, composed values. For a simple SELECT * FROM users WHERE id = $1, you’re writing an encoder for the parameter, a decoder for every column of the result row, a row-to-record mapping, and a result-set-to-collection mapping. This is correct and explicit and composable, and it is also a lot of code per query compared to persistent’s approach of generating everything from the entity definition. For a large application like Mercury with thousands of entity types, the cumulative boilerplate is substantial. persistent-postgresql-ng gives us the same binary protocol and direct decoding with the same TH-generated entity definitions we’re already using.
The second is instrumentation. Hasql takes an extremely-principled stance on its interfaces: the connection type is opaque, the session execution pipeline is sealed, and there are limited extension points for middleware. This is a defensible design choice (it prevents misuse and makes the library easier to reason about), but it makes certain operational concerns difficult. I care about seeing what my systems are doing in production. I wrote most of the OpenTelemetry code for Haskell that is used by nearly every commercial Haskell shop that I’ve come across, and I want every query to produce a span with the SQL text, parameters, timing, and error status, wired into the same trace context as the rest of the request. With persistent-postgresql-ng, I control the backend and can instrument at whatever granularity I need. With Hasql, you’re working against the abstraction to get this kind of visibility. The hasql-opentelemetry package exists but is limited by what Hasql’s sealed interfaces expose. For a production service where observability is non-negotiable, this is a real friction point.
Frankly, with regards to persistent and esqueleto: I was tempted to take a narrower path. We could have forked persistent and esqueleto into PostgreSQL-specific versions for Mercury, stripped out the backend-agnostic abstractions, hard-coded the binary protocol and pipelining, and shipped something that was fast and correct for our exact stack. It would have been less work in some regards, and it would have solved our problem.
But persistent and esqueleto have a large user base, and most of those users are hitting the same performance problems we are, even if they don’t realize it. The text protocol overhead, the PersistValue indirection, and the synchronous round-trips aren’t Mercury-specific problems. They’re ecosystem-wide problems that happen to be most visible at Mercury’s scale. If I’m going to do the work of building a fast PostgreSQL backend, I’d rather do it in a way that raises the bar for the community, not just for one company’s fork. That’s why the direct codec work is an RFC to persistent core, not a Mercury-internal patch, and ‘s why persistent-postgresql-ng is a drop-in replacement for the community package, not a private fork with our connection pooling baked in.
If you’re starting a greenfield project with a small schema and want maximum speed and correctness, hasql is excellent; it’s one of the best-designed Haskell database libraries. If you have an existing persistent codebase (and there are a lot of them), or you need the operational flexibility that comes with controlling your own backend, persistent-postgresql-ng gives you the same performance characteristics without leaving the persistent ecosystem.
The “single perfect query” approach
There’s another serious approach worth calling out: push as much shaping as possible into one query and have the database return exactly the final response shape. Systems like Hasura do this well for GraphQL workloads; they compile a nested query into SQL that can aggregate, join, and emit JSON in one trip.
When it works, it’s elegant. You avoid N+1 by construction, and you get one deterministic statement to optimize.
But in many application stacks this comes with tradeoffs:
- Response caching gets trickier because the cache key is effectively the full query shape + variables, and tiny field-selection differences can fragment cache hit rates.
- Building massive nested JSON objects inside PostgreSQL can become its own complexity and performance tax, especially once response shapes get wide and conditional.
- Query readability and debuggability can degrade fast; one “perfect query” can become a very imperfect thing to maintain.
- Some people hate GraphQL for assorted reasons, so spending your social capital to adopt it is risky if it goes poorly.
- In any case, without Hasura or similar, GraphQL’s model struggles deeply with making N+1 patterns a first class citizen of how you write code.
So I treat this as a powerful technique, not a universal answer. For some systems it’s the right center of gravity; for others, composable fetching plus batching / pipelining is a better long-term fit.
Back to the techniques at hand
Binary protocol and direct decoding. This is the one where Haskell has historically been behind. Most PostgreSQL drivers in other languages already use the binary wire protocol by default, or offer it as an option. Python’s asyncpg is the gold standard here; it uses the binary protocol throughout, decodes directly into Python objects, and is significantly faster than psycopg2’s text-based approach3. In the JVM world, JDBC drivers typically use binary format for prepared statements. Node.js’s node-postgres uses text by default but supports binary through plugins. Go’s pgx uses binary protocol natively. The Haskell ecosystem’s reliance on text protocol via postgresql-libpq is unusual among modern language ecosystems; it’s an artifact of postgresql-simple’s early design choices propagating through the dependency tree, and reticence about inflicting interface changes on consumers of the libraries in question.
Pipelining. This is less common but gaining traction. PostgreSQL added explicit pipeline mode support in libpq 14 (PostgreSQL 14, released 2021), and drivers are still catching up. Python’s asyncpg supports pipelining through its async design. Node.js’s node-postgres doesn’t support pipelining natively yet. Go’s pgx added pipeline mode support. JDBC doesn’t expose it directly, though connection pools with statement batching achieve a similar effect. In most languages, if you want automatic pipelining without thinking about it, you either need an async driver that pipelines by default, or you need to use explicit batch/pipeline APIs. There’s not any real equivalent of the lazy-reply trick I’ll describe below (that I’m aware of), because most languages lack Haskell’s laziness.
N+1 batching. This is where the landscape is richest. Facebook’s Haxl4 pioneered this pattern in Haskell. In fact, my sofetch library is really a reformulation of it that solves the things that bug me about it. Facebook also built DataLoader for JavaScript, which is widely used in the GraphQL ecosystem. It batches and deduplicates fetches within a single tick of the event loop, which is conceptually similar to sofetch’s applicative rounds. GraphQL servers in most languages have DataLoader implementations or equivalents. Outside GraphQL, though, adoption is spottier. Ruby on Rails has includes and eager_load for preloading associations, but these require explicit annotation at the query site. Elixir’s Ecto has preload, which is similar. Python’s SQLAlchemy has eager loading strategies (joinedload, subqueryload, selectinload), each with different tradeoffs. Go has dataloaden for code-generating DataLoader instances, although it seems to be dead at this point.
As we see, batching solutions exist in most ecosystems, but they’re typically either tied to a specific framework (GraphQL DataLoaders), require explicit opt-in at each query site (Rails includes, SQLAlchemy eager loading), or require significant code generation boilerplate (Go’s dataloaden). sofetch’s contribution is making this work transparently through Haskell’s Applicative typeclass (the type system itself tells the library which fetches are independent) but if you’re in another language, a DataLoader is generally the closest equivalent.
Regardless of your language, the diagnostic framework is the same. If your database-backed service is slow, ask three questions in order: Are you paying an encoding tax you don’t need to? Are you paying a round-trip tax? Are you issuing more queries than necessary? The rest of this post addresses all three for the Haskell
persistentecosystem specifically, but the thinking is portable everywhere.One more practical rule I care about a lot: bound your queries. Unbounded reads over high-cardinality datasets are one of the fastest ways to turn a healthy service into an incident. Pagination (offset or cursor), selective projection, and explicit limits are still foundational, even when you’re also using batching and pipelining.
The binary protocol
As mentioned, PostgreSQL has a binary wire protocol5. It has had one for a very long time. Almost nobody in the Haskell ecosystem uses it aside from hasql-enjoyers.
persistent-postgresql-ng does. When it sends an Int64, it sends 8 bytes, the same representation GHC already has in memory. When it receives a UTCTime, it reads 8 bytes of microseconds-since-epoch and constructs the value directly. IN practice, this means we avoid text parsing, intermediate representations, and shoehorning things into PersistValue.
To understand why this matters so much, consider what happens when you read an integer column under the text protocol. PostgreSQL converts the internal 8-byte integer to an ASCII string like "12345". That string travels over the wire. postgresql-simple receives it, allocates a ByteString, and calls a text-to-integer parser. The parser walks the string character by character, accumulating the result. Then persistent wraps the parsed integer in a PersistInt64 constructor, conses it onto a linked list, and later pattern-matches it back out in fromPersistValue. The binary protocol skips all of this: the 8 bytes arrive, they’re read directly into an Int64, done.
Why zero allocation matters: a brief tour of GHC’s heap
To understand why “zero-allocation decode” is worth the engineering effort, you need to understand what allocation actually costs in GHC, and it’s more than you might think.
Every heap object in GHC has a header: a pointer to an info table that describes the object’s type, layout, and entry code. For a constructor like Right (PersistInt64 42), that’s not one object, it’s three: the Right constructor (header + pointer to payload), the PersistInt64 constructor (header + pointer to payload), and the Int64# unboxed value (though this one may be inlined into the PersistInt64 depending on optimization). Each constructor is a minimum of 2 words (16 bytes on 64-bit) for the header and one pointer. A cons cell (:) in a linked list is 3 words: header, pointer to head, pointer to tail.
So when persistent decodes a 10-column row through PersistValue, you get: 10 PersistValue constructors (each wrapping a payload), 10 list cons cells to hold them in [PersistValue], a nil at the end, and then the applicative fromPersistValues chain produces 9 intermediate Either Text (a -> b -> c -> ...) values (one per field application), plus 10 more Either values from per-field fromPersistValue calls. Each Either is another 2-3 word heap object. The partial applications in the Eithers also allocate closures. We’re looking at roughly 49 heap objects per row (not counting the actual payload values) that exist for a few microseconds and then become garbage.
Now here’s the trick: you can represent the same computation without allocating any of these intermediate objects, using continuation-passing style (CPS).
Consider the standard Either-based approach to parsing a field:
fromPersistValue :: PersistValue -> Either Text Int64
fromPersistValue (PersistInt64 x) = Right x
fromPersistValue _ = Left "expected Int64"
Every call allocates a Right or Left constructor. The caller pattern-matches on it, extracts the value, and discards the wrapper. The wrapper’s only purpose was to signal success or failure; it carried one bit of information in a 16-byte heap object.
The CPS alternative eliminates this entirely by passing the continuation (what to do with the value) directly into the decoder:
decodeInt64 :: FieldRunner env
decodeInt64 onSuccess onFailure bytes =
onSuccess (readInt64 bytes)
On success, we call onSuccess directly with the decoded value. No Right constructor is allocated, no Either is pattern-matched. On failure, we call onFailure with the error. The branch that determines success or failure is a function call, not an ADT constructor followed by a pattern match. In GHC core, this compiles down to a conditional jump, the same thing a C if statement produces.
This is sometimes called church encoding: representing a data type not as a tagged union but as a function that accepts one continuation per constructor. Either a b church-encoded is forall r. (a -> r) -> (b -> r) -> r. The church-encoded version contains exactly the same information, but instead of allocating a heap object and then immediately destructuring it, it directly invokes the appropriate branch. For code where the Either is always immediately consumed (as in a row decoder that processes field after field), church encoding eliminates 100% of the intermediate Either allocations at zero semantic cost.
persistent-postgresql-ng’s RowReader approach chains these CPS decoders together. Each field decoder takes an onSuccess continuation that feeds the decoded value into the next field’s decoder, which feeds into the next, and so on, until the final continuation constructs the complete record. The entire 10-column row decode is one chain of function calls with no intermediate heap objects. GHC’s simplifier inlines the continuations, and the result is a tight loop that reads bytes from the wire format and writes fields directly into the record constructor, the same code a C programmer would write by hand, but generated from the entity definition by Template Haskell.
Beyond latency: GC pressure and mechanical sympathy
The 49 intermediate objects per row might seem like a micro-optimization. Who cares about a few hundred bytes? But this analysis changes dramatically at scale, in a superlinear fashion.
GHC’s garbage collector is generational and (by default) copying. Nursery collections6 are fast, proportional to the amount of live data in the nursery, not the total nursery size. Short-lived allocations that die before the next GC are essentially free because the copying collector never touches them. But “short-lived” means “dead before the next nursery collection,” and in a web server processing multiple requests concurrently, the nursery is shared across all capability’s green threads. A request that allocates heavily fills the nursery faster, triggering more frequent collections that pause all threads on that capability.
The real killer for latency-sensitive systems is major (generation 1) collections. These are stop-the-world: every Haskell thread on every core pauses while the GC walks the entire old-generation heap. As Channable documented in their excellent writeup on Haskell memory management7, this creates a quadratic slowdown: doubling the live data roughly doubles both the number of major GCs and the time each one takes. In a server handling concurrent requests, one thread with a large live set penalizes every other thread through stop-the-world pauses.
This means that reducing allocation isn’t just about making individual requests faster; it’s about mechanical sympathy with the runtime. Every intermediate PersistValue, every Either Text a, every cons cell contributes to GC pressure that degrades tail latency and throughput across the entire service. A request that decodes a 10,000-row result set through the PersistValue path generates roughly 490,000 intermediate heap objects. Most die quickly and are collected cheaply in the nursery, but the allocation rate itself drives nursery collection frequency, and any objects that happen to survive across a nursery boundary get promoted to the old generation where they contribute to expensive major GC pauses.
The zero-allocation decode path doesn’t just make that one request 2-3× faster. It reduces the GC pressure that request imposes on every other concurrent request. In a production service handling hundreds of requests per second, this is the difference between consistent 5ms p99 latency and occasional 50ms spikes when a major GC happens to coincide with your request. It’s the difference between a service that scales predictably with load and one that develops mysterious latency cliffs when traffic increases.
This is why I consider the CPS decode path a systems-level optimization, not a micro-benchmark curiosity. The benchmarks show 2-3× on individual operations; the production impact on tail latency and throughput consistency can be significantly larger.
The column type dispatch (checking which PostgreSQL OID each column has and selecting the right decoder) happens once per result set via a prepare/run split, not once per row. To understand why this matters: when PostgreSQL sends you a result set, each column has an OID (object identifier) that tells you its type: int8, timestamptz, text, etc. A naive decoder would check this OID for every column of every row, branching on the type to select the right decoding function. For a 10-column table returning 10,000 rows, that’s 100,000 type dispatches.
persistent-postgresql-ng’s FromRow class exposes a prepareRow method that runs prepareField for each column once, producing a RowDecoder that captures the resolved FieldRunners in a closure. The per-row loop then calls only runField: no OID dispatch, no vector lookup, no branching on column types. After GHC specialization, the closures inline and the row-reading loop collapses to a straight-line sequence of concrete decoder calls with no polymorphism left at runtime. Those 100,000 type dispatches become 10.
The result, even on localhost where the network is free:
| Benchmark | persistent-postgresql | persistent-postgresql-ng | Speedup |
|---|---|---|---|
| get ×100 | 4.7ms | 1.7ms | 2.8× |
| insertMany ×1000 | 14.1ms | 5.3ms | 2.7× |
| upsert ×100 | 12.7ms | 8.9ms | 1.4× |
| mixed DML ×100 then select | 29.9ms | 14.6ms | 2.0× |
| delete ×100 then select | 7.5ms | 4.5ms | 1.7× |
These numbers are purely from the binary protocol and zero-allocation decode path. Pipelining isn’t even contributing yet; at localhost latency, there’s nothing to pipeline. This is the tax you pay today for text encoding and PersistValue, and it applies to every query your application runs. And remember, this isn’t just about the latency of the individual request. Every avoided allocation is GC pressure that isn’t imposed on every other concurrent request in your service.
The encode side is symmetric. Consider how bulk inserts typically work: the library generates INSERT INTO t VALUES (?,?,...), (?,?,...), ... with one bind parameter per value. If you’re inserting 100 rows with 10 columns each, that’s a SQL string with 1,000 bind parameters, and crucially, a different SQL string than if you were inserting 99 rows or 101 rows. PostgreSQL’s prepared statement cache keys on the SQL text, so every different batch size generates a new plan.
persistent-postgresql-ng uses INSERT ... SELECT * FROM UNNEST($1::type[], $2::type[], ...) instead. Each column is sent as a single array parameter, so the SQL template is fixed regardless of batch size: 10 rows or 10,000 rows, same query string, same prepared statement, same cached plan. The database parses and plans the query once and reuses it forever.
The same principle applies to WHERE IN clauses. WHERE id IN (?,?,?,...) becomes WHERE id = ANY($1) with a single array parameter. Again: fixed SQL, one cached plan, no matter how many values you’re checking against.
Automatic pipelining
The standard persistent PostgreSQL backend executes every operation synchronously. insert sends INSERT ... RETURNING id, waits for the server to respond with the generated key, then returns it. delete sends DELETE, waits for the server to acknowledge, then returns (). If you run 100 deletes, you perform 100 round-trips, and the actual wall-clock time is dominated not by the database doing work but by network latency. The database might spend 0.1ms executing each delete, but each round-trip costs 2ms, so 95% of your time is spent waiting for packets to traverse the wire.
PostgreSQL’s wire protocol actually supports pipelining: sending multiple queries without waiting for responses, then reading the responses in order. This is the same concept as HTTP pipelining or Redis pipelining. The idea is simple: instead of send-wait-send-wait-send-wait, you send-send-send-wait-read-read-read. The database processes queries as fast as they arrive, and the responses queue up for you to read whenever you’re ready.
Here’s what sequential queries look like as a trace waterfall. Each query waits for the previous response before it can be sent; notice how most of the timeline is idle rather than doing useful work:
Now compare the pipelined version. All queries are sent in a single flush, and the database processes them back-to-back while the responses stream back:
The total database work is identical. The difference is pure dead time eliminated: time where neither the application nor the database was doing anything useful, just waiting for packets to traverse the wire.
persistent-postgresql-ng enters libpq’s pipeline mode at connection time and never leaves it. Operations fall into two categories based on a simple question: does the caller need the result right now?
Fire-and-forget operations (delete, update, replace, deleteWhere, updateWhere) are operations where the caller doesn’t use the return value (it’s ()). These send the query into the output buffer, increment a pending-result counter, and return immediately. These operations eliminate round-trip latency entirely. The results accumulate on the server’s side until something actually needs to read from the connection: a subsequent read operation, or a transaction commit. This is safe because if any of these operations fail, the error will surface when the next read occurs (or at commit time), which is the same point where you’d discover the error in a transaction anyway.
Lazy-reply operations (get, getBy, insert, count, exists) are trickier. The caller does need a result: get returns Maybe record, insert returns a generated key. But the caller doesn’t necessarily need that result immediately. They might call get 100 times in a loop before actually inspecting any of the returned values.
To understand how persistent-postgresql-ng exploits this, you need to understand what “laziness” means in Haskell, because it’s unlike anything in most mainstream languages.
In most languages, when you call a function, it runs immediately and gives you the result. In Haskell, evaluation is lazy by default: when you call a function, the runtime doesn’t actually compute the result; it creates a “thunk,” which is essentially a note that says “here’s how to compute this value when someone actually needs it.” The value is only computed when your code tries to inspect it (pattern-match on it, print it, use it in arithmetic, etc.). If nothing ever looks at the value, the computation never runs.
This is normally a feature for pure computation; it lets you work with infinite data structures, avoid unnecessary work, and compose programs more flexibly. But it becomes especially powerful when combined with IO, because it lets you decouple when a side-effecting operation is initiated from when its result is consumed.
This is where unsafeInterleaveIO comes in, borrowed from a technique used by Hedis, the well-known Haskell Redis client. At connection time, an infinite lazy list of server replies is created using unsafeInterleaveIO. Each get eagerly writes its SELECT into the output buffer and pops an unevaluated thunk from this list via atomicModifyIORef. The actual network read happens only when the caller inspects the result value. The thunk is a promise that says “when you need this value, I’ll go read it from the server.” If you accumulate 100 such promises before forcing any of them, all 100 queries have already been sent, and forcing the first thunk reads all 100 responses from the server’s already-buffered output.
This is the part that makes some people nervous about unsafeInterleaveIO. And they should be; it has “unsafe” in the name for a reason. Normally, Haskell’s type system guarantees that IO actions execute in the order you write them: read a file, then process the contents, then write the output. unsafeInterleaveIO breaks this guarantee by wrapping an IO action in a lazy thunk, so the side effect happens whenever the runtime gets around to evaluating the result, not when the action appears in your code. In the general case, this is genuinely dangerous: if two unsafeInterleaveIO thunks both read from the same file handle, the order they execute in depends on which value your program pattern-matches on first, which might change based on optimization level, GHC version, or the phase of the moon. Your program becomes nondeterministic about when side effects occur, which is exactly what Haskell’s IO type was designed to prevent.
So why is it safe here? Because the replies from a PostgreSQL pipeline are strictly ordered (response 1 is always followed by response 2, which is always followed by response 3) and the lazy list is constructed so that thunk N is created inside thunk N-1’s body. Forcing any thunk forces all preceding thunks first, which means replies are always consumed in pipeline order regardless of which result the caller looks at first. The data structure makes the only possible execution order the correct one.
If you call mapM get over 100 keys before looking at any of the results, all 100 queries are sent in a single flush, and the results are read sequentially from the server’s already-buffered responses. This is invisible at the API level. mapM get keys is the same code you write today. It just runs significantly faster the instant there is any network between you and your database:
| Benchmark | Pipeline | Sequential | Speedup |
|---|---|---|---|
| get ×100 (2ms RTT) | 11ms | 310ms | 28× |
| insert ×100 (2ms RTT) | 13ms | 314ms | 24× |
| upsert ×100 (2ms RTT) | 13ms | 321ms | 25× |
| insertMany ×1000 (2ms RTT) | 8.6ms | 31.0ms | 3.6× |
| get ×100 (10ms RTT) | 50ms | 1.19s | 24× |
| insert ×100 (10ms RTT) | 41ms | 1.20s | 29× |
The speedup scales linearly with latency, and the reason is straightforward. At 10ms RTT, 100 sequential round-trips cost a minimum of 1000ms; that’s one full second of pure waiting, independent of how fast the database actually processes the queries. The pipeline pays one RTT for the initial flush and then reads all 100 results from the server’s already-buffered responses. The database was processing queries the entire time the first responses were in transit, so by the time the client starts reading, most or all of the work is done.
At 2ms RTT (a nearby datacenter, the boring normal case for most production deployments) you get a 24-28× improvement for free. At 0ms RTT (localhost), pipelining provides no benefit because there’s no latency to hide. This is why localhost benchmarks can be misleading: they show you the encoding cost but hide the round-trip cost, which in production is usually the dominant factor.
The two sources of improvement are additive and independent:
| Source | Typical gain at 0ms RTT | Typical gain at 2ms RTT |
|---|---|---|
| Binary protocol (encode/decode) | 1.2-2.8× | 1.2-2.8× |
| UNNEST bulk insert | 2.3× | 3.6× |
| Fire-and-forget DML pipelining | 1.0× | 20-24× |
| Hedis-style lazy pipelining | 1.0× | 24-28× |
The binary protocol provides a constant factor improvement regardless of latency; it’s simply doing less work per byte. Pipelining provides a latency-proportional factor; the higher the latency, the more time you were wasting on sequential round-trips, and the more pipelining saves. At any non-zero network distance (which is to say, in any production deployment that isn’t running the database on the same box) the pipelining dominates. The combined best case is 28× at a mere 2ms RTT.
To be clear: these gains come from doing exactly what you were already doing, with no code changes beyond the import and pool setup. The same mapM get keys that produced 100 sequential synchronous text-protocol round-trips now produces 100 pipelined binary-protocol queries that execute in a fraction of the time.
Automatic batching
So we’ve made each query cheaper (binary protocol, zero-allocation decoding) and we’ve made the network carry many queries in one trip (automatic pipelining). But we’re still issuing the same number of queries. If your page renders 50 posts with authors and comment counts, that’s still 50 SELECT ... WHERE id = ? queries hitting the database, even if they’re pipelined.
Pipelining turns 50 round-trips into 1, which is transformative. But the database is still parsing (or retrieving a cached prepared query), planning, and executing 50 separate queries when one would do. We can go further: turn 50 queries into 1 query.
This is solving the N+1 problem proper, and as I said at the top, everyone knows about it. The traditional solutions are well-understood too: eager loading (fetch related data in a JOIN), batch loading (collect IDs up front, do one WHERE IN query), or DataLoader-style request batching. The problem with all of these is that they require the caller to know about the access pattern. You can’t just write getUser :: Key User -> m User and compose it freely; you have to write getUsersBatch :: [Key User] -> m [User] and restructure every call site to collect keys and call the batch version. This works, but it’s fragile. It requires discipline at every call site, it doesn’t compose across module boundaries without explicit coordination, and it breaks the moment someone writes an innocent helper that does a single fetch.
sofetch is a library I wrote to solve this problem in environments where side effects are constrained. The lineage is Facebook’s Haxl, but the motivation is slightly different. In most production Haskell services, database transactions form a boundary: inside a transaction, you should be doing database work, not making HTTP calls, writing to disk, or firing off messages to a different queueing system. sofetch is designed for this world; it gives you a structured way to express data fetching as a pure description of what you need, separate from the effectful execution. You describe your data sources declaratively, the library batches and deduplicates, and the actual database calls happen in controlled batches rather than scattered throughout your business logic.
And it’s not just for database interactions. The same model works for batched HTTP fetches, service-to-service calls, and other key/value-shaped IO. If a dependency can be expressed as “batch these keys, return these results”, it can fit in the same framework.
The key insight: you keep writing the simple single-fetch version (fetch (UserById uid)), and the library figures out at runtime which fetches can be batched together. By doing this, we eliminate manual key collection, restructuring call sites (besides changing the type signature and using fetch instead of an inline query). No discipline required at the point of use.
The mechanism by which we achieve Applicative composition. When multiple fetch calls are combined with <*> (or ApplicativeDo-desugared independent binds), sofetch knows they’re independent and can be grouped into a single batch. You get the ergonomics of single-item fetching with the performance of batch fetching.
You define key types that describe what you’re looking for:
newtype UserById = UserById (Key User)
deriving stock (Eq, Ord, Show, Generic)
deriving anyclass (Hashable)
instance FetchKey UserById where
type Result UserById = User
And a DataSource instance that knows how to batch-fetch a group of them:
instance DataSource AppM UserById where
batchFetch keys = do
let ids = [uid | UserById uid <- toList keys]
users <- runDB $ selectList [UserId <-. ids] []
pure $ HM.fromList
[(UserById (entityKey e), entityVal e) | e <- users]
Now your rendering code uses fetch instead of direct database calls:
renderPosts :: (MonadFetch m n, DataSource m UserById) => [Post] -> n [Html]
renderPosts posts =
fetchThrough (UserById . postAuthorId) posts
<&> map (\(post, author) -> renderPostCard post author)
Fifty posts, one SELECT ... WHERE id = ANY(...) query. And this works across function boundaries; if renderPostCard internally fetches comment counts, and a sidebar widget fetches some of the same authors, sofetch merges all the fetches in a given round automatically. Functions that were written independently, with no knowledge of each other, still get optimal batching when composed. This is the critical difference from manual batching: you don’t need to know what other functions are fetching in order to benefit from batching with them.
How does sofetch know which fetches to batch? This is the part that’s Haskell-specific in a deep way, but the underlying idea is simple enough to explain to anyone who’s written async code.
In most languages, when you write code that fetches data, you have two patterns: independent fetches and dependent fetches. Independent fetches don’t need each other’s results; fetching a user’s profile and fetching their notification count can happen in parallel. Dependent fetches do: you can’t fetch a user’s posts until you know which user you’re looking at.
In JavaScript, you’d express this distinction explicitly: Promise.all([getProfile(id), getNotifications(id)]) for independent, and getUser(id).then(user => getPosts(user.id)) for dependent. The programmer has to choose the right combinator.
Haskell has a type-level version of this distinction built into the language. Applicative composition (the <*> operator) combines computations whose results don’t depend on each other. Monad composition (the >>= operator, “bind”) combines computations where the second needs the first’s result. These aren’t just conventions; they’re different interfaces with different types, and the compiler knows which one your code is using. A library can look at the structure of a computation and see “these five fetches are applicative (independent), then there’s a bind (dependency boundary), then three more applicative fetches” without the programmer annotating anything.
sofetch exploits this. It collects all fetch calls within an applicative “round” (all the independent fetches that don’t depend on each other) and batches them into a single database query. When it hits a monadic bind, it knows the next fetch depends on a previous result, so it starts a new round.
Haskell has a compiler extension called ApplicativeDo that makes this ergonomic: you write what looks like normal sequential code, and the compiler’s desugarer automatically figures out which lines are independent of each other and uses <*> for those. You don’t have to manually call Promise.all or restructure your code; the compiler does the dependency analysis for you, and sofetch uses that analysis to batch.
Within a round, three things happen: keys for the same data source are grouped into one batchFetch call, keys for different data sources run concurrently, and duplicate keys are deduplicated. The same key appearing in multiple call sites produces only one fetch, with all callers sharing the result via an IVar (a write-once mutable variable).
Here’s that entire process, from expression tree to executed queries, for a concrete example — rendering three posts that each fetch an author and a comment count:
A monadic bind introduces a round boundary. So the total number of rounds equals the sequential depth of your data dependencies, not the total number of fetches. A four-level deep rendering tree (authors → posts → comments → comment authors) produces four rounds, four SQL queries, regardless of how many entities are at each level:
Four queries total. Not 25+. The functions never coordinate with each other. They don’t know they’re being composed. sofetch handles it.
Here’s why I find this so much more satisfying than the traditional solutions: manual batching requires you to fight against composition. You have to know, at the top level, every piece of data that every nested function will need, collect all the keys, batch-fetch them, and thread the results down. This is the “extreme discipline” approach; it works, but it’s the opposite of how you want to write code. You want to write small, reusable functions that each fetch what they need, and compose them freely.
sofetch makes that work. You write renderComment as if it’s the only function in the world, fetching the comment’s author with a single fetch. Then you compose 200 of them and sofetch issues one query. The discipline isn’t at the call site anymore; it’s in the library, once, and every call site benefits automatically.
All three together
These three techniques don’t just add; they multiply. To see why, it helps to trace through a concrete example and understand where time is being spent at each stage.
Consider a page that renders 50 posts with authors and comment counts. Here’s what the trace waterfall looks like under the old regime; each line is a query, and the time is dominated by sequential round-trips:
And here’s the same page with all three optimizations applied:
Same data. Same application logic. Two queries instead of a hundred, pipelined so they execute concurrently, binary-encoded so decoding is 2-3× faster. Let’s walk through each stage of the improvement.
Under the old regime:
- 50 author queries + 50 comment-count queries = 100 queries
- Each query uses the text protocol: parsing overhead on both sides
- Each query is a synchronous round-trip: at 1ms/dir, 100 × 2ms = 200ms of network dead time
- Total: ~200ms+ wall clock
The bottleneck here isn’t the database; it probably processes each query in microseconds. The bottleneck is 100 sequential round-trips, each one a synchronous wait.
With sofetch alone (batching, but still text protocol, still synchronous):
- 1 author query + 1 comment-count query = 2 queries
- 2 synchronous round-trips: 4ms of network time
- Total: ~10-15ms
This is already a massive win: we’ve eliminated 98% of the queries. The remaining time is actual query execution, result encoding, and decoding. But we can still do better: those 2 queries are still using the text protocol, and they’re still synchronous (the second waits for the first).
Now add persistent-postgresql-ng:
- 1 author query + 1 comment-count query = 2 queries
- Binary protocol: faster decode, no
PersistValueallocation - Pipelined: both queries sent in one flush, 1 effective round-trip
= ANY($1)instead ofIN (?,?,?,...): fixed SQL template, prepared statement reuse- Total: ~3-5ms
From 200ms to 5ms. The integration point is the DataSource instance: sofetch’s batchFetch calls persistent’s selectList, and if you’re using persistent-postgresql-ng, that call automatically benefits from everything:
instance DataSource AppM UserById where
batchFetch keys = do
let ids = [uid | UserById uid <- toList keys]
-- This selectList call:
-- Uses binary protocol (persistent-postgresql-ng)
-- Sends WHERE id = ANY($1) instead of WHERE id IN (?,?,?,...)
-- Decodes directly to User records, zero PersistValue allocation
-- Benefits from pipelining if other queries are in flight
users <- runDB $ selectList [UserId <-. ids] []
pure $ HM.fromList
[(UserById (entityKey e), entityVal e) | e <- users]
If your batchFetch issues multiple queries internally, persistent-postgresql-ng’s pipelining ensures they’re sent together. sofetch ensures you’re issuing the minimum number of batchFetch calls. The binary protocol ensures each call pays the minimum encoding cost. All three optimizations are complementary; each one helps even when the other two are already doing their work.
Buzzkill: this isn’t ready for public use
As I mentioned earlier in the article, I’m trying to solve this problem for the persistent ecosystem more broadly. The direct decoding mechanism that bypasses PersistValue isn’t just a persistent-postgresql-ng feature, which means that we have to land changes in persistent core to make this work. I’ve written an RFC proposing it as a backend-agnostic extension to persistent core. The design introduces RowReader, FieldDecode, and FromRow, all parameterized by an opaque backend-specific env type, so that the same TH-generated FromRow instance works for PostgreSQL, SQLite, MySQL, MongoDB, and Redis without modification.
I feel that this work also matters beyond performance. PersistValue’s fixed set of constructors has become a straitjacket. JSON and JSONB lose their distinction, shoved into PersistByteString. UUIDs round-trip through hex text in PersistLiteralEscaped. Composite types have no reasonable representation at all. Every time a backend wants to support a new type faithfully, the choice is either a breaking change to PersistValue (which ripples through every backend and every PersistField instance) or a lossy encoding into an existing constructor.
With FieldDecode/FieldEncode, a backend package can add support for any new type by publishing a new instance, without touching persistent core, without coordinating with other backends, and without breaking downstream users. A PostgreSQL backend could add FieldDecode PgRowEnv HStore or FieldDecode PgRowEnv TSVector on the day it ships, and users who don’t reference these types are completely unaffected.
The proposal follows esqueleto’s migration pattern: direct-path API lives in .Experimental modules with identical function names and strictly more constrained type signatures. You switch by changing one import. All existing code continues to compile unchanged.
If this sounds like something you’d want for your backend, the RFC is open for feedback.
Final thoughts
The cost of adoption is deliberately asymmetric between the two libraries, because the problems they solve are different.
persistent-postgresql-ng is a drop-in replacement. Swap the dependency, change the pool creation function, and your existing persistent code runs faster; binary protocol and pipelining are on by default. For the direct decode path (bypassing PersistValue), change one import to .Experimental and the additional type constraints select the zero-allocation path automatically. If your entity has a FromRow instance (generated by TH), you get the fast path. If it doesn’t, you get the standard PersistValue path, same behavior as before, no breakage.
sofetch asks more of you: you model your data access patterns as key types and data source instances. This is real work: for each kind of fetch in your application, you write a newtype for the key, a FetchKey instance declaring the result type, and a DataSource instance implementing the batch-fetch logic. In exchange, you get several things beyond performance. Your data access becomes declarative and testable: swap in MockFetch with canned data, no database needed. It becomes observable; TracedFetch gives you round-by-round instrumentation, and sofetch-otel adds OpenTelemetry support. And perhaps most importantly, the DataSource class is parameterized by your monad, so if it has access to a connection pool, config, or whatever else, your data sources do too. Missing instances are compile-time errors, not runtime Dynamic-lookup failures.
Neither library asks you to meaningfully restructure your application logic. The whole point (the reason I built these) is that you should be able to write the obvious, compositional code (mapM get, traverse fetch, small functions that each do one thing) and have the libraries handle performance. The traditional approach to database performance is “think carefully about every query at every call site.” These libraries replace that with “write the obvious code and let the infrastructure optimize it.” The discipline moves from the application programmer to the library, where it only has to be right once.
The screaming monkeys in the back of my mind are quieter now. Not silent; there are still thousands of tables and six years of accumulated data access patterns, and I wil still flinch when I open a trace and see something that could be better. But making the tools exist to fix things incrementally, without rewriting the world, and the new code people write gets the fast path by default, is remarkably satisfying. That’s the dream for any of my best projects: a ratchet that only turns in the right direction.
PostgreSQL’s text protocol is still the default in every major Haskell library. The N+1 problem still exists in every for loop that hasn’t been given the right abstraction to work with. Synchronous round-trips still happen in most codebases. But we can fix it. These tools exist so they don’t have to.
persistent-postgresql-ng is being prepared for release alongside the RFC for direct codec support in persistent core. sofetch is available on Hackage today. If you’re interested in either, find me on GitHub (@iand675) or reach out directly.
References
- Hasql - A strong binary-protocol-first PostgreSQL library for Haskell
- asyncpg - Python’s gold standard for binary PostgreSQL communication
- DataLoader - A practical N+1 mitigation tool in GraphQL-heavy stacks
- pgx - A high-performance PostgreSQL driver for Go
- Hedis pipelining technique - The Redis-side technique that inspired the lazy pipelining shape here
Further Reading
Footnotes
-
The Wikipedia article on the N+1 query problem is a reminder that this issue has been documented for years and still shows up everywhere. ↩
-
Well, some folks might have. But in this specific case, the engineers who wrote those queries were doing their job: shipping features for a startup that needed to grow. The problems only revealed themselves later, when production traffic patterns changed. ↩
-
The asyncpg vs psycopg2 performance comparison shows binary protocol improvements of 2-3x for simple queries and up to 10x for complex data types. ↩
-
Facebook’s Haxl, the original inspiration for
sofetch’s design. Their approach to batching requests still holds up well. ↩ -
I cannot fathom why the people who named these things ended up settling on terminology that sounds like poor-taste dead baby jokes. Unfortunate convergence of names, I suppose? Anyways, see GHC’s garbage collection internals. ↩
-
Channable’s writeup on Haskell memory management covers GC behavior under real production loads in detail. ↩