The Production Outage — Where the Fix Was Worse Than the Bug
Table of Contents
One missing property line.
Two production outages.
The hot fix between them made the second one inevitable.
This is the story of an outage where every reasonable next step made things worse — and where the eventual resolution was a hard “no” to the reflex answer “just raise the pool size.” It’s also the companion piece to the previous post on Spring Boot concurrency and connection pools. That one taught the mental model: a transaction holds a connection; the pool is capped; slow I/O inside transactions is fatal. This one is what happens when you try to escape that model by giving each service a bigger pool. The database has its own ceiling, and it’s often lower than you think.
If you skip to one section, skip to the arithmetic in section 5. That’s the single formula every backend team on a shared database should be doing at capacity-planning time, and almost none are.
1. The setup
The backend is a Spring Boot monolith split across four cooperating services on Heroku: api (the main HTTP surface), engine (the async job runner), messaging (outbound integrations), and integration (inbound webhooks). All four talk to the same Postgres. Each runs on multiple dynos — small Heroku process containers with their own JVM and their own Hikari connection pool.
In the environment where this incident happened:
- Postgres
max_connections= 250 — the hard ceiling the database will accept before refusing new connections withtoo many connections. - Two dynos total — one
webforapi, oneworkerforengine.messagingandintegrationrun on the same two dynos as sidecars in this environment (a smaller tier than the higher-headroom environment we run for other customers). spring.datasource.hikari.maximum-pool-sizewas set explicitly fordev,e2e, and the larger production environment. For this production environment specifically, no override existed.
That last line is the whole bug, and the article’s first surprise is what it actually means.
2. The first outage: apps refused to start
The visible symptom was pedestrian and terrifying at the same time. During a normal deploy, the new dynos wouldn’t come up. Spring Boot startup would proceed through bean registration, hit the Hikari initialisation phase, and throw:
com.zaxxer.hikari.pool.HikariPool$PoolInitializationException:
Failed to initialize pool: FATAL: sorry, too many clients already
too many clients means Postgres refused a connection at TCP acceptance — the database was already saturated. Every new dyno startup made it slightly worse. The rolling deploy stalled. Customers saw 5xx from api because the old dyno had already been drained.
The first hypothesis was the natural one: “one of the workloads is misbehaving and leaking connections.” Twenty minutes later, that hypothesis was dead. Nothing was leaking. What was happening was more subtle.
To see it, you have to understand a Heroku deploy transition.
3. The trace: deploys are the moment of peak connection demand
The Heroku deploy model is “boot the new dynos before killing the old ones.” For a brief window — usually 30-60 seconds while the new dynos become healthy — both generations exist simultaneously. Every connection the old dynos are holding, plus every connection the new dynos are opening to become ready, both count against max_connections. Deploy time isn’t just a rolling swap. It’s the environment’s connection-demand peak.
Which suddenly meant the actual question wasn’t “how many connections does the app use in steady state?” It was “how many connections does the app use during the ~60-second overlap window?”
To answer that in the moment, we needed to know the effective pool size on each service. And the ugly discovery here was: nobody knew what the effective pool size actually was in this environment.
The larger production environment’s application-*.properties set the pool explicitly (70 for api, lower values for others). The dev and e2e environments did the same. This production environment’s application-*.properties, though — for reasons lost to a git-blame in 2024 — did not.
The base application.properties, sitting quietly at the root of the config tree, had spring.datasource.hikari.maximum-pool-size=10 set explicitly. That value dated to an early-development-era commit when the “sensible default” it was written to protect against was a local developer accidentally exhausting a tiny local Postgres. It had never been revisited.
Spring’s profile-specific properties override individual keys, not whole sections. Every environment that overrode the pool inherited that override, so nobody in review ever noticed the base value. This environment’s profile didn’t override, so it silently inherited 10.
Two properties files, one line missing from one of them, and the entire environment has been running at ten connections per service for months. Nobody ever wrote “10” while thinking about this environment. Ten was a default someone else chose for a different reason years earlier — and every environment except this one had a profile-level override sitting on top of it hiding that fact.
We had been running a pool of 10 for months.
4. The hot fix: six minutes to write, and a lie in every direction
The pattern was now visible: pool=10 × 2 dynos = 20 connections per app, four apps means ~80 connections at steady state, and deploy overlap briefly doubles that to ~160. Still under 250 in principle — but the older dynos were also fighting to keep connections open on longer-lived JDBC operations, and Hikari’s keepAliveTime was still holding idle connections. In practice, the environment was hovering at ~90% of max_connections continuously, and the deploy pushed it over.
The hot fix took six minutes to write. Set spring.datasource.hikari.maximum-pool-size explicitly per service in the production profile. Redeploy.
Apps started. Traffic resumed. The team lead sent the “we’re back” Slack message. It was pre-lunch, morale was high, we’d caught it fast.
We had also, without realising it, made the next deploy a bigger problem.
The number we’d picked for the pool wasn’t chosen against a total budget. It was chosen against the previous value in a spirit of “we clearly weren’t running with enough headroom, let’s give ourselves more.” Nobody sat down and did the arithmetic across every service and every dyno. And when the next deploy landed — six hours later, a routine PR — the outage came back, worse.
By the time of that next deploy, both the running dynos and the incoming ones were configured with the raised pool of 25. Four apps × two dynos × pool of 25 gives every generation of dynos an upper limit of about 200 connections. During the ~60-second overlap window, both generations can hold up to that upper limit simultaneously. Peak demand: up to ~400 connections. The database’s ceiling: 250.
too many clients came back, this time not because the app had too few connections available, but because it had asked for too many.
Same error class. Opposite root cause.
5. The arithmetic every backend team should be doing
If you take one thing from this article, take this formula. It’s not new. It’s not clever. It is the thing that got skipped, and the thing that gets skipped in almost every production incident of this shape I’ve seen.
For any Postgres-backed backend:
(apps × dynos_per_app × pool_size) + deploy_overlap ≤ max_connections
Where:
appsis the number of services that share this database. For us: four.dynos_per_appis how many process instances of each app run concurrently. For us: two (oneweb, oneworker— but shared across apps).pool_sizeis the effective Hikarimaximum-pool-sizeper app process.deploy_overlapis the additional connections held by old-generation dynos while new-generation dynos are starting. Under a rolling deploy, this can be up to another steady-state total’s worth for a 30-60 second window — the worst case is when old dynos haven’t drained yet while new dynos are eagerly opening connections. In practice it’s often less than a full doubling, but the bound is what matters for the arithmetic.max_connectionsis what Postgres will actually allow. Not what it’s tuned for — what it will actually accept. Query withSHOW max_connections;inpsql. (A small note: Postgres reserves a few connections for the superuser for admin access, so the effective ceiling is a few below the announcedmax_connections. Don’t plan for the whole number.)
One important nuance about the pool_size term. This is HikariCP’s maximumPoolSize — the upper bound the pool will grow to under load. Under quiet traffic, actual open connections may be much lower (minimumIdle defaults to maximumPoolSize, but Hikari will close idle connections down to minimumIdle after idleTimeout if it’s set lower). For capacity planning, though, you always compute against the max. Under load, or under a sudden burst, Hikari will grow to it — and it will hit whichever ceiling comes first: the pool’s, or the database’s.
For our production environment, worked example:
Steady state: 4 apps × 2 dynos × 10 pool = 80 connections
Deploy overlap: another 80 briefly on top = 160 connections at peak
Available: 250
That’s 64% of the ceiling — comfortable, and consistent with what monitoring showed us when we finally started charting it.
After the hot fix, with pool raised to 25:
Steady state: 4 apps × 2 dynos × 25 pool = 200 connections
Deploy overlap: another 200 briefly = 400 connections at peak
Available: 250
400 > 250. The second deploy was going to trip on this the moment it happened. There was no way it wouldn’t.
If you don’t have the arithmetic written down for your environments, you don’t know your capacity. You have opinions about your capacity. Those opinions are wrong more often than right on shared databases.
6. The paradox: you cannot copy production’s pool size to a smaller environment
The reflex when someone tells you “the pool is too small” is “look at what production uses and copy that.” The larger production environment we run — with higher-tier customers and a higher-tier Postgres — has pool sizes tuned against its own max_connections. What makes 70 comfortable there is that its database has substantial headroom. Copying that number to a smaller environment copies the “70” without copying the “substantial headroom” it depends on. The math that worked in one place presupposes an environment the other one doesn’t have.
Worse, on the smaller environment, there is essentially no “just raise the pool” answer at all.
Set the deploy-overlap-doubling assumption and put the arithmetic against 250:
(4 × 2 × pool) × 2 (deploy overlap) ≤ 250
16 × pool ≤ 125
pool ≤ ~7.8
Round it: on that specific database, no service can have a Hikari pool larger than about 7 without risking too many clients at deploy time. And seven connections is not enough for a service under normal steady-state load, let alone concurrent work.
So the article’s title isn’t rhetorical. There genuinely is no right pool number here. “Raise it” fails at the ceiling. “Keep it low” fails at real load. You can shuffle the value around and pick the least bad option, but there is no value at which the arithmetic is comfortable. The problem is not the pool size. The problem is the connection budget.
Which means the fix must live somewhere other than the pool.
7. Making the effective pool size visible everywhere
Before we could argue about which fix belonged where, we needed something more basic: to see the effective pool size across every service in every environment. If a team can’t tell you what pool their production is running with in the current five seconds, they can’t be part of any solution — they don’t have the data.
The naive move is to log the property at startup:
env.getProperty("spring.datasource.hikari.maximum-pool-size")
This is the trap the outage was hiding inside. A library’s built-in default is not a property. If the codebase never sets the key, Environment has no entry for it — the property is “absent,” not “defaulted to 10.” Logging the property tells you what someone wrote, not what Hikari is actually using. The one case where logging matters most — the silent-default case — is exactly the case this approach can’t see.
The correct move is to stop asking the configuration text and ask the live bean:
HikariDataSource hikari = dataSource.unwrap(HikariDataSource.class);
int effectivePoolSize = hikari.getMaximumPoolSize();
getMaximumPoolSize() returns whatever Hikari resolved to — from a property file, from the library’s default, from a @Bean override, from anywhere. It is by definition the effective value. That’s the number you want in your logs.
That leaves three implementation subtleties, each of which turned out to be a real design decision, not a formality.
First, when to log. ContextRefreshedEvent fires early and can fire multiple times (context hierarchies, manual refreshes). ApplicationReadyEvent fires once, after every bean is fully built and the app is officially ready to serve. That’s the natural place.
Second, how to reach the bean without breaking startup on any service. ObjectProvider<DataSource> is the defensive-injection idiom: iterate zero, one, or many DataSource beans without ever throwing NoSuchBeanDefinitionException on a service that has none. And DataSource.isWrapperFor(HikariDataSource.class) before unwrap(...) skips proxied-but-non-Hikari data sources silently.
Third, where to put the class so every service picks it up. This is where the design choice matters most.
The obvious answer is “put a @Component in the shared module.” That relies on component-scanning. In a real polyrepo or a multi-service monorepo, each service’s @ComponentScan is likely different: some scan com.company broadly, others use explicit allowlists. A plain @Component in a shared module gets picked up by services that scan broadly and silently skipped by services with allowlists — with no error to warn you. Which is precisely the “works in three services, invisible in the fourth” silent-failure this article is about.
The right pattern is Spring Boot’s auto-configuration mechanism, which does not go through component scanning at all:
@AutoConfiguration
@ConditionalOnClass(HikariDataSource.class)
public class DataSourcePoolLoggingAutoConfiguration {
@Bean
public ApplicationListener<ApplicationReadyEvent> logEffectiveHikariPoolSize(
ObjectProvider<DataSource> dataSources, Environment environment) {
return event -> dataSources.forEach(ds -> logPoolSize(ds, environment));
}
private void logPoolSize(DataSource dataSource, Environment environment) {
try {
if (!dataSource.isWrapperFor(HikariDataSource.class)) return;
HikariDataSource hikari = dataSource.unwrap(HikariDataSource.class);
log.atInfo()
.addKeyValue("hikari.poolName", hikari.getPoolName())
.addKeyValue("hikari.maxPoolSize", hikari.getMaximumPoolSize())
.addKeyValue("activeProfiles", Arrays.toString(environment.getActiveProfiles()))
.log("Effective Hikari pool [{}] maximum-pool-size={} for active profiles {}",
hikari.getPoolName(), hikari.getMaximumPoolSize(),
Arrays.toString(environment.getActiveProfiles()));
} catch (SQLException e) {
log.error("Failed to read the effective Hikari pool size at startup", e);
}
}
}
Plus a one-line registration file — META-INF/spring/org.springframework.boot.autoconfigure.AutoConfiguration.imports — that lists the class. Spring Boot discovers auto-configurations by reading that file, regardless of any service’s @ComponentScan configuration. Every service that depends on the shared module logs its effective pool size at startup, automatically, forever.
The subtleties in one paragraph so you can revisit later without re-reading: @AutoConfiguration is the classpath-registered variant of @Configuration that Spring Boot discovers without scanning. @ConditionalOnClass(HikariDataSource.class) activates the config only where Hikari is present, without throwing NoClassDefFoundError on services where it isn’t (Spring Boot evaluates the condition via bytecode inspection, not classloading). ObjectProvider<DataSource> handles zero-or-many injection safely. isWrapperFor + unwrap drills through Spring’s proxy layers to reach the real HikariDataSource. log.atInfo().addKeyValue(...) emits structured fields into whatever log platform the service uses, so “the effective pool sizes across every service in every environment” becomes a queryable dashboard instead of a text-grep exercise.
That’s the observability tooling. It doesn’t fix the outage. It just makes it possible to have a real discussion about the outage.
8. Why the actual fix isn’t in the pool at all
The concurrency primer’s spine bears repeating here:
Every
@Transactionalmethod holds a connection for the entire duration of the method call. Anything else the method does while it holds that line — HTTP calls, file uploads, external OCR, sleep — is time the connection is checked out and unavailable to anyone else.
If a service’s connections are being held — meaning: transactions are open, doing slow I/O, waiting on external calls — then the effective concurrency of the service is bounded by the pool. Raising the pool makes more parallel holds possible. But each hold now has to fit inside the database’s ceiling. On a small database, the pool ceiling is not far above the DB ceiling. Raising one gets you a little more; the other slams the door.
The lever that keeps working is not making more connections available. It is making each connection held for less time.
Concretely, and in order of impact for real applications:
- Move slow I/O out of the transaction window. Do the external HTTP call, the file download, the OCR before opening the transaction, or after closing it. Keep the
@Transactionalblock scoped to the writes only. A method that holds a connection for five seconds while OCR runs, refactored to hold it for fifty milliseconds while it just writes the OCR result, needs one hundredth of the connection capacity to serve the same throughput. - Move the whole workflow off the request thread. Even before you shorten the DB window, if the request thread is what’s blocking on the five-second call, that request thread is also unavailable to serve other HTTP requests. Use
@Async(with an explicit bounded executor), so the request thread returns immediately and the slow work happens on a pool designed for it. - Cap the async pool. If a burst of 500 events would trigger 500 concurrent slow workflows, you don’t want 500 in flight — you’d need 500 connections. A bounded executor with a handful of workers and a queue holds bursty demand at the queue level, without borrowing any connections.
The concurrency primer covers each of these in more depth. The relevant part for this article is that all three of them reduce the connection budget the service consumes. And reducing the budget is the only lever that keeps working on a small-database environment where the pool ceiling can’t safely rise.
The lesson at the top of this section is worth stating one more time in blunt form: on a shared database, connection budget is not a per-service question, it’s a per-database question. The four services share a database. The database has a hard ceiling. Every service’s pool comes out of the same pot. The productive question isn’t “what pool size should service X use” — it’s “what’s the total budget for this database, and how do we split it while each service still functions”. That question has a very different answer, and it’s often “nobody can have as much as they’d want; every service has to work harder to hold connections for less time.”
9. What I’d tell another engineer
The reflex when a service can’t get a connection is “raise the pool.” On a database with lots of headroom that works. On a shared database with a real ceiling — which is most non-toy production environments — it works only for a while. The next deploy, or the next traffic burst, hits the ceiling from the other side, and the same error message shows up meaning the opposite thing. If you see too many clients for the first time and reach for the pool as a knob, you’re likely trading one outage for a bigger one four hours later.
Before touching anything, do the arithmetic across every service that shares the database. Add up steady-state demand, double it for deploy overlap, compare it to max_connections. If the number is close, no per-service pool tweak will save you — the fix has to reduce time-under-transaction somewhere, or the database itself has to grow, or a service has to move to its own database. Those are the three real levers.
And on the observability side: “nobody knows what pool size we’re actually running” is not a soft problem, it’s the hard problem underneath the outage. If you cannot answer “what’s the effective pool size for every service in every environment right now” in less than thirty seconds, the outage is going to happen and you’re going to be surprised by it. Building a small auto-configuration that logs the effective values at ApplicationReadyEvent — reading the live bean, not the Environment — took an afternoon and permanently removed “we didn’t know” from every future incident of this shape.
One missing property line, two production outages, and a discipline that starts with the arithmetic. If your team has the arithmetic written down and refreshed every quarter, you probably aren’t reading this article to learn something. If you don’t — that’s this weekend’s PR. It’ll cost you an afternoon and save you the next outage.



