When the Constraint Isn’t Capacity

A few years ago, as Field CTO for an enterprise customer, I was pulled into a rescue effort that started the way these stories usually start: pain, urgency, and a narrative that felt convenient.

The application hit a bootstorm—150,000+ users slamming it in a short window—and then the predictable second-order effect: every day after that, more tickets piled up. Instability. Session timeouts. Intermittent failures. The kind of symptoms that turn a service into a rumor.

Inside the organization, the explanation had already crystallized into the classic line:

“The application scales infinitely. Infra is the constraint.”

I’m not allergic to that claim. Sometimes it’s true. Capacity walls exist. Storage, network, noisy neighbors, kernel behavior—real constraints show up under load.

But that phrase is also a comfortable hiding place. Because once you accept it, the next step is almost always the same: buy more runway. More VMs, more pools, more spend, more moving parts. And before you know it, infrastructure becomes the default lever, until the organization runs out of budget, patience, or both.

Here’s the security angle people miss: when you scale around a defect, you don’t just waste money—you expand attack surface, increase blast radius, and normalize fragility. Availability is not separate from security. It is security.

This time, “infra couldn’t give more runway” meant the footprint had already exploded into hundreds of VMs, and the ask was trending toward hundreds more. Meanwhile the service was still failing where it mattered: at the edge of user experience. Sessions dying. Requests hanging. Timeouts that made the whole system feel haunted.

That’s the moment I like to slow things down. Not to be philosophical, but because at that scale, guessing is expensive. And “add capacity” is one of the most expensive guesses you can make.

In my world, availability and cost explosions are security outcomes—because they widen blast radius and create the conditions for failure to repeat. So I took the lead on the diagnostic track and pushed for something unglamorous: instrumentation.

We deployed an APM. Nothing exotic. Just enough visibility to stop operating off symptoms and start seeing the shape of the failure: where time was being spent, what paths were stalling, and which dependencies were acting like quiet choke points.

And we found it.

There was a dependency call present in essentially every request path. Not some rare corner case. Not something you’d only hit under special conditions. It was part of the application’s breathing.

That call attempted to query a SQL endpoint that didn’t exist.

Not “was overloaded.” Not “was slow.” It wasn’t there. Wrong host, stale config, a dead reference that survived long enough to become “normal.”

From there, the chain reaction was almost boring in its predictability:

The application would attempt to reach this non-existent SQL server.
The network stack would wait, because timeouts are how systems politely fail while consuming your resources.
That waiting multiplied across threads, workers, and nodes.
Under bootstorm conditions, queues backed up immediately.
Latency ballooned.
Sessions expired.
Retries cascaded.
Autoscaling saw “load” and did what it’s designed to do: add capacity.

And that’s the trap: a failure mode that looks like load.

It fools humans and automation at the same time. You add nodes, it gets worse. You add more, it gets worse again. Because the bottleneck is not capacity. The bottleneck is a systemic stall poisoning every request path.

We fixed that single defect.

The footprint collapsed from hundreds of VMs to less than a dozen.

No miracle tuning. No re-architecture. Just removing a dependency call that never should have shipped, paired with timeout behavior that never should have been allowed to dominate the entire request path.

That was the moment the “infra constraint” narrative evaporated.

Here’s what made it land at the executive layer: the business was already halfway into approving a major infrastructure expansion—an expensive all-flash storage build-out—because that’s what the symptoms were “asking” for.

Once the defect was corrected, that expansion stopped making sense.

The customer avoided over $1M in storage spend they never needed. Time to resolution: less than three days, versus the months-to-quarters timeline the storage build-out would have required. And beyond the dollars, the team got something harder to quantify: relief from the operational toil of keeping a broken system upright.

Why this is Security Engineering

Some people still think security starts at identity and ends at encryption. That’s a narrow slice of the truth.

Availability is a security property.
Reliability is a security property.
Change discipline is a security property.

A single unvalidated dependency plus poor timeout behavior can take down your service as effectively as an attacker can. It can also drain budget, burn teams, and trap the organization in permanent incident response. In practice, it becomes an internal denial-of-service: the system attacks itself.

And when the failure isn’t attributable—when you can’t trace it, replay it, and force the organization to learn—you get the worst kind of risk:

Risk that repeats.

The fix isn’t better people. It’s better defaults.

I don’t believe in cultures that rely on hero engineers to catch every edge case. That’s not a strategy. That’s gambling with payroll and reputation.

The fix is making the safe path the easy path.

If I were writing the building code after that incident, it would be simple:

Guardrails that prevent silent failure from shipping

Dependency validation gate: if a release references an endpoint, it must resolve and respond in an environment that resembles production.
Deliberate timeout budgets: short enough to protect the system, consistent enough to reason about, aligned across tiers.
Circuit breakers + bulkheads: one dependency should never be allowed to poison every request path.
Retry discipline: bounded retries, backoff, jitter, and clear rules for when not to retry.
Canary + automated rollback: tie releases to SLOs/error budgets—don’t “watch it closely.” Make rollback the default when thresholds trip.

Traceability that turns chaos into learning

Every change is attributable: linked to a PR/ticket with a clear “why,” not just “what.”
Clear ownership boundaries: dependency maps, SLOs, runbooks—owned, current, and enforced.
Postmortems that produce controls: owners + deadlines + verification, not narratives that fade by next sprint.

None of this is glamorous. That’s the point.

Security engineering, at its best, is not theater. It’s operational correctness under pressure, with guardrails that keep velocity high and prevent the organization from writing million-dollar checks for runway it never needed.

The pendulum swings between “move fast” and “lock it down.” But fundamentals don’t care about your methodology. Systems fail the same way regardless of what you call your process.

The only question is whether you’ve built the instrumentation and the discipline to see it before your customers do, and before “more capacity” becomes the default answer to a problem that was never capacity in the first place.

Why this is Security Engineering#

Guardrails that prevent silent failure from shipping#

Traceability that turns chaos into learning#

Why this is Security Engineering

Guardrails that prevent silent failure from shipping

Traceability that turns chaos into learning