A bricked storage array, a 2+4 SLA that technically performed, and a technician asking about lunch while executives circled. We learned that risk transfer is an illusion when your blood is on the floor.
January 2026 · Stefano Schotten
The contract was honored. The business still bled.
My case manager called me from the customer site. I could hear the tension before he said a word.
“The VPs are pacing. Four of them, maybe five. They’re all just… standing around IT, watching.”
The storage controllers had been bricked since Saturday. It was now Monday, approaching noon. The manufacturer’s technician had arrived within SLA, opened his backpack, and asked the Infrastructure Manager: “Do you have the console cable?”
Silence.
The manager didn’t have it. Not on him. The company owned dozens of DB-9 serials somewhere—“bottom drawer,” probably—but not in that room, not at that moment. The technician made a call. “A cab is bringing one. Less than an hour.”
Then—my case manager still on the line, narrating this in disbelief—the technician looked up, surrounded by executives whose P&L was hemorrhaging by the minute, and asked like it was a regular Tuesday: “Well, nothing to do. Where’s a good place for lunch around here?”
My case manager paused. “The MD is going to call you.”
He did. Asked me if this was a joke.
I heard him out. Told him I’d see what I could do. There was nothing to do. The manufacturer was covered. The customer had the best SLA money could buy. The technician was hungry. The cab arrived a few minutes late. Everybody waited while the storage guy finished his meal.
The Setup
This was a customer with on-premises infrastructure—their data center, their MEP team, their maintenance windows. A few days earlier, they’d warned us about scheduled electrical work over the weekend. “Don’t worry if things go offline.” We didn’t. Zabbix went into maintenance mode.
What we didn’t know: their UPS ran on a monophasic Delta connection—phase to phase. The MEP team, for whom data center operations wasn’t core business, didn’t fully understand the topology. After the maintenance on Saturday morning, only one phase came back up.
The UPS went haywire. Flapping between DC and AC for a few thousand cycles. By the time it stabilized, both storage controllers—no hard power-off switch—were bricked.
We had a field engineer nearby. Dispatched immediately. Thirty minutes later: “Yeah, it’s really bricked.” Time to call support.
The customer had a DR site. Storage replicated to another building. But compute capacity on the secondary side had never matched primary—budget approval entropy, justified by “it never happened before.” We’ll unpack Theoretical DR in a future piece; for now, know that entropy often wears a budget approval mask.
So the business would bleed either way. The question was how long.
The Risk Framework
As a CISSP, I carry a simple axiom about identified risk: you have four options. Accept, mitigate, deny, or transfer. Pick one, execute completely or palliatively, narrow your blast radius and impact depth. Move on.
The customer had chosen transfer. They’d purchased the manufacturer’s premium SLA: 2+4. Two hours for on-site diagnosis, four hours to resolution. Six hours total felt acceptable in their risk matrix.
Here’s the thing: the ticket wasn’t opened until Monday morning. The UPS failed Saturday. The controllers bricked Saturday. But the MEP team thought it was an IT problem. IT thought MEP had it handled. The weekend passed. You know the saying—everybody’s job is nobody’s job.
So when the VPs started pacing at 9 AM Monday, the SLA clock had barely started. The contract would perform exactly as written. The technician arrived within two hours. He had the replacement controllers. The resolution would have hit the four-hour window—if not for a cable nobody thought to stage.
What the risk matrix didn’t capture: the operational reality of a 48-hour head start that nobody took. The executives circling. The phone calls escalating. The MD demanding answers from a vendor who was, technically, doing everything right—on a clock that started two days late.
The map said “risk transferred.” The territory said “your blood, your floor.”
The Lessons
This incident crystallized three things we’d been learning the hard way across multiple customers:
1. Risk transfer is a legal concept, not an operational one.
The SLA protected the manufacturer. It didn’t protect the customer’s Monday—or the Saturday and Sunday that preceded it. When the controllers bricked, nobody owned the problem. MEP thought it was IT. IT thought MEP had it covered. The ticket opened 48 hours late. The contract said “six hours from ticket.” The business experienced sixty. You can transfer liability. You cannot transfer pain.
2. Dependencies you don’t control are dependencies that will fail you.
A DB-9 serial cable. Maybe ten dollars. The customer owned dozens. But ownership isn’t availability. The cable existed on the asset register. It didn’t exist in the room where it was needed. The map showed inventory. The territory showed empty hands and a communication gap.
3. Extended warranties create parallel work, not relief.
Here’s a question I started asking after incidents like this: when did an extended warranty actually remove operational burden from your team? Your people still have to be there. Still have to watch. Still have to escalate. Still have to wait. The manufacturer’s technician isn’t a replacement for your operations—he’s an addition to them. That’s not a multiplier. That’s overhead with a receipt.
What We Built
We made a decision at AMTI: if we’re the ones bleeding, we can’t outsource the tourniquet.
We developed local inventory at our facilities. Spare parts staged for over 95% of failure scenarios. Our field teams carried what they needed. No cabs. No “bottom drawer.” No lunch breaks while customers hemorrhaged.
The RMA process flipped. Instead of opening tickets during the crisis, we triggered manufacturer replacements in the postmortem—after the customer was already back online. The vendor became our restocking mechanism, not our recovery path.
Then came standardization. We built an internal component matrix: connectrix cards, SFPs, disks, enclosures, processors, chassis—servers to JBODs. Everything cataloged. Everything interchangeable. Everything on-hand.
After a few years of data, we realized something uncomfortable: we didn’t need extended warranties anymore. We were paying 20-40% premiums for SLAs that, in practice, created friction and risk rather than reducing them. We stopped.
The Calcifying
This wasn’t just cost optimization. It was a strategic position.
Local inventory meant we controlled time-to-recovery. Standardization meant we controlled complexity. Dropping extended warranties meant we stopped paying for the illusion of transferred risk.
The business outcomes followed: faster recovery, lower total cost, fewer dependencies on external actors who—however competent, however contractually compliant—would never care about our customers’ Monday mornings the way we did.
We’d stumbled into something that felt like sovereignty.
The Uncomfortable Truth
I don’t blame the technician. He was hungry. He’d done his job—showed up on time, brought the parts. The contract didn’t say anything about console cables or executive anxiety or skipping lunch.
I don’t blame the customer. They’d bought the best coverage available. They’d done what the risk frameworks told them to do: identify, assess, transfer.
The failure wasn’t in the people. The failure was in believing that a signed contract could substitute for operational readiness. That a 2+4 SLA meant six hours when nobody owned the first forty-eight. That transferring risk on paper transferred it in reality.
Twenty years in data centers taught me this: the map is not the territory. Your SLA is a map. Your inventory spreadsheet is a map. Your DR runbook is a map.
The territory is Monday morning, controllers bricked, executives pacing, and a technician asking where to get lunch.
The biggest risk of blood in the cold aisle is yours—never theirs, and never their lawyer’s.
Stefano Schotten