Cold Aisle Trenches: You Don't Chase Lights-Out—You Earn It

It was 2017. We had just deployed an additional ScaleIO cluster to handle the onboarding of a new customer with hundreds of VMs. Eight nodes, each with 40 Gbps at the backend. Beautiful. Efficient. The whole rack was a work of art—Dell R740s with MD1220 expansions, bezels removed so you could see all those drives blinking in perfect synchronization.

The cluster was deployed less than two weeks ago. I told the customer to “burn it.”

I’ve written before about how we learned that “one-size-fits-all” isn’t the way to go with hyperconverged infrastructure. This story is about something else. This is about what happens when you trust your playbooks more than your people—and what happens when you fix that.

The Setup

Architecture reviewed. Blood spilled with the customer, cleared. Promises made—my name and reputation on the stake. I was safe.

The day was memorable because I was with the Dell/EMC VP of Engineered Systems at an event we were sponsoring. Front row. Smiling. Networking. My phone rang. I couldn’t answer. It rang again.

The customer. His batch jobs were getting I/O timeouts.

I told him I’d investigate and call him back. Then I called our Ops Manager: “Please, fix it.”

I returned to the VP, channeling Buddha’s wisdom throughout my entire nervous system.

“Filho Feio Não Tem Pai”

There’s a Brazilian saying: “Ugly child has no father.” It’s a less polite way of saying what you already know—success has many fathers, failure is an orphan.

Five minutes. Ten minutes. Nothing moved.

Another call from the customer. I excused myself from the front-row conversation and retreated to the back. The event had our Support Manager in the field too, so back at the office, we only had the Ops Manager and the DC technicians. The RCA was being conducted not by the people who designed the system, but by playbook runners.

Improvised war room—a telephone party with five people, one of them inside the data center. We started doing reverse chronology. Something went wrong “15 minutes ago.” What happened 15 minutes ago?

One technician defended himself: “I wasn’t even here. I was in the break room.”

The Truth

What happened was this: That same technician, coming back from the break room, decided to go inside the DC. Most likely to admire the new ScaleIO cluster. Maybe take a picture. It was beautiful, after all.

Looking at the rack, he noticed one of the servers wasn’t perfectly flush with the others. With the bezel removed, all the drives were visible, so he decided to give it a gentle push to align it properly.

The thing is, the power button on a Dell R740 sits right next to the rail interface.

He turned off one node of the cluster.

And then ScaleIO did exactly what it was designed to do: self-heal. The distributed storage architecture detected a failed node and initiated an immediate, uncapped rebuild across the remaining seven nodes. Forty gigabits per second, times seven, all hammering the storage mesh simultaneously. Tail latency for write operations went through the roof.

The customer’s batch jobs timed out because our “fault-tolerant” system was busy being fault-tolerant.

The Resolution

Yellow smile emoji: “We’ll talk about that later.”

I called Compliance—you remember them from the article on governance frameworks? Then called the customer: “Just a few more minutes, it’ll be back. We need to wait for the rebalance.” Our compliance required three replicas of every block. It was midday. No maintenance window available. We had to wait for the bleeding to stop.

The next day, I walked into the office. Everyone had a rehearsed version of what happened. Honestly, I wasn’t interested in what happened—I already knew from yesterday.

If we need to point the finger at someone, here I am. It’s on me. But not because I want to be the hero absorbing all the blame—I’ve written before about why that’s a trap. The technician didn’t fail. The system failed. And as the person responsible for the system, my job wasn’t to punish—it was to fix the gaps that let good intentions cascade into an outage.

The Lessons

This incident exposed five systemic failures that had nothing to do with the technician and everything to do with how we ran operations:

1. Documentation was a filing cabinet, not a knowledge system.

Our compliance documents and runbooks lived in a shared folder. Not indexed. Not searchable. Six Sigma Black Belts loved that structure in the previous decade—hierarchically organized folders and subfolders with no tags or rational index. The standard answer was always: “It’s easy, you just have to read and keep following the levels down.”

When your playbook runners can’t find the playbook, you don’t have playbooks. You have organized chaos (well, I don’t believe it’s a thing, but it’s a way to see it).

2. Access control had authentication without intent.

Our premises access was “simple AAA”—Authentication, Authorization, and Accountability. But there was no reason captured. A DC technician was properly authorized to be there. He had clearance to open a rack and touch the hardware. Nothing in the system asked: “Why are you touching this right now?”

Authorization without intent is just a gate that lets problems walk through.

3. Out-of-band management was a promise, not a practice.

The cluster went up. IPMI and Redfish management were configured as “a promise”—an open ticket that would be executed “when everything calms down.” The irony of rushing to production without the tools that let you recover from production failures was not lost on me. Eventually.

4. Accountability was policy, not enforcement.

We had accountability as a written concept. Sign-in logs. Badge records. But there was no correlation between “who was in the DC” and “what changed in the infrastructure.” The technician could accurately say he “wasn’t there”—because from a systems perspective, his presence was invisible until we reconstructed it manually.

5. The system had no guardrails on its own behavior.

A system went live, working flawlessly, with no “breaking behavior” fencing. The rebuild thresholds weren’t defined. Things go south—it’s part of the entropy. Our self-healing architecture was configured to heal at maximum speed, consequences be damned. We gave it the keys to the kingdom and no speed limit.

The Action Plan: Trias Politica for the Data Center

We needed standards and enforcement. I went back to Montesquieu’s “trias politica”—the theory of self-regulating powers from The Spirit of Law. I’d studied Economic Science in college, so the framework resonated. We defined clear boundaries between Compliance, Engineering, and Operations, with high agency in risk mitigation. A few years later, that framework became my compass for some very hard decisions.

First: Ticket-gated access.

Nobody could enter a data center premise without an opened ticket that triggered a SOP or playbook requiring physical intervention. People pushed back: “What about things we only notice when we’re inside the computing room?”

Answer: “Let’s double down on telemetry. Human opinion will be the tie-breaker.”

Second: Tandem access with double credentials.

Technicians could only access premises in pairs, except during emergencies. All access required dual authorization.

Third: OOB management before production.

We adjusted our procedures first, then our systems. Out-of-band management became the first thing that had to be soldered onto the Ops Mesh—not a ticket waiting for “when things calm down.” No IPMI/Redfish integration, no production traffic. The irony of that ScaleIO cluster going live without the tools to recover it remotely wasn’t going to repeat.

The Calcifying

When we entered the ISO 27001 process, we set the bar high. What made a huge difference was running away from tribal knowledge toward structured paths.

Fast-forwarding: we realized we couldn’t have too many valves and bottlenecks just to see what’s going on. We had something like fifteen different systems to process overall information for a mature snapshot of “what’s happening.” A Siemens Profinet-based system to check phase current here. A Netbotz temperature probe there. All with SNMP, making our Grafana board look like a Christmas canvas—a lot of green, some red, shades of orange everywhere. A lot of noise. A complicated-to-reason signal.

Here’s the thing nobody asked during the war room: why didn’t the NOC see the rebuild happening?

Because the new cluster wasn’t in their dashboards yet. There was an open ticket for the NOC team to add the newly joined ScaleIO cluster to the “single pane of glass.” Another task waiting for “when things calm down.” Another vital signal buried in an ocean of beautiful dashboards that showed everything except what was actually breaking.

We had fifteen systems feeding Grafana. Green lights, red lights, shades of orange everywhere. But the one system actively bleeding out? Invisible. Not because we lacked observability—because we had so much of it that one more integration felt like a low priority.

At the end, we had to develop our own solution.

Our Autotask PSA could trigger APIs, so we built an endpoint in our stack that integrated with our centralized BMS access control system. When a ticket requiring physical intervention was dispatched, the system generated a one-time passcode specific to that ticket. To enter the DC, the technician needed two factors: their biometric signature and that ticket-specific passcode.

Their personal access code still worked—we weren’t going to lock people out during real emergencies. But access using a personal code without a dispatched ticket automatically created a non-conformity and triggered Compliance. The ticket had to specify the mission: troubleshoot memory on room #3, rack #10, server #8. All registered as ISO required.

We didn’t build a gate. We built a paper trail that writes itself.

What We Achieved

I’ll confess: lights-out operations wasn’t on my radar when we started. I just wanted accountability and clear RCA. The tandem access requirement was about having a witness, not about automation.

But at the end, it became peace of mind.

After a few months, the culture changed. All alerts were managed in a mature, prepared way. People stopped rushing into the cold aisle with parts in hand. We’d created a safety environment by practicing one thing: discipline.

The Uncomfortable Truth

Honestly, I don’t trust the public statistics on IT outages caused by human error. I’ve seen too many events of LUNs running out of space due to lack of maintenance—because of lack of observability. Deployment teams not configuring the email server on the BMC because “it requires TLS” or because “I’ll have to ask the security team to open port 587 and that’ll take time”—and the time is forever.

When balance sheets were performed pre-Excel, the human error exposure surface was far wider and troubleshooting was far harder. The hyperscale era showed us these improvements are necessary. Servers no longer have “planet” or “Star Wars” names. They’re now a fungible swarm, far away from “human-level” control.

Lights-out is a destination, not a starting point. You can’t automate your way out of operational immaturity. And every “self-healing” system is only as good as the constraints you put around it.

The technician wasn’t malicious. He wasn’t incompetent. He cared enough about the infrastructure to notice something was slightly off and tried to fix it. That instinct is exactly what you want in your team.

What failed wasn’t him. What failed was a system that let good intentions cause cascading failures because we hadn’t built the guardrails to channel those intentions safely.

Twenty years in data centers taught me this: The playbook is not the operation. The people are the operation. And if your people can cause an outage by admiring your hardware, your operational model has a gap that no amount of distributed storage can fill.

About the Author

Stefano Schotten spent two decades building AMTI from a startup to a cloud service provider with 2,000+ consecutive days of uptime. He’s now working on URE, tackling the $244B cloud waste problem with economics-driven governance. He still removes bezels to admire the drives. He just doesn’t push anymore.

The Setup#

“Filho Feio Não Tem Pai”#

The Truth#

The Resolution#

The Lessons#

The Action Plan: Trias Politica for the Data Center#

The Calcifying#

What We Achieved#

The Uncomfortable Truth#