From Security to Resilience: What Running a Multi-Tenant Cloud Taught Me About Defense in Depth

Most security programs are built around preventing bad things from happening. That’s necessary but insufficient. At AMTI, where I served as CTO and led infrastructure security for a multi-tenant cloud serving customers from single-VM deployments to enterprise DRaaS contracts spanning hundreds of miles of metro fiber, I learned that mature security is about resilience: the capacity to detect, contain, and recover faster than adversaries can escalate.

The Visibility Problem at Scale

Operating a cloud service provider on your own ASN creates a specific governance challenge: you’re the abuse contact, but in a GDPR-compliant architecture, you have no visibility into customer data. Encrypted traffic is opaque by design. This constraint forced architectural discipline: we couldn’t inspect our way to security, so we had to instrument our way there.

Starting in 2017, I implemented automated threat intelligence by integrating Shodan scans against our IP ranges with our ticketing system. Exposed services triggered immediate customer notifications. The operational overhead was significant, anyone who’s run abuse@ for a cloud provider knows the toil, but the alternative was reactive incident response after compromise.

Tenant Security Is Platform Security

In shared infrastructure, the boundary between customer risk and platform risk dissolves. A ransomware-infected VM doesn’t stay contained. It generates anomalous traffic patterns, degrades storage performance, and eventually becomes a support ticket or an outage.

I drove the development of behavioral detection capabilities based on a simple but effective heuristic: when a customer workload exhibited sustained high-entropy small writes to storage, memory saturation, and single-core CPU pinning simultaneously, we had an early ransomware indicator. This telemetry-based approach gave us detection capability without violating data privacy constraints: we observed behavior, not content.

But detection without authority has limits. For some customers, there wasn’t much we could do beyond isolating the workload, that boundary was explicit in our SLA. Even with MTTD below one minute, our team sometimes could only monitor out-of-band telemetry and escalate. Trying to reach a customer at 3am to tell them their production environment was encrypting itself created friction. Some appreciated the call. Others questioned why we couldn’t just “fix it.” Others worried we were snooping into their operations.

And some made a business out of it.

One customer with hundreds of VMs ran a multi-tenant SaaS offering with a front-facing vulnerability. Infections went wild. We averaged three ransomware incidents per day from this single customer. It escalated to me because my team’s KPIs were bleeding. I reached out to the customer’s Director and heard: “The thing is, it’s a source of revenue for us. I charge my customers every time I have to rebuild the server.”

He received a contract termination a few days later.

My risk mitigation mental model was clear: this is a risk I can’t accept. We’re not a trauma ER. We’re not going to be complicit in someone else’s exploitation model. That decision cost us revenue. It protected our platform, our team, and our other customers. This is how we built an operational model where only 8% of our tickets were customer-related; all the rest was foreseeable entropy maintenance and automated flows.

That tension is inherent in multi-tenant security: you can build detection that’s fast and privacy-respecting, but response authority lives with the customer. The operational safety rail was clear SLA language, a documented escalation path, and the willingness to fire customers who weaponize your infrastructure. Not heroics.

The Strategic Lesson

The shift from security to resilience isn’t about tools. It’s about accepting that prevention eventually fails and designing systems, processes, and teams around rapid detection and containment. At scale, that requires:

Architectural constraints that enable observability without compromising privacy
Automated response that reduces time-to-containment below adversary dwell time
Cross-functional alignment between security, operations, and customer success

But owning the P&L forces you to confront the economics of defense.

By 2023, the human burden of incident response was climbing, not because our defenses weakened, but because attacker sophistication accelerated. Bad actors adopted AI-assisted tooling and expanded their targeting strategy: no longer just spearheading user-space entrypoints, but probing application-level vulnerabilities systematically. The attack surface grew faster than manual testing could cover.

We caught this shift early because we had ticket sanity. Our operational discipline kept noise low enough to spot pattern changes without ML. When your baseline is clean, signal surfaces. Running the numbers and projections on coverage versus cost, the conclusion was clear: we couldn’t sustain the depth of traditional pentesting across an expanding threat landscape.

So we adapted. I led adoption of Vonahi’s vPentest platform (vpentest.io), partnering with them during their Kaseya transition, to shift from deep-but-narrow to wide-and-continuous. Automated penetration testing sacrificed some depth for breadth, but it matched the new reality: attackers weren’t going deep on one vector, they were scanning wide for any opening. Our defense had to mirror the threat.

But the business strategy made it sustainable.

We were already carrying the operational expense of daily mitigation — the toil was a sunk cost bleeding margin. So we flipped the model: we offered scheduled network pentests to our cloud tenants at no additional charge. For customers used to classic CSPs — where the provider profits from high usage and incident response billables — this was a differentiator. We weren’t monetizing their pain; we were reducing it.

The upstream benefits compounded. We updated our SLAs to formalize the visibility these assessments required, which gave us clearer responsibility boundaries over workloads we’d previously had to treat as black boxes. And the majority of customers, once they saw the findings, started paying us to remediate — not because we forced the upsell, but because trust was already established.

What began as an operational trade-off became a value-added business strategy: lower mitigation toil, stronger customer retention, clearer SLA boundaries, and a new revenue stream built on proactive security rather than reactive firefighting.

That’s the insight I keep returning to: security strategy follows threat economics. When attackers automate, defenders must automate. When attack surface expands, coverage must expand. The best security leaders aren’t the ones with the most sophisticated tools. They’re the ones who adapt fastest when the landscape shifts.

The New Reality

The cyber attack surface is suffering from entropy, side drift, and speed. It expands in directions you didn’t architect and at a pace that outstrips periodic assessment.

I saw this firsthand years ago when one of our BaaS customers received a critical finding in a security assessment. The red team had traced a path from our marketing webpage, a simple “Create a Support Ticket” link, to a third-party ticketing system, the kind every company uses: ServiceNow, Zendesk, or similar. That third-party server had an outbound SSH port open. Our customer’s security posture was dinged for a vulnerability three hops removed from anything they controlled.

This is the reality of modern attack surface: it’s not what you built, it’s everything you touch.

ISO 27001 has a requirement called “penetration test.” You can satisfy it by checking a box and writing an SOP that says “scan web attack surface once per year.” You’ll get certified. But certification is the floor, not the ceiling. Compliance frameworks exist for good reason. They establish baseline controls and create accountability structures. I’m not dismissive of governance; I’ve led teams through ISO 27001, SOC 2, and GDPR audits. The discipline matters. What doesn’t work is treating the audit cycle as your security strategy. In the real world, where attackers probe continuously, where your supply chain extends into dozens of SaaS vendors, where a marketing link can become an attack vector, annual testing is a compliance artifact, not a defense posture.

The lesson is architectural: if security isn’t a prerequisite in your design and operations, you’re building a leaky bucket. You can patch holes faster than water escapes, or you can build a vessel that holds. The second approach scales. The first guarantees that your security team spends its time on remediation instead of resilience.

No Happy Ending, Just Safe Travel

There’s no such thing as a secure operation. I stopped believing in that destination years ago.

Cybersecurity has a potential attack surface of nearly eight billion possible bad actors: every human with internet access, plus the automated systems they deploy. You don’t defeat that. You don’t outrun it. You build an organization that travels safely through it.

The difference between security-awareness and safety-posture is operational discipline. Awareness means you know the threats exist. Posture means you’ve structured your architecture, your processes, and your team to absorb impact, detect anomalies, and adapt continuously. One is a mindset. The other is a capability.

That’s the transformation I’ve spent two decades building: not security programs that promise protection, but resilient organizations that assume breach and optimize for recovery. There’s no happy ending. There’s just safe travel, and the operational excellence that makes it possible.

The Visibility Problem at Scale#

Tenant Security Is Platform Security#

The Strategic Lesson#

The New Reality#

No Happy Ending, Just Safe Travel#

The Visibility Problem at Scale

Tenant Security Is Platform Security

The Strategic Lesson

The New Reality

No Happy Ending, Just Safe Travel