GPU, Cloud & Data Center

“The lab is where you learn. Production is where you prove it.”

Research rarely happens in uncontrolled conditions. It happens under discipline: controlled variables, measured baselines, and experiments that isolate causality.

This lab is designed to make results repeatable—and production less surprising.

This page is the index of my research program: pillars, lab methods, and the equipment used to generate repeatable results. For applied multi-cloud cost work, see Atlas; for write-ups, see Articles.

Research Pillars

Compute & Workload Performance

Training throughput, GPU utilization, NCCL tradeoffs, latency/jitter, virtualization overhead, scheduling effects, Slurm, Kubernetes.

Thermal & Cooling Engineering (Liquid + Air)

Loop design, radiator efficiency, fan/pump control, hotspot mapping, airflow characterization (CFM), CFM-per-watt efficiency, thermal drift, throttling behavior, ambient sensitivity, compute-to-thermal efficiency ratio.

Power & Electrical Engineering

Power transients/load steps, PSU behavior, UPS interaction, power quality (sag/ripple/noise), power capping strategies, perf/watt, rack-level power efficiency, stability under burst loads, brownouts.

I/O, Storage & Data Path

NVMe behavior under load, dataset streaming, ingest pipelines, page cache behavior, PCIe contention, storage-induced jitter, NVMe-oF, HPC storage, shared volumes, data sovereignty, fault-domain storage architecture.

Network & Interconnect

Distributed training sensitivity, congestion and microbursts, topology impacts, RDMA (RoCE/IB when applicable), horizontal scaling, automated security perimeters, mesh/fabric architectures for high throughput, load-balancing algorithms.

Observability & Measurement Science

Instrumentation plan, sensor validation, sampling, time alignment, baselines, confidence intervals, repeatability, experiment hygiene.

Platform Automation & Fleet Operations

Auto-inventory (Redfish-first; IPMI fallback), config drift detection, golden images, reproducible provisioning, Infrastructure as Code (IaC), firmware/driver compliance, zero-touch onboarding.

Reliability & Failure Engineering

Controlled fault injection, thermal/power stress tests, degradation over time, MTBF signals, rollback criteria, postmortems, brownout survivability.

Economics & Capacity (FinOps-style)

Cost per training hour, energy sensitivity, utilization vs waste, capex/opex tradeoffs, constraints-to-dollars narratives, rack efficiency (kW per useful work), throughput per kW, throughput per CFM.

Current Equipment

  • GPUs: 4× NVIDIA GPUs with NVLink (2×2 training/inference rig)
  • CPUs: Intel and AMD platforms (Threadripper, Core i9-14900, and Xeon-class systems)
  • Switching: Arista and Dell datacenter switches (up to 100 Gbps)
  • NICs/CNAs: Mellanox CNAs (50 Gbps, RDMA-capable)
  • Interconnect: NVLink; PCIe topology and contention under test
  • Cooling: rack-level closed-loop liquid cooling, plus controlled airflow
  • Power: wye-fed PSU setup; UPS interaction and load-step behavior under test
  • Containment: lab-scale hot aisle containment for repeatable thermal experiments
  • Instrumentation: multiple probes/sensors for thermal, airflow, and power measurements
  • OOB management: Dell iDRAC (Redfish-first; IPMI fallback) on an isolated OOB network

Current Projects

Atlas

Atlas is a single pane of glass for multi-cloud cost visibility and planning—drilldowns by provider, region, and product, with signals that support forecasting, capacity planning, and contract decisions.

Atlas is designed to be autonomous on the data side: it ingests standardized FinOps-style cost and usage exports, then applies a pre-processing layer that lets teams map their own services and offers into consistent categories for reporting and analysis.

The demo uses masked, synthetic data and is not related to any company.