GPU, Cloud & Data Center
“The lab is where you learn. Production is where you prove it.”
Research rarely happens in uncontrolled conditions. It happens under discipline: controlled variables, measured baselines, and experiments that isolate causality.
This lab is designed to make results repeatable—and production less surprising.
This page is the index of my research program: pillars, lab methods, and the equipment used to generate repeatable results. For applied multi-cloud cost work, see Atlas; for write-ups, see Articles.
Research Pillars
Compute & Workload Performance
Training throughput, GPU utilization, NCCL tradeoffs, latency/jitter, virtualization overhead, scheduling effects, Slurm, Kubernetes.
Thermal & Cooling Engineering (Liquid + Air)
Loop design, radiator efficiency, fan/pump control, hotspot mapping, airflow characterization (CFM), CFM-per-watt efficiency, thermal drift, throttling behavior, ambient sensitivity, compute-to-thermal efficiency ratio.
Power & Electrical Engineering
Power transients/load steps, PSU behavior, UPS interaction, power quality (sag/ripple/noise), power capping strategies, perf/watt, rack-level power efficiency, stability under burst loads, brownouts.
I/O, Storage & Data Path
NVMe behavior under load, dataset streaming, ingest pipelines, page cache behavior, PCIe contention, storage-induced jitter, NVMe-oF, HPC storage, shared volumes, data sovereignty, fault-domain storage architecture.
Network & Interconnect
Distributed training sensitivity, congestion and microbursts, topology impacts, RDMA (RoCE/IB when applicable), horizontal scaling, automated security perimeters, mesh/fabric architectures for high throughput, load-balancing algorithms.
Observability & Measurement Science
Instrumentation plan, sensor validation, sampling, time alignment, baselines, confidence intervals, repeatability, experiment hygiene.
Platform Automation & Fleet Operations
Auto-inventory (Redfish-first; IPMI fallback), config drift detection, golden images, reproducible provisioning, Infrastructure as Code (IaC), firmware/driver compliance, zero-touch onboarding.
Reliability & Failure Engineering
Controlled fault injection, thermal/power stress tests, degradation over time, MTBF signals, rollback criteria, postmortems, brownout survivability.
Economics & Capacity (FinOps-style)
Cost per training hour, energy sensitivity, utilization vs waste, capex/opex tradeoffs, constraints-to-dollars narratives, rack efficiency (kW per useful work), throughput per kW, throughput per CFM.
Current Equipment
- GPUs: 4× NVIDIA GPUs with NVLink (2×2 training/inference rig)
- CPUs: Intel and AMD platforms (Threadripper, Core i9-14900, and Xeon-class systems)
- Switching: Arista and Dell datacenter switches (up to 100 Gbps)
- NICs/CNAs: Mellanox CNAs (50 Gbps, RDMA-capable)
- Interconnect: NVLink; PCIe topology and contention under test
- Cooling: rack-level closed-loop liquid cooling, plus controlled airflow
- Power: wye-fed PSU setup; UPS interaction and load-step behavior under test
- Containment: lab-scale hot aisle containment for repeatable thermal experiments
- Instrumentation: multiple probes/sensors for thermal, airflow, and power measurements
- OOB management: Dell iDRAC (Redfish-first; IPMI fallback) on an isolated OOB network
Current Projects
Atlas
Atlas is a single pane of glass for multi-cloud cost visibility and planning—drilldowns by provider, region, and product, with signals that support forecasting, capacity planning, and contract decisions.
Atlas is designed to be autonomous on the data side: it ingests standardized FinOps-style cost and usage exports, then applies a pre-processing layer that lets teams map their own services and offers into consistent categories for reporting and analysis.
The demo uses masked, synthetic data and is not related to any company.
- Prototype: /atlas/
- Technical stack: Project Atlas: Technical Stack