URE — Usage • Resources • Economics

GPU Communication Performance — NVIDIA NVLink, NCCL over RDMA and default TCP

This research prototype examines how modern Blackwell systems communicate—within a node and across nodes. Inside the box, GB200-class platforms pair GPUs with next-gen NVLink and NVSwitch fabrics; at rack scale, NVL72 stitches dozens of GPUs into a single high-bandwidth domain. The goal is to make that fabric’s character visible, not to crown a winner.

We contrast those on-node fabrics with “standard TCP over Ethernet” paths in distributed clusters. While TCP is ubiquitous and easy to operate, it introduces kernel/CPU overheads and higher tail latency compared to RDMA transports, especially for collective communication (NCCL).

Method plan — evaluate, compare, and publish results for: (1) intra-node NVLink vs PCIe, and (2) inter-node NCCL over RDMA (RoCEv2) at 50 Gbps vs TCP at 50 Gbps. We’ll capture throughput, latency distributions (p50/p99/p99.9), communication/compute overlap, and sustained GPU utilization across LLM inference and training phases.

The objective is to quantify the practical gap between a Blackwell NVLink/NVSwitch domain and conventional Ethernet paths—and to show how disciplined performance engineering narrows that gap in real deployments.

Atlas Preview — Capacity & cost map

An interactive D3/Observable prototype built on FOCUS data, designed to make cloud usage and cost transparent. The stack uses Kafka (or Redpanda for simplicity) to ingest up to one million events per second in a WAL pattern, feeding a TimescaleDB/PostgreSQL backbone. Hasura provides GraphQL access for insight gathering after aggregation, while a pandas pipeline handles projections and seasonal forecasting across multiple zones and providers.

The result is a practical compass for Operations Engineers — showing where to place data and applications for maximum performance and minimum cost. Work is ongoing to deliver the most effective “human-view” frontend.

Memory-Shared Container Mesh

A prototype exploring an alternative to traditional TCP-based microservices: containers exchange data through low-level memory sharing, bypassing the network stack entirely. The architecture enables near-zero-latency communication and drastically reduces serialization overhead, while preserving container boundaries and isolation.

The stack combines a shared-memory fabric with process-level coordination, exposing high-bandwidth data paths for workloads that demand microsecond responsiveness. Early results show throughput and efficiency gains well beyond what service meshes or gRPC pipelines can deliver.

The goal is to give Operations Engineers a new lens on distributed systems design — one where containers collaborate at hardware speed, unlocking performance once reserved for tightly-coupled monoliths.

HPC & ML Serving Scheduler

A prototype for orchestrating scarce, high-value GPU fleets in the era of LLMs. Using Redfish telemetry, the system tracks hardware states, fleet utilization, and load profiles to dynamically adjust queues, KV caches, and scheduling strategies.

The central challenge: inference and training compete for the same accelerators.

During peak hours — GPUs are tuned for inference. The priority is low-latency serving at multi-million-user scale, with scheduling optimized around prompt throughput and memory-efficient KV cache management. Parallel file systems like Lustre or BeeGFS aren’t required here — most data is already loaded into memory, so the focus shifts to smart queuing and GPU allocation.

During off-peak windows — the same GPUs pivot to distributed training. This is when Lustre-class parallel file systems and RDMA fabrics become critical: streaming large datasets, checkpointing models, and synchronizing gradients across nodes. Techniques like AllReduce over InfiniBand and overlapping communication with computation ensure GPUs remain fully saturated.

By adapting to demand cycles, the scheduler extracts maximum performance from scarce top-end accelerators — ensuring organizations get the inference capacity users expect during the day and the training throughput researchers need at night.

NFS over RDMA with Multistream

After nearly a decade of tuning high-performance storage and distributed systems, I am now exploring how modern kernel implementations of NFS 4.1 with multistream support over RDMA change the game compared to traditional NFS mounts.

Classic NFS mounts operate over TCP, where throughput and latency are bounded by the network stack and serialized calls. Under heavy load, this introduces queue buildup, head-of-line blocking, and tail-latency spikes that limit effective resource utilization.

NFSoRDMA with kernel multistream removes these bottlenecks by allowing parallel data channels directly over RDMA. Each stream runs independently, eliminating single-flow choke points and dramatically reducing long-tail latency. Metadata and data I/O can progress concurrently, keeping GPUs and CPUs fed without stalls.

The result is not just higher raw throughput but more predictable performance curves, better fleet utilization, and higher application-level efficiency. By reducing storage as a bottleneck, NFSoRDMA with multistream enables systems to push their GPUs, CPUs, and interconnects closer to full capacity.

URE — A performance engineering project for smarter cloud usage

Performance engineering and smarter cloud usage