Personal experiments in performance engineering and distributed systems, 2025 — tackling AI’s very specific bottlenecks.
By Stefano Schotten — AI Infrastructure & Performance Engineering

ABOUT

URE started as my personal project as a performance engineer — driven by the conviction that computing can be both faster and leaner when you truly understand what’s under the hood.

The idea is simple: put the right data in the right place, align usage with resources, and connect both to economics. Done well, this creates systems that are not only blazing fast, but also sustainable and accountable.

Today, URE is a workbench of experiments and visualizations — exploring smarter ways to approach cloud usage, performance, and cost.

VISION

Cloud usage doesn’t need to be a black box.

With better insight and governance, engineers and finance teams can work together instead of against each other.

URE explores how Usage • Resources • Economics can become one connected system, showing that efficiency and performance aren’t opposites — they’re two sides of the same design principle.

FIELD NOTES
UPCOMING ARTIFACTS
GPU Communication Performance — NVIDIA NVLink, NCCL over RDMA and default TCP

This research prototype examines how modern Blackwell systems communicate—within a node and across nodes. Inside the box, GB200-class platforms pair GPUs with next-gen NVLink and NVSwitch fabrics; at rack scale, NVL72 stitches dozens of GPUs into a single high-bandwidth domain. The goal is to make that fabric’s character visible, not to crown a winner.

We contrast those on-node fabrics with “standard TCP over Ethernet” paths in distributed clusters. While TCP is ubiquitous and easy to operate, it introduces kernel/CPU overheads and higher tail latency compared to RDMA transports, especially for collective communication (NCCL).

Method plan — evaluate, compare, and publish results for: (1) intra-node NVLink vs PCIe, and (2) inter-node NCCL over RDMA (RoCEv2) at 50 Gbps vs TCP at 50 Gbps. We’ll capture throughput, latency distributions (p50/p99/p99.9), communication/compute overlap, and sustained GPU utilization across LLM inference and training phases.

The objective is to quantify the practical gap between a Blackwell NVLink/NVSwitch domain and conventional Ethernet paths—and to show how disciplined performance engineering narrows that gap in real deployments.

Atlas Preview — Capacity & cost map


An interactive D3/Observable prototype built on FOCUS data, designed to make cloud usage and cost transparent. The stack uses Kafka (or Redpanda for simplicity) to ingest up to one million events per second in a WAL pattern, feeding a TimescaleDB/PostgreSQL backbone. Hasura provides GraphQL access for insight gathering after aggregation, while a pandas pipeline handles projections and seasonal forecasting across multiple zones and providers.

The result is a practical compass for Operations Engineers — showing where to place data and applications for maximum performance and minimum cost. Work is ongoing to deliver the most effective “human-view” frontend.

Memory-Shared Container Mesh

A prototype exploring an alternative to traditional TCP-based microservices: containers exchange data through low-level memory sharing, bypassing the network stack entirely. The architecture enables near-zero-latency communication and drastically reduces serialization overhead, while preserving container boundaries and isolation.

The stack combines a shared-memory fabric with process-level coordination, exposing high-bandwidth data paths for workloads that demand microsecond responsiveness. Early results show throughput and efficiency gains well beyond what service meshes or gRPC pipelines can deliver.

The goal is to give Operations Engineers a new lens on distributed systems design — one where containers collaborate at hardware speed, unlocking performance once reserved for tightly-coupled monoliths.

HPC & ML Serving Scheduler

A prototype for orchestrating scarce, high-value GPU fleets in the era of LLMs. Using Redfish telemetry, the system tracks hardware states, fleet utilization, and load profiles to dynamically adjust queues, KV caches, and scheduling strategies.

The central challenge: inference and training compete for the same accelerators.

During peak hours — GPUs are tuned for inference. The priority is low-latency serving at multi-million-user scale, with scheduling optimized around prompt throughput and memory-efficient KV cache management. Parallel file systems like Lustre or BeeGFS aren’t required here — most data is already loaded into memory, so the focus shifts to smart queuing and GPU allocation.

During off-peak windows — the same GPUs pivot to distributed training. This is when Lustre-class parallel file systems and RDMA fabrics become critical: streaming large datasets, checkpointing models, and synchronizing gradients across nodes. Techniques like AllReduce over InfiniBand and overlapping communication with computation ensure GPUs remain fully saturated.

By adapting to demand cycles, the scheduler extracts maximum performance from scarce top-end accelerators — ensuring organizations get the inference capacity users expect during the day and the training throughput researchers need at night.

NFS over RDMA with Multistream

After nearly a decade of tuning high-performance storage and distributed systems, I am now exploring how modern kernel implementations of NFS 4.1 with multistream support over RDMA change the game compared to traditional NFS mounts.

Classic NFS mounts operate over TCP, where throughput and latency are bounded by the network stack and serialized calls. Under heavy load, this introduces queue buildup, head-of-line blocking, and tail-latency spikes that limit effective resource utilization.

NFSoRDMA with kernel multistream removes these bottlenecks by allowing parallel data channels directly over RDMA. Each stream runs independently, eliminating single-flow choke points and dramatically reducing long-tail latency. Metadata and data I/O can progress concurrently, keeping GPUs and CPUs fed without stalls.

The result is not just higher raw throughput but more predictable performance curves, better fleet utilization, and higher application-level efficiency. By reducing storage as a bottleneck, NFSoRDMA with multistream enables systems to push their GPUs, CPUs, and interconnects closer to full capacity.

s@ure.us

URE — A performance engineering project for smarter cloud usage

Performance engineering and smarter cloud usage

URE is an ongoing project exploring how to align usage, resources, and economics for faster, leaner, more sustainable computing. It began as a founder-driven passion project and continues as a living workbench of ideas and experiments.