A few weeks ago we hit a production issue on a cloud environment — one XCP-ng host was showing IOPS contention caused by a single guest VM. The classic noisy-neighbor race condition on shared storage. The diagnostic path was obvious: cross the dom0 guest list with iostat on the host, find the VM hammering the disk, and work the problem from there. Straightforward correlation — the kind of thing an experienced operator resolves in fifteen minutes with two terminal windows.
Since I was deep in AIOps work at the time, I decided to throw an agent at it. Live production telemetry, multiple data sources needing cross-reference, structured parsing of streaming output — this was the perfect case for AI-assisted triage.
The agent lasted about ten seconds before it was useless.
Not because the model couldn’t reason about IOPS contention and VM scheduling. It absolutely could. The problem was the data volume. The host was in production — iostat alone was generating hundreds of lines per second, guest state was cycling through dozens of VMs, and the moment I plugged the agent into the live stream it was drinking from a firehose. The context drifted on the first pass. By the time the agent had the iostat snapshot and the guest mapping in the same window, the earlier data had already been buried under the next wave of telemetry. It couldn’t hold two data points long enough to correlate them.
I solved it by hand in twelve minutes. Crossref, isolate, remediate. But the experience lodged — not because the agent failed, but because the failure had nothing to do with intelligence. The model had the reasoning capability. It just couldn’t see its own findings through the noise.
It reminded me of a customer whose Catalyst core was generating 37,000 events per second into their log aggregator — different layer, same physics. When the pipe is that wide, everything downstream drowns. But that’s a story for another day.
Every Command Starts From Zero
When an LLM agent runs commands on a remote host through SSH, every invocation is a clean-room. ssh host "dmesg | grep -i error" opens a TCP connection, negotiates keys, authenticates, spawns a shell, runs the command, dumps the full output into the agent’s context, and tears everything down. The next command does it again. cd /opt/app in command one doesn’t exist by command two.
For a single check, nobody notices. For a fifty-command triage across a twenty-node GPU cluster, the math gets ugly. Eight commands per host, ~250 ms handshake overhead on LAN, ~800 ms over the internet. That’s 40–120 seconds of pure SSH ceremony before a byte of diagnostic work happens. At fleet scale — ten thousand hosts, eight checks each — handshake overhead alone accounts for 17.7 hours.
But the latency isn’t the real problem.
Every byte of output lands in the agent’s context window unfiltered. The agent can’t preview it, can’t measure it before reading, can’t decide “4,327 lines of warnings? Let me just read the last 20.” Raw SSH doesn’t give it that choice. A single journalctl -p warning on a busy production host pushed 1.1 MB of text into the context in my benchmark — consumed as input tokens before the agent could evaluate whether any of it was signal.
Over fifty commands, that’s roughly 900 KB of raw output in the context window. Most of it noise. All of it displacing previous findings, the current plan, tool definitions, and the agent’s actual reasoning.
Context Drift
I’ve started calling this context drift — the progressive degradation of an agent’s reasoning quality as its working context fills with low-signal content.
It’s not hallucination. Hallucination is the model generating false information from its training distribution. Context drift is the model losing access to correct information it already gathered because noise has pushed it out of effective attention.
Think of a mechanic diagnosing an intermittent misfire across an eight-cylinder engine. Compression test on cylinder one. Spark plug inspection on two. Injector flow rates on three. Each result gets pinned to the corkboard behind the bench — visible, organized, referenceable. Now imagine that after every test, someone dumps a box of random parts onto the workbench. Wrenches, gaskets, bolts, shop manuals opened to the wrong chapter. By the fifth dump she can’t find her compression gauge. By the eighth she can’t see the corkboard at all. The mechanic didn’t get worse at diagnosis. The workspace got sabotaged.
That’s what unfiltered SSH output does to a context window.
The difference between an agent that synthesizes findings across fifty commands into a coherent root cause and one that loses the thread after fifteen isn’t model quality. It’s plumbing.
Metadata First
The fix is architectural, not incremental. You don’t solve context drift by buying a bigger context window — you stop the noise from entering in the first place.
I built hauntty around one principle: the agent should know what happened before it decides to read the output. It’s open source, published on GitHub — because tooling that solves infrastructure problems shouldn’t live behind a gate.
A single Go binary, self-deploying. hauntty connect user@host SCPs itself to the remote, starts a daemon, and forwards a Unix socket over the SSH tunnel. One connection, reused for everything. The daemon holds persistent PTY sessions — real bash shells where cd, export, environment variables, and shell functions survive across commands. No more disposable shells.
When a command finishes, the agent gets structured metadata:
seq: 5
rc: 0
stdout_lines: 4327
stderr_lines: 0
cwd: /var/log
elapsed_s: 0.3
stdout_lines: 4327. Now the agent decides. Re-run with | tail -20. Or grep -c 'Failed' to get just a count. Or skip entirely — stdout_lines: 0 means the check passed, nothing to read.
The output still exists, captured per-command in separate files on the remote host, readable at any time with offset and limit pagination. But it only enters the context when the agent explicitly requests it — and only the slice it needs.
On the same eight-command health check: raw SSH consumed 1.1 MB. hauntty consumed 13 KB — metadata plus forty-four lines the agent chose to read. Ninety-nine percent less context for the same diagnostic coverage.
What the Kernel Tells You
The process monitoring was where the interesting engineering lived.
Most shell wrappers determine command completion by matching regex against output — looking for prompt strings, timing heuristics, specific patterns. This breaks the way you’d expect: commands that print prompt-like strings, pipelines where the prompt flashes between stages, interactive programs that wait for input with no visible output marker.
hauntty reads /proc directly. /proc/<pid>/stat for process state — running, sleeping, disk wait, zombie. /proc/<pid>/wchan for the kernel wait channel — the exact kernel function the process is blocked on. /proc/<pid>/io for I/O counters.
The classification rules are deterministic:
- Bash on
n_tty_read, no children → done (waiting for next command) - Child on
n_tty_read→ waiting_input (interactive prompt) - Child in state D → io_wait
- Child with CPU delta > 0 → running
- Bash zombie → zombie (needs recovery)
No heuristics. No guessing. The kernel already knows what every process is doing — /proc is the API.
Monitoring runs in two phases. Phase one polls the return code file every 50 ms for one second — catches fast commands like echo, cat, grep without /proc overhead. Phase two kicks in for anything still running: /proc sampling at one-second intervals, streaming CPU percentage, I/O bytes, and elapsed time back to the client. This two-phase approach matters operationally because the majority of diagnostic commands — the ones an agent fires off dozens of during a triage pass — complete in under a second, and you don’t want to pay /proc scanning overhead for every cat /proc/uptime.
There’s a subtlety with pipelines that took a few iterations to get right. Actually, to understand why this matters, you need to know that when cmd_a | cmd_b executes, there’s a moment between cmd_a completing and cmd_b consuming the final pipe buffer where bash appears idle — n_tty_read, no children. Without debouncing, this triggers a false “done” classification mid-pipeline. The fix is a two-tick window: the classification has to hold for two consecutive one-second samples before it’s reported. Long enough to absorb pipeline transitions, short enough to not meaningfully delay real completion detection.
Shell state preservation follows the same design-follows-physics approach. The usual trick for persistent wrappers is bash -c "command" or sourcing a script — both create subshells where cd and export don’t propagate back to the parent. hauntty injects a function into the running bash instance that uses exec 3>&1 4>&2 to save file descriptors, redirects stdout and stderr to per-command capture files, runs the command with eval in the same process, captures the return code, and restores the original descriptors. No subshell. cd in command one persists to command fifty. If something destroys the wrapper — an exec bash or an exit — hauntty detects the loss and re-injects it. The bash process survives; only the function needs restoring.
The Numbers
Two servers, two tasks each, forty total commands. One server at sub-millisecond latency, one at 50 ms.
| Raw SSH | hauntty | |
|---|---|---|
| Wall clock (40 commands) | 21.7 s | 7.7 s |
| Context consumed | 160.6 KB | 18.6 KB |
| SSH handshakes | 40 | 0 after connect |
2.8x wall clock improvement. 89% context reduction in aggregate, 99% on the high-output health check task. The gains scale with two things: network latency (more handshake overhead eliminated) and output volume (more noise prevented from entering context). The worst case — fast LAN, minimal output — still shows 2x from eliminated handshakes alone.
What’s Still Broken
v0.1.0 has three security gaps I want to be explicit about.
The daemon socket has no authentication beyond Unix file permissions. SSH secures the tunnel, and the socket file is owned by the session user, but any process running as that user on the remote host can connect and execute commands. On single-tenant infrastructure — which is where I run it — SSH access already implies trust. On a shared host, this is a privilege escalation path. Socket auth is the top priority for v0.2.
Command text persists as plaintext on disk. If a command contains secrets — an API key in an environment variable, a password in a connection string — it’s written in the clear to the session directory. Needs either in-memory-only passing or encrypted-at-rest storage.
The --yes flag for auto-answering interactive prompts answers everything affirmatively without discrimination. It should whitelist known-safe patterns and round-trip unknowns back to the agent for decision.
None of these are architectural — the fixes are scoped and straightforward. But I’d rather ship with documented gaps than undocumented ones.
Context drift isn’t a model problem. It’s an infrastructure problem. Finite capacity, competing demands, noise displacing signal. The fix isn’t a bigger context window. It’s a cleaner one — where the default output is metadata, not a dump, and the agent reads only what it decides matters.
Physics doesn’t care whether the system is a cooling loop or a context window.
hauntty is open source, Apache 2.0. Single static Go binary, two dependencies, self-deploying.