VRAM Swap, Two Weeks In: Multithreading and Killing the Deadlock

The first nbd-vram post went further than I expected. It hit the front page of Hacker News, and a lot of the comments on that thread was sharp, specific feedback about where the design was weak.

So I spent the next two weeks on it. The two biggest things: the daemon is now multi-threaded, and the deadlock that used to hard-freeze the whole machine under heavy swap pressure is gone. This post is what changed and the refreshed numbers. The questions and objections from the thread get their own post on Monday.

What Changed

The daemon is multi-threaded now. The original was single-threaded: one connection, one cuMemcpy per request. That is exactly the userspace round-trip bottleneck several commenters flagged. It is now a thread pool: one worker and one nbd connection per core, each with its own non-blocking CUDA stream, async copies, and a pinned host buffer (cuMemAllocHost) so the driver skips its staging copy. The kernel's nbd multi-connection support spreads requests across the connections.

Single-stream sequential throughput barely moves, because one request in flight is still one request in flight. But concurrent 4K IOPS scale about 4x, from roughly 77k single-threaded to ~312k with connections fanned across cores. Scaling flattens right at physical core count: SMT and extra connections buy almost nothing past that.

The freeze is gone. This is the bug I actually cared about. Under sustained pressure the single-threaded daemon would saturate and the box would hard-freeze. It is the classic swap-over-NBD deadlock: servicing a swap write needs an allocation, and at zero free RAM that allocation recurses into reclaim, which is itself waiting on the write to finish. mlockall does not fix this on its own, it only pins pages that already exist.

The fix is prctl(PR_SET_IO_FLUSHER), which sets PF_MEMALLOC_NOIO and PF_LOCAL_THROTTLE, the same thing nfs-ganesha and libfuse do, so the daemon's own allocations never recurse back into the I/O path. Add OOMScoreAdjust=-1000 so the kernel never picks the daemon to kill, and mlockall(MCL_CURRENT | MCL_FUTURE) so its pages are never the ones being swapped. It now fills the entire 7 GB of VRAM swap at zero free RAM and the desktop stays responsive.

The installer sizes it for you. No more guessing at VRAM_SETUP_SIZE_MB. The installer queries nvidia-smi for total VRAM and whether the card drives a display, then recommends an allocation: leave ~1 GB of headroom on an offload-only card, ~3 GB on a card that also drives the desktop. At runtime the daemon backs off in 512 MiB steps if the requested size will not allocate, so it grabs as much as it can even with a compositor already loaded.

Updated Benchmarks

These supersede the numbers in the original post. Same rig: RTX 3070 Laptop (8 GB), Ryzen 9 5900HX, Linux 6.17 on Pop!_OS, NVMe on PCIe 4.0 behind dm-crypt. Every one of these ships as a script in benchmarks/ so you can reproduce them.

Per-operation latency - ioping, 4K, sporadic reads

Device	Min	Avg	Max
NVMe	115 us	8.7 ms	10.1 ms
VRAM (nbd)	90 us	257 us	437 us

This is still the one that matters. ~34x lower average latency. NVMe is physically capable of microseconds, but APST puts it to sleep between sporadic faults and it wakes cold almost every time, paying a multi-millisecond penalty. VRAM has no power states. At 8.7 ms per fault you feel the stutter; at 257 us you do not.

Concurrent 4K IOPS - fio, numjobs=16

Device	IOPS	Bandwidth
NVMe	240k	936 MiB/s
VRAM (nbd)	312k	1219 MiB/s

This is where multithreading shows up: ~77k single-threaded became ~312k. The NVMe number here is a hot, throttled drive under sustained load. A cool drive hits ~860k, so NVMe still wins parallel IOPS outright when it has thermal headroom.

Single-stream 4K IOPS - fio, one job

Device	Read IOPS	Write IOPS	Bandwidth
NVMe	59k	59k	229 MiB/s
VRAM (nbd)	42k	42k	165 MiB/s

NVMe leads, thread count is irrelevant with one request in flight, and both devices are far past what sporadic swap actually demands.

Sequential throughput - dd, O_DIRECT

Device	Write	Read
NVMe	2.8 GB/s	3.1 GB/s
VRAM (nbd)	1.9 GB/s	2.7 GB/s

NVMe's home turf, and not what swap does. The kernel moves random 4K pages, not sequential megabyte streams. ~2 GB/s is plenty.

Surviving heavy pressure

The whole 7 GB partition filled at zero free RAM, machine stays responsive. The single-threaded build hard-froze here. That is the change I am happiest about.

It is still not an NVMe killer and I will not pretend otherwise. NVMe wins raw throughput and parallel IOPS. Where this genuinely wins is latency on the sporadic single-page faults that make a desktop feel laggy, and it is free, low-wear swap out of VRAM that was sitting idle anyway.

The repo is nbd-vram. On Monday I am putting up a FAQ that works through the Hacker News thread directly: gaming, the deadlock, why CUDA instead of a BAR1 mapping, and the prior art people pointed me at.