← All Posts
LinuxCUDANVIDIALLM

Fitting an LLM Into VRAM That Isn't There

Sean Lobjoit··3 min read

Last week I used VRAM as OS swap space. This week I went the other direction.

I had coding models I wanted to run locally that were just over my 8GB VRAM limit. CUDA Unified Memory seemed like the right tool. The UVM driver manages 4KB pages across VRAM and system RAM transparently, migrating hot pages to VRAM and spilling cold ones out. If you hook cudaMalloc to redirect large allocations to cudaMallocManaged, the model loads fully onto GPU and overflows into RAM without the application knowing.

That's what llm-fit does. It also inflates the reported free VRAM so Ollama assigns all layers to GPU rather than splitting to CPU.

What Actually Happened

ModelOverflowWithout llm-fitWith llm-fit
llama3.1:8b-instruct-q8_0293 MB27/33 layers GPU, ~10 tok/s33/33 GPU, ~0.2 tok/s
qwen2.5-coder:14b~1 GB36/49 layers GPU, ~10 tok/s49/49 GPU, ~0.2 tok/s
qwen3-coder:30b-a3b~9 GB18/49 layers GPUhangs on load

The numbers tell the story. For small overflows, all layers land on GPU. For anything meaningful, it's consistently slower than Ollama's default CPU layer split.

Why

PCIe bandwidth caps page migration at around 15 GB/s. System RAM runs at around 50 GB/s. On every forward pass, the UVM driver services page faults over PCIe for whichever weights aren't in VRAM. That's slower than letting Ollama just read those layers from RAM directly with its built-in CPU split.

The sweet spot, if one exists, is models within about 200-300MB of the VRAM limit. At that range, all layers stay on GPU and the overhead is manageable. Beyond that, you're fighting the bandwidth ceiling on every token.

Large models are a different category entirely. cudaMallocManaged returns successfully at 9GB overflow, but tensor loading stalls indefinitely. Reduced context windows, different load timeouts, llama.cpp's --mmap path. None of them change the outcome.

One Interesting Wrinkle

Ollama bundles its own CUDA runtime loaded with RTLD_LOCAL, which hides symbols from the standard RTLD_NEXT lookup that LD_PRELOAD hooks rely on. llm-fit falls back to explicit path resolution against Ollama's bundled libraries when the standard lookup fails. Worth knowing if you're doing anything similar.

What Actually Works for Big Models

For large MoE models, ktransformers is the better answer. It routes only the active experts through VRAM per token rather than managing the full weight set, so the bandwidth constraint works in its favour rather than against it.

The Code

llm-fit is MIT licensed and the exploration is fully documented. If you want to test it against your own setup, or poke at the UVM approach yourself:

git clone https://github.com/c0dejedi/llm-fit
cd llm-fit
sudo ./install.sh

The experiment didn't pan out the way I hoped, but the RTLD_LOCAL workaround and the bandwidth ceiling are worth knowing about if you're working in this space. Feel free to check out llm-fit and share it with anyone running into the same wall.