AI & ML

Running Models Locally: What Ollama and vLLM Skip, and What I Learned Keeping One Warm on a Mac

Ollama and vLLM make local inference look like two commands. But the interesting part is everything underneath — device selection, CPU fallbacks, keeping the model warm, caching. I built a local Telugu voice-clone TTS on an Apple Silicon Mac (no Ollama, just raw PyTorch and transformers), and it taught me the operational reality those two commands hide.

Dhileep KumarJun 10, 20267 min read

Running Models Locally: What Ollama and vLLM Skip, and What I Learned Keeping One Warm on a Mac

For two years the answer to "where does the model run? " was always "someone else's GPU. " That's changing. Open-weight models are good enough for real work, consumer hardware has enough memory to hold them, and the tooling has gotten boring in the best way. Running a capable model on hardware you control is no longer a science project — and when you self-host today, the choice usually narrows to two tools at opposite ends of the same spectrum: Ollama for getting started in one command, vLLM for serving many users fast.

Both are excellent, and I'll give you the honest version of when to reach for each. But I want to spend most of this post one level below the pitch, because that's where I've actually lived. I built a local Telugu voice-clone text-to-speech system — IndicF5 — that reads mixed Telugu-and-English tech articles aloud in my own voice, fully offline on my Apple Silicon Mac, wired up by hand with PyTorch and HuggingFace transformers. It's not an LLM, but as an inference workload it hits every operational concern a local LLM does: pick a device, hold the model in memory, don't crash on unsupported ops, keep it warm, cache what you can. That's the part Ollama abstracts away — and the part worth understanding before you trust a local model with anything real.

Why run a model on your own hardware

Cloud APIs are convenient and, for plenty of use cases, the right call. But local inference buys you four things an API never will, and for some workloads those things are decisive:

Cost: a local model costs nothing per token. If you generate a lot of them, the API bill is a recurring tax that local hardware pays off once.
Privacy: nothing leaves your machine. For my TTS, the reference clip is a recording of my own voice — exactly the kind of data I'd rather never upload anywhere.
Control: no network hop, no rate limits, no surprise deprecations. The model you tested is the model you ship, frozen until you choose to change it.
Offline: it runs on a plane, in a locked-down network, or on a laptop with the Wi-Fi off. The model is just a file on disk — in my case about 1.5 GB, downloaded once on first run.

That last point is not rhetorical. My whole system — model in memory, audio cached on disk, a small local server, a browser extension, and a menu-bar reader — runs with no cloud round-trip at all. Once the weights are down, you can pull the network cable.

Ollama and vLLM: pick for your load, not your ego

Ollama is the on-ramp. It treats models the way Docker treats images — you pull one by name and run it, and it handles quantization, memory, and GPU acceleration underneath. There's effectively no configuration. vLLM is what you graduate to when one user becomes hundreds: a serving engine built for high-throughput production inference, whose headline trick, PagedAttention, manages GPU memory the way an operating system pages RAM so it can batch many concurrent requests without wasting memory.

The rule of thumb is simple. If you're prototyping, building a desktop app, or serving only yourself, use Ollama. If you're standing up an inference service that many users or agents hit at once, use vLLM. Plenty of teams run both — Ollama on the laptop, vLLM on the server. Running vLLM for a single-user desktop app is overkill; running Ollama behind a high-traffic service will fall over. The mistake isn't picking the wrong tool, it's picking for the load you wish you had instead of the load you have.

Here's the catch nobody puts in the two-command README: those tools are easy precisely because someone already solved the hard parts for the models they support. The moment you step off that path — a model Ollama doesn't package, a device the happy path doesn't assume, an op your backend doesn't implement — you're doing the work by hand. That's where I ended up.

What "local" actually costs: device selection and fallbacks

The first real decision in any local inference setup is which device runs the math. My selection order is the standard one: prefer CUDA if there's an NVIDIA GPU, then Apple's MPS backend, then CPU as a last resort. On my Mac that resolves to MPS — Metal Performance Shaders, Apple Silicon's GPU compute path for PyTorch. And here's the thing the glossy benchmarks skip: MPS does not implement every operation. Hit an unsupported one and, by default, PyTorch throws instead of running. So the very first line of the program isn't model code at all — it's telling PyTorch to fall back to CPU for the ops MPS can't do, so the process degrades gracefully instead of crashing.

There's a second landmine specific to this model. IndicF5 wraps part of its pipeline in torch. compile, and on a recent PyTorch that breaks while tracing the audio front-end. Disabling Dynamo turns compile into a no-op — you lose the compile speedup, but you get a model that actually runs. Two environment variables, set before anything else imports torch:

python

# Let unsupported MPS ops fall back to CPU instead of crashing (Apple Silicon).
os.environ.setdefault("PYTORCH_ENABLE_MPS_FALLBACK", "1")
# IndicF5 wraps its vocoder in torch.compile(); on recent torch this breaks while
# tracing torchaudio's mel filterbank. Disabling Dynamo makes compile a no-op (eager).
os.environ.setdefault("TORCHDYNAMO_DISABLE", "1")

This is the tax Ollama pays for you. When you type ollama run, someone upstream already decided the device, already handled the fallbacks, already made sure the model loads on your hardware. Doing it by hand, I had a third surprise waiting: the published checkpoint was saved while wrapped in torch. compile, so every weight key carried an . _orig_mod. prefix. Load it as-is and the keys don't match — the transformer stays random-initialized and you get pure noise, with no error to tell you why. The fix is to load the weights yourself and strip the prefix so the keys line up:

python

sd = load_file(hf_hub_download(REPO_ID, "model.safetensors"))
remapped = {k.replace("._orig_mod.", "."): v for k, v in sd.items()}
res = model.load_state_dict(remapped, strict=False)
n_miss, n_unexp = len(res.missing_keys), len(res.unexpected_keys)
if n_miss or n_unexp:
    print(f"WARNING: weight load mismatch — {n_miss} missing, {n_unexp} unexpected")
else:
    print("Weights loaded cleanly (0 missing, 0 unexpected).")

That last log line — "Weights loaded cleanly (0 missing, 0 unexpected). " — is the sentence that tells me the model is actually the model, and not 1.5 GB of random numbers politely producing static. When you run inference locally, silent failure is the failure mode you have to design against. A cloud API errors loudly; a mis-loaded local model just sounds wrong.

The two-command demo is real, but it only works because someone already fought device selection, fallbacks, and a silent weight-loading bug on your behalf. Step off the paved path and those fights are yours — and the scariest ones don't throw an exception, they just quietly produce garbage.

The speed nobody quotes you: ~5x slower than real time, and why caching wins

Here's the honest number. In one uncached run on my Mac, synthesizing 5.8 seconds of audio took 30.6 seconds on MPS — roughly five times slower than real time. That's a single observed run from my logs, not a controlled benchmark, and I haven't re-run it under controlled conditions. But it's the right order of magnitude to plan around, and it reframes the whole problem: on consumer hardware, a cold generation is slow enough that you architect the system around never paying for the same generation twice.

So the system is built around two ideas: keep the model warm, and cache aggressively. Loading 1.5 GB of weights and moving them onto the GPU is expensive, so you pay that once — the model lives inside a long-running local FastAPI process, and requests hit an already-loaded model instead of spawning a fresh Python process per call. Ollama does the same thing under the hood, keeping a model resident so your second prompt doesn't re-pay the load cost. If you're wiring up inference yourself, a warm server process is the single highest-leverage decision you'll make.

Then caching. The server splits text at sentence boundaries and hashes each chunk with SHA-256, keyed together with a voice identifier, then writes the generated audio to disk under that key. Ask for the same sentence in the same voice again and it's a disk read instead of a 30-second generation. The key deliberately embeds the reference-clip identity too, so the moment I change my voice sample the cache invalidates itself — no stale audio in a new voice. A slow model plus a good cache is a fast system for anything you read more than once.

One more operational gotcha that maps straight onto local LLMs: I serialize every inference call behind a lock. The model and MPS are not safe to call concurrently, so a naive "just handle two requests at once" server corrupts state or crashes. This is exactly the problem vLLM's PagedAttention and continuous batching exist to solve properly. My single-user answer — one lock, one request at a time — is fine for a desktop reader and completely wrong for a service. Which is the whole Ollama-versus-vLLM decision restated in code: concurrency is the line between the two.

Where people actually get stuck

The barrier to running your own model used to be expertise; now the on-ramp is a download and a command. But the ditches on either side of it haven't moved. From building one by hand, these are the ones that got me or would have:

Underestimating memory. If a model spills from VRAM to system RAM or disk, inference crawls. Check the quantized size against your hardware before you download tens of gigabytes.
Assuming your device implements everything. Apple's MPS doesn't cover every op; without a CPU fallback the process just dies. Whatever your backend, know its blind spots before you ship.
Trusting silence. A mis-loaded local model doesn't error — it produces plausible-looking garbage. Log a loud, explicit "loaded cleanly" check so you know the weights actually landed.
Not keeping the model warm. Re-loading gigabytes of weights per request turns a slow model into an unusable one. Keep it resident in a long-running process.
Confusing the two tools' jobs. Ollama behind a high-traffic service falls over; vLLM for a single-user app is overkill. And if you go fully hand-rolled like I did, remember concurrency is a feature you have to build, not one you get for free.

Start with Ollama and a small model on whatever hardware you already have, get a feel for what open models can and can't do, and reach for vLLM the day you genuinely need to serve more than yourself. But if you ever step off the paved road — a model nobody packaged, a device the happy path didn't assume — none of the pitch survives contact and all of this operational reality shows up at once. It's slower than the demo, it fails quietly instead of loudly, and it's still one of the quietly radical things you can build: a model running on your own machine — private, free per token, offline, yours.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter