All posts
AI & ML

Running LLMs Locally: Ollama vs vLLM in 2026

Open models are good enough now that running one on your own hardware is a real choice, not a hobby. The decision usually comes down to two tools — Ollama for ease, vLLM for throughput. Here’s how to pick and run.

Dhileep Kumar7 min read
Running LLMs Locally: Ollama vs vLLM in 2026

For two years, the answer to “where does the model run? ” was always “someone else’s GPU. ” That’s changing. Open-weight models like Llama, Qwen, and Mistral are now good enough for real work, consumer GPUs have enough memory to hold them, and the tooling has gotten boring in the best way. Running a capable model on hardware you control is no longer a science project.

When developers self-host today, the choice almost always narrows to two tools: Ollama and vLLM. They sit at opposite ends of the same spectrum — one optimized for getting started in a single command, the other for serving many users fast. Knowing which problem you have tells you which to reach for.

Why run a model on your own hardware

Cloud APIs are convenient and, for plenty of use cases, the right call. But local inference buys you things an API never will, and for some teams those things are decisive.

  • Cost. A local model costs nothing per token. If you’re generating millions of tokens a day for classification, extraction, or drafting, the API bill is a recurring tax that local hardware pays off once.
  • Privacy. Nothing leaves your machine. For regulated data, internal documents, or anything you can’t legally ship to a third party, this is the whole ballgame.
  • Latency and control. No network hop, no rate limits, no surprise deprecations. The model you tested is the model you ship, frozen until you choose to change it.
  • Offline. It runs on a plane, in a locked-down network, or on an edge device with no connectivity. The model is just a file on disk.

Ollama: Docker for models

Ollama is the on-ramp. It treats models the way Docker treats images — you pull one by name and run it, and Ollama handles quantization, memory, and GPU acceleration underneath. There’s effectively no configuration. From install to a running chat model is two commands.

bash
# Install, then pull and run a model. That's the whole setup.
ollama pull llama3.3
ollama run llama3.3 "Explain MCP in one sentence."

# Ollama also serves an OpenAI-compatible API on localhost:11434,
# so most existing client code points at it with almost no changes.
curl http://localhost:11434/v1/chat/completions -d '{"model":"llama3.3","messages":[{"role":"user","content":"hi"}]}'

vLLM: built for throughput

vLLM is what you graduate to when one user becomes hundreds. It’s a serving engine built for high-throughput, production inference, and its headline trick — PagedAttention — manages GPU memory the way an operating system pages RAM, cutting waste and letting it batch many concurrent requests. Where Ollama is built to be easy, vLLM is built to be fast under load.

bash
# Serve an open model with an OpenAI-compatible endpoint.
pip install vllm
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000

# Now point any OpenAI client at http://localhost:8000/v1
# and vLLM batches concurrent requests automatically.

The rule of thumb is simple: if you’re prototyping, building a desktop app, or serving only yourself, use Ollama. If you’re standing up an inference service that many users or agents hit at once, use vLLM. Plenty of teams use both — Ollama on the laptop, vLLM on the server.

Picking your model and hardware

  • 8GB VRAM — comfortable for 7B-8B models like Llama 3. x 8B or Qwen2.5 7B. Genuinely useful for coding help, extraction, and chat.
  • 24GB VRAM — the practical floor for 30B-class models, and a sweet spot for quality-per-dollar on a single card.
  • 40GB+ VRAM — needed for 70B models at reasonable quality unless you quantize hard. This is where serious local setups live.
  • Quantization is your lever. 4-bit (Q4_K_M) roughly halves memory with minimal quality loss — a 70B model that wanted 140GB fits in ~40GB. Reach for it before buying more hardware.
  • Match the model to the job. A 7B coding model often beats a 70B generalist at writing code. Bigger isn’t automatically better; the right specialist usually wins.

The interesting shift isn’t that local models got smart enough. It’s that the tooling got dumb enough to use — pull a name, run a command, and the GPU stops being someone else’s problem.

Where people get stuck

  • Underestimating VRAM. If the model spills to system RAM or disk, inference crawls. Check the quantized size against your card before downloading 40GB.
  • Confusing the two tools’ jobs. Running vLLM for a single-user desktop app is overkill; running Ollama behind a high-traffic service will fall over. Pick for your load.
  • Ignoring context length. Long contexts eat memory fast — a model that fits at 4K tokens may not at 32K. Budget for the context you actually use.
  • Forgetting it’s still a server. A local endpoint with no auth on a shared network is an open door. Lock it down like any other service.
  • Chasing the newest model every week. The leaderboard churns constantly; pick a solid model, build on it, and upgrade deliberately rather than every Friday.

The barrier to running your own model used to be expertise; now it’s a download and a command. Start with Ollama and a 7B model on whatever GPU you already have, get a feel for what open models can and can’t do, and reach for vLLM the day you need to serve more than yourself. The model running on your own machine — private, free per token, yours — is one of the quietly radical things you can build right now.

Share

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter

Comments