All posts
AI & ML

Fine-Tuning a Small Model with LoRA

Prompting and RAG cover most needs, but sometimes you need the model itself to change. LoRA made fine-tuning cheap enough to do on a single GPU — here’s when it’s worth it and how it actually works.

Dhileep Kumar7 min read
Fine-Tuning a Small Model with LoRA

Most of the time, the right way to make a model do what you want is to ask it better — a sharper prompt, a few examples, some retrieved context. Prompting and RAG are cheaper, faster, and easier to change than touching the model itself. But there’s a class of problem they can’t fix: when you need the model to reliably produce a specific format, hold a consistent voice, or master a narrow domain that no amount of context quite teaches. That’s when you fine-tune.

Fine-tuning used to mean retraining billions of parameters on a cluster you couldn’t afford. LoRA changed that. Short for Low-Rank Adaptation, it made customizing a model cheap enough to do on a single consumer GPU in an afternoon — and it’s why fine-tuning is back on the table for normal teams in 2026. Here’s the honest version of when it helps and what’s actually happening under the hood.

When to fine-tune — and when not to

The most expensive fine-tuning mistake is doing it at all when a prompt would have worked. Reach for it only after prompting and RAG have genuinely failed, and be clear about which problem you’re solving.

  • Fine-tune for form, not facts. Tone, format, and style are what fine-tuning teaches well — a consistent JSON shape, a brand voice, a domain’s phrasing. Knowledge that changes belongs in RAG, not in the weights.
  • Fine-tune to replace a giant prompt. If you ship the same 2,000-token instruction block on every call, you can often bake that behavior into a small model and pay for it once instead of on every request.
  • Fine-tune to shrink the model. A fine-tuned 7B model can beat a prompted 70B one at a narrow task — cheaper, faster, and private. Specialization buys you size.
  • Don’t fine-tune for fresh or factual knowledge. Anything that updates — prices, docs, today’s data — goes stale in the weights the moment training ends. That’s RAG’s job.

What LoRA actually does

Full fine-tuning updates every weight in the model, which is why it’s so expensive: a 7B model means 7 billion numbers to adjust and store. LoRA’s insight is that you don’t have to. Instead of editing the original weights, it freezes them and trains a tiny pair of low-rank matrices alongside each layer — a small “adapter” that nudges the model’s behavior. You might train a few million parameters instead of several billion.

The payoff is practical. The base model stays untouched, so you can keep many adapters for many tasks and swap them in and out like plugins. Training fits in far less memory, which is what lets it run on one GPU. And QLoRA goes further — it quantizes the frozen base model to 4-bit so even a 70B model fits on a single card while you train the adapter on top.

Fine-tuning in practice

The modern stack makes this almost anticlimactic. With Hugging Face’s PEFT library you wrap a base model, declare a LoRA config, and train on your dataset — usually a few hundred to a few thousand example pairs of input and the output you want. The shape looks like this:

python
# Wrap a base model with a LoRA adapter and train it.
# pip install peft transformers datasets
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("base-model-7b")

config = LoraConfig(
    r=8,                 # rank: bigger = more capacity, more memory
    lora_alpha=16,       # scaling for the adapter's effect
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 4.2M || all params: 7.0B || 0.06%
# ...then train as usual on your input/output pairs.

That last line is the whole story: you’re training 0.06% of the model and leaving the rest frozen. When you’re done, the adapter is a few megabytes you can version, share, and load on top of the base model whenever you need that behavior.

Fine-tuning isn’t teaching the model new facts; it’s teaching it new habits. Reach for it when you’ve been writing the same instructions over and over, wishing the model would just remember them.

Where people go wrong

  • Too little data, or dirty data. A few hundred clean, consistent examples beat ten thousand noisy ones. The model learns exactly what you show it — including your mistakes.
  • Fine-tuning what a prompt could fix. If you haven’t exhausted prompting and few-shot examples, you’re spending training cycles on a problem that didn’t need them.
  • Chasing knowledge instead of behavior. Baking facts into weights feels powerful and ages terribly. If the answer can change, it shouldn’t live in the model.
  • Overfitting the rank. A bigger r isn’t better — too much adapter capacity memorizes your examples and forgets how to generalize. Start small (r=8 or 16) and grow it only if you must.
  • No eval to prove it worked. Fine-tune, then measure against a held-out set. “It feels better” is how you ship a regression you can’t see.

Is it worth it?

For most features, no — and that’s the right default. Prompting and RAG are reversible, debuggable, and cheap, and you should exhaust them first. But when you find yourself fighting the same formatting battle on every call, or paying for a huge model to do one narrow thing, a LoRA adapter turns that recurring cost into a one-time investment.

The barrier that kept fine-tuning out of reach — money, hardware, expertise — is mostly gone. What’s left is judgment: knowing it’s for habits, not facts, and reaching for it only when prompting has genuinely run out of room. Get that call right and a few megabytes of adapter can do what a much larger model couldn’t.

Share

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter

Comments