AI & ML

I Built an AI Agent That Doesn't Forget Me (Without Fine-Tuning)

SelfMind wraps Claude in a persistent memory + RAG loop: it decides what's worth remembering after each turn and writes it to SQLite, so a fresh session still knows you. Here's the real architecture, in ~637 lines with one dependency, and the design calls I made instead of fine-tuning.

Dhileep KumarJun 11, 20267 min read

I Built an AI Agent That Doesn't Forget Me (Without Fine-Tuning)

Talk to most AI agents twice and you notice the quiet disappointment: the second conversation starts from zero. Whatever you told it yesterday, your name, your project, the decision you made together, is gone. The model is brilliant and amnesiac, because by default a language model has no memory at all. Each call is a blank slate with a context window, and when the window closes, everything in it disappears.

I got tired of re-introducing myself to my own assistant, so I built one that doesn't forget. It's called SelfMind: a Python REPL where Claude wraps itself in a persistent memory loop, decides what's worth remembering after every turn, and writes it to a SQLite file on disk. Restart the process, start a brand-new session, and it still recalls what it learned. This post is the concept and the actual build, warts and design decisions included. It's an early prototype (version 0.1.0), the whole thing is roughly 637 lines including the README, and it leans on exactly one dependency: the Anthropic SDK.

First, why the amnesia happens at all, because it isn't a bug, it's the architecture. Three properties of a stateless model tell you exactly what a memory layer has to solve:

The model is stateless. It holds no information between calls. Anything it knows in a conversation lives only in the prompt you send that turn.
The context window is finite. Even big windows fill up, and stuffing everything in is slow and expensive. You can't just append forever.
Not everything is worth keeping. Most of what's said is noise. Memory isn't recording everything, it's choosing what to keep and what to throw away.
Retrieval has to be cheap. When the agent needs a past fact, it has to find it fast, without re-reading the entire history. "Find a relevant fact fast" is a search problem, and search is what most of SelfMind's retrieval code actually is.

The one decision that shaped everything: no weight training

The obvious way to make a model "learn" is to fine-tune it on your data. I deliberately didn't. The README names the reasons out loud: catastrophic forgetting (fine-tune on today's chat and you can quietly wreck yesterday's capability), reward design (what exactly are you optimizing? ), and compute (you're not retraining a frontier model on a laptop). My argument is that a persistent memory plus retrieval loop delivers most of what people actually mean by "self-learning", continual adaptation to one user, and it runs on a laptop today.

So the "learning" in SelfMind is not new weights. It's a growing SQLite table. Claude stays frozen and does the reasoning; the memory store is the part that changes over time. That reframing is the whole project, and it's an honest framing of what self-learning realistically means for someone who isn't a research lab.

Fine-tuning was the tempting answer and the wrong one. The learning lives in a SQLite file, not in the weights. Claude reasons; the store remembers.

The four-step loop, and the second Claude call

Every turn runs the same cycle in agent. py: retrieve, prompt, answer, learn. The first three are a fairly standard RAG loop. The fourth is the interesting one, and it's a second model call.

RETRIEVE the top relevant memories for the user's message (top_k=6 by default) from the SQLite store.
PROMPT Claude with those memories injected as their own system block.
ANSWER by streaming Claude's reply back to the terminal.
LEARN: a second Claude call reads the exchange back and extracts durable facts as schema-constrained JSON, which get deduped and written to SQLite.

Here's the seam where answering ends and learning begins. After the stream finishes, the exchange is handed straight to a reflection call:

python

reply_parts = []
with self.client.messages.stream(
    model=self.config.chat_model,
    max_tokens=self.config.max_tokens,
    thinking={"type": "adaptive"},   # chat = reasoning
    system=system,
    messages=messages,
) as stream:
    for text in stream.text_stream:
        reply_parts.append(text)
        print(text, end="", flush=True)
reply = "".join(reply_parts)
self.history.append({"role": "assistant", "content": reply})

new = self._learn(user_text, reply)   # <- the self-learning step
return reply, new

One deliberate detail: the two Claude calls use opposite thinking settings. The chat call uses adaptive thinking because answering you well is a reasoning task. The extraction call uses disabled thinking, because pulling facts out of a transcript into a fixed schema is mechanical, and reasoning tokens there would just be latency and spend for nothing.

"Decide what's worth keeping" sounds soft, but I forced it into structure. The extractor is constrained by a JSON schema via structured outputs, so the reply is always valid JSON, which means the code never has to defensively parse free text; it trusts json. loads() with only a minimal decode-error fallback. Each memory comes out typed as fact, preference, correction, or context. The extractor's system prompt also carries an explicit list of what NOT to store: one-off task details, transient state and greetings, the assistant's own suggestions, and anything the user didn't actually assert. It forces each memory into a third-person standalone sentence, like "The user prefers ... ", so a retrieved fact still makes sense with no surrounding conversation. And the whole call is wrapped so a failed reflection never breaks the chat, it just prints a skip notice and returns an empty list:

python

try:
    resp = self.client.messages.create(
        model=self.config.memory_model,
        max_tokens=self.config.memory_max_tokens,
        thinking={"type": "disabled"},   # mechanical, no reasoning
        system=_EXTRACTOR_SYSTEM,
        output_config={"format": {"type": "json_schema",
                                  "schema": _MEMORY_SCHEMA}},
        messages=[{"role": "user", "content": exchange}],
    )
except Exception as exc:   # a failed reflection never breaks the chat
    print(f"[memory extraction skipped: {exc}]")
    return []

Retrieval is pure-Python BM25 (on purpose)

Here's the part that surprises people: there's no vector database. Retrieval is Okapi BM25, hand-written in pure Python (k1=1.5, b=0.75), no numpy, no torch, no embedding service. That's why it runs on any Python install with a single dependency, and why the README stresses it was tested on Python 3.14.

python

k1, b = 1.5, 0.75
for mem, d in zip(memories, docs):
    tf = Counter(d)
    dl = len(d)
    score = 0.0
    for term in q_terms:
        if term not in tf:
            continue
        idf = math.log(1 + (n - df[term] + 0.5) / (df[term] + 0.5))
        freq = tf[term]
        score += idf * (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * dl / avgdl))
    if score > 0:
        scored.append((mem, score))

I'm not pretending BM25 is the best retriever. It's lexical, not semantic: it matches shared words, so a query for "automobile" won't find a memory about your "car". That's the known limitation, and I chose it anyway because it removes an entire class of dependency and gets the loop working end to end. The trick is that rank(memories, query, k) is the single seam: to go semantic later, you store a vector per memory row and rank by cosine behind the same function signature, and the agent, store, and CLI don't change at all. The README documents both a local sentence-transformers path and a hosted Voyage AI path for exactly that swap.

The unglamorous parts that actually matter

The store-and-recall loop is the easy 80%. The design decisions that make it not-annoying live in the boring 20%:

Dedupe so the store doesn't rot. New candidates go through two stages: an exact lowercased string match, then Jaccard token-set overlap. At or above a 0.7 threshold a candidate is treated as a near-duplicate and skipped, so twelve rephrasings of the same fact don't become twelve rows.
Prompt-cache-aware system prompt. The system prompt is two blocks: a stable persona first (cacheable) and the volatile retrieved-memory block second. A code comment calls this out. Most naive RAG loops interleave the two and quietly forfeit caching.
Explicit memory control. CLI commands expose the store: /remember to force-save, /forget <id> to delete, /memories to list, /stats to inspect. Memory without a delete button is a liability, so /forget is a first-class command, not an afterthought.
Offline affordance. /remember and /memories work with no API key at all, because storing or listing a fact needs no Claude call. You can seed or inspect memory without spending a cent.

Everything is centralized in a config dataclass: top_k=6 memories injected per turn, history_turns=12 in-context turns, max_tokens=16000 for chat, memory_max_tokens=1024 for extraction, dedupe_threshold=0.7. The default database lives at ~/. selfmind/memory. db (one memories table: id, text, kind, source, created_at), and both models are env-overridable. The chat core and extractor both default to Claude Opus 4.8, but you can point the extractor at a cheaper Haiku model since it's just doing mechanical JSON.

What I can and can't claim

In the spirit of not overselling: I haven't benchmarked this. There are zero measured metrics in the repo, no retrieval hit-rate numbers, no timings, no extraction precision or recall. The numbers here are configuration constants and BM25 hyperparameters, not results, and the design rationale is documented in code and README comments, not proven by a test suite. This is version 0.1.0, an early prototype. The obvious next step is evals: does the extractor store the durable facts and skip the transient ones, is 0.7 the right Jaccard threshold, and how much does retrieval actually improve if I swap BM25 for embeddings? Those are the experiments the code's own design is begging for.

But the core claim holds because it's structural rather than empirical: memories persist across process restarts, so a fresh session recalls what earlier ones taught it. An agent without memory is a demo; an agent with a persistent store is starting to feel like a colleague. The gap between them isn't a bigger model, it's the unglamorous discipline of choosing what to keep, deduping it, letting the user delete it, and finding it again when it matters. Build that loop, even in ~600 lines, and the second conversation finally picks up where the first left off.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter

The one decision that shaped everything: no weight training

The four-step loop, and the second Claude call

Retrieval is pure-Python BM25 (on purpose)

The unglamorous parts that actually matter

What I can and can't claim

Enjoyed this?

Comments