The LeepCast Blog
Stories worth your attention
No hype, no filler — reporting and analysis on AI, software, hardware, and security, written for the people who build things.
Reasoning Models: When to Pay a Model to Think
A new class of models thinks before it answers — burning extra time and tokens to reason through hard problems. Sometimes that’s transformative. Often it’s an expensive way to answer a question a fast model already nails.
Getting Real Work Out of AI Coding Agents
Most people use Claude Code and Cursor like fancy autocomplete and wonder why the magic runs out. The teams getting real leverage treat them as agents you delegate whole tasks to — and that’s a different skill.
Prompt Engineering That Actually Works
People keep declaring prompt engineering dead. Meanwhile it’s still the cheapest, fastest lever you have on output quality — if you know the handful of techniques that actually move the needle.
Choosing a Vector Database in 2026
Every RAG app and AI feature with memory needs somewhere to store embeddings and search them fast. There are a dozen vector databases now — but the real choice is simpler than the marketing makes it look.
Streaming LLM Responses: Real-Time UX for AI Apps
Waiting ten seconds for a full AI response feels broken. Streaming the answer token by token, the moment it starts generating, is the difference between an app that feels slow and one that feels alive. Here’s how.
Semantic Caching: Cut Your LLM Bill Without Hurting Quality
Teams are quietly killing AI features — not because they don’t work, but because the token bill doesn’t justify them. Semantic caching is the fix: serve a cached answer when someone asks the same thing in different words.
Multi-Agent Orchestration: Coordinating Specialist LLM Agents
One agent doing everything turns into a confused generalist with a bloated prompt. The fix that’s everywhere in 2026 is orchestration — a coordinator routing work to focused specialist agents. Here’s how to build one.
GraphRAG: When Vector Search Isn’t Enough
Vector RAG finds paragraphs that look similar to your question. But some questions are about how things connect — and for those you need a graph. Here’s when plain retrieval breaks, and how GraphRAG fixes it.
Guardrails for LLM Apps: Stopping Prompt Injection and Bad Output
An LLM app takes untrusted text in and sends model-generated text out — two open doors. Guardrails are the checks on both sides that keep prompt injection, leaked data, and bad output from reaching anyone.
LLM Observability: Tracing What Your AI Does in Production
You shipped the LLM feature, the demo worked, and now it’s a black box serving real users. Observability is how you see what your AI is actually doing — before a silent quality drop becomes a support queue.
Giving Your AI Agent Memory
An agent that forgets everything the moment a session ends isn’t an assistant — it’s a very smart goldfish. Memory is the discipline that turns a stateless model into something that actually knows you.
Fine-Tuning a Small Model with LoRA
Prompting and RAG cover most needs, but sometimes you need the model itself to change. LoRA made fine-tuning cheap enough to do on a single GPU — here’s when it’s worth it and how it actually works.
Structured Outputs: Getting Reliable JSON and Tool Calls from LLMs
Ask a model for JSON and it’ll often hand you prose, a markdown fence, and a trailing apology. For anything automated, “often valid” is a bug. Here’s how to make an LLM return data your code can actually trust.
Context Engineering: Managing What Your LLM Actually Sees
Prompt engineering was about choosing your words. Context engineering is about everything else in the window — what you put in, what you leave out, and in what order. It’s the discipline that separates an LLM demo from an LLM product.
Building a RAG Pipeline That Actually Works
Bolting a vector database onto an LLM gives you a demo. Getting it to answer real questions over real documents is an engineering problem — chunking, retrieval, reranking, and knowing when not to retrieve at all. Here’s the pipeline that survives production.
How to Evaluate LLM Apps: Evals That Catch Failures Before Production
You can’t assertEquals a language model. That’s why teams ship LLM features blind and find the regressions in production. Evals are the missing discipline — here’s how to build ones that actually catch failures.
Running LLMs Locally: Ollama vs vLLM in 2026
Open models are good enough now that running one on your own hardware is a real choice, not a hobby. The decision usually comes down to two tools — Ollama for ease, vLLM for throughput. Here’s how to pick and run.
Model Context Protocol Explained: Build Your First MCP Server
Every AI tool used to reinvent its own integrations. The Model Context Protocol turned that M×N mess into a standard — and in eighteen months it became the USB-C of AI apps. Here’s what it is and how to ship a server.
Raspberry Pi AI Projects for Beginners
Most people meet AI through a text box, which hides the most interesting part: making a model do something in the physical world. A Raspberry Pi is the cheapest, friendliest way to cross that line — and its limits are the lesson.
Deploying LLM Apps on GKE, Step by Step
There’s a wide, quiet gap between an LLM app that works on your laptop and one that survives real users on Kubernetes. GKE closes a lot of it — but only if you know which parts it solves and which it leaves to you.
AI API Gateway Architecture, Explained
Every team that ships more than one LLM feature ends up building the same box in front of the model — usually by accident. Here’s what an AI gateway actually does, a reference design, and the mistakes that bite a year later.
Building AI Agents Using Java + Spring Boot
Everyone reaches for Python to build agents. But an agent is mostly plumbing — a supervised loop around a model — and Spring Boot has been quietly excellent at plumbing for fifteen years. Here’s how to build one where your data already lives.
AI Interview Prep Tools That Actually Work
Most AI job tools overpromise. Interview prep is the rare corner where they quietly overdeliver — if you use them to rehearse, and ignore the ones that promise to take the interview for you.
Can AI Build a Resume Better Than Humans?
Hand the same career history to a person and a model and you get two very different resumes. The honest answer to which is better isn’t one or the other — it’s what happens when you stop treating it as a contest.
The Best AI Tools for Job Hunting in 2026
Every week there’s a new AI tool promising to land you a job. Most are thin wrappers around a chatbot. Here’s how to tell the few that matter from the noise — organized by the job you’re actually trying to get done.
How I Use AI to Apply to 100 Jobs Automatically
I built an AI pipeline that scans, scores, and tailors applications across hundreds of postings. The twist: the goal was never to apply to more jobs — it was to apply to far fewer, and mean it.
The agents are here: what autonomous coding means for engineers
A new wave of AI agents can plan, write, and ship code end to end. We dig into what actually works today — and what's still hype.
Why everyone is rewriting their tooling in Rust
From bundlers to linters, the JS ecosystem is going native. Here's the performance story behind the migration.
The new silicon: a closer look at on-device AI chips
NPUs are landing in everything from phones to laptops. What they can do, and why it matters for privacy.
Supply chain attacks are evolving — here's how teams respond
A practical playbook for hardening your dependency graph without grinding shipping to a halt.