Hardware

On-device AI, from the inside: I ran a voice-clone model on my Mac's silicon

NPUs and Apple's MPS backend promise private, local AI. Here's what it actually cost me — ~5x-slower-than-real-time synthesis, a mandatory lock, CPU fallbacks, and a weight-loading bug that turns a model into pure noise — building an offline Telugu voice clone.

Dhileep KumarJun 4, 20266 min read

On-device AI, from the inside: I ran a voice-clone model on my Mac's silicon

Every explainer on on-device AI ends at the same place: the NPU is efficient, your data stays local, and the future is hybrid. All true. But none of it tells you what it actually feels like to make a real neural network run on the silicon already in your machine. I spent a few weeks doing exactly that — building a Telugu text-to-speech system that clones my own voice and runs entirely on my Apple Silicon Mac, no cloud round-trip, ever. This is the part the explainers skip: the ops that don't exist, the lock you have to hold, and the price in wall-clock seconds you pay for keeping the data on your desk.

The project is called indic-tts. It loads a third-party flow-matching model (ai4bharat's IndicF5) once into memory and synthesizes 24 kHz speech from a single ~15-second reference clip of my voice — zero-shot, meaning no training required to imitate a new speaker. It reads mixed Telugu/English tech prose aloud through a Chrome extension and a macOS menu-bar app. Everything below is measured or read straight out of that repo. Where a number is a single observation rather than a benchmark, I say so.

What an NPU actually is (and what "on-device" really means here)

A CPU is a generalist. A GPU is a parallel workhorse built for graphics that happens to be excellent at the dense matrix multiplications neural networks live on. An NPU — the neural processing unit now shipping in phones and "AI PC" laptops — is neither: it's purpose-built to do that one narrow thing at very low numerical precision (8-bit integers, sometimes 4), because networks tolerate it and low precision means far less energy per operation. That's the textbook version, and it's correct.

On an Apple Silicon Mac the story is subtler than "it runs on the NPU. " What you actually target through PyTorch is MPS — Apple's Metal Performance Shaders backend — which spans the GPU and the Neural Engine. And the first thing you learn is that MPS does not implement every operation a model needs. My very first line of environment setup is an admission of that:

python

# Let unsupported MPS ops fall back to CPU instead of crashing (Apple Silicon).
os.environ.setdefault("PYTORCH_ENABLE_MPS_FALLBACK", "1")
# IndicF5 wraps its vocoder in torch.compile(); on recent torch this breaks while
# tracing torchaudio's mel filterbank. Disabling Dynamo makes compile a no-op (eager).
os.environ.setdefault("TORCHDYNAMO_DISABLE", "1")

Without PYTORCH_ENABLE_MPS_FALLBACK=1, an op the backend doesn't support doesn't degrade gracefully — it throws, and inference dies. With it, unsupported operations quietly execute on the CPU instead. So "on-device inference" is not one homogeneous chip humming along; it's a scheduler shuttling tensors between the accelerator and the CPU, and the ops that fall back are pure overhead. This is the reality behind the marketing TOPS number: the theoretical throughput assumes the whole graph lands on the fast path, and real models rarely do.

The price of keeping it local: ~5x slower than real-time

Here is the one measured timing number I have, and I want to be precise about what it is. On device=mps, my run log recorded: "Done in 30.6s. Saved: test. wav (5.8s of audio). " That's roughly 5.3x slower than real-time to synthesize — 30.6 seconds of compute for 5.8 seconds of speech, on a first, uncached generation. It is a single observed line from one run, not an averaged benchmark; I haven't re-run it under controlled conditions. But it's honest, and it's the shape of the trade-off nobody quotes: the on-device tax is real and you feel it.

On-device inference isn't one chip humming along — it's a scheduler shuttling tensors between the accelerator and the CPU, and every op the NPU can't do is pure tax you pay in wall-clock seconds.

That ~5x figure is also why the architecture around the model matters more than the model. If a cold generation costs 30 seconds, you cannot afford to pay it twice for the same sentence. So the server hashes every chunk of text with SHA-256 and caches the resulting audio to disk. A re-read of anything you've heard before is a disk read, not a re-synthesis. The cache key deliberately embeds a reference-clip identity and a voice tag, so when I swap the reference or the fine-tuned voice, the whole cache invalidates itself instead of serving stale audio in the wrong voice.

MPS is not concurrency-safe, so inference is single-file

This one surprised me and it's the detail I'd most want another builder to know. You cannot just throw threads at a local model to hide the latency. The model and the MPS backend are not concurrency-safe — fire two generations at once and you get corruption or a crash, not a speedup. So every generation is serialized behind a single threading. Lock. Concurrency, on this kind of on-device target, has to happen at a coarser grain than the model call.

The way you actually make it feel responsive isn't parallel inference — it's pipelining at the sentence level. The FastAPI server splits text at sentence boundaries (capped at 320 characters per chunk), and the clients stream: play sentence i while sentence i+1 is generating. To stop the audio "stumbling" at each seam, adjacent chunks are joined with a 0.07-second equal-power (cosine/sine) crossfade and a small gap. The listener perceives continuous speech; under the hood it's a strict one-at-a-time queue with a lock around the expensive part.

Serialize the model call — one threading. Lock, no exceptions; MPS won't tolerate concurrent inference.
Cache aggressively — SHA-256 per chunk to disk, because a cold generation is ~5x slower than real-time and you never want to pay it twice.
Pipeline at the boundary — split at sentences, play chunk i while i+1 generates, crossfade the seams (0.07s equal-power).
Keep the model warm — load it once into a long-lived FastAPI process; the load and first-run cost is not something you want per request.

The load-bearing bug: a torch. compile prefix that turns your model into noise

The most instructive failure had nothing to do with chips and everything to do with how these models get shipped. IndicF5's published checkpoint was saved while the model was wrapped by torch. compile. That wrapper leaves a fingerprint: every single weight key carries an extra . _orig_mod. segment in its name. Load that checkpoint into an un-wrapped model and the key names don't match — and here's the trap — the loader doesn't error. It silently matches nothing, leaves the transformer at its random initialization, and hands you back a model that runs perfectly and produces pure noise.

The fix is to strip the prefix so the keys line up, then assert loudly that they did:

python

sd = load_file(hf_hub_download(REPO_ID, "model.safetensors"))
remapped = {k.replace("._orig_mod.", "."): v for k, v in sd.items()}
res = model.load_state_dict(remapped, strict=False)
n_miss, n_unexp = len(res.missing_keys), len(res.unexpected_keys)
if n_miss or n_unexp:
    print(f"WARNING: weight load mismatch - {n_miss} missing, {n_unexp} unexpected")
else:
    print("Weights loaded cleanly (0 missing, 0 unexpected).")

After the strip, the log reads "Weights loaded cleanly (0 missing, 0 unexpected). " That line is the difference between a working voice and noise. It's also a lesson about local AI in general: when you leave the safety of a hosted API and load raw weights yourself, silent mismatches become your problem. On top of this I had to monkey-patch two more things to get the model to load at all on modern libraries — transformers now builds models on a "meta" device that IndicF5's initializer can't tolerate, and torchaudio 2.11 moved its file loader onto a new backend that the model's WAV reader chokes on. None of that is exotic; it's the ordinary friction of running someone else's checkpoint on your own hardware, and it's exactly what the polished explainers omit.

The part that isn't about silicon at all: the text has to be readable first

There's a myth that on-device intelligence is just about the model and the chip. Half my effort went somewhere else entirely. IndicF5 silently skips or mangles Latin-script English words, and my source material is Telugu prose full of English tech jargon — model, GPU, GHz, 92.5%, Rs1500. Fed raw, it's unreadable. So I hand-built a Telugu normalizer from scratch: an English-to-Telugu term dictionary of roughly 200 jargon entries, a rule-based Latin-to-Telugu fallback transliterator for words not in the dictionary, an integer-to-Telugu-words converter that respects the crore/lakh grouping, and handlers for percentages, currency, and units.

It even goes down to the level of individual glyphs. The base model mispronounces certain hard Telugu conjuncts, so for specific words I hard-code a respelling that inserts a zero-width non-joiner (U+200C) after the virama to break an unwanted conjunct — including, fittingly, the Telugu word for "failure. " That's the texture of real on-device work: not a clean box labeled "AI," but a text front-end you have to build by hand before the model can even do its job. The chip is the easy part; making it understand your input is where the weeks go.

So what's the on-device story, really?

The privacy dividend is genuine and it's the reason I built this locally at all: the reference clip of my own voice, and every document I have it read, never leave the machine. There is no server that logs the audio, no training pipeline it silently feeds. For a voice — arguably the most personal biometric you have — that structural guarantee is the entire point. The hardware really does let you decouple the smart feature from the surveillance.

But the honest version has a bill attached. On-device means ~5x-slower-than-real-time generation, a mandatory lock because the backend can't do two things at once, ops that quietly fall back to the CPU, and a pile of glue code — prefix stripping, monkey-patches, a hand-written transliterator — that exists only because you're running raw weights on hardware they weren't packaged for. All of that is invisible in a TOPS number. The silicon in your pocket is real and it's capable. What the explainers won't tell you is how much software you have to write around it before "runs on-device" turns into something a person would actually want to use.

Enjoyed this?

Get the next deep dive in your inbox. No spam — just the stories worth reading.

Subscribe to the newsletter