Shipping a Jarvis-like voice interface without the Jarvis-sized bill: lessons learned

How we gave a 60-tool security agent a real voice, kept it as smart as it is in text, and paid commodity prices instead of realtime-voice-API prices.

Speech in, speech out, three function calls in between is a solved problem. Microphone, transcription, a prompt, text-to-speech. That part is commodity, so we won't dwell on it.

The hard part is voicing an agent that thinks. Ours does. The Transilience security agent runs on Claude Sonnet over the Vercel AI SDK, and before it answers you it reasons for several seconds and fans out across 60+ tools:

REST calls to run scans and pull findings
document skills that produce PDFs and slide decks
dashboard, scheduling, and file actions

One spoken question can set off a chain of tool calls and thousands of tokens of hidden reasoning. So the question was never how to make it talk. It was how to voice something that thinks this hard without ten seconds of silence, and without sounding like a machine reading a log. Here's what we learned.

We didn't change the agent. We listened to it.

The agent stays exactly as it is in text: same model, prompt, personality, tools, reasoning. Voice is a layer that listens in on what the AI SDK is already streaming and decides what to say out loud. There's a lot to listen to:

the thinking: "pull their IAM findings, then cross-reference the last scan"
the tool calls: runApp, readFileContent, getSessionDetail, a document skill
the answer itself

Voicing the agent means turning that stream of internal machinery into calm narration that sounds like someone working on your problem.

A second model, running in parallel, just to sound human

You can't ask the main agent to narrate itself; that slows it down and pollutes its reasoning. So a smaller, faster model runs alongside it (Claude Haiku) with one job: watch what the big agent is doing and say one short, human line about it. We call it the Voice Director:

"Pulling your IAM findings now." "Cross-referencing that against the last scan."

Not a canned "please wait." We shipped canned filler first, it sounded like hold music, and we deleted it the same week. Each line is one breath, capped to about sixteen words, with a hard limit per turn so it narrates instead of rambling. If it can't say something honest in time, it stays quiet. It runs fire-and-forget: never in the agent's path, never blocking a tool call, never touching the answer.

Saying the answer like a person, not reading it like a file

Reading the raw answer aloud was awful. Real answers are rich markdown: bullets, links, file names, UUIDs, tables. Out loud, that becomes a run-on that recites identifiers like a license plate.

So the voice never gets the raw answer. The screen shows the full version; the voice gets a distillation, split in two:

Opening: the moment the first sentences land, Haiku distills them into one line. The agent leads with the bottom line, so that opening line is the answer. You hear the verdict almost immediately.
Remainder: distilled separately, told what was already said so it doesn't repeat, and kicked off mid-stream so it's ready the instant the opening finishes. No gap.

The AI SDK plumbing that made it real-time

The natural way to feed answer text into speech on the Vercel AI SDK is result.textStream. We did, and the pipeline stalled until the model finished the entire answer; the SDK buffers that view when you're also consuming the UI stream. The fix: tap the raw onChunk text deltas, push them into our own queue, and feed that to speech. Now audio starts at the first token instead of the last. Small change, hard to find, and it's the difference between feeling alive and feeling like it's buffering.

The bill: we only pay for audio at the edges

A realtime voice API like OpenAI's gpt-realtime bills in audio tokens and makes one model do everything inside that session: listen, reason, call tools, talk. Audio output runs about $64 per million tokens, metered at one token per 50 ms, so you pay premium rates for every second it talks plus every reasoning token and tool result riding along. For us the orchestration is the expensive part, and a realtime API would run all of it at audio-session prices.

So we unbundled it. The deep reasoning and tool orchestration runs on a plain text model (Sonnet, $3 / $15 per million text tokens), exactly as it does for the text product, at no premium. Speech only touches the edges:

Piece	Job	List price
Deepgram Flux	speech to text	~$0.0077 / min
Claude Haiku	narrate and distill	$1 / $5 per M tokens, fed a few hundred per turn
ElevenLabs Flash v2.5	text to speech	~$0.05 to $0.12 / 1K chars

And since the voice speaks only the distilled answer, even the text-to-speech bill is a fraction of reading the whole thing. The reasoning stays at text prices; voice adds a few cents at the rim instead of a premium on the whole turn.

One agent, thinking as hard as ever, now out loud

What we're proud of isn't speech recognition or a nice voice. Anyone can buy those. It's that an agent doing real work, reasoning deeply, orchestrating dozens of tools, generating documents, now talks you through it like a sharp colleague: what it's checking, what it found, what to do about it. In real time, no dead air, no robot.

We didn't wrap a voice API. We taught a thinking agent to think out loud.