Why we ripped cloud voice out of our robots: from Deepgram + OpenAI + ElevenLabs to on-device

Like most teams, we built our first robot voice agent the fastest way possible: all cloud, all Python. Deepgram for speech-to-text, OpenAI for the LLM, ElevenLabs for text-to-speech. A few hundred lines of glue and we had a robot that listened, thought, and talked.

In a quiet room, on good Wi-Fi, it demoed beautifully.

Then we put it on a robot that had to work in the real world — and every assumption behind that stack fell apart. This is the honest story of what broke, why cloud-first is the wrong foundation for physical AI, and what we're building instead.

The cloud-first stack we started with

mic → Deepgram (STT) → OpenAI (LLM) → ElevenLabs (TTS) → speaker
         ↑ network        ↑ network        ↑ network

Three SaaS APIs, three network round-trips, one Python event loop holding it together. It's the path of least resistance, and for a prototype it's the right call. For a product that ships on a robot, it's a trap.

Everything wrong with cloud-first voice on a robot

Latency is three network round-trips deep. Every turn pays for the trip to Deepgram, then OpenAI, then ElevenLabs — serialized. Even on a good connection that's well over a second before the robot starts talking, and humans read a pause that long as "it's broken." Voice lives or dies on time-to-first-token, and you've handed that number to three vendors and your network.

No connection, no robot. Robots, vehicles, drones, and field hardware operate where Wi-Fi is flaky, congested, or absent. A voice agent that hard-depends on three cloud APIs simply stops working the moment connectivity does — exactly when you need it most.

Your users' audio leaves the device. Every utterance — which can include faces, names, proprietary processes, medical or in-vehicle context — gets streamed to third-party servers. For automotive, defense, healthcare, and most serious enterprise buyers, that's a non-starter on privacy and compliance grounds. It turns voice from a feature into a liability.

Cost scales with every word. Per-minute STT, per-token LLM, per-character TTS — multiplied across a fleet running all day. The bill that's invisible at prototype scale becomes a line item that kills the unit economics at deployment scale.

You're renting your core loop. Rate limits, deprecations, price changes, and outages on three vendors are all now your robot's problem. You don't control the most latency-critical path in your product.

Python in the hot path. Real-time audio and deterministic latency don't love a Python event loop and GIL, and it doesn't drop cleanly into the C++ control stack robots already run.

Where we are now (and why it's still not enough)

So we pulled the whole pipeline on-device. STT moved to whisper.cpp, TTS to local engines like Piper and Kokoro, and the LLM runs locally too — we talk to it over a localhost HTTP call to an on-device inference server. No network hops, no third-party vendors, no audio or text leaving the machine. On the privacy and offline axes, we're already there.

But we'll be honest about the gap that's left, and it's architectural. Even fully local, this is still independent components stitched across process boundaries — three engines and an HTTP loopback in the critical path. Every localhost round-trip and serialization step is latency you're paying for, and the interruption problem (a human talking over the robot) still requires a control loop those separate components don't share. On-device but stitched-together isn't the destination — it's a checkpoint. The next step is collapsing it into one integrated C++ runtime.

What we're building

The thesis behind EdgeAI: a robot's voice agent shouldn't be a pile of SaaS calls and glue — it should be one on-device C++ stack designed around the three things that actually matter, from day one:

End-to-end on-device — local STT, a small language model for reasoning, and local TTS, so nothing leaves the device and there's no network in the critical path.
Streaming by design — partial transcripts feed the model and tokens feed TTS as they're produced, so the agent starts speaking while it's still thinking.
Barge-in as a first-class operation — VAD, generation, and playback in one runtime, so a human speaking instantly cancels playback and reopens listening.
C++ on the hot path — deterministic and low-overhead, built to drop into the control stack robots already run.

We're building this in the open and will publish numbers as we go — no vaporware benchmarks here.

Cloud-first vs on-device: the tradeoff that decided it for us

	Cloud-first (Deepgram + OpenAI + ElevenLabs)	On-device (where we're headed)
Latency floor	3 network round-trips, vendor-dependent	No network in the critical path
Works offline	No	Yes
Data leaves device	Yes (audio + text to 3 vendors)	No
Cost model	Per-minute / per-token / per-char, forever	Compute you already own
Control of the loop	Rented across 3 vendors	Yours
Time to prototype	Fast ✅	Slower, but it's the foundation

Cloud-first wins the prototype. On-device wins the product. For physical AI, only one of those matters.

Measure it yourself

If you're weighing the same move, benchmark it honestly on your target hardware (e.g. a Jetson Orin):

Measure time-to-first-token (mic-stops to first-audio-out) on your cloud stack across 50 utterances — and again with the network throttled or dropped.
Track end-to-end latency, cost per conversation, and barge-in stop latency.
Repeat with STT/TTS pulled on-device.
Compare the distributions, not just averages — voice quality lives in the slow tail.

Following along

We're documenting the rebuild from cloud-first to fully on-device as we go. If you're putting voice on a robot, vehicle, or edge device and recognize this pain, start with EdgeAI free or reach out — we'd love to compare notes.