Voice AI HTTP API Reference

API Reference

API documentation for EdgeAI TTS and STT servers.

TTS Server

The EdgeAI TTS server provides real-time text-to-speech synthesis over HTTP. Send text, get audio played through local speakers or returned as raw audio data. Supports multiple voices, streaming input, and sub-100ms time-to-first-byte.

Authentication

Most TTS endpoints require a Bearer token. Include your API key in the Authorization header. Keys must have the tts scope.

curl -X POST http://localhost:9999/ \
  -H "Authorization: Bearer re_live_your_api_key" \
  -d "Hello from EdgeAI"

Public endpoints (no auth required): GET /voices and GET /metrics. All other endpoints require authentication.

Base URL

http://localhost:9999

Default port is 9999. The server listens on 127.0.0.1 (localhost only).

POST / Auth Required

Synthesize text and play it through the device speakers immediately. Text is queued and played in order.

Request

Content-Type	`text/plain`
Body	Plain text string to speak
Query params	`speaker_id`, `length_scale`, `noise_scale`, `noise_w_scale` (see Query Parameters)

Response

200OK

Returns OK as plain text once the text is queued for synthesis.

# Speak a sentence through device speakers
curl -X POST "http://localhost:9999/" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "Welcome to EdgeAI. On-device AI, no cloud required."

# With custom speed (1.2x slower)
curl -X POST "http://localhost:9999/?length_scale=1.2" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "This will be spoken more slowly."

POST /stream Auth Required

Buffer text chunks and speak them as complete sentences. Audio plays when sentence-ending punctuation is detected (. ! ? : or newline). Ideal for streaming LLM output token-by-token.

Request

Content-Type	`text/plain`
Body	Text chunk (partial sentence, word, or token)
Query params	Same as `POST /`

Response

200OK

Returns Buffered as plain text.

# Simulate streaming LLM output
curl -X POST "http://localhost:9999/stream" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "The weather today is"

curl -X POST "http://localhost:9999/stream" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d " sunny and warm."   # <-- Sentence ends, audio plays

# Flush any remaining buffered text
curl -X POST "http://localhost:9999/flush" \
  -H "Authorization: Bearer YOUR_API_KEY"

POST /flush Auth Required

Speak any text remaining in the stream buffer. Call this after your last /stream request to ensure trailing text that doesn't end with punctuation is spoken.

Response

200OK

Returns OK as plain text.

curl -X POST "http://localhost:9999/flush" \
  -H "Authorization: Bearer YOUR_API_KEY"

POST /cancel Auth Required

Immediately stop audio playback and clear all queued and buffered text. Use this to interrupt speech when the user starts talking (barge-in).

Response

200OK

Returns Cancelled as plain text.

curl -X POST "http://localhost:9999/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY"

POST /synthesize Auth Required

Synthesize text and return raw audio data in the response body. Does not play through speakers — use this to capture audio for recording, streaming to a client, or further processing.

Request

Content-Type	`text/plain`
Body	Plain text string to synthesize
Query params	Same as `POST /`

Response

200OK

Content-Type	`audio/x-raw`
Format	32-bit float, little-endian (f32le), mono
Sample Rate	22,050 Hz (Piper) or 24,000 Hz (Kokoro)

# Save raw audio to file
curl -X POST "http://localhost:9999/synthesize" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "Hello from EdgeAI" -o output.raw

# Convert to WAV with ffmpeg (Piper voices, 22050 Hz)
ffmpeg -f f32le -ar 22050 -ac 1 -i output.raw output.wav

# Convert to WAV (Kokoro voices, 24000 Hz)
ffmpeg -f f32le -ar 24000 -ac 1 -i output.raw output.wav

# Pipe directly to playback (macOS)
curl -s -X POST "http://localhost:9999/synthesize" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "Streaming audio" | play -t f32 -r 22050 -c 1 -

GET /voicesPublic

List all available voice models installed on the server. Returns a JSON array of voice names that can be used with POST /loadVoice.

Response

200OK

JSON array of voice name strings, sorted alphabetically.

curl http://localhost:9999/voices

# Example response:
# ["en_US-hannah-medium","en_US-ryan-high"]

POST /loadVoice Auth Required

Hot-swap the active voice model at runtime. The new voice is loaded asynchronously — subsequent synthesis requests will use it once loading completes. Use GET /voices to see available options.

Request

Body	Voice name (no path, no `.onnx` extension)

Response

200OK

Returns Voice loading: <voice_name> as plain text.

# Switch to a different voice
curl -X POST "http://localhost:9999/loadVoice" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "en_US-hannah-medium"

# Verify the voice list
curl http://localhost:9999/voices

GET /metricsPublic

Retrieve performance metrics for monitoring. Includes per-request timings and aggregate statistics.

curl http://localhost:9999/metrics | jq .

# Example response:
{
  "last_request": {
    "ttfb_ms": 85.3,
    "synthesis_ms": 420.1,
    "audio_ms": 1200.0,
    "rtf": 0.35,
    "character_count": 42,
    "cancelled": false
  },
  "aggregate": {
    "total_requests": 157,
    "total_cancelled": 2,
    "total_characters": 8432,
    "avg_ttfb_ms": 92.5,
    "avg_rtf": 0.38
  }
}

Fields

Field	Description
ttfb_ms	Time to first audio byte (milliseconds)
synthesis_ms	Total synthesis time (milliseconds)
audio_ms	Duration of generated audio (milliseconds)
rtf	Real-time factor (synthesis_ms / audio_ms). Below 1.0 = faster than real-time
character_count	Number of characters in the request
cancelled	Whether the request was interrupted via `/cancel`

Query Parameters

These optional query parameters can be appended to POST /, POST /stream, and POST /synthesize.

Parameter	Type	Default	Description
speaker_id	int	0	Speaker index for multi-speaker models
length_scale	float	1.0	Speed control. 0.5 = 2x faster, 2.0 = 2x slower
noise_scale	float	0.667	Synthesis noise level (Piper only)
noise_w_scale	float	0.8	Phoneme length variation (Piper only)

# Speak faster with a specific speaker
curl -X POST "http://localhost:9999/?speaker_id=3&length_scale=0.8" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "This is fast speech from speaker three."

Errors & Status Codes

Code	Meaning	Body
200	Success	Varies by endpoint
400	Bad Request	{"error":"Voice name required"}
401	Unauthorized	{"error":"missing_authorization","detail":"Bearer token required"}
403	Forbidden	{"error":"insufficient_scope","detail":"Key does not have tts scope"}
429	Rate Limited	{"error":"rate_limit_exceeded"}
500	Server Error	Synthesis failed

CLI Reference

Command-line arguments for starting the TTS server.

Usage

tts_server [voices_dir] [default_voice] [espeak_data] [port] [--api-key <key>]

All positional arguments are optional when running from the standard install layout.

Arguments

Argument	Required	Description
[voices_dir]	no	Directory containing `.onnx` voice models. Auto-detected at `<exe>/../models/tts/`
[default_voice]	no	Voice name (without `.onnx`). Auto-selects `en_US-ryan-high` if present, otherwise first available
[espeak_data]	no	Path to `espeak-ng-data`. Auto-detected next to executable
[port]	no	HTTP server port (default: `9999`)
--api-key <key>	no	API key for device authorization. Required on first run; cached locally for subsequent runs.

CLI Usage

# Zero-arg start (auto-detects paths from install layout)
~/.edgeai/tts/tts_server --api-key YOUR_API_KEY

# Or override specific arguments
~/.edgeai/tts/tts_server /path/to/voices en_US-ryan-high 9999 --api-key YOUR_API_KEY

# Or use the launcher (reads API key from ~/.edgeai/device_key)
~/.edgeai/bin/edgeai-tts

STT Server

The EdgeAI STT server provides real-time speech-to-text transcription from a local microphone. Control listening via HTTP, receive transcription events via webhooks. Powered by a quantized Whisper model with voice activity detection (VAD).

Authentication

STT endpoints require a Bearer token with stt scope. Include your API key in the Authorization header.

curl -X POST http://localhost:8888/start \
  -H "Authorization: Bearer re_live_your_api_key"

Base URL

http://localhost:8888

Port is configured with the -p / --port flag. The server listens on 127.0.0.1 (localhost only).

POST /start Auth Required

Begin listening for speech. The server starts capturing audio from the microphone, applies VAD to detect speech segments, and transcribes them. Results are delivered asynchronously via webhook events.

Response

200OK

{"status": "listening"}

# Start listening
curl -X POST "http://localhost:8888/start" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Response: {"status":"listening"}

POST /stop Auth Required

Stop listening and finalize any in-progress transcription. The server stops capturing audio and processes any remaining speech buffer. A final transcription event is delivered via webhook if there is pending audio.

Response

200OK

{"status": "processing"}

# Stop listening
curl -X POST "http://localhost:8888/stop" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Response: {"status":"processing"}

Webhook Events

Transcription results and lifecycle events are delivered asynchronously to a webhook URL configured with the -e / --endpoint flag. The server sends HTTP POST requests with a JSON payload.

Webhook Request

Method	`POST`
Content-Type	`application/json`
User-Agent	`EdgeAI-Voice/0.1`

Payload Format

{
  "event": "transcription",
  "timestamp": 1708934523000,
  "text": "Hello, how can I help you today?"
}

Event Types

Event	Text	Description
speech_start	empty	User started speaking (VAD detected voice)
partial	partial text	Interim transcription result (may change)
transcription	final text	Final transcription of a speech segment
speech_end	empty	User stopped speaking (silence detected)
cancelled	empty	Transcription was cancelled (e.g. via `/stop`)

Example: Receiving Webhooks

# 1. Start a simple webhook receiver (in another terminal)
python3 -c "
from http.server import HTTPServer, BaseHTTPRequestHandler
import json

class Handler(BaseHTTPRequestHandler):
    def do_POST(self):
        body = json.loads(self.rfile.read(int(self.headers['Content-Length'])))
        print(f"[{body['event']}] {body.get('text', '')}")
        self.send_response(200)
        self.end_headers()

HTTPServer(('127.0.0.1', 8080), Handler).serve_forever()
"

# 2. Start the STT server with webhook
~/.edgeai/bin/edgeai-stt --endpoint http://127.0.0.1:8080/events

# 3. Start listening
curl -X POST http://localhost:8888/start \
  -H "Authorization: Bearer YOUR_API_KEY"

# 4. Speak into your microphone... events arrive at your webhook:
#   [speech_start]
#   [partial] hello
#   [partial] hello how are
#   [transcription] Hello, how are you?
#   [speech_end]

Errors & Status Codes

Code	Meaning	Body
200	Success	{"status":"listening"} or {"status":"processing"}
401	Unauthorized	{"error":"missing_authorization","detail":"Bearer token required"}
403	Forbidden	{"error":"insufficient_scope","detail":"Key does not have stt scope"}
429	Rate Limited	{"error":"rate_limit_exceeded"}

Operating Modes

The STT server supports three operating modes.

HTTP-Controlled Mode

Set a port with -p 8888. The server starts idle and waits for /start / /stop commands. Transcription events are printed to stdout. Best for push-to-talk UIs where your application polls or reads console output.

edgeai-stt --ptt

Continuous Mode

Omit the -p flag. The server immediately begins listening and transcribing continuously. No HTTP control available. Best for always-on voice input applications like kiosks or accessibility tools.

edgeai-stt

Agent-in-the-Loop Mode

Add -e to forward transcription events to an agent or application via webhook. Combine with -p for HTTP-controlled agent workflows (the agent calls /start and /stop to manage turn-taking), or omit -p for continuous hands-free agent input. This is how the EdgeAI Agents SDK voice agent connects to the STT server.

# Agent-controlled: agent calls /start and /stop to manage turns
edgeai-stt --ptt \
  --endpoint http://127.0.0.1:8080/agent

# Continuous + agent: always listening, agent receives all speech
edgeai-stt --endpoint http://127.0.0.1:8080/agent

CLI Reference

Command-line flags for configuring the STT server.

Flag	Default	Description
-m, --model <path>	required	Path to STT model file (Whisper GGML)
-v, --vad-model <path>	none	Path to VAD model (recommended for accuracy)
-e, --endpoint <url>	none	Webhook URL for transcription events
-p, --port <num>	none	HTTP control port (omit for continuous mode)
-l, --language <code>	auto	Language code (e.g. `en`, `fr`, `de`)
-t, --threads <num>	auto	Number of CPU threads for inference
-d, --device <idx>	-1	Audio input device index (-1 = auto-detect)
-g, --gpu	off	Enable GPU acceleration (Metal on macOS, CUDA on Linux)
-k, --api-key <key>	none	API key for device authorization

CLI Usage

# Start STT with all options
~/.edgeai/stt/bin/stt_server \
  --model ~/.edgeai/models/stt/edgeai-small-q8_0.bin \
  --vad-model ~/.edgeai/models/stt/edgeai-vad-v0.0.1.bin \
  --language en \
  --port 8888 \
  --endpoint http://127.0.0.1:8080/agent \
  --gpu \
  --api-key YOUR_API_KEY

# Or use the launcher (reads config from ~/.edgeai/)
~/.edgeai/bin/edgeai-stt

Examples

Real-world integration patterns using Python and FastAPI.

STT: Webhook Receiver

Receive transcription events from the STT server via webhooks. When a final transcription arrives, forward it to an LLM for a response.

from fastapi import FastAPI
from openai import OpenAI
import os

app = FastAPI()
llm = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)

@app.post("/webhook")
def webhook(event: dict):
    if event["type"] == "transcription":
        print("User said:", event["text"])

        response = llm.chat.completions.create(
            model="openai/gpt-4o-mini",
            messages=[{"role": "user", "content": event["text"]}],
        )
        print("LLM:", response.choices[0].message.content)

    return {"ok": True}

STT: Push to Talk

A push-to-talk pattern where POST /start begins listening on a button press, and POST /stop is sent automatically when the user stops speaking.

from fastapi import FastAPI
from openai import OpenAI
import requests, os

app = FastAPI()
llm = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)
HEADERS = {"Authorization": "Bearer " + os.environ["EDGEAI_API_KEY"]}

@app.post("/push-to-talk")
def push_to_talk():
    # Start listening when the button is pressed
    requests.post("http://localhost:8888/start", headers=HEADERS)
    return {"status": "listening"}

@app.post("/webhook")
def webhook(event: dict):
    if event["type"] == "speech_end":
        # Stop listening when the user stops speaking
        requests.post("http://localhost:8888/stop", headers=HEADERS)

    if event["type"] == "transcription":
        print("User said:", event["text"])

        response = llm.chat.completions.create(
            model="openai/gpt-4o-mini",
            messages=[{"role": "user", "content": event["text"]}],
        )
        print("LLM:", response.choices[0].message.content)

    return {"ok": True}

TTS: Stream LLM Responses to Speech

Stream tokens from an LLM directly to the TTS server for real-time speech synthesis. Each token is sent to /stream as it arrives, and /flush is called at the end to play any remaining buffered audio.

from openai import OpenAI
import requests, os

llm = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)
HEADERS = {"Authorization": "Bearer " + os.environ["EDGEAI_API_KEY"]}

# Stream LLM response token-by-token to the TTS server
stream = llm.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain quantum computing in one paragraph"}],
    stream=True,
)

for chunk in stream:
    token = chunk.choices[0].delta.content
    if token:
        requests.post("http://localhost:9999/stream", json={"text": token}, headers=HEADERS)

# Signal end-of-stream so TTS flushes remaining audio
requests.post("http://localhost:9999/flush", headers=HEADERS)

End-to-End Voice Agent

A complete voice agent that receives STT transcriptions, sends them to an LLM, and streams the response to TTS — all in one FastAPI app.

from fastapi import FastAPI
from openai import OpenAI
import requests, os

app = FastAPI()
llm = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)
HEADERS = {"Authorization": "Bearer " + os.environ["EDGEAI_API_KEY"]}

@app.post("/webhook")
def webhook(event: dict):
    if event["type"] != "transcription":
        return {"ok": True}

    # Send transcription to LLM and stream the response to TTS
    stream = llm.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": event["text"]}],
        stream=True,
    )

    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            requests.post("http://localhost:9999/stream", json={"text": token}, headers=HEADERS)

    requests.post("http://localhost:9999/flush", headers=HEADERS)
    return {"ok": True}

API Reference

TTS Server

Request

Response

Request

Response

Response

Response

Request

Response

Response

Request

Response

Fields

Usage

Arguments

CLI Usage

STT Server

Response

Response

Webhook Request

Payload Format

Event Types

Example: Receiving Webhooks

HTTP-Controlled Mode

Continuous Mode

Agent-in-the-Loop Mode

CLI Usage

Examples

Getting Started

Agents SDK Docs

Get an API Key