API Reference
API documentation for EdgeAI TTS and STT servers.
TTS Server
The EdgeAI TTS server provides real-time text-to-speech synthesis over HTTP. Send text, get audio played through local speakers or returned as raw audio data. Supports multiple voices, streaming input, and sub-100ms time-to-first-byte.
Authentication
Most TTS endpoints require a Bearer token. Include your API key in the Authorization header. Keys must have the tts scope.
curl -X POST http://localhost:9999/ \
-H "Authorization: Bearer re_live_your_api_key" \
-d "Hello from EdgeAI"Public endpoints (no auth required): GET /voices and GET /metrics. All other endpoints require authentication.
Base URL
http://localhost:9999Default port is 9999. The server listens on 127.0.0.1 (localhost only).
POST / Auth Required
Synthesize text and play it through the device speakers immediately. Text is queued and played in order.
Request
| Content-Type | text/plain |
| Body | Plain text string to speak |
| Query params | speaker_id, length_scale, noise_scale, noise_w_scale (see Query Parameters) |
Response
200OKReturns OK as plain text once the text is queued for synthesis.
# Speak a sentence through device speakers
curl -X POST "http://localhost:9999/" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d "Welcome to EdgeAI. On-device AI, no cloud required."
# With custom speed (1.2x slower)
curl -X POST "http://localhost:9999/?length_scale=1.2" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d "This will be spoken more slowly."POST /stream Auth Required
Buffer text chunks and speak them as complete sentences. Audio plays when sentence-ending punctuation is detected (. ! ? : or newline). Ideal for streaming LLM output token-by-token.
Request
| Content-Type | text/plain |
| Body | Text chunk (partial sentence, word, or token) |
| Query params | Same as POST / |
Response
200OKReturns Buffered as plain text.
# Simulate streaming LLM output
curl -X POST "http://localhost:9999/stream" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d "The weather today is"
curl -X POST "http://localhost:9999/stream" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d " sunny and warm." # <-- Sentence ends, audio plays
# Flush any remaining buffered text
curl -X POST "http://localhost:9999/flush" \
-H "Authorization: Bearer YOUR_API_KEY"POST /flush Auth Required
Speak any text remaining in the stream buffer. Call this after your last /stream request to ensure trailing text that doesn't end with punctuation is spoken.
Response
200OKReturns OK as plain text.
curl -X POST "http://localhost:9999/flush" \
-H "Authorization: Bearer YOUR_API_KEY"POST /cancel Auth Required
Immediately stop audio playback and clear all queued and buffered text. Use this to interrupt speech when the user starts talking (barge-in).
Response
200OKReturns Cancelled as plain text.
curl -X POST "http://localhost:9999/cancel" \
-H "Authorization: Bearer YOUR_API_KEY"POST /synthesize Auth Required
Synthesize text and return raw audio data in the response body. Does not play through speakers — use this to capture audio for recording, streaming to a client, or further processing.
Request
| Content-Type | text/plain |
| Body | Plain text string to synthesize |
| Query params | Same as POST / |
Response
200OK| Content-Type | audio/x-raw |
| Format | 32-bit float, little-endian (f32le), mono |
| Sample Rate | 22,050 Hz (Piper) or 24,000 Hz (Kokoro) |
# Save raw audio to file
curl -X POST "http://localhost:9999/synthesize" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d "Hello from EdgeAI" -o output.raw
# Convert to WAV with ffmpeg (Piper voices, 22050 Hz)
ffmpeg -f f32le -ar 22050 -ac 1 -i output.raw output.wav
# Convert to WAV (Kokoro voices, 24000 Hz)
ffmpeg -f f32le -ar 24000 -ac 1 -i output.raw output.wav
# Pipe directly to playback (macOS)
curl -s -X POST "http://localhost:9999/synthesize" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d "Streaming audio" | play -t f32 -r 22050 -c 1 -GET /voicesPublic
List all available voice models installed on the server. Returns a JSON array of voice names that can be used with POST /loadVoice.
Response
200OKJSON array of voice name strings, sorted alphabetically.
curl http://localhost:9999/voices
# Example response:
# ["en_US-hannah-medium","en_US-ryan-high"]POST /loadVoice Auth Required
Hot-swap the active voice model at runtime. The new voice is loaded asynchronously — subsequent synthesis requests will use it once loading completes. Use GET /voices to see available options.
Request
| Body | Voice name (no path, no .onnx extension) |
Response
200OKReturns Voice loading: <voice_name> as plain text.
# Switch to a different voice
curl -X POST "http://localhost:9999/loadVoice" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d "en_US-hannah-medium"
# Verify the voice list
curl http://localhost:9999/voicesGET /metricsPublic
Retrieve performance metrics for monitoring. Includes per-request timings and aggregate statistics.
curl http://localhost:9999/metrics | jq .
# Example response:
{
"last_request": {
"ttfb_ms": 85.3,
"synthesis_ms": 420.1,
"audio_ms": 1200.0,
"rtf": 0.35,
"character_count": 42,
"cancelled": false
},
"aggregate": {
"total_requests": 157,
"total_cancelled": 2,
"total_characters": 8432,
"avg_ttfb_ms": 92.5,
"avg_rtf": 0.38
}
}Fields
| Field | Description |
|---|---|
| ttfb_ms | Time to first audio byte (milliseconds) |
| synthesis_ms | Total synthesis time (milliseconds) |
| audio_ms | Duration of generated audio (milliseconds) |
| rtf | Real-time factor (synthesis_ms / audio_ms). Below 1.0 = faster than real-time |
| character_count | Number of characters in the request |
| cancelled | Whether the request was interrupted via /cancel |
Query Parameters
These optional query parameters can be appended to POST /, POST /stream, and POST /synthesize.
| Parameter | Type | Default | Description |
|---|---|---|---|
| speaker_id | int | 0 | Speaker index for multi-speaker models |
| length_scale | float | 1.0 | Speed control. 0.5 = 2x faster, 2.0 = 2x slower |
| noise_scale | float | 0.667 | Synthesis noise level (Piper only) |
| noise_w_scale | float | 0.8 | Phoneme length variation (Piper only) |
# Speak faster with a specific speaker
curl -X POST "http://localhost:9999/?speaker_id=3&length_scale=0.8" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d "This is fast speech from speaker three."Errors & Status Codes
| Code | Meaning | Body |
|---|---|---|
| 200 | Success | Varies by endpoint |
| 400 | Bad Request | {"error":"Voice name required"} |
| 401 | Unauthorized | {"error":"missing_authorization","detail":"Bearer token required"} |
| 403 | Forbidden | {"error":"insufficient_scope","detail":"Key does not have tts scope"} |
| 429 | Rate Limited | {"error":"rate_limit_exceeded"} |
| 500 | Server Error | Synthesis failed |
CLI Reference
Command-line arguments for starting the TTS server.
Usage
tts_server [voices_dir] [default_voice] [espeak_data] [port] [--api-key <key>]All positional arguments are optional when running from the standard install layout.
Arguments
| Argument | Required | Description |
|---|---|---|
| [voices_dir] | no | Directory containing .onnx voice models. Auto-detected at <exe>/../models/tts/ |
| [default_voice] | no | Voice name (without .onnx). Auto-selects en_US-ryan-high if present, otherwise first available |
| [espeak_data] | no | Path to espeak-ng-data. Auto-detected next to executable |
| [port] | no | HTTP server port (default: 9999) |
| --api-key <key> | no | API key for device authorization. Required on first run; cached locally for subsequent runs. |
CLI Usage
# Zero-arg start (auto-detects paths from install layout)
~/.edgeai/tts/tts_server --api-key YOUR_API_KEY
# Or override specific arguments
~/.edgeai/tts/tts_server /path/to/voices en_US-ryan-high 9999 --api-key YOUR_API_KEY
# Or use the launcher (reads API key from ~/.edgeai/device_key)
~/.edgeai/bin/edgeai-ttsSTT Server
The EdgeAI STT server provides real-time speech-to-text transcription from a local microphone. Control listening via HTTP, receive transcription events via webhooks. Powered by a quantized Whisper model with voice activity detection (VAD).
Authentication
STT endpoints require a Bearer token with stt scope. Include your API key in the Authorization header.
curl -X POST http://localhost:8888/start \
-H "Authorization: Bearer re_live_your_api_key"Base URL
http://localhost:8888Port is configured with the -p / --port flag. The server listens on 127.0.0.1 (localhost only).
POST /start Auth Required
Begin listening for speech. The server starts capturing audio from the microphone, applies VAD to detect speech segments, and transcribes them. Results are delivered asynchronously via webhook events.
Response
200OK{"status": "listening"}# Start listening
curl -X POST "http://localhost:8888/start" \
-H "Authorization: Bearer YOUR_API_KEY"
# Response: {"status":"listening"}POST /stop Auth Required
Stop listening and finalize any in-progress transcription. The server stops capturing audio and processes any remaining speech buffer. A final transcription event is delivered via webhook if there is pending audio.
Response
200OK{"status": "processing"}# Stop listening
curl -X POST "http://localhost:8888/stop" \
-H "Authorization: Bearer YOUR_API_KEY"
# Response: {"status":"processing"}Webhook Events
Transcription results and lifecycle events are delivered asynchronously to a webhook URL configured with the -e / --endpoint flag. The server sends HTTP POST requests with a JSON payload.
Webhook Request
| Method | POST |
| Content-Type | application/json |
| User-Agent | EdgeAI-Voice/0.1 |
Payload Format
{
"event": "transcription",
"timestamp": 1708934523000,
"text": "Hello, how can I help you today?"
}Event Types
| Event | Text | Description |
|---|---|---|
| speech_start | empty | User started speaking (VAD detected voice) |
| partial | partial text | Interim transcription result (may change) |
| transcription | final text | Final transcription of a speech segment |
| speech_end | empty | User stopped speaking (silence detected) |
| cancelled | empty | Transcription was cancelled (e.g. via /stop) |
Example: Receiving Webhooks
# 1. Start a simple webhook receiver (in another terminal)
python3 -c "
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
class Handler(BaseHTTPRequestHandler):
def do_POST(self):
body = json.loads(self.rfile.read(int(self.headers['Content-Length'])))
print(f"[{body['event']}] {body.get('text', '')}")
self.send_response(200)
self.end_headers()
HTTPServer(('127.0.0.1', 8080), Handler).serve_forever()
"
# 2. Start the STT server with webhook
~/.edgeai/bin/edgeai-stt --endpoint http://127.0.0.1:8080/events
# 3. Start listening
curl -X POST http://localhost:8888/start \
-H "Authorization: Bearer YOUR_API_KEY"
# 4. Speak into your microphone... events arrive at your webhook:
# [speech_start]
# [partial] hello
# [partial] hello how are
# [transcription] Hello, how are you?
# [speech_end]Errors & Status Codes
| Code | Meaning | Body |
|---|---|---|
| 200 | Success | {"status":"listening"} or {"status":"processing"} |
| 401 | Unauthorized | {"error":"missing_authorization","detail":"Bearer token required"} |
| 403 | Forbidden | {"error":"insufficient_scope","detail":"Key does not have stt scope"} |
| 429 | Rate Limited | {"error":"rate_limit_exceeded"} |
Operating Modes
The STT server supports three operating modes.
HTTP-Controlled Mode
Set a port with -p 8888. The server starts idle and waits for /start / /stop commands. Transcription events are printed to stdout. Best for push-to-talk UIs where your application polls or reads console output.
edgeai-stt --port 8888Continuous Mode
Omit the -p flag. The server immediately begins listening and transcribing continuously. No HTTP control available. Best for always-on voice input applications like kiosks or accessibility tools.
edgeai-sttAgent-in-the-Loop Mode
Add -e to forward transcription events to an agent or application via webhook. Combine with -p for HTTP-controlled agent workflows (the agent calls /start and /stop to manage turn-taking), or omit -p for continuous hands-free agent input. This is how the EdgeAI Agents SDK voice agent connects to the STT server.
# Agent-controlled: agent calls /start and /stop to manage turns
edgeai-stt --port 8888 \
--endpoint http://127.0.0.1:8080/agent
# Continuous + agent: always listening, agent receives all speech
edgeai-stt --endpoint http://127.0.0.1:8080/agentCLI Reference
Command-line flags for configuring the STT server.
| Flag | Default | Description |
|---|---|---|
| -m, --model <path> | required | Path to STT model file (Whisper GGML) |
| -v, --vad-model <path> | none | Path to VAD model (recommended for accuracy) |
| -e, --endpoint <url> | none | Webhook URL for transcription events |
| -p, --port <num> | none | HTTP control port (omit for continuous mode) |
| -l, --language <code> | auto | Language code (e.g. en, fr, de) |
| -t, --threads <num> | auto | Number of CPU threads for inference |
| -d, --device <idx> | -1 | Audio input device index (-1 = auto-detect) |
| -g, --gpu | off | Enable GPU acceleration (Metal on macOS, CUDA on Linux) |
| -k, --api-key <key> | none | API key for device authorization |
CLI Usage
# Start STT with all options
~/.edgeai/stt/bin/stt_server \
--model ~/.edgeai/models/stt/edgeai-small-q8_0.bin \
--vad-model ~/.edgeai/models/stt/edgeai-vad-v0.0.1.bin \
--language en \
--port 8888 \
--endpoint http://127.0.0.1:8080/agent \
--gpu \
--api-key YOUR_API_KEY
# Or use the launcher (reads config from ~/.edgeai/)
~/.edgeai/bin/edgeai-sttExamples
Real-world integration patterns using Python and FastAPI.
STT: Webhook Receiver
Receive transcription events from the STT server via webhooks. When a final transcription arrives, forward it to an LLM for a response.
from fastapi import FastAPI
from openai import OpenAI
import os
app = FastAPI()
llm = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
@app.post("/webhook")
def webhook(event: dict):
if event["type"] == "transcription":
print("User said:", event["text"])
response = llm.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": event["text"]}],
)
print("LLM:", response.choices[0].message.content)
return {"ok": True}STT: Push to Talk
A push-to-talk pattern where POST /start begins listening on a button press, and POST /stop is sent automatically when the user stops speaking.
from fastapi import FastAPI
from openai import OpenAI
import requests, os
app = FastAPI()
llm = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
HEADERS = {"Authorization": "Bearer " + os.environ["EDGEAI_API_KEY"]}
@app.post("/push-to-talk")
def push_to_talk():
# Start listening when the button is pressed
requests.post("http://localhost:8888/start", headers=HEADERS)
return {"status": "listening"}
@app.post("/webhook")
def webhook(event: dict):
if event["type"] == "speech_end":
# Stop listening when the user stops speaking
requests.post("http://localhost:8888/stop", headers=HEADERS)
if event["type"] == "transcription":
print("User said:", event["text"])
response = llm.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": event["text"]}],
)
print("LLM:", response.choices[0].message.content)
return {"ok": True}TTS: Stream LLM Responses to Speech
Stream tokens from an LLM directly to the TTS server for real-time speech synthesis. Each token is sent to /stream as it arrives, and /flush is called at the end to play any remaining buffered audio.
from openai import OpenAI
import requests, os
llm = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
HEADERS = {"Authorization": "Bearer " + os.environ["EDGEAI_API_KEY"]}
# Stream LLM response token-by-token to the TTS server
stream = llm.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Explain quantum computing in one paragraph"}],
stream=True,
)
for chunk in stream:
token = chunk.choices[0].delta.content
if token:
requests.post("http://localhost:9999/stream", json={"text": token}, headers=HEADERS)
# Signal end-of-stream so TTS flushes remaining audio
requests.post("http://localhost:9999/flush", headers=HEADERS)End-to-End Voice Agent
A complete voice agent that receives STT transcriptions, sends them to an LLM, and streams the response to TTS — all in one FastAPI app.
from fastapi import FastAPI
from openai import OpenAI
import requests, os
app = FastAPI()
llm = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
HEADERS = {"Authorization": "Bearer " + os.environ["EDGEAI_API_KEY"]}
@app.post("/webhook")
def webhook(event: dict):
if event["type"] != "transcription":
return {"ok": True}
# Send transcription to LLM and stream the response to TTS
stream = llm.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": event["text"]}],
stream=True,
)
for chunk in stream:
token = chunk.choices[0].delta.content
if token:
requests.post("http://localhost:9999/stream", json={"text": token}, headers=HEADERS)
requests.post("http://localhost:9999/flush", headers=HEADERS)
return {"ok": True}