API Reference

API documentation for EdgeAI TTS and STT servers.

TTS Server

The EdgeAI TTS server provides real-time text-to-speech synthesis over HTTP. Send text, get audio played through local speakers or returned as raw audio data. Supports multiple voices, streaming input, and sub-100ms time-to-first-byte.

Authentication

Most TTS endpoints require a Bearer token. Include your API key in the Authorization header. Keys must have the tts scope.

curl -X POST http://localhost:9999/ \
  -H "Authorization: Bearer re_live_your_api_key" \
  -d "Hello from EdgeAI"

Public endpoints (no auth required): GET /voices and GET /metrics. All other endpoints require authentication.

Base URL

http://localhost:9999

Default port is 9999. The server listens on 127.0.0.1 (localhost only).

POST / Auth Required

Synthesize text and play it through the device speakers immediately. Text is queued and played in order.

Request

Content-Typetext/plain
BodyPlain text string to speak
Query paramsspeaker_id, length_scale, noise_scale, noise_w_scale (see Query Parameters)

Response

200OK

Returns OK as plain text once the text is queued for synthesis.

# Speak a sentence through device speakers
curl -X POST "http://localhost:9999/" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "Welcome to EdgeAI. On-device AI, no cloud required."

# With custom speed (1.2x slower)
curl -X POST "http://localhost:9999/?length_scale=1.2" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "This will be spoken more slowly."

POST /stream Auth Required

Buffer text chunks and speak them as complete sentences. Audio plays when sentence-ending punctuation is detected (. ! ? : or newline). Ideal for streaming LLM output token-by-token.

Request

Content-Typetext/plain
BodyText chunk (partial sentence, word, or token)
Query paramsSame as POST /

Response

200OK

Returns Buffered as plain text.

# Simulate streaming LLM output
curl -X POST "http://localhost:9999/stream" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "The weather today is"

curl -X POST "http://localhost:9999/stream" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d " sunny and warm."   # <-- Sentence ends, audio plays

# Flush any remaining buffered text
curl -X POST "http://localhost:9999/flush" \
  -H "Authorization: Bearer YOUR_API_KEY"

POST /flush Auth Required

Speak any text remaining in the stream buffer. Call this after your last /stream request to ensure trailing text that doesn't end with punctuation is spoken.

Response

200OK

Returns OK as plain text.

curl -X POST "http://localhost:9999/flush" \
  -H "Authorization: Bearer YOUR_API_KEY"

POST /cancel Auth Required

Immediately stop audio playback and clear all queued and buffered text. Use this to interrupt speech when the user starts talking (barge-in).

Response

200OK

Returns Cancelled as plain text.

curl -X POST "http://localhost:9999/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY"

POST /synthesize Auth Required

Synthesize text and return raw audio data in the response body. Does not play through speakers — use this to capture audio for recording, streaming to a client, or further processing.

Request

Content-Typetext/plain
BodyPlain text string to synthesize
Query paramsSame as POST /

Response

200OK
Content-Typeaudio/x-raw
Format32-bit float, little-endian (f32le), mono
Sample Rate22,050 Hz (Piper) or 24,000 Hz (Kokoro)
# Save raw audio to file
curl -X POST "http://localhost:9999/synthesize" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "Hello from EdgeAI" -o output.raw

# Convert to WAV with ffmpeg (Piper voices, 22050 Hz)
ffmpeg -f f32le -ar 22050 -ac 1 -i output.raw output.wav

# Convert to WAV (Kokoro voices, 24000 Hz)
ffmpeg -f f32le -ar 24000 -ac 1 -i output.raw output.wav

# Pipe directly to playback (macOS)
curl -s -X POST "http://localhost:9999/synthesize" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "Streaming audio" | play -t f32 -r 22050 -c 1 -

GET /voicesPublic

List all available voice models installed on the server. Returns a JSON array of voice names that can be used with POST /loadVoice.

Response

200OK

JSON array of voice name strings, sorted alphabetically.

curl http://localhost:9999/voices

# Example response:
# ["en_US-hannah-medium","en_US-ryan-high"]

POST /loadVoice Auth Required

Hot-swap the active voice model at runtime. The new voice is loaded asynchronously — subsequent synthesis requests will use it once loading completes. Use GET /voices to see available options.

Request

BodyVoice name (no path, no .onnx extension)

Response

200OK

Returns Voice loading: <voice_name> as plain text.

# Switch to a different voice
curl -X POST "http://localhost:9999/loadVoice" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "en_US-hannah-medium"

# Verify the voice list
curl http://localhost:9999/voices

GET /metricsPublic

Retrieve performance metrics for monitoring. Includes per-request timings and aggregate statistics.

curl http://localhost:9999/metrics | jq .

# Example response:
{
  "last_request": {
    "ttfb_ms": 85.3,
    "synthesis_ms": 420.1,
    "audio_ms": 1200.0,
    "rtf": 0.35,
    "character_count": 42,
    "cancelled": false
  },
  "aggregate": {
    "total_requests": 157,
    "total_cancelled": 2,
    "total_characters": 8432,
    "avg_ttfb_ms": 92.5,
    "avg_rtf": 0.38
  }
}

Fields

FieldDescription
ttfb_msTime to first audio byte (milliseconds)
synthesis_msTotal synthesis time (milliseconds)
audio_msDuration of generated audio (milliseconds)
rtfReal-time factor (synthesis_ms / audio_ms). Below 1.0 = faster than real-time
character_countNumber of characters in the request
cancelledWhether the request was interrupted via /cancel

Query Parameters

These optional query parameters can be appended to POST /, POST /stream, and POST /synthesize.

ParameterTypeDefaultDescription
speaker_idint0Speaker index for multi-speaker models
length_scalefloat1.0Speed control. 0.5 = 2x faster, 2.0 = 2x slower
noise_scalefloat0.667Synthesis noise level (Piper only)
noise_w_scalefloat0.8Phoneme length variation (Piper only)
# Speak faster with a specific speaker
curl -X POST "http://localhost:9999/?speaker_id=3&length_scale=0.8" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d "This is fast speech from speaker three."

Errors & Status Codes

CodeMeaningBody
200SuccessVaries by endpoint
400Bad Request{"error":"Voice name required"}
401Unauthorized{"error":"missing_authorization","detail":"Bearer token required"}
403Forbidden{"error":"insufficient_scope","detail":"Key does not have tts scope"}
429Rate Limited{"error":"rate_limit_exceeded"}
500Server ErrorSynthesis failed

CLI Reference

Command-line arguments for starting the TTS server.

Usage

tts_server [voices_dir] [default_voice] [espeak_data] [port] [--api-key <key>]

All positional arguments are optional when running from the standard install layout.

Arguments

ArgumentRequiredDescription
[voices_dir]noDirectory containing .onnx voice models. Auto-detected at <exe>/../models/tts/
[default_voice]noVoice name (without .onnx). Auto-selects en_US-ryan-high if present, otherwise first available
[espeak_data]noPath to espeak-ng-data. Auto-detected next to executable
[port]noHTTP server port (default: 9999)
--api-key <key>noAPI key for device authorization. Required on first run; cached locally for subsequent runs.

CLI Usage

# Zero-arg start (auto-detects paths from install layout)
~/.edgeai/tts/tts_server --api-key YOUR_API_KEY

# Or override specific arguments
~/.edgeai/tts/tts_server /path/to/voices en_US-ryan-high 9999 --api-key YOUR_API_KEY

# Or use the launcher (reads API key from ~/.edgeai/device_key)
~/.edgeai/bin/edgeai-tts

STT Server

The EdgeAI STT server provides real-time speech-to-text transcription from a local microphone. Control listening via HTTP, receive transcription events via webhooks. Powered by a quantized Whisper model with voice activity detection (VAD).

Authentication

STT endpoints require a Bearer token with stt scope. Include your API key in the Authorization header.

curl -X POST http://localhost:8888/start \
  -H "Authorization: Bearer re_live_your_api_key"

Base URL

http://localhost:8888

Port is configured with the -p / --port flag. The server listens on 127.0.0.1 (localhost only).

POST /start Auth Required

Begin listening for speech. The server starts capturing audio from the microphone, applies VAD to detect speech segments, and transcribes them. Results are delivered asynchronously via webhook events.

Response

200OK
{"status": "listening"}
# Start listening
curl -X POST "http://localhost:8888/start" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Response: {"status":"listening"}

POST /stop Auth Required

Stop listening and finalize any in-progress transcription. The server stops capturing audio and processes any remaining speech buffer. A final transcription event is delivered via webhook if there is pending audio.

Response

200OK
{"status": "processing"}
# Stop listening
curl -X POST "http://localhost:8888/stop" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Response: {"status":"processing"}

Webhook Events

Transcription results and lifecycle events are delivered asynchronously to a webhook URL configured with the -e / --endpoint flag. The server sends HTTP POST requests with a JSON payload.

Webhook Request

MethodPOST
Content-Typeapplication/json
User-AgentEdgeAI-Voice/0.1

Payload Format

{
  "event": "transcription",
  "timestamp": 1708934523000,
  "text": "Hello, how can I help you today?"
}

Event Types

EventTextDescription
speech_startemptyUser started speaking (VAD detected voice)
partialpartial textInterim transcription result (may change)
transcriptionfinal textFinal transcription of a speech segment
speech_endemptyUser stopped speaking (silence detected)
cancelledemptyTranscription was cancelled (e.g. via /stop)

Example: Receiving Webhooks

# 1. Start a simple webhook receiver (in another terminal)
python3 -c "
from http.server import HTTPServer, BaseHTTPRequestHandler
import json

class Handler(BaseHTTPRequestHandler):
    def do_POST(self):
        body = json.loads(self.rfile.read(int(self.headers['Content-Length'])))
        print(f"[{body['event']}] {body.get('text', '')}")
        self.send_response(200)
        self.end_headers()

HTTPServer(('127.0.0.1', 8080), Handler).serve_forever()
"

# 2. Start the STT server with webhook
~/.edgeai/bin/edgeai-stt --endpoint http://127.0.0.1:8080/events

# 3. Start listening
curl -X POST http://localhost:8888/start \
  -H "Authorization: Bearer YOUR_API_KEY"

# 4. Speak into your microphone... events arrive at your webhook:
#   [speech_start]
#   [partial] hello
#   [partial] hello how are
#   [transcription] Hello, how are you?
#   [speech_end]

Errors & Status Codes

CodeMeaningBody
200Success{"status":"listening"} or {"status":"processing"}
401Unauthorized{"error":"missing_authorization","detail":"Bearer token required"}
403Forbidden{"error":"insufficient_scope","detail":"Key does not have stt scope"}
429Rate Limited{"error":"rate_limit_exceeded"}

Operating Modes

The STT server supports three operating modes.

HTTP-Controlled Mode

Set a port with -p 8888. The server starts idle and waits for /start / /stop commands. Transcription events are printed to stdout. Best for push-to-talk UIs where your application polls or reads console output.

edgeai-stt --port 8888

Continuous Mode

Omit the -p flag. The server immediately begins listening and transcribing continuously. No HTTP control available. Best for always-on voice input applications like kiosks or accessibility tools.

edgeai-stt

Agent-in-the-Loop Mode

Add -e to forward transcription events to an agent or application via webhook. Combine with -p for HTTP-controlled agent workflows (the agent calls /start and /stop to manage turn-taking), or omit -p for continuous hands-free agent input. This is how the EdgeAI Agents SDK voice agent connects to the STT server.

# Agent-controlled: agent calls /start and /stop to manage turns
edgeai-stt --port 8888 \
  --endpoint http://127.0.0.1:8080/agent

# Continuous + agent: always listening, agent receives all speech
edgeai-stt --endpoint http://127.0.0.1:8080/agent

CLI Reference

Command-line flags for configuring the STT server.

FlagDefaultDescription
-m, --model <path>requiredPath to STT model file (Whisper GGML)
-v, --vad-model <path>nonePath to VAD model (recommended for accuracy)
-e, --endpoint <url>noneWebhook URL for transcription events
-p, --port <num>noneHTTP control port (omit for continuous mode)
-l, --language <code>autoLanguage code (e.g. en, fr, de)
-t, --threads <num>autoNumber of CPU threads for inference
-d, --device <idx>-1Audio input device index (-1 = auto-detect)
-g, --gpuoffEnable GPU acceleration (Metal on macOS, CUDA on Linux)
-k, --api-key <key>noneAPI key for device authorization

CLI Usage

# Start STT with all options
~/.edgeai/stt/bin/stt_server \
  --model ~/.edgeai/models/stt/edgeai-small-q8_0.bin \
  --vad-model ~/.edgeai/models/stt/edgeai-vad-v0.0.1.bin \
  --language en \
  --port 8888 \
  --endpoint http://127.0.0.1:8080/agent \
  --gpu \
  --api-key YOUR_API_KEY

# Or use the launcher (reads config from ~/.edgeai/)
~/.edgeai/bin/edgeai-stt

Examples

Real-world integration patterns using Python and FastAPI.

STT: Webhook Receiver

Receive transcription events from the STT server via webhooks. When a final transcription arrives, forward it to an LLM for a response.

from fastapi import FastAPI
from openai import OpenAI
import os

app = FastAPI()
llm = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)

@app.post("/webhook")
def webhook(event: dict):
    if event["type"] == "transcription":
        print("User said:", event["text"])

        response = llm.chat.completions.create(
            model="openai/gpt-4o-mini",
            messages=[{"role": "user", "content": event["text"]}],
        )
        print("LLM:", response.choices[0].message.content)

    return {"ok": True}

STT: Push to Talk

A push-to-talk pattern where POST /start begins listening on a button press, and POST /stop is sent automatically when the user stops speaking.

from fastapi import FastAPI
from openai import OpenAI
import requests, os

app = FastAPI()
llm = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)
HEADERS = {"Authorization": "Bearer " + os.environ["EDGEAI_API_KEY"]}

@app.post("/push-to-talk")
def push_to_talk():
    # Start listening when the button is pressed
    requests.post("http://localhost:8888/start", headers=HEADERS)
    return {"status": "listening"}

@app.post("/webhook")
def webhook(event: dict):
    if event["type"] == "speech_end":
        # Stop listening when the user stops speaking
        requests.post("http://localhost:8888/stop", headers=HEADERS)

    if event["type"] == "transcription":
        print("User said:", event["text"])

        response = llm.chat.completions.create(
            model="openai/gpt-4o-mini",
            messages=[{"role": "user", "content": event["text"]}],
        )
        print("LLM:", response.choices[0].message.content)

    return {"ok": True}

TTS: Stream LLM Responses to Speech

Stream tokens from an LLM directly to the TTS server for real-time speech synthesis. Each token is sent to /stream as it arrives, and /flush is called at the end to play any remaining buffered audio.

from openai import OpenAI
import requests, os

llm = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)
HEADERS = {"Authorization": "Bearer " + os.environ["EDGEAI_API_KEY"]}

# Stream LLM response token-by-token to the TTS server
stream = llm.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain quantum computing in one paragraph"}],
    stream=True,
)

for chunk in stream:
    token = chunk.choices[0].delta.content
    if token:
        requests.post("http://localhost:9999/stream", json={"text": token}, headers=HEADERS)

# Signal end-of-stream so TTS flushes remaining audio
requests.post("http://localhost:9999/flush", headers=HEADERS)

End-to-End Voice Agent

A complete voice agent that receives STT transcriptions, sends them to an LLM, and streams the response to TTS — all in one FastAPI app.

from fastapi import FastAPI
from openai import OpenAI
import requests, os

app = FastAPI()
llm = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)
HEADERS = {"Authorization": "Bearer " + os.environ["EDGEAI_API_KEY"]}

@app.post("/webhook")
def webhook(event: dict):
    if event["type"] != "transcription":
        return {"ok": True}

    # Send transcription to LLM and stream the response to TTS
    stream = llm.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": event["text"]}],
        stream=True,
    )

    for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            requests.post("http://localhost:9999/stream", json={"text": token}, headers=HEADERS)

    requests.post("http://localhost:9999/flush", headers=HEADERS)
    return {"ok": True}