Provider Capabilities
Feature support by provider and tier. Unsupported modifiers return 400.
Capability Matrix
| Provider | Tier | SSML | Markup | Speed | Multi-Speaker | Model Selection | Prompt | Bitrate Config |
|---|---|---|---|---|---|---|---|---|
| Google[1] | Premium | Conditional | Conditional | Conditional | No | No | No | Yes |
| Google[2] | Ultra | Yes | No | No | Conditional | Yes | Yes | Yes |
| Polly | Premium | Yes | No | No | No | No | No | No |
| Polly | Ultra | Yes | No | No | No | No | No | No |
| Kokoro | Premium | No | No | Yes | No | No | No | Yes |
[1] SSML, markup, and speed vary by voice family — check GET /api/v1/voices.
[2] Multi-speaker and text limits vary by selected model — see Google Ultra Model Details.
Models By Tier
These are the public voice families you will see in GET /api/v1/voices, grouped by provider and tier.
| Provider | Tier | Models |
|---|---|---|
| Premium | Casual, Chirp-HD, Chirp3-HD, Neural2, News, Polyglot, Standard, Studio, Wavenet | |
| Ultra | Gemini 2.5 Flash TTS, Gemini 2.5 Pro TTS, Gemini 2.5 Flash Lite Preview TTS, Gemini 3.1 Flash TTS Preview | |
| Polly | Premium | Generative, Neural, Standard |
| Polly | Ultra | Long-Form |
| Kokoro | Premium | Kokoro |
Looking for supported languages and live voice availability? Browse the Voice Library for a visual view, or use GET /api/v1/voices for the full API response. For per-voice capability discovery (streaming, formats, limits, speed, prompt), use GET /api/v1/voices/{voice_id}.
Text And Prompt Limits
The limits below show the maximum accepted text for each provider and tier, plus prompt limits where prompt is supported. Google Ultra limits vary by selected model.
| Provider | Tier | Max Text(bytes) | Streaming Max(bytes) | Prompt Limits(bytes) |
|---|---|---|---|---|
| Premium | 500,000 | 5,000 | Not supported | |
| Ultra | Varies by model | Varies by model | Varies by model | |
| Polly | Premium | 100,000 | 3,000 | Not supported |
| Polly | Ultra | 100,000 | 3,000 | Not supported |
| Kokoro | Premium | 5,000 | 5,000 | Not supported |
When prompt is used, chars_charged includes both text and prompt bytes.
Google Ultra Model Details
Google Ultra limits vary by selected model. Use the model field in your request to select a specific model.
| Model | Max Text | Stream Max | Prompt Max | Combined Max | Multi-Speaker |
|---|---|---|---|---|---|
| Flash TTSgemini-2.5-flash-tts | 4,000 | 4,000 | 4,000 | 8,000 | Yes |
| Pro TTSgemini-2.5-pro-tts | 4,000 | 4,000 | 4,000 | 8,000 | Yes |
| Flash Lite Previewgemini-2.5-flash-lite-preview-tts | 300 | 300 | 300 | 300 | No |
| 3.1 Flash TTS Previewgemini-3.1-flash-tts-preview | 4,000 | 4,000 | 4,000 | 8,000 | Yes |
Flash Lite Preview is a reduced-capacity preview model best suited for very short text. Use Flash TTS or Pro TTS for longer content.
Voice-Level Capabilities
The matrix above covers provider/tier-level features. For voice-specific capabilities, use GET /api/v1/voices. Each voice includes:
| Field | Description |
|---|---|
| supports_ssml | Whether this voice supports SSML input format |
| supports_markup | Whether this voice supports markup input format |
| supports_multispeaker | Whether this voice supports multi-speaker mode |
| model_type | Voice tier: "premium" or "ultra" |
| language | Voice language code (e.g., en-US) |
| provider | Voice provider (e.g., google, polly, kokoro) |
| available_models | Available ultra model IDs for this voice (ultra voices with model selection only) |
Always check voice-level fields before sending optional modifiers. See GET /api/v1/voices for the full response shape.
Output Formats
The async API (POST /api/v1/tts) supports three output formats, identical across all providers. Set output_format in your request (default: wav).
| Format | Sample Rates | Bitrate Range | Default Bitrate | Notes |
|---|---|---|---|---|
| wav | 8k–48k | — | — | Uncompressed PCM. Largest files, no bitrate setting. |
| mp3 | 8k–48k | 32–320 kbps | 128 kbps | Widest playback compatibility. |
| ogg_opus | 8k, 12k, 16k, 24k, 48k | 6–320 kbps | 64 kbps | Best quality-to-size ratio. |
Bitrate is configurable for Google and Kokoro (output_bitrate_kbps). Other providers manage bitrate internally. Use sample_rate_hertz to control sample rate for any format.
Streaming Delivery
Short-form generation supports streaming delivery. In the web app, use Stream Preview in the advanced settings on /tts. For the public API, use POST /api/v1/tts/stream. Every streaming request creates a normal job with the same billing and storage behavior as async generation. If matching audio is already available, the service returns the completed result immediately instead of opening a new live stream. Streaming is short-form only — maximum text length varies by provider. See the Streaming Max column in the Text and Prompt Limits table.
| Provider | Tier | Family | Streaming | Default Transport | Supported Formats (API v1) | Notes |
|---|---|---|---|---|---|---|
| Premium | Chirp3HD | Yes | ogg_opus | ogg_opus (24k/48k), wav (24k/48k), pcm (24k/48k), mulaw (8k), alaw (8k) | Max speaking rate: 2.0 | |
| Premium | ChirpHD | Yes | ogg_opus | ogg_opus (24k/48k), wav (24k/48k), pcm (24k/48k), mulaw (8k), alaw (8k) | Max speaking rate: 2.0 | |
| Premium | Wavenet | No | — | — | — | |
| Premium | Neural2 | No | — | — | — | |
| Premium | Studio | No | — | — | — | |
| Premium | Standard | No | — | — | — | |
| Premium | Casual | No | — | — | — | |
| Premium | Polyglot | No | — | — | — | |
| Premium | News | No | — | — | — | |
| Ultra | Gemini TTS | Yes | wav | wav (24k), pcm (24k) | Speaking rate not supported | |
| Polly | Premium | Standard | Yes | ogg_opus | ogg_opus (48k), mp3 (8k–24k), wav (8k/16k), ogg_vorbis (8k–24k), pcm (8k/16k), mulaw (8k), alaw (8k) | — |
| Polly | Premium | Neural | Yes | ogg_opus | ogg_opus (48k), mp3 (8k–24k), wav (8k/16k), ogg_vorbis (8k–24k), pcm (8k/16k), mulaw (8k), alaw (8k) | — |
| Polly | Premium | Generative | Yes | ogg_opus | ogg_opus (48k), mp3 (8k–24k), wav (8k/16k), ogg_vorbis (8k–24k), pcm (8k/16k), mulaw (8k), alaw (8k) | — |
| Polly | Ultra | Long-Form | Yes | ogg_opus | ogg_opus (48k), mp3 (8k–24k), wav (8k/16k), ogg_vorbis (8k–24k), pcm (8k/16k), mulaw (8k), alaw (8k) | — |
| Kokoro | Premium | Kokoro | Yes | ogg_opus | ogg_opus (24k), mp3 (24k), wav (24k), pcm (24k) | — |
API v1 callers can request any supported format from the matrix above via output_format and sample_rate_hertz. If omitted, the default stream format is used. The /tts web app exposes only the browser-playable subset (ogg_opus, mp3, wav). Bitrate selection is not available in stream mode. The saved artifact uses the same streamed format. See API v1 Streaming for request/response details.
© 2026 AI TTS Microservice. All rights reserved.