Provider Capabilities

Feature support by provider and tier. Unsupported modifiers return 400.

Capability Matrix

Provider	Tier	SSML	Markup	Speed	Multi-Speaker	Model Selection	Prompt	Bitrate Config
Google^[1]	Premium	Conditional	Conditional	Conditional	No	No	No	Yes
Google^[2]	Ultra	Yes	No	No	Conditional	Yes	Yes	Yes
Polly	Premium	Yes	No	No	No	No	No	No
Polly	Ultra	Yes	No	No	No	No	No	No
Kokoro	Premium	No	No	Yes	No	No	No	Yes

^[1] SSML, markup, and speed vary by voice family — check GET /api/v1/voices.

^[2] Multi-speaker and text limits vary by selected model — see Google Ultra Model Details.

Models By Tier

These are the public voice families you will see in GET /api/v1/voices, grouped by provider and tier.

Provider	Tier	Models
Google	Premium	Casual, Chirp-HD, Chirp3-HD, Neural2, News, Polyglot, Standard, Studio, Wavenet
Google	Ultra	Gemini 2.5 Flash TTS, Gemini 2.5 Pro TTS, Gemini 2.5 Flash Lite Preview TTS, Gemini 3.1 Flash TTS Preview
Polly	Premium	Generative, Neural, Standard
Polly	Ultra	Long-Form
Kokoro	Premium	Kokoro

Looking for supported languages and live voice availability? Browse the Voice Library for a visual view, or use GET /api/v1/voices for the full API response. For per-voice capability discovery (streaming, formats, limits, speed, prompt), use GET /api/v1/voices/{voice_id}.

Text And Prompt Limits

The limits below show the maximum accepted text for each provider and tier, plus prompt limits where prompt is supported. Google Ultra limits vary by selected model.

Provider	Tier	Max Text(bytes)	Streaming Max(bytes)	Prompt Limits(bytes)
Google	Premium	500,000	5,000	Not supported
Google	Ultra	Varies by model	Varies by model	Varies by model
Polly	Premium	100,000	3,000	Not supported
Polly	Ultra	100,000	3,000	Not supported
Kokoro	Premium	5,000	5,000	Not supported

When prompt is used, chars_charged includes both text and prompt bytes.

Google Ultra Model Details

Google Ultra limits vary by selected model. Use the model field in your request to select a specific model.

Model	Max Text	Stream Max	Prompt Max	Combined Max	Multi-Speaker
Flash TTSgemini-2.5-flash-tts	4,000	4,000	4,000	8,000	Yes
Pro TTSgemini-2.5-pro-tts	4,000	4,000	4,000	8,000	Yes
Flash Lite Previewgemini-2.5-flash-lite-preview-tts	300	300	300	300	No
3.1 Flash TTS Previewgemini-3.1-flash-tts-preview	4,000	4,000	4,000	8,000	Yes

Flash Lite Preview is a reduced-capacity preview model best suited for very short text. Use Flash TTS or Pro TTS for longer content.

Voice-Level Capabilities

The matrix above covers provider/tier-level features. For voice-specific capabilities, use GET /api/v1/voices. Each voice includes:

Field	Description
supports_ssml	Whether this voice supports SSML input format
supports_markup	Whether this voice supports markup input format
supports_multispeaker	Whether this voice supports multi-speaker mode
model_type	Voice tier: "premium" or "ultra"
language	Voice language code (e.g., en-US)
provider	Voice provider (e.g., google, polly, kokoro)
available_models	Available ultra model IDs for this voice (ultra voices with model selection only)

Always check voice-level fields before sending optional modifiers. See GET /api/v1/voices for the full response shape.

Output Formats

The async API (POST /api/v1/tts) supports three output formats, identical across all providers. Set output_format in your request (default: wav).

Format	Sample Rates	Bitrate Range	Default Bitrate	Notes
wav	8k–48k	—	—	Uncompressed PCM. Largest files, no bitrate setting.
mp3	8k–48k	32–320 kbps	128 kbps	Widest playback compatibility.
ogg_opus	8k, 12k, 16k, 24k, 48k	6–320 kbps	64 kbps	Best quality-to-size ratio.

Bitrate is configurable for Google and Kokoro (output_bitrate_kbps). Other providers manage bitrate internally. Use sample_rate_hertz to control sample rate for any format.

Streaming Delivery

Short-form generation supports streaming delivery. In the web app, use Stream Preview in the advanced settings on /tts. For the public API, use POST /api/v1/tts/stream. Every streaming request creates a normal job with the same billing and storage behavior as async generation. If matching audio is already available, the service returns the completed result immediately instead of opening a new live stream. Streaming is short-form only — maximum text length varies by provider. See the Streaming Max column in the Text and Prompt Limits table.

Provider	Tier	Family	Streaming	Default Transport	Supported Formats (API v1)	Notes
Google	Premium	Chirp3HD	Yes	ogg_opus	ogg_opus (24k/48k), wav (24k/48k), pcm (24k/48k), mulaw (8k), alaw (8k)	Max speaking rate: 2.0
Google	Premium	ChirpHD	Yes	ogg_opus	ogg_opus (24k/48k), wav (24k/48k), pcm (24k/48k), mulaw (8k), alaw (8k)	Max speaking rate: 2.0
Google	Premium	Wavenet	No	—	—	—
Google	Premium	Neural2	No	—	—	—
Google	Premium	Studio	No	—	—	—
Google	Premium	Standard	No	—	—	—
Google	Premium	Casual	No	—	—	—
Google	Premium	Polyglot	No	—	—	—
Google	Premium	News	No	—	—	—
Google	Ultra	Gemini TTS	Yes	wav	wav (24k), pcm (24k)	Speaking rate not supported
Polly	Premium	Standard	Yes	ogg_opus	ogg_opus (48k), mp3 (8k–24k), wav (8k/16k), ogg_vorbis (8k–24k), pcm (8k/16k), mulaw (8k), alaw (8k)	—
Polly	Premium	Neural	Yes	ogg_opus	ogg_opus (48k), mp3 (8k–24k), wav (8k/16k), ogg_vorbis (8k–24k), pcm (8k/16k), mulaw (8k), alaw (8k)	—
Polly	Premium	Generative	Yes	ogg_opus	ogg_opus (48k), mp3 (8k–24k), wav (8k/16k), ogg_vorbis (8k–24k), pcm (8k/16k), mulaw (8k), alaw (8k)	—
Polly	Ultra	Long-Form	Yes	ogg_opus	ogg_opus (48k), mp3 (8k–24k), wav (8k/16k), ogg_vorbis (8k–24k), pcm (8k/16k), mulaw (8k), alaw (8k)	—
Kokoro	Premium	Kokoro	Yes	ogg_opus	ogg_opus (24k), mp3 (24k), wav (24k), pcm (24k)	—

API v1 callers can request any supported format from the matrix above via output_format and sample_rate_hertz. If omitted, the default stream format is used. The /tts web app exposes only the browser-playable subset (ogg_opus, mp3, wav). Bitrate selection is not available in stream mode. The saved artifact uses the same streamed format. See API v1 Streaming for request/response details.

Back to Documentation