Google Cloud TTS vs Amazon Polly vs Kokoro: A Practical Comparison

Google Cloud TTS, Amazon Polly, and Kokoro are three distinct approaches to text-to-speech. Google offers the widest language coverage and the most voice families. Polly provides dedicated long-form narration voices. Kokoro delivers natural-sounding speech at lower cost with a simpler model. This comparison covers what each provider actually offers today — voice families, capabilities, format support, and where each one fits best.

Provider Overview

Google Cloud TTS

Google offers the largest voice catalog of the three, spanning 90+ languages across multiple voice families. In AI TTS Microservice, Google voices are split across two tiers:

Premium tier — includes Chirp3-HD, Chirp-HD, Studio, WaveNet, Neural2, Standard, Casual, News, and Polyglot families. These cover the broadest range of languages and use cases.
Ultra tier — includes Gemini models (2.5 Flash, 2.5 Pro, Flash Lite Preview, 3.1 Flash Preview). These are the most expressive voices, supporting multi-speaker dialogue and system prompts for guiding delivery style.

Google's strength is breadth: more languages, more voice families, and more configuration options than the other two providers.

Amazon Polly

Polly focuses on reliability and narration quality. In AI TTS Microservice, Polly voices are also split across two tiers:

Premium tier — includes Generative, Neural, and Standard families. Good for notifications, UI prompts, and general-purpose speech.
Ultra tier — includes Long-Form voices, specifically designed for extended narration like audiobooks, courses, and long articles. These voices maintain consistent quality over thousands of words.

Polly's strength is long-form narration and SSML support across all voice families.

Kokoro

Kokoro is an open-source TTS model that delivers natural-sounding speech with a simpler architecture. In AI TTS Microservice, Kokoro voices are classified under the Premium tier.

Kokoro's strength is natural delivery at lower cost, with speaking rate control. It supports fewer languages than Google or Polly, but the voices it does offer sound notably natural for English content.

Capability Comparison

SSML support

SSML (Speech Synthesis Markup Language) lets you control pronunciation, pauses, emphasis, and number formatting. Support varies:

Google Premium — conditional. Varies by voice family (some families support it, others don't).
Google Ultra (Gemini) — yes.
Polly (all families) — yes.
Kokoro — no.

If SSML control is critical for your use case (technical content, specific pronunciation requirements), Polly offers the most consistent support. See our SSML guide for practical examples.

Speaking rate control

Google Premium — conditional. Varies by voice family.
Google Ultra — no.
Polly — no (use SSML prosody rate instead).
Kokoro — yes (0.25x to 4x).

Multi-speaker dialogue

Google Premium — no.
Google Ultra — yes (model-dependent; Flash, Pro, and 3.1 Flash support it; Flash Lite does not).
Polly — no.
Kokoro — no.

If you need two voices in a single generation (dialogue, interviews, conversational content), Google Ultra is currently the only option.

Streaming delivery

Streaming lets you hear audio as it's generated rather than waiting for the full file. Support varies:

Google Premium — only Chirp3-HD and Chirp-HD families. Other families (WaveNet, Neural2, Studio, Standard, etc.) do not support streaming.
Google Ultra — yes (all Gemini models).
Polly — yes (all families).
Kokoro — yes.

For more on streaming delivery and when to use it, see our streaming guide.

System prompts

System prompts let you guide the voice's delivery style with natural language instructions (e.g., "speak calmly and slowly, as if reading a bedtime story"):

Google Ultra (Gemini) — yes.
All others — no.

Text Limits

Maximum text per request varies significantly:

Google Premium — up to 500,000 bytes (async), 5,000 bytes (stream)
Google Ultra — typically 4,000 bytes text + 4,000 bytes prompt (varies by model; Flash Lite is limited to 300 bytes)
Polly (all) — 100,000 bytes (async), 3,000 bytes (stream)
Kokoro — 5,000 bytes (async and stream)

For long-form content (articles, chapters, course modules), Google Premium and Polly offer the highest limits. Kokoro and Google Ultra are better suited to shorter content.

Output Formats

All three providers support the core async formats: WAV, MP3, and OGG Opus. Streaming format support varies — see the provider capabilities page for the full matrix of formats and sample rates per provider.

For a deeper comparison of audio formats and when to use each, see our format comparison.

When to Use Each Provider

Choose Google when:

You need broad language coverage (90+ languages)
You want the most expressive voices available (Gemini/Ultra tier)
You need multi-speaker dialogue generation
You need very long text support (up to 500,000 bytes)
You want system prompts to guide delivery style

Choose Polly when:

You're producing long-form narration (audiobooks, courses, articles)
You need consistent SSML support across all voice families
You want streaming delivery with any voice family
You need reliable, production-proven voices for sustained listening

Choose Kokoro when:

Natural-sounding English speech is the priority
Cost efficiency matters (lower per-character pricing)
You want direct speaking rate control (0.25x–4x)
Your content is short-form (under 5,000 bytes per request)
You don't need SSML or multi-speaker

Using Multiple Providers Together

These providers aren't mutually exclusive. Through AI TTS Microservice, you can use all three through the same API — same authentication, same request format, same billing. Pick the best provider for each specific use case:

Kokoro for high-volume short notifications (cost-effective, natural)
Polly Long-Form for course narration (designed for sustained listening)
Google Gemini for expressive dialogue or prompted delivery (most flexible)
Google Chirp3-HD for multilingual content (broadest language support with streaming)

The voice ID prefix tells you which provider you're using (google:, polly:, kokoro:), but everything else — the endpoint, the response format, the billing — stays the same.

How to Compare Voices Yourself

The best way to evaluate is to listen. The voice gallery lets you browse and hear pre-generated samples from all three providers in one place — no account required. Filter by provider, language, or tier to narrow down candidates, then sign in to generate audio with your own text to hear how each voice handles your specific content.

Try it: Browse the voice gallery to compare Google, Polly, and Kokoro voices side by side. See the provider capabilities page for the full technical matrix.