AI TTS Microservice
ssml
guide

A Practical Guide to SSML in Text-to-Speech

AI TTS Microservice Team6 min read
A Practical Guide to SSML in Text-to-Speech

Speech Synthesis Markup Language (SSML) lets you control exactly how a TTS engine speaks your text — where it pauses, what it emphasises, how it pronounces abbreviations and numbers. But SSML support is not universal. It varies by provider, by tier, and in some cases by individual voice family — and even within providers that support SSML, each one accepts a different subset of tags. This guide covers what SSML is, what tags are available, and exactly which providers and models support them.

What SSML Does

Without SSML, a TTS engine reads your text using its best guess about pacing, emphasis, and pronunciation. That's fine for simple sentences, but it falls apart with domain-specific terms, acronyms, phone numbers, dates, or any content where delivery matters.

SSML wraps your text in XML-like tags that give the engine explicit instructions. Every SSML document starts with a <speak> root element:

<speak>
  Your text with SSML tags goes here.
</speak>

Common SSML Tags

Here are the most useful tags for practical TTS work. Note that not all providers support all of these — see the provider-specific sections below for what works where.

<break> — Insert a Pause

Control silence between words or sentences. Useful for pacing, dramatic effect, or separating list items.

<speak>
  Welcome to the demo.<break time="500ms"/>
  Let's get started.
</speak>

The time attribute accepts milliseconds (ms) or seconds (s). You can also use strength with values like weak, medium, strong, or x-strong.

<say-as> — Control Interpretation

Tell the engine how to interpret ambiguous content like numbers, dates, or abbreviations.

<speak>
  Your order number is <say-as interpret-as="characters">AB123</say-as>.
  The total is <say-as interpret-as="currency" language="en-US">$42.50</say-as>.
  Call us at <say-as interpret-as="telephone">+1-800-555-0199</say-as>.
</speak>

Common interpret-as values: characters, cardinal, ordinal, telephone, date, currency, unit.

<prosody> — Adjust Rate, Pitch, and Volume

Fine-tune how the voice delivers specific sections.

<speak>
  <prosody rate="slow" pitch="-2st">
    This part is slower and lower pitched.
  </prosody>
  <prosody volume="loud">
    And this part is louder.
  </prosody>
</speak>

Rate values: x-slow, slow, medium, fast, x-fast, or a percentage like 80%. Pitch values: x-low, low, medium, high, x-high, or semitones like +2st or -3st.

Important: Prosody support varies. Google Studio voices don't support the pitch attribute. On Polly, prosody is only partially supported on Neural, Long-Form, and Generative voices — only rate is reliable across all Polly voice types.

<emphasis> — Stress a Word or Phrase

<speak>
  This is <emphasis level="strong">extremely</emphasis> important.
</speak>

Levels: reduced, moderate, strong.

Availability is limited: Supported on Google Standard, WaveNet, Neural2, and Ultra (Gemini) voices. Not supported on Google Studio, Google Chirp3-HD, or any Polly voice type (Neural, Long-Form, Generative).

<phoneme> — Override Pronunciation

Force a specific pronunciation using the International Phonetic Alphabet (IPA) or X-SAMPA.

<speak>
  The word <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>
  is pronounced differently in different regions.
</speak>

Supported on Google (all SSML-capable families) and Polly (all voice types, though only partial on Generative).

<sub> — Substitution

Replace displayed text with a spoken alternative. Useful for abbreviations and symbols.

<speak>
  <sub alias="World Wide Web Consortium">W3C</sub> publishes the SSML spec.
</speak>

<p> and <s> — Paragraphs and Sentences

Explicitly mark paragraph and sentence boundaries for more natural pacing.

<speak>
  <p>
    <s>This is the first sentence.</s>
    <s>This is the second.</s>
  </p>
  <p>
    <s>New paragraph starts here.</s>
  </p>
</speak>

SSML Support by Provider

This is where it gets important. Each provider supports a different subset of SSML tags, and within providers, support varies by voice type.

Google Cloud TTS

Google's SSML support varies significantly by voice family:

Standard, WaveNet, Neural2

The broadest SSML support among Google voice families. These accept a wide subset of SSML tags (see Google's SSML reference for details): <speak>, <break>, <say-as>, <prosody>, <emphasis>, <phoneme>, <sub>, <p>, <s>, <mark>, <voice>, <lang>, <audio>, <par>, <seq>, <media>, <desc>.

Studio

Supports SSML but with restrictions. The following are not supported on Studio voices:

  • <mark>
  • <emphasis>
  • <lang>
  • <prosody pitch="..."> — the pitch attribute specifically is unsupported, though rate and volume work

Chirp3-HD

SSML support was recently added (Preview). Chirp3-HD accepts a more limited subset of tags: <speak>, <say-as>, <p>, <s>, <phoneme>, <sub>, <break>, <audio>, <prosody>, <voice>.

Notably absent from Chirp3-HD: <emphasis>, <mark>, <lang>, <par>, <seq>, <media>. Unsupported tags are silently ignored rather than causing errors.

Casual, Chirp-HD, News, Polyglot

These Premium voice families do not support SSML. Sending SSML to these voices will return a 400 error.

Google Ultra (Gemini)

Broad SSML support — the same tags as Standard/WaveNet/Neural2. Note that Google Ultra is short-form only — long-form generation is not available on this tier.

Amazon Polly

Polly supports SSML across all voice types, but with important limitations depending on the engine:

  • Standard voices — broadest support, including all standard SSML tags plus Amazon-specific extensions like <amazon:effect name="drc">.
  • Neural voices — most standard tags work, but <emphasis> is not available. <prosody> is only partially supported (rate works, but pitch and volume may be limited). <say-as> is partially supported.
  • Long-Form voices — similar to Neural. <emphasis> is not available. <prosody> is partial.
  • Generative voices — the most limited. <emphasis> is not available. <prosody> is partial. <mark> and <phoneme> are only partially supported.

Tags supported across all Polly voice types: <speak>, <break>, <lang>, <p>, <s>, <sub>, <w>.

Kokoro

No SSML support. Kokoro accepts plain text input only. Any SSML tags sent to a Kokoro voice will be rejected. Use speaking rate adjustment instead for pacing control.

SSML Support Summary

ProviderVoice TypeSSMLKey Limitations
GoogleStandard, WaveNet, Neural2YesBroadest Google support. See SSML reference.
GoogleStudioYesNo emphasis, mark, lang, or prosody pitch.
GoogleChirp3-HDYes (Preview)Limited subset. No emphasis, mark, lang, par, seq, media.
GoogleCasual, Chirp-HD, News, PolyglotNoNot supported.
GoogleUltra (Gemini)YesSame as Standard/WaveNet/Neural2. Short-form only.
PollyStandardYesBroadest Polly support. See Polly SSML reference.
PollyNeuralYesNo emphasis. Partial prosody and say-as.
PollyLong-FormYesNo emphasis. Partial prosody.
PollyGenerativeYesNo emphasis. Partial prosody, mark, phoneme.
KokoroAllNoPlain text only.

Sending SSML

You can send SSML directly from the generation UI by switching the input format to SSML — no API integration needed. For programmatic access, set the format field to "ssml" in your API request:

curl -X POST https://aitts.theproductivepixel.com/api/v1/tts \
  -H "Authorization: Bearer tts_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "<speak>Hello <break time=\"300ms\"/> world.</speak>",
    "voice_id": "google:en-US-Standard-A",
    "format": "ssml"
  }'

The same request in Python:

import requests

response = requests.post(
    "https://aitts.theproductivepixel.com/api/v1/tts",
    headers={"Authorization": "Bearer tts_YOUR_KEY"},
    json={
        "text": '<speak>Hello <break time="300ms"/> world.</speak>',
        "voice_id": "google:en-US-Standard-A",
        "format": "ssml"
    }
)

Checking SSML Support Per Voice

Since SSML support varies at the voice level, you can check programmatically using the voices endpoint. Each voice in the response includes a supports_ssml boolean:

curl "https://aitts.theproductivepixel.com/api/v1/voices?provider=google" \
  -H "Authorization: Bearer tts_YOUR_KEY"

In the response, each voice object includes:

{
  "voice_id": "google:en-US-Standard-A",
  "provider": "google",
  "supports_ssml": true,
  "supports_markup": false,
  ...
}

Always check supports_ssml before sending SSML input to a voice. Sending SSML to a voice that doesn't support it returns a 400 error.

Practical Tips

  • Always wrap in <speak> — it's required as the root element.
  • Check voice support first — use GET /api/v1/voices to verify supports_ssml before sending SSML.
  • Stick to the safe subset<break>, <say-as>, <sub>, <p>, and <s> work across almost all SSML-capable voices. Start there.
  • Be careful with <emphasis> — it's not supported on Polly Neural/Long-Form/Generative, Google Studio, or Google Chirp3-HD. If you need cross-provider compatibility, avoid it.
  • Test with short clips — iterate on SSML with short text before committing to long-form generation.
  • Mind the byte limits — SSML tags count toward the text byte limit. Google Premium allows up to 500,000 bytes for long-form; Polly allows 100,000. Google Ultra is 4,000 bytes (short-form only).
  • Kokoro alternative — if you need a Kokoro voice, use plain text with speaking rate adjustment instead of SSML. Kokoro supports speed control on the Premium tier.

Reference: See the provider capabilities page for the full support matrix, or check the API docs for endpoint details. For official provider documentation, see Google Cloud TTS SSML reference and Amazon Polly supported SSML tags.