What Is Streaming Text to Speech and When Should You Use It?

AI TTS Microservice supports streaming delivery for short-form text to speech. Instead of waiting for the full audio file to be generated and returned, you can start hearing audio sooner — as chunks arrive — while still getting a durable completed result at the end. Streaming is available in the web app and through the public API, across supported providers and compatible voice selections.

What Streaming Delivery Means

The standard async model works like this: submit your text, wait for the full audio to be generated, then download or play the completed file. That's reliable and works well for batch workflows, but you don't hear anything until the entire job finishes.

Streaming delivery changes the timing. When you choose streaming, audio begins playing as it's generated — you hear the first words while the rest of the text is still being processed. The key thing to understand is that streaming is a delivery mode, not a separate product. The same voices, the same billing, the same quality. The only difference is when you start hearing the audio.

Streaming applies to short-form generation. Text limits are the same as async short-form and vary by provider — see the Text and Prompt Limits table for current limits per provider, including the Streaming Max column.

Streaming in the Web App

On the TTS generation page, you'll find a Delivery Mode control in the advanced settings. It offers two options: Async (the default) and Stream.

When you select Stream and hit Generate, audio begins playing in your browser as chunks arrive from the provider. You hear the output building instead of waiting for a progress bar to complete.

If the current voice, provider, or browser combination doesn't support streaming, the Stream option is automatically disabled with an explanation of why. This means you don't need to memorize which combinations work — the UI tells you.

After streaming finishes, the completed audio appears in your library and generation history just like any async job. You can play it again, download it, share it, or add it to a collection.

Streaming via the API

For API users, streaming is available through a dedicated endpoint that follows the same authentication, billing, and request format as the standard generation endpoint (see our API tutorial for the full async flow).

The streaming flow is two steps:

Step 1: Create the streaming job

Send a POST request to /api/v1/tts/stream. The request body is the same as /api/v1/tts, with one addition: you can optionally specify output_format to choose the stream format. If omitted, the provider's default format is used. See the provider capabilities page for supported formats per provider.

curl -X POST https://aitts.theproductivepixel.com/api/v1/tts/stream \
  -H "Authorization: Bearer tts_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Streaming lets you hear audio as it is generated.",
    "voice_id": "google:en-US-Chirp3HD-Charon",
    "output_format": "ogg_opus"
  }'

The response comes back immediately with HTTP 202:

{
  "success": true,
  "data": {
    "job_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "pending",
    "poll_url": "/api/v1/tts/550e8400-e29b-41d4-a716-446655440000",
    "audio_endpoint": "/api/v1/tts/550e8400-e29b-41d4-a716-446655440000/audio",
    "stream_url": "https://...",
    "stream_url_expires_at": "2026-04-18T20:05:00.000Z",
    "transport_format": "ogg_opus",
    "transport_mime_type": "audio/ogg; codecs=opus",
    "transport_sample_rate_hertz": 48000,
    "chars_charged": 49
  }
}

The response includes transport_format, transport_mime_type, and transport_sample_rate_hertz, which tell your client exactly what audio format to expect from the stream.

Important: The stream_url is short-lived and single-use. Open it promptly after receiving the response. The stream_url_expires_at field tells you exactly when it expires.

Step 2: Open the stream

Send a GET request to the stream_url:

curl -N "STREAM_URL_FROM_RESPONSE" --output streamed-audio.ogg

Audio arrives as chunked data in the format specified by transport_format in the response. Your client can begin playback as chunks arrive, or buffer the full stream and save it.

Step 3: Access the durable result

After the stream completes, the durable audio file is available through the normal job and audio endpoints — just like an async job:

curl https://aitts.theproductivepixel.com/api/v1/tts/JOB_ID/audio \
  -H "Authorization: Bearer tts_YOUR_KEY"

Idempotency keys work the same way as the standard endpoint. If you retry a request with the same idempotency key, you'll get the same job metadata back — though the stream URL is not replayed, since it's one-time use.

Format and the Saved Artifact

In streaming mode, the format you choose (or the provider's default if you omit output_format) is used for both the live stream and the saved file. There is no hidden conversion step — what streams is what gets saved.

For example, if you request ogg_opus, the live stream delivers OGG Opus chunks and the durable artifact saved to your library is also OGG Opus. If you request wav, the stream delivers WAV (PCM wrapped in a RIFF header) and the saved file is WAV.

A few constraints to be aware of in stream mode:

Bitrate selection is not available — output_bitrate_kbps is rejected in stream mode. Format and sample rate are the controls you have.
Not all formats are available on all providers — for example, Google Ultra (Gemini) streaming supports only wav and pcm at 24000 Hz. See the provider capabilities page for the full matrix of supported formats, sample rates, and providers.
Speaking rate is constrained — Google streaming limits speaking rate to 2.0 max (async allows up to 4.0). Exceeding the limit returns an error suggesting you use async delivery instead.

For a deeper comparison of audio formats and when to use each, see our audio format comparison.

Cache Hits and Repeat Requests

If you request streaming for text you've already generated with the same voice and settings, the service may return the completed result immediately instead of opening a new live stream. In this case, the response won't include a stream_url — you'll get the finished audio right away through the normal completion path.

This is actually faster than streaming, since the audio is already available. The response still includes the job_id, poll_url, and audio_endpoint so your client can retrieve the result through the same flow.

Cache behavior is deterministic: the same input text, voice, and settings produce the same cache key. If the cached audio is still available, you skip the generation step entirely.

When to Use Streaming

Streaming delivery is most useful when reducing perceived wait time matters — when a user or end customer is actively waiting to hear the audio. Some concrete scenarios:

Interactive voice previews — let users hear how a voice sounds with their text immediately, without waiting for the full file. Useful in voice selection workflows, content editors, and preview interfaces.
Web and mobile app playback — if your app generates audio on demand for end users, streaming reduces the gap between "request" and "first sound." Users perceive the experience as faster even though the total generation time is similar.
Chatbot and assistant responses — when generating spoken responses in a conversational interface, streaming lets the user start hearing the reply sooner, reducing perceived wait time between turns.
Live demos and presentations — generate audio on the fly during a demo or sales call. Streaming means the audience hears the result building in front of them instead of watching a loading spinner.
Prototyping and rapid iteration — when you're testing different voices, speeds, or phrasings, streaming gives you faster feedback loops. You can hear whether the output sounds right within seconds instead of waiting for each full generation.
Accessibility on demand — serving audio alternatives for text content where a user requests an audio version and expects to start hearing it promptly.
Kiosk and exhibit audio — interactive installations where a visitor triggers audio and expects immediate playback. Streaming reduces the silence between the trigger and the first sound.

When to Use Async

Async delivery remains the default and is the better choice for workflows where immediate playback isn't the goal:

Batch voiceover production — generating narration for multiple videos, podcast episodes, or audio files at once. Submit all your scripts, do other work, and collect the results when they're ready.
E-learning and course narration — producing audio for course modules where you'll edit, review, and publish the files later. Async lets you queue an entire course and process it in the background.
CMS-triggered audio generation — automatically generating audio versions of articles or documentation when content is published. The audio doesn't need to play immediately — it just needs to be ready when a reader wants it.
Accessibility compliance at scale — generating audio alternatives for large content libraries as a background process. Pair with webhooks to get notified when each file is ready.
API-driven content pipelines — automated workflows where audio generation is one step in a larger pipeline. Submit the job, receive a webhook when it completes, and pass the result to the next stage.
Long-form content — async supports both short-form and long-form generation. If your text exceeds short-form limits, async is the path to use.
Background processing — any workflow where you don't need to play the audio right now. Async is simpler to integrate (one endpoint, poll or webhook) and doesn't require handling a streaming connection.

Both modes cost the same. The credit charge is based on the text and voice, not the delivery mode.

What Happens If You Disconnect

If you close the browser tab or your API client disconnects while audio is still streaming, the generation process continues in the background. The durable audio file is still saved and the job completes normally.

For web app users, the completed audio will appear in your library and history. For API users, the durable result remains available through the normal job status and audio endpoints. No credits are lost — the job finalizes regardless of whether the client stayed connected for the full stream.

This means streaming is safe to use even on unreliable connections. The worst case is that you miss part of the live playback, but the completed file is always there.

Try it: Open the TTS generation page, select a voice, switch to Stream in the delivery mode settings, and hit Generate. For API integration, see the API documentation and the provider capabilities page for the current streaming support matrix and text limits.