AI TTS Microservice
mcp
api
ai

How to Connect AI Agents to Text-to-Speech with MCP

AI TTS Microservice Team9 min read
How to Connect AI Agents to Text-to-Speech with MCP

The Model Context Protocol (MCP) is an open standard that lets AI assistants — Claude, Cursor, Windsurf, and others — call external tools directly. Instead of copying text between your AI assistant and a TTS dashboard, you can generate speech, manage audio files, create shareable links, and organize your library through natural conversation. AI TTS Microservice exposes 34 MCP tools covering generation, voices, jobs, shares, library, and storage.

What MCP Is and Why It Matters for TTS

MCP is a protocol that standardizes how AI assistants discover and call external tools. Think of it as a universal plug that lets any compatible AI client talk to any compatible service — without custom integrations for each pair.

For text-to-speech, this means your AI assistant can:

  • Generate audio from text you're working on — without leaving your editor or chat
  • Search voices by language, provider, or tier to find the right fit
  • Check job status, retrieve audio links, and manage your library
  • Create shareable links, access codes, and QR codes for generated audio
  • Estimate costs before generating, check usage, and manage storage

All of this happens through the same conversational interface you already use for coding, writing, or research. No context switching, no separate dashboard tabs.

Two Ways to Connect

AI TTS Microservice supports both MCP connection methods:

1. Local stdio (recommended for desktop AI clients)

The @theproductivepixel/aittsm npm package runs as a local process on your machine. Your AI client launches it, communicates over stdin/stdout, and the package proxies requests to the TTS API. This is the standard pattern for Claude Desktop, Cursor, and Windsurf.

Install globally or run via npx — no build step required:

npm install -g @theproductivepixel/aittsm

2. Remote HTTP (for server-side agents)

For agents running on servers — automated pipelines, custom bots, or backend integrations — the MCP endpoint is available at POST /api/v1/mcp. It uses stateless Streamable HTTP transport: each request is independent, no session tracking required. Authentication uses the same Bearer token as the REST API.

Setting Up Claude Desktop

Add this to your Claude Desktop MCP configuration file:

{
  "mcpServers": {
    "aitts": {
      "command": "npx",
      "args": ["@theproductivepixel/aittsm"],
      "env": {
        "AITTSM_API_KEY": "tts_your_api_key_here"
      }
    }
  }
}

Setting Up Cursor or Windsurf

Cursor and Windsurf use a slightly different config shape — without the outer mcpServers wrapper:

{
  "aitts": {
    "command": "npx",
    "args": ["@theproductivepixel/aittsm"],
    "env": {
      "AITTSM_API_KEY": "tts_your_api_key_here"
    }
  }
}

Replace tts_your_api_key_here with your actual API key from the Dashboard (API tab). Restart your AI client, and you'll see the TTS tools available in the tool list.

What You Can Do: 34 Tools Across 5 Groups

The MCP server exposes the full TTS API surface as discrete tools. Each tool maps to a specific capability with its own permission scope and rate limit bucket.

Voice and Generation (4 tools)

The core workflow: find a voice, generate audio, check status, get the download link.

  • search_voices — filter by language, provider, or model type. Returns voice IDs, capabilities, and metadata.
  • generate_speech — submit text with a voice ID. Supports single-speaker and multi-speaker modes, async and streaming delivery, custom output formats, and idempotency keys. Options like speaking rate and SSML are available where the selected voice family supports them.
  • get_job_status — check whether a generation job is pending, completed, or failed.
  • get_audio_link — get a signed download URL for completed audio.

Jobs (4 tools)

Manage your generation history:

  • list_jobs — paginated list with filters (status, provider, date range).
  • get_job_text — retrieve the original input text of a completed job.
  • update_job_metadata — add or change tags and collection assignments.
  • delete_job_audio — remove stored audio for a specific job.

Shares (8 tools)

Create and manage shareable audio links — the same sharing features available in the web app (see our sharing guide for the full picture):

  • create_share — create a public, password-protected, or access-code-gated link. Supports snapshot and live modes.
  • list_shares, get_share — browse and inspect your active shares.
  • update_share, revoke_share, bulk_revoke_shares — modify settings or disable access.
  • toggle_share_permanent — control whether shared audio persists indefinitely.
  • update_track_order — reorder tracks in a playlist share.

Access Codes and QR (6 tools)

Distribute controlled access to shared audio:

  • create_access_codes — generate unique codes with optional expiry. Raw codes are returned once only.
  • list_access_codes, update_access_code, delete_access_code — manage individual codes.
  • export_access_codes — CSV-compatible export for bulk distribution.
  • get_qr_code — generate a QR code image (SVG or PNG) for any share link.

Library and Storage (12 tools)

Organize your audio library and monitor usage:

  • list_collections, manage_collection — create, rename, or delete audio collections.
  • list_tags — see all tags with usage counts.
  • create_bookmark, list_bookmarks, delete_bookmark — save and organize shared audio from others.
  • manage_bookmark_collection — organize bookmarks into collections.
  • get_storage, list_storage_items, bulk_delete_storage — monitor and manage stored audio files.
  • estimate_cost — check how much a generation would cost before spending credits.
  • get_usage — view your current credit balance and API usage.

Streaming Delivery via MCP

The generate_speech tool supports a delivery_mode parameter. Set it to "stream" to receive a one-time stream URL instead of waiting for async completion. This is useful when you want to hear audio immediately — the stream URL delivers chunked audio as it's generated.

If you've previously generated the same text with the same voice and settings, the tool may return the cached result instantly — no live stream needed. For more on how streaming works, see our streaming delivery guide.

Multi-Speaker Generation

The generate_speech tool supports multi-speaker mode for dialogue and conversational content. Set speaker_type: "multi" and provide voice_id_speaker_1 and voice_id_speaker_2 instead of a single voice_id.

This works naturally in a conversational AI context: "Generate this dialogue between two speakers — use a male Google voice for speaker 1 and a female Polly voice for speaker 2."

Permissions

Each tool requires a specific permission on your API key. By default, new keys are created with four permissions: tts:generate, tts:status, usage:read, and voices:list. To use the full set of 34 tools — including jobs, shares, library, and storage management — you'll need to add the relevant permissions from the Dashboard when creating or editing your key.

Rate limits are per-account, split into two buckets: read operations (voice search, job status, listing) have higher limits, while generate operations (speech generation, share creation, deletion) have lower limits.

What This Looks Like in Practice

The real power of MCP is that it turns multi-step workflows into single conversations. Here are some examples of what you can ask your AI assistant to do:

Content creation without leaving your editor

You're writing a blog post in Cursor. When you're done, you ask: "Generate an audio version of this article using a calm English female voice, save it as MP3, tag it 'blog-audio', and add it to my 'Website Content' collection." The agent searches voices, picks one that fits, generates the audio, organizes it, and gives you the download link — all without opening a browser.

Research-to-audio pipeline

If your AI client supports web browsing or summarization, you can chain those abilities with TTS tools: "Search for the latest developments in quantum computing, summarize the top 5 findings, then generate an audio briefing of each summary as a separate track. Create a password-protected playlist I can listen to on my commute." The agent handles the research and writing, then uses the MCP tools to generate 5 audio files, bundle them into a playlist share with a password, and hand you the link.

Learning material on demand

If your AI client can generate written content, you can go straight to audio: "Write a 10-part summary of the Roman Empire — each part covering a different era — then generate audio for each using a clear, calm voice. Put them in a collection called 'Roman History' and share the playlist with an access code that expires in 7 days." The agent writes the content, then uses the MCP tools to generate audio, organize it into a collection, and set up controlled distribution with expiring access codes.

Voice auditions at scale

"Search for all English male voices from Google and Polly. Generate a 30-second sample of this paragraph with the top 5 results. Show me the links so I can compare." Instead of clicking through a gallery one by one, you get parallel samples in seconds.

Daily audio briefings

"Take this morning's meeting notes, generate an audio summary I can listen to while walking, and stream it to me now." The agent generates with streaming delivery — you start hearing audio within seconds while the rest is still being processed.

Batch production for teams

"Here are 20 product descriptions. Generate audio for each one using the same voice, tag them all 'product-audio-q2', put them in the 'E-commerce' collection, and give me a cost estimate before you start." The agent estimates first, you approve, then it generates the batch and organizes everything.

Library maintenance

"Show me my storage usage. List all audio older than 30 days that isn't in any collection. Delete the ones tagged 'draft'." The agent audits your library and cleans up — no manual browsing through pages of files.

Error Handling

MCP tool errors are returned as structured JSON with an error code, message, and HTTP status. Common codes you might encounter:

  • VALIDATION_ERROR — malformed input (missing required fields, invalid values).
  • INVALID_VOICE — voice ID not found or not available for the requested configuration.
  • INSUFFICIENT_CREDITS — not enough balance. The message tells you how much you need.
  • STREAM_NOT_SUPPORTED — the selected voice doesn't support streaming delivery.

Your AI assistant will typically interpret these errors and suggest corrections — for example, recommending a different voice if the one you asked for doesn't support streaming.

Getting Started

  1. Get an API key — sign in and create one from the Dashboard (API tab). Enable the permissions you need for your workflow.
  2. Add the MCP config — paste the configuration block into your AI client's MCP settings file (Claude Desktop uses mcpServers wrapper; Cursor/Windsurf use the flat format shown above).
  3. Restart your AI client — the tools appear in the tool list immediately.
  4. Start talking — ask your AI assistant to search voices, generate speech, or manage your library.

The npm package is MIT-licensed and requires Node 18 or later. Full documentation is at aitts.theproductivepixel.com/docs/mcp.


Try it: Install with npm install -g @theproductivepixel/aittsm, add the config to your AI client, and ask it to generate speech. See the MCP documentation for the full tool reference, or browse the API reference for the remote HTTP endpoint details.