Can I generate audio in different formats like MP3 or WAV?

Yes. Using the `tts_bytes` tool, you can specify the `output_format_container` as 'mp3', 'wav', or 'raw', and configure the sample rate and encoding to match your needs.

How do I transcribe an existing audio file to text?

Use the `stt_batch` tool. Provide the base64 encoded audio file, specify the model (e.g., 'ink-whisper'), and the language code to receive a full transcription.

Is it possible to clone a voice using this integration?

Absolutely. The `clone_voice` tool allows you to create a new voice model by uploading a short (approx. 5s) base64 encoded audio clip.

Cartesia (Voice AI) MCP Connector for Claude

A+

Generate lifelike AI voices, clone speech, and transcribe audio with Cartesia's state-of-the-art Sonic models directly from your AI agent.

20 tools Official Updated Jun 28, 2026 Official Vinkius Partner

More Details Connect to Claude

Connect Cartesia to your AI agent to unlock high-performance voice synthesis and speech recognition. Cartesia's Sonic models provide industry-leading latency and quality for real-time applications.

What you can do

Text-to-Speech (TTS) — Generate high-fidelity audio bytes or stream via SSE using models like Sonic 3.5 and Sonic 3.
Speech-to-Text (STT) — Transcribe audio files into text using the Ink Whisper model with multi-language support.
Voice Cloning — Create custom voice models from as little as 5 seconds of audio input.
Voice Management — List, retrieve, and update voices, or use the Voice Changer to transform existing audio.
Pronunciation Control — Manage custom pronunciation dictionaries for specialized terminology or accents.
Agent Orchestration — List and manage AI agents and monitor call logs and usage credits.

How it works

Subscribe to this server
Enter your Cartesia API Key
Start generating audio or transcribing speech from Claude, Cursor, or any MCP-compatible client

Who is this for?

Developers — integrate real-time voice synthesis into applications without managing complex infrastructure.
Content Creators — automate voiceovers and audio localization using high-quality cloned voices.
Product Teams — build conversational AI agents that sound human and respond with sub-second latency.

text-to-speechspeech-to-textvoice-synthesislow-latencyai-voiceaudio-streaming

Related Connectors

Fathom MCP

20 tools Official

Privacy-first website analytics — track visitors, monitor real-time traffic, and manage sites and events directly from your AI agent.

A+ View details →

Exa MCP

10 tools Official

Find exactly the web content you need with semantic search that understands context and returns high-quality curated results.

A+ View details →

Pixabay MCP

10 tools Official

Search and retrieve royalty-free stock images, vectors, illustrations, and videos via AI directly from Pixabay.

A+ View details →

AlisQI MCP

10 tools Official

Quality management orchestration — manage analysis sets, results, and QMS data via AI.

A+ View details →