Eleven v3

Eleven v3 is ElevenLabs's flagship text-to-speech and voice-cloning model, supporting 32 languages with highly expressive, natural-sounding speech synthesis and voice cloning from short audio samples.
Eleven v3

Eleven v3

Eleven v3 is the flagship text-to-speech and voice-cloning model from ElevenLabs, a step-change in expressiveness over the earlier Multilingual v2 generation through improved handling of emotional inflection, pacing, breath, and laughter. It is accessible through the ElevenLabs API, ElevenLabs Studio, and a growing set of creator-platform integrations including Riverside, Spotify dubbing partnerships, and embedded audiobook and podcast workflows. As of April 2026, Eleven v3 is widely characterized as the highest-quality commercially available text-to-speech model by expressiveness and naturalness, covering both TTS and voice-cloning workflows from the same model surface.

At a glance

  • Lab: ElevenLabs
  • Released: 2024 to 2025 (v3 generation, following Multilingual v2; iterative releases through the period)
  • Modality: Audio (text-to-speech, voice cloning, dubbing)
  • Open weights: No (closed)
  • Output formats: MP3, WAV, PCM, OGG; sample rates from 22,050 Hz to 44,100 Hz depending on tier
  • Languages supported: 32 languages including English, Spanish, French, German, Mandarin, Japanese, Hindi, Portuguese, Italian, Polish, Arabic, and others
  • Pricing: Free tier (10,000 characters per month); Starter ($5/month, 30,000 characters); Creator ($22/month, 100,000 characters); Pro ($99/month, 500,000 characters); Scale ($330/month, 2,000,000 characters); Business ($1,320/month, 10,000,000 characters); Enterprise (custom). Additional characters purchasable at per-tier rates.
  • Distribution channels: ElevenLabs API (https://elevenlabs.io), ElevenLabs web app, ElevenLabs Studio, Conversational AI platform, Riverside integration, Spotify dubbing workflows, third-party audiobook and content-creation platform integrations

Origins

ElevenLabs was founded in 2022 by Mati Staniszewski and Piotr Dabkowski, both former senior engineers at Palantir Technologies and Google respectively, with headquarters in London and operations in Warsaw. The founding thesis centered on building AI voice synthesis with substantially higher emotional expressiveness and multilingual capability than existing text-to-speech offerings, which at the time were dominated by cloud-platform TTS products with limited voice naturalness.

The version lineage progressed through several named releases. Eleven v1 was the initial public model. Turbo v1 and Turbo v2 addressed latency for real-time deployment. Multilingual v1 extended language coverage. Multilingual v2, released in 2023, became the principal commercial product for 2023 and 2024, supporting over 29 languages. Eleven v3 followed as the next-generation release, with improved expressiveness in emotional contexts and better handling of paralinguistic features including breath, laughter, and pacing that make synthesized speech sound substantially less synthetic.

The commercial trajectory during this period was strong. ElevenLabs achieved broad adoption across audiobook narration, podcast production, content creator tooling, and enterprise media localization. The company reached approximately $330 million annualized revenue at the end of 2025, raised a $500 million Series D in February 2026 at an $11 billion valuation led by Sequoia Capital, and began IPO-track preparation under CEO Staniszewski. Revenue was split approximately 50-50 between enterprise and consumer customers as of December 2025.

Capabilities

Eleven v3's primary capability is text-to-speech synthesis with substantially improved expressiveness over the Multilingual v2 generation. The model handles emotional inflection in ways that prior versions did not: a sentence written to sound joyful, anxious, or sad produces synthesized speech with inflection patterns, pacing, and timbre appropriate to the sentiment, rather than flat delivery at consistent pitch and speed. Breath placement, laughter, pauses, and other paralinguistic features that make speech sound natural are generated more reliably by v3 than by its predecessors. Industry coverage and creator-community evaluations consistently described the v2-to-v3 transition as the most significant expressiveness improvement in the product's history.

Voice cloning is the second major capability. Eleven v3 supports two cloning modes. Instant cloning generates a usable voice from a short recorded sample, typically under one minute of audio, making it practical for rapid workflows. Professional voice cloning uses longer recording sessions to produce higher-fidelity clones with more consistent timbre and expressiveness across extended narration. Both modes are governed by ElevenLabs platform policies, including voice-actor consent verification requirements and the company's anti-abuse measures for preventing unauthorized voice replication.

Multi-speaker dialogue synthesis allows generation of audio content with multiple distinct speaker voices in a single workflow, without manually stitching separate generations together. This is useful for podcast production, audiobook narration with multiple characters, and enterprise content localization requiring consistent character voices across long productions.

Dubbing is a workflow product built on the Eleven v3 synthesis engine. The ElevenLabs Dubbing product takes a source audio or video file, translates speech into a target language, and synthesizes the translated speech in the source speaker's voice characteristics, allowing content creators and enterprise media customers to localize long-form content into up to 32 languages while preserving original voice identity. Spotify's podcast dubbing partnership, which has translated selected podcasts into multiple languages in the original host's voice, is a high-profile commercial use of this capability.

Voice design, introduced with the v3 generation, allows users to generate a novel synthetic voice through text description or interactive controls adjusting parameters such as gender, age, and accent, rather than cloning from a recorded sample. Voice designs are stored in ElevenLabs's Voice Library and reused across generations.

The Conversational AI platform extends Eleven v3 into interactive voice-agent applications, combining TTS synthesis with speech recognition and language model inference for real-time use cases including customer-service automation and AI companion applications.

Benchmarks and standing

Voice synthesis benchmarking is less standardized than text-generation benchmarking. There is no widely accepted equivalent to LMArena or GPQA Diamond for TTS. The principal evaluation approach is Mean Opinion Score (MOS), in which panels rate generated speech on naturalness, expressiveness, and voice quality. MOS comparisons are subjective and depend on evaluator panel composition, test material, and reference conditions; cross-lab comparisons are not reliable unless conducted by independent parties using standardized methodology.

ElevenLabs's commercial standing provides an indirect indicator. The company's reported $330 million annualized revenue, its position as the highest-valued AI voice-synthesis company at $11 billion, and widespread characterization in industry coverage as the leading commercial voice-synthesis platform globally all reflect a product-market position that benchmark numbers alone cannot capture. Creator-community side-by-side evaluations through 2024 and 2025 consistently placed Eleven v3 ahead of OpenAI's TTS API and cloud-platform offerings (Azure Speech, Google Cloud TTS) on expressiveness and naturalness, with closer comparisons on specific dimensions to Cartesia and Hume EVI.

The benchmark gap across peer offerings has been narrowing. Cartesia's Sonic model targets low-latency streaming deployment and has received strong evaluations for real-time voice-agent use cases. Hume EVI incorporates prosody modeling that rivals Eleven v3 on emotional expressiveness in some evaluations. OpenAI's TTS and GPT-4o voice mode leverage the OpenAI platform integration advantage. None of these peers combines ElevenLabs's voice-cloning depth, language coverage, and dubbing workflow capability in a single commercial offering as of April 2026.

Benchmark leadership in voice synthesis is point-in-time, and evaluations vary by test material, language, and speaker style. Numbers and characterizations here reflect publicly available community evaluations and industry coverage as of April 2026.

Access and pricing

Eleven v3 is available through several access channels, all served from the ElevenLabs platform at https://elevenlabs.io.

The ElevenLabs API provides programmatic TTS synthesis, voice-cloning management, and dubbing workflow access. Output format options include MP3, WAV, PCM, and OGG; the API supports streaming for low-latency applications. ElevenLabs Studio is the long-form production interface for audiobook narration and podcast production, with a document-level view for assigning voices to text segments and managing multi-speaker projects.

Subscription tiers as of April 2026: Free (10,000 characters/month, non-commercial); Starter ($5/month, 30,000 characters); Creator ($22/month, 100,000 characters, commercial license); Pro ($99/month, 500,000 characters); Scale ($330/month, 2 million characters); Business ($1,320/month, 10 million characters); Enterprise (custom). Professional voice cloning is available from Creator tier and above.

Creator-platform integrations include Riverside (podcast recording and transcript-based editing), Spotify podcast dubbing workflows, and additional audiobook-production and video-production tool integrations.

Comparison

The peer set for Eleven v3 in the voice-synthesis category as of April 2026:

  • OpenAI TTS and GPT-4o Voice. OpenAI's hosted TTS endpoint and the real-time voice mode built into GPT-4o. OpenAI's TTS API offers a smaller voice library and simpler cloning capability, but benefits from tight integration with OpenAI's language models. GPT-4o Voice Mode competes with ElevenLabs Conversational AI on real-time interaction. Whisper, OpenAI's speech-to-text model, is complementary to Eleven v3 rather than a competitor.
  • Cartesia Sonic. Cartesia's Sonic model targets low-latency streaming TTS for voice-agent and real-time applications. Cartesia has received strong evaluations for latency performance and voice naturalness in conversational contexts. Sonic does not offer voice-cloning depth or dubbing workflow capability at ElevenLabs's level. Cartesia is a smaller company with a narrower product surface.
  • Hume AI EVI. Hume's Empathic Voice Interface incorporates prosody modeling that attempts to match synthesized speech expressiveness to conversational context, with listener emotional-response modeling as a design criterion. EVI competes with Eleven v3 on expressiveness benchmarks in some evaluations, particularly for emotionally varied conversational content. Hume's product is positioned specifically for conversational AI rather than general TTS or dubbing.
  • Resemble AI. A voice-cloning and TTS provider that preceded ElevenLabs's commercial rise and competes on voice-cloning depth. Resemble AI offers a broader range of audio manipulation features, including deepfake-detection tooling. On overall expressiveness and language coverage, industry coverage places Eleven v3 ahead as of 2025 to 2026.
  • Microsoft Azure Neural TTS. Microsoft's cloud-platform voice synthesis service, with a very large voice library and broad language coverage. The expressiveness gap relative to Eleven v3 has been cited consistently in creator-community evaluations; Azure's primary advantage is enterprise-platform integration and compliance infrastructure.
  • Google Cloud Text-to-Speech. Google's cloud TTS service competing primarily on language breadth and platform integration. Evaluated by the creator community as behind Eleven v3 on expressiveness for English and major European languages.

Voice synthesis is distinct from music generation: Suno and Udio generate complete songs and are not competitors in this category. Eleven v3 is similarly distinct from speech-to-text: Whisper and ElevenLabs Scribe handle transcription, the inverse workflow.

Outlook

Open questions shaping Eleven v3's trajectory through 2026 and beyond:

  • Eleven v4 timing and capability. ElevenLabs's model release cadence has accelerated over time. Whether Eleven v4 targets a specific capability gap (lower latency, expanded language coverage, better acoustic realism in challenging conditions, or voice-creation capability) and when it ships will determine how quickly the competitive field can close on v3's expressiveness advantages.
  • Agentic voice integration. The Conversational AI platform positions ElevenLabs in the broader market for voice-based AI agents, which involves competing with OpenAI, Google, and purpose-built conversational AI platforms. Whether ElevenLabs can maintain voice-quality differentiation while adding the language-model and tool-calling infrastructure necessary for full agentic-voice applications is a key strategic question.
  • Open-source competitive pressure. Community forks of Coqui TTS, XTTS v2, and OpenVoice from MyShell AI offer free self-hosted TTS and voice cloning at improving quality levels. How quickly open-weights alternatives close the expressiveness gap will affect ElevenLabs's pricing leverage, particularly in developer and prosumer segments.
  • Regulatory pressure on voice cloning. Legislative activity in the United States (biometric voice data protections, AI disclosure requirements), the European Union (AI Act synthetic media provisions), and other jurisdictions may impose consent-verification or usage restrictions that affect voice-cloning product design and liability.
  • Dubbing market competitive dynamics. HeyGen, Deepdub, and cloud-platform TTS providers are competing in AI dubbing for long-form video and podcast content. ElevenLabs's early-mover position via the Spotify partnership and audiobook-localization market may be difficult to maintain as larger platforms compete on enterprise relationships.
  • IPO timing. Staniszewski publicly indicated IPO-track preparation following the February 2026 Series D. The timing and structure of a public offering will affect capital availability and the competitive investment ElevenLabs can sustain in model research.

Sources

About the author
Nextomoro

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.