Whisper
Whisper is OpenAI's automatic speech recognition (ASR) system, released in September 2022 under the MIT license, capable of transcribing audio in approximately 100 languages and translating 99 of those languages into English. It is available as open-weights software on GitHub and Hugging Face, and as a hosted inference endpoint through the OpenAI API. As of April 2026, Whisper remains the most widely deployed speech-to-text model globally, embedded in thousands of third-party applications, developer workflows, and transcription products.
At a glance
- Lab: OpenAI
- Released: September 2022 (v1); large-v2: December 2022; large-v3: November 2023; large-v3-turbo: October 2024
- Modality: Audio (speech-to-text transcription and speech translation)
- Open weights: Yes (MIT license)
- Languages supported: Approximately 100 (transcription); 99 source languages translated to English
- Audio input length: 30 seconds per chunk natively; longer audio is segmented automatically by most inference wrappers
- Pricing: Free for self-hosted deployments; OpenAI hosted API at $0.006 per minute of audio (as of April 2026)
- Distribution channels: GitHub (https://github.com/openai/whisper), Hugging Face, OpenAI API (api.openai.com/v1/audio/transcriptions), Replicate, fal.ai, Together AI, and embedded in a large number of third-party transcription products
Origins
OpenAI released Whisper v1 in September 2022, publishing model weights and a research paper simultaneously. The model was trained on 680,000 hours of multilingual and multitask supervised data collected from the web, a corpus substantially larger and more diverse than anything previously used for open-weights ASR. The central design choice was breadth: Whisper was not optimized for a single domain or language but trained across conditions, accents, noise environments, and approximately 100 languages.
The MIT license applied to Whisper was notable at the time. OpenAI's other prominent models, including GPT-3 and DALL-E, were closed weights. Releasing Whisper under a permissive commercial license signaled a different approach for that product line, and the downstream market effects were large.
The version sequence reflects iterative improvement on the same architecture. Whisper large-v2 (December 2022) reduced word error rates across several benchmark languages. Whisper large-v3 (November 2023) continued accuracy improvements with additional training data and architecture refinements. Whisper large-v3-turbo (October 2024) addressed latency: it uses a pruned decoder that runs roughly four times faster than large-v3 while retaining most of its accuracy, making real-time deployment more practical.
The structural impact on the speech-recognition market was significant. Startups building proprietary ASR models shifted to Whisper-based products. Enterprise buyers who had been paying for Nuance or Google STT licenses found that self-hosted Whisper could match their vendor's accuracy at a fraction of the cost. Products such as Otter.ai, Descript, and dozens of smaller transcription services moved to Whisper as a backend or alongside fine-tuned variants. The standalone speech-recognition vendor market contracted sharply in the two years following the release.
Capabilities
Whisper's primary capabilities are automatic speech recognition, speech translation, language identification, and robust handling of diverse audio conditions.
Speech recognition is the model's core task: converting audio input to text transcripts. Whisper operates on 30-second audio chunks encoded as log-mel spectrograms, which are processed by a transformer encoder-decoder architecture. The model outputs text with punctuation and capitalization. Timestamps can be generated at the word or segment level, enabling subtitle generation and audio-aligned transcripts. For audio longer than 30 seconds, inference wrappers segment the audio and stitch outputs together, with most implementations handling this automatically.
Speech translation is a capability distinct from transcription: Whisper can take audio in any of 99 supported source languages and produce an English text output. This makes it possible to transcribe and translate a non-English recording in a single inference pass, without a separate translation step. The translation quality is practical for understanding purposes, though it does not match dedicated neural machine translation systems on formal or technical text.
Language identification is a byproduct of the model's multitask training. Whisper identifies the language of an audio segment before transcribing, which means the correct language model is applied automatically without the caller needing to specify the language. This is significant for applications that process diverse audio inputs.
Noise robustness and accent tolerance are among Whisper's most consistently cited practical advantages. The training corpus included audio from diverse environments, speakers, and recording conditions, which means the model does not degrade as sharply as narrowly trained ASR systems when applied to telephone audio, conference calls with multiple speakers, accented speech, or recordings with background noise.
Model sizes span five variants: tiny, base, small, medium, and large (with large-v2, large-v3, and large-v3-turbo sub-variants). The smallest variants (tiny, base) run on modest consumer hardware and process audio faster than real-time, at the cost of higher word error rates on difficult material. The large variants achieve the lowest word error rates but require GPU memory and compute proportional to their size. This range makes Whisper deployable across a wide set of hardware contexts, from embedded applications to server-side batch transcription.
Real-time deployment requires additional infrastructure because the native model processes fixed 30-second chunks rather than streaming audio continuously. Derivative implementations, including faster-whisper (CTranslate2-optimized inference), whisper.cpp (a C++ port), and WhisperX (word-level alignment and speaker diarization), have made streaming and low-latency deployment practical and are now standard components of production transcription pipelines.
Benchmarks and standing
The standard benchmark for speech recognition accuracy is word error rate (WER) on held-out test sets. LibriSpeech is the primary English benchmark, using audiobook recordings in clean and other conditions. Whisper large-v3 achieves a WER of 2.7% on LibriSpeech test-clean, which is near the lower bound of what has been considered meaningful improvement on that dataset. The test-other split, which uses more difficult audio, shows a WER of 5.2% for Whisper large-v3.
For multilingual evaluation, FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a standard benchmark covering 102 languages. Whisper large-v3 achieves a mean WER of 13.9% across the FLEURS language set. Common Voice, Mozilla's crowd-sourced multilingual speech dataset, provides another multilingual benchmark, where Whisper large-v3 similarly performs competitively across high-resource and medium-resource languages, with larger gaps on very low-resource languages.
These numbers represent a snapshot from the 2023 to 2024 timeframe. Whisper is no longer the leader on every benchmark axis. Google's Universal Speech Model (USM), Meta's Massively Multilingual Speech (MMS), and commercial services including ElevenLabs Scribe and Deepgram have equaled or surpassed Whisper on specific benchmark sets, particularly for lower-resource languages, domain-adapted transcription, and speaker diarization. AssemblyAI's hosted models achieve lower WER than base Whisper on English-language tasks where they have been fine-tuned on domain-specific data.
Despite these benchmark developments, Whisper remains the de-facto deployment standard. The combination of open weights, MIT licensing, broad language coverage, strong accuracy on mainstream use cases, and a large ecosystem of derivative tools means that the practical adoption gap between Whisper and its benchmark competitors is larger than the accuracy numbers suggest. For most production transcription use cases, Whisper large-v3 or large-v3-turbo is the default starting point.
Benchmark leadership in speech recognition is point-in-time, and the methods, datasets, and scoring conditions vary across evaluations. Numbers cited here reflect publicly reported evaluations as of April 2026.
Access and pricing
Whisper is available through three primary channels, with additional distribution through third-party platforms.
The GitHub repository at https://github.com/openai/whisper is the canonical open-weights release: model weights for all five size variants, Python inference code, and MIT-licensed documentation available for commercial and non-commercial use. Hugging Face hosts the same weights at https://huggingface.co/openai, integrated with the Transformers library, which is the standard path for ML pipelines.
The OpenAI hosted API at https://api.openai.com/v1/audio/transcriptions provides a managed inference endpoint at $0.006 per minute of audio (April 2026). A companion endpoint at /v1/audio/translations handles speech translation. Third-party inference platforms including Replicate, fal.ai, and Together AI host Whisper with comparable or lower per-minute pricing for teams that need managed deployment outside the OpenAI ecosystem.
Comparison
The peer set for Whisper in April 2026 is the leading speech-to-text systems available through open-weights releases and commercial APIs:
- ElevenLabs Scribe. ElevenLabs' transcription model, positioned as matching or exceeding Whisper large-v3 on several accuracy benchmarks, with speaker diarization and low-latency streaming. Available through the ElevenLabs API. Whisper's open-weights MIT license is the primary differentiator: Scribe is closed, and self-hosted deployment is not available.
- AssemblyAI. A commercial transcription API with models fine-tuned for domains including medical, legal, and call-center audio. AssemblyAI achieves lower WER than base Whisper on English tasks in its trained domains. No open-weights offering.
- Deepgram. A commercial speech-to-text API notable for low-latency streaming transcription. Deepgram's Nova models target real-time applications including voice agents and live captioning, a use case where Whisper's 30-second chunk architecture requires wrapping infrastructure. Closed weights, API-based pricing.
- Google Universal Speech Model (USM). Google's large-scale speech model, trained on 12 million hours of audio across 300 languages, underpinning Google Cloud Speech-to-Text. Competitive with or ahead of Whisper large-v3 on lower-resource languages. Not open weights.
- Meta Massively Multilingual Speech (MMS). Meta's speech model family, covering over 1,100 languages with open weights under a CC-BY-NC 4.0 license. MMS is frequently the better choice for very low-resource languages outside Whisper's scope. The non-commercial license limits substitutability in commercial deployments.
Whisper's distinctive position in this peer set is defined by three factors: open weights under a permissive commercial license, broad language coverage at high accuracy, and an ecosystem of optimized derivative implementations that have extended its deployment range beyond what the base OpenAI release supports.
Outlook
Several open questions shape Whisper's trajectory in 2026 and beyond:
- Successor cadence. No Whisper v4 has been announced as of April 2026. The large-v3-turbo release in October 2024 was the most recent update, focused on latency reduction rather than accuracy advancement. Whether OpenAI views the speech-recognition problem as sufficiently solved for the open-weights line, or plans a Whisper v4 with different architecture or training scope, is not public.
- Relationship to multimodal audio. GPT-4o and its successors incorporate native audio understanding that subsumes some Whisper use cases, particularly for voice-interface applications where the model needs to hear and respond. As multimodal models capable of real-time speech understanding become standard, the distinction between a standalone transcription model and an integrated audio-language model becomes less clear. This may affect Whisper's role: it could remain the standard for offline, high-volume, or open-source transcription while multimodal models handle interactive voice applications.
- Ecosystem derivative development. The largest extensions of Whisper's practical capability have come from outside OpenAI: faster-whisper, whisper.cpp, WhisperX, and other implementations have substantially improved Whisper's throughput, latency, and feature set. Whether the open-source ecosystem continues to maintain and extend these tools as commercial alternatives improve is an ongoing question for the model's long-term relevance as a deployed baseline.
- Competitive pressure from open-weights alternatives. Meta's MMS covers more languages under an open license, and additional open-weights ASR models may emerge from other labs. The combination of open weights, broad language support, and permissive commercial licensing that made Whisper distinctive in 2022 is an increasingly contested position.
Sources
- OpenAI: Introducing Whisper. Original research blog post, September 2022.
- GitHub: openai/whisper. Open-weights model repository, MIT license, with all model variants and inference code.
- Hugging Face: OpenAI Whisper models. Model weights and Transformers integration for Whisper large-v3.
- OpenAI API: Speech to text. Hosted API documentation for /v1/audio/transcriptions and /v1/audio/translations.
- Wikipedia: Whisper (speech recognition system). Model history, version progression, and benchmark context.
- OpenAI Whisper technical report. Radford et al., 2022. Training methodology, dataset, and evaluation results across LibriSpeech, FLEURS, and Common Voice.
- faster-whisper on GitHub. CTranslate2-based Whisper implementation with substantially improved inference speed.