Sesame is a voice-AI research and product company building speech-to-speech conversational models, voice characters, and smart-glasses hardware around the same speech-AI foundation. The lab was founded by Brendan Iribe, the co-founder and former chief executive of Oculus VR, and Ankit Kumar, and has been operating since 2024 as one of the most-visible insurgent voice-AI companies competing with ElevenLabs, OpenAI's realtime API, and Kyutai on conversational-voice quality. The company has released its Conversational Speech Model under an Apache 2.0 licence on GitHub, distinguishing itself from the closed-weight voice AI competition through an open-research posture.
At a glance
- Founded: 2024.
- Status: Private.
- Funding: Series A round disclosed; specific amounts and lead investor not publicly itemised in primary sources. Reported backers include Andreessen Horowitz and Spark Capital based on industry coverage.
- Co-founders: Brendan Iribe (co-founder of Oculus VR), Ankit Kumar.
- Notable technical team: Johan Schalkwyk (formerly of Google's speech-recognition team), Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang.
- Open weights: Yes. Conversational Speech Model (CSM) released under Apache 2.0 at SesameAILabs/csm.
- Flagship products: Maya and Miles voice characters; the underlying Conversational Speech Model (CSM) in Tiny (1B backbone / 100M decoder), Small (3B / 250M), and Medium (8B / 300M) sizes; smart-glasses hardware in development.
Origins
Sesame was founded in 2024 by Brendan Iribe and Ankit Kumar around a thesis that conversational voice was the next major human-computer interaction modality and that no existing voice-AI product had crossed what the company called "the uncanny valley of voice." Iribe is the co-founder and former chief executive of Oculus VR, the virtual-reality company acquired by Facebook (now Meta) in 2014 for approximately $2 billion. He left Facebook in 2018 and spent the intervening years exploring early-stage technology investments before co-founding Sesame.
The technical team that joined the founders includes several speech-AI researchers with strong prior credentials. Johan Schalkwyk previously led significant portions of Google's speech-recognition work, the line that produced Google Voice Search, the Pixel-line on-device speech recognition, and the Google Translate speech-to-speech pipeline. The remainder of the technical hires came from a mix of academic speech-and-audio labs and prior industry speech-AI roles.
Sesame's founding thesis is distinct from the other major voice-AI insurgents along two axes. First, the company has framed its mission around emotional intelligence in voice ("voice presence") rather than around the more standard text-to-speech-quality framing that ElevenLabs and the broader TTS-incumbent category emphasise. Second, the company has consistently positioned itself as building hardware alongside the model, with smart-glasses described publicly as the eventual product surface for the company's voice-AI capability.
Mission and strategy
Sesame's mission is to build AI companions with what the company calls "voice presence," interactions that feel "real, understood, and valued" to the user. The strategic premise is that the current generation of voice AI (Siri, Alexa, the consumer-product voice assistants from the 2010s) failed to achieve consumer-grade conversational engagement because the models did not produce voice output with the paralinguistic richness humans expect in conversation, and that the foundation-model generation of speech AI can now address the gap.
The company's go-to-market is staged. The Conversational Speech Model and the Maya/Miles voice characters are demos-and-developer-tools releases that establish the model's capability publicly and recruit the developer-and-research community. The eventual consumer product appears to be smart-glasses with always-available voice AI as the primary interaction surface, a direction that draws on Iribe's prior hardware experience at Oculus and on the Meta Ray-Ban Smart Glasses product line that Iribe's previous employer is pursuing in the same market.
The open-weights strategy is unusual for a voice-AI startup at frontier scale. Sesame's CSM release under Apache 2.0 is the broadest open-research release of a conversational speech model from a well-funded insurgent company, and the framing emphasises shared progress on the underlying speech-AI capability rather than proprietary moat building. The competitive implication is that Sesame's commercial differentiation is intended to be at the hardware-product layer rather than at the model layer.
Models and products
- Conversational Speech Model (CSM). A multimodal text-and-audio transformer operating on residual-vector-quantised (RVQ) tokens via the Mimi tokenizer (a single-codebook semantic plus N-1 acoustic codebooks at 12.5 Hz). Three sizes released: Tiny (1B backbone, 100M decoder), Small (3B, 250M), Medium (8B, 300M). The two-transformer architecture is Llama-based on both the backbone and decoder sides. Compute amortisation through random 1/16-frame decoder subsampling during training keeps wall-clock training cost manageable at scale. Trained on approximately 1 million hours of English audio. Single-stage architecture (input audio context to output audio in one pass) rather than the two-stage TTS-plus-LLM pipeline that most competitors use.
- Maya and Miles voice characters. Two demos of the CSM Medium model with distinct voice personas, hosted on Sesame's research site. The two voices are the company's most-visible public surface and the closest thing to a current consumer product release.
- Smart-glasses hardware (in development, no public launch date). Public commentary from the company has framed this as the primary commercial vector for the voice-AI capability stack, with the model and the developer-tools releases positioned as the technical and recruiting foundation rather than the commercial product.
Benchmarks and standing
Sesame's release materials introduced several new benchmarks rather than competing on the standard text-to-speech metrics (which the company argues do not capture the conversational-quality dimensions that matter for AI companions).
- Homograph Disambiguation. Tests the model's ability to choose between alternate pronunciations of words spelled identically ("lead" the metal versus "lead" the verb, "bass" the fish versus "bass" the instrument, "tear" the noun versus the verb, "wound" the injury versus the past tense of wind, "row" the noun versus the verb). CSM Medium reports leading performance against OpenAI's realtime voice and Play.ht on this benchmark in Sesame's published comparisons.
- Pronunciation Continuation Consistency. Tests whether the model maintains a specific pronunciation choice (when prompted with a particular pronunciation of an ambiguous word) across subsequent turns of a dialogue. Designed to surface the multi-turn-stability problem that prior voice models had with name pronunciation, code pronunciation, and accent consistency.
- Comparative Mean Opinion Score (CMOS) studies on the Expresso dataset. Without conversational context, human evaluators showed no preference between CSM Medium output and real human speech. With 90 seconds of conversational context, evaluators favoured the real human recordings, which is the gap Sesame frames as the remaining "uncanny valley" the company is working to close.
The benchmark framing is structurally important: Sesame is not competing on raw voice quality, which it considers solved at frontier scale, but on the harder problem of multi-turn conversational coherence and paralinguistic appropriateness. The choice of measurement instrument reflects the company's view of where the meaningful capability gap currently lives.
Leadership
- Brendan Iribe (Co-founder): Former co-founder and chief executive of Oculus VR (acquired by Facebook in 2014); departed Facebook in 2018. The hardware-product experience underlies Sesame's smart-glasses strategy.
- Ankit Kumar (Co-founder): Technical co-founder with prior experience in AI research and product roles; the public-facing technical-lead voice for the company.
- Johan Schalkwyk (Speech research): Senior speech-AI researcher whose prior tenure at Google included significant contributions to Google's speech-recognition stack. The most-recognisable named hire on Sesame's technical team.
- Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang: Senior research-and-engineering staff identified in the company's published research credits.
Funding and backers
Sesame's funding history has not been comprehensively disclosed in public sources. Industry coverage suggests a seed-plus-Series-A round structure with backers including Andreessen Horowitz and Spark Capital, with both firms participating in the voice-AI and consumer-AI investment categories around 2024 and 2025. The specific round sizes and post-money valuations have not been published.
The capital-side narrative draws on Iribe's prior track record (an Oculus exit at $2 billion to Facebook) and on the team's technical credentials in voice AI. The talent-leads-capital pattern documented in the diaspora map applies here: institutional capital was available at scale before any consumer product release because the founding team's prior experience compressed the time-to-credibility.
Industry position
Sesame sits in the conversational-voice-AI category alongside ElevenLabs, Kyutai, and the realtime-voice offerings from the major frontier labs (notably GPT-Realtime-2 from OpenAI and Gemini 3.1 Flash Live from Google DeepMind). Sesame's differentiation against the category is the open-weights research posture and the explicit hardware ambition; against the broader voice-AI market it is the focus on emotional and paralinguistic quality rather than on raw text-to-speech accuracy.
The Meta Ray-Ban Smart Glasses product line is the closest parallel for Sesame's hardware ambition, and the comparison is direct in two ways. First, the Meta product is the most-established consumer-tier always-available voice-AI device in the 2024 to 2026 market. Second, Brendan Iribe's Oculus history puts him in direct adjacency to the team that built the Meta Ray-Ban glasses, which gives Sesame an unusually informed competitive frame even before the company ships its own hardware.
Competitive landscape
- ElevenLabs (text-to-speech and conversational AI): The leading commercial voice-AI provider on the open API market. Closed-weight, breadth of voice catalog, strong enterprise traction. Sesame's open-weights posture and emotional-intelligence framing differentiate against the ElevenLabs commercial-incumbent positioning.
- Kyutai (open-weights speech-to-speech, French): The closest direct open-weights competitor. Kyutai's Moshi model is also full-duplex speech-to-speech and is also Apache-2.0 open. The two labs compete for developer-and-research-community mindshare on the open-weights voice frontier.
- OpenAI Realtime API (closed-weight realtime): The closest commercial competitor on the conversational-voice axis. Different commercial-model (closed API, GPT-5-class reasoning included, narrower hardware ambition).
- Play.ht, Hume, others (smaller voice-AI insurgents): A long tail of voice-AI startups competing on TTS quality, voice cloning, or specific verticals (call-center automation, audiobook narration). Smaller scale than Sesame and not pursuing comparable hardware ambitions.
- Meta Ray-Ban Smart Glasses (the eventual hardware-side incumbent): Not a voice-AI lab per se, but the most-established consumer-tier device that competes with Sesame's eventual hardware product surface. Meta's distribution advantage is large; Sesame's differentiation will need to be at the voice-AI experience layer.
Outlook
Open questions and watchable signals over the next 6 to 18 months:
- First hardware launch. The smart-glasses hardware has been publicly framed by the company but has not had a launch date or product reveal. The first hardware shipping event will materially shift the competitive analysis from "voice-AI research lab" to "consumer-electronics startup with a voice-AI moat," which are different categories with different valuation models.
- Multilingual scaling. The current CSM is English-dominant; the company has explicitly stated multilingual scaling (to 20-plus languages) is on the development roadmap. The first multilingual release will be a meaningful capability milestone and a competitive signal against Kyutai's similar multilingual roadmap.
- Pre-trained-LLM utilisation. The current CSM does not initialise from pre-trained language-model weights, which the company has flagged as a known design choice and a future improvement direction. Adoption of LLM-warm-start training could close the remaining quality gap to the closed-weight competitors and would be a research event of independent interest.
- Series B timing. A Series B round in 2026 or 2027 would surface the company's commercial trajectory more clearly than the current seed-and-Series-A funding state and would price in the hardware-ambition optionality.
- Talent retention. Sesame's senior technical team is one of the strongest small-company voice-AI rosters in the field. The diaspora map suggests that frontier-credible technical teams can fragment quickly once individual researchers see better opportunities elsewhere. Retention through the first product release is a structural signal.
Sources
- Crossing the uncanny valley of conversational voice. Sesame's research-page introduction to the Conversational Speech Model, with architectural detail, benchmark framing, and the team-and-mission overview.
- SesameAILabs/csm on GitHub. The Apache-2.0 open-source release of the Conversational Speech Model.
- Oculus VR Wikipedia entry for Brendan Iribe's prior context.
- Companion profile: Kyutai for the closest open-weights competitor in the speech-to-speech category, and GPT-Realtime-2 for the closest closed-weight commercial competitor.