Inferact

Inferact is an American AI inference infrastructure startup founded in 2025 by the creators and core maintainers of vLLM, the open-source LLM inference engine, with $150 million in seed funding led by Andreessen Horowitz and Lightspeed at an $800 million valuation.
Inferact

Inferact

Inferact is an American artificial intelligence inference infrastructure startup founded in 2025 by the creators and core maintainers of vLLM, the open-source large language model inference engine originally developed at the UC Berkeley Sky Computing Lab. The founding team includes Simon Mo, Woosuk Kwon, Kaichao You, and Roger Wang, with Joseph Gonzalez and Ion Stoica of UC Berkeley as advisors. As of May 2026, Inferact has raised $150 million in seed funding at an $800 million post-money valuation, in a round co-led by Andreessen Horowitz and Lightspeed Venture Partners.

At a glance

  • Founded: 2025 in the United States. Public launch and seed-round announcement in January 2026.
  • Status: Private. Seed round closed January 2026.
  • Funding: $150 million seed at $800 million post-money valuation, co-led by Andreessen Horowitz and Lightspeed Venture Partners. Sequoia Capital, Altimeter Capital, Redpoint Ventures, ZhenFund, Databricks Ventures, and the UC Berkeley Chancellor's Fund participated.
  • CEO: Simon Mo, co-founder. Lead maintainer of the vLLM open-source project.
  • Other notable leadership: Woosuk Kwon, co-founder; original creator of vLLM and the PagedAttention method while a PhD student at UC Berkeley. Kaichao You, co-founder; vLLM core maintainer and Tsinghua University Special Award winner. Roger Wang, co-founder; vLLM core maintainer.
  • Open weights: Inferact develops infrastructure for inference rather than foundation models. The vLLM project itself is hosted by the PyTorch Foundation and remains Apache 2.0 licensed.
  • Flagship products: vLLM (the open-source inference engine, hosted by PyTorch Foundation), with commercial managed-inference offerings for enterprise and cloud customers as the productization path.

Origins

vLLM emerged in 2023 from the Sky Computing Lab at UC Berkeley as part of a research line on efficient large-language-model serving. Woosuk Kwon, then a PhD student at Berkeley, was the original creator and lead author of the PagedAttention paper, the methodological core of vLLM. PagedAttention applies operating-system-style virtual memory paging to transformer attention key-value caches, increasing throughput on inference workloads without sacrificing latency.

The project's open-source release in mid-2023 was rapidly adopted across the AI infrastructure ecosystem. Simon Mo became the lead maintainer as the project scaled. Kaichao You contributed core compiler and execution-engine work. Roger Wang anchored the multimodal-inference and broader ecosystem work. By 2024 and 2025, vLLM had grown into one of the most active open-source AI projects, with over 2,000 contributors and adoption across major frontier AI labs and enterprise customers.

In July 2024, the University of California, Berkeley contributed vLLM to the Linux Foundation. In 2025, the PyTorch Foundation announced that vLLM had become a Foundation-hosted project, formalizing its status as a community-governed open-source project rather than a UC Berkeley research artifact.

The Inferact spinout was announced on January 22, 2026, alongside the $150 million seed round. The framing was an "open core" model: vLLM continues as community-owned open-source infrastructure, while Inferact ships commercial offerings (managed services, optimized cloud deployments, enterprise tooling) on top of the open core. The founder team explicitly committed to maintaining vLLM as the industry-standard open-source inference engine, with Inferact providing commercial-grade reliability and managed operations for enterprise customers.

vLLM's adoption metrics at the time of the Inferact launch were unusual for an open-source project. The project is reported to power over 400,000 GPUs concurrently across deployments at Meta AI / FAIR, Google, Character AI, and others, and is one of the most widely-used components in the modern AI inference stack.

Mission and strategy

Inferact's stated mission is to "grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster." The strategic premise is that AI inference is the dominant compute cost in production AI, that open-source infrastructure components like vLLM benefit from commercial backing, and that managed inference services on top of widely-adopted open cores are a viable enterprise commercial model.

The strategy combines three threads. The first is continued investment in vLLM as community-owned open-source infrastructure, with Inferact's senior engineering staff providing principal maintainer attention. The second is the development of managed inference services for enterprise customers who require operational reliability, security guarantees, and performance optimization beyond what self-hosted vLLM deployments deliver. The third is the development of optimized backends across the diverse AI hardware landscape (NVIDIA, AMD, Intel, Google TPUs, custom silicon), positioning vLLM-and-Inferact as the cross-vendor inference layer.

The competitive premise reflects the broader open-core software thesis: if vLLM is the de facto standard inference engine, the company that anchors its development can capture commercial value through managed services and enterprise tooling without competing directly with the foundation-model developers it serves. The structure parallels Databricks-and-Spark, Hugging Face-and-Transformers, and Confluent-and-Kafka in earlier open-source eras.

Models and products

  • vLLM (open-source). High-throughput, memory-efficient inference engine for large language models and multimodal models. Hosted by the PyTorch Foundation. Apache 2.0 licensed.
  • Inferact managed services. Commercial managed-inference offerings on top of vLLM, targeted at enterprise customers requiring reliability, security, and operational support beyond self-hosted deployments. Specific product details and pricing have not been broadly publicly disclosed.
  • Cross-hardware optimization. Continued vLLM optimization for NVIDIA, AMD, Intel, Google TPU, and custom AI silicon, positioning the inference layer as hardware-vendor-neutral.

Distribution is split between the open-source community (via GitHub, Hugging Face, and PyPI for vLLM itself) and direct enterprise sales for the Inferact commercial managed-service offerings.

Benchmarks and standing

vLLM is consistently ranked among the leading open-source LLM inference engines on community benchmarks for throughput, latency, and memory efficiency. The project's benchmark leadership is most prominent on long-context and high-batch workloads, where PagedAttention's memory management produces the largest gains over alternative engines.

The 400,000-plus GPU deployment count and the use of vLLM in production by Meta, Google, Character AI, and other major operators is the clearest indicator of standing. The project's classification as PyTorch Foundation-hosted infrastructure, alongside PyTorch itself, reflects its central position in the AI inference stack.

For Inferact specifically (the commercial entity), commercial-traction data has not been publicly disclosed as of May 2026. The standing of the commercial entity will depend on the conversion of the open-source community position into managed-service revenue.

Leadership

As of May 2026, Inferact's senior leadership includes:

  • Simon Mo, Chief Executive Officer and co-founder. Lead maintainer of vLLM since the project's 2023 emergence.
  • Woosuk Kwon, co-founder. Original creator of vLLM and lead author of the PagedAttention paper.
  • Kaichao You, co-founder. vLLM core maintainer and senior compiler-and-execution-engine engineer.
  • Roger Wang, co-founder. vLLM core maintainer with focus on multimodal inference and ecosystem contribution.

Advisors include Joseph Gonzalez and Ion Stoica of UC Berkeley, both senior researchers in the Sky Computing Lab. Stoica is also a co-founder of Databricks and Anyscale; the lineage of large-scale open-source projects emerging from his Berkeley group includes Apache Spark and Ray.

Funding and backers

Inferact's funding history through May 2026 consists of one disclosed round:

  • Seed (January 2026): $150 million at $800 million post-money valuation, co-led by Andreessen Horowitz and Lightspeed Venture Partners. Sequoia Capital, Altimeter Capital, Redpoint Ventures, ZhenFund, Databricks Ventures, and the UC Berkeley Chancellor's Fund participated.

The seed-round size and valuation are unusually rich for a pre-revenue commercial entity but consistent with the open-core thesis: investors are pricing the commercial company against the strategic value of the underlying open-source position. The participation of Databricks Ventures is notable given Ion Stoica's overlapping involvement and the structural parallel between Databricks-and-Spark and Inferact-and-vLLM.

The investor base spans top-tier US growth investors (a16z, Lightspeed, Sequoia, Altimeter, Redpoint), corporate-strategic investors (Databricks Ventures), the UC Berkeley Chancellor's Fund, and ZhenFund (a China-focused early-stage firm). The configuration reflects the global adoption profile of vLLM.

Industry position

Inferact occupies a structurally distinctive position in the AI inference infrastructure category. The combination of the vLLM open-source project's adoption depth, the founder team's status as the project's principal maintainers, and the rich seed-round investor lineup produces a profile that no peer commercial inference startup matches at the same combination of attributes.

Industry coverage has characterized Inferact as the canonical open-core commercialization play in the AI inference category. The principal strategic-execution risks identified are the maintenance of vLLM as a community-governed project under PyTorch Foundation (preserving open-source neutrality while building commercial offerings), the conversion of the open-source community into managed-service customers, and the competitive pressure from Together AI, Fireworks AI, and other inference platforms that operate without an open-source-anchored model.

Competitive landscape

Inferact competes for AI inference workload across several categories:

  • Together AI, Fireworks AI. Direct AI inference platform competitors with their own optimized engines and managed services.
  • RadixArk. The closest peer commercialization of an alternative open-source inference engine (SGLang), launched May 2026 with similar open-core framing.
  • Cloud-provider managed inference. Amazon Bedrock, Google Vertex AI, and Azure OpenAI Service are the principal cloud-provider alternatives.
  • Anyscale. Sister Berkeley spinout (also from Stoica's research group) commercializing the Ray distributed-compute project; partial overlap on AI workload management.
  • Cerebras, Groq, SambaNova. Specialized AI hardware vendors with their own inference stacks.
  • NVIDIA inference software. TensorRT-LLM and Triton Inference Server are the principal NVIDIA-aligned competitors.

Outlook

Several open questions affect Inferact's trajectory in 2026 and 2027:

  • Commercial managed-service traction, with named enterprise customers and revenue trajectory among the watchable signals.
  • The structural relationship between Inferact and vLLM's PyTorch Foundation governance, particularly as commercial offerings expand.
  • The competitive dynamic with RadixArk and the broader question of whether the inference category supports multiple open-core commercial entities or whether one will dominate.
  • Continued vLLM technical leadership, particularly across new AI hardware platforms and emerging workload patterns (long context, multimodal, agentic inference).
  • Subsequent fundraising; the $800 million seed valuation implies a meaningful step-up in commercial milestones before the next priced round.
  • The broader question of where AI inference value accrues across the stack between hardware, compilers, runtimes, and managed services.

Sources

About the author
Nextomoro

Nextomoro

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.