SGLang

SGLang is a high-performance open-source serving framework for large language models and multimodal models, developed by Ying Sheng and Banghua Zhu and originated at LMSYS, the Berkeley research collective. The framework introduced RadixAttention, a method for caching and reusing the prefix portions of language model computation across requests, combined with a structured-program-execution model that compiles structured generation patterns (regex-constrained outputs, structured JSON, agentic control flow) into efficient runtime sequences. SGLang is reported to power inference at Google, Microsoft, NVIDIA, Oracle, AMD, Nebius, LinkedIn, xAI, Thinking Machines Lab, and additional operators, generating trillions of tokens per day across more than 400,000 GPUs in production deployments globally as of 2026. The commercial entity backing SGLang, RadixArk, launched in May 2026 with a $100 million seed round at a $400 million valuation.

At a glance

Lab: Open-source project; commercialized by RadixArk. Originated at LMSYS, the Berkeley research collective.
Released: Initial public release in 2024. Joined the PyTorch ecosystem in 2025. Commercial entity (RadixArk) launched May 5, 2026.
Modality: Inference and serving framework for language and multimodal models. Not a foundation model itself; runs other labs' models including DeepSeek, Llama, Qwen, Kimi, GLM, GPT-OSS, Gemma, and Mistral families.
Open weights: Not applicable. SGLang is open-source under Apache 2.0 license. The framework is the inference engine; weights for served models come from their respective labs.
Context window: Determined by the served model rather than the framework. SGLang efficiently handles long contexts and supports prefix caching across requests.
Pricing: Free for self-hosted deployment under Apache 2.0. RadixArk offers commercial managed-inference and managed-training services on top of the open core.
Distribution channels: sgl-project/sglang on GitHub, PyTorch ecosystem distribution, and the RadixArk commercial managed-service offering.

Origins

SGLang originated at LMSYS, the Berkeley research collective known for the Chatbot Arena leaderboard (LMArena) and for releases including Vicuna and the LMSYS Chatbot Arena. The project began as a serving-framework research effort in 2023 and 2024, with subsequent contributions from a broad community of researchers and practitioners across academic and industry labs.

The technical foundation is RadixAttention, the prefix-caching mechanism developed by Sheng, Zhu, and the LMSYS team. The original SGLang paper, "SGLang: Efficient Execution of Structured Language Model Programs," was published at NeurIPS 2024. The paper documented the architectural novelty: a unified runtime that combines automatic prefix caching across requests via a radix tree, and a compiler-style approach to structured generation programs that compiles control-flow patterns (loops, branches, structured JSON output, regex constraints) into efficient batched execution sequences.

By 2025, SGLang had become one of the de facto standard inference engines for production-scale LLM deployment, alongside vLLM. The project's adoption signal includes deployment at Google, Microsoft, NVIDIA, Oracle, AMD, Nebius, LinkedIn, xAI, Thinking Machines Lab, and additional operators, with combined inference flow reaching trillions of tokens per day across hundreds of thousands of GPUs worldwide.

xAI's Igor Babuschkin has publicly stated that SGLang is xAI's default inference engine, and that xAI uses SGLang to serve its flagship Grok line. Microsoft Azure uses SGLang to serve DeepSeek R1 on AMD GPUs as part of the company's frontier-model offering. The DeepSeek-V4 release in April 2026 was supported by SGLang day-zero inference, an integration documented in an LMSYS blog post titled "DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles."

In May 2026 the SGLang core team spun out RadixArk, the commercial entity backing the project, with a $100 million seed round led by Accel and co-led by Spark Capital at a $400 million post-money valuation. The seed round included strategic participation from NVIDIA's NVentures, AMD, MediaTek, and angel investors including Igor Babuschkin (xAI), John Schulman (OpenAI and Thinking Machines Lab), and Soumith Chintala (PyTorch creator).

Capabilities

SGLang is built specifically for high-throughput serving of structured language model workloads. Three capability features distinguish it from peer inference frameworks.

The first is RadixAttention. The mechanism enables automatic reuse of the key-value cache across multiple generation calls by maintaining an LRU cache of the KV cache for all requests within a radix tree. The radix tree provides efficient matching, insertion, and eviction, allowing SGLang to recognize when an incoming request shares a prefix with previously processed requests and to reuse the cached prefix computation. RadixAttention is the structural reason SGLang outperforms peer engines on prefix-heavy workloads such as retrieval-augmented generation and multi-turn conversation, where prefix sharing is dense.

The second is the structured-generation runtime. SGLang treats structured generation programs (regex-constrained outputs, structured JSON, agentic control flow) as compilable computational graphs rather than as token-by-token sampling with rejection. The compiled execution path produces substantially higher throughput on agentic and reasoning workloads, where structured outputs and tool-use formatting are characteristic.

The third is cross-vendor hardware optimization. SGLang supports NVIDIA, AMD, Intel, MediaTek, and other AI hardware platforms, with continued investment in cross-platform performance optimization. The cross-vendor neutrality is reinforced by the strategic-investor base of the commercial RadixArk entity, which includes NVIDIA's NVentures, AMD as a strategic investor, and MediaTek as a semiconductor partner.

SGLang supports a broad range of model families including Llama 4, Qwen 3, DeepSeek V4, Kimi K2, GLM, GPT-OSS, Gemma, and Mistral lines. The framework is the model-serving substrate rather than a model itself, and its capability is most directly measured through aggregate throughput, latency, and cost-per-token economics relative to peer engines.

Throughput and capability claims

SGLang's principal disclosed performance metrics are throughput, latency, and prefix-cache reuse efficiency on standardized inference workloads.

On general throughput benchmarks on H100 GPUs, SGLang reports 16,215 tokens per second versus 12,553 tokens per second for the fully optimized vLLM at the same workload, a 29 percent advantage. On prefix-heavy workloads such as retrieval-augmented generation and multi-turn chat, SGLang's RadixAttention provides up to 6.4 times throughput gains over vLLM at the same context loads.

On multi-turn conversation evaluations, RadixAttention provides approximately 10 percent throughput gain over vLLM at standard context loads, with the advantage growing as the cache reuse rate increases.

For high-concurrency batch processing of unique-prompt workloads, vLLM's C++ routing implementation has reported higher throughput than SGLang by avoiding Python GIL contention. The two frameworks therefore have complementary advantages: SGLang leads on prefix-heavy and structured-generation workloads, while vLLM leads on high-concurrency unique-prompt workloads.

The trillions-of-tokens-per-day production deployment count and the use of SGLang as the default engine at xAI and across major frontier-lab deployments are the clearest indicators of standing in production. Public adoption among Google, Microsoft, NVIDIA, Oracle, AMD, Nebius, LinkedIn, Thinking Machines Lab, and additional operators reflects the project's broad cross-organization usage. Microsoft Azure uses SGLang to serve DeepSeek R1 on AMD GPUs as part of the company's open-source-model serving offering.

The benchmark category itself is fragmented: throughput, latency, and cost-per-token results depend on the specific GPU, context length, batch size, and workload distribution. Direct head-to-head comparisons between SGLang, vLLM, TensorRT-LLM, and LMDeploy produce different rankings depending on the benchmark configuration, and benchmark leadership rotates with each release of the major frameworks.

Adoption and ecosystem

SGLang is distributed primarily through GitHub at sgl-project/sglang under the Apache 2.0 license. The project joined the PyTorch ecosystem in 2025, with PyTorch hosting SGLang as a recommended inference engine for production deployment.

The community-governed open-source project is maintained by RadixArk's senior engineering staff alongside contributors from academic and industry labs. The mailing list, GitHub issues, and Slack channels are the principal community-engagement surfaces, with weekly community calls and contributor recognition driving sustained engagement.

The commercial RadixArk entity provides managed-inference and managed-training services on top of SGLang and the adjacent Miles reinforcement learning framework. Specific commercial product details and per-call pricing have not been broadly publicly disclosed as of May 2026; the commercial offering is distinguished from the open-source distribution by managed deployment, support contracts, and integration with enterprise compliance requirements.

Distribution channels also include integration with Ollama for local deployment, with vLLM-compatible APIs for migration paths, and with the broader Hugging Face inference ecosystem.

Comparison

Direct competitors and adjacent inference frameworks:

vLLM (open-source). The principal open-source peer to SGLang. vLLM is commercialized by Inferact, which closed a $150 million seed round at $800 million valuation in January 2026. The two engines are widely characterized as the leading open-source inference frameworks, with vLLM's C++ routing producing higher throughput on high-concurrency unique-prompt workloads and SGLang's RadixAttention producing higher throughput on prefix-heavy and structured-generation workloads.
TensorRT-LLM (NVIDIA). The principal NVIDIA-aligned commercial inference framework. Closed-source NVIDIA-specific optimization with strong performance on NVIDIA hardware. SGLang's cross-vendor neutrality is the principal architectural distinction.
Triton Inference Server (NVIDIA). Adjacent NVIDIA inference offering for general inference workloads.
LMDeploy (open-source). Adjacent open-source inference framework. Reports throughput within a few percent of SGLang on H100 workloads.
Together AI, Fireworks AI, and managed-inference platforms (Together AI, Fireworks AI). Direct AI inference platform competitors. Each operates its own optimized engine and managed-service surface, often built on or integrating SGLang or vLLM as a substrate component.
Cloud-provider managed inference. Amazon Bedrock, Google Vertex AI, and Azure OpenAI Service are the principal cloud-provider alternatives for hosted inference.
Cerebras, Groq, SambaNova. Specialized AI hardware vendors with their own inference stacks. Different positioning around custom silicon rather than commodity GPU optimization.

SGLang's distinctive position among 2024 to 2026 inference frameworks: open-source Apache 2.0 distribution, the RadixAttention prefix-caching mechanism that produces structural throughput advantages on prefix-heavy and structured-generation workloads, cross-vendor hardware optimization across NVIDIA and AMD platforms, and adoption depth at xAI and across frontier-lab deployments that few peer projects match.

Outlook

Open questions for SGLang and RadixArk over the next 6 to 18 months:

Commercial managed-service traction. RadixArk's $400 million seed valuation implies meaningful commercial milestones before the next priced round. Named enterprise customers and revenue trajectory for the managed-service offering are watchable signals.
Continued technical leadership. The inference framework category compresses rapidly: vLLM, TensorRT-LLM, LMDeploy, and others all release substantial performance updates on quarterly cadences. Whether SGLang sustains its prefix-cache and structured-generation lead through 2026 and 2027 depends on continued architectural advancement.
Miles RL framework development. The adjacent Miles open-source reinforcement learning framework is positioned to address post-training workloads. The convergence of post-training and inference into a unified vendor offering is part of RadixArk's commercial thesis.
Cross-vendor neutrality. RadixArk's strategic-investor base includes NVIDIA, AMD, MediaTek, Intel, and Broadcom-affiliated angels. Maintaining genuine cross-vendor optimization while accepting strategic capital from each major hardware vendor is a structural balancing act.
Competitive dynamics with Inferact. Inferact is the closest direct peer (vLLM commercialization). Both companies launched within four months of each other; both target the open-core managed-service model. Whether the inference market supports two open-core commercial entities at scale is the central market-structure question.
Integration with frontier-lab post-training pipelines. SGLang has positioned itself as the inference component for verified-RL post-training pipelines (the DeepSeek-V4 day-zero integration is the disclosed example). Continued depth of integration with frontier-lab post-training workflows would expand the project's strategic relevance.

Sources

SGLang: Efficient Execution of Structured Language Model Programs (NeurIPS 2024). Original SGLang paper covering RadixAttention and structured-generation runtime.
GitHub: sgl-project/sglang. Open-source project repository.
LMSYS blog: Fast and Expressive LLM Inference with RadixAttention and SGLang. Original technical announcement.
PyTorch blog: SGLang Joins PyTorch Ecosystem. PyTorch ecosystem integration announcement.
LMSYS blog: DeepSeek-V4 on Day 0. DeepSeek-V4 day-zero inference integration writeup.
SGLang Documentation. Official documentation.
BusinessWire: RadixArk Launches with $100 Million in Seed Funding. Commercial entity launch.
Particula: SGLang vs vLLM in 2026. Independent benchmark comparison.

SGLang

At a glance

Origins

Capabilities

Throughput and capability claims

Adoption and ecosystem

Comparison

Outlook

Sources

Nextomoro

AI Research Lab Intelligence

SGLang

At a glance

Origins

Capabilities

Throughput and capability claims

Adoption and ecosystem

Comparison

Outlook

Sources

Nextomoro

QwQ-32B

Qwen3 Coder 480B-A35B

MiniMax M2

Kimi K2.5

Qwen 3.6

AI Research Lab Intelligence