RT-2

RT-2 is Google DeepMind's vision-language-action model that applies web-scale vision-language pretraining to robotic control, enabling robots to interpret natural-language instructions and generalize to objects and scenarios outside their training data.
RT-2

RT-2

RT-2 (Robotic Transformer 2) is a vision-language-action (VLA) model developed by Google DeepMind, announced July 28, 2023, that trains a large transformer jointly on internet-scale vision-language data and physical robot demonstration data to produce a single model capable of outputting robot control actions directly from natural-language instructions. It is available as a research artifact, not a commercial API or product. As of April 2026, RT-2 and the follow-on RT-X cross-embodiment research line represent the foundational robotics work that preceded Gemini Robotics, DeepMind's current production-grade embodied AI family.

At a glance

  • Lab: Google DeepMind
  • Released: July 28, 2023
  • Modality: Vision-language-action (image and language in, robot control actions out)
  • Open weights: No (research model; no public weights release)
  • Architecture variants: PaLM-E 12B and PaLI-X 55B
  • Pricing: Not applicable (research only; no commercial deployment)
  • Distribution channels: Research paper and project site; no public API

Origins

RT-2 builds directly on RT-1, the first Robotic Transformer from DeepMind. RT-1, introduced in December 2022, was a compact transformer architecture trained on 130,000 robot demonstrations collected by a fleet of 13 robots in office kitchen environments over 17 months. RT-1 achieved strong task generalization within the distribution of its training scenarios but struggled with objects, environments, and instructions it had not seen during training. Its architecture was narrow by design: it processed images and task strings, and output tokenized motor commands, but had no connection to the broad semantic and visual knowledge encoded in large-scale internet-trained models.

The core insight driving RT-2 was that this limitation was architectural rather than fundamental. Large vision-language models trained on web-scale data had already learned rich representations of objects, categories, spatial relationships, materials, and human-world semantics. The question was whether those representations could be adapted for robot control without training a new model from scratch.

RT-2 answered by treating robot actions as a new type of token in the existing vocabulary of a vision-language model. Rather than building a separate action-prediction head, the team expressed discrete robot actions as text strings and incorporated them directly into the training data alongside standard vision-language tasks. The model then underwent co-fine-tuning: simultaneous training on both web-scale vision-language tasks (image captioning, visual question answering) and robot trajectory demonstrations. The result was a unified model that could respond to a natural-language instruction, interpret a camera image of the robot's workspace, and output a sequence of action tokens that the robot arm executed.

Two backbone architectures were explored: PaLM-E at 12 billion parameters and PaLI-X at 55 billion parameters. Both were adapted as RT-2 instances through the same co-fine-tuning process. PaLI-X at 55B showed stronger performance across most evaluation axes, with PaLM-E showing a relative edge on math-reasoning tasks. The scaling relationship was directional: larger models generalized better.

The paper, titled "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control," was published to arXiv in July 2023. It reported results from over 6,000 robot evaluation trials.

Capabilities

RT-2's defining contribution is generalization to scenarios outside the robot's training distribution, achieved through the web-scale semantic knowledge embedded in the underlying vision-language model.

Three classes of emergent capability were documented:

Novel-object handling. RT-2 could pick up and manipulate objects it had never encountered in robot training data, because the vision-language backbone had seen those objects in web images and built semantic representations for them. A robot given the instruction "bring me the toy not from China" could identify the relevant object by origin and retrieve it correctly, drawing on world-knowledge learned from web text rather than from robot demonstrations.

Semantic reasoning during manipulation. RT-2 could interpret instructions that required real-world inference rather than pattern-matching to a known command. When asked to pick up something that could be used as an improvised hammer, the model identified a rock in the scene and grasped it. When asked to move an object to the number "4," the model placed it on a tile showing that numeral. These tasks had no analogues in the robot training data; the model resolved them through the conceptual reasoning carried over from the vision-language pretraining.

Chain-of-thought planning. When prompted to reason before acting, RT-2 could generate a sequence of natural-language reasoning steps followed by the keyword "Action:" and then the action tokens. This let users elicit multi-stage plans for complex manipulation scenarios and inspect the model's reasoning before execution, a behavior with no parallel in RT-1.

On the simulation benchmark Language Table, RT-2 reached 90% success, compared to 77% for the prior state of the art. On novel-task evaluations designed to test emergent capabilities, RT-2 exceeded the performance of RT-1 and the Visual Cortex (VC-1) baseline by more than 3x in average success rate. Across broader generalization axes, the improvement over comparable baselines averaged approximately 2x.

The PaLI-X 55B variant consistently outperformed PaLM-E 12B on most evaluations, consistent with the expectation that larger vision-language backbones carry more transferable world knowledge.

Benchmarks and standing

RT-2's benchmarks are specific to robotic manipulation evaluation and are not directly comparable to the text-and-multimodal leaderboards used for language models.

The primary reported results:

  • Emergent capability evaluations (novel tasks): RT-2 achieved more than 3x the average success rate of the best baseline (RT-1 and VC-1) across symbol understanding, semantic reasoning, and human-recognition task categories.
  • Generalization evaluations: Approximately 2x improvement over comparable baselines on held-out scenarios not seen during robot training.
  • Language Table simulation benchmark: 90% success rate, versus 77% for the previous state of the art.
  • Scale effect: PaLI-X 55B outperformed PaLM-E 12B across most tasks; neither model had been publicly released with weights.

In October 2023, the RT-X project (Open X-Embodiment) extended the RT-2 training approach to cross-embodiment learning. RT-2-X, trained on a dataset of over one million real robot trajectories spanning 22 robot platforms and 34 research institutions, achieved roughly 3x better performance than the original RT-2 alone on emergent skill evaluations, with gains attributed to shared representations across diverse physical configurations.

These figures are point-in-time research results from 2023. The robotics benchmark landscape has developed considerably since, with Gemini Robotics reporting that it doubles performance on a comprehensive generalization benchmark relative to other state-of-the-art VLA models as of March 2025.

Access and pricing

RT-2 is a research model with no public weights release and no commercial API.

The project site at robotics-transformer2.github.io hosts the paper, evaluation videos, and supplementary materials. The paper is available at arXiv:2307.15818. No inference endpoint is offered to external parties.

The RT-X / Open X-Embodiment project site at robotics-transformer-x.github.io provides the cross-embodiment dataset and accompanying research materials. The dataset itself has been released for research use by the broader community, but the trained RT-2-X model weights remain internal to DeepMind.

Organizations looking to build on this line of research can access the Open X-Embodiment dataset to train their own models. Practical deployment of DeepMind's robotics capabilities is available only through the Gemini Robotics family, which is distributed through Google Cloud for qualified robotics partners.

Comparison

RT-2's direct competitors in the embodied VLA space at the time of its release and in the research literature since:

  • RT-1 (Google DeepMind). RT-2's immediate predecessor. RT-1 achieved strong performance within its training distribution but lacked the web-scale semantic knowledge that enables RT-2's novel-task generalization. On the emergent capability evaluations, RT-2 exceeded RT-1 by more than 3x. RT-1 remains the compact-model reference point in DeepMind's robotics work.
  • Open X-Embodiment / RT-2-X (Google DeepMind). The successor research effort that trained RT-2 on a cross-embodiment dataset spanning 22 robot types across 34 institutions. RT-2-X achieves approximately 3x better performance than the single-embodiment RT-2 on emergent skill evaluations. The Open X-Embodiment dataset underpins a broader open research effort to build models that generalize across robot platforms.
  • Gemini Robotics. DeepMind's March 2025 production-grade VLA, built on Gemini 2.0 rather than PaLM-E or PaLI-X. Gemini Robotics more than doubles performance on a comprehensive generalization benchmark compared to other VLA models at the time of its release. It supports dexterous manipulation (including origami folding), real-time conversational language commands, and handles novel environments with greater robustness than RT-2. Gemini Robotics-ER adds an embodied reasoning mode achieving 2x-3x success rates versus Gemini 2.0 in end-to-end robot control. Gemini Robotics represents the current state of the art within DeepMind's robotics line and is what DeepMind offers to partners seeking practical deployment.
  • Other academic VLA models. Concurrent research lines including SayCan (Google/Everyday Robots), PaLM-E (from which RT-2 borrows one backbone), and subsequent models like OpenVLA (UC Berkeley, 2024) have explored related approaches of grounding language models in robotic action spaces. These models operate at smaller scale or with narrower task scope than RT-2, though the open-source VLA community has grown substantially since RT-2's publication.

Outlook

Open questions for the next 6 to 18 months:

  • Gemini Robotics trajectory. The production robotics line has moved from RT-2 (July 2023) to RT-X (October 2023) to Gemini Robotics (March 2025) to Gemini Robotics-ER 1.6. Whether the Gemini Robotics family absorbs the cross-embodiment goals of RT-X, or whether a separate research program continues in parallel, is the primary architectural question.
  • Weight release. Neither RT-2 nor RT-2-X weights have been publicly released. The Open X-Embodiment dataset is available, but reproducibility at this scale requires compute resources most academic groups lack. Whether DeepMind opens RT-2 weights in any form, or continues to keep them internal while releasing only the dataset, shapes the research community's ability to build on this work.
  • Cross-embodiment at commercial scale. RT-2-X demonstrated positive transfer across 22 embodiments in a research setting. Extending that principle to commercial deployment across diverse hardware platforms is unsolved. Gemini Robotics targets this by building on Gemini 2.0's general capabilities, but the specific mechanism for adapting to new robot hardware at deployment time remains an open engineering problem.
  • On-device robotics. DeepMind has released Gemini Robotics On-Device as a local-inference variant. How the compute requirements of large VLA models compress to run on embedded hardware without significant capability loss is a constraints problem that RT-2's 55B-parameter PaLI-X backbone does not address.
  • Benchmark standardization. Robotics evaluation lacks the standard leaderboards that exist for language models. Emergent-capability evaluations in RT-2's paper are specific to DeepMind's hardware and task set. Community efforts like Open X-Embodiment and subsequent standardization proposals will determine whether cross-lab comparison becomes as tractable as it is for text benchmarks.

Sources

About the author
Nextomoro

Nextomoro

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.