Gemini 3 Deep Think

Gemini 3 Deep Think is Google DeepMind's enhanced reasoning mode for the Gemini 3 family, spending additional compute on extended chain-of-thought before producing a final answer to tackle complex math, science, and engineering problems.
Gemini 3 Deep Think

Gemini 3 Deep Think

Gemini 3 Deep Think is Google DeepMind's enhanced reasoning mode for the Gemini 3 model family, designed to spend additional compute on extended internal chain-of-thought before surfacing a final answer to the user. It is available to Google AI Ultra subscribers through the Gemini app and to developers through the Gemini API and Vertex AI on Google Cloud. As of early 2026, Deep Think represents Google's primary offering in the extended-reasoning category alongside OpenAI's o-series and Anthropic's extended thinking mode, with notably strong benchmark results on graduate-level science and competitive mathematics.

At a glance

  • Lab: Google DeepMind
  • Released: December 4, 2025 (initial rollout to AI Ultra subscribers); February 12, 2026 (major upgrade for science and engineering)
  • Modality: Text and multimodal (inherits Gemini 3 Pro's native image, audio, and video understanding)
  • Open weights: No (closed)
  • Context window: 1,000,000 tokens (same as Gemini 3 Pro; thinking tokens count toward context)
  • Pricing: Thinking tokens billed as output tokens at $12 per million (standard context); Google AI Ultra subscription at $249.99/month includes Deep Think access in the Gemini app
  • Distribution channels: Gemini app (AI Ultra tier), Gemini API, Google AI Studio, Vertex AI on Google Cloud

Origins

Gemini 3 launched on November 18, 2025, introducing Gemini 3 Pro as the flagship variant and Gemini 3 Flash as the efficient alternative. Deep Think was announced alongside the base launch as a forthcoming enhanced reasoning mode. It began rolling out to Google AI Ultra subscribers through the Gemini app on December 4, 2025, following an internal safety evaluation period.

The concept behind Deep Think builds on a pattern Google established with the Gemini 2.x generation. Gemini 2.5 Pro, released March 2025, introduced a "thinking" capability that allowed the model to reason before responding. Deep Think represents the continuation of that approach at the Gemini 3 capability level, with an extended and more capable implementation of the same core idea: allocating additional inference-time compute so the model can evaluate multiple solution paths, check its own logic, and surface a more reliable answer.

The mechanism is conceptually similar to OpenAI's o-series approach. Both train or fine-tune models to produce extended internal reasoning traces before committing to a final answer, and both charge for the tokens generated during that internal process. The practical difference is in how the reasoning is exposed to developers: Gemini's API exposes a thinking_level parameter that controls the depth of internal reasoning, giving callers a lever to trade latency and cost against answer quality.

A major upgrade to Gemini 3 Deep Think shipped on February 12, 2026, substantially expanding performance on scientific reasoning benchmarks and introducing the mode's availability via the Gemini API in preview for researchers and enterprise teams.

Capabilities

Deep Think is not a separate model architecture. It is a reasoning mode of Gemini 3 Pro, which means it inherits the full capability profile of the base model: native multimodal understanding across text, images, audio, and video; a 1 million-token context window; tool use and function calling via the API; and integration with Google Search for grounded retrieval.

What Deep Think adds is a different response generation strategy. Rather than producing a final answer directly, the model spends additional compute on an internal deliberation phase. During this phase, it generates reasoning traces not visible to the end user by default, considering alternative approaches and checking intermediate steps. The visible output reflects the conclusion of that process.

This matters most for tasks where the correct answer cannot be reached by pattern-matching to similar training examples. Problems in competitive mathematics, graduate-level physics and chemistry, formal logic, and multi-step planning all benefit from extended deliberation in ways that standard generation does not provide. For a retrieval or summarization task, enabling Deep Think adds latency and cost without a meaningful quality improvement.

The thinking_level parameter available through the API allows developers to control how much internal compute the model dedicates to reasoning. At lower settings, the model reasons briefly before answering; at higher settings, it explores more extensively. This gives engineering teams a practical knob for managing cost-latency tradeoffs when embedding Deep Think in production systems.

Multimodal reasoning benefits carry over from the base model. A user can supply an image of a diagram alongside a mathematical question, and Deep Think will reason over both modalities in its internal deliberation. This distinguishes Gemini 3 Deep Think from text-only reasoning systems that treat image inputs as summarized descriptions rather than first-class reasoning inputs.

Benchmarks and standing

Gemini 3 Deep Think's benchmark results reflect its specialization in extended reasoning tasks. All figures are from Google's launch materials and third-party evaluations current as of April 2026.

On GPQA Diamond, which tests graduate-level scientific reasoning in biology, chemistry, and physics, Deep Think scores 93.8%, compared to Gemini 3 Pro's 91.9% in standard mode. AIME 2025, the American Invitational Mathematics Examination, scores 95% without code execution and 100% with code execution enabled.

On Humanity's Last Exam, a benchmark designed to test expert-level knowledge across fields, Deep Think scores 41.0% without tools. The February 2026 upgrade also achieved 48.4% on this benchmark in a subsequent configuration. These place Deep Think among the highest scores published on this benchmark as of early 2026.

On ARC-AGI-2, which tests abstract reasoning on novel visual patterns with no overlap with training data, Deep Think achieves 45.1% with code execution (ARC Prize Verified), compared to Gemini 3 Pro's score at the 2.x-era baseline of 4.9%. This represents the category of benchmark where the reasoning mode's advantage is most pronounced.

The February 2026 upgrade additionally produced gold-medal-level results on written sections of the 2025 International Physics Olympiad and 2025 International Chemistry Olympiad, and contributed to resolving 18 research problems in mathematics, physics, and computer science, including settling a decade-old conjecture in online submodular optimization.

On Codeforces, Deep Think achieved an ELO of 3455 in the February 2026 evaluation, placing it in top competitive programming territory.

Extended reasoning modes generally show less improvement over base models on benchmarks that favor rapid breadth over depth, such as general instruction following or MMLU-style multiple choice at standard difficulty. Deep Think's benchmark advantage concentrates on the hardest tier of each domain: Olympiad-level problems, terminal-difficulty science questions, and formal mathematics.

Benchmark standings change as labs publish model updates and new evaluations appear. The figures above reflect data through April 2026.

Access and pricing

Gemini 3 Deep Think is available through two primary channels.

In the Gemini app, Deep Think is included with the Google AI Ultra subscription tier at $249.99 per month (introductory pricing of $124.99 for the first three months was available at launch). The Ultra tier also includes extended limits on Gemini 3.1 Pro, Veo video generation, Project Mariner agentic capabilities, 30 TB of Google storage, and $100 in monthly Google Cloud credits. Switching to Deep Think mode within the app activates extended reasoning for that conversation.

Through the Gemini API and Vertex AI, Deep Think is accessed by enabling thinking via the API parameters for the Gemini 3 Pro model. Thinking tokens are billed as output tokens at the same rate as the base model's output: $12 per million tokens for the standard context window (up to 200,000 tokens per request), rising to $18 per million for requests above 200,000 tokens. Input tokens are billed at $2 per million (standard context) or $4 per million (large context). As of the February 2026 update, API access to Deep Think was in preview with interested researchers and enterprises able to apply for early access through Google.

Google AI Studio, the free developer interface at aistudio.google.com, provides experimental access to Deep Think within the Studio's rate limits, making it accessible for prototyping without a subscription.

The practical cost of using Deep Think is higher than the base Gemini 3.1 Pro pricing would suggest, because reasoning-intensive queries generate substantially more output tokens in thinking traces than standard queries. A request that produces 500 output tokens in standard mode may generate several thousand thinking tokens when Deep Think is enabled, at the same per-token rate.

Comparison

Direct competitors in the extended reasoning category, as of April 2026:

  • Gemini 3.1 Pro (Google DeepMind). The standard-mode counterpart. Gemini 3.1 Pro at standard mode is faster, less expensive per query, and appropriate for most tasks: retrieval, summarization, document analysis, code generation, and general instruction following. Deep Think adds meaningful value over Gemini 3.1 Pro standard mode on competition mathematics, Olympiad-level science, complex multi-step logical reasoning, and formal proofs. For most production workloads, standard Gemini 3.1 Pro is the right default; Deep Think is the right choice when queries require extended deliberation and latency tolerance allows it.
  • o3 (OpenAI). The most direct architectural parallel. o3 is OpenAI's reasoning-specialist model trained specifically for chain-of-thought performance, rather than a reasoning mode added to a general model. On competitive mathematics and formal reasoning benchmarks, o3 and Gemini 3 Deep Think are close, with benchmark leadership varying by task and evaluation configuration. o3 is available at the API level with its own pricing structure; it does not come bundled into a consumer subscription at the same tier as Deep Think. The key practical difference is that Deep Think inherits Gemini 3's native multimodal capability, while o3 has more limited multimodal handling.
  • Claude Opus 4.7 (Anthropic) extended thinking. Claude Opus 4.7 offers an extended thinking mode that functions analogously to Deep Think: the model reasons internally before producing a final response, with thinking tokens billed separately. Claude Opus 4.7 leads across most SWE-bench Verified evaluations at 87.6%, making extended thinking on Claude a strong choice for software engineering and code-repair tasks. Gemini 3 Deep Think's advantage concentrates on formal mathematics, Olympiad-level physical science, and abstract reasoning benchmarks like ARC-AGI-2. For enterprise buyers whose workloads are primarily code-focused, Claude Opus 4.7 with extended thinking is a competitive alternative; for research-grade science and mathematics, Deep Think's benchmark results are stronger.

Outlook

Open questions for the next 6 to 18 months:

  • API general availability. Deep Think was in API preview as of early 2026. Whether Google moves it to general availability with stable pricing, or keeps it in a gated access program, determines how easily third-party products can depend on it.
  • Gemini 4 and Deep Think continuity. The Gemini generation has moved on roughly a six-to-nine-month cadence. When Gemini 4 arrives, likely in late 2026 or early 2027, whether Deep Think is carried forward as a first-class mode or replaced by a new reasoning architecture is an open question.
  • Thinking token cost. The current billing structure treats thinking tokens the same as output tokens. As reasoning models become more widely used, competitive pressure from OpenAI and Anthropic may push Google to separate thinking token pricing, as Anthropic has done with Claude's extended thinking, or to bundle more thinking capacity into subscription tiers.
  • Multimodal reasoning depth. Deep Think's advantage over text-only reasoning systems rests on its ability to reason over images, audio, and video natively. How Google develops and communicates this advantage, particularly for scientific imaging and video-based reasoning tasks, will shape its positioning in research and enterprise markets.
  • Benchmark ceiling effects. Deep Think's strong AIME and Olympiad scores are approaching the ceiling of current formal evaluation frameworks. As the community develops harder benchmarks, whether Deep Think sustains its relative position is a key indicator of the approach's scalability.

Sources

About the author
Nextomoro

Nextomoro

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.