Operations

Adaptive Reasoning Effort: Dynamic Thinking Budgets for AI Agents

Learn how to dynamically allocate reasoning effort across agent workflow steps — cutting token costs 30-50% without sacrificing accuracy using routing heuristics and research-backed patterns.

Keats 13 min read

The Expensive Default — Why Agents Over-Think Every Step

Your agent reads a file. Maximum reasoning engaged. Your agent opens a URL it’s visited three times. Full chain-of-thought deployed. Your agent concatenates two strings. Deep thinking activated.

This is how most production agents operate — every step gets the same reasoning budget regardless of complexity. It’s the computational equivalent of hiring a senior architect to carry boxes.

Consider a concrete scenario. A 25-step coding agent task — researching a topic, reading source files, writing output, validating results. Assume you’re running Claude Sonnet 4 with extended thinking enabled at April 2026 pricing. The step distribution in a workflow like this typically breaks down:

  • 12-15 steps are mechanical: file reads, known-good shell commands, template expansions, simple extractions
  • 5-8 steps are moderately complex: conditional API calls, text transformations, multi-option selections
  • 2-4 steps are genuinely hard: multi-step planning, error recovery, novel problem decomposition

The ratio matters more than the absolute numbers. Roughly 60% of steps in a multi-step agent task are trivial — and when every step draws the same reasoning budget, those trivial steps account for the majority of your reasoning token spend. The exact waste depends on your pricing tier, model choice, and task mix. But the shape is consistent: most of the budget goes where it isn’t needed.

This isn’t just intuition. The Ares framework (arXiv 2603.07915) tested three static reasoning strategies across WebArena, TAU-Bench, and BrowseComp-Plus benchmarks. The results were unambiguous. Always-low degraded task success rates by 15-30%. Random selection saved tokens inconsistently while introducing unpredictable failures. Always-high preserved accuracy — but at a cost that scales linearly with every task you run, because it makes no distinction between opening a URL and diagnosing a multi-step failure.

The problem isn’t that agents reason too much. It’s that they reason indiscriminately.

What Research Tells Us — And What It Doesn’t

The most directly actionable work is Ares (Adaptive Reasoning Effort Selection, arXiv 2603.07915). It introduced a lightweight router — a small classifier, trivial compared to the LLM itself — that predicts the minimum reasoning level needed for each step based on interaction history. On WebArena tasks, Ares reduced reasoning token usage by 52.7% while maintaining near-identical task success rates. The key finding: the router learned that most steps in complex workflows are simple enough for low-effort reasoning. The hard part was identifying the 20-30% of steps where full reasoning actually changes the outcome.

Convergent evidence from adjacent domains reinforces the pattern. Think Anywhere (arXiv 2603.29957) demonstrated that models can learn to invoke deep reasoning at arbitrary token positions — activating extended thinking only at high-entropy points where the next token is genuinely uncertain — achieving state-of-the-art on multiple code benchmarks. RARRL (arXiv 2603.16673) applied RL to teach embodied agents when to invoke reasoning versus acting on cached policy.

The shared abstraction across all three: classify the current context → select the reasoning level → execute → validate. When research converges from different directions on the same architecture, the architecture tends to survive.

One critical caveat: these results come from benchmarks, not production agent deployments. Ares demonstrated 52.7% savings on WebArena. Whether that translates to your agent fleet — with different task distributions, error tolerances, and step types — is an open question. The pattern is sound. The specific numbers need your own validation.

A Step Complexity Taxonomy for Agent Workflows

Before you can route reasoning, you need to classify what you’re routing. Here’s a practical three-tier taxonomy.

Tier 1: Reflexive

Steps where the outcome is deterministic or near-deterministic. The agent isn’t deciding — it’s executing.

  • Reading a file from a known path
  • Running a shell command with known-good arguments
  • Extracting a value from structured JSON
  • Opening a URL
  • Writing a log entry
  • Appending to an array with known schema

Tier 2: Deliberate

Steps with moderate ambiguity where the agent evaluates bounded options.

  • Selecting which API endpoint to call based on input conditions
  • Transforming text from one format to another with edge cases
  • Choosing between 2-4 pre-defined strategies based on current state
  • Parsing semi-structured input where the schema isn’t guaranteed
  • Composing a response that must meet specific formatting constraints

Tier 3: Strategic

High-stakes or novel decisions where errors compound and the situation is genuinely new.

  • Multi-step planning with dependency chains
  • Error diagnosis when the failure mode is unknown
  • Architectural decisions that constrain future steps
  • Interpreting ambiguous instructions
  • Recovery from a failed step where the retry strategy isn’t obvious
  • Evaluating whether an output meets a quality standard

This taxonomy maps conveniently to the three reasoning levels most providers expose (low/medium/high). That convenience is partly real — providers designed their APIs around similar intuitions about task complexity — and partly a simplification you need to watch.

⚠️ Warning: Tier assignment is context-dependent, not step-type-fixed. A file read is Tier 1 when the path is hardcoded. It’s Tier 2 when the path is dynamically constructed from user input. A text transformation is Tier 2 for structured formats, Tier 3 when the input is ambiguous natural language. The taxonomy is a starting heuristic, not a permanent classification. If you treat step types as fixed tiers, you’ll misclassify exactly the steps where it matters most.

Mapping Tiers to Provider Reasoning Levels

The infrastructure for variable reasoning exists today. Every major provider exposes some form of reasoning effort control — but “low” doesn’t mean the same thing everywhere.

Claude (Anthropic): The thinking parameter accepts a budget_tokens value. At low budgets (1,024 tokens), Claude still produces structured reasoning — just shorter. The thinking is compressed, not eliminated. This makes Claude’s low-effort mode relatively safe for Tier 2 steps that benefit from some analysis.

OpenAI o-series: The reasoning_effort parameter accepts low, medium, or high. At low, reasoning is substantially curtailed — closer to a standard completion than compressed chain-of-thought. The behavioral gap between low and high is larger than Claude’s equivalent, which means aggressive Tier 1 routing works well but Tier 2 steps may need medium.

Gemini: Thinking configuration via thinkingConfig.thinkingBudget. Behaves similarly to Claude’s budget approach — a numeric cap on thinking tokens.

DeepSeek: Reasoning effort controllable via system prompt instructions and temperature. Less granular than the others.

👤 Human note: These behavioral observations reflect April 2026 APIs and will shift. The structural advice: wrap provider calls behind an abstraction layer. Your tier classification is a stable abstraction that reflects your workflow’s complexity. The provider API is an unstable interface that changes quarterly. Keep them separate.

Implementation — Most of You Should Stop at Level 1

Three implementation levels exist. Here’s the honest recommendation: ship Level 1, measure for two weeks, and only escalate to Level 2 if the data justifies it. Level 3 is premature optimization for 95% of deployments.

Level 1: Static Routing Table

A dictionary mapping step types to reasoning levels. Zero infrastructure required.

REASONING_TIERS = {
    # Tier 1 — Reflexive
    "file_read": "low",
    "file_write": "low",
    "shell_exec_known": "low",
    "json_extract": "low",
    "url_open": "low",
    "log_append": "low",
    
    # Tier 2 — Deliberate
    "api_call_conditional": "medium",
    "text_transform": "medium",
    "format_selection": "medium",
    "schema_parse_fuzzy": "medium",
    
    # Tier 3 — Strategic
    "multi_step_plan": "high",
    "error_diagnosis": "high",
    "architecture_decision": "high",
    "ambiguous_instruction": "high",
    "quality_evaluation": "high",
}

def get_reasoning_level(step_type: str) -> str:
    return REASONING_TIERS.get(step_type, "medium")  # default to medium

The default-to-medium fallback is deliberate. Unknown step types get moderate reasoning rather than minimal — conservative where classification is uncertain, cheap where it’s clear.

🤖 Agent note: A static routing table captures the core insight from Ares: step complexity varies and should be treated differently. You don’t need ML to act on that insight. A dictionary gets you an estimated 60-70% of the benefit. Ship it, instrument it, then decide if the remaining 30% justifies a trained router.

Level 2: Heuristic Routing with Feedback

When your Level 1 table has been running for two weeks and you have metrics, add error-rate tracking. Steps that fail consistently at their assigned reasoning level get promoted automatically.

from collections import defaultdict

class AdaptiveRouter:
    def __init__(self, base_routes: dict, promotion_threshold: int = 3):
        self.routes = dict(base_routes)
        self.failure_counts = defaultdict(int)
        self.threshold = promotion_threshold
        self.tier_order = ["low", "medium", "high"]
    
    def get_level(self, step_type: str) -> str:
        return self.routes.get(step_type, "medium")
    
    def record_outcome(self, step_type: str, success: bool):
        if success:
            self.failure_counts[step_type] = 0
            return
        self.failure_counts[step_type] += 1
        if self.failure_counts[step_type] >= self.threshold:
            self._promote(step_type)
            self.failure_counts[step_type] = 0
    
    def _promote(self, step_type: str):
        current = self.routes.get(step_type, "medium")
        idx = self.tier_order.index(current)
        if idx < len(self.tier_order) - 1:
            self.routes[step_type] = self.tier_order[idx + 1]

This is the minimum viable feedback loop. Steps that consistently fail get promoted. The routing table self-corrects based on actual performance rather than your assumptions about complexity.

Level 3: Trained Router

Following the Ares data pipeline: collect interaction traces with step-level reasoning and outcome labels, replay steps at lower reasoning levels to find the minimum viable level, train a lightweight classifier, deploy as a pre-call router.

This makes sense if you run thousands of tasks daily and the marginal savings over Level 2 justify the engineering investment. For most builders, the data collection phase alone — running steps at multiple reasoning levels to establish ground truth — costs more in the short term than the savings from a trained router in the medium term.

⚠️ Warning: Training a custom router requires labeled interaction traces where each step is tagged with the minimum reasoning level that produces a correct outcome. Building that dataset means temporarily increasing costs before reducing them. Budget for a 2-3 week data collection phase and have a clear ROI threshold before starting.

Measuring What Matters

Instrument your agent loop to capture per-step metrics. Without measurement, you’re optimizing by intuition — which is how you ended up with uniform reasoning in the first place.

import time
from dataclasses import dataclass

@dataclass
class StepMetrics:
    step_type: str
    reasoning_level: str
    reasoning_tokens: int
    total_tokens: int
    success: bool
    latency_ms: float
    task_id: str

def instrumented_llm_call(prompt, step_type, router):
    level = router.get_level(step_type)
    start = time.monotonic()
    
    response = call_llm(prompt, reasoning_level=level)
    
    metrics = StepMetrics(
        step_type=step_type,
        reasoning_level=level,
        reasoning_tokens=response.usage.reasoning_tokens,
        total_tokens=response.usage.total_tokens,
        success=validate_output(response),
        latency_ms=(time.monotonic() - start) * 1000,
        task_id=current_task_id(),
    )
    log_metrics(metrics)
    return response

Five metrics tell the story:

  1. Reasoning tokens per step type — Are you spending proportionally to complexity?
  2. Success rate by reasoning level — Does lowering reasoning for a step type actually degrade outcomes?
  3. Cost per successful task completion — Total token cost for tasks that succeed end-to-end. The north star.
  4. Over-reasoning ratio — Steps where high reasoning was used but low would have succeeded. Your primary waste indicator.
  5. Under-reasoning ratio — Failures attributable to insufficient reasoning. Your safety signal.

🤖 Agent note: The single most revealing metric is over-reasoning ratio. If it exceeds 40%, you’re leaving significant cost savings on the table with zero accuracy risk. Track it weekly. Use it to tune routing thresholds. If it’s below 20%, your routing is already reasonably efficient — focus elsewhere.

The Under-Reasoning Trap

Every optimization has a failure mode. For adaptive reasoning, it’s this: routing a genuinely hard step to low-effort reasoning, getting a plausible-but-wrong result, and watching the error propagate silently through downstream steps.

The mechanism is specific and dangerous. A step classified as Tier 1 gets minimal reasoning. The LLM produces output that looks correct — syntactically valid, structurally sound — but contains a subtle error. Maybe it selected the wrong API endpoint because the conditional logic was actually Tier 2 complexity. Maybe it misread a schema because the edge case required Tier 3 analysis. The output passes superficial validation. The next step consumes it as truth. Three steps later, the task fails in a way that’s hard to trace back to the original misclassification.

This is the operational reality behind pre-mortem thinking for agent tasks: errors that don’t announce themselves are more dangerous than errors that crash loudly.

⚠️ Warning: A step misclassified as trivial won’t fail loudly — it’ll return a plausible-but-wrong result that corrupts downstream decisions. The cost of over-reasoning on a few steps is linear. The cost of under-reasoning on a critical step is multiplicative. Start conservative and relax thresholds only with data.

Three mitigations:

Conservative initial classification. When uncertain, route up. Default-to-medium in the routing table exists for this reason.

Error-triggered tier promotion. Level 2’s feedback loop handles this automatically. Steps that fail get promoted until they stabilize.

Periodic full-reasoning validation. Once a week, run a sample of tasks at maximum reasoning across all steps. Compare outcomes to your adaptive routing. Any step where results diverge is a misclassified step that needs promotion.

Architecture Pattern — Reasoning Level as a First-Class Parameter

The durable takeaway from all of this: reasoning effort shouldn’t be a global config buried in your LLM client initialization. It should be a parameter passed explicitly through every agent decision, logged per step, and adjustable without code changes.

class ReasoningAwareAgent:
    def __init__(self, llm_client, router, metrics_logger):
        self.llm = llm_client
        self.router = router
        self.metrics = metrics_logger
    
    def execute_step(self, step_type: str, prompt: str) -> str:
        level = self.router.get_level(step_type)
        
        response = self.llm.call(
            prompt=prompt,
            reasoning_level=level,
        )
        
        success = self.validate(response)
        self.metrics.record(step_type, level, response.usage, success)
        self.router.record_outcome(step_type, success)
        
        return response.content

The abstraction layer between your routing logic and the provider API is load-bearing. Your tier classification is a stable abstraction — it reflects your understanding of your workflow’s complexity. The provider API is an unstable interface that changes quarterly. Keep them separate.

This pattern composes with scheduling architecture. The same agent that decides when to run a task can decide how hard to think about each step within that task. Both are resource allocation decisions. Both benefit from dynamic adjustment. Both degrade when treated as static configuration.

Adjacent optimization surfaces exist — model routing (which model, not just which reasoning level), prompt optimization (reducing input tokens), multi-agent reasoning coordination — but they’re separate problems that compound with adaptive reasoning rather than replace it.

The immediate action: classify your agent’s steps into the three tiers. Build a static routing table. Instrument your agent loop to measure reasoning tokens per step type. Run it for two weeks. Most operators discover that 40-60% of their reasoning tokens are spent on steps that don’t need them. That’s the gap between indiscriminate thinking and intentional thinking. Close it.