WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[Google DeepMind] MiRA & SGO: Solving the Planning Bottleneck in Long-Horizon LLM Agents
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces a subgoal-driven framework for long-horizon LLM agents, featuring Gemini-SGO for online planning and MiRA (Milestoning your Reinforcement Learning Enhanced Agent) for offline training. The system addresses planning failures in complex web navigation, achieving a 43.0% success rate on WebArena-Lite with a Gemma3-12B backbone, significantly outperforming GPT-4o and previous open-source SOTA.

TL;DR

Web navigation is the "stress test" for AI agents, requiring dozens of steps where a single mistake leads to a loop. Google DeepMind researchers have tackled this with a two-pronged strategy: Gemini-SGO for real-time introspective planning and MiRA, a Reinforcement Learning (RL) framework that replaces sparse "win/loss" rewards with dense, milestone-based signals. The result? A 12B open model (Gemma3) that crushes GPT-4o's performance on web tasks.

Problem: The "Mid-Task Stuck" Phenomenon

Why do even the most powerful LLMs fail at booking a flight or managing a GitHub repo? Analysis shows that agents don't just "fail"; they get lost. In nearly 50% of failed trajectories, models like Gemini-2.5-Pro enter non-productive action loops—a "mid-task stuck" behavior.

The root causes are twofold:

  1. Online Execution: Agents lack situational awareness; they don't track which sub-parts of a task are finished.
  2. Training (RL): Rewards are typically binary (1 for success, 0 for failure). In a 50-step sequence, the agent has no idea which of the 49 intermediate actions were actually helpful.

Methodology: Milestoning the Agent's Journey

The authors propose a unified framework that treats subgoals as first-class citizens in both the agent's "thinking" and its "learning."

1. Dynamic Milestoning (Inference-Time)

Instead of blindly predicting the next click, the SGO (Subgoal-Oriented) framework forces the agent to ask three introspective questions at every step:

  • What milestones have I achieved so far?
  • Have I finished the current subgoal?
  • What is the next future milestone?

Dynamic Milestoning Framework Figure: The real-time feedback loop where the agent's actions are monitored against a SubGoals Checklist.

2. MiRA: Reward Shaping with a Potential Critic

To solve the credit assignment problem during RL, the authors introduce MiRA. It uses a Potential Critic ($P_\psi$) that learns to predict "progress" on a scale of 0 to 1. By interpolating between completed subgoals (e.g., "Navigated to search" -> 0.25, "Found item" -> 0.5), the reward signal becomes dense. The agent receives a "shaping reward" whenever it increases its potential—moving it closer to the next milestone even before the final goal is reached.

Experiments: Performance Leap

The framework was tested on WebArena-Lite, a rigorous benchmark covering Shopping, Maps, Reddit, and GitLab.

| Model | Avg. Success Rate (SR) | | :--- | :--- | | GPT-4o | 13.9% | | Gemini-2.5-Pro | 23.0% | | Gemini-2.5-Pro + SGO (Ours) | 32.1% | | Gemma3-12B + MiRA (Ours) | 43.0% |

The 12B Gemma3 model, when trained with MiRA, achieved a staggering 43.0% SR, proving that targeted RL with dense rewards can help small models punch far above their weight class.

Failure Mode Distribution Figure: MiRA drastically reduces "Stuck Midway" errors (execution loops) compared to standard SFT and RL baselines.

Deep Insight: Compiling Planning into Intuition

One of the most profound takeaways is the complementary nature of these two methods.

  • MiRA (Training) essentially "compiles" complex navigation patterns into the model's weights. The agent develops an "intuition" for the right path.
  • SGO (Inference) acts as a runtime guardrail, catching the model when its intuition fails in novel or dynamic scenarios.

The authors also noted a "behavioral phase transition" (Figure 12 in the paper). Early in training, agents stall at the first milestone. By the final phase of MiRA, completion probability follows a perfect diagonal gradient across time—indicating that the agent has learned to chain subgoals sequentially and efficiently.

Conclusion & Future Horizons

This work demonstrates that the bottleneck for LLM agents isn't just raw "intelligence" or parameter count—it's structured planning fidelity. By grounding the agent in semantically verifiable milestones, we move away from "hallucinated progress" toward robust digital coworkers.

The next frontier? Self-evolving autonomy, where the agent autonomously diagnoses its own failures to generate the very subgoals and curriculum needed for its next upgrade.

Find Similar Papers

Try Our Examples

  • Search for recent papers using Process Reward Models (PRMs) versus Outcome Reward Models (ORMs) for autonomous web navigation agents.
  • Which paper first introduced Potential-Based Reward Shaping (PBRS), and how does the MiRA framework's implementation of dynamic potentials differ from that original theory?
  • Explore how subgoal-driven decomposition and milestone-based RL have been applied to multi-modal agents in mobile or OS-level automation tasks.
Contents
[Google DeepMind] MiRA & SGO: Solving the Planning Bottleneck in Long-Horizon LLM Agents
1. TL;DR
2. Problem: The "Mid-Task Stuck" Phenomenon
3. Methodology: Milestoning the Agent's Journey
3.1. 1. Dynamic Milestoning (Inference-Time)
3.2. 2. MiRA: Reward Shaping with a Potential Critic
4. Experiments: Performance Leap
5. Deep Insight: Compiling Planning into Intuition
6. Conclusion & Future Horizons