DriveAgent-R1

Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking

Weicheng Zheng1,3*, Xiaofei Mao2*, Nanfei Ye2, Pengxiang Li2, Kun Zhan2, XianPeng Lang2, Hang Zhao1,4†
1Shanghai Qi Zhi Institute  2Li Auto  3Tongji University  4Tsinghua University
*Equal contribution  Corresponding author
ICLR 2026

Active Perception Example

Input
Front view camera input
Reasoning under Mtool

The agent uses RoI Inspection to resolve ambiguity, discover critical risk, and revise its plan.

Case Studies

Selected qualitative cases from the paper. Hover to pause.

Tool Mode

Case A: Busy Intersection

Case A busy intersection

At a complex intersection, the agent invokes RoI Inspection to confirm the green light and Retrieve View to assess dense pedestrian flow before deciding to stop and wait.

Tool Mode

Case B: Night Hazard Discovery

Case B night hazard discovery

On a dark narrow road, initial perception misses a pedestrian. The agent calls 3D Object Detection, actively revealing the hidden hazard and revising its plan to decelerate.

Tool Mode

Case C: Depth-Aware Passing

Case C depth-aware passing

Facing a barrier gate, the agent uses Depth Estimation to gauge proximity and see the clear road beyond, producing a confident two-stage plan: slow down, then accelerate through.

Tool Mode

Case D: Distant Pedestrian Risk

Case D distant pedestrian risk

On a seemingly clear road, the agent proactively calls RoI Inspection to zoom into distant pedestrians, discovering they are dangerously close to the lane—prompting a prudent deceleration.

Adaptive Mode

Adaptive: Active Tool Use

Adaptive mode with tool reasoning

Facing an ambiguous intersection, Madaptive autonomously switches to Tool Mode and uses RoI Inspection to detect an oncoming car invisible at default resolution, enabling a safe lane change.

Text Mode

Adaptive: Avoiding Tool Noise

Adaptive mode with text reasoning

With a car cutting in clearly visible, Madaptive stays in Text Mode—avoiding unnecessary tool calls that would introduce distracting information and lead to a dangerous "Keep Speed" prediction.

Overview

DriveAgent-R1 is a VLM-based driving agent for high-level behavioral planning. It aims to deliver robust and interpretable decisions in challenging traffic scenarios.

The method has two key designs: Active Perception, which proactively invokes vision tools under uncertainty, and Hybrid Thinking, which uses efficient text-only reasoning for simple scenes and tool-augmented reasoning for complex ones.

Hybrid-Thinking Architecture

DriveAgent-R1 hybrid thinking architecture

Dynamic switching between text-only and tool-augmented modes balances efficiency and robustness.

Three-Stage Progressive Training

DriveAgent-R1 three-stage training strategy

The pipeline includes DM-SFT, FCM-RL, and AMS-RL: first improving each mode, then learning adaptive mode switching.

Key Idea 1: Active Perception

Instead of passively consuming fixed inputs, the agent actively acquires evidence. In complex intersections, low-visibility night scenes, and long-range risk assessment, tool use improves perception completeness and reduces decision errors.

Key Idea 2: Hybrid Thinking

Text-only reasoning is used in low-complexity scenes, while tool-augmented reasoning is activated for complex scenes. This avoids unnecessary tool calls, extra latency, and irrelevant information noise.

Experimental Results

DriveAgent-R1 achieves state-of-the-art behavioral planning with only a 3B model, surpassing GPT-5 on nuScenes and leading open-loop motion planning benchmarks.

+6.07%
Tool-use accuracy gain
on Drive-Internal
47.10%
nuScenes Seq. Acc.
(surpasses GPT-5 45.14%)
0.28 m
Avg. ADE on nuScenes
(best open-loop planning)

High-Level Behavioral Planning

Sequence Average Joint Accuracy (%) on Drive-Internaltest and nuScenestest. Parentheses show the absolute gain from using visual tools. Bold: best  ·  Underline: second best.

Model Drive-Internaltest
Seq. Avg. Joint Acc. (%)
nuScenestest
Seq. Avg. Joint Acc. (%)
w/o Tools w/ Tools w/o Tools w/ Tools
Human 49.29 48.24
Qwen2.5-VL-3B 24.98 22.63 (−2.35) 23.48 21.58 (−1.90)
Qwen2.5-VL-72B 38.80 39.61 (+0.81) 39.13 40.47 (+1.34)
Gemini-2.5-Flash 42.14 43.42 (+1.28) 42.69 44.07 (+1.38)
GPT-4.1 42.14 43.43 (+1.29) 43.63 44.72 (+1.09)
GPT-5 47.19 47.97 (+0.78) 44.85 45.14 (+0.29)
DriveAgent-R1 (Ours) 43.29 45.42 (+2.13) 44.43 47.10 (+2.67)

DriveBench

Comparison with VLM-based driving agents. Bold: best.

Method Perception Prediction Planning Behavior
DriveLM 16.85 44.33 68.71 42.78
Dolphins 9.59 32.66 52.91 18.81
DriveAgent-R1 34.07 32.85 61.89 43.69

Perception score 2× DriveLM (34.07 vs. 16.85); highest Behavior score overall.

Citation

@inproceedings{driveagentr1,
  title={DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking},
  author={Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, XianPeng Lang, Hang Zhao},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}