DriveAgent-R1

Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking

Weicheng Zheng^1,3*, Xiaofei Mao^2*, Nanfei Ye², Pengxiang Li², Kun Zhan², XianPeng Lang², Hang Zhao^1,4†

¹Shanghai Qi Zhi Institute ²Li Auto ³Tongji University ⁴Tsinghua University

^*Equal contribution ^†Corresponding author

ICLR 2026

Paper Code

Active Perception Example

Input

Reasoning under M_tool

The agent uses RoI Inspection to resolve ambiguity, discover critical risk, and revise its plan.

Case Studies

Selected qualitative cases from the paper. Hover to pause.

Tool Mode

Case A: Busy Intersection

At a complex intersection, the agent invokes RoI Inspection to confirm the green light and Retrieve View to assess dense pedestrian flow before deciding to stop and wait.

Tool Mode

Case B: Night Hazard Discovery

On a dark narrow road, initial perception misses a pedestrian. The agent calls 3D Object Detection, actively revealing the hidden hazard and revising its plan to decelerate.

Tool Mode

Case C: Depth-Aware Passing

Facing a barrier gate, the agent uses Depth Estimation to gauge proximity and see the clear road beyond, producing a confident two-stage plan: slow down, then accelerate through.

Tool Mode

Case D: Distant Pedestrian Risk

On a seemingly clear road, the agent proactively calls RoI Inspection to zoom into distant pedestrians, discovering they are dangerously close to the lane—prompting a prudent deceleration.

Adaptive Mode

Adaptive: Active Tool Use

Facing an ambiguous intersection, M_adaptive autonomously switches to Tool Mode and uses RoI Inspection to detect an oncoming car invisible at default resolution, enabling a safe lane change.

Text Mode

Adaptive: Avoiding Tool Noise

With a car cutting in clearly visible, M_adaptive stays in Text Mode—avoiding unnecessary tool calls that would introduce distracting information and lead to a dangerous "Keep Speed" prediction.

Overview

DriveAgent-R1 is a VLM-based driving agent for high-level behavioral planning. It aims to deliver robust and interpretable decisions in challenging traffic scenarios.

The method has two key designs: Active Perception, which proactively invokes vision tools under uncertainty, and Hybrid Thinking, which uses efficient text-only reasoning for simple scenes and tool-augmented reasoning for complex ones.

Hybrid-Thinking Architecture

Dynamic switching between text-only and tool-augmented modes balances efficiency and robustness.

Three-Stage Progressive Training

DriveAgent-R1 three-stage training strategy

The pipeline includes DM-SFT, FCM-RL, and AMS-RL: first improving each mode, then learning adaptive mode switching.

Key Idea 1: Active Perception

Instead of passively consuming fixed inputs, the agent actively acquires evidence. In complex intersections, low-visibility night scenes, and long-range risk assessment, tool use improves perception completeness and reduces decision errors.

Key Idea 2: Hybrid Thinking

Text-only reasoning is used in low-complexity scenes, while tool-augmented reasoning is activated for complex scenes. This avoids unnecessary tool calls, extra latency, and irrelevant information noise.

Experimental Results

DriveAgent-R1 achieves state-of-the-art behavioral planning with only a 3B model, surpassing GPT-5 on nuScenes and leading open-loop motion planning benchmarks.

+6.07%

Tool-use accuracy gain
on Drive-Internal

47.10%

nuScenes Seq. Acc.
(surpasses GPT-5 45.14%)

0.28 m

Avg. ADE on nuScenes
(best open-loop planning)

High-Level Behavioral Planning

Sequence Average Joint Accuracy (%) on Drive-Internal_test and nuScenes_test. Parentheses show the absolute gain from using visual tools. Bold: best · Underline: second best.

Model	Drive-Internal_test Seq. Avg. Joint Acc. (%)		nuScenes_test Seq. Avg. Joint Acc. (%)
Model	w/o Tools	w/ Tools	w/o Tools	w/ Tools
Human	49.29		48.24
Qwen2.5-VL-3B	24.98	22.63 (−2.35)	23.48	21.58 (−1.90)
Qwen2.5-VL-72B	38.80	39.61 (+0.81)	39.13	40.47 (+1.34)
Gemini-2.5-Flash	42.14	43.42 (+1.28)	42.69	44.07 (+1.38)
GPT-4.1	42.14	43.43 (+1.29)	43.63	44.72 (+1.09)
GPT-5	47.19	47.97 (+0.78)	44.85	45.14 (+0.29)
DriveAgent-R1 (Ours)	43.29	45.42 (+2.13)	44.43	47.10 (+2.67)

DriveBench

Comparison with VLM-based driving agents. Bold: best.

Method	Perception	Prediction	Planning	Behavior
DriveLM	16.85	44.33	68.71	42.78
Dolphins	9.59	32.66	52.91	18.81
DriveAgent-R1	34.07	32.85	61.89	43.69

Perception score 2× DriveLM (34.07 vs. 16.85); highest Behavior score overall.

Citation

@inproceedings{driveagentr1,
  title={DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking},
  author={Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, XianPeng Lang, Hang Zhao},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}