Active Perception Example
The agent uses RoI Inspection to resolve ambiguity, discover critical risk, and revise its plan.
The agent uses RoI Inspection to resolve ambiguity, discover critical risk, and revise its plan.
Selected qualitative cases from the paper. Hover to pause.
DriveAgent-R1 is a VLM-based driving agent for high-level behavioral planning. It aims to deliver robust and interpretable decisions in challenging traffic scenarios.
The method has two key designs: Active Perception, which proactively invokes vision tools under uncertainty, and Hybrid Thinking, which uses efficient text-only reasoning for simple scenes and tool-augmented reasoning for complex ones.
Dynamic switching between text-only and tool-augmented modes balances efficiency and robustness.
The pipeline includes DM-SFT, FCM-RL, and AMS-RL: first improving each mode, then learning adaptive mode switching.
Instead of passively consuming fixed inputs, the agent actively acquires evidence. In complex intersections, low-visibility night scenes, and long-range risk assessment, tool use improves perception completeness and reduces decision errors.
Text-only reasoning is used in low-complexity scenes, while tool-augmented reasoning is activated for complex scenes. This avoids unnecessary tool calls, extra latency, and irrelevant information noise.
DriveAgent-R1 achieves state-of-the-art behavioral planning with only a 3B model, surpassing GPT-5 on nuScenes and leading open-loop motion planning benchmarks.
Sequence Average Joint Accuracy (%) on Drive-Internaltest and nuScenestest. Parentheses show the absolute gain from using visual tools. Bold: best · Underline: second best.
| Model | Drive-Internaltest Seq. Avg. Joint Acc. (%) |
nuScenestest Seq. Avg. Joint Acc. (%) |
||
|---|---|---|---|---|
| w/o Tools | w/ Tools | w/o Tools | w/ Tools | |
| Human | 49.29 | 48.24 | ||
| Qwen2.5-VL-3B | 24.98 | 22.63 (−2.35) | 23.48 | 21.58 (−1.90) |
| Qwen2.5-VL-72B | 38.80 | 39.61 (+0.81) | 39.13 | 40.47 (+1.34) |
| Gemini-2.5-Flash | 42.14 | 43.42 (+1.28) | 42.69 | 44.07 (+1.38) |
| GPT-4.1 | 42.14 | 43.43 (+1.29) | 43.63 | 44.72 (+1.09) |
| GPT-5 | 47.19 | 47.97 (+0.78) | 44.85 | 45.14 (+0.29) |
| DriveAgent-R1 (Ours) | 43.29 | 45.42 (+2.13) | 44.43 | 47.10 (+2.67) |
Comparison with VLM-based driving agents. Bold: best.
| Method | Perception | Prediction | Planning | Behavior |
|---|---|---|---|---|
| DriveLM | 16.85 | 44.33 | 68.71 | 42.78 |
| Dolphins | 9.59 | 32.66 | 52.91 | 18.81 |
| DriveAgent-R1 | 34.07 | 32.85 | 61.89 | 43.69 |
Perception score 2× DriveLM (34.07 vs. 16.85); highest Behavior score overall.
@inproceedings{driveagentr1,
title={DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking},
author={Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, XianPeng Lang, Hang Zhao},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}