Abstract

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of chain-of-thought (CoT) modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. DriveVLM-Dual achieves robust spatial understanding and real-time inference speed. Extensive experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the effectiveness of DriveVLM and the enhanced performance of DriveVLM-Dual, surpassing existing methods in complex and unpredictable driving conditions.

DriveVLM

DriveVLM accepts sequences of images as input and, through a reasoning-based Chain-of-Thought (CoT) mechanism, outputs hierarchical planning predictions. DriveVLM can optionally incorporate traditional 3D perception and trajectory planning modules to achieve spatial reasoning capability and real-time trajectory planning.

Data Annotation

Data mining and annotation pipeline for building a scene understanding dataset:

The figure below illustrates a sample scenario with detailed annotations. We employ a group of annotators to perform the scene annotation, including scene description, scene analysis, and planning, except for waypoints, which can be auto-labeled from the vehicle’s IMU recordings.

Qualitative analysis

In the figure below, the traffic police signaling to proceed with hand gestures has been accurately captured by DriveVLM.

In the figure below, DriveVLM precisely detect the fallen tree and its position, subsequently planning an appropriate detour trajectory.

Citing

@misc{DriveVLM,
title={DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models},
author={Xiaoyu Tian and Junru Gu and Bailin Li and Yicheng Liu and Chenxu Hu and Yang Wang and Kun Zhan and Peng Jia and Xianpeng Lang and Hang Zhao},
year={2024},
eprint={2402.12289},
archivePrefix={arXiv},
primaryClass={cs.CV}
}