MUTR3D

Tianyuan Zhang¹, Xuanyao Chen², Yue Wang³, Yilun Wang⁴, Hang Zhao⁵

¹CMU, ²Fudan University, ³MIT, ⁴Li Auto, ⁵Tsinghua University

CVPR 2022 Workshop on Autonomous Driving (formally published)

Webpage | Code | Paper

An end-to-end multi-camera 3D tracking framework that works with arbrtary camera rigs with known parameters. MUTR3D handles multi-camera 3D detection, and cross-camera, cross-frame objects association in an end-to-end fashion.

Abstract

Accurate and consistent 3D tracking from multiple cameras is a key component in a vision-based autonomous driving system. It involves modeling 3D dynamic objects in complex scenes across multiple cameras. This problem is inherently challenging due to depth estimation, visual occlusions, appearance ambiguity, etc. Moreover, objects are not consistently associated across time and cameras. To address that, we propose an end-to-end MUlti-camera TRacking framework called MUTR3D. In contrast to prior works, MUTR3D does not explicitly rely on the spatial and appearance similarity of objects. Instead, our method introduces 3D track query to model spatial and appearance coherent track for each object that appears in multiple cameras and multiple frames. We use camera transformations to link 3D trackers with their observations in 2D images. Each tracker is further refined according to the features that are obtained from camera images. MUTR3D uses a set-to-set loss to measure the difference between the predicted tracking results and the ground truths. Therefore, it does not require any post-processing such as non-maximum suppression and/or bounding box association. MUTR3D outperforms state-of-the-art methods by 5.3 AMOTA on the nuScenes dataset.

Method

Query-based 3D tracking. Query-based tracking is extended from query-based detection, where detect queries, a fixed-size set of embedding, are used to represent 2D object candidates. Track query extends the concept of the detect query to multi-frames, i.e., representing a whole tracklet across frames. Specifically, we initialize a set of newborn queries at the beginning of each frame, then queries update themselves frame-by-frame in an auto-regressive way. A decoder head predicts one object candidate from each track query in each frame, and boxes decoded in different frames from the same track query are directly associated. With proper query life cycle management, query-based tracking can perform joint detection and track in an online fashion.

There are three key ingredients in our query-based multi-camera 3D tracker, as shown in above figure.

A query-based object tracking loss assigns different regression targets for two different types of queries, newborn queries, and old queries.
A multi-camera sparse attention uses 3D reference points to sample image features for each query.
A motion model estimates object dynamics and updates the query's reference point across frames.

Related Projects on VCAD (Vision-Centric Autonomous Driving)

BEV Mapping

HDMapNet

BEV Vectorized Mapping

VectorMapNet

BEV Detection

DETR3D

BEV Fusion

FUTR3D

Reference

If you find our work useful in your research, please cite our paper:

@article{zhang2022mutr3d,
  title={MUTR3D: A Multi-camera Tracking Framework via 3D-to-2D Queries},
  author={Zhang, Tianyuan and Chen, Xuanyao and Wang, Yue and Wang, Yilun and Zhao, Hang},
  journal={arXiv preprint arXiv:2205.00613},
  year={2022}
}