MUTR3D: A Multi-camera Tracking Framework via 3D-to-2D Queries
Tianyuan Zhang1Xuanyao Chen2Yue Wang3Yilun Wang4Hang Zhao5 
1CMU, 2Fudan University, 3MIT, 4Li Auto, 5Tsinghua University
CVPR 2022 Workshop on Autonomous Driving (formally published)
Webpage | Code | Paper

An end-to-end multi-camera 3D tracking framework that works with arbrtary camera rigs with known parameters. MUTR3D handles multi-camera 3D detection, and cross-camera, cross-frame objects association in an end-to-end fashion.


Accurate and consistent 3D tracking from multiple cameras is a key component in a vision-based autonomous driving system. It involves modeling 3D dynamic objects in complex scenes across multiple cameras. This problem is inherently challenging due to depth estimation, visual occlusions, appearance ambiguity, etc. Moreover, objects are not consistently associated across time and cameras. To address that, we propose an end-to-end MUlti-camera TRacking framework called MUTR3D. In contrast to prior works, MUTR3D does not explicitly rely on the spatial and appearance similarity of objects. Instead, our method introduces 3D track query to model spatial and appearance coherent track for each object that appears in multiple cameras and multiple frames. We use camera transformations to link 3D trackers with their observations in 2D images. Each tracker is further refined according to the features that are obtained from camera images. MUTR3D uses a set-to-set loss to measure the difference between the predicted tracking results and the ground truths. Therefore, it does not require any post-processing such as non-maximum suppression and/or bounding box association. MUTR3D outperforms state-of-the-art methods by 5.3 AMOTA on the nuScenes dataset.


Query-based 3D tracking. Query-based tracking is extended from query-based detection, where detect queries, a fixed-size set of embedding, are used to represent 2D object candidates. Track query extends the concept of the detect query to multi-frames, i.e., representing a whole tracklet across frames. Specifically, we initialize a set of newborn queries at the beginning of each frame, then queries update themselves frame-by-frame in an auto-regressive way. A decoder head predicts one object candidate from each track query in each frame, and boxes decoded in different frames from the same track query are directly associated. With proper query life cycle management, query-based tracking can perform joint detection and track in an online fashion.

There are three key ingredients in our query-based multi-camera 3D tracker, as shown in above figure.

Related Projects on VCAD (Vision-Centric Autonomous Driving)
BEV Mapping

BEV Vectorized Mapping

BEV Detection

BEV Fusion


If you find our work useful in your research, please cite our paper:

  title={MUTR3D: A Multi-camera Tracking Framework via 3D-to-2D Queries},
  author={Zhang, Tianyuan and Chen, Xuanyao and Wang, Yue and Wang, Yilun and Zhao, Hang},
  journal={arXiv preprint arXiv:2205.00613},