Multimodal Knowledge Expansion (MKE)

Zihui Xue1,2   Sucheng Ren1,3   Zhengqi Gao1,4   Hang Zhao5,1  

1Shanghai Qi Zhi Insitute   2UT Austin  
3South China University of Technology   4MIT   5Tsinghua University  

Overview Video


The popularity of multimodal sensors and the accessibility of the Internet have brought us a massive amount of unlabeled multimodal data. Since existing datasets and well-trained models are primarily unimodal, the modality gap between a unimodal network and unlabeled multimodal data poses an interesting problem: how to transfer a pre-trained unimodal network to perform the same task on unlabeled multimodal data? In this work, we propose multimodal knowledge expansion (MKE), a knowledge distillation-based framework to effectively utilize multimodal data without requiring labels. Opposite to traditional knowledge distillation, where the student is designed to be lightweight and inferior to the teacher, we observe that a multimodal student model consistently corrects pseudo labels and generalizes better than its teacher. Extensive experiments on four tasks and different modalities verify this finding. Furthermore, we connect the mechanism of MKE to semi-supervised learning and offer both empirical and theoretical explanations to understand the expansion capability of a multimodal student.


Figure 1: The popularity of multimodal data collection devices and the Internet engenders a large amount of unlabeled multimodal data. We show two examples above: (a) after a hardware pgrade, lots of unannotated multimodaldata are collected by the new sensor suite; (b) large-scaleunlabeled videos can be easily obtained from the Internet.



Figure 2: Framework of MKE. In knowledge distillation, a cumbersome teacher network is considered as the upper bound of a lightweight student network. Contradictory to that, we introduce a unimodal teacher and a multimodal student. The multimodal student achieves knowledge expansion from the unimodal teacher.


To verify the efficiency and generalizability of MKE, we perform a thorough test on various tasks: (i) binary classification on the synthetic TwoMoon dataset, (ii) emotion recognition on RAVDESS dataset, (iii) semantic segmentation on NYU Depth V2 dataset, and (iv) event classification on AudioSet and VGGsound dataset.

Figure 3 below presents visualization results on NYU Depth V2. Although our MM student receives inaccurate predictions given by the UM teacher, our MM student does a good job in handling details and maintaining intra-class consistency. As shown in the third and fourth row, the MM student is robust to illumination changes while the UM teacher and NOISY student easily get confused. Depth modality helps our MM student better distinguish objects, and correct wrong predictions it receives.


Figure 3: Qualitative segmentation results on NYU Depth V2 test set.


Table 1: Results of semantic segmentation on NYU Depth V2. rgb and d denote RGB images and depth images.