On Feature Decorrelation in Self-Supervised Learning

Tianyu Hua*12   Wenxiao Wang*1   Zihui Xue23   Sucheng Ren24   Yue Wang5   Hang Zhao12  

1Tsinghua University   2Shanghai Qi Zhi Institute  
3UT Austin   4South China University of Technology   5MIT   * Equal contribution  






Teaser

Figure 1: An overview of the key components of this work: (a) is a sketch of the framework used in this work; (b) and (c) are two reachable collapse patterns in self-supervised settings; (d) is an illustration of the goal of feature decorrelation; (d) is a sketch of the framework used in this work.

Abstract

In self-supervised representation learning, a common idea behind most of the state-of-the-art approaches is to enforce the robustness of the representations to predefined augmentations. A potential issue of this idea is the existence of completely collapsed solutions (i.e., constant features), which are typically avoided implicitly by carefully chosen implementation details. In this work, we study a relatively concise framework containing the most common components from recent approaches. We verify the existence of complete collapse and discover another reachable collapse pattern that is usually overlooked, namely dimensional collapse. We connect dimensional collapse with strong correlations between axes and consider such connection as a strong motivation for decorrelation (i.e., standardizing the covariance matrix). The capability of correlation as an unsupervised metric and the gains from feature decorrelation are verified empirically to highlight the importance and the potential of this insight.

Different types of mode collapse when no negative sample is used

Teaser

Figure 2: Direct visualization of 2-dimensional projection spaces on CIFAR-10. Different colors correspond to different classes. Figure (a), (b) and (c) are from our framework. For completeness, we visualize the 2-dimensional projection spaces of SimCLR (by setting the output dimension of the projector to be 2) and a supervised baseline (by letting the penultimate layer to contain 2 neurons) in Figure (d) and (e).

(a) Complete collapse

Normally, it is complete collapse that people is referring to when talking about collapse. In the case of complete collapse, all features are residing in a tiny region with negligible variance.

(b) Dimensional collapse

Standardizing variance is a typical solution to avoid complete collapse. After doing so, we discovered another collapse pattern, dimensional collapse. In dimensional collapse, feature space are stretched diagonally such that it satifies the standard deviation requirement but maximizes feature correlation.

(c) Decorrelated feature space

Instead of standardizing variance, we standardize covariance (i.e. feature decorrelation), which naturally mitigates the issue of dimensional collapse. Using feature decorrelation techniques, we achieve comparable results to representative latest methods.

Teaser

Figure 3: Top-1 accuracies(%) of DBN and Shuffled-DBN in linear evaluation with 200-epoch pretraining. For completeness and reference, we include results of some representative methods from our reproduction. For a fair comparison, we use the same projector and augmentations as we describe in Section 4.1 for all methods in the reproduction.