微软 | 基于对比学习和掩码图像建模的标签高效表示

Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations

Z Jiang, Y Chen, M Liu...
[Microsoft & Texas A&M University & University of Texas at Austin]

Layer Grafted Pre-training: 基于对比学习和掩码图像建模的标签高效表示

要点:

Layer Grafted Pre-training 将对比学习和掩码图像建模连起来，以更好地实现表示学习；
根据不同的偏好，将 MIM 和 CL 损失分别移植到低层和高层；
顺序级联的方式在下游应用中带来了更理想的表示质量和优秀的标签效率；
与 MIM 和 CL 基线相比，Layer Grafted Pre-training 在少样本性能和线性评估方面取得了明显的改进。

一句话总结:
Layer Grafted Pre-training 将对比学习和掩码图像建模以连续级联的方式结合起来，以获得更好的标签效率和少样本性能。

Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM) demonstrate that self-supervision is powerful to learn good representations. However, naively combining them is far from success. In this paper, we start by making the empirical observation that a naive joint optimization of CL and MIM losses leads to conflicting gradient directions - more severe as the layers go deeper. This motivates us to shift the paradigm from combining loss at the end, to choosing the proper learning method per network layer. Inspired by experimental observations, we find that MIM and CL are suitable to lower and higher layers, respectively. We hence propose to combine them in a surprisingly simple, "sequential cascade" fashion: early layers are first trained under one MIM loss, on top of which latter layers continue to be trained under another CL loss. The proposed Layer Grafted Pre-training learns good visual representations that demonstrate superior label efficiency in downstream applications, in particular yielding strong few-shot performance besides linear evaluation. For instance, on ImageNet-1k, Layer Grafted Pre-training yields 65.5% Top-1 accuracy in terms of 1% few-shot learning with ViT-B/16, which improves MIM and CL baselines by 14.4% and 2.1% with no bells and whistles. The code is available at this https URL.

论文链接：https://arxiv.org/abs/2302.14138

内容中包含的图片若涉及版权问题，请及时与我们联系删除

微软 | 基于对比学习和掩码图像建模的标签高效表示

Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations

评论列表

评论