本文来自今日的爱可可推介

[CV] Improving Zero-shot Generalization and Robustness of Multi-modal Models

Y Ge, J Ren, Y Wang, A Gallagher, M Yang, L Itti, H Adam, B Lakshminarayanan, J Zhao
[Google Research]

提高多模态模型零样本泛化性和鲁棒性

聚焦多模态模型的零样本泛化性和鲁棒性问题,通过一致性识别和WordNet层次结构增强的事后方法提高了top-1准确性,这是一种有效且无超参数的方法,可推广到不同的数据集和多模态架构。

CLIP 和 LiT 等多模态图像-文本模型在图像分类基准上表现出了令人印象深刻的性能,其零样本泛化能力尤其令人兴奋。虽然这些模型的top-5零样本准确率非常高,但top-1准确率要低得多(在某些情况下差距超过 25%)。本文调查了造成这种性能差距的原因,发现许多失败案例都是由文本提示中的歧义引起的。本文提出一种简单有效的零样本事后方法,通过测量在多个提示和图像变换上的预测一致性来识别其top-1预测可能不正确的图像。 该程序可以更好地预测错误,在选择性预测任务上优于流行的max logit基线。本文提出一种简单有效的方法来利用 WordNet 层次结构来提高此类不确定图像的准确性,通过从语义标签层次结构中合并其父类和子类来扩充原始类,并将扩充插入文本提示中。用五个不同的基于 ImageNet 的数据集对 CLIP 和 LiT 模型进行实验。对于 CLIP,该方法在不确定子集上将 top-1 精度提高了 17.13%,在整个 ImageNet 验证集上提高了 3.6%。该方法在 ImageNet 移位数据集和其他模型架构(例如 LiT)中得到了改进。所提出的方法是无超参数的,不需要额外的模型训练,可以很容易地扩展到其他大型多模态架构。

论文地址 https://arxiv.org/abs/2212.01758

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text promts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures.

 

图片

图片

图片

图片更多内容详情请访问原文