爱可可AI前沿推介(3.10)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CL] HyperMixer: An MLP-based Green AI Alternative to Transformers

F Mai, A Pannatier, F Fehr, H Chen, F Marelli, F Fleuret, J Henderson

[Idiap Research Institute]

HyperMixer：基于MLP替代Transformer迈向绿色人工智能。基于Transformer的架构是自然语言理解的首选模型，但它们的成本很高，因为在输入长度上具有二次复杂度，而且难以微调。在追求绿色人工智能的过程中，本文研究了简单的基于MLP的架构。发现现有的架构，如MLPMixer，通过独立应用于每个特征的静态MLP来实现token混合，过于脱离自然语言理解所需的归纳偏差。本文提出一种简单的变体——HyperMixer，用超网络动态形成标记混合MLP，具有与Transformer相似的归纳偏差。实验证明所提出模型比其他基于MLP的模型表现得更好，且与Transformer相当。与Transformers相比，HyperMixer在处理时间、训练数据和超参数调整方面的成本大大降低，标志着在绿色人工智能方面取得了重大进展。

Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length and can be difficult to tune. In the pursuit of Green AI, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternative MLP-based models, and on par with Transformers. In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyperparameter tuning.

2、[LG] Correct-N-Contrast: A Contrastive Approach for Improving Robustness to Spurious Correlations

M Zhang, N S. Sohoni, H R. Zhang, C Finn, C Ré

[Stanford University & Northeastern University]

Correct-N-Contrast：提高假相关鲁棒性的对比学习方法。虚假相关性对鲁棒机器学习构成了重大挑战。用经验风险最小化(ERM)训练的模型可能会学会依赖类标签和假属性之间的相关性，从而导致在没有这些相关性的数据组上表现不佳。当假属性标签不可用时，解决这个问题就特别具有挑战性。为了在没有训练属性标签的情况下提高最差组的性能，本文提出Correct-n-Contrast(CnC)，一种两阶段的对比学习方法，可以直接学习对假相关的鲁棒表示。由于ERM模型可以是很好的假属性预测器，CnC的工作原理是：（1）用训练好的ERM模型的输出来识别具有相同类别但不相似的假特征的样本，以及（2）用对比学习训练一个鲁棒模型来学习相同类别样本的类似表示。为支持CnC，在最差组误差和CnC旨在最小化的表示排列损失之间引入了新的联系。从实验中观察到，最差组错误与对齐损失密切相关，并证明一个类别的对齐损失有助于对该类别的最差组与平均错误差距进行上限控制。在流行的基准上，CnC极大地减少了对齐损失，并以3.6%的平均绝对提升实现了最先进的最差组准确性。

Spurious correlations pose a major challenge for robust machine learning. Models trained with empirical risk minimization (ERM) may learn to rely on correlations between class labels and spurious attributes, leading to poor performance on data groups without these correlations. This is particularly challenging to address when spurious attribute labels are unavailable. To improve worst-group performance on spuriously correlated data without training attribute labels, we propose Correct-n-Contrast (CnC), a contrastive approach to directly learn representations robust to spurious correlations. As ERM models can be good spurious attribute predictors, CnC works by (1) using a trained ERM model’s outputs to identify samples with the same class but dissimilar spurious features, and (2) training a robust model with contrastive learning to learn similar representations for same-class samples. To support CnC, we introduce new connections between worst-group error and a representation alignment loss that CnC aims to minimize. We empirically observe that worst-group error closely tracks with alignment loss, and prove that the alignment loss over a class helps upper-bound the class’s worst-group vs. average error gap. On popular benchmarks, CnC reduces alignment loss drastically, and achieves state-of-the-art worst-group accuracy by 3.6% average absolute lift. CnC is also competitive with oracle methods that require group labels.

3、[CV] EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers

H Zhang, W Hu, X Wang

EdgeFormer：向视觉Transformer学习改进轻量卷积网络。最近，视觉Transformer开始显示出令人印象深刻的结果，其性能明显优于基于卷积的大型模型。然而，在用于移动或资源受限设备的小型模型领域，ConvNet在性能和模型复杂度方面仍有自己的优势。本文提出EdgeFormer，一种纯粹基于ConvNet的骨干模型，通过将视觉Transformer的优点融合到ConvNets中，进一步加强了这些优势。提出了带有位置嵌入的全局循环卷积(GCC)，一种轻量卷积运算，拥有全局感受野，同时产生了像局部卷积那样的位置敏感特征。将GCC和squeezeexictation操作结合起来，形成一个类似于元生成器的模型块，具有类似于Transformer的注意里机制。上述块可以用即插即用的方式替代ConvNets或Transformer中的相关块。实验结果表明，在常见的视觉任务和数据集中，所提出的EdgeFormer比流行的轻量ConvNets和基于视觉Transformer的模型取得了更好的性能，同时参数更少，推理速度更快。对于ImageNet-1k的分类，EdgeFormer以大约500万个参数达到了78.6%的最高准确率，与MobileViT相比，节省了11%的参数和13%的计算成本，但获得了0.2%的高准确率和23%的快推理速度（基于ARM的Rockchip RK3288），与DeIT相比，只使用了0.5倍的参数，但获得了2.7%的准确率。在MS-COCO物体检测和PASCAL VOC分割任务上，EdgeFormer也显示出更好的性能。

Recently, vision transformers started to show impressive results which outperform large convolution based models significantly. However, in the area of small models for mobile or resource constrained devices, ConvNet still has its own advantages in both performance and model complexity. We propose EdgeFormer, a pure ConvNet based backbone model that further strengthens these advantages by fusing the merits of vision transformers into ConvNets. Specifically, we propose global circular convolution (GCC) with position embeddings, a light-weight convolution op which boasts a global receptive field while producing location sensitive features as in local convolutions. We combine the GCCs and squeezeexictation ops to form a meta-former like model block, which further has the attention mechanism like transformers. The aforementioned block can be used in plug-and-play manner to replace relevant blocks in ConvNets or transformers. Experiment results show that the proposed EdgeFormer achieves better performance than popular light-weight ConvNets and vision transformer based models in common vision tasks and datasets, while having fewer parameters and faster inference speed. For classification on ImageNet-1k, EdgeFormer achieves 78.6% top-1 accuracy with about 5.0 million parameters, saving 11% parameters and 13% computational cost but gaining 0.2% higher accuracy and 23% faster inference speed (on ARM based Rockchip RK3288) compared with MobileViT, and uses only 0.5× parameters but gaining 2.7% accuracy compared with DeIT. On MS-COCO object detection and PASCAL VOC segmentation tasks, EdgeFormer also shows better performance.

4、[CV] Transferability Estimation using Bhattacharyya Class Separability

M Pándy, A Agostinelli, J Uijlings, V Ferrari, T Mensink

[Google Research]

基于Bhattacharyya类可分离性的可迁移性估计。迁移学习已经成为计算机视觉中利用预训练模型的一种流行方法。然而，如果不进行计算昂贵的微调，就很难量化哪些预训练的源模型适合于特定的目标任务，或者反过来说，预训练的源模型可以很容易地适应哪些任务。本文提出了高斯Bhattacharyya系数(GBC)，一种量化源模型和目标数据集之间可迁移性的新方法，测量源特征空间中目标类之间的重叠量，其中每个目标类都是由高斯建模的。将所有目标图像嵌入由源模型定义的特征空间，并用每类高斯进行表示。用Bhattacharyya系数来估计成对类别可分离性，产生衡量源模型向目标任务可迁移程度的一个简单有效的指标。在数据集和架构选择的背景下对GBC的图像分类任务进行了评估。对更复杂的语义分割可迁移性估计任务进行了实验。证明了GBC在语义分割设置中的大多数评价标准上优于最先进的可迁移性指标，在图像分类中与顶级方法的数据集可迁移性性能相匹配，并且在图像分类的架构选择问题上表现最好。

Transfer learning has become a popular method for leveraging pre-trained models in computer vision. However, without performing computationally expensive finetuning, it is difficult to quantify which pre-trained source models are suitable for a specific target task, or, conversely, to which tasks a pre-trained source model can be easily adapted to. In this work, we propose Gaussian Bhattacharyya Coefficient (GBC), a novel method for quantifying transferability between a source model and a target dataset. In a first step we embed all target images in the feature space defined by the source model, and represent them with per-class Gaussians. Then, we estimate their pairwise class separability using the Bhattacharyya coefficient, yielding a simple and effective measure of how well the source model transfers to the target task. We evaluate GBC on image classification tasks in the context of dataset and architecture selection. Further, we also perform experiments on the more complex semantic segmentation transferability estimation task. We demonstrate that GBC outperforms state-of-the-art transferability metrics on most evaluation criteria in the semantic segmentation settings, matches the performance of top methods for dataset transferability in image classification, and performs best on architecture selection problems for image classification.

5、[LG] Flat minima generalize for low-rank matrix recovery

L Ding, D Drusvyatskiy, M Fazel

[University of Washington]

将平坦最小值推广到低秩矩阵恢复。经验证据表明，对于各种超参数的非线性模型，尤其是在神经网络训练中，损失在最小化器周围的增长对其性能有很大影响。平坦最小值——那些损失增长缓慢的最小值——似乎可以很好地泛化。本文通过关注最简单的过参数化非线性模型：那些在低秩矩阵恢复中产生的模型，向理解这一现象迈出了一步。分析了过参数化矩阵和双线性感知、鲁棒PCA、协方差矩阵估计以及具有二次激活函数的单隐层神经网络。在所有情况下都表明，在标准统计假设下，由Hessian迹测量的平坦最小值可以完全恢复真值。对于矩阵补全，建立了弱恢复，尽管经验证据表明精确恢复在这里也是成立的。

Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima—those around which the loss grows slowly—appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well. We complete the paper with synthetic experiments that illustrate our findings.