爱可可AI前沿推介(5.31)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：线性连接性揭示泛化策略、视觉和语言生成式图像-文本Transformer、从少样本到自然语言任务描述的指令归纳、强化学习多模态知识对齐、3D生成建模的不规则潜网格、基于自回归Transformer和推理时检索的蛋白质适应度预测、基于DOMiNO的策略发现、可解释性的神经基础模型、复杂和自然视频的简单无监督对象为中心学习

1、[LG] Linear Connectivity Reveals Generalization Strategies

J Juneja, R Bansal, K Cho, J Sedoc, N Saphra

[Delhi Technological University & New York University]

线性连接性揭示泛化策略。在模式连接性文献中被广泛接受的是，当两个神经网络在相同的数据上进行类似的训练时，它们通过参数空间的路径连接起来，在这条路径上，测试集的精度得以保持。在某些情况下，包括从预训练模型中迁移学习，这些路径被认为是线性的。与现有的结果相反，本文发现在文本分类器中(在MNLI、QQP和CoLA上进行训练)，一些微调模型对它们之间的线性路径上的损失增加有很大的障碍。在每项任务中，不同的模型集簇，在测试损失面上是线性连接的，但与簇外的模型——在面上占据独立盆地的模型——是不相连的。通过测量特别制作的诊断数据集的性能，发现这些簇对应于不同的泛化策略：一个簇在领域漂移下表现得像一个词袋模型，而另一个簇则是句法启发式。本文工作证明了损失面的几何形状如何引导模型走向不同的启发式功能。

It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster—models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.

https://arxiv.org/abs/2205.12411

2、[CV] GIT: A Generative Image-to-text Transformer for Vision and Language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu, C Liu, L Wang

[Microsoft Cloud and AI]

GIT：视觉和语言生成式图像-文本Transformer。本文设计并训练了一个生成式图像-文本Transformer，GIT，以统一视觉-语言任务，如图像/视频描述和问答。虽然生成式模型在预训练和微调之间提供了一个一致的网络结构，但现有的工作通常包含复杂的结构(单/多模态编码器/解码器)，并依赖于外部模块，如目标检测器/标记器和光学字符识别(OCR)。GIT在单个语言建模任务下将结构简化为一个图像编码器和一个文本解码器。本文还扩大了预训练数据和模型的规模，以提高模型性能。在没有任何附加条件的情况下，GIT在12个具有挑战性的基准上达到了新的最先进水平，而且幅度很大。例如，该模型在TextCaps上首次超过了人类的表现。此外，本文提出了一个基于生成的图像分类和场景文本识别新方案，在标准基准上取得了不错的性能。

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.

https://arxiv.org/abs/2205.14100

3、[CL] Instruction Induction: From Few Examples to Natural Language Task Descriptions

O Honovich, U Shaham, S R. Bowman, O Levy

[Tel Aviv University & New York University & Meta AI]

指令归纳：从少样本到自然语言任务描述。大型语言模型能通过对少数输入输出演示的调节来完成一项任务——这种模式被称为上下文学习。本文表明，语言模型可以通过提示其生成适合这些样本的自然语言指令，从少数范例中明确推断出一个基本任务。为了探索这种能力，本文引入了指令归纳挑战，编制了一个由24个任务组成的数据集，并根据执行生成的指令定义了一个新的评价指标。在很大程度上，当使用一个既足够大又与指令对齐的模型时，生成指令的能力确实出现了；在基于执行的指标中，InstructGPT达到了人类性能的65.7%，而原始的GPT-3模型只达到人类性能的9.8%。这个令人惊讶的结果表明，指令归纳本身可能是一个可行的学习范式，这里不是将一组潜在的连续参数拟合到数据上，而是在自然语言假设空间中寻找最佳描述。

Large language models are able to perform a task by conditioning on a few inputoutput demonstrations – a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations by prompting them to generate a natural language instruction that fits the examples. To explore this ability, we introduce the instruction induction challenge, compile a dataset consisting of 24 tasks, and define a novel evaluation metric based on executing the generated instruction. We discover that, to a large extent, the ability to generate instructions does indeed emerge when using a model that is both large enough and aligned to follow instructions; InstructGPT achieves 65.7% of human performance in our execution-based metric, while the original GPT-3 model reaches only 9.8% of human performance. This surprising result suggests that instruction induction might be a viable learning paradigm in and of itself, where instead of fitting a set of latent continuous parameters to the data, one searches for the best description in the natural language hypothesis space.

https://arxiv.org/abs/2205.10782

4、[CL] Multimodal Knowledge Alignment with Reinforcement Learning

Y Yu, J Chung, H Yun, J Hessel...

[Allen Institute for Artificial Intelligence & Seoul National University & University of Washington]

强化学习多模态知识对齐。大型语言模型很容易自适应新环境，甚至没有特定任务的训练数据。其零样本能力是否可以扩展到多模态输入？本文提出ESPER(ExtraSensory PErception with Reinforcement learning)，将仅有语言的零样本模型扩展到未见过的多模态任务，如图像和音频描述。本文的关键创新在于，利用强化学习将多模态输入与语言模型生成相一致，无需直接监督：例如，对于图像案例，奖励优化仅依赖于从CLIP得出的余弦相似度，因此不需要额外的明确配对(图像、描述)数据。由于语言模型的参数没有变化，该模型保持了其零样本泛化的能力。实验表明，ESPER在各种零样本任务上的表现优于基线和之前的工作；其中包括本文收集和发布的新基准——ESP数据集，其任务是让模型为每张图片生成几个不同风格的描述。

Large language models readily adapt to novel settings, even without task-specific training data. Can their zero-shot capacity be extended to multimodal inputs? In this work, we propose ESPER (ExtraSensory PErception with Reinforcement learning) which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, in the image case our reward optimization relies only on cosine similarity derived from CLIP (Radford et al., 2021), and thus requires no additional explicitly paired (image, caption) data. Because the parameters of the language model are left unchanged, the model maintains its capacity for zero-shot generalization. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks; these include a new benchmark we collect and release, ESP dataset, which tasks models with generating several diversely-styled captions for each image.

https://arxiv.org/abs/2205.12630

5、[CV] 3DILG: Irregular Latent Grids for 3D Generative Modeling

B Zhang, M Nießner, P Wonka

[KAUST & TUM]

3DILG：3D生成建模的不规则潜网格。本文提出一种新的表示方法，将3D形状编码为神经场。该表示法被设计为与Transformer结构兼容，并有利于形状重建和形状生成。现有的关于神经场的工作是基于网格的表示，潜点定义在一个规则网格上。相比之下，本文在不规则网格上定义潜点，所得到的表示是稀疏的和自适应的。在从点云进行形状重建的背景下，建立在不规则网格上的形状表示在重建精度方面比基于网格的方法有所提高。在形状生成方面，所提出的表示方法促进了用自回归概率模型的高质量形状生成。展示了不同的应用，这些应用比目前的技术水平有所提高。展示了从单幅高分辨率图像中进行概率形状重建的结果；训练了一个以极低分辨率图像为条件的概率模型；将模型应用于类别条件下的生成。所有的概率实验证实，所提出表示能生成详细和高质量的形状，达到了生成式3D形状建模的新技术水平。

We propose a new representation for encoding 3D shapes as neural fields. The representation is designed to be compatible with the transformer architecture and to benefit both shape reconstruction and shape generation. Existing works on neural fields are grid-based representations with latents defined on a regular grid. In contrast, we define latents on irregular grids, enabling our representation to be sparse and adaptive. In the context of shape reconstruction from point clouds, our shape representation built on irregular grids improves upon grid-based methods in terms of reconstruction accuracy. For shape generation, our representation promotes high-quality shape generation using auto-regressive probabilistic models. We show different applications that improve over the current state of the art. First, we show results for probabilistic shape reconstruction from a single higher resolution image. Second, we train a probabilistic model conditioned on very low resolution images. Third, we apply our model to category-conditioned generation. All probabilistic experiments confirm that we are able to generate detailed and high quality shapes to yield the new state of the art in generative 3D shape modeling.

https://arxiv.org/abs/2205.13914