爱可可AI前沿推介(9.16)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：自动激发绕口令的音素感知神经补全、调整预训练图像-文本模型以适配视频-语言表示对齐、重新审视语言和视觉中的神经缩放律、《纽约客》配文大赛的幽默"理解"基准测试、联合扩展的多语种语言-图像模型、扩散模型方法与应用综述、基于往复翻译的vec2text、语义解析角度的知识库问答、多语言视觉问答研究

1、[CL] PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically

S S Keh, S Y. Feng, V Gangal, M Alikhani, E Hovy
[CMU & Stanford University & University of Pittsburgh]
PANCETTA：自动激发绕口令的音素感知神经补全。绕口令是有意义的句子，难以发音。自动生成绕口令的过程很有挑战性，因为生成的语料必须同时满足两个条件：语音难度和语义。此外，语音难度本身就很难定性，而且在自然绕口令中是通过异质性混合现象来表达的，如典故和同音异义。本文提出PANCETTA：音素感知神经补全，以自动激发绕口令。利用音素表示来捕捉语音难度的概念，并训练语言模型以在两个所提的任务设置上生成原始绕口令。为此，本文策划了一个名为PANCETTA的数据集，由现有的英语绕口令组成。自动和人工评估以及定性分析表明，PANCETTA生成了新的、有语音难度的、流利的和有语义的绕口令。

Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony. In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue twisters on two proposed task settings. To do this, we curate a dataset called PANCETTA, consisting of existing English tongue twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel, phonetically difficult, fluent, and semantically meaningful tongue twisters.

https://arxiv.org/abs/2209.06275

2、[CV] CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

H Xue, Y Sun, B Liu, J Fu, R Song, H Li, J Luo
[University of Science and Technology of China & Renmin University of China & Microsoft Research Asia & ...]
CLIP-ViP: 调整预训练图像-文本模型以适配视频-语言表示对齐。预训练的图像文本模型，如CLIP，已经证明了从大规模的网络收集的图像文本数据中学习的视觉语言表示的强大力量。鉴于良好的视觉特征，一些现有的工作将图像表示迁移到视频领域并取得了良好的效果。然而，如何利用图像语言的预训练模型(如CLIP)进行视频语言的预训练(后再训练)仍在探索之中。本文研究了两个问题。1）是什么因素阻碍了训练后的CLIP进一步提高视频语言任务的性能？ 2）如何减轻这些因素的影响？通过一系列的比较实验和分析，本文发现，数据规模和语言源之间的领域差距有很大的影响。受此启发，本文提出一种在CLIP基础上配备视频代理机制的全源跨模态学习方法CLIPViP。广泛的研究结果表明，所提方法在视频-文本检索上大大改善了CLIP的性能。该模型在各种数据集上也取得了SOTA的结果，包括MSR-VTT、DiDeMo、LSMDC和ActivityNet。

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for videolanguage pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIPViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We release our code and pre-trained CLIPViP models at https://github.com/microsoft/ XPretrain/tree/main/CLIP-ViP.

https://arxiv.org/abs/2209.06430

3、[LG] Revisiting Neural Scaling Laws in Language and Vision

I Alabdulmohsin, B Neyshabur, X Zhai
[Google Research]
重新审视语言和视觉中的神经缩放律。近年来，深度学习的显著进步主要是由规模的改进所驱动的，即更大的模型在更大的数据集上训练更长的时间。为了凭经验预测规模的好处，本文主张采用基于外推损失的更严格的方法，而不是报告最佳拟合(插值)参数。提出了一种从学习曲线中可靠估计比例法参数的方案。除了BIG-Bench评估基准中的任务外，在几个领域的广泛的架构族中，比之前的方法更准确地推断，包括图像分类、神经机器翻译(NMT)和语言建模。本文还发布了一个由90个评估任务组成的基准数据集，以促进该领域的研究。

The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the bestfitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.

https://arxiv.org/abs/2209.06640

4、[CL] Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest

J Hessel, A Marasović, J D. Hwang, L Lee, J Da, R Zellers, R Mankoff, Y Choi
[The Allen Institute for AI & University of Utah & Cornell University & OpenAI & ...]
《纽约客》配文大赛的幽默"理解"基准测试。本文挑战人工智能模型对《纽约客》配文竞赛中复杂的多模态幽默的"理解"。开发了三个仔细限定的任务，对于这些任务来说，掌握图像和配文之间潜在的复杂和意外的关系，以及对人类经验的广泛种类的类似的复杂和意外的影射就足够了(但不是必须的)；这些是《纽约客》级别的漫画的文字。本文研究了视觉-语言模型，直接将卡通像素和配文作为输入，以及仅有语言模型，通过提供图像的文字描述来规避图像处理。即使为卡通图像提供了丰富的多方面的标注，还是发现了高质量的机器学习模型(例如，微调的175B参数的语言模型)和人类之间的性能差距。

We challenge AI models to “demonstrate understanding” of the sophisticated multimodal humor of The New Yorker Caption Contest. Concretely, we develop three carefully circumscribed tasks for which it suffices (but is not necessary) to grasp potentially complex and unexpected relationships between image and caption, and similarly complex and unexpected allusions to the wide varieties of human experience; these are the hallmarks of a New Yorker-caliber cartoon. We investigate vision-and-language models that take as input the cartoon pixels and caption directly, as well as language-only models for which we circumvent image-processing by providing textual descriptions of the image. Even with the rich multifaceted annotations we provide for the cartoon images, we identify performance gaps between high-quality machine learning models (e.g., a fine-tuned, 175B parameter language model) and humans. We publicly release our corpora including annotations describing the image’s locations/entities, what’s unusual about the scene, and an explanation of the joke.

https://arxiv.org/abs/2209.06293

5、[CV] PaLI: A Jointly-Scaled Multilingual Language-Image Model

X Chen, X Wang, S Changpinyo...
[Google Research]
PaLI：联合扩展的多语种语言-图像模型。有效的缩放和灵活的任务界面，使大型语言模型在许多任务中表现出色。PaLI(Pathways语言和图像模型)将这种方法扩展到语言和视觉的联合建模。PaLI根据视觉和文本输入生成文本，并通过该接口以多种语言执行许多视觉、语言和多模态任务。为训练PaLI，本文利用了大型预训练的编-解码器语言模型和视觉Transformer(ViT)。这使我们能利用它们现有的能力，并利用训练它们的大量成本。本文发现，视觉和语言组件的联合扩展很重要。由于现有的语言Transformer比它们的视觉Transformer大得多，本文训练了迄今为止最大的ViT(ViT-e)来量化更大容量的视觉模型所带来的好处。为了训练PaLI，本文创建了一个大型的多语言混合预训练任务，基于一个新的图像-文本训练集，包含100多种语言的10B图像和文本。PaLI在多种视觉和语言任务(如文字描述、视觉问答、场景-文本理解)中达到了最先进的水平，同时保留了简单、模块化和可扩展的设计。

Effective scaling and a flexible task interface enable large language models to excel at many tasks. PaLI (Pathways Language and Image model) extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-ofthe-art in multiple vision and language tasks (such as captioning, visual questionanswering, scene-text understanding), while retaining a simple, modular, and scalable design.

https://arxiv.org/abs/2209.06794

另外几篇值得关注的论文：

[LG] Diffusion Models: A Comprehensive Survey of Methods and Applications

扩散模型方法与应用综述
L Yang, Z Zhang, S Hong...
[Peking University & University of California, Los Angeles & CMU & BUPT & Mila & University of California at Merced] https://arxiv.org/abs/2209.00796v5

[CL] vec2text with Round-Trip Translations

基于往复翻译的vec2text
G Cideron, S Girgin, A Raichuk, O Pietquin, O Bachem, L Hussenot
[Google Research]
https://arxiv.org/abs/2209.06792

[CL] Knowledge Base Question Answering: A Semantic Parsing Perspective

语义解析角度的知识库问答
Y Gu, V Pahuja, G Cheng, Y Su
[The Ohio State University & Nanjing University]
https://arxiv.org/abs/2209.04994

[CL] Towards Multi-Lingual Visual Question Answering

多语言视觉问答研究
S Changpinyo, L Xue, I Szpektor, A V. Thapliyal, J Amelot, X Chen, R Soricut
[Google Research]
https://arxiv.org/abs/2209.05401

内容中包含的图片若涉及版权问题，请及时与我们联系删除