LG - 机器学习   CV - 计算机视觉   CL - 计算与语言   AS - 音频与语音 RO - 机器人




1、[CL] PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically

S S Keh, S Y. Feng, V Gangal, M Alikhani, E Hovy
[CMU & Stanford University & University of Pittsburgh]

Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony. In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue twisters on two proposed task settings. To do this, we curate a dataset called PANCETTA, consisting of existing English tongue twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel, phonetically difficult, fluent, and semantically meaningful tongue twisters.



2、[CV] CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

H Xue, Y Sun, B Liu, J Fu, R Song, H Li, J Luo
[University of Science and Technology of China & Renmin University of China & Microsoft Research Asia & ...]
CLIP-ViP: 调整预训练图像-文本模型以适配视频-语言表示对齐。预训练的图像文本模型,如CLIP,已经证明了从大规模的网络收集的图像文本数据中学习的视觉语言表示的强大力量。鉴于良好的视觉特征,一些现有的工作将图像表示迁移到视频领域并取得了良好的效果。然而,如何利用图像语言的预训练模型(如CLIP)进行视频语言的预训练(后再训练)仍在探索之中。本文研究了两个问题。1)是什么因素阻碍了训练后的CLIP进一步提高视频语言任务的性能? 2)如何减轻这些因素的影响?通过一系列的比较实验和分析,本文发现,数据规模和语言源之间的领域差距有很大的影响。受此启发,本文提出一种在CLIP基础上配备视频代理机制的全源跨模态学习方法CLIPViP。广泛的研究结果表明,所提方法在视频-文本检索上大大改善了CLIP的性能。该模型在各种数据集上也取得了SOTA的结果,包括MSR-VTT、DiDeMo、LSMDC和ActivityNet。

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for videolanguage pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIPViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We release our code and pre-trained CLIPViP models at https://github.com/microsoft/ XPretrain/tree/main/CLIP-ViP.



3、[LG] Revisiting Neural Scaling Laws in Language and Vision

I Alabdulmohsin, B Neyshabur, X Zhai
[Google Research]

The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the bestfitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.



4、[CL] Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest

J Hessel, A Marasović, J D. Hwang, L Lee, J Da, R Zellers, R Mankoff, Y Choi
[The Allen Institute for AI & University of Utah & Cornell University & OpenAI & ...]

We challenge AI models to “demonstrate understanding” of the sophisticated multimodal humor of The New Yorker Caption Contest. Concretely, we develop three carefully circumscribed tasks for which it suffices (but is not necessary) to grasp potentially complex and unexpected relationships between image and caption, and similarly complex and unexpected allusions to the wide varieties of human experience; these are the hallmarks of a New Yorker-caliber cartoon. We investigate vision-and-language models that take as input the cartoon pixels and caption directly, as well as language-only models for which we circumvent image-processing by providing textual descriptions of the image. Even with the rich multifaceted annotations we provide for the cartoon images, we identify performance gaps between high-quality machine learning models (e.g., a fine-tuned, 175B parameter language model) and humans. We publicly release our corpora including annotations describing the image’s locations/entities, what’s unusual about the scene, and an explanation of the joke.



5、[CV] PaLI: A Jointly-Scaled Multilingual Language-Image Model

X Chen, X Wang, S Changpinyo...
[Google Research]

Effective scaling and a flexible task interface enable large language models to excel at many tasks. PaLI (Pathways Language and Image model) extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-ofthe-art in multiple vision and language tasks (such as captioning, visual questionanswering, scene-text understanding), while retaining a simple, modular, and scalable design.




[LG] Diffusion Models: A Comprehensive Survey of Methods and Applications

L Yang, Z Zhang, S Hong...
[Peking University & University of California, Los Angeles & CMU & BUPT & Mila & University of California at Merced] https://arxiv.org/abs/2209.00796v5


[CL] vec2text with Round-Trip Translations

G Cideron, S Girgin, A Raichuk, O Pietquin, O Bachem, L Hussenot
[Google Research]


[CL] Knowledge Base Question Answering: A Semantic Parsing Perspective

Y Gu, V Pahuja, G Cheng, Y Su
[The Ohio State University & Nanjing University]


[CL] Towards Multi-Lingual Visual Question Answering

S Changpinyo, L Xue, I Szpektor, A V. Thapliyal, J Amelot, X Chen, R Soricut
[Google Research]


