LG - 机器学习   CV - 计算机视觉   CL - 计算与语言   AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要:基于层次Transformer的更快更好文本-图像生成、对DALL-E 2的非常初步的分析、基于图像转换及编码的音乐增强、视觉-语言模型预训练的层次特征对齐、分子的图各向异性扩散、不良文本的处理和表示、面向少次学习的视觉语言模型、StyleGAN迁移学习源特征解缠、基于相对查询的Oracle指导图像合成

 

1、[CV] CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

M Ding, W Zheng, W Hong, J Tang

[Tsinghua University]

CogView2:基于层次Transformer的更快更好文本-图像生成。基于Transformer的文本到图像模型的发展,因其缓慢的生成和对高分辨率图像的复杂性而受到阻碍。本文提出一种基于层次Transformer和局部平行自回归生成的解决方案。用一个简单而灵活的自监督任务——跨模态通用语言模型(CogLM)来预训练一个6B参数的Transformer,并对其进行微调以实现快速的超分辨率。新的文本到图像系统CogView2,与目前最先进的DALLE-2相比,显示出非常有竞争力的生成,并自然支持图像上交互式文本指导的编辑。

The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel autoregressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALLE-2, and naturally supports interactive text-guided editing on images.

https://arxiv.org/abs/2204.14217

 

2、[CV] A very preliminary analysis of DALL-E 2

G Marcus, E Davis, S Aaronson

[New York University & University of Texas at Austin]

对DALL-E 2的非常初步的分析。DALL-E 2系统生成原创的合成图像,对应于输入的文本作为标题。本文报告了该系统的14次测试结果,旨在评估其常识、推理和理解复杂文本的能力。所有的提示都有意比最近几周展示的典型提示更具挑战性。然而,对于14个提示中的5个,10张图片中至少有一张完全满足要求。另一方面,没有一个提示是所有的10张图片都能满足要求。

The DALL-E 2 system generates original synthetic images corresponding to an input text as caption. We report here on the outcome of fourteen tests of this system designed to assess its common sense, reasoning and ability to understand complex texts. All of our prompts were intentionally much more challenging than the typical ones that have been showcased in recent weeks. Nevertheless, for 5 out of the 14 prompts, at least one of the ten images fully satisfied our requests. On the other hand, on no prompt did all of the ten images satisfy our requests.

https://arxiv.org/abs/2204.13807

 

3、[AS] Music Enhancement via Image Translation and Vocoding

N Kandpal, O Nieto, Z Jin

[University of North Carolina at Chapel Hill & Adobe Research]

基于图像转换及编码的音乐增强。消费级别的音乐录音,例如由移动设备捕获的音乐录音,通常包含背景噪音、混响和麦克风引起的EQ等形式的失真。本文提出一种深度学习方法来增强低质量音乐录音,结合了(i)一个图像到图像的翻译模型,用于操作音频旋律谱图表示;(ii)一个音乐编码模型,用于将合成的旋律谱图映射到知觉上真实波形。该音乐增强方法优于用经典方法进行旋律谱图反转和直接将噪声波形映射到清洁波形的端到端方法的基线。此外,在用听觉测试评估所提出的方法时,分析了在音乐域中用常见的音频增强评估指标的可靠性。

Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality music recordings by combining (i) an image-to-image translation model for manipulating audio in its mel-spectrogram representation and (ii) a music vocoding model for mapping synthetically generated mel-spectrograms to perceptually realistic waveforms. We find that this approach to music enhancement outperforms baselines which use classical methods for mel-spectrogram inversion and an end-to-end approach directly mapping noisy waveforms to clean waveforms. Additionally, in evaluating the proposed method with a listening test, we analyze the reliability of common audio enhancement evaluation metrics when used in the music domain.

https://arxiv.org/abs/2204.13289

 

4、[CV] PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Y Gao, J Liu, Z Xu, J Zhang, K Li, C Shen

[Tencent Youtu Lab & Zhejiang University]

PyramidCLIP:视觉-语言模型预训练的层次特征对齐。大规模视觉-语言预训练已经在下游任务中取得了可喜的成果。现有方法高度依赖于这样的假设:从互联网上抓取的图像-文本对是完全一一对应的。然而,在实际场景中,这一假设可能难以成立:通过抓取图像的附属元数据获得的文本描述,往往存在语义不匹配和相互兼容的问题。为解决这些问题,b e本文提出PyramidCLIP,构建一个具有不同语义层次的输入金字塔,并通过层次内语义对齐和跨层次关系对齐的形式将视觉元素和语言元素进行层次对齐。此外,通过软化负样本(未配对样本)损失来调整目标函数,以便在预训练阶段弱化严格约束,从而减轻模型过于自信的风险。在三个下游任务上的实验,包括零次图像分类、零次图像-文本检索和图像目标检测,验证了所提出PyramidCLIP的有效性。特别是,在相同数量的1500万个图像-文本对预训练数据下,PyramidCLIP分别比CLIP高出19.2%/18.5%/19.6%,图像编码器是ResNet-50/ViT-B32/ViT-B16在ImageNet零次分类中的最高精度。当扩展到更大的数据集时,PyramidCLIP只用128M图像-文本对训练了8个epochs,其结果与用400M训练数据训练了32个epochs的CLIP非常接近。

Large-scale vision-language pre-training has achieved promising results on downstream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffer from semantic mismatch and mutual compatibility. To address these issues, here we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels, and aligns visual elements and linguistic elements in the form of hierarchy via intra-level semantics alignment and cross-level relation alignment. Furthermore, we adjust the objective function by softening the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the risk of the model being over-confident. Experiments on three downstream tasks, including zero-shot image classification, zero-shot image-text retrieval and image object detection, verify the effectiveness of the proposed PyramidCLIP. In particular, with the same amount of pre-training data of 15 millions image-text pairs, PyramidCLIP exceeds CLIP by 19.2%/18.5%/19.6% respectively, with the image encoder being ResNet-50/ViT-B32/ViT-B16 on ImageNet zero-shot classification top-1 accuracy. When scaling to larger datasets, the results of PyramidCLIP only trained for 8 epochs using 128M image-text pairs are very close to that of CLIP trained for 32 epochs using 400M training data.

https://arxiv.org/abs/2204.14095

 

5、[LG] Graph Anisotropic Diffusion for Molecules

A A. A. Elhag, G Corso, H Stärk, M M. Bronstein

[African Masters of Machine Intelligence & MIT & University of Oxford]

分子的图各向异性扩散。传统的图神经网络(GNN)依赖于信息传递,相当于相邻特征的置换不变局部聚合。该过程是各向同性的,在图上没有"方向"的概念。本文提出一种新的GNN架构,图各向异性扩散。该模型在线性扩散和局部各向异性过滤器之间交替进行,以获得高效的多跳各向异性核。在两个常见的分子特性预测基准(ZINC和QM9)上测试了所提出模型,显示了其具有竞争力的性能。

Traditional Graph Neural Networks (GNNs) rely on message passing, which amounts to permutation-invariant local aggregation of neighbour features. Such a process is isotropic and there is no notion of ‘direction’ on the graph. We present a new GNN architecture called Graph Anisotropic Diffusion. Our model alternates between linear diffusion, for which a closed-form solution is available, and local anisotropic filters to obtain efficient multi-hop anisotropic kernels. We test our model on two common molecular property prediction benchmarks (ZINC and QM9) and show its competitive performance.

https://openreview.net/forum?id=MDYOh60QN94

 

另外几篇值得关注的论文:

 

[CL] Handling and Presenting Harmful Text

不良文本的处理和表示

L Derczynski, H R Kirk, A Birhane, B Vidgen

[IT University of Copenhagen & University of Oxford & Mozilla Foundation & The Alan Turing Institute]

https://arxiv.org/abs/2204.14256

 

[CV] Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo:面向少次学习的视觉语言模型

J Alayrac, J Donahue, P Luc, A Miech, I Barr, Y Hasson, K Lenc, A Mensch, K Millican...

[DeepMind]

https://arxiv.org/abs/2204.14198

 

[CV] Fix the Noise: Disentangling Source Feature for Transfer Learning of StyleGAN

修复噪声:StyleGAN迁移学习源特征解缠

D Lee, J Y Lee, D Kim, J Choi, J Kim

[KAIST]

https://arxiv.org/abs/2204.14079

 

[CV] Oracle Guided Image Synthesis with Relative Queries

基于相对查询的Oracle指导图像合成

A Helbling, C J Rozell, M O'Shaughnessy, K Fallah

[Georgia Institute of Technology]

https://arxiv.org/abs/2204.14189

 

内容中包含的图片若涉及版权问题,请及时与我们联系删除