爱可可AI前沿推介(12.17)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

1、[CV] Efficient Geometry-aware 3D Generative Adversarial Networks

E R. Chan, C Z. Lin, M A. Chan, K Nagano, B Pan, S D Mello, O Gallo, L Guibas, J Tremblay, S Khamis, T Karras, G Wetzstein

[Stanford University & NVIDIA]

高效几何感知3D生成对抗网络。长期以来，仅使用单视角2D照片集合无监督生成高质量多视角一致的图像和3D形状，是一个长期的挑战。现有的3D生成对抗网络要么是计算密集的，要么不符合3D一致性近似；前者限制了生成图像的质量和分辨率，后者则对多视图一致性和形状质量产生不利影响。本文改善了3D GAN的计算效率和图像质量，而不过度依赖这些近似值。提出了一种富有表现力的显式-隐式混合网络架构，与其他设计选择一起，不仅可以实时合成高分辨率多视图一致图像，还可以产生高质量3D几何图形。通过解耦特征生成和神经渲染，所提出框架能利用最先进2D CNN生成器，如StyleGAN2，并继承其效率和表现力。在其他实验中，用FFHQ和AFHQ Cats展示了最先进的3D感知合成。

Unsupervised generation of high-quality multi-viewconsistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape quality. In this work, we improve the computational efficiency and image quality of 3D GANs without overly relying on these approximations. For this purpose, we introduce an expressive hybrid explicit-implicit network architecture that, together with other design choices, synthesizes not only high-resolution multi-view-consistent images in real time but also produces high-quality 3D geometry. By decoupling feature generation and neural rendering, our framework is able to leverage state-of-the-art 2D CNN generators, such as StyleGAN2, and inherit their efficiency and expressiveness. We demonstrate state-of-the-art 3D-aware synthesis with FFHQ and AFHQ Cats, among other experiments.

2、[CL] Massive-scale Decoding for Text Generation using Lattices

J Xu, G Durrett

[The University of Texas at Austin]

基于网格的大规模文本生成解码。那些用于摘要和翻译的神经文本生成模型会产生高质量输出，但往往集中在一种模式上，而我们真正需要的是一套多样化的选项。本文提出一种搜索算法来构建编码大量生成选项的格子。首先将解码重构为最佳优先搜索，这与波束搜索不同，其探索的是空间，并通过避免修剪路径提高了效率。重新审视了假设重组的想法：可以在搜索过程中识别出一对相似的生成候选者，并将其合并为一个近似值。在文档摘要和机器翻译上，该算法将成百上千个保持语法和质量的不同选项编码到一个线性大小的格子里。这种算法为在大规模的多样化输出之上建立下游的生成应用提供了基础。

Neural text generation models like those used for summarization and translation generate high-quality outputs, but often concentrate around a mode when what we really want is a diverse set of options. We present a search algorithm to construct lattices encoding a massive number of generation options. First, we restructure decoding as a best-first search, which explores the space differently than beam search and improves efficiency by avoiding pruning paths. Second, we revisit the idea of hypothesis recombination: we can identify pairs of similar generation candidates during search and merge them as an approximation. On both document summarization and machine translation, we show that our algorithm encodes hundreds to thousands of diverse options that remain grammatical and high-quality into one linear-sized lattice. This algorithm provides a foundation for building downstream generation applications on top of massive-scale diverse outputs.1

3、[CL] Textless Speech-to-Speech Translation on Real Data

A Lee, H Gong, P Duquenne, H Schwenk, P Chen, C Wang, S Popuri, J Pino, J Gu, W Hsu

[Meta AI]

真实数据上的无文本语音到语音翻译。本文提出一种无文本语音翻译(S2ST)系统，可将一种语言的语音翻译成另一种语言，而不需要任何文本数据就。与文献中的现有工作不同，所提系统解决了对多发言人目标语音进行建模的挑战，并用真实世界的S2ST数据来训练系统。该方法的关键是一种基于单元的自监督语音规范化技术，用来自多个说话人和一个参考说话人的配对音频对预训练语音编码器进行微调，以减少口音引起的变化，同时保留词汇内容。仅用10分钟的配对数据进行语音规范化，在VoxPopuli S2ST数据集上训练S2ST模型时，与在未规范化的语音目标上训练的基线相比，平均获得了3.2个BLEU增益。还纳入了自动挖掘的S2ST数据，并显示出额外的2.0 BLEU增益。这是第一个建立无文本的S2ST技术，可以用真实世界的数据进行训练，并适用于多种语言对。

We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the VoxPopuli S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.

4、[CL] PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts

D Khashabi, S Lyu, S Min, L Qin, K Richardson, S Singh, S Welleck, H Hajishirzi, T Khot, A Sabharwal, Y Choi

[Allen Institute for AI & University of Washington]

提示不确定性：连续提示离散化解释的奇特情况。最近，对目标任务的连续提示进行微调已成为全模型微调的一个紧凑的替代方案。在这些有希望的结果的激励下，本文研究了提取连续提示的离散(文本)解释的可行性，这种解释忠实于它们所解决的问题。在实践中，观察到在连续提示所解决的任务和它们最近的相邻的离散投影之间有一种"偏离"行为。可以发现连续提示在解决一个任务的同时被投射到一个任意的文本(例如，不同的甚至是矛盾的任务的定义)，而与该任务相同大小的最佳连续提示相比，差距非常小(2%)。本文提供了这种奇怪的行为背后的直觉，以及量化各种参数影响的广泛经验分析。例如，对于较大的模型尺寸，观察到较高的循序渐进性，可以找到更接近于任何任意文本的提示，而且准确率下降较少。这些发现对忠实地解释连续提示的难度及其在不同模型和任务中的通用性有重要的影响，为提示语言模型的未来进展提供了指导。

Fine-tuning continuous prompts for target tasks has recently emerged as a compact alternative to full model fine-tuning. Motivated by these promising results, we investigate the feasibility of extracting a discrete (textual) interpretation of continuous prompts that is faithful to the problem they solve. In practice, we observe a “wayward” behavior between the task solved by continuous prompts and their nearest neighbor discrete projections: We can find continuous prompts that solve a task while being projected to an arbitrary text (e.g., definition of a different or even a contradictory task), while being within a very small (2%) margin of the best continuous prompt of the same size for the task. We provide intuitions behind this odd and surprising behavior, as well as extensive empirical analyses quantifying the effect of various parameters. For instance, for larger model sizes we observe higher waywardness, i.e, we can find prompts that more closely map to any arbitrary text with a smaller drop in accuracy. These findings have important implications relating to the difficulty of faithfully interpreting continuous prompts and their generalization across models and tasks, providing guidance for future progress in prompting language models.

5、[CV] Fast Point Transformer

C Park, Y Jeong, M Cho, J Park

[POSTECH CSE & GSAI]

快速点Transformer。最近神经网络的成功使人们能够更好地解释3D点云，但处理大规模3D场景仍然是具有挑战性的问题。目前的大多数方法是将大规模场景划分为小区域并将局部预测结合在一起。然而，这种方案不可避免地涉及到预处理和后处理的额外阶段，也可能由于局部角度的预测而降低最终输出。本文提出由一个新的轻量自注意力层组成的快速点Transformer，对连续3D坐标进行编码，而基于体素散列的架构提高了计算效率。所提出的方法通过3D语义分割和3D检测进行了演示。该方法的准确性与最好的基于体素的方法相比是有竞争力的，所提出网络实现了比最先进的Point Transformer快136倍的推理时间，并且有一个合理的精度权衡。

The recent success of neural networks enables a better interpretation of 3D point clouds, but processing a large-scale 3D scene remains a challenging problem. Most current approaches divide a large-scale scene into small regions and combine the local predictions together. However, this scheme inevitably involves additional stages for preand post-processing and may also degrade the final output due to predictions in a local perspective. This paper introduces Fast Point Transformer that consists of a new lightweight self-attention layer. Our approach encodes continuous 3D coordinates, and the voxel hashing-based architecture boosts computational efficiency. The proposed method is demonstrated with 3D semantic segmentation and 3D detection. The accuracy of our approach is competitive to the best voxelbased method, and our network achieves 136 times faster inference time than the state-of-the-art, Point Transformer, with a reasonable accuracy trade-off.