爱可可AI前沿推介(10.28)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

本文转自爱可可爱生活

1、[AS] Deep Learning Tools for Audacity: Helping Researchers Expand the Artist's Toolkit

H F Garcia, A Aguilar, E Manilow, D Vedenko, B Pardo

[Northwestern Universit & Audacity Team]

Audacity深度学习工具：帮研究人员扩展艺术家工具箱。本文提出一个软件框架，将神经网络整合到流行的开源音频编辑软件Audacity中，而开发者只需付出最小的努力。本文为终端用户和神经网络开发者展示了一些使用案例。希望这项工作能够促进深度学习从业者和终端用户之间的互动性达到新水平。

We present a software framework that integrates neural networks into the popular open-source audio editing software, Audacity, with a minimal amount of developer effort. In this paper, we showcase some example use cases for both end-users and neural network developers. We hope that this work fosters a new level of interactivity between deep learning practitioners and end-users.

https://weibo.com/1402400261/KEL1LFFjX

2、[LG] The Efficiency Misnomer

M Dehghani, A Arnab, L Beyer, A Vaswani, Y Tay

[Google Research]

模型效率的误区。模型效率是开发和部署机器学习模型的一个重要方面。推理时间和延迟直接影响到用户体验，而且有些应用有硬性要求。除了推理成本，模型训练也有直接的财务和环境影响。虽然有许多成熟的衡量模型效率的指标(成本指标)，但研究人员和从业人员往往认为这些指标是相互关联的，并且只报告其中的少数指标。本文彻底讨论了常见的成本指标，它们的优点和缺点，以及它们如何相互矛盾。展示了成本指标的不完整报告如何导致片面的结论，以及对不同模型的实际考虑的模糊或不完整的描述。进一步提出了改进效率指标报告的建议。

Model efficiency is a critical aspect of developing and deploying machine learning models. Inference time and latency directly affect the user experience, and some applications have hard requirements. In addition to inference costs, model training also have direct financial and environmental impacts. Although there are numerous well-established metrics (cost indicators) for measuring model efficiency, researchers and practitioners often assume that these metrics are correlated with each other and report only few of them. In this paper, we thoroughly discuss common cost indicators, their advantages and disadvantages, and how they can contradict each other. We demonstrate how incomplete reporting of cost indicators can lead to partial conclusions and a blurred or incomplete picture of the practical considerations of different models. We further present suggestions to improve reporting of efficiency metrics.

https://weibo.com/1402400261/KEL4KmL1E

3、[LG] Spectral Bias in Practice: The Role of Function Frequency in Generalization

S Fridovich-Keil, R Gontijo-Lopes, R Roelofs

[UC Berkeley & Google Brain]

谱偏差：函数频率在泛化中的作用。尽管有能力表示高度表达性的函数，用SGD训练的深度学习模型似乎倾向到简单的、受约束的方案，其泛化效果好得令人惊讶。谱偏差——神经网络优先学习低频函数的趋势——是对这种现象的一种可能的解释，但迄今为止，谱偏差只在理论模型和简化实验中被观察到。本文提出了测量现代图像分类网络中谱偏差的方法。这些网络确实表现出谱偏差，而泛化性好的网络在具有足够的复杂性(即高频)以拟合数据、同时又足够简单以避免过拟合之间取得了平衡。实验表明，较大的模型比较小的模型学习高频的速度更快，但是许多形式的正则化，包括显式和隐式的，都会放大谱偏差，延迟高频的学习。本文还探索了函数频率和图像频率之间的联系，发现谱偏差对自然图像中普遍存在的低频很敏感。本文工作使测量和最终控制用于图像分类的神经网络的谱行为成为可能，并向理解为什么深度模型具有良好的泛化能力迈出了一步。

Despite their ability to represent highly expressive functions, deep learning models trained with SGD seem to find simple, constrained solutions that generalize surprisingly well. Spectral bias – the tendency of neural networks to prioritize learning low frequency functions – is one possible explanation for this phenomenon, but so far spectral bias has only been observed in theoretical models and simplified experiments. In this work, we propose methodologies for measuring spectral bias in modern image classification networks. We find that these networks indeed exhibit spectral bias, and that networks that generalize well strike a balance between having enough complexity (i.e. high frequencies) to fit the data while being simple enough to avoid overfitting. For example, we experimentally show that larger models learn high frequencies faster than smaller ones, but many forms of regularization, both explicit and implicit, amplify spectral bias and delay the learning of high frequencies. We also explore the connections between function frequency and image frequency and find that spectral bias is sensitive to the low frequencies prevalent in natural images. Our work enables measuring and ultimately controlling the spectral behavior of neural networks used for image classification, and is a step towards understanding why deep models generalize well.

https://weibo.com/1402400261/KEL6WB4rj

4、[CV] AugMax: Adversarial Composition of Random Augmentations for Robust Training

H Wang, C Xiao, J Kossaifi, Z Yu, A Anandkumar, Z Wang

[University of Texas at Austin & NVIDIA]

AugMax：面向鲁棒性训练的随机增强对抗合成。数据增强是提高深度神经网络(DNNs)鲁棒性的一种简单而有效的方法。多样性和困难度是数据增强实现鲁棒性的两个互补的维度。例如，AugMix探索多样化的增强集合的随机组合，以提高更广泛的覆盖率，而对抗训练产生对抗性的难样本来发现弱点。受此启发，本文提出了一个数据增强框架AugMax，以统一多样性和困难度这两个方面。AugMax首先随机抽取多个增强算子，然后学习所选算子的对抗性混合。作为一种更强的数据增强形式，AugMax导致了一个显著增强的输入分布，这使得模型训练更具挑战性。为解决该问题，进一步设计了一个分解规范化模块DuBIN(双批次和实例规范化)，可以分解AugMax产生的实例特征异质性。实验表明，AugMax-DuBIN导致了分布外鲁棒性的明显改善，在CIFAR10-C、CIFAR100-C、Tiny ImageNet-C和ImageNet-C上的表现分别优于最先进方法3.03％、3.49％、1.82％和0.71％。

Data augmentation is a simple yet effective way to improve the robustness of deep neural networks (DNNs). Diversity and hardness are two complementary dimensions of data augmentation to achieve robustness. For example, AugMix explores random compositions of a diverse set of augmentations to enhance broader coverage, while adversarial training generates adversarially hard samples to spot the weakness. Motivated by this, we propose a data augmentation framework, termed AugMax, to unify the two aspects of diversity and hardness. AugMax first randomly samples multiple augmentation operators and then learns an adversarial mixture of the selected operators. Being a stronger form of data augmentation, AugMax leads to a significantly augmented input distribution which makes model training more challenging. To solve this problem, we further design a disentangled normalization module, termed DuBIN (Dual-Batch-and-Instance Normalization), that disentangles the instance-wise feature heterogeneity arising from AugMax. Experiments show that AugMax-DuBIN leads to significantly improved out-of-distribution robustness, outperforming prior arts by 3.03%, 3.49%, 1.82% and 0.71% on CIFAR10-C, CIFAR100-C, Tiny ImageNet-C and ImageNet-C. Codes and pretrained models are available: https://github.com/VITA-Group/AugMax.

https://weibo.com/1402400261/KELbGFeOF

5、[CL] Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

A Maharana, M Bansal

[University of North Carolina at Chapel Hill]

基于视觉空间、语言和常识结构整合的叙事可视化。虽然在文本到图像合成方面已经做了很多研究，但在探索输入文本语言结构的使用方面却没有太多研究。这些信息对叙事可视化来说更加重要，因为其输入有一个明确的叙事结构，需要被翻译成图像序列(或视觉故事)。此前在该领域的工作表明，在视觉质量、一致性和相关性方面，生成的图像序列有很大的改进空间。本文首先探索了用基于Transformer的递归架构来编码结构化输入的成分解析树。第二，用常识信息来增强结构化输入，并研究这种外部知识对视觉故事生成的影响。第三，通过将边框和密集字幕纳入视觉结构，在双重学习设置中提供关于生成图像中的人物/物体的反馈。在Visual Genome上训练的现成的密集字幕模型可以改善来自不同目标领域的图像的空间结构，而无需进行微调。用叙事内的对比损失(单词和图像子区域之间)对模型进行端到端训练，并在多个数据集的几个指标(和人工评价)中显示出明显改进。提供了对语言和视觉空间信息的分析。

While much research has been done in textto-image synthesis, little work has been done to explore the usage of linguistic structure of the input text. Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story). Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance. In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input. Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story. Third, we also incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images within a dual learning setup. We show that off-theshelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning. We train the model end-to-end using intra-story contrastive loss (between words and image sub-regions) and show significant improvements in several metrics (and human evaluation) for multiple datasets. Finally, we provide an analysis of the linguistic and visuo-spatial information.1

https://weibo.com/1402400261/KELfVkIMd

另外几篇值得关注的论文：

[LG] Variational Gaussian Processes: A Functional Analysis View

变分高斯过程：泛函分析视角

V Wild, G Wynne

[University of Oxford & Imperial College London]

https://weibo.com/1402400261/KELiUijrP

[CL] s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning

s2s-ft：面向序列到序列学习的微调预训练Transformer编码器

H Bao, L Dong, W Wang, N Yang, F Wei

[Microsoft Research]

https://weibo.com/1402400261/KELkiAj8z

[CL] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

WavLM：面向全栈语音处理的大规模自监督预训练

S Chen, C Wang, Z Chen, Y Wu, S Liu, Z Chen, J Li, N Kanda, T Yoshioka, X Xiao, J Wu, L Zhou, S Ren, Y Qian, Y Qian, J Wu, M Zeng, F Wei

[Microsoft & Shanghai Jiao Tong University]

https://weibo.com/1402400261/KELmKtkPO

[CV] H-NeRF: Neural Radiance Fields for Rendering and Temporal Reconstruction of Humans in Motion

H-NeRF：面向运动中人体渲染和时间重建的神经辐射场

H Xu, T Alldieck, C Sminchisescu