爱可可AI前沿推介(10.3)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：文本指导的音频生成、多模态遥感图像超分辨率辅助目标检测、"免费隐私：数据集浓缩有助于改善数据隐私"中没有免费午餐、双向语言模型也是少样本学习器、利用预测改善语义解析检索、新目标描述有释义就够了、基于概念模型解释的因果代理模型、如何将任意预训练指标转换为文档级指标、约束连续概率分布采样综述

1、[AS] AUDIOGEN: Textually Guided Audio GenerationF Kreuk, G Synnaeve, Adam & Polyak...

[Meta AI]
AUDIOGEN：文本指导的音频生成。本文解决根据描述性文字说明生成音频样本的问题，提出了AUDIOGEN，一种自回归生成模型，根据文本输入的条件生成音频样本。AUDIOGEN在一个习得的离散音频表示基础上运行。文本到音频的生成任务带来了多种挑战。由于音频在介质中的传播方式，区分"对象"可能是一项困难的任务(例如，区分多个人同时说话)。现实世界的录音条件(例如，背景噪音、混响等)使之更加复杂。稀少的文本标注带来了另一个限制，限制了扩展模型的能力。最后，为高保真音频建模需要以高采样率对音频进行编码，从而导致了极长的序列。为缓解上述挑战，本文提出一种混合不同音频样本的增强技术，驱动模型在内部学会分离多个来源。本文策划了10个包含不同类型的音频和文本标注的数据集，以处理文本-音频数据点的稀缺性。为更快地进行推理，本文探索使用多流建模，允许使用较短的序列，同时保持相似的比特率和感知质量。应用无分类指导来提高对文本的忠实。与被评估的基线相比，AUDIOGEN在客观和主观指标上都表现出色。最后，本文探讨了所提出的方法在有条件和无条件的情况下产生音频延续的能力。

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AUDIOGEN, an auto-regressive generative model that generates audio samples conditioned on text inputs. AUDIOGEN operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating “objects” can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AUDIOGEN outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://tinyurl.com/audiogen-text2audio.

https://felixkreuk.github.io/text2audio_arxiv_samples/paper.pdf

2、[CV] SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery

J Zhang, J Lei, W Xie, Z Fang, Y Li, Q Du
[Xidian University & Simon Fraser University & Mississippi State University]
SuperYOLO：多模态遥感图像超分辨率辅助目标检测。用遥感图像(RSI)精确检测多尺度小目标并完成实时检测仍然具有挑战性，特别是对于时间敏感的任务，如军事侦察和紧急救援。为了获得这些小目标的精确位置和分类，最适用的解决方案之一是融合多模态图像中的互补信息以提高探测能力。大多数现有的解决方案主要是设计一个复杂的深度神经网络来学习从背景中分离出来的物体的强大特征表示，这往往导致沉重的计算负担。本文提出一种准确而快速的RSI小目标检测方法SuperYOLO，融合了多模态数据，通过利用辅助的超级分辨率(SR)学习并考虑检测精度和计算成本，对多尺目标进行高分辨率(HR)检测。首先，通过删除Focus模块来构建一个紧凑的基线，以保留HR特征，并大大克服了小目标的丢失误差。其次，利用像素级多模态融合(MF)，从各种数据中提取信息，以促进RSI中小目标的更合适和有效的特征。此外，设计了一种简单而灵活的SR分支来学习HR特征表征，该表示可以从低分辨率(LR)输入的广阔背景中分辨出小目标，从而进一步提高检测精度。此外，为避免引入额外的计算，在推理阶段舍弃了SR分支，并且由于LR输入，网络模型的计算量减少。实验结果表明，在广泛使用的VEDAI RS数据集上，SuperYOLO实现了73.61%的准确率(以mAP50计算)，比SOTA的大型模型如YOLOv5l、YOLOv5x和RS设计的YOLOrs高出10%以上。同时，SuperYOLO的GFOLPs和参数大小比YOLOv5x少约18.1倍和4.2倍。与最先进的模型相比，所提出的模型显示出良好的精度-速度权衡。

Accurately detecting multiscale small objects and accomplishing real-time detection using remote sensing imagery (RSI) remain challenging, especially for time-sensitive tasks such as military reconnaissance and emergency rescue. To obtain precise locations and classifications for those small objects, one of the most applicable solutions is to fuse the complementary information in multimodal images to enhance the detection capability. Most of the existing solutions primarily design a complex deep neural network to learn strong feature representations for objects separated from the background, which often results in a heavy computation burden. In this paper, we propose an accurate yet fast small object detection method for RSI, named SuperYOLO, which fuses multimodal data and performs high resolution (HR) object detection on multiscale objects by utilizing the assisted super resolution (SR) learning and considering both the detection accuracy and computation cost. First, we construct a compact baseline by removing the Focus module to keep the HR features and significantly overcomes the missing error of small objects. Second, we utilize pixel-level multimodal fusion (MF) to extract information from various data to facilitate more suitable and effective features for small objects in RSI. Furthermore, we design a simple and flexible SR branch to learn HR feature representations that can discriminate small objects from vast backgrounds with low-resolution (LR) input, thus further improving the detection accuracy. Moreover, to avoid introducing additional computation, the SR branch is discarded in the inference stage and the computation of the network model is reduced due to the LR input. Experimental results show that, on the widely used VEDAI RS dataset, SuperYOLO achieves an accuracy of 73.61% (in terms of mAP50), which is more than 10% higher than the SOTA large models such as YOLOv5l, YOLOv5x and RS designed YOLOrs. Meanwhile, the GFOLPs and parameter size of SuperYOLO are about 18.1x and 4.2x less than YOLOv5x. Our proposed model shows a favorable accuracy-speed trade-off compared to the state-of-art models. The code will be open sourced at https://github.com/icey-zhang/SuperYOLO.

https://arxiv.org/abs/2209.13351

3、[LG] No Free Lunch in "Privacy for Free: How does Dataset Condensation Help Privacy"

N Carlini, V Feldman, M Nasr
[Google & Apple]
"免费隐私：数据集浓缩有助于改善数据隐私"中没有免费午餐。旨在保护数据隐私的新方法需要仔细审查。保护隐私的失败很难被发现，但当实施"保护隐私"方法的系统受到攻击时，可能导致灾难性的结果。最近的一项工作被选为ICML2022的杰出论文奖，声称数据集浓缩(DC)在训练机器学习模型时能显著改善数据隐私。这一说法得到了一个特定的数据集浓缩技术的理论分析和对一些现有成员推理攻击的抵抗力的经验评估的支持。本文检查了其工作中的主张，并描述了该方法的经验评估及其理论分析中的主要缺陷。这些缺陷意味着其工作没有提供统计上的重要证据，证明DC比天真的基线改善了训练机器学习模型的隐私性。此外，之前发表的结果显示，DP-SGD，即保护隐私的机器学习的标准方法，同时给出了更好的准确性，并实现了(可证明的)更低的成员攻击成功率。

New methods designed to preserve data privacy require careful scrutiny. Failure to preserve privacy is hard to detect, and yet can lead to catastrophic results when a system implementing a ``privacy-preserving'' method is attacked. A recent work selected for an Outstanding Paper Award at ICML 2022 (Dong et al., 2022) claims that dataset condensation (DC) significantly improves data privacy when training machine learning models. This claim is supported by theoretical analysis of a specific dataset condensation technique and an empirical evaluation of resistance to some existing membership inference attacks. In this note we examine the claims in the work of Dong et al. (2022) and describe major flaws in the empirical evaluation of the method and its theoretical analysis. These flaws imply that their work does not provide statistically significant evidence that DC improves the privacy of training ML models over a naive baseline. Moreover, previously published results show that DP-SGD, the standard approach to privacy preserving ML, simultaneously gives better accuracy and achieves a (provably) lower membership attack success rate.

https://arxiv.org/abs/2209.14987

4、[LG] Bidirectional Language Models Are Also Few-shot Learners

A Patel, B Li, M S Rasooli, N Constant, C Raffel, C Callison-Burch
[University of Pennsylvania & Microsoft & Google Research & UNC Chapel Hill]
双向语言模型也是少样本学习器。大型语言模型，如GPT-3，在被提示只有几个标记的样本后，可以执行任意的任务而不需要进行微调。一个任意的任务可以被重新表述为一种自然语言提示，语言模型可以被要求生成完成，在一个被称为基于提示的学习的范式中间接地执行任务。到目前为止，基于提示的新兴学习能力主要是针对单向语言模型而被证明的。然而，双向语言模型在去噪目标上的预训练，如掩码语言建模，为迁移学习产生了更强的学习表示。这促使了提示双向模型的可能性，但它们的预训练目标使它们在很大程度上与现有的提示范式不相容。本文提出SAP(顺序自回归提示)，一种能提示双向模型的技术。利用机器翻译任务作为案例研究，用SAP提示双向的mT5模型），并证明其少样本和零样本翻译优于GPT-3和XGLM等单向模型的少样本翻译，尽管mT5的参数大约少50%。进一步表明SAP在问答和摘要方面是有效的。本文研究结果首次证明了基于提示的学习是一类更广泛的语言模型的新兴属性，而不仅仅是单向模型。

Large language models such as GPT-3 (Brown et al., 2020) can perform arbitrary tasks without undergoing fine-tuning after being prompted with only a few labeled examples. An arbitrary task can be reformulated as a natural language prompt, and a language model can be asked to generate the completion, indirectly performing the task in a paradigm known as prompt-based learning. To date, emergent prompt-based learning capabilities have mainly been demonstrated for unidirectional language models. However, bidirectional language models pre-trained on denoising objectives such as masked language modeling produce stronger learned representations for transfer learning. This motivates the possibility of prompting bidirectional models, but their pre-training objectives have made them largely incompatible with the existing prompting paradigm. We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models. Utilizing the machine translation task as a case study, we prompt the bidirectional mT5 model (Xue et al., 2021) with SAP and demonstrate its few-shot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM (Lin et al., 2021), despite mT5's approximately 50% fewer parameters. We further show SAP is effective on question answering and summarization. For the first time, our results demonstrate prompt-based learning is an emergent property of a broader class of language models, rather than only unidirectional models.

https://arxiv.org/abs/2209.14500

5、[CL] Generate-and-Retrieve: use your predictions to improve retrieval for semantic parsing

Y Zemlyanskiy, M d Jong, J Ainslie...
[University of Southern California & Google Research]
生成和检索：利用预测改善语义解析检索。最近一种常见的语义解析方法是通过检索和附加一组训练样本(称为范例)来增强序列到序列模型。这种方法的有效性受限于检索有助于产生正确解析的信息示例的能力，这在低资源环境中尤其具有挑战性。现有的检索通常是基于查询和范例输入的相似性。本文提出GandR，一种检索程序，可以检索输出也相似的范例。GandR首先通过基于输入的检索产生一个初步预测。然后，它检索具有与初步预测相似的输出的范例，这些范例被用来生成最终的预测。GandR在多个低资源语义解析任务上达到了最先进的技术水平。

A common recent approach to semantic parsing augments sequence-to-sequence models by retrieving and appending a set of training samples, called exemplars. The effectiveness of this recipe is limited by the ability to retrieve informative exemplars that help produce the correct parse, which is especially challenging in low-resource settings. Existing retrieval is commonly based on similarity of query and exemplar inputs. We propose GandR, a retrieval procedure that retrieves exemplars for which outputs are also similar. GandRfirst generates a preliminary prediction with input-based retrieval. Then, it retrieves exemplars with outputs similar to the preliminary prediction which are used to generate a final prediction. GandR sets the state of the art on multiple low-resource semantic parsing tasks.

https://arxiv.org/abs/2209.14899

另外几篇值得关注的论文：

[CV] Paraphrasing Is All You Need for Novel Object Captioning

新目标描述有释义就够了
C Yang, Y H Tsai, W Fan, R Salakhutdinov, L Morency, Y F Wang
[UCLA & CMU & National Taiwan University] https://arxiv.org/abs/2209.12343

[CL] Causal Proxy Models for Concept-Based Model Explanations

基于概念模型解释的因果代理模型
Z Wu, K D'Oosterlinck, A Geiger, A Zur, C Potts
[Stanford University]
https://arxiv.org/abs/2209.14279

[CL] Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric

超简单文档级MT指标：如何将任意预训练指标转换为文档级指标
G Vernikos, B Thompson, P Mathur, M Federico
[EPFL & AWS AI Labs]
https://arxiv.org/abs/2209.13654

[LG] Sampling Constrained Continuous Probability Distributions: A Review

约束连续概率分布采样综述
S Lan, L Kang
[Arizona State University & Illinois Institute of Technology] https://arxiv.org/abs/2209.12403

内容中包含的图片若涉及版权问题，请及时与我们联系删除