爱可可AI前沿推介(8.31)

LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要：说话人脸生成运动稳定性的分析与改善、开放集半监督目标检测、面向持续对话的端到端建模流意图查询检测、解缠语音表示研究、自然对话语音的轮换预测、面向复杂场景图像合成的特征金字塔扩散、深度学习算法中的隐性偏差、协同趋光机器人构造、基于生成式神经辐射场训练和微调的属性条件性3D感知人脸生成

1、[CV] StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

J Ling, X Tan, L Chen...

[Shanghai Jiao Tong University & Microsoft Research Asia & Tsinghua University & Microsoft Azure Speech]

StableFace: 说话人脸生成运动稳定性的分析与改善。虽然之前的语音驱动的人脸生成方法在提高合成视频视觉质量和唇语同步质量方面取得了重大进展，但其对唇语运动抖动的关注较少，而这种抖动大大削弱了人脸视频的真实性。是什么导致了运动抖动，以及如何缓解这一问题？本文基于最先进的管道对运动抖动问题进行了系统分析，该管道使用3D人脸表示法来连接输入音频和输出视频，并通过一系列有效设计来提高运动稳定性。有几个问题会导致合成人脸视频出现抖动：1）输入3D人脸表示的抖动；2）训练-推理不匹配；3）视频帧之间缺乏依赖性建模。因此，本文提出了三种有效的解决方案来解决该问题：1）提出了一种基于高斯的自适应平滑模块来平滑3D人脸表示，以消除输入中的抖动；2）在训练中对神经渲染器的输入数据添加了增强的侵蚀，以模拟推理中的失真，减少不匹配；3）开发了一个音频融合的Transformer生成器来建立视频帧之间的依赖性。此外，考虑到没有现成的指标来测量人脸视频中的运动抖动，设计了一种客观指标(运动稳定指数，MSI)，通过计算方差加速度的倒数来定量测量运动抖动。广泛的实验结果表明，所提出的方法在运动稳定人脸视频生成方面具有优势，比之前的系统质量更好。

While previous speech-driven talking face generation methods have made significant progress in improving the visual quality and lip-sync quality of the synthesized videos, they pay less attention to lip motion jitters which greatly undermine the realness of talking face videos. What causes motion jitters, and how to mitigate the problem? In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video, and improve the motion stability with a series of effective designs. We find that several issues can lead to jitters in synthesized talking face video: 1) jitters from the input 3D face representations; 2) traininginference mismatch; 3) lack of dependency modeling among video frames. Accordingly, we propose three effective solutions to address this issue: 1) we propose a gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) we add augmented erosions on the input data of the neural renderer in training to simulate the distortion in inference to reduce mismatch; 3) we develop an audio-fused transformer generator to model dependency among video frames. Besides, considering there is no off-the-shelf metric for measuring motion jitters in talking face video, we devise an objective metric (Motion Stability Index, MSI), to quantitatively measure the motion jitters by calculating the reciprocal of variance acceleration. Extensive experimental results show the superiority of our method on motion-stable face video generation, with better quality than previous systems.

https://arxiv.org/abs/2208.13717

2、[CV] Open-Set Semi-Supervised Object Detection

Y Liu, C Ma, X Dai, J Tian, P Vajda, Z He, Z Kira

[Georgia Tech & Meta]

开放集半监督目标检测。半监督目标检测(SSOD)的最新发展显示了利用未标记数据来改进目标检测器的前景。然而，到目前为止，这些方法都假定未标记数据不包含分布外(OOD)的类，这对于大规模未标记数据集是不现实的。本文考虑了一个更实际但具有挑战性的问题，即开放集半监督目标检测(OSSOD)。发现现有SSOD方法在开放集条件下获得较低的性能增益，这是由语义扩展引起的，在半监督训练中，分散的OOD目标被错误地预测为分布中的伪标签。为解决该问题，本文考虑了在线和离线OOD检测模块，这些模块与SSOD方法相结合。通过广泛的研究发现，利用基于自监督视觉Transformer的离线OOD检测器，由于其对伪标签干扰的鲁棒性，对在线OOD检测器表现得很好。在实验中，本文提出的框架有效解决了语义扩展问题，并在许多OSSOD基准上显示出一致的改进，包括大规模COCO-OpenImages。本文还验证了所提框架在不同的OSSOD条件下的有效性，包括不同数量的分布内的类，不同程度的监督，以及不同的无标签集组合。

Recent developments for Semi-Supervised Object Detection (SSOD) have shown the promise of leveraging unlabeled data to improve an object detector. However, thus far these methods have assumed that the unlabeled data does not contain out-of-distribution (OOD) classes, which is unrealistic with larger-scale unlabeled datasets. In this paper, we consider a more practical yet challenging problem, Open-Set SemiSupervised Object Detection (OSSOD). We first find the existing SSOD method obtains a lower performance gain in open-set conditions, and this is caused by the semantic expansion, where the distracting OOD objects are mispredicted as in-distribution pseudo-labels for the semi-supervised training. To address this problem, we consider online and offline OOD detection modules, which are integrated with SSOD methods. With the extensive studies, we found that leveraging an offline OOD detector based on a self-supervised vision transformer performs favorably against online OOD detectors due to its robustness to the interference of pseudolabeling. In the experiment, our proposed framework effectively addresses the semantic expansion issue and shows consistent improvements on many OSSOD benchmarks, including large-scale COCO-OpenImages. We also verify the effectiveness of our framework under different OSSOD conditions, including varying numbers of in-distribution classes, different degrees of supervision, and different combinations of unlabeled sets.

https://arxiv.org/abs/2208.13722

3、[CL] Streaming Intended Query Detection using E2E Modeling for Continued Conversation

S Chang, G Prakash, Z Wu, Q Liang, T N. Sainath, B Li, A Stambler, S Upadhyay, M Faruqui, T Strohman

[Google]

面向持续对话的端到端建模流意图查询检测。在语音应用中，一个预定的唤醒词通常被用来激活设备，以关注查询。然而，在持续对话中，每次说查询后都会有一个热词，会带来认知负担。为避免重复唤醒词，本文提出一种流式的端到端(E2E)意图查询检测器，可以识别出针对设备的话语，并过滤掉非针对设备的其他话语。所提出的方法将预定查询检测器纳入E2E模型，该模型已经将语音识别管道的不同部分折叠成一个神经网络。语音解码和意图查询检测的E2E模型还允许根据早期的部分识别结果宣布一个快速的意图查询检测，这对减少延迟、使系统反应灵敏很重要。本文证明，与独立的意图查询检测器相比，所提出的E2E方法产生了22%的检测准确率提升和600毫秒延迟的改进。验证实验中，所提出的模型在用户开始说话后的1.4秒中位延迟内，以8.7%的EER检测到用户是否在与设备交谈。

In voice-enabled applications, a predetermined hotword is usually used to activate a device in order to attend to the query. However, speaking queries followed by a hotword each time introduces a cognitive burden in continued conversations. To avoid repeating a hotword, we propose a streaming end-to-end (E2E) intended query detector that identifies the utterances directed towards the device and filters out other utterances not directed towards device. The proposed approach incorporates the intended query detector into the E2E model that already folds different components of the speech recognition pipeline into one neural network. The E2E modeling on speech decoding and intended query detection also allows us to declare a quick intended query detection based on early partial recognition result, which is important to decrease latency and make the system responsive. We demonstrate that the proposed E2E approach yields a 22% relative improvement on equal error rate (EER) for the detection accuracy and 600 ms latency improvement compared with an independent intended query detector. In our experiment, the proposed model detects whether the user is talking to the device with a 8.7% EER within 1.4 seconds of median latency after user starts speaking.

https://arxiv.org/abs/2208.13322

4、[AS] Towards Disentangled Speech Representations

C Peyser, R H A R T N. Sainath, M Picheny, K Cho

[New York University & Google]

解缠语音表示研究。在设计许多语音任务的方法时，仔细构建音频表示已成为一个主要特征。这些方法越来越多地强调"解缠"，即表示只包含与转录有关的语音信号的部分，而舍弃不相关的信息。本文构建了一个基于ASR和TTS联合建模的表示学习任务，并试图学习一种音频表征，将语音信号中与转录相关的部分与不相关的部分分开。提出了经验性的证据，表明成功地找到这样的表示是与训练中固有的随机性相联系的。本文观察到，这些理想的优化问题的解缠方案具有独特的统计特性。本文表明，在训练过程中强制执行这些属性可以使联合建模任务的误码率平均提高24.5%。这些观察结果激发了一种学习有效音频表示的新方法。

The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized “disentanglement”, where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint modeling of ASR and TTS, and seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We present empirical evidence that successfully finding such a representation is tied to the randomness inherent in training. We then make the observation that these desired, disentangled solutions to the optimization problem possess unique statistical properties. Finally, we show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task. These observations motivate a novel approach to learning effective audio representations.

https://arxiv.org/abs/2208.13191

5、[CL] Turn-Taking Prediction for Natural Conversational Speech

S Chang, B Li, T N. Sainath, C Zhang, T Strohman, Q Liang, Y He

[Google]

自然对话语音的轮换预测。虽然流媒体语音助理系统已被用于许多应用中，但该系统通常专注于非自然的、一次性的交互，假设输入的是一个没有犹豫或非流畅的单一语音查询。然而，一个常见的对话话语往往涉及到多个查询，除了不流畅外，还有轮流发言。这些不流畅的地方包括停顿思考、犹豫不决、拉长字数、填充停顿和重复短语。这使得用对话式语音进行语音识别，包括带有多个询问的语音，成为一项具有挑战性的任务。为了更好地模拟对话交互，关键是要区分不流畅和询问结束，以便让用户掌握不流畅的发言权，同时让系统在用户说完后尽快做出反应。本文提出了一种建立在端到端(E2E)语音识别器之上的轮换预测器。所得的最佳系统是通过联合优化ASR任务和检测用户何时暂停思考或结束说话而获得的。所提出的方法在预测真正的轮换时表现出超过97%的召回率和85%的精确率，而在一个设计有4种插入对话语料中的不流畅的测试集上只有100ms的延迟。

While a streaming voice assistant system has been used in many applications, this system typically focuses on unnatural, one-shot interactions assuming input from a single voice query without hesitation or disfluency. However, a common conversational utterance often involves multiple queries with turn-taking, in addition to disfluencies. These disfluencies include pausing to think, hesitations, word lengthening, filled pauses and repeated phrases. This makes doing speech recognition with conversational speech, including one with multiple queries, a challenging task. To better model the conversational interaction, it is critical to discriminate disfluencies and end of query in order to allow the user to hold the floor for disfluencies while having the system respond as quickly as possible when the user has finished speaking. In this paper, we present a turntaking predictor built on top of the end-to-end (E2E) speech recognizer. Our best system is obtained by jointly optimizing for ASR task and detecting when the user is paused to think or finished speaking. The proposed approach demonstrates over 97% recall rate and 85% precision rate on predicting true turn-taking with only 100 ms latency on a test set designed with 4 types of disfluencies inserted in conversational utterances.

https://arxiv.org/abs/2208.13321