语音/音频处理论文19篇

本文转自arXiv每日学术速递
cs.SD语音8篇，eess.AS音频处理11篇

cs.SD语音

【1】 Multi-Scale Feature Fusion Transformer Network for End-to-End Single Channel Speech Separation
标题：用于端到端单通道语音分离的多尺度特征融合变换网络
链接：https://arxiv.org/abs/2212.07163

作者：Yinhao Xu,Jian Zhou,Liang Tao,Hon Keung Kwan
摘要：None
摘要：Recently studies on time-domain audio separation networks (TasNets) have made a great stride in speech separation. One of the most representative TasNets is a network with a dual-path segmentation approach. However, the original model called DPRNN used a fixed feature dimension and unchanged segment size throughout all layers of the network. In this paper, we propose a multi-scale feature fusion transformer network (MSFFT-Net) based on the conventional dual-path structure for single-channel speech separation. Unlike the conventional dual-path structure where only one processing path exists, adopting several iterative blocks with alternative intra-chunk and inter-chunk operations to capture local and global context information, the proposed MSFFT-Net has multiple parallel processing paths where the feature information can be exchanged between multiple parallel processing paths. Experiments show that our proposed networks based on multi-scale feature fusion structure have achieved better results than the original dual-path model on the benchmark dataset-WSJ0-2mix, where the SI-SNRi score of MSFFT-3P is 20.7dB (1.47% improvement), and MSFFT-2P is 21.0dB (3.45% improvement), which achieves SOTA on WSJ0-2mix without any data augmentation method.

【2】 CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos
标题：CLIPSep：带噪声未标记视频的文本查询声音分离
链接：https://arxiv.org/abs/2212.07065

作者：Hao-Wen Dong,Naoya Takahashi,Yuki Mitsufuji,Julian McAuley,Taylor Berg-Kirkpatrick
机构：Sony Group Corporation, University of California San Diego
摘要：None
摘要：Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.

【3】 Disentangling Prosody Representations with Unsupervised Speech Reconstruction
标题：基于无监督语音重建的韵律表征解缠方法
链接：https://arxiv.org/abs/2212.06972

作者：Leyuan Qu,Taihao Li,Cornelius Weber,Theresa Pekarek-Rosin,Fuji Ren,Stefan Wermter
机构：University of Hamburg
摘要：None
摘要：Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for unsupervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective and subjective evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations. Some audio samples can be found on our demo website.

【4】 Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks
标题：解决鸡尾酒叉子问题以分离和转录真实世界的配乐
链接：https://arxiv.org/abs/2212.07327

作者：Darius Petermann,Gordon Wichern,Aswin Shanmugam Subramanian,Zhong-Qiu Wang,Jonathan Le Roux
备注：Submitted to IEEE TASLP (In review), 13 pages, 6 figures
摘要：None
摘要：Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. Much of this research investigates separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. In this paper, we focus on the cocktail fork problem, which takes a three-pronged approach to source separation by separating an audio mixture such as a movie soundtrack or podcast into the three broad categories of speech, music, and sound effects (SFX - understood to include ambient noise and natural sound events). We benchmark the performance of several deep learning-based source separation models on this task and evaluate them with respect to simple objective measures such as signal-to-distortion ratio (SDR) as well as objective metrics that better correlate with human perception. Furthermore, we thoroughly evaluate how source separation can influence downstream transcription tasks. First, we investigate the task of activity detection on the three sources as a way to both further improve source separation and perform transcription. We formulate the transcription tasks as speech recognition for speech and audio tagging for music and SFX. We observe that, while the use of source separation estimates improves transcription performance in comparison to the original soundtrack, performance is still sub-optimal due to artifacts introduced by the separation process. Therefore, we thoroughly investigate how remixing of the three separated source stems at various relative levels can reduce artifacts and consequently improve the transcription performance. We find that remixing music and SFX interferences at a target SNR of 17.5 dB reduces speech recognition word error rate, and similar impact from remixing is observed for tagging music and SFX content.

【5】 Event-driven Spectrotemporal Feature Extraction and Classification using a Silicon Cochlea Model
标题：基于硅耳模型的事件驱动型光谱时态特征提取与分类
链接：https://arxiv.org/abs/2212.07136

作者：Ying Xu,Samalika Perera,Yeshwanth Bethi,Saeed Afshar,André van Schaik
机构：International Centre for Neuromorphic Systems, The MARCS Institute, Western Sydney University, Kingswood, NSW , Australia
备注：12 pages, 8 figures
摘要：None
摘要：This paper presents a reconfigurable digital implementation of an event-based binaural cochlear system on a Field Programmable Gate Array (FPGA). It consists of a pair of the Cascade of Asymmetric Resonators with Fast Acting Compression (CAR FAC) cochlea models and leaky integrate and fire (LIF) neurons. Additionally, we propose an event-driven SpectroTemporal Receptive Field (STRF) Feature Extraction using Adaptive Selection Thresholds (FEAST). It is tested on the TIDIGTIS benchmark and compared with current event-based auditory signal processing approaches and neural networks.

【6】 Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis
标题：风格标签无关：语音合成中基于量化VAE和说话人归一化的交叉说话人风格转换
链接：https://arxiv.org/abs/2212.06397

作者：Chunyu Qiang,Peng Yang,Hao Che,Xiaorui Wang,Zhongyuan Wang
机构：Kwai, Beijing, P.R. China
备注：Published to ISCSLP 2022
摘要：跨说话人风格迁移是指将源说话人的风格迁移到目标说话人的合成语音中。大多数先前的方法依赖于具有样式标签的数据，但是手动注释标签是昂贵的并且不总是可靠的。针对这一问题，本文提出了一种跨说话人风格迁移方法--无风格标签方法，该方法可以实现源说话人到目标说话人的风格迁移，而无需风格标签。首先，设计了一种基于量化变分自动编码器（Q-VAE）和风格瓶颈的参考编码器结构，用于提取离散风格表示。其次，提出了一种基于说话人的批量归一化层，以减少源说话人泄漏。为了提高参考编码器的风格提取能力，提出了一种风格不变性和对比数据增强的方法。实验结果表明，该方法优于基线方法。我们提供了一个网站与音频样本。
摘要：Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.

【7】 Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric
标题：基于自回归模型和改进评价指标的可信音素边界检测
链接：https://arxiv.org/abs/2212.06387

作者：Hyeongju Kim,Hyeong-Seok Choi
机构：Supertone, Inc., Seoul National University
备注：5 pages, submitted to ICASSP 2023
摘要：音素边界检测由于其在各种语音应用中的中心作用而被研究。在本文中，我们指出，这一任务不仅需要通过算法的方式来解决，而且需要通过评估度量来解决。为此，我们首先提出了一种以自回归方式操作的最先进的音素边界检测器，称为SuperSeg。在TIMIT和Buckeye语料库上的实验表明，与现有模型相比，SuperSeg识别音素边界具有显著的边缘。此外，我们注意到流行的评估度量R值存在限制，并提出了新的评估度量，以防止每个边界对评估有多次贡献。该方法揭示了非自回归基线的不足，建立了一个适合于评价音素边界检测的可靠准则。
摘要：Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries with significant margin compared to existing models. Furthermore, we note that there is a limitation on the popular evaluation metric, R-value, and propose new evaluation metrics that prevent each boundary from contributing to evaluation multiple times. The proposed metrics reveal the weaknesses of non-autoregressive baselines and establishes a reliable criterion that suits for evaluating phoneme boundary detection.

【8】 Jointly Learning Visual and Auditory Speech Representations from Raw Data
标题：从原始数据中联合学习视觉和听觉语音表示
链接：https://arxiv.org/abs/2212.06246

作者：Alexandros Haliassos,Pingchuan Ma,Rodrigo Mira,Stavros Petridis,Maja Pantic
机构：Imperial College London, Meta AI
备注：22 pages
摘要：本文提出了一种自监督的多模态语音学习方法RAVEn，用于联合学习视觉和听觉语音表示。我们的预训练目标包括编码屏蔽输入，然后预测由缓慢进化的动量编码器生成的上下文化目标。受视频和音频之间固有差异的驱动，我们的设计是不对称的。两种模式的借口任务：听觉流预测视觉和听觉目标，而视觉流只预测听觉目标。当微调由单个预训练阶段产生的视觉和听觉编码器时，我们观察到在低和高资源标记数据设置中的强结果，其中编码器被联合训练。值得注意的是，RAVEn超越了LRS3上视觉语音识别（VSR）的所有自监督方法，并且将RAVEn与仅使用30小时标记数据的自训练相结合，甚至超过了最近在90，000小时非公开数据上训练的半监督方法。同时，我们在听觉语音识别（以及VSR）的LRS3低资源设置中实现了最先进的结果。我们的发现指出了完全从原始视频和音频学习强大的语音表示的可行性，即，而不依赖于手工制作的特征。代码和模型将公开。
摘要：We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowly-evolving momentum encoders. Driven by the inherent differences between video and audio, our design is asymmetric w.r.t. the two modalities' pretext tasks: Whereas the auditory stream predicts both the visual and auditory targets, the visual one predicts only the auditory targets. We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained. Notably, RAVEn surpasses all self-supervised methods on visual speech recognition (VSR) on LRS3, and combining RAVEn with self-training using only 30 hours of labelled data even outperforms a recent semi-supervised method trained on 90,000 hours of non-public data. At the same time, we achieve state-of-the-art results in the LRS3 low-resource setting for auditory speech recognition (as well as for VSR). Our findings point to the viability of learning powerful speech representations entirely from raw video and audio, i.e., without relying on handcrafted features. Code and models will be made public.

eess.AS音频处理

【1】 Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks
标题：解决鸡尾酒叉子问题以分离和转录真实世界的配乐
链接：https://arxiv.org/abs/2212.07327

*与cs.SD语音【4】为同一篇

【2】 Event-driven Spectrotemporal Feature Extraction and Classification using a Silicon Cochlea Model
标题：基于硅耳模型的事件驱动型光谱时态特征提取与分类
链接：https://arxiv.org/abs/2212.07136

*与cs.SD语音【5】为同一篇

作者：Ying Xu,Samalika Perera,Yeshwanth Bethi,Saeed Afshar,André van Schaik
机构：International Centre for Neuromorphic Systems, The MARCS Institute, Western Sydney University, Kingswood, NSW , Australia
备注：12 pages, 8 figures
摘要：提出了一种基于事件的双耳耳蜗系统的可重构数字实现方法。该模型由一对具有快速压缩功能的非对称共振器级联（CAR FAC）耳蜗模型和泄漏积分放电（LIF）神经元组成。此外，我们提出了一种基于自适应选择阈值（FEAST）的事件驱动的光谱时间感受野（STRF）特征提取方法。在TIDIGTIS基准上进行了测试，并与现有的基于事件的听觉信号处理方法和神经网络进行了比较。
摘要：This paper presents a reconfigurable digital implementation of an event-based binaural cochlear system on a Field Programmable Gate Array (FPGA). It consists of a pair of the Cascade of Asymmetric Resonators with Fast Acting Compression (CAR FAC) cochlea models and leaky integrate and fire (LIF) neurons. Additionally, we propose an event-driven SpectroTemporal Receptive Field (STRF) Feature Extraction using Adaptive Selection Thresholds (FEAST). It is tested on the TIDIGTIS benchmark and compared with current event-based auditory signal processing approaches and neural networks.

【3】 Probing Deep Speaker Embeddings for Speaker-related Tasks
标题：针对说话人相关任务的深度说话人嵌入研究
链接：https://arxiv.org/abs/2212.07068

作者：Zifeng Zhao,Ding Pan,Junyi Peng,Rongzhi Gu
机构：School of ECE, Peking University, China, Brno University of Technology, Czechia, Tencent AI Lab, China
摘要：说话人深度嵌入在说话人识别以及其他与说话人相关的任务中显示出良好的效果。然而，一些问题仍在探索中，例如，这些表征中编码的信息及其对下游任务的影响。本文研究了四种深度说话人嵌入方法，即d向量、x向量、ResNetSE-34和ECAPA-TDNN。受人类语音机制的启发，我们从身份、内容和渠道三个方面探索了可能的编码信息;在此基础上，对区分性任务（说话人确认和日记化）、引导性任务（目标说话人检测和提取）和调节性任务（多说话人文本-语音转换）进行实验，进一步探讨不同深度嵌入对说话人识别的影响。结果表明，除了说话人身份之外，所有深度嵌入都对信道和内容信息进行了编码，但是深度嵌入的程度可能不同，并且它们在与说话人相关的任务上的性能可能有很大的不同：ECAPA-TDNN在区分任务中占主导地位，d向量在引导任务中占主导地位，而调节任务对说话人表征的选择不太敏感。这对今后利用说话人嵌入的研究有一定的借鉴意义。
摘要：Deep speaker embeddings have shown promising results in speaker recognition, as well as in other speaker-related tasks. However, some issues are still under explored, for instance, the information encoded in these representations and their influence on downstream tasks. Four deep speaker embeddings are studied in this paper, namely, d-vector, x-vector, ResNetSE-34 and ECAPA-TDNN. Inspired by human voice mechanisms, we explored possibly encoded information from perspectives of identity, contents and channels; Based on this, experiments were conducted on three categories of speaker-related tasks to further explore impacts of different deep embeddings, including discriminative tasks (speaker verification and diarization), guiding tasks (target speaker detection and extraction) and regulating tasks (multi-speaker text-to-speech). Results show that all deep embeddings encoded channel and content information in addition to speaker identity, but the extent could vary and their performance on speaker-related tasks can be tremendously different: ECAPA-TDNN is dominant in discriminative tasks, and d-vector leads the guiding tasks, while regulating task is less sensitive to the choice of speaker representations. These may benefit future research utilizing speaker embeddings.

【4】 DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect
标题：解决多普勒效应的双耳音频合成
链接：https://arxiv.org/abs/2212.07000

作者：Jinglin Liu,Zhenhui Ye,Qian Chen,Siqi Zheng,Wen Wang,Qinglin Zhang,Zhou Zhao
机构：Zhejiang University, Speech Lab, Alibaba Group
摘要：近年来，双耳音频合成技术因其在增强现实和虚拟现实中的应用而成为一个很有前途的研究领域。双耳音频通过向大脑提供反映空间信息的耳间时间差来帮助我们定位自己并建立沉浸感。然而，现有方法在相位估计方面受到限制，相位估计对于空间听觉至关重要。在这篇文章中，我们提出DopplerBAS方法来明确地解决移动声源的多普勒效应。具体来说，我们计算了球坐标系中运动说话人的径向相对速度，这进一步指导了双耳音频的合成。这个简单的方法既不引入任何额外的超参数，也不修改损失函数，并且是即插即用的：它很好地扩展到不同类型的骨干。NeuralDopper在相位误差指标方面显著改善了WarpNet和BinauralGrad，并达到了新的最先进水平：0.780（与当前最先进的0.807相比）。实验和烧蚀研究证明了该方法的有效性。
摘要：Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual realities. Binaural audio helps us to orient ourselves and establish immersion by providing the brain with interaural time differences reflecting spatial information. However, existing methods are limited in terms of phase estimation, which is crucial for spatial hearing. In this paper, we propose the DopplerBAS method to explicitly address the Doppler effect of the moving sound source. Specifically, we calculate the radial relative velocity of the moving speaker in spherical coordinates, which further guides the synthesis of binaural audio. This simple method neither introduces any additional hyper-parameters nor modifies the loss functions, and is plug-and-play: it scales well to different types of backbones. NeuralDopper distinctly improves WarpNet and BinauralGrad in the phase error metric and reaches a new state-of-the-art: 0.780 (vs. the current state-of-the-art 0.807). Experiments and ablation studies demonstrate the effectiveness of our method.

【5】 Speech and Natural Language Processing Technologies for Pseudo-Pilot Simulator
标题：伪机模拟器的语音和自然语言处理技术
链接：https://arxiv.org/abs/2212.07164

作者：Amrutha Prasad,Juan Zuluaga-Gomez,Petr Motlicek,Saeed Sarfjoo,Iuliia Nigmatulina,Karel Vesely
机构：Idiap Research Institute, Martigny, Switzerland, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland, Institute of Computational Linguistics, University of Zurich, Switzerland, †equal contribution
备注：Presented at Sesar Innovation Days 2022. this https URL
摘要：本文介绍了一种简单而有效的基于重复的模块化系统，用于加速空中交通管制员（ATCos）的培训。例如：在ATCo培训期间，欧洲航管组织的简易逃生模拟器（见https：//www.eurocontrol.int/simulator/escape）仍需要一名驾驶员。然而，这种需要可以由能够充当先导的自动系统代替。在本文中，我们的目标是通过合并各种人工智能（AI）驱动的模块，开发一个伪飞行员代理并将其集成到ATCo训练流水线中。该系统能理解空中交通管制公司发出的语音通信，并相应地产生一个语音提示，该提示遵循飞行员的用语，直到最初的通信。该系统主要采用开源的人工智能工具和空中交通管制数据库，具有简单易行、易于推广的特点。整个管道由以下部分组成：（1）接收和预处理原始音频的输入流的子模块，（2）将音频转换成单词序列的自动语音识别（ASR）系统;（3）高层ATC相关实体解析器，从通信中提取相关信息，即：呼号和命令，以及最后，（4）基于先前提取的高级ATC实体生成响应的语音合成器子模块。总的来说，我们表明该系统可以为开发真正的概念验证伪飞行员系统铺平道路。因此，加快了空中交通管制员的培训，同时大大降低了其总成本。
摘要：This paper describes a simple yet efficient repetition-based modular system for speeding up air-traffic controllers (ATCos) training. E.g., a human pilot is still required in EUROCONTROL's ESCAPE lite simulator (see https://www.eurocontrol.int/simulator/escape) during ATCo training. However, this need can be substituted by an automatic system that could act as a pilot. In this paper, we aim to develop and integrate a pseudo-pilot agent into the ATCo training pipeline by merging diverse artificial intelligence (AI) powered modules. The system understands the voice communications issued by the ATCo, and, in turn, it generates a spoken prompt that follows the pilot's phraseology to the initial communication. Our system mainly relies on open-source AI tools and air traffic control (ATC) databases, thus, proving its simplicity and ease of replicability. The overall pipeline is composed of the following: (1) a submodule that receives and pre-processes the input stream of raw audio, (2) an automatic speech recognition (ASR) system that transforms audio into a sequence of words; (3) a high-level ATC-related entity parser, which extracts relevant information from the communication, i.e., callsigns and commands, and finally, (4) a speech synthesizer submodule that generates responses based on the high-level ATC entities previously extracted. Overall, we show that this system could pave the way toward developing a real proof-of-concept pseudo-pilot system. Hence, speeding up the training of ATCos while drastically reducing its overall cost.

【6】 Multi-Scale Feature Fusion Transformer Network for End-to-End Single Channel Speech Separation
标题：用于端到端单通道语音分离的多尺度特征融合变换网络
链接：https://arxiv.org/abs/2212.07163

*与cs.SD语音【1】为同一篇

作者：Yinhao Xu,Jian Zhou,Liang Tao,Hon Keung Kwan
摘要：近年来，时域音频分离网络（TasNets）的研究在语音分离领域取得了很大进展。最具代表性的TasNets之一是具有双路径分割方法的网络。然而，最初的模型DPRNN在网络的所有层中使用固定的特征维度和不变的段大小。针对单通道语音分离问题，在传统双通道结构的基础上，提出了一种多尺度特征融合Transformer网络（MSFT-Net）。与传统的双路径结构（只有一条处理路径）不同，MSFT-Net采用多个块内和块间交替操作的迭代块来获取局部和全局上下文信息，具有多条并行处理路径，特征信息可以在多条并行处理路径之间交换。实验结果表明，基于多尺度特征融合结构的网络在WSJ 0 - 2 mix基准数据集上取得了比原双路径模型更好的结果，其中MSFT-3 P的SI-SNRi得分为20. 7 dB（提高了1. 47%），MSFT-2 P的SI-SNRi得分为21. 0 dB（提高了3. 45%），在WSJ 0 - 2 mix上实现了SOTA。
摘要：Recently studies on time-domain audio separation networks (TasNets) have made a great stride in speech separation. One of the most representative TasNets is a network with a dual-path segmentation approach. However, the original model called DPRNN used a fixed feature dimension and unchanged segment size throughout all layers of the network. In this paper, we propose a multi-scale feature fusion transformer network (MSFFT-Net) based on the conventional dual-path structure for single-channel speech separation. Unlike the conventional dual-path structure where only one processing path exists, adopting several iterative blocks with alternative intra-chunk and inter-chunk operations to capture local and global context information, the proposed MSFFT-Net has multiple parallel processing paths where the feature information can be exchanged between multiple parallel processing paths. Experiments show that our proposed networks based on multi-scale feature fusion structure have achieved better results than the original dual-path model on the benchmark dataset-WSJ0-2mix, where the SI-SNRi score of MSFFT-3P is 20.7dB (1.47% improvement), and MSFFT-2P is 21.0dB (3.45% improvement), which achieves SOTA on WSJ0-2mix without any data augmentation method.

【7】 CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos
标题：CLIPSep：带噪声未标记视频的文本查询声音分离
链接：https://arxiv.org/abs/2212.07065

*与cs.SD语音【2】为同一篇

作者：Hao-Wen Dong,Naoya Takahashi,Yuki Mitsufuji,Julian McAuley,Taylor Berg-Kirkpatrick
机构：Sony Group Corporation, University of California San Diego
摘要：近年来，已经看到从用于语音或音乐的特定领域声音分离向用于任意声音的通用声音分离的进展。关于通用声音分离的现有工作已经研究了在给定文本查询的情况下从音频混合中分离目标声音。这种文本查询声音分离系统提供了用于指定任意目标声音的自然且可缩放的接口。然而，受监督的文本查询声音分离系统需要昂贵的标记音频-文本对用于训练。此外，现有数据集中提供的音频通常是在受控环境中记录的，这导致与野外的噪声音频存在相当大的泛化差距。在这项工作中，我们的目标是通过只使用未标记的数据来处理文本查询的通用声音分离。我们建议利用视觉模态作为桥梁来学习期望的音频-文本对应。CLIPSep模型首先利用对比语言-图像预训练（CLIP）模型将输入查询编码为查询向量，然后利用查询向量对音频分离模型进行条件化处理以分离出目标声音。虽然该模型是在从未标记视频中提取的图像-音频对上训练的，但在测试时，由于CLIP模型学习到的联合语言-图像嵌入，我们可以在zero-shot设置中使用文本输入来查询该模型。此外，野外的视频通常包含可能妨碍模型学习期望的音频-文本对应关系的屏幕外声音和背景噪声。为了解决这个问题，我们进一步提出了一种称为噪声不变训练的方法，用于在噪声数据上训练基于查询的声音分离模型。实验结果表明，所提模型仅使用噪声未标记视频就能成功学习文本查询通用声音分离，甚至在某些设置下可以获得与监督模型竞争的性能。
摘要：Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.

【8】 Disentangling Prosody Representations with Unsupervised Speech Reconstruction
标题：基于无监督语音重建的韵律表征解缠方法
链接：https://arxiv.org/abs/2212.06972

*与cs.SD语音【3】为同一篇

作者：Leyuan Qu,Taihao Li,Cornelius Weber,Theresa Pekarek-Rosin,Fuji Ren,Stefan Wermter
机构：University of Hamburg
摘要：人类语音可以由不同的成分来表征，包括语义内容、说话人身份和韵律信息。自动语音识别（ASR）和说话人确认任务中语义内容和说话人身份的分离已经取得了显著的进展。然而，由于不同属性（如音色和节奏）之间的内在联系，以及实现鲁棒的大规模和非特定人ASR需要无监督的训练方案，因此提取韵律信息仍然是一个具有挑战性的研究问题。本文的目的是解决基于无监督重构的情感韵律与语音的解纠缠问题。具体而言，我们识别、设计、实现和集成了我们提出的语音重建模型Prosody2Vec中的三个关键组件：（1）单元编码器，其将语音信号变换为语义内容的离散单元，（2）预训练的说话者验证模型，以生成说话者身份嵌入，以及（3）可训练的韵律编码器，以学习韵律表示。首先在未标注情感语音库上预训练Prosody2Vec表示，然后在特定数据集上微调模型以执行语音情感识别（SER）和情感语音转换（EVC）任务。对EVC任务的客观和主观评价表明，Prosody2Vec能有效地捕捉一般的韵律特征，并将其平滑地传递给其他情感语音。此外，我们在IEMOCAP数据集上的SER实验表明，Prosody2Vec学习的韵律特征对广泛使用的语音预训练模型的性能具有互补性和有益性，并在Prosody2Vec与HuBERT表示相结合时优于现有的方法。一些音频样本可以在我们的演示网站上找到。
摘要：Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for unsupervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective and subjective evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations. Some audio samples can be found on our demo website.

【9】 Towards deep generation of guided wave representations for composite materials
标题：复合材料导波表示的深层次生成
链接：https://arxiv.org/abs/2212.06365

作者：Mahindra Rautela,J. Senthilnath,Armin Huber,S. Gopalakrishnan
机构： Senthilnath is with the Institute for Infocomm Research
摘要：层合复合材料广泛应用于工程领域。波传播分析在理解复合材料结构的短时瞬态响应中起着至关重要的作用。基于正演物理学的模型被用于从弹性属性空间映射到层合复合材料中的波传播行为。由于导波的高频、多模态和色散性质，基于物理的模拟在计算上要求很高。这使得性能预测、生成和材料设计问题更具挑战性。本文利用基于正向物理的模拟方法，如刚度矩阵法，对一组复合材料的导波群速度进行了采集。提出了一种基于变分自动编码器（VAE）的深度生成模型，用于生成新的、真实的极群速度表示。观察到，深度生成器能够以非常低的均方重构误差重构看不见的表示。采用全局蒙特卡罗方法和定向等间距采样器对VAE的连续、完整、有序的低维特征空间进行采样。采样点被馈送到经训练的解码器中以生成新的极坐标表示。该网络已显示出卓越的发电能力。还可以看出，潜在空间形成概念空间，其中不同的方向和区域显示与所生成的表征及其相应的材料属性相关的固有模式。
摘要：Laminated composite materials are widely used in most fields of engineering. Wave propagation analysis plays an essential role in understanding the short-duration transient response of composite structures. The forward physics-based models are utilized to map from elastic properties space to wave propagation behavior in a laminated composite material. Due to the high-frequency, multi-modal, and dispersive nature of the guided waves, the physics-based simulations are computationally demanding. It makes property prediction, generation, and material design problems more challenging. In this work, a forward physics-based simulator such as the stiffness matrix method is utilized to collect group velocities of guided waves for a set of composite materials. A variational autoencoder (VAE)-based deep generative model is proposed for the generation of new and realistic polar group velocity representations. It is observed that the deep generator is able to reconstruct unseen representations with very low mean square reconstruction error. Global Monte Carlo and directional equally-spaced samplers are used to sample the continuous, complete and organized low-dimensional latent space of VAE. The sampled point is fed into the trained decoder to generate new polar representations. The network has shown exceptional generation capabilities. It is also seen that the latent space forms a conceptual space where different directions and regions show inherent patterns related to the generated representations and their corresponding material properties.

【10】 Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis
标题：风格标签无关：语音合成中基于量化VAE和说话人归一化的交叉说话人风格转换
链接：https://arxiv.org/abs/2212.06397

作者：Chunyu Qiang,Peng Yang,Hao Che,Xiaorui Wang,Zhongyuan Wang
机构：Kwai, Beijing, P.R. China
备注：Published to ISCSLP 2022
摘要：None
摘要：Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.

【11】 Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric
标题：基于自回归模型和改进评价指标的可信音素边界检测
链接：https://arxiv.org/abs/2212.06387

内容中包含的图片若涉及版权问题，请及时与我们联系删除

语音/音频处理论文19篇

评论列表

评论