德国萨尔(Saarland)大学和博世AI实验室的论文。

【摘要】

自然语言处理的最新发展为低资源的语言和领域带来了挑战和机遇。众所周知,深度神经网络需要大量的训练数据,而数据在低资源的场景中是缺乏的。但是,现在有越来越多的研究工作在不断改善低资源场景中的性能。受到神经模型的根本变化以及当前流行的预训练和微调方法的推动,我们对低资源自然语言处理中比较有前途的方法进行了综述。在讨论了低资源场景的定义以及数据可用性的不同方面之后,我们将研究在训练数据稀疏时可以使用的学习方法。这包括创建附加标记数据的机制,例如数据增强和远程监督,以及减少目标监督需求的迁移学习。最后,本文将简要介绍非NLP机器学习社区中建议的方法,这可能对NLP也有启发。 Current developments in natural language processing offer challenges and opportunities for low-resource languages and domains. Deep neural networks are known for requiring large amounts of training data which might not be available in resource-lean scenarios. However, there is also a growing body of works to improve the performance in low-resource settings. Motivated by fundamental changes towards neural models and the currently popular pre-train and fine-tune paradigm, we give an overview of promising approaches for low-resource natural language processing. After a discussion about the definition of low-resource scenarios and the different dimensions of data availability, we then examine methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. The survey closes with a brief look into methods suggested in non-NLP machine learning communities, which might be beneficial for NLP in low-resource scenarios

内容中包含的图片若涉及版权问题,请及时与我们联系删除