- 简介知识蒸馏(KD)被广泛用于将教师模型压缩为更小的学生模型,从而降低推理成本和内存占用,同时保留模型功能。然而,当前用于自回归序列模型(例如大型语言模型)的KD方法缺乏标准化的目标函数。此外,最近使用学生生成的输出来解决训练-推理不匹配问题,已经显著增加了计算成本。为了解决这些问题,我们引入了DistiLLM,这是一个更有效和高效的自回归语言模型KD框架。DistiLLM包括两个组件:(1)一种新颖的偏斜Kullback-Leibler散度损失,我们揭示并利用了它的理论特性,以及(2)一种自适应的离线策略方法,旨在增强利用学生生成的输出的效率。包括遵循指令的任务在内的大量实验证明了DistiLLM在构建高性能学生模型方面的有效性,同时与最近的KD方法相比,实现了高达4.3倍的加速。
-
- 图表
- 解决问题DistiLLM: A More Effective and Efficient Knowledge Distillation Framework for Auto-regressive Language Models
- 关键思路DistiLLM introduces a novel skew Kullback-Leibler divergence loss and an adaptive off-policy approach to enhance the efficiency of utilizing student-generated outputs for knowledge distillation in auto-regressive language models.
- 其它亮点The experiments demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3x speedup compared to recent KD methods. The framework was tested on instruction-following tasks and the results show its potential for practical applications. The paper also provides an in-depth analysis of the theoretical properties of the proposed skew KL divergence loss. The code and pre-trained models are publicly available.
- Recent related studies in this field include 'TinyBERT: Distilling BERT for Natural Language Understanding' and 'Distilling Task-Specific Knowledge from BERT into Simple Neural Networks'.
NEW
提问交流
提交问题,平台邀请作者,轻松获得权威解答~
向作者提问

提问交流