DistiLLM: Towards Streamlined Distillation for Large Language Models

2024年02月06日
  • 简介
    知识蒸馏(KD)被广泛用于将教师模型压缩为更小的学生模型,从而降低推理成本和内存占用,同时保留模型功能。然而,当前用于自回归序列模型(例如大型语言模型)的KD方法缺乏标准化的目标函数。此外,最近使用学生生成的输出来解决训练-推理不匹配问题,已经显著增加了计算成本。为了解决这些问题,我们引入了DistiLLM,这是一个更有效和高效的自回归语言模型KD框架。DistiLLM包括两个组件:(1)一种新颖的偏斜Kullback-Leibler散度损失,我们揭示并利用了它的理论特性,以及(2)一种自适应的离线策略方法,旨在增强利用学生生成的输出的效率。包括遵循指令的任务在内的大量实验证明了DistiLLM在构建高性能学生模型方面的有效性,同时与最近的KD方法相比,实现了高达4.3倍的加速。
  • 作者讲解
  • 图表
  • 解决问题
    DistiLLM: A More Effective and Efficient Knowledge Distillation Framework for Auto-regressive Language Models
  • 关键思路
    DistiLLM introduces a novel skew Kullback-Leibler divergence loss and an adaptive off-policy approach to enhance the efficiency of utilizing student-generated outputs for knowledge distillation in auto-regressive language models.
  • 其它亮点
    The experiments demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3x speedup compared to recent KD methods. The framework was tested on instruction-following tasks and the results show its potential for practical applications. The paper also provides an in-depth analysis of the theoretical properties of the proposed skew KL divergence loss. The code and pre-trained models are publicly available.
  • 相关研究
    Recent related studies in this field include 'TinyBERT: Distilling BERT for Natural Language Understanding' and 'Distilling Task-Specific Knowledge from BERT into Simple Neural Networks'.
许愿开讲
PDF
原文
点赞 收藏
向作者提问
NEW
分享到Link

提问交流

提交问题,平台邀请作者,轻松获得权威解答~

向作者提问