DistiLLM: Towards Streamlined Distillation for Large Language Models

向作者提问

NEW

简介

知识蒸馏（KD）被广泛用于将教师模型压缩为更小的学生模型，从而降低推理成本和内存占用，同时保留模型功能。然而，当前用于自回归序列模型（例如大型语言模型）的KD方法缺乏标准化的目标函数。此外，最近使用学生生成的输出来解决训练-推理不匹配问题，已经显著增加了计算成本。为了解决这些问题，我们引入了DistiLLM，这是一个更有效和高效的自回归语言模型KD框架。DistiLLM包括两个组件：（1）一种新颖的偏斜Kullback-Leibler散度损失，我们揭示并利用了它的理论特性，以及（2）一种自适应的离线策略方法，旨在增强利用学生生成的输出的效率。包括遵循指令的任务在内的大量实验证明了DistiLLM在构建高性能学生模型方面的有效性，同时与最近的KD方法相比，实现了高达4.3倍的加速。
作者讲解

目前尚无作者解读视频，你可点击下方【许愿开讲】按钮，许愿作者开讲~
图表
解决问题

DistiLLM: A More Effective and Efficient Knowledge Distillation Framework for Auto-regressive Language Models
关键思路

DistiLLM introduces a novel skew Kullback-Leibler divergence loss and an adaptive off-policy approach to enhance the efficiency of utilizing student-generated outputs for knowledge distillation in auto-regressive language models.
其它亮点

The experiments demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3x speedup compared to recent KD methods. The framework was tested on instruction-following tasks and the results show its potential for practical applications. The paper also provides an in-depth analysis of the theoretical properties of the proposed skew KL divergence loss. The code and pre-trained models are publicly available.
相关研究

Recent related studies in this field include 'TinyBERT: Distilling BERT for Natural Language Understanding' and 'Distilling Task-Specific Knowledge from BERT into Simple Neural Networks'.

许愿开讲

PDF

原文

点赞收藏

向作者提问

NEW

分享到Link

提问交流

提交问题，平台邀请作者，轻松获得权威解答～

向作者提问