关于知识蒸馏，你一定要了解的三类基础算法

知识蒸馏（Knowledge Distillation，简记为 KD）是一种经典的模型压缩方法，核心思想是通过引导轻量化的学生模型“模仿”性能更好、结构更复杂的教师模型（或多模型的 ensemble），在不改变学生模型结构的情况下提高其性能。2015 年 Hinton 团队提出的基于“响应”（response-based）的知识蒸馏技术（一般将该文算法称为 vanilla-KD [1]）掀起了相关研究热潮，其后基于“特征”（feature-based）和基于“关系”（relation-based）的 KD 算法被陆续提出。以上述三类蒸馏算法为基础，学术界不断涌现出致力于解决各特定问题、面向各特定场景的 KD 算法，如：

零训练数据情况下的 data-free KD；
教师模型也权重更新的 online KD、self KD；
面向检测、分割、自然语言处理等任务的 KD 算法等。

本系列文章将以 MMRazor 算法库为依托，逐步揭开各类 KD 算法的神秘面纱。MMRazor 链接：

https://github.com/open-mmlab/mmrazor

本文作为 KD 系列文章的头篇，将对 response-based、feature-based 和relation-based 这三类基础 KD 算法进行重点介绍，为大家后续的深入研究、交流打下基础。

1 Response-based KD

太长不看，直接看结论

如果你没有充足的时间浏览上面的各项论述，可以直接获取本节的结论：

logits 中的非目标类信息是 response-based KD 起效的关键；
目标类信息传递的是教师模型对各样本“难度”的评估，数据噪声较大、任务困难的情况下，难度传递的作用更为明显；
logits 相比于 one-hot label 而言，起到了类似标签平滑的作用，抑制了模型的 over-confidence 倾向，从而提高了模型泛化性；
从信息量化的角度来看，response-based KD 往往使得模型学到更多的知识、更倾向于同时学到不同的知识、优化方向更为稳定。

2 Feature-based KD

Summary

Feature-based KD相关的研究较多，本文不再进行深入讨论。稍作总结的话，该类别算法的核心关注点在于：

知识的定位（设计规则选出更为重要的教师特征，这一点在检测蒸馏算法中非常重要）
如何进行特征维度对齐、特征语义对齐、特征加权（connector 设计）
如何进行知识的高效传递（特征 fusion、loss 设计）

3 Relation-based KD

Summary

近年来，relation-based KD 算法在分割任务中不断取得突破。同一张图像中，像素点之间的特征关系差异或区域之间的特征关系差异成为蒸馏分割模型的有效手段。但在检测任务中 relation-based KD 算法取得的成果较少。

一个可能的原因在于，构建高质量的关系矩阵需要大量的样本，分类和分割（以像素点或区域为样本）任务的样本数量足够大；而受限于存储空间大小等硬件因素，检测任务同一个 batch 中的前景目标（object）数量较少且存在低质量前景目标（被遮挡的、模糊的物体等），因此制约了样本间关系蒸馏在检测任务上的应用。

4 Conclusion

本文对知识蒸馏中的三类基础算法进行了浅薄的介绍，近年来的 KD 算法大多是依托于这三类基础算法进行的优化升级，相信本文对大家在知识蒸馏的进一步研究会有所帮助。

文中提到的 Vanilla-KD、DKD、AB、AT Loss、Factor Transfer、FitNets、OFD、RKD 等算法均已在 MMRazor 中实现，期待大家的使用与批评指正。我们非常欢迎大家：

提出使用过程中遇到的问题，包括但不限于 bug、框架设计优化建议、希望后续 MMRazor 新增某些功能、算法等；
在 MMRazor 中复现某个算法或某类算法 pipeline（优秀的复现会掉落实习机会哦，爆率真的很高）；
参加超级视客营活动并认领其中的 MMRazor 任务，收获技术成长与丰厚奖品；
帮 MMRazor 进行宣传，增加使用者的数量（优秀的宣传大使同样会掉落实习机会哦）等。

参考文献：

[1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015, 2(7).
[2] Gou J, Yu B, Maybank S J, et al. Knowledge distillation: A survey[J]. International Journal of Computer Vision, 2021, 129(6): 1789-1819.
[3] Zhao B, Cui Q, Song R, et al. Decoupled Knowledge Distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 11953-11962.
[4] Furlanello T, Lipton Z, Tschannen M, et al. Born again neural networks[C]//International Conference on Machine Learning. PMLR, 2018: 1607-1616.
[5] Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2818-2826.
[6] Shen Z, Liu Z, Xu D, et al. Is label smoothing truly incompatible with knowledge distillation: An empirical study[J]. arXiv preprint arXiv:2104.00676, 2021.
[7] Müller R, Kornblith S, Hinton G E. When does label smoothing help?[J]. Advances in neural information processing systems, 2019, 32.
[8] Chandrasegaran K, Tran N T, Zhao Y, et al. Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?[C]//International Conference on Machine Learning. PMLR, 2022: 2890-2916.
[9] Zhang Q, Cheng X, Chen Y, et al. Quantifying the Knowledge in a DNN to Explain Knowledge Distillation for Classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[10] Romero A, Ballas N, Kahou S E, et al. Fitnets: Hints for thin deep nets[J]. arXiv preprint arXiv:1412.6550, 2014.
[11] Kim J, Park S U, Kwak N. Paraphrasing complex network: Network compression via factor transfer[J]. Advances in neural information processing systems, 2018, 31.
[12] Heo B, Lee M, Yun S, et al. Knowledge transfer via distillation of activation boundaries formed by hidden neurons[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33(01): 3779-3787.
[13] Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer[J]. arXiv preprint arXiv:1612.03928, 2016.
[14] Heo B, Kim J, Yun S, et al. A comprehensive overhaul of feature distillation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 1921-1930.
[15] Yim J, Joo D, Bae J, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4133-4141.
[16] Park W, Kim D, Lu Y, et al. Relational knowledge distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 3967-3976.

内容中包含的图片若涉及版权问题，请及时与我们联系删除

关于知识蒸馏，你一定要了解的三类基础算法

评论