- 零训练数据情况下的 data-free KD;
- 教师模型也权重更新的 online KD、self KD;
- 面向检测、分割、自然语言处理等任务的 KD 算法等。
如果你没有充足的时间浏览上面的各项论述,可以直接获取本节的结论:
- logits 中的非目标类信息是 response-based KD 起效的关键;
- 目标类信息传递的是教师模型对各样本“难度”的评估,数据噪声较大、任务困难的情况下,难度传递的作用更为明显;
- logits 相比于 one-hot label 而言,起到了类似标签平滑的作用,抑制了模型的 over-confidence 倾向,从而提高了模型泛化性;
- 从信息量化的角度来看,response-based KD 往往使得模型学到更多的知识、更倾向于同时学到不同的知识、优化方向更为稳定。
Summary
Feature-based KD相关的研究较多,本文不再进行深入讨论。稍作总结的话,该类别算法的核心关注点在于:
- 知识的定位(设计规则选出更为重要的教师特征,这一点在检测蒸馏算法中非常重要)
- 如何进行特征维度对齐、特征语义对齐、特征加权(connector 设计)
- 如何进行知识的高效传递(特征 fusion、loss 设计)
Summary
近年来,relation-based KD 算法在分割任务中不断取得突破。同一张图像中,像素点之间的特征关系差异或区域之间的特征关系差异成为蒸馏分割模型的有效手段。但在检测任务中 relation-based KD 算法取得的成果较少。
本文对知识蒸馏中的三类基础算法进行了浅薄的介绍,近年来的 KD 算法大多是依托于这三类基础算法进行的优化升级,相信本文对大家在知识蒸馏的进一步研究会有所帮助。
- 提出使用过程中遇到的问题,包括但不限于 bug、框架设计优化建议、希望后续 MMRazor 新增某些功能、算法等;
- 在 MMRazor 中复现某个算法或某类算法 pipeline(优秀的复现会掉落实习机会哦,爆率真的很高);
- 参加超级视客营活动并认领其中的 MMRazor 任务,收获技术成长与丰厚奖品;
- 帮 MMRazor 进行宣传,增加使用者的数量(优秀的宣传大使同样会掉落实习机会哦)等。
参考文献:
- [1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015, 2(7).
- [2] Gou J, Yu B, Maybank S J, et al. Knowledge distillation: A survey[J]. International Journal of Computer Vision, 2021, 129(6): 1789-1819.
- [3] Zhao B, Cui Q, Song R, et al. Decoupled Knowledge Distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 11953-11962.
- [4] Furlanello T, Lipton Z, Tschannen M, et al. Born again neural networks[C]//International Conference on Machine Learning. PMLR, 2018: 1607-1616.
- [5] Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2818-2826.
- [6] Shen Z, Liu Z, Xu D, et al. Is label smoothing truly incompatible with knowledge distillation: An empirical study[J]. arXiv preprint arXiv:2104.00676, 2021.
- [7] Müller R, Kornblith S, Hinton G E. When does label smoothing help?[J]. Advances in neural information processing systems, 2019, 32.
- [8] Chandrasegaran K, Tran N T, Zhao Y, et al. Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?[C]//International Conference on Machine Learning. PMLR, 2022: 2890-2916.
- [9] Zhang Q, Cheng X, Chen Y, et al. Quantifying the Knowledge in a DNN to Explain Knowledge Distillation for Classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- [10] Romero A, Ballas N, Kahou S E, et al. Fitnets: Hints for thin deep nets[J]. arXiv preprint arXiv:1412.6550, 2014.
- [11] Kim J, Park S U, Kwak N. Paraphrasing complex network: Network compression via factor transfer[J]. Advances in neural information processing systems, 2018, 31.
- [12] Heo B, Lee M, Yun S, et al. Knowledge transfer via distillation of activation boundaries formed by hidden neurons[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33(01): 3779-3787.
- [13] Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer[J]. arXiv preprint arXiv:1612.03928, 2016.
- [14] Heo B, Kim J, Yun S, et al. A comprehensive overhaul of feature distillation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 1921-1930.
- [15] Yim J, Joo D, Bae J, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4133-4141.
- [16] Park W, Kim D, Lu Y, et al. Relational knowledge distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 3967-3976.
内容中包含的图片若涉及版权问题,请及时与我们联系删除
评论
沙发等你来抢