LoRA-GA: Low-Rank Adaptation with Gradient Approximation

简介

本文讨论了微调大规模预训练模型的高昂计算和内存成本。作为最流行的参数效率微调（PEFT）方法之一，LoRA通过微调较少参数的辅助低秩模型，提供了一种经济实惠的替代方案。虽然LoRA在每次迭代中显着降低了计算和内存需求，但广泛的实证证据表明，它的收敛速度比完全微调要慢得多，最终导致整体计算量增加并且通常会导致更差的测试性能。在本文中，我们对LoRA的初始化方法进行了深入研究，并表明，仔细的初始化（不改变架构和训练算法）可以显著提高效率和性能。特别是，我们引入了一种新的初始化方法，LoRA-GA（梯度近似的低秩适应），它在第一步将低秩矩阵乘积的梯度与完全微调的梯度对齐。我们广泛的实验表明，LoRA-GA实现了与完全微调相当的收敛速度（因此比普通的LoRA和各种最近的改进快得多），同时达到了可比或甚至更好的性能。例如，在使用T5-Base的GLUE数据集子集上，LoRA-GA平均优于LoRA 5.69％。在更大的模型（如Llama 2-7B）上，LoRA-GA在MT-bench，GSM8K和Human-eval上显示出0.34，11.52％和5.05％的性能提高。此外，我们观察到与普通LoRA相比，收敛速度提高了2-4倍，验证了它在加速收敛和提高模型性能方面的有效性。代码可在https://github.com/Outsider565/LoRA-GA上获得。
图表
解决问题

LoRA, a popular Parameter-Efficient Fine-Tuning (PEFT) method, converges at a slower rate compared to full fine-tuning, leading to increased overall compute and often worse test performance. The paper aims to investigate the initialization method of LoRA and improve its efficiency and performance.
关键思路

The paper introduces a novel initialization method, LoRA-GA, which aligns the gradients of low-rank matrix product with those of full fine-tuning at the first step. LoRA-GA achieves a convergence rate comparable to that of full fine-tuning and simultaneously attains comparable or even better performance.
其它亮点

The paper conducts extensive experiments on various models and datasets, demonstrating that LoRA-GA outperforms vanilla LoRA and recent improvements. LoRA-GA shows performance improvements of up to 11.52% on MT-bench, GSM8K, and Human-eval. The paper provides open-source code on GitHub. The effectiveness of LoRA-GA in accelerating convergence and enhancing model performance is validated.
相关研究

Recent related studies include various Parameter-Efficient Fine-Tuning (PEFT) methods such as AdaFill, PET, and SVD-Finetune.

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

评论