参数高效微调方法-LoRA

和上一篇文章一样,本文依旧是介绍参数高效微调(Parameter-Efficient Fine-Tuning,PEFT) 方法:LoRA: Low-Rank Adaptation of Large Language Models,这篇论文是发表在ICLR2022上一篇论文,微软的工作,用于解决大模型finetune的问题。下面详细介绍LoRA工作原理。

LoRA

title:LoRA: Low-Rank Adaptation of Large Language Models

author:Edward J. Hu, Yelong Shen, Phillip Wallis

code:https://github.com/microsoft/LoRA

现有的解决大模型finetune的方法有很多,比如部分fine-tune、adapter以及prompting等。但这些方法大多存在如下问题:

  1. Adapter 引入额外的inference latency(增加了层数);
  2. prefix-tuning比较难于训练;
  3. 模型性能不如全参数fine-tuning;
Adapter Layers Introduce Inference Latency

显然,增加模型层数会增加inference的时长:

While one can reduce the overall latency by pruning layers or exploiting multi-task settings , there is no direct ways to bypass the extra compute in adapter layers;

从上图可以看出,对于线上batch size为1,sequence length比较短的情况,inference latency的变化比例会更明显。

Directly Optimizing the Prompt is Hard

与Prefix-Tuning的难于训练相比,LoRA则更容易训练:

We observe that prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameters, confirming similar observations in the original paper

模型性能不如Full fine-tuning

预留一些sequence做adaption会让处理下游任务的可用sequence长度变少,一定程度上会影响模型性能:

More fundamentally, reserving a part of the sequence length for adaptation necessarily reduces the sequence length available to process a downstream task, which we suspect makes tuning the prompt less performant compared to other methods.

LoRA

先来看下LoRA的motivation:

A neural network contains many dense layers which perform matrix multiplication. The weight matrices in these layers typically have full-rank. When adapting to a specific task, the pre-trained language models have a low “instrisic dimension” and can still learn efficiently despite a random projection to a smaller subspace.

虽然,预训练的大模型有着较多参数,但是应用于下游任务时,其实模型主要依赖low intrinsic dimension,那adaption应该也依赖于此,所以提出了Low-Rank Adaptation (LoRA)。

如上图所示,LoRA的思想很简单,在原始PLM旁边增加一个旁路,做一个降维再升维的操作,来模拟所谓的 intrinsic rank 。训练的时候固定PLM的参数,只训练降维矩阵A与升维矩阵B。而模型的输入输出维度不变,输出时将BA与PLM的参数叠加。用随机高斯分布初始化A,用0矩阵初始化B,保证开始训练时,此旁路矩阵依然是0矩阵。

具体来说,假设预训练的参数矩阵为$W_0$,它的更新可表示为:

有点类似于残差连接,使用旁路的更新来模型full fine-tuning。并且full fine-tuning可以看作时LoRA的特例。

LoRA与transformer的结合也很简单,在transformer中,在self-attention中有四个权重矩阵$W_q,W_k,W_v,W_o$以及俩个MLP权重。作者仅在self-attention的计算中应用LoRA,而不动MLP模块。对于加在哪个权重参数上,作者做了一系列ablation study,如下表所示:

当部署在生产中时,我们可以显式计算和存储 W = W0 + BA 并像往常一样执行推理,几乎未引入额外的inference latency。

通过实验也发现,众多数据集上LoRA在只训练极少量参数的前提下,达到了匹配full fine-tuning,是一种高效的参数更新方法。相比Adapter, BitFit,LoRA在较少训练参数时就能保证比较稳定的效果。