Very popular in the past few months as it enabled fine-tuning of large language models with extreme parameter efficiency. You can now take GPT-3 and fine-tune it at low cost.
Specifically for LLMs, 2 prominent strategies arose:
adapter layers:
2 layers inserted between the self-attention + feedforward layer and the residual connection
downside: has to be processed sequentially
prefix tuning:
insert trainable word embeddings as special tokens among the input tokens
downsides: reduces useful sequence size (because of the new tokens), optimizing the prompt is very hard and gains are not monotonically increasing with number of new tokens.
Intuition: learned over-parameterized models reside on a low intrinsic dimension (linked to manifold hypothesis maybe?)
A pre-trained weight matrix $W_0\in \mathbb{R}^{d\times k}$ is frozen. We update it with a low rank decomposition:
$W := W_0 + BA$ were $B\in \mathbb{R}^{d\times r}$, $A\in \mathbb{R}^{r\times k}$ (and $r$ is very small). Onlyl $A$ and $B$ are trainable.
Initialization:
Transformers contain the following weight matrices:
Some order of magnitude: with $r=4$, only adapting the query and value projection matrices, we can reduce memory footprint of GPT-3 updates from 350GB to 35MB.
We can also switch between tasks easily (only need to swap LoRA weights).