Training LLMs involves instruction tuning, reinforcement learning and etc, which are difficult to replicate during QAT
Method:
Data-free quantization-aware training (QAT) which produces QAT data using next token data generation -> Select appropriate fine-tuning dataset
Per-channel weight quantization and per-token activation quantization (symmetric MinMax quantization), per-token quantization for KV cache -> Identify suitable quantizer
Cross-entropy based loss -> Knowledge distillation from full precision model
Result:
Empirical recommendations:
8-bit quantization should be preferred over smaller full precision models, and PTQ methods are sufficient for this case
4-bit models quantized using LLM-QAT should be preferred over 8-bit models of similar size -> 4-bit LLM-QAT models towards the best efficiency-accuracy tradeoff
Partial results:
Limitation:
4-bit quantization does not have hardware support out-of-the-box -> no hardware implementation
Method works well for 4-bit weights, 4-bit KV cache and 8-bit activations -> Insufficient for 4-bit activation quantization
Reduce memory footprint of parameter-efficient fine-tuning(PEFT) stage
Method:
Overall pipeline
QLoRA
4-bit NormalFloat Quantization -> better quantization data type for normally distributed data compared with 4-bit Integers and 4-bit Floats (See the paper for details)
Double Quantization -> combined with NF4 to reduce the memory footprint of quantization constants i.e. weights (See the paper for details)
Paged Optimizers -> manage memory spikes i.e. manage the memory swap between CPU and GPU
Result:
MMLU test accuracy
Memory footprint -> enables the finetuning of 33B parameter models on a single consumer GPU and 65B parameter models on a single professional GPU, even 7B parameter models on mobile phones(e.g. iPhone 12 Plus)
Limitation:
Can’t establish that QLoRA can match full 16-bit finetuning performance at 33B and 65B scales…
Did not evaluate different bit-precisions e.g.3-bit base models, or different adapter methods