QuantLLM

发表于 2023-11-18 更新于 2024-12-02

Quantization of LLM

LLM Quantization Survey

Awesome LLM Quantization Repository

LLM Quantization Papers

Quantization-Aware Training(QAT)

LLM-QAT (from META)
- Motivation:
  1. Lacking training data
  2. Training LLMs involves instruction tuning, reinforcement learning and etc, which are difficult to replicate during QAT
- Method:
  1. Data-free quantization-aware training (QAT) which produces QAT data using next token data generation -> Select appropriate fine-tuning dataset
  2. Per-channel weight quantization and per-token activation quantization (symmetric MinMax quantization), per-token quantization for KV cache -> Identify suitable quantizer
  3. Cross-entropy based loss -> Knowledge distillation from full precision model
- Result:
  1. Empirical recommendations:
    - 8-bit quantization should be preferred over smaller full precision models, and PTQ methods are sufficient for this case
    - 4-bit models quantized using LLM-QAT should be preferred over 8-bit models of similar size -> 4-bit LLM-QAT models towards the best efficiency-accuracy tradeoff
  2. Partial results:
- Limitation:
  1. 4-bit quantization does not have hardware support out-of-the-box -> no hardware implementation
  2. Method works well for 4-bit weights, 4-bit KV cache and 8-bit activations -> Insufficient for 4-bit activation quantization
PEQA (from NAVER)
- Motivation:
  1. Bridging the gap between parameter-efficient fine-tuning(PEFT e.g. LoRA, Prefix Tuning) and Quantization -> combine PEFT with quantized LLMs
- Method:
  1. Overall pipeline
  2. Solely updating quantization scales while freezing the integer quantization values of pre-trained weights
- Result:
  1. Memory footprint, inference latency performance
  2. Common-sense reasoning and in-context learning performance
  3. Massive Multitask Language Understanding (MMLU) benchmark performance
- Limitation:
  1. low-bit weight-only quantization in a linear asymmetric per-channel context -> Lacking weight-activation quantization part
QLoRA (from University of Washington’s UW NLP group)
- Motivation:
  1. Reduce memory footprint of parameter-efficient fine-tuning(PEFT) stage
- Method:
  1. Overall pipeline
  2. QLoRA
    - 4-bit NormalFloat Quantization -> better quantization data type for normally distributed data compared with 4-bit Integers and 4-bit Floats (See the paper for details)
    - Double Quantization -> combined with NF4 to reduce the memory footprint of quantization constants i.e. weights (See the paper for details)
  3. Paged Optimizers -> manage memory spikes i.e. manage the memory swap between CPU and GPU
- Result:
  1. MMLU test accuracy
  2. Memory footprint -> enables the finetuning of 33B parameter models on a single consumer GPU and 65B parameter models on a single professional GPU, even 7B parameter models on mobile phones(e.g. iPhone 12 Plus)
- Limitation:
  1. Can’t establish that QLoRA can match full 16-bit finetuning performance at 33B and 65B scales…
  2. Did not evaluate different bit-precisions e.g.3-bit base models, or different adapter methods

QuantLLM

Quantization of LLM

LLM Quantization Survey

Awesome LLM Quantization Repository

LLM Quantization Papers

Quantization-Aware Training(QAT)

Post-Training Quantization(PTQ)

Weight Quantization

Weight and Activation Quantization