QuantLLM

Quantization of LLM

LLM Quantization Survey

Awesome LLM Quantization Repository

LLM Quantization Papers

Quantization-Aware Training(QAT)

  1. LLM-QAT (from META)
    • Motivation:
      1. Lacking training data
      2. Training LLMs involves instruction tuning, reinforcement learning and etc, which are difficult to replicate during QAT
    • Method:
      1. Data-free quantization-aware training (QAT) which produces QAT data using next token data generation -> Select appropriate fine-tuning dataset
      2. Per-channel weight quantization and per-token activation quantization (symmetric MinMax quantization), per-token quantization for KV cache -> Identify suitable quantizer
      3. Cross-entropy based loss -> Knowledge distillation from full precision model
    • Result:
      1. Empirical recommendations:
        • 8-bit quantization should be preferred over smaller full precision models, and PTQ methods are sufficient for this case
        • 4-bit models quantized using LLM-QAT should be preferred over 8-bit models of similar size -> 4-bit LLM-QAT models towards the best efficiency-accuracy tradeoff
      2. Partial results:
    • Limitation:
      1. 4-bit quantization does not have hardware support out-of-the-box -> no hardware implementation
      2. Method works well for 4-bit weights, 4-bit KV cache and 8-bit activations -> Insufficient for 4-bit activation quantization
  2. PEQA (from NAVER)
    • Motivation:
      1. Bridging the gap between parameter-efficient fine-tuning(PEFT e.g. LoRA, Prefix Tuning) and Quantization -> combine PEFT with quantized LLMs
    • Method:
      1. Overall pipeline
      2. Solely updating quantization scales while freezing the integer quantization values of pre-trained weights
    • Result:
      1. Memory footprint, inference latency performance
      2. Common-sense reasoning and in-context learning performance
      3. Massive Multitask Language Understanding (MMLU) benchmark performance
    • Limitation:
      1. low-bit weight-only quantization in a linear asymmetric per-channel context -> Lacking weight-activation quantization part
  3. QLoRA (from University of Washington’s UW NLP group)
    • Motivation:
      1. Reduce memory footprint of parameter-efficient fine-tuning(PEFT) stage
    • Method:
      1. Overall pipeline
      2. QLoRA
        • 4-bit NormalFloat Quantization -> better quantization data type for normally distributed data compared with 4-bit Integers and 4-bit Floats (See the paper for details)
        • Double Quantization -> combined with NF4 to reduce the memory footprint of quantization constants i.e. weights (See the paper for details)
      3. Paged Optimizers -> manage memory spikes i.e. manage the memory swap between CPU and GPU
    • Result:
      1. MMLU test accuracy
      2. Memory footprint -> enables the finetuning of 33B parameter models on a single consumer GPU and 65B parameter models on a single professional GPU, even 7B parameter models on mobile phones(e.g. iPhone 12 Plus)
    • Limitation:
      1. Can’t establish that QLoRA can match full 16-bit finetuning performance at 33B and 65B scales…
      2. Did not evaluate different bit-precisions e.g.3-bit base models, or different adapter methods

Post-Training Quantization(PTQ)

Weight Quantization
  1. LUT-GEMM
  2. LLM.int8()
  3. GPTQ
  4. AWQ
  5. OWQ
  6. SpQR
  7. SqueezeLLM
  8. QuIP
  9. SignRound
Weight and Activation Quantization
  1. ZeroQuant
  2. SmoothQuant
  3. RPTQ
  4. OliVe
  5. ZeroQuant-V2
  6. OutlierSuppression+
  7. MoFQ
  8. ZeroQuant-FP
  9. FPTQ
  10. QuantEase
  11. NormTweaking
  12. OmniQuant