arxiv:2506.19697

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Published on Jun 24

· Submitted by

affjljoo3581 on Jun 26

Upvote

Authors:

Jungwoo Park ,

Abstract

Outlier-Safe Pre-Training improves large language model quantization performance by preventing extreme activation outliers through innovative training techniques.

AI-generated summary

Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.

View arXiv page View PDF GitHub 29 Add to collection

Community

affjljoo3581

Paper author Paper submitter Jun 26

ArXiv: https://www.arxiv.org/abs/2506.19697
GitHub: https://github.com/dmis-lab/Outlier-Safe-Pre-Training
Model Collection: https://huggingface.co/collections/dmis-lab/outlier-safe-pre-training-osp-685bda10aa1e8a19fcb58ea8

grantsing

Jun 26

ai audio breakdown of this paper 👉 https://arxivexplained.com/papers/outlier-safe-pre-training-for-robust-4-bit-quantization-of-large-language-models

Enigrand

Jun 27

•

edited Jun 27

Hi, there.

Interesting work on how to make LLM better suited for quantization.

I have two questions about the table 3 in the paper, where you compare OSP LLM with other LLMs in quantized setting.

The quantization setting is W4A4KV4, is that correct?
Can you explain the quantization method and hyperparameters used for other LLMs?

affjljoo3581

Paper author 29 days ago

•

edited 29 days ago

Hi, thank you for your interest in our paper.

No. The quantization setting in Table 3 is W4A4KV16, as we used the vLLM implementation as detailed in this section. Using W4A4KV4 may not dramatically change the result.
Just use RTN for all LLMs. Specifically, you can replicate the result by running:

$ python lighteval_ptq_vllm.py \
    $MODEL_PATH \
    --dtype bfloat16 \
    --gpu_memory_utilisation 0.4 \
    --max_model_length 2048 \
    --custom_tasks lighteval_tasks.py \
    --tasks tasks.txt \
    --output_dir results \
    --save_details \
    --w_bits 4 \
    --a_bits 4