--- language: - en - fr - es - it - pt - zh - ar - ru base_model: - HuggingFaceTB/SmolLM3-3B pipeline_tag: text-generation tags: - smollm3 - fp8 - vllm - conversational - compressed-tensors license: apache-2.0 license_name: apache-2.0 name: RedHatAI/SmolLM3-3B-FP8-dynamic description: This model was obtained by quantizing activation and weights of SmolLM3-3B to FP8 data type. readme: https://huggingface.co/RedHatAI/SmolLM3-3B-FP8-dynamic/main/README.md tasks: - text-to-text - text-generation provider: HuggingFaceTB license_link: https://www.apache.org/licenses/LICENSE-2.0 --- ## Model Overview - **Model Architecture:** SmolLM3-3B - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP8 - **Activation quantization:** FP8 - **Release Date:** 07/28/2025 - **Version:** 1.0 - **License(s):** Apache-2.0 - **Model Developers:** RedHat (Neural Magic) ### Model Optimizations This model was obtained by quantizing activation and weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization. ## Deployment This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/SmolLM3-3B-FP8-dynamic" number_gpus = 1 sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation
Creation details This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.transformers import oneshot # Load model model_stub = "HuggingFaceTB/SmolLM3-3B" model_name = model_stub.split("/")[-1] tokenizer = AutoTokenizer.from_pretrained(model_stub) model = AutoModelForCausalLM.from_pretrained( model_stub, device_map="auto", torch_dtype="auto", ) # Configure the quantization algorithm and scheme recipe = QuantizationModifier( targets="Linear", scheme="FP8_dynamic", ignore=["lm_head"], ) # Apply quantization oneshot( model=model, recipe=recipe, ) # Save to disk in compressed-tensors format save_path = model_name + "-FP8-dynamic" model.save_pretrained(save_path) tokenizer.save_pretrained(save_path) print(f"Model and tokenizer saved to: {save_path}") ```
## Evaluation This model was evaluated on the well-known reasoning tasks: AIME24, Math-500, and GPQA-Diamond. In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine, and evals are collected through [LightEval](https://github.com/huggingface/lighteval) library.
Evaluation details ``` export VLLM_WORKER_MULTIPROC_METHOD=spawn export MODEL="RedHatAI/SmolLM3-3B-FP8-dynamic" export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}" export TASK=aime24 # {aime24, math_500, gpqa:diamond} lighteval vllm $MODEL_ARGS "lighteval|${TASK}|0|0" \ --use-chat-template \ --output-dir out_dir ```
### Accuracy
Category Benchmark HuggingFaceTB/SmolLM3-3B RedHatAI/SmolLM3-3B-FP8-dynamic
(this model)
Recovery
Reasoning AIME24 (pass@1:64) 45.31 47.50 104.83%
MATH-500 (pass@1:4) 89.30 88.30 98.88%
GPQA-Diamond (pass@1:8) 41.22 40.91 99.25%
Average 58.61 58.90 100.5%