馃 Model Overview

This is a quantized variant of the Mistral 7B (small) model using LLM.int8() quantization via bitsandbytes. It reduces memory footprint while maintaining high-generation quality鈥攊deal for single-GPU inference, research benchmarks, and lightweight downstream applications.

馃敡 Model Specs

  • Total Parameters: ~7 Billion
  • Precision: INT8 with FP32 CPU offload
  • Quantization Threshold: 6.0
  • Device Map: Auto (compatible with CUDA / CPU offloading)
  • Tokenizer: Byte-level BPE

馃殌 Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "ParveshRawal/mistral-small-int8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quant_config
)

inputs = tokenizer("Tell me something about IndiaAI.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
10
Safetensors
Model size
7.24B params
Tensor type
F32
F16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Parveshiiii/mistral-small-int8

Quantized
(22)
this model