--- base_model: google/gemma-3-12b-it --- # MISHANM/google-gemma-3-12b-it-fp8 This model is an advanced fp8 quantized version of google/gemma-3-12b-it, expertly optimized for use on suitable hardware platforms. Utilizing the fp8 quantization strategy, it offers impressive computational efficiency, leading to much quicker processing speeds and reduced resource demands while maintaining the high-performance benchmarks of the original model. This quantized modification is especially beneficial for settings that require high throughput and rapid responsiveness, guaranteeing the model's reliability and effectiveness in managing intricate tasks. As a result, it showcases a cutting-edge synergy between improved performance and resource utilization, designed for innovative applications in varied computational environments. ## Model Details 1. Tasks: Causal Language Modeling, Text Generation 2. Base Model: google/gemma-3-12b-it 3. Quantization Format: fp8 # Device Used 1. GPUs: 1*AMD Instinctâ„¢ MI210 Accelerators ## Transformers library ```sh pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3 ``` # Inference with Transformers ```python3 from transformers import AutoProcessor, Gemma3ForConditionalGeneration, BitsAndBytesConfig from PIL import Image import torch model_id = "MISHANM/google-gemma-3-12b-it-fp8" # Load the model with 8-bit quantization model = Gemma3ForConditionalGeneration.from_pretrained( model_id, device_map="auto" ).eval() processor = AutoProcessor.from_pretrained(model_id) # Define chat messages for inference messages = [ { "role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}] }, { "role": "user", "content": [ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, {"type": "text", "text": "Describe this image in detail."} ] } ] # Prepare inputs for the model inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt" ).to(model.device, dtype=torch.bfloat16) input_len = inputs["input_ids"].shape[-1] # Generate model output with torch.inference_mode(): generation = model.generate(**inputs, max_new_tokens=100, do_sample=False) generation = generation[0][input_len:] # Decode the generated output decoded = processor.decode(generation, skip_special_tokens=True) print(decoded) ``` ## Citation Information ``` @misc{MISHANM/google-gemma-3-12b-it-fp8, author = {Mishan Maurya}, title = {Introducing fp8 quantized version of google/gemma-3-12b-it}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face repository}, } ```