Matrix
		
	commited on
		
		
					Upload README.md with huggingface_hub
Browse files
    	
        README.md
    CHANGED
    
    | @@ -17,9 +17,14 @@ extra_gated_button_content: Acknowledge license | |
| 17 | 
             
               This model was converted to GGUF format from [`google/gemma-3-1b-it`](https://huggingface.co/google/gemma-3-1b-it) using llama.cpp via the ggml.ai's [all-gguf-same-where](https://huggingface.co/spaces/matrixportal/all-gguf-same-where) space.
         | 
| 18 | 
             
            Refer to the [original model card](https://huggingface.co/google/gemma-3-1b-it) for more details on the model.
         | 
| 19 |  | 
| 20 | 
            -
            ## ✅ Quantized Models Download List | 
| 21 | 
            -
            **✨ Recommended for CPU:** `Q4_K_M` | **⚡ Recommended for ARM CPU:** `Q4_0` | **🏆 Best Quality:** `Q8_0`  
         | 
| 22 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 23 | 
             
            | 🚀 Download | 🔢 Type | 📝 Notes |
         | 
| 24 | 
             
            |:---------|:-----|:------|
         | 
| 25 | 
             
            | [Download](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q2_k.gguf) |  | Basic quantization |
         | 
| @@ -37,3 +42,126 @@ Refer to the [original model card](https://huggingface.co/google/gemma-3-1b-it) | |
| 37 | 
             
            | [Download](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-f16.gguf) |  | Maximum accuracy |
         | 
| 38 |  | 
| 39 | 
             
            💡 **Tip:** Use `F16` for maximum precision when quality is critical
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 17 | 
             
               This model was converted to GGUF format from [`google/gemma-3-1b-it`](https://huggingface.co/google/gemma-3-1b-it) using llama.cpp via the ggml.ai's [all-gguf-same-where](https://huggingface.co/spaces/matrixportal/all-gguf-same-where) space.
         | 
| 18 | 
             
            Refer to the [original model card](https://huggingface.co/google/gemma-3-1b-it) for more details on the model.
         | 
| 19 |  | 
| 20 | 
            +
            ## ✅ Quantized Models Download List
         | 
|  | |
| 21 |  | 
| 22 | 
            +
            ### 🔍 Recommended Quantizations
         | 
| 23 | 
            +
            - **✨ General CPU Use:** [`Q4_K_M`](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q4_k_m.gguf) (Best balance of speed/quality)
         | 
| 24 | 
            +
            - **📱 ARM Devices:** [`Q4_0`](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q4_0.gguf) (Optimized for ARM CPUs)
         | 
| 25 | 
            +
            - **🏆 Maximum Quality:** [`Q8_0`](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q8_0.gguf) (Near-original quality)
         | 
| 26 | 
            +
             | 
| 27 | 
            +
            ### 📦 Full Quantization Options
         | 
| 28 | 
             
            | 🚀 Download | 🔢 Type | 📝 Notes |
         | 
| 29 | 
             
            |:---------|:-----|:------|
         | 
| 30 | 
             
            | [Download](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q2_k.gguf) |  | Basic quantization |
         | 
|  | |
| 42 | 
             
            | [Download](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-f16.gguf) |  | Maximum accuracy |
         | 
| 43 |  | 
| 44 | 
             
            💡 **Tip:** Use `F16` for maximum precision when quality is critical
         | 
| 45 | 
            +
             | 
| 46 | 
            +
            # GGUF Model Quantization & Usage Guide with llama.cpp
         | 
| 47 | 
            +
             | 
| 48 | 
            +
            ## What is GGUF and Quantization?
         | 
| 49 | 
            +
             | 
| 50 | 
            +
            **GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
         | 
| 51 | 
            +
            - Supports multiple quantization levels
         | 
| 52 | 
            +
            - Works cross-platform
         | 
| 53 | 
            +
            - Enables fast loading and inference
         | 
| 54 | 
            +
             | 
| 55 | 
            +
            **Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
         | 
| 56 | 
            +
            - Reduce model size
         | 
| 57 | 
            +
            - Decrease memory usage
         | 
| 58 | 
            +
            - Speed up inference
         | 
| 59 | 
            +
            - (With minor accuracy trade-offs)
         | 
| 60 | 
            +
             | 
| 61 | 
            +
            ## Step-by-Step Guide
         | 
| 62 | 
            +
             | 
| 63 | 
            +
            ### 1. Prerequisites
         | 
| 64 | 
            +
             | 
| 65 | 
            +
            ```bash
         | 
| 66 | 
            +
            # System updates
         | 
| 67 | 
            +
            sudo apt update && sudo apt upgrade -y
         | 
| 68 | 
            +
             | 
| 69 | 
            +
            # Dependencies
         | 
| 70 | 
            +
            sudo apt install -y build-essential cmake python3-pip
         | 
| 71 | 
            +
             | 
| 72 | 
            +
            # Clone and build llama.cpp
         | 
| 73 | 
            +
            git clone https://github.com/ggerganov/llama.cpp
         | 
| 74 | 
            +
            cd llama.cpp
         | 
| 75 | 
            +
            make -j4
         | 
| 76 | 
            +
            ```
         | 
| 77 | 
            +
             | 
| 78 | 
            +
            ### 2. Using Quantized Models from Hugging Face
         | 
| 79 | 
            +
             | 
| 80 | 
            +
            My automated quantization script produces models in this format:
         | 
| 81 | 
            +
            ```
         | 
| 82 | 
            +
            https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q4_k_m.gguf
         | 
| 83 | 
            +
            ```
         | 
| 84 | 
            +
             | 
| 85 | 
            +
            Download your quantized model directly:
         | 
| 86 | 
            +
             | 
| 87 | 
            +
            ```bash
         | 
| 88 | 
            +
            wget https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q4_k_m.gguf
         | 
| 89 | 
            +
            ```
         | 
| 90 | 
            +
             | 
| 91 | 
            +
            ### 3. Running the Quantized Model
         | 
| 92 | 
            +
             | 
| 93 | 
            +
            Basic usage:
         | 
| 94 | 
            +
            ```bash
         | 
| 95 | 
            +
            ./main -m gemma-3-1b-it-q4_k_m.gguf -p "Your prompt here" -n 128
         | 
| 96 | 
            +
            ```
         | 
| 97 | 
            +
             | 
| 98 | 
            +
            Example with a creative writing prompt:
         | 
| 99 | 
            +
            ```bash
         | 
| 100 | 
            +
            ./main -m gemma-3-1b-it-q4_k_m.gguf        -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]"        -n 256 -c 2048 -t 8 --temp 0.7
         | 
| 101 | 
            +
            ```
         | 
| 102 | 
            +
             | 
| 103 | 
            +
            Advanced parameters:
         | 
| 104 | 
            +
            ```bash
         | 
| 105 | 
            +
            ./main -m gemma-3-1b-it-q4_k_m.gguf        -p "Question: What is the GGUF format?
         | 
| 106 | 
            +
            Answer:"        -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
         | 
| 107 | 
            +
            ```
         | 
| 108 | 
            +
             | 
| 109 | 
            +
            ### 4. Python Integration
         | 
| 110 | 
            +
             | 
| 111 | 
            +
            Install the Python package:
         | 
| 112 | 
            +
            ```bash
         | 
| 113 | 
            +
            pip install llama-cpp-python
         | 
| 114 | 
            +
            ```
         | 
| 115 | 
            +
             | 
| 116 | 
            +
            Example script:
         | 
| 117 | 
            +
            ```python
         | 
| 118 | 
            +
            from llama_cpp import Llama
         | 
| 119 | 
            +
             | 
| 120 | 
            +
            # Initialize the model
         | 
| 121 | 
            +
            llm = Llama(
         | 
| 122 | 
            +
                model_path="gemma-3-1b-it-q4_k_m.gguf",
         | 
| 123 | 
            +
                n_ctx=2048,
         | 
| 124 | 
            +
                n_threads=8
         | 
| 125 | 
            +
            )
         | 
| 126 | 
            +
             | 
| 127 | 
            +
            # Run inference
         | 
| 128 | 
            +
            response = llm(
         | 
| 129 | 
            +
                "[INST] Explain GGUF quantization to a beginner [/INST]",
         | 
| 130 | 
            +
                max_tokens=256,
         | 
| 131 | 
            +
                temperature=0.7,
         | 
| 132 | 
            +
                top_p=0.9
         | 
| 133 | 
            +
            )
         | 
| 134 | 
            +
             | 
| 135 | 
            +
            print(response["choices"][0]["text"])
         | 
| 136 | 
            +
            ```
         | 
| 137 | 
            +
             | 
| 138 | 
            +
            ## Performance Tips
         | 
| 139 | 
            +
             | 
| 140 | 
            +
            1. **Hardware Utilization**:
         | 
| 141 | 
            +
               - Set thread count with `-t` (typically CPU core count)
         | 
| 142 | 
            +
               - Compile with CUDA/OpenCL for GPU support
         | 
| 143 | 
            +
             | 
| 144 | 
            +
            2. **Memory Optimization**:
         | 
| 145 | 
            +
               - Lower quantization (like q4_k_m) uses less RAM
         | 
| 146 | 
            +
               - Adjust context size with `-c` parameter
         | 
| 147 | 
            +
             | 
| 148 | 
            +
            3. **Speed/Accuracy Balance**:
         | 
| 149 | 
            +
               - Higher bit quantization is slower but more accurate
         | 
| 150 | 
            +
               - Reduce randomness with `--temp 0` for consistent results
         | 
| 151 | 
            +
             | 
| 152 | 
            +
            ## FAQ
         | 
| 153 | 
            +
             | 
| 154 | 
            +
            **Q: What quantization levels are available?**  
         | 
| 155 | 
            +
            A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
         | 
| 156 | 
            +
             | 
| 157 | 
            +
            **Q: How much performance loss occurs with q4_k_m?**  
         | 
| 158 | 
            +
            A: Typically 2-5% accuracy reduction but 4x smaller size
         | 
| 159 | 
            +
             | 
| 160 | 
            +
            **Q: How to enable GPU support?**  
         | 
| 161 | 
            +
            A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
         | 
| 162 | 
            +
             | 
| 163 | 
            +
            ## Useful Resources
         | 
| 164 | 
            +
             | 
| 165 | 
            +
            1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
         | 
| 166 | 
            +
            2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
         | 
| 167 | 
            +
            3. [Hugging Face Model Hub](https://huggingface.co/models)
         | 
