Matrix commited on
Commit
8802855
·
verified ·
1 Parent(s): 54ee6c6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +130 -2
README.md CHANGED
@@ -17,9 +17,14 @@ extra_gated_button_content: Acknowledge license
17
  This model was converted to GGUF format from [`google/gemma-3-1b-it`](https://huggingface.co/google/gemma-3-1b-it) using llama.cpp via the ggml.ai's [all-gguf-same-where](https://huggingface.co/spaces/matrixportal/all-gguf-same-where) space.
18
  Refer to the [original model card](https://huggingface.co/google/gemma-3-1b-it) for more details on the model.
19
 
20
- ## ✅ Quantized Models Download List
21
- **✨ Recommended for CPU:** `Q4_K_M` | **⚡ Recommended for ARM CPU:** `Q4_0` | **🏆 Best Quality:** `Q8_0`
22
 
 
 
 
 
 
 
23
  | 🚀 Download | 🔢 Type | 📝 Notes |
24
  |:---------|:-----|:------|
25
  | [Download](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q2_k.gguf) | ![Q2_K](https://img.shields.io/badge/Q2_K-1A73E8) | Basic quantization |
@@ -37,3 +42,126 @@ Refer to the [original model card](https://huggingface.co/google/gemma-3-1b-it)
37
  | [Download](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
38
 
39
  💡 **Tip:** Use `F16` for maximum precision when quality is critical
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  This model was converted to GGUF format from [`google/gemma-3-1b-it`](https://huggingface.co/google/gemma-3-1b-it) using llama.cpp via the ggml.ai's [all-gguf-same-where](https://huggingface.co/spaces/matrixportal/all-gguf-same-where) space.
18
  Refer to the [original model card](https://huggingface.co/google/gemma-3-1b-it) for more details on the model.
19
 
20
+ ## ✅ Quantized Models Download List
 
21
 
22
+ ### 🔍 Recommended Quantizations
23
+ - **✨ General CPU Use:** [`Q4_K_M`](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q4_k_m.gguf) (Best balance of speed/quality)
24
+ - **📱 ARM Devices:** [`Q4_0`](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q4_0.gguf) (Optimized for ARM CPUs)
25
+ - **🏆 Maximum Quality:** [`Q8_0`](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q8_0.gguf) (Near-original quality)
26
+
27
+ ### 📦 Full Quantization Options
28
  | 🚀 Download | 🔢 Type | 📝 Notes |
29
  |:---------|:-----|:------|
30
  | [Download](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q2_k.gguf) | ![Q2_K](https://img.shields.io/badge/Q2_K-1A73E8) | Basic quantization |
 
42
  | [Download](https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
43
 
44
  💡 **Tip:** Use `F16` for maximum precision when quality is critical
45
+
46
+ # GGUF Model Quantization & Usage Guide with llama.cpp
47
+
48
+ ## What is GGUF and Quantization?
49
+
50
+ **GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
51
+ - Supports multiple quantization levels
52
+ - Works cross-platform
53
+ - Enables fast loading and inference
54
+
55
+ **Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
56
+ - Reduce model size
57
+ - Decrease memory usage
58
+ - Speed up inference
59
+ - (With minor accuracy trade-offs)
60
+
61
+ ## Step-by-Step Guide
62
+
63
+ ### 1. Prerequisites
64
+
65
+ ```bash
66
+ # System updates
67
+ sudo apt update && sudo apt upgrade -y
68
+
69
+ # Dependencies
70
+ sudo apt install -y build-essential cmake python3-pip
71
+
72
+ # Clone and build llama.cpp
73
+ git clone https://github.com/ggerganov/llama.cpp
74
+ cd llama.cpp
75
+ make -j4
76
+ ```
77
+
78
+ ### 2. Using Quantized Models from Hugging Face
79
+
80
+ My automated quantization script produces models in this format:
81
+ ```
82
+ https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q4_k_m.gguf
83
+ ```
84
+
85
+ Download your quantized model directly:
86
+
87
+ ```bash
88
+ wget https://huggingface.co/matrixportal/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-q4_k_m.gguf
89
+ ```
90
+
91
+ ### 3. Running the Quantized Model
92
+
93
+ Basic usage:
94
+ ```bash
95
+ ./main -m gemma-3-1b-it-q4_k_m.gguf -p "Your prompt here" -n 128
96
+ ```
97
+
98
+ Example with a creative writing prompt:
99
+ ```bash
100
+ ./main -m gemma-3-1b-it-q4_k_m.gguf -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]" -n 256 -c 2048 -t 8 --temp 0.7
101
+ ```
102
+
103
+ Advanced parameters:
104
+ ```bash
105
+ ./main -m gemma-3-1b-it-q4_k_m.gguf -p "Question: What is the GGUF format?
106
+ Answer:" -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
107
+ ```
108
+
109
+ ### 4. Python Integration
110
+
111
+ Install the Python package:
112
+ ```bash
113
+ pip install llama-cpp-python
114
+ ```
115
+
116
+ Example script:
117
+ ```python
118
+ from llama_cpp import Llama
119
+
120
+ # Initialize the model
121
+ llm = Llama(
122
+ model_path="gemma-3-1b-it-q4_k_m.gguf",
123
+ n_ctx=2048,
124
+ n_threads=8
125
+ )
126
+
127
+ # Run inference
128
+ response = llm(
129
+ "[INST] Explain GGUF quantization to a beginner [/INST]",
130
+ max_tokens=256,
131
+ temperature=0.7,
132
+ top_p=0.9
133
+ )
134
+
135
+ print(response["choices"][0]["text"])
136
+ ```
137
+
138
+ ## Performance Tips
139
+
140
+ 1. **Hardware Utilization**:
141
+ - Set thread count with `-t` (typically CPU core count)
142
+ - Compile with CUDA/OpenCL for GPU support
143
+
144
+ 2. **Memory Optimization**:
145
+ - Lower quantization (like q4_k_m) uses less RAM
146
+ - Adjust context size with `-c` parameter
147
+
148
+ 3. **Speed/Accuracy Balance**:
149
+ - Higher bit quantization is slower but more accurate
150
+ - Reduce randomness with `--temp 0` for consistent results
151
+
152
+ ## FAQ
153
+
154
+ **Q: What quantization levels are available?**
155
+ A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
156
+
157
+ **Q: How much performance loss occurs with q4_k_m?**
158
+ A: Typically 2-5% accuracy reduction but 4x smaller size
159
+
160
+ **Q: How to enable GPU support?**
161
+ A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
162
+
163
+ ## Useful Resources
164
+
165
+ 1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
166
+ 2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
167
+ 3. [Hugging Face Model Hub](https://huggingface.co/models)