# Llama-3.2-1B-Instruct Q4_0 Quantized Model
This is a Q4_0 quantized version of the `meta-llama/Llama-3.2-1B-Instruct` model, converted to GGUF format and optimized for efficient inference on resource-constrained devices. It was created using `llama.cpp` in Google Colab, following a workflow inspired by [bartowski](https://huggingface.co/bartowski).
## Model Details
- **Base Model**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
- **Quantization**: Q4_0 (4-bit quantization)
- **Format**: GGUF
- **Size**: ~0.7โ1.0 GB
- **Llama.cpp Version**: Recent commit (July 2025 or later)
- **License**: Inherits the license from `meta-llama/Llama-3.2-1B-Instruct` (Meta AI license, see original repository for details)
- **Hardware Optimization**: Supports online repacking for ARM and AVX CPU inference (e.g., Snapdragon, AMD Zen5, Intel AVX2)
## Usage
Run the model with `llama.cpp` for command-line inference:
```bash
./llama-cli -m Llama-3.2-1B-Instruct-Q4_0.gguf --prompt "Hello, world!" --no-interactive
```
Alternatively, use [LM Studio](https://lmstudio.ai) for a user-friendly interface:
1. Download the GGUF file from this repository.
2. Load it in LM Studio.
3. Enter prompts to interact with the model.
The model is compatible with any `llama.cpp`-based project (e.g., Ollama, MLX).
## Creation Process
This model was created in Google Colab with the following steps:
1. **Downloaded the Base Model**: Retrieved `meta-llama/Llama-3.2-1B-Instruct` from Hugging Face using `huggingface-cli`.
2. **Converted to GGUF**: Used `llama.cpp`'s `convert_hf_to_gguf.py` to convert the model to GGUF format (`Llama-3.2-1B-Instruct-f16.gguf`).
3. **Quantized to Q4_0**: Applied Q4_0 quantization using `llama-quantize` from `llama.cpp`.
4. **Tested**: Verified functionality with `llama-cli` using the prompt "Hello, world!" in non-interactive mode.
## Performance
- The Q4_0 quantization reduces the model size to ~0.7โ1.0 GB, enabling fast inference on CPUs and low-memory devices.
- Online repacking optimizes performance for ARM (e.g., mobile devices) and AVX CPUs (e.g., modern laptops, servers).
- Accuracy may be slightly reduced compared to the original bfloat16 model due to 4-bit quantization. For higher quality, consider Q5_K_M or Q8_0 quantizations.
## Limitations
- **Accuracy Trade-off**: Q4_0 quantization may lead to minor accuracy loss compared to higher-precision formats (e.g., Q8_0 or bfloat16).
- **Hardware Requirements**: Requires `llama.cpp` (recent build) or compatible software like LM Studio for inference.
- **No Imatrix (Optional)**: This quantization does not use an importance matrix by default. For improved quality, an imatrix-calibrated version can be created using a dataset like WikiText-2.
- **License Restrictions**: The model inherits the Meta AI license, which may require approval for use (see [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)).
## Acknowledgments
- **Meta AI**: For the original `Llama-3.2-1B-Instruct` model.
- **Bartowski**: For inspiration and guidance on GGUF quantization workflows (e.g., [bartowski/Llama-3.2-1B-Instruct-GGUF](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF)).
- **Llama.cpp**: By Georgi Gerganov for providing the quantization and inference tools.
## Contact
For issues or feedback, please open a discussion on this repository or contact the maintainer on [Hugging Face](https://huggingface.co) or [X](https://x.com).
---
*Created in July 2025 by tanujdev.*
- Downloads last month
- -
Hardware compatibility
Log In
to view the estimation
4-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for tanujdev/Llama-3.2-1B-Instruct-Q4_0.gguf
Base model
meta-llama/Llama-3.2-1B-Instruct