tanujdev/Llama-3.2-1B-Instruct-Q4_0.gguf

 # Llama-3.2-1B-Instruct Q4_0 Quantized Model

 This is a Q4_0 quantized version of the `meta-llama/Llama-3.2-1B-Instruct` model, converted to GGUF format and optimized for efficient inference on resource-constrained devices. It was created using `llama.cpp` in Google Colab, following a workflow inspired by [bartowski](https://huggingface.co/bartowski).

 ## Model Details
 - **Base Model**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
 - **Quantization**: Q4_0 (4-bit quantization)
 - **Format**: GGUF
 - **Size**: ~0.7–1.0 GB
 - **Llama.cpp Version**: Recent commit (July 2025 or later)
 - **License**: Inherits the license from `meta-llama/Llama-3.2-1B-Instruct` (Meta AI license, see original repository for details)
 - **Hardware Optimization**: Supports online repacking for ARM and AVX CPU inference (e.g., Snapdragon, AMD Zen5, Intel AVX2)

 ## Usage
 Run the model with `llama.cpp` for command-line inference:
 ```bash
 ./llama-cli -m Llama-3.2-1B-Instruct-Q4_0.gguf --prompt "Hello, world!" --no-interactive
 ```

 Alternatively, use [LM Studio](https://lmstudio.ai) for a user-friendly interface:
 1. Download the GGUF file from this repository.
 2. Load it in LM Studio.
 3. Enter prompts to interact with the model.

 The model is compatible with any `llama.cpp`-based project (e.g., Ollama, MLX).

 ## Creation Process
 This model was created in Google Colab with the following steps:
 1. **Downloaded the Base Model**: Retrieved `meta-llama/Llama-3.2-1B-Instruct` from Hugging Face using `huggingface-cli`.
 2. **Converted to GGUF**: Used `llama.cpp`'s `convert_hf_to_gguf.py` to convert the model to GGUF format (`Llama-3.2-1B-Instruct-f16.gguf`).
 3. **Quantized to Q4_0**: Applied Q4_0 quantization using `llama-quantize` from `llama.cpp`.
 4. **Tested**: Verified functionality with `llama-cli` using the prompt "Hello, world!" in non-interactive mode.

 ## Performance
 - The Q4_0 quantization reduces the model size to ~0.7–1.0 GB, enabling fast inference on CPUs and low-memory devices.
 - Online repacking optimizes performance for ARM (e.g., mobile devices) and AVX CPUs (e.g., modern laptops, servers).
 - Accuracy may be slightly reduced compared to the original bfloat16 model due to 4-bit quantization. For higher quality, consider Q5_K_M or Q8_0 quantizations.

 ## Limitations
 - **Accuracy Trade-off**: Q4_0 quantization may lead to minor accuracy loss compared to higher-precision formats (e.g., Q8_0 or bfloat16).
 - **Hardware Requirements**: Requires `llama.cpp` (recent build) or compatible software like LM Studio for inference.
 - **No Imatrix (Optional)**: This quantization does not use an importance matrix by default. For improved quality, an imatrix-calibrated version can be created using a dataset like WikiText-2.
 - **License Restrictions**: The model inherits the Meta AI license, which may require approval for use (see [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)).

 ## Acknowledgments
 - **Meta AI**: For the original `Llama-3.2-1B-Instruct` model.
 - **Bartowski**: For inspiration and guidance on GGUF quantization workflows (e.g., [bartowski/Llama-3.2-1B-Instruct-GGUF](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF)).
 - **Llama.cpp**: By Georgi Gerganov for providing the quantization and inference tools.

 ## Contact
 For issues or feedback, please open a discussion on this repository or contact the maintainer on [Hugging Face](https://huggingface.co) or [X](https://x.com).

 ---
 *Created in July 2025 by tanujdev.*
tanujdev
/

Llama-3.2-1B-Instruct-Q4_0.gguf

Model tree for tanujdev/Llama-3.2-1B-Instruct-Q4_0.gguf