C4AI Command A - Quantized Models
This repository contains quantized versions of the C4AI Command A model, an open weights research release by Cohere and Cohere For AI. The original model is a 111 billion parameter language model optimized for enterprise use cases, excelling in agentic, multilingual, and retrieval-augmented generation (RAG) tasks while being deployable on minimal hardware (e.g., two GPUs). Here, we provide multiple quantized variants to further reduce memory footprint and enhance deployment flexibility across various hardware setups, including multi-GPU environments.
For details on the original model, refer to the official model card below.
Quantized Models
We have quantized the original CohereForAI/c4ai-command-a-03-2025
model using the bitsandbytes
library with various configurations to balance performance, memory efficiency, and accuracy. Below are the available quantized versions:
Quantization Type | Description | Compute Dtype | Double Quantization | Notes |
---|---|---|---|---|
4bit_nf4_double |
4-bit quantization with nf4 (Normal Float 4) |
bfloat16 |
Yes | High precision with reduced memory usage |
4bit_fp4 |
4-bit quantization with fp4 (Float Point 4) |
bfloat16 |
No | Lightweight, slightly less precise |
8bit_standard |
Standard 8-bit quantization | bfloat16 |
N/A | Balanced memory and accuracy |
8bit_mixed |
8-bit quantization with mixed precision and CPU offloading capability | float16 |
N/A | Flexible for constrained environments |
4bit_nf4_no_double |
4-bit quantization with nf4 , no double quantization |
bfloat16 |
No | Minimal memory footprint |
These models are optimized for multi-GPU deployment using the accelerate
library, ensuring efficient distribution across available GPUs. Each variant is hosted in its own sub-repository:
Tonic/c4ai-command-a-03-2025-4bit_nf4_double
Tonic/c4ai-command-a-03-2025-4bit_fp4
Tonic/c4ai-command-a-03-2025-8bit_standard
Tonic/c4ai-command-a-03-2025-8bit_mixed
Tonic/c4ai-command-a-03-2025-4bit_nf4_no_double
Usage
To use a quantized model, install the required dependencies and load the desired variant as shown below. Multi-GPU support is enabled via accelerate
.
Installation
pip install transformers bitsandbytes accelerate torch huggingface_hub
Example: Loading and Generating Text
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator
# Initialize Accelerator for multi-GPU support
accelerator = Accelerator()
# Specify the quantized model ID
model_id = "Tonic/c4ai-command-a-03-2025-4bit_nf4_double" # Replace with desired variant
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
# Prepare model for multi-GPU
model = accelerator.prepare(model)
# Format message with chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(accelerator.device)
# Generate text
gen_tokens = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.3,
)
gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)
Notes
- Device Mapping:
device_map="auto"
ensures the model is distributed across all available GPUs. - Compute Dtype: Adjust
torch_dtype
(e.g.,torch.bfloat16
ortorch.float16
) based on your hardware and the quantization type. - Memory: Quantized models significantly reduce VRAM requirements compared to the original 111B parameter model, making them suitable for deployment on consumer-grade GPUs.
Quantization Details
The quantization process leverages bitsandbytes
with the following configurations:
- 4-bit Variants: Use
nf4
orfp4
quantization types, with optional double quantization for improved precision. - 8-bit Variants: Offer standard or mixed precision options, with the latter supporting CPU offloading for additional flexibility.
- Multi-GPU Optimization: The
accelerate
library handles model sharding and distribution, allowing deployment on systems with multiple GPUs.
For the exact quantization script, see this Gist (replace with a link to your script if hosted).
Model Card for C4AI Command A
Below is the original model card for C4AI Command A
, adapted for this repository.
Model Summary
C4AI Command A is an open weights research release of a 111 billion parameter model optimized for demanding enterprises that require fast, secure, and high-quality AI. Compared to other leading proprietary and open-weights models, Command A delivers maximum performance with minimum hardware costs, excelling on business-critical agentic and multilingual tasks while being deployable on just two GPUs.
- Developed by: Cohere and Cohere For AI
- Point of Contact: Cohere For AI: cohere.for.ai
- License: CC-BY-NC, requires adhering to C4AI's Acceptable Use Policy
- Model:
c4ai-command-a-03-2025
- Model Size: 111 billion parameters
- Context Length: 256K
Try C4AI Command A
You can try the original model before downloading weights in the hosted Hugging Face Space.
Model Details
- Input: Text only
- Output: Text only
- Model Architecture: Auto-regressive language model with an optimized transformer architecture, featuring sliding window attention (window size 4096) with RoPE, and a global attention layer without positional embeddings.
- Languages: Supports 23 languages including English, French, Spanish, German, Japanese, Chinese, Arabic, and more (see full list in the original model card).
- Context Length: 256K
Chat Capabilities
Command A is configured as a conversational model by default with two safety modes: contextual (default, fewer constraints) and strict (avoids sensitive topics). See Command A prompt format docs for details.
RAG Capabilities
Command A excels in Retrieval Augmented Generation (RAG) tasks. Use the apply_chat_template
method with document snippets for RAG functionality. Example:
conversation = [{"role": "user", "content": "What has Man always dreamed of?"}]
documents = [
{"heading": "The Moon", "body": "Man has always dreamed of destroying the moon..."},
{"heading": "Love", "body": "Man's dream has always been to find love..."}
]
input_ids = tokenizer.apply_chat_template(conversation, documents=documents, tokenize=True, add_generation_prompt=True, return_tensors="pt")
Tool Use Capabilities
Command A supports conversational tool use with JSON schema-based tool descriptions. See the tool use example in the original model card for implementation details.
Code Capabilities
The model performs well on enterprise-relevant code tasks (e.g., SQL generation, code translation). Use low temperature or greedy decoding for optimal code generation.
Terms of Use
This model is released under a CC-BY-NC license for non-commercial use only, adhering to C4AI's Acceptable Use Policy. For commercial inquiries, contact Cohere’s Sales team.
Contact
For issues or questions, reach out to [email protected]
.
- Downloads last month
- 1
Model tree for Tonic/c4ai-command-a-03-2025-4bit_fp4
Base model
CohereLabs/c4ai-command-a-03-2025