Instructions to use Tushe/AMINI-ASSISTANT-GGUF-F16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Tushe/AMINI-ASSISTANT-GGUF-F16 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Tushe/AMINI-ASSISTANT-GGUF-F16",
	filename="AMINI-f16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Tushe/AMINI-ASSISTANT-GGUF-F16 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16
# Run inference directly in the terminal:
llama-cli -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16
# Run inference directly in the terminal:
llama-cli -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16
# Run inference directly in the terminal:
./llama-cli -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16

Use Docker

docker model run hf.co/Tushe/AMINI-ASSISTANT-GGUF-F16:F16

LM Studio
Jan
Ollama
How to use Tushe/AMINI-ASSISTANT-GGUF-F16 with Ollama:
```
ollama run hf.co/Tushe/AMINI-ASSISTANT-GGUF-F16:F16
```

Unsloth Studio new

How to use Tushe/AMINI-ASSISTANT-GGUF-F16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Tushe/AMINI-ASSISTANT-GGUF-F16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Tushe/AMINI-ASSISTANT-GGUF-F16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Tushe/AMINI-ASSISTANT-GGUF-F16 to start chatting

Pi new

How to use Tushe/AMINI-ASSISTANT-GGUF-F16 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Tushe/AMINI-ASSISTANT-GGUF-F16:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Tushe/AMINI-ASSISTANT-GGUF-F16 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Tushe/AMINI-ASSISTANT-GGUF-F16:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Tushe/AMINI-ASSISTANT-GGUF-F16:F16

Run Hermes

hermes

Docker Model Runner
How to use Tushe/AMINI-ASSISTANT-GGUF-F16 with Docker Model Runner:
```
docker model run hf.co/Tushe/AMINI-ASSISTANT-GGUF-F16:F16
```

Lemonade

How to use Tushe/AMINI-ASSISTANT-GGUF-F16 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Tushe/AMINI-ASSISTANT-GGUF-F16:F16

Run and chat with the model

lemonade run user.AMINI-ASSISTANT-GGUF-F16-F16

List all available models

lemonade list

🔩 AMINI F16 GGUF — Tushe Foundry Edge Inference Pack

Quantised & packaged by Tushe – The Foundry Research Team

🧭 What Is This?

We are Tushe – The Foundry Research Team, and we are building a bare-metal inference engine for African language AI on constrained hardware.

This repository is part of our open-source model-baked-on-metal-inference-engine package — a lightweight inference runtime we are releasing as:

Format	Use case
🐍 Python library (`pip install tushe-bare-metal`)	Server, Raspberry Pi, edge Linux
📦 npm package (`npm install tushe-bare-metal`)	Node.js apps, Electron, React Native
⚙️ Compiled C executable	Bare-metal embedded, IoT, MCUs

Every developer can drop this into their app and run offline African language inference right away — no internet, no cloud, no GPU required.

Note: 💡 This is the F16 (full precision) version. It is best used for benchmarking, research, and producing further quantisations. For deployment on phones and edge devices, use the lighter Q4_K_M version (4.5 GB, runs on 8 GB RAM).💡 This is the F16 (full precision) version. It is best used for benchmarking, research, and producing further quantisations. For deployment on phones and edge devices, use the lighter Q4_K_M version (4.5 GB, runs on 8 GB RAM).

🎯 Why We Built This

Africa has some of the most resource-constrained connectivity environments in the world. Millions of people — rural doctors, farmers, teachers, students, traders, and tourists — need intelligent language tools but have no reliable internet access.

We took N-ATLaS, the Llama 3 8B model fine-tuned on Nigerian and African languages by Awarri Technologies in collaboration with NCAIR, and quantised it with immense optimizations to run on low-resource hardware including:

📱 Android & iOS phones
🌾 Edge IoT devices in agricultural fields
🏥 Offline clinical/medical support tools in rural clinics
🏫 Classrooms with no internet access
🧭 Portable translator devices for traders and tourists

🌍 Target Use Cases

Domain	Description
🏥 Rural & Edge Medical	Doctors and health workers in remote clinics — symptom triage, patient communication, drug info in local languages
🌾 Farmers Support	Modern and rural farmers — crop advice, weather interpretation, market prices, pest identification in Hausa, Igbo, Yoruba
🏫 Education	Teachers and students in areas without internet — explanations, tutoring, literacy support in local languages
🛒 Traders & Markets	Cross-language communication for traders and informal markets across Africa
✈️ Tourists	Real-time offline translation across African languages

🔩 About the Base Model

This GGUF is derived from NCAIR1/N-ATLaS — An open-source multilingual LLM, built on Llama 3 8B, fine-tuned by Awarri Technologies in collaboration with the National Centre for Artificial Intelligence & Robotics (NCAIR) and the Federal Ministry of Communications, Innovation and Digital Economy of Nigeria.

N-ATLaS was trained on approximately ~392 million multilingual tokens spanning English, Hausa, Igbo, and Yoruba.

We did not change the weights. We quantised the original model and build a highly optimized efficient inference engine to enable edge deployment.

💾 This File

File	Quant	Size	Min RAM
`AMINI-F16.gguf`	F16	~16 GB	25-32 GB

F16 is our recommended quant for edge deployment — best balance of accuracy, speed, and memory. Runs on a phone with 8GB RAM or a Raspberry Pi 5.

🚀 Inference Examples

1. Python — `llama-cpp-python`

pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id      = "AlaminI/AMINI-ASSISTANT-GGUF-F16",
    filename     = "*.gguf",
    n_ctx        = 2048,
    n_gpu_layers = 0,      # 0 = CPU only (edge/offline), -1 = full GPU
    verbose      = False,
)

English — rural medical support:

output = llm(
    "A patient presents with fever, headache, and joint pain for 3 days. What are the possible diagnoses and first-line management?",
    max_tokens  = 256,
    temperature = 0.7,
    echo        = False,
)
print(output["choices"][0]["text"])

Hausa — farmer support:

output = llm(
    "Gonar hatsi na da kwari da yawa. Menene zan iya yi don kare amfanin gona na?",
    max_tokens  = 256,
    temperature = 0.7,
    echo        = False,
)
print(output["choices"][0]["text"])

Yoruba — student support:

output = llm(
    "Ṣe alaye ohun ti photosynthesis jẹ ni ede Yoruba fun ọmọ ile-iwe.",
    max_tokens  = 256,
    temperature = 0.7,
    echo        = False,
)
print(output["choices"][0]["text"])

Igbo — trader/market support:

output = llm(
    "Gwa m ọnụ ahịa nke ọka ugbu a n'ahịa Onitsha.",
    max_tokens  = 256,
    temperature = 0.7,
    echo        = False,
)
print(output["choices"][0]["text"])

2. Chat format — multilingual instruction

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id      = "AlaminI/AMINI-ASSISTANT-GGUF-F16",
    filename     = "*.gguf",
    n_ctx        = 2048,
    n_gpu_layers = 0,
    verbose      = False,
)

response = llm.create_chat_completion(
    messages = [
        {
            "role": "system",
            "content": (
                "You are an offline African language assistant running on a local device. "
                "You support English, Hausa, Igbo, and Yoruba. "
                "Respond in the same language the user writes in. "
                "Be concise — this device has limited resources."
            )
        },
        {
            "role": "user",
            "content": "Translate 'The child has a high fever and needs immediate care' into Hausa and Yoruba."
        }
    ],
    max_tokens  = 256,
    temperature = 0.7,
)
print(response["choices"][0]["message"]["content"])

3. Streaming (for responsive UIs on edge devices)

stream = llm.create_chat_completion(
    messages = [
        {"role": "user", "content": "Explain crop rotation to a farmer in Hausa."}
    ],
    max_tokens = 256,
    temperature = 0.7,
    stream     = True,
)

for chunk in stream:
    delta = chunk["choices"][0]["delta"].get("content", "")
    print(delta, end="", flush=True)

4. Node.js — `node-llama-cpp`

npm install node-llama-cpp

import { getLlama, LlamaChatSession } from "node-llama-cpp";
import path from "path";

const llama   = await getLlama();
const model   = await llama.loadModel({ modelPath: path.join("models", "AMINI-F16.gguf") });
const context = await model.createContext({ contextSize: 2048 });
const session = new LlamaChatSession({ contextSequence: context.getSequence() });

const response = await session.prompt(
    "A farmer asks: my tomatoes are wilting despite regular watering. What could be wrong?",
    { maxTokens: 256 }
);
console.log(response);

5. llama.cpp CLI (bare-metal / embedded)

# Download
huggingface-cli download AlaminI/AMINI-ASSISTANT-GGUF-F16 \
    AMINI-F16.gguf --local-dir ./models/

# Run on CPU only (edge device)
./llama-cli -m ./models/AMINI-F16.gguf \
    --ctx-size 2048 \
    --threads 4 \
    --temp 0.7 \
    -i -r "User:" \
    -p "You are an offline assistant for African languages. Respond in the user's language.\nUser:"

6. Ollama (local server mode)

ollama run hf.co/AlaminI/AMINI-ASSISTANT-GGUF-F16

⚙️ Recommended Settings for Edge Devices

Parameter	Value	Notes
`n_ctx`	512–1024	Reduce on very low RAM devices
`n_gpu_layers`	0	CPU-only for phones/IoT
`n_threads`	4	Match your device's core count
`temperature`	0.7	Balanced responses
`max_tokens`	128–256	Keep short for low-latency UX
`repeat_penalty`	1.1	Reduces looping on edge

🔗 Related Repositories

Repo	Description
NCAIR1/N-ATLaS	Original model by Awarri / NCAIR
AlaminI/AMINI-ASSISTANT-GGUF-F16	F16 GGUF (full precision, for re-quantising)

⚠️ License

This GGUF quantisation is an independent contribution by Tushe – The Foundry Research Team, We enncorage developers to refer to the N-ATLaS licence. But our Inference engine ca be used for any mean, commercial and beyond any user Number. We will rellease models traind by us to give developers fullly open-source models and inference at edge

Downloads last month: 18

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tushe/AMINI-ASSISTANT-GGUF-F16

Base model

NCAIR1/N-ATLaS

Quantized

(4)

this model