Instructions to use airev-ai/gemma-4-e2b-ondemand with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use airev-ai/gemma-4-e2b-ondemand with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="airev-ai/gemma-4-e2b-ondemand")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("airev-ai/gemma-4-e2b-ondemand")
model = AutoModelForImageTextToText.from_pretrained("airev-ai/gemma-4-e2b-ondemand")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use airev-ai/gemma-4-e2b-ondemand with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "airev-ai/gemma-4-e2b-ondemand"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "airev-ai/gemma-4-e2b-ondemand",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/airev-ai/gemma-4-e2b-ondemand

SGLang

How to use airev-ai/gemma-4-e2b-ondemand with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "airev-ai/gemma-4-e2b-ondemand" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "airev-ai/gemma-4-e2b-ondemand",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "airev-ai/gemma-4-e2b-ondemand" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "airev-ai/gemma-4-e2b-ondemand",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use airev-ai/gemma-4-e2b-ondemand with Docker Model Runner:
```
docker model run hf.co/airev-ai/gemma-4-e2b-ondemand
```

Gemma-4-E2B On-Demand Plugin Orchestrator

2B-active-parameter plugin selection and orchestration model, trained by AIREV for the On-Demand plugin platform.

Given a user request and a candidate pool of plugins, this model picks the correct subset, orders them with proper dependencies, and hydrates every API parameter — emitting a valid JSON plan that can be executed directly.

Eval results (100-sample held-out On-Demand set)

Model	Mean	JSON valid	Plugin-ID match	Count match	No-hallucinate	Hydrated	Deps chain
Gemma-4-E2B SFT-only (baseline)	0.9180	95.0%	89.0%	93.0%	95.0%	95.0%	87.0%
Gemma-4-E2B SFT + GRPO (this model)	0.9400	97.0%	91.0%	96.0%	97.0%	97.0%	89.0%

+2.2pp mean score = 26.8% relative error reduction over SFT-only.

By category (GRPO wins on multi-step chains)

Category	SFT	GRPO	Δ
1_step	0.923	0.923	—
2_step	0.950	1.000	+5.0
3_step	0.936	0.976	+4.0
4_step	0.781	0.791	+1.0
multi_turn	1.000	1.000	—

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json

model_id = "airev-ai/gemma-4-e2b-ondemand"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16
).to("cuda")

system = ("You are an AI agent orchestrator. Given a user request and available "
          "plugins/tools, generate a precise multi-step execution plan as a valid "
          "JSON object. Each step must use available plugins with correct parameters, "
          "proper types, and valid JSON formatting.")

candidates = [
    {"pluginId": "plugin-1714851345", "name": "Nutrition BOT",
     "description": "Nutrition type stuff", "identifier": "rest_api", "method": "POST"},
    {"pluginId": "plugin-1768545918", "name": "kinetiqai-exercise-scoring",
     "description": "Analyzes workout form using PoseTracker data",
     "identifier": "rest_api", "method": "POST"},
    # ... more candidates
]

user_msg = (
    f"YOUR TASK IS TO GENERATE A JSON STRICTLY and CORRECTLY\n"
    f"{json.dumps(candidates, indent=2)}\n\n"
    f"User Request: I want to improve my fitness routine — analyze my workout "
    f"form, then get nutrition guidance."
)

prompt = tok.apply_chat_template(
    [{"role": "system", "content": system},
     {"role": "user", "content": user_msg}],
    tokenize=False, add_generation_prompt=True,
)
ids = tok(prompt, return_tensors="pt").input_ids.to("cuda")
out = model.generate(ids, max_new_tokens=1024, temperature=0.1, do_sample=True, top_p=0.9)
response = tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True)

# Response contains <think>...</think> followed by JSON:
# {"plugins": [{"pluginId": "...", "api_request_parameters": {...},
#               "all_parameters_hydrated": true, "dependencies": [...]}]}

Output format

The model emits:

A <think>...</think> reasoning trace explaining plugin selection
A JSON object: {"plugins": [...]} where each plugin has:
- pluginId (from the candidate list — never hallucinated)
- name, description, identifier, method
- api_request_parameters — fully hydrated, no placeholders
- all_parameters_hydrated: true
- dependencies: [] — list of pluginIds that must run first

Training pipeline

1. Data — 64,992 cleaned samples

Source: real On-Demand production traces + synthetic plans
Cleaning pipeline: deduplicated, JSON-validated, thinking tokens enforced, parameters hydrated (no example.com, no empty values, no placeholders)
Judge: Claude Opus 4.6 via Vertex AI, 100 parallel workers

2. SFT — 194,976 steps, 3 epochs

Base: google/gemma-4-E2B (5B total, 2B active, GDN hybrid)
Optimizer: Adafactor (AdamW causes CUDA illegal memory access on Gemma 4)
Single GPU only, no scheduler, no gradient clipping
LR = 2e-5, batch_size = 1, grad_accum = 1, max_length = 1024
Final loss: 0.1496 avg
~20 hours on 1× H100 80GB
Eval score: 0.918

3. AutoResearch — 24 iterations hyperparameter search

Claude Opus 4.6 mutations with ratchet (keep best config)
Best finding: num_plugins=6 candidates per prompt (down from 8)
Everything else stayed at defaults: lr=1e-6, num_generations=4, top_p=0.9

4. GRPO — 570 steps with plugin-selection reward

Reward (0.0–1.0) combines:

0.10 valid JSON
0.15 all picks in available candidate list
0.25 × (correct_picked / total_correct)
0.20 bonus for no wrong picks
−0.10 × wrong picks (capped at 3)
0.10 exact count match
0.15 × hydration ratio

Best checkpoint: step 500 — eval score 0.940. Peak training-reward avg20 of 0.811 hit at step 473 before entering a noise-induced tail.

Architecture notes

Gemma-4-E2B = 5B total params, 2B active (MoE), 128K context, GDN-style
Thinking tokens always active — the model learned to use <think> for plugin reasoning
Adafactor is mandatory for training — AdamW hits illegal memory access
Single GPU only — device_map="auto" causes crashes
No LR scheduler, no gradient clipping — these also destabilize training

Code

Full open-source training pipeline, AutoResearch harness, GRPO reward function, and eval scripts available at: github.com/mk42-ai/gemma-4-e2b-ondemand

Acknowledgements

Google DeepMind for Gemma 4
Berkeley RAIL for the BFCL benchmark methodology
HuggingFace TRL team for GRPO reference implementation
AIREV infrastructure team for 8× H100 cluster access

Citation

@misc{gemma4e2b_ondemand_2026,
  title   = {Gemma-4-E2B On-Demand: Plugin Orchestration via SFT + GRPO},
  author  = {Khalid, Muhammed and AIREV},
  year    = {2026},
  url     = {https://huggingface.co/airev-ai/gemma-4-e2b-ondemand},
}

Downloads last month: 16

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for airev-ai/gemma-4-e2b-ondemand

Base model

google/gemma-4-E2B

Finetuned

(59)

this model