Instructions to use airev-ai/gemma-4-e2b-ondemand with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use airev-ai/gemma-4-e2b-ondemand with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="airev-ai/gemma-4-e2b-ondemand") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("airev-ai/gemma-4-e2b-ondemand") model = AutoModelForImageTextToText.from_pretrained("airev-ai/gemma-4-e2b-ondemand") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use airev-ai/gemma-4-e2b-ondemand with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "airev-ai/gemma-4-e2b-ondemand" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "airev-ai/gemma-4-e2b-ondemand", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/airev-ai/gemma-4-e2b-ondemand
- SGLang
How to use airev-ai/gemma-4-e2b-ondemand with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "airev-ai/gemma-4-e2b-ondemand" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "airev-ai/gemma-4-e2b-ondemand", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "airev-ai/gemma-4-e2b-ondemand" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "airev-ai/gemma-4-e2b-ondemand", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use airev-ai/gemma-4-e2b-ondemand with Docker Model Runner:
docker model run hf.co/airev-ai/gemma-4-e2b-ondemand
Gemma-4-E2B On-Demand Plugin Orchestrator
2B-active-parameter plugin selection and orchestration model, trained by AIREV for the On-Demand plugin platform.
Given a user request and a candidate pool of plugins, this model picks the correct subset, orders them with proper dependencies, and hydrates every API parameter β emitting a valid JSON plan that can be executed directly.
Eval results (100-sample held-out On-Demand set)
| Model | Mean | JSON valid | Plugin-ID match | Count match | No-hallucinate | Hydrated | Deps chain |
|---|---|---|---|---|---|---|---|
| Gemma-4-E2B SFT-only (baseline) | 0.9180 | 95.0% | 89.0% | 93.0% | 95.0% | 95.0% | 87.0% |
| Gemma-4-E2B SFT + GRPO (this model) | 0.9400 | 97.0% | 91.0% | 96.0% | 97.0% | 97.0% | 89.0% |
+2.2pp mean score = 26.8% relative error reduction over SFT-only.
By category (GRPO wins on multi-step chains)
| Category | SFT | GRPO | Ξ |
|---|---|---|---|
| 1_step | 0.923 | 0.923 | β |
| 2_step | 0.950 | 1.000 | +5.0 |
| 3_step | 0.936 | 0.976 | +4.0 |
| 4_step | 0.781 | 0.791 | +1.0 |
| multi_turn | 1.000 | 1.000 | β |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json
model_id = "airev-ai/gemma-4-e2b-ondemand"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16
).to("cuda")
system = ("You are an AI agent orchestrator. Given a user request and available "
"plugins/tools, generate a precise multi-step execution plan as a valid "
"JSON object. Each step must use available plugins with correct parameters, "
"proper types, and valid JSON formatting.")
candidates = [
{"pluginId": "plugin-1714851345", "name": "Nutrition BOT",
"description": "Nutrition type stuff", "identifier": "rest_api", "method": "POST"},
{"pluginId": "plugin-1768545918", "name": "kinetiqai-exercise-scoring",
"description": "Analyzes workout form using PoseTracker data",
"identifier": "rest_api", "method": "POST"},
# ... more candidates
]
user_msg = (
f"YOUR TASK IS TO GENERATE A JSON STRICTLY and CORRECTLY\n"
f"{json.dumps(candidates, indent=2)}\n\n"
f"User Request: I want to improve my fitness routine β analyze my workout "
f"form, then get nutrition guidance."
)
prompt = tok.apply_chat_template(
[{"role": "system", "content": system},
{"role": "user", "content": user_msg}],
tokenize=False, add_generation_prompt=True,
)
ids = tok(prompt, return_tensors="pt").input_ids.to("cuda")
out = model.generate(ids, max_new_tokens=1024, temperature=0.1, do_sample=True, top_p=0.9)
response = tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True)
# Response contains <think>...</think> followed by JSON:
# {"plugins": [{"pluginId": "...", "api_request_parameters": {...},
# "all_parameters_hydrated": true, "dependencies": [...]}]}
Output format
The model emits:
- A
<think>...</think>reasoning trace explaining plugin selection - A JSON object:
{"plugins": [...]}where each plugin has:pluginId(from the candidate list β never hallucinated)name,description,identifier,methodapi_request_parametersβ fully hydrated, no placeholdersall_parameters_hydrated: truedependencies: []β list of pluginIds that must run first
Training pipeline
1. Data β 64,992 cleaned samples
- Source: real On-Demand production traces + synthetic plans
- Cleaning pipeline: deduplicated, JSON-validated, thinking tokens enforced,
parameters hydrated (no
example.com, no empty values, no placeholders) - Judge: Claude Opus 4.6 via Vertex AI, 100 parallel workers
2. SFT β 194,976 steps, 3 epochs
- Base:
google/gemma-4-E2B(5B total, 2B active, GDN hybrid) - Optimizer: Adafactor (AdamW causes CUDA illegal memory access on Gemma 4)
- Single GPU only, no scheduler, no gradient clipping
- LR = 2e-5, batch_size = 1, grad_accum = 1, max_length = 1024
- Final loss: 0.1496 avg
- ~20 hours on 1Γ H100 80GB
- Eval score: 0.918
3. AutoResearch β 24 iterations hyperparameter search
- Claude Opus 4.6 mutations with ratchet (keep best config)
- Best finding:
num_plugins=6candidates per prompt (down from 8) - Everything else stayed at defaults:
lr=1e-6,num_generations=4,top_p=0.9
4. GRPO β 570 steps with plugin-selection reward
Reward (0.0β1.0) combines:
- 0.10 valid JSON
- 0.15 all picks in available candidate list
- 0.25 Γ (correct_picked / total_correct)
- 0.20 bonus for no wrong picks
- β0.10 Γ wrong picks (capped at 3)
- 0.10 exact count match
- 0.15 Γ hydration ratio
Best checkpoint: step 500 β eval score 0.940. Peak training-reward avg20 of 0.811 hit at step 473 before entering a noise-induced tail.
Architecture notes
- Gemma-4-E2B = 5B total params, 2B active (MoE), 128K context, GDN-style
- Thinking tokens always active β the model learned to use
<think>for plugin reasoning - Adafactor is mandatory for training β AdamW hits illegal memory access
- Single GPU only β
device_map="auto"causes crashes - No LR scheduler, no gradient clipping β these also destabilize training
Code
Full open-source training pipeline, AutoResearch harness, GRPO reward function, and eval scripts available at: github.com/mk42-ai/gemma-4-e2b-ondemand
Acknowledgements
- Google DeepMind for Gemma 4
- Berkeley RAIL for the BFCL benchmark methodology
- HuggingFace TRL team for GRPO reference implementation
- AIREV infrastructure team for 8Γ H100 cluster access
Citation
@misc{gemma4e2b_ondemand_2026,
title = {Gemma-4-E2B On-Demand: Plugin Orchestration via SFT + GRPO},
author = {Khalid, Muhammed and AIREV},
year = {2026},
url = {https://huggingface.co/airev-ai/gemma-4-e2b-ondemand},
}
- Downloads last month
- 16
Model tree for airev-ai/gemma-4-e2b-ondemand
Base model
google/gemma-4-E2B