EgoNormia-Cosmos-Reason2-2B-v3
Multi-task SFT fine-tune of nvidia/Cosmos-Reason2-2B on the EgoNormia social norm benchmark. Trained jointly on action selection, justification selection, and sensibility identification.
Training
| Parameter | Value |
|---|---|
| Base model | nvidia/Cosmos-Reason2-2B (Qwen3-VL-2B) |
| Tasks | Action + Justification + Sensibility (multi-task) |
| Train samples | 4959 (1653 per task x 3) |
| Epochs | 3 |
| Global batch | 64 (8 replicas x 8 per replica) |
| Learning rate | 1e-5 (cosine decay, 3% warmup) |
| Video input | video_prev.mp4, 8 frames |
| Hardware | 8x A100-SXM4-80GB |
| Best checkpoint | step_150 / 231 total steps |
Evaluation (200 verified test samples)
| Model | Action | Justification | Both | S-IoU | Parse(A/J/S) |
|---|---|---|---|---|---|
| Zero-shot | 58.5% | 81.5% | 51.0% | 0.516 | -/-/- |
| v2 best | 82.0% | 84.0% | 71.5% | 0.0%* | 100/100/0% |
v3 step_150 |
79.5% | 96.5% | 77.0% | 0.630 | 100/100/100% |
*v2 S-IoU = 0% because the model collapses on the sensibility output format.
Notes
- v3 is the first version that fully solves the output-format collapse from v2.
- Relative to v2, it trades a small amount of peak action accuracy for large gains in justification quality, sensibility parsing, and overall benchmark completeness.
- Later versions mainly explore how to improve v3's action / robustness tradeoff without breaking formatting.
Usage
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
model = Qwen3VLForConditionalGeneration.from_pretrained(
"robertzty/EgoNormia-Cosmos-Reason2-2B-v3",
torch_dtype="bfloat16",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("robertzty/EgoNormia-Cosmos-Reason2-2B-v3")
- Downloads last month
- 23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support