Efficient Universal Perception Encoder
Paper β’ 2603.22387 β’ Published β’ 9
Multi-task perception on a frozen EUPE-ViT-T backbone. Classification, semantic segmentation, metric depth, object detection, and dense correspondence from a single 5.5-million-parameter encoder.
Image β EUPE-ViT-T (frozen, 5.5M) β shared features
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ¬βββββββββββββββββββ
βΌ βΌ βΌ βΌ
Classification Segmentation Depth Detection
Linear(192, 1K) DPT (dec_dim=128) DPT (dec_dim=128) Split-tower (192-D)
0.19 M params 3.27 M params 3.28 M params 2.88 M params
Plus correspondence via dense cosine matching on patch tokens (0 params).
| Component | Params |
|---|---|
| EUPE-ViT-T backbone (frozen) | 5.50 M |
| Classifier head | 0.19 M |
| Segmentation head | 3.27 M |
| Depth head | 3.28 M |
| Detection head | 2.88 M |
| Total | 15.12 M |
| Task | Dataset | Metric | Value |
|---|---|---|---|
| Classification | ImageNet-1k val (50K images) | top-1 / top-5 | 60.31 % / 85.49 % |
| Segmentation | ADE20K val (2000 images) | mIoU | 0.365 |
| Depth | NYU Depth V2 val (5058 images) | RMSE | 0.625 |
| Head | Recipe |
|---|---|
| Classification | Cached 192-D CLS tokens from 1.28 M ImageNet-1k train images, L2-normalized, Linear(192, 1000), AdamW lr 1e-3, 30-epoch cosine |
| Segmentation | ADE20K 20K train at 512 px, DPT decoder over 4 hooked blocks (2, 5, 8, 11), dec_dim=128, cross-entropy, AdamW lr 1e-4, cosine |
| Depth | NYU Depth V2 32K train at 416 px, DPT decoder over the same 4 hooked blocks, 256-bin log-depth head, SILog loss on expected depth, AdamW lr 1e-4, cosine, reflection-padded 3Γ3 convs |
Paper (arXiv:2603.22387) reports EUPE-ViT-T IN1k-kNN at 66.3 %, a rough ceiling for a linear probe on these features.
cls_head.safetensors Linear(192, 1000) classifier
seg_head.safetensors DPT decoder, 150-class segmentation at 512 px
depth_head.safetensors DPT decoder, 256 log-depth bins at 416 px
det_head.safetensors Split-tower detection head, 80-class FCOS at 768 px
argus_edge.py ArgusEdge class
cls_training_log.json classifier training curve
cls_val_eval.json classifier val eval
from argus_edge import ArgusEdge
model = ArgusEdge.from_pretrained('phanerozoic/argus-edge').cuda().eval()
model.classify('image.jpg', top_k=5)
# {'label': 'tabby', 'score': 0.62, 'top5': [...]}
model.segment('street.jpg') # (512, 512) int array of ADE20K class ids
model.depth('room.jpg') # (416, 416) float array, metric depth in meters
model.detect('photo.jpg', score_thresh=0.3) # [{'box': [x1,y1,x2,y2], 'score': 0.78, 'class_id': 17}, ...]
model.correspond('a.jpg', 'b.jpg') # dense patch-token matches between two images
out = model.perceive('a.jpg', image_b='b.jpg')
# keys: classification, segmentation, depth, detection, correspondence
Requires argus.py from phanerozoic/argus on sys.path for DinoVisionTransformer and SplitTowerHead.
EUPE-ViT-T from Meta FAIR (arXiv:2603.22387, Zhu et al., March 2026). Three-stage distillation from PEcore-G + PElang-G + DINOv3-H+ via a 1.9 B proxy teacher. License: FAIR Research License (non-commercial).
Base model
facebook/EUPE-ViT-T