Argus-Edge

Multi-task perception on a frozen EUPE-ViT-T backbone. Classification, semantic segmentation, metric depth, object detection, and dense correspondence from a single 5.5-million-parameter encoder.

Architecture

Image β†’ EUPE-ViT-T (frozen, 5.5M) β†’ shared features
                                     β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό                      β–Ό                      β–Ό                  β–Ό
        Classification          Segmentation              Depth            Detection
        Linear(192, 1K)         DPT (dec_dim=128)     DPT (dec_dim=128)  Split-tower (192-D)
        0.19 M params            3.27 M params           3.28 M params       2.88 M params

Plus correspondence via dense cosine matching on patch tokens (0 params).

Component Params
EUPE-ViT-T backbone (frozen) 5.50 M
Classifier head 0.19 M
Segmentation head 3.27 M
Depth head 3.28 M
Detection head 2.88 M
Total 15.12 M

Results

Task Dataset Metric Value
Classification ImageNet-1k val (50K images) top-1 / top-5 60.31 % / 85.49 %
Segmentation ADE20K val (2000 images) mIoU 0.365
Depth NYU Depth V2 val (5058 images) RMSE 0.625

Training recipes

Head Recipe
Classification Cached 192-D CLS tokens from 1.28 M ImageNet-1k train images, L2-normalized, Linear(192, 1000), AdamW lr 1e-3, 30-epoch cosine
Segmentation ADE20K 20K train at 512 px, DPT decoder over 4 hooked blocks (2, 5, 8, 11), dec_dim=128, cross-entropy, AdamW lr 1e-4, cosine
Depth NYU Depth V2 32K train at 416 px, DPT decoder over the same 4 hooked blocks, 256-bin log-depth head, SILog loss on expected depth, AdamW lr 1e-4, cosine, reflection-padded 3Γ—3 convs

Paper (arXiv:2603.22387) reports EUPE-ViT-T IN1k-kNN at 66.3 %, a rough ceiling for a linear probe on these features.

Files

cls_head.safetensors          Linear(192, 1000) classifier
seg_head.safetensors          DPT decoder, 150-class segmentation at 512 px
depth_head.safetensors        DPT decoder, 256 log-depth bins at 416 px
det_head.safetensors          Split-tower detection head, 80-class FCOS at 768 px
argus_edge.py                 ArgusEdge class
cls_training_log.json         classifier training curve
cls_val_eval.json             classifier val eval

Usage

from argus_edge import ArgusEdge

model = ArgusEdge.from_pretrained('phanerozoic/argus-edge').cuda().eval()

model.classify('image.jpg', top_k=5)
# {'label': 'tabby', 'score': 0.62, 'top5': [...]}

model.segment('street.jpg')              # (512, 512) int array of ADE20K class ids
model.depth('room.jpg')                  # (416, 416) float array, metric depth in meters
model.detect('photo.jpg', score_thresh=0.3)  # [{'box': [x1,y1,x2,y2], 'score': 0.78, 'class_id': 17}, ...]
model.correspond('a.jpg', 'b.jpg')       # dense patch-token matches between two images

out = model.perceive('a.jpg', image_b='b.jpg')
# keys: classification, segmentation, depth, detection, correspondence

Requires argus.py from phanerozoic/argus on sys.path for DinoVisionTransformer and SplitTowerHead.

Source backbone

EUPE-ViT-T from Meta FAIR (arXiv:2603.22387, Zhu et al., March 2026). Three-stage distillation from PEcore-G + PElang-G + DINOv3-H+ via a 1.9 B proxy teacher. License: FAIR Research License (non-commercial).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phanerozoic/argus-edge

Finetuned
(1)
this model

Datasets used to train phanerozoic/argus-edge

Paper for phanerozoic/argus-edge