Argus-Edge

Multi-task perception on a frozen EUPE-ViT-T backbone. Classification, semantic segmentation, metric depth, object detection, and dense correspondence from a single 5.5-million-parameter encoder.

Architecture

Image → EUPE-ViT-T (frozen, 5.5M) → shared features
                                     │
              ┌──────────────────────┼──────────────────────┬──────────────────┐
              ▼                      ▼                      ▼                  ▼
        Classification          Segmentation              Depth            Detection
        Linear(192, 1K)         DPT (dec_dim=128)     DPT (dec_dim=128)  Split-tower (192-D)
        0.19 M params            3.27 M params           3.28 M params       2.88 M params

Plus correspondence via dense cosine matching on patch tokens (0 params).

Component	Params
EUPE-ViT-T backbone (frozen)	5.50 M
Classifier head	0.19 M
Segmentation head	3.27 M
Depth head	3.28 M
Detection head	2.88 M
Total	15.12 M

Results

Task	Dataset	Metric	Value
Classification	ImageNet-1k val (50K images)	top-1 / top-5	60.31 % / 85.49 %
Segmentation	ADE20K val (2000 images)	mIoU	0.365
Depth	NYU Depth V2 val (5058 images)	RMSE	0.625

Training recipes

Head	Recipe
Classification	Cached 192-D CLS tokens from 1.28 M ImageNet-1k train images, L2-normalized, `Linear(192, 1000)`, AdamW lr 1e-3, 30-epoch cosine
Segmentation	ADE20K 20K train at 512 px, DPT decoder over 4 hooked blocks (2, 5, 8, 11), `dec_dim=128`, cross-entropy, AdamW lr 1e-4, cosine
Depth	NYU Depth V2 32K train at 416 px, DPT decoder over the same 4 hooked blocks, 256-bin log-depth head, SILog loss on expected depth, AdamW lr 1e-4, cosine, reflection-padded 3×3 convs

Paper (arXiv:2603.22387) reports EUPE-ViT-T IN1k-kNN at 66.3 %, a rough ceiling for a linear probe on these features.

Files

cls_head.safetensors          Linear(192, 1000) classifier
seg_head.safetensors          DPT decoder, 150-class segmentation at 512 px
depth_head.safetensors        DPT decoder, 256 log-depth bins at 416 px
det_head.safetensors          Split-tower detection head, 80-class FCOS at 768 px
argus_edge.py                 ArgusEdge class
cls_training_log.json         classifier training curve
cls_val_eval.json             classifier val eval

Usage

from argus_edge import ArgusEdge

model = ArgusEdge.from_pretrained('phanerozoic/argus-edge').cuda().eval()

model.classify('image.jpg', top_k=5)
# {'label': 'tabby', 'score': 0.62, 'top5': [...]}

model.segment('street.jpg')              # (512, 512) int array of ADE20K class ids
model.depth('room.jpg')                  # (416, 416) float array, metric depth in meters
model.detect('photo.jpg', score_thresh=0.3)  # [{'box': [x1,y1,x2,y2], 'score': 0.78, 'class_id': 17}, ...]
model.correspond('a.jpg', 'b.jpg')       # dense patch-token matches between two images

out = model.perceive('a.jpg', image_b='b.jpg')
# keys: classification, segmentation, depth, detection, correspondence

Requires argus.py from phanerozoic/argus on sys.path for DinoVisionTransformer and SplitTowerHead.

Source backbone

EUPE-ViT-T from Meta FAIR (arXiv:2603.22387, Zhu et al., March 2026). Three-stage distillation from PEcore-G + PElang-G + DINOv3-H+ via a 1.9 B proxy teacher. License: FAIR Research License (non-commercial).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for phanerozoic/argus-edge

Base model

facebook/EUPE-ViT-T

Finetuned

(1)

this model

Datasets used to train phanerozoic/argus-edge

Paper for phanerozoic/argus-edge

Efficient Universal Perception Encoder

Paper • 2603.22387 • Published Mar 23 • 9