timm
/

PE-Core-S-16-384

Zero-Shot Image Classification

Model card Files Files and versions

rwightman HF Staff commited on Jul 24

Commit

3249a38

·

verified ·

1 Parent(s): efea53c

Update README.md

Files changed (1) hide show

README.md +7 -5

README.md CHANGED Viewed

@@ -9,8 +9,8 @@ license: apache-2.0
 This is an OpenCLIP (image + text) remaped version of the the [original](https://huggingface.co/facebook/PE-Core-S16-384)
 [\[📃 Tech Report\]](https://arxiv.org/abs/2504.13181)
-[\[📂 PE Github\]](https://github.com/facebookresearch/perception_models/)
-[\[📂 OpenCLIP Github\]](https://github.com/mlfoundations/open_clip)
 Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
 are not at the output of the network](https://ai.meta.com/research/publications/perception-encoder-the-best-visual-embeddings-are-not-at-the-output-of-the-network/)".
@@ -46,9 +46,11 @@ PE core obtains extremely strong results across the board on zero-shot image cla
 | Model | Checkpoint | IN-1k | IN-v2 | IN-A | ObjectNet | COCO-T2I | Kinetics-400 | VTT-T2I
 |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
-| **B/16** 224px | [PE-Core-B16-224](https://huggingface.co/facebook/PE-Core-B16-224) | 78.4 | 71.7 | 62.4 |  71.9 | 50.9 | 65.6 | 47.6 |
-| **L/14** 336px | [PE-Core-L14-336](https://huggingface.co/facebook/PE-Core-L14-336) | 83.5 | 77.9 | 89.0 | 84.7 | 57.1 | 73.4 | 50.3  |
-| **G/14** 448px | [PE-Core-G14-448](https://huggingface.co/facebook/PE-Core-G14-448) | 85.4 | 80.2 | 92.6 | 88.2 | 58.1 | 76.9 | 51.2  |
 PE core performs particularly well on the _hard_ benchmarks such as ObjectNet and ImageNet-A.

 This is an OpenCLIP (image + text) remaped version of the the [original](https://huggingface.co/facebook/PE-Core-S16-384)
 [\[📃 Tech Report\]](https://arxiv.org/abs/2504.13181)
+[\[📂 PE Github (original weights)\]](https://github.com/facebookresearch/perception_models/)
+[\[📂 OpenCLIP Github (these weights)\]](https://github.com/mlfoundations/open_clip)
 Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
 are not at the output of the network](https://ai.meta.com/research/publications/perception-encoder-the-best-visual-embeddings-are-not-at-the-output-of-the-network/)".
 | Model | Checkpoint | IN-1k | IN-v2 | IN-A | ObjectNet | COCO-T2I | Kinetics-400 | VTT-T2I
 |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| **T/16** 384px | [PE-Core-T-16-384](https://huggingface.co/timm/PE-Core-T-16-384) | 62.1 | 54.7 | 21.1 | 43.9 | 33.0 | 41.5 | 28.8 |
+| **S/16** 384px | [PE-Core-S-16-384](https://huggingface.co/timm/PE-Core-S-16-384) | 72.7 | 65.0 | 49.5 | 60.0 | 42.6 | 55.0 | 39.3 |
+| **B/16** 224px | [PE-Core-B-16](https://huggingface.co/timm/PE-Core-B-16) | 78.4 | 71.7 | 62.4 |  71.9 | 50.9 | 65.6 | 47.6 |
+| **L/14** 336px | [PE-Core-L-14-336](https://huggingface.co/timm/PE-Core-L-14-336) | 83.5 | 77.9 | 89.0 | 84.7 | 57.1 | 73.4 | 50.3  |
+| **G/14** 448px | [PE-Core-bigG-14-448](https://huggingface.co/timm/PE-Core-bigG-14-448) | 85.4 | 80.2 | 92.6 | 88.2 | 58.1 | 76.9 | 51.2  |
 PE core performs particularly well on the _hard_ benchmarks such as ObjectNet and ImageNet-A.