Long-CLIP Needs Registers. And Gated MLPs. And +20M params.
Fixing Long-CLIP's modality gap via happy little accidents.
- ❤️ this CLIP? Donate if you can / want. TY! (and enjoy, either way!) 🤗
ℹ️ Update 02/June/2025:
- You can now load the model with HF 'transformers'. ✅
- Unfortunately, AutoModel produced nonsense / I couldn't get "trust_remote_code=True" to work properly - (using that was suggested in response to my pull request on GitHub).
💡 Alas, you will need to:
- Download the 'hfmodel' folder
- Use it for manually importing the correct (my custom) CLIPModel code required as per the config.json
- Minimal example code:
import torch
from PIL import Image, ImageDraw
import transformers
from hfmodel.modeling_clip import CLIPModel
from transformers import CLIPProcessor
from torchvision.transforms import ToTensor
import torch.nn.functional as F
model = CLIPModel.from_pretrained("zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14")
processor = CLIPProcessor.from_pretrained("zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
size = 224
im = Image.new("RGB", (size, size), (255, 255, 255))
draw = ImageDraw.Draw(im)
# --------- GPT-4.1's idea of a pineapple. Need an input image... ---------
body_bbox = [size*0.28, size*0.38, size*0.72, size*0.90]
draw.ellipse(body_bbox, fill=(254, 221, 72), outline=(180, 120, 0), width=5)
eye_color = (198, 134, 66)
for row in range(4):
for col in range(3):
ex = size*0.36 + col*size*0.09 + (row%2)*size*0.045
ey = size*0.50 + row*size*0.085
ew, eh = size*0.035, size*0.025
draw.ellipse([ex-ew, ey-eh, ex+ew, ey+eh], fill=eye_color, outline=None)
leaf_color = (61, 179, 70)
leaf_base_x = size/2
leaf_base_y = size*0.38
for i, (angle, length) in enumerate([(-28, 65), (-12, 70), (0, 80), (12, 70), (28, 65)]):
from math import radians, cos, sin
a = radians(angle)
tip_x = leaf_base_x + length*sin(a)
tip_y = leaf_base_y - length*cos(a)
left = (leaf_base_x + 13*cos(a+1.5), leaf_base_y + 13*sin(a+1.5))
right = (leaf_base_x + 13*cos(a-1.5), leaf_base_y + 13*sin(a-1.5))
draw.polygon([left, (tip_x, tip_y), right], fill=leaf_color)
im.save("pineapple.png")
# ---------
image = Image.open("pineapple.png").convert("RGB")
texts = ["pine", "apple", "pineapple", "orange", "pear", "person", "cat", "dog"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
image_embeds = outputs.image_embeds
text_embeds = outputs.text_embeds
image_embeds = F.normalize(image_embeds, dim=-1)
text_embeds = F.normalize(text_embeds, dim=-1)
cos_sim = image_embeds @ text_embeds.T
cos_sim = cos_sim.squeeze(0)
for text, sim in zip(texts, cos_sim):
print(f"Similarity with '{text}': {sim.item():.4f}")
I just want a new Text Encoder..
- ...for my Text-to-Image (Text-to-Video) AI! \o/
- Here you go: 👉 direct download 👈
- The model has 248 tokens max input (instead of CLIP-L: 77)
- Replace your CLIP-L with this Long-CLIP (e.g. ComfyUI natively supports Long-CLIP).
- Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!)
⚠️ Full model (currently) not HuggingFace Transformers compatible. ⚠️
- The ViT (Vision Encoder) is basically a big mutant. Alas:
- The full model .safetensors have the 'import clip' (OpenAI) structure inside.
- Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!).
- For more info, see also (not-a-long CLIP-L version, 77 tokens): zer0int/CLIP-Registers-Gated_MLP-ViT-L-14
✅ Info / Using the full model
- Models available: FULL, TE-only (Text Encoder only), LongCLIP-L.safetensors
- LongCLIP-L the original model from BeichenZhang/LongCLIP-L.
- It's just so you don't have to download a danger-pickle. :)
- git clone my repo, github.com/zer0int/CLIP-fine-tune-registers-gated
- Put the FULL model and the LongCLIP-L from this HF in a 'models' subfolder
- You're all set! I made an entire playground for the models (+ safetensors loading)! 🎉
- PS: All code for fine-tuning it yourself is also included on my Git! 🤗
Metric | LongCLIP-L Original | Long-ViT-L/14 Register-Tokens, X-GATED |
---|---|---|
VoC-2007 multilabel, mAP | 0.8221 | 0.8403 |
MSCOCO Image Retrieval Recall@5 | 0.2761 | 0.3663 |
MSCOCO Text Retrieval Recall@5 | 0.3314 | 0.5398 |
CIFAR10 Linear Probe Acc@1 | 0.9809 | 0.9812 |
CIFAR10 Linear Probe Acc@5 | 0.9998 | 0.9997 |
CIFAR10 LP Mean Recall | 0.9809 | 0.9812 |
ImageNet/ObjectNet MVT (Zero-Shot) | 0.8103 | 0.8724 |
ILSVRC2012 LP, Top-1 | 66.95% | 66.84% |
ILSVRC2012 LP, Top-5 | 91.87% | 91.70% |
Modality Gap (Euclidean) | 1.0672 ⚠️ | 0.5781 ✅ |
Img-Text Cosine (Mean) | 0.2666 | 0.4711 |
Img-Text Cosine (Std Dev) | 0.0191 | 0.0726 |
Txt-Text Cosine (Mean) | 0.8421 | 0.7046 |
Txt-Text Cosine (Std Dev) | 0.0707 | 0.1498 |
Jensen-Shannon Divergence (JSD) | 0.3847 | 0.1894 |
Wasserstein Distance | 0.5755 | 0.2335 |
- Downloads last month
- 27
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14
Base model
BeichenZhang/LongCLIP-L