Long-CLIP Needs Registers. And Gated MLPs. And +20M params.

Fixing Long-CLIP's modality gap via happy little accidents.

❤️ this CLIP? Donate if you can / want. TY! (and enjoy, either way!) 🤗

ℹ️ Update 02/June/2025:

You can now load the model with HF 'transformers'. ✅
Unfortunately, AutoModel produced nonsense / I couldn't get "trust_remote_code=True" to work properly - (using that was suggested in response to my pull request on GitHub).

💡 Alas, you will need to:

Download the 'hfmodel' folder
Use it for manually importing the correct (my custom) CLIPModel code required as per the config.json
Minimal example code:

import torch
from PIL import Image, ImageDraw
import transformers
from hfmodel.modeling_clip import CLIPModel
from transformers import CLIPProcessor
from torchvision.transforms import ToTensor
import torch.nn.functional as F

model = CLIPModel.from_pretrained("zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14")
processor = CLIPProcessor.from_pretrained("zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

size = 224
im = Image.new("RGB", (size, size), (255, 255, 255))
draw = ImageDraw.Draw(im)

# --------- GPT-4.1's idea of a pineapple. Need an input image... ---------
body_bbox = [size*0.28, size*0.38, size*0.72, size*0.90]
draw.ellipse(body_bbox, fill=(254, 221, 72), outline=(180, 120, 0), width=5)
eye_color = (198, 134, 66)
for row in range(4):
    for col in range(3):
        ex = size*0.36 + col*size*0.09 + (row%2)*size*0.045
        ey = size*0.50 + row*size*0.085
        ew, eh = size*0.035, size*0.025
        draw.ellipse([ex-ew, ey-eh, ex+ew, ey+eh], fill=eye_color, outline=None)
leaf_color = (61, 179, 70)
leaf_base_x = size/2
leaf_base_y = size*0.38
for i, (angle, length) in enumerate([(-28, 65), (-12, 70), (0, 80), (12, 70), (28, 65)]):
    from math import radians, cos, sin
    a = radians(angle)
    tip_x = leaf_base_x + length*sin(a)
    tip_y = leaf_base_y - length*cos(a)
    left = (leaf_base_x + 13*cos(a+1.5), leaf_base_y + 13*sin(a+1.5))
    right = (leaf_base_x + 13*cos(a-1.5), leaf_base_y + 13*sin(a-1.5))
    draw.polygon([left, (tip_x, tip_y), right], fill=leaf_color)
im.save("pineapple.png")
# ---------

image = Image.open("pineapple.png").convert("RGB")
texts = ["pine", "apple", "pineapple", "orange", "pear", "person", "cat", "dog"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True).to(device)

with torch.no_grad():
    outputs = model(**inputs)
    image_embeds = outputs.image_embeds
    text_embeds = outputs.text_embeds

image_embeds = F.normalize(image_embeds, dim=-1)
text_embeds = F.normalize(text_embeds, dim=-1)
cos_sim = image_embeds @ text_embeds.T
cos_sim = cos_sim.squeeze(0)

for text, sim in zip(texts, cos_sim):
    print(f"Similarity with '{text}': {sim.item():.4f}")

I just want a new Text Encoder..

...for my Text-to-Image (Text-to-Video) AI! \o/
Here you go: 👉 direct download 👈
The model has 248 tokens max input (instead of CLIP-L: 77)
Replace your CLIP-L with this Long-CLIP (e.g. ComfyUI natively supports Long-CLIP).
Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!)

⚠️ Full model (currently) not HuggingFace Transformers compatible. ⚠️

The ViT (Vision Encoder) is basically a big mutant. Alas:
The full model .safetensors have the 'import clip' (OpenAI) structure inside.
Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!).
For more info, see also (not-a-long CLIP-L version, 77 tokens): zer0int/CLIP-Registers-Gated_MLP-ViT-L-14

✅ Info / Using the full model

Models available: FULL, TE-only (Text Encoder only), LongCLIP-L.safetensors
LongCLIP-L the original model from BeichenZhang/LongCLIP-L.
It's just so you don't have to download a danger-pickle. :)
git clone my repo, github.com/zer0int/CLIP-fine-tune-registers-gated
Put the FULL model and the LongCLIP-L from this HF in a 'models' subfolder
You're all set! I made an entire playground for the models (+ safetensors loading)! 🎉
PS: All code for fine-tuning it yourself is also included on my Git! 🤗

Metric	LongCLIP-L Original	Long-ViT-L/14 Register-Tokens, X-GATED
VoC-2007 multilabel, mAP	0.8221	0.8403
MSCOCO Image Retrieval Recall@5	0.2761	0.3663
MSCOCO Text Retrieval Recall@5	0.3314	0.5398
CIFAR10 Linear Probe Acc@1	0.9809	0.9812
CIFAR10 Linear Probe Acc@5	0.9998	0.9997
CIFAR10 LP Mean Recall	0.9809	0.9812
ImageNet/ObjectNet MVT (Zero-Shot)	0.8103	0.8724
ILSVRC2012 LP, Top-1	66.95%	66.84%
ILSVRC2012 LP, Top-5	91.87%	91.70%
Modality Gap (Euclidean)	1.0672 ⚠️	0.5781 ✅
Img-Text Cosine (Mean)	0.2666	0.4711
Img-Text Cosine (Std Dev)	0.0191	0.0726
Txt-Text Cosine (Mean)	0.8421	0.7046
Txt-Text Cosine (Std Dev)	0.0707	0.1498
Jensen-Shannon Divergence (JSD)	0.3847	0.1894
Wasserstein Distance	0.5755	0.2335

zer0int
/

LongCLIP-Registers-Gated_MLP-ViT-L-14

Long-CLIP Needs Registers. And Gated MLPs. And +20M params.

Fixing Long-CLIP's modality gap via happy little accidents.

ℹ️ Update 02/June/2025:

💡 Alas, you will need to:

I just want a new Text Encoder..

⚠️ Full model (currently) not HuggingFace Transformers compatible. ⚠️

✅ Info / Using the full model

Model tree for zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14

Dataset used to train zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14