merve PRO
AI & ML interests
Recent Activity
Organizations
-
HuggingFaceTB/SmolLM3-3B
Text Generation ⢠3B ⢠Updated ⢠597k ⢠⢠612 -
moonshotai/Kimi-K2-Instruct
Text Generation ⢠Updated ⢠315k ⢠⢠1.93k -
fal/Realism-Detailer-Kontext-Dev-LoRA
Image-to-Image ⢠Updated ⢠1.99k ⢠⢠32 -
Alibaba-NLP/WebSailor-3B
3B ⢠Updated ⢠624 ⢠64
-
nari-labs/Dia-1.6B-0626
Text-to-Speech ⢠2B ⢠Updated ⢠75.1k ⢠62 -
google/gemma-3n-E4B-it
Image-Text-to-Text ⢠8B ⢠Updated ⢠263k ⢠653 -
ByteDance/XVerse
Text-to-Image ⢠Updated ⢠1.21k ⢠86 -
nvidia/llama-nemoretriever-colembed-3b-v1
Visual Document Retrieval ⢠4B ⢠Updated ⢠566 ⢠35
-
opendatalab/OmniDocBench
Viewer ⢠Updated ⢠984 ⢠5.72k ⢠28 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text ⢠4B ⢠Updated ⢠173k ⢠1.44k -
echo840/MonkeyOCR
Image-Text-to-Text ⢠Updated ⢠20k ⢠495 -
Running on ZeroMCP115115
OCR2
š»monkey ocr / nanonets ocr / smoldocling / typhoon ocr
-
ByteDance-Seed/BAGEL-7B-MoT
Any-to-Any ⢠15B ⢠Updated ⢠1.14k ⢠1.09k -
mistralai/Devstral-Small-2505
24B ⢠Updated ⢠74.9k ⢠838 -
ByteDance/Dolphin
Image-Text-to-Text ⢠0.4B ⢠Updated ⢠18.4k ⢠439 -
moondream/moondream-2b-2025-04-14-4bit
Image-Text-to-Text ⢠1B ⢠Updated ⢠7.96k ⢠52
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text ⢠16B ⢠Updated ⢠111k ⢠431 -
agentica-org/DeepCoder-14B-Preview
Text Generation ⢠15B ⢠Updated ⢠19.9k ⢠⢠668 -
HiDream-ai/HiDream-I1-Full
Text-to-Image ⢠Updated ⢠247k ⢠⢠948 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text ⢠78B ⢠Updated ⢠123k ⢠208
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text ⢠8B ⢠Updated ⢠23.1k ⢠73 -
AIDC-AI/Ovis2-34B
Image-Text-to-Text ⢠35B ⢠Updated ⢠617 ⢠150 -
open-r1/OpenR1-Qwen-7B
Text Generation ⢠8B ⢠Updated ⢠1.01k ⢠⢠53 -
nomic-ai/nomic-embed-text-v2-moe
Sentence Similarity ⢠0.5B ⢠Updated ⢠161k ⢠416
-
allenai/Llama-3.1-Tulu-3-405B
Text Generation ⢠406B ⢠Updated ⢠468 ⢠107 -
Qwen/Qwen2.5-VL-72B-Instruct
Image-Text-to-Text ⢠73B ⢠Updated ⢠556k ⢠⢠516 -
mistralai/Mistral-Small-24B-Instruct-2501
24B ⢠Updated ⢠95.8k ⢠931 -
deepseek-ai/Janus-Pro-7B
Any-to-Any ⢠Updated ⢠128k ⢠3.47k
-
ostris/Flex.1-alpha
Text-to-Image ⢠Updated ⢠1.44k ⢠⢠467 -
Qwen/Qwen2.5-Math-PRM-72B
Text Classification ⢠73B ⢠Updated ⢠894 ⢠73 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text ⢠0.5B ⢠Updated ⢠25.7k ⢠164 -
deepseek-ai/DeepSeek-R1
Text Generation ⢠685B ⢠Updated ⢠975k ⢠⢠12.5k
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠92.3k ⢠525 -
Qwen/QwQ-32B-Preview
Text Generation ⢠33B ⢠Updated ⢠26.2k ⢠⢠1.74k -
nvidia/Hymba-1.5B-Base
Text Generation ⢠2B ⢠Updated ⢠5.41k ⢠146 -
vidore/colsmolvlm-v0.1
Visual Document Retrieval ⢠Updated ⢠1.57k ⢠52
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification ⢠Updated ⢠100 ⢠58 -
microsoft/LLM2CLIP-EVA02-B-16
Updated ⢠17 ⢠10 -
PleIAs/common_corpus
Viewer ⢠Updated ⢠470M ⢠52.2k ⢠304 -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation ⢠33B ⢠Updated ⢠85.5k ⢠⢠1.91k
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper ⢠2409.11402 ⢠Published ⢠75 -
BRAVE: Broadening the visual encoding of vision-language models
Paper ⢠2404.07204 ⢠Published ⢠19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper ⢠2403.18814 ⢠Published ⢠48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper ⢠2409.17146 ⢠Published ⢠122
-
Runtime error101101
LOTUS Normal
šGenerate high-quality predictions from images
-
Runtime error7575
LOTUS Depth
šGenerate depth maps from images and videos
-
jingheya/lotus-depth-g-v1-0
Depth Estimation ⢠Updated ⢠16k ⢠24 -
jingheya/lotus-depth-d-v1-0
Depth Estimation ⢠Updated ⢠357 ⢠5
-
facebook/dinov2-large
Image Feature Extraction ⢠0.3B ⢠Updated ⢠866k ⢠88 -
google/flan-t5-xl
3B ⢠Updated ⢠282k ⢠513 -
google/siglip-large-patch16-384
Zero-Shot Image Classification ⢠0.7B ⢠Updated ⢠19.9k ⢠9 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction ⢠0.6B ⢠Updated ⢠29k ⢠21
-
facebook/deit-base-distilled-patch16-384
Image Classification ⢠0.1B ⢠Updated ⢠1.24k ⢠5 -
facebook/convnextv2-base-1k-224
Image Classification ⢠0.1B ⢠Updated ⢠705 ⢠⢠3 -
facebook/deit-base-distilled-patch16-224
Image Classification ⢠Updated ⢠14.6k ⢠⢠27 -
google/vit-base-patch32-384
Image Classification ⢠0.1B ⢠Updated ⢠5.44k ⢠⢠23
-
facebook/maskformer-swin-large-coco
Image Segmentation ⢠0.2B ⢠Updated ⢠384k ⢠⢠25 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation ⢠0.0B ⢠Updated ⢠320k ⢠⢠162 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation ⢠0.0B ⢠Updated ⢠124 ⢠⢠3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation ⢠Updated ⢠136k ⢠⢠27
-
timbrooks/instruct-pix2pix
Image-to-Image ⢠Updated ⢠38.3k ⢠1.13k -
TencentARC/t2i-adapter-canny-sdxl-1.0
Image-to-Image ⢠Updated ⢠4.7k ⢠52 -
TencentARC/t2i-adapter-sketch-sdxl-1.0
Image-to-Image ⢠Updated ⢠5.25k ⢠76 -
CrucibleAI/ControlNetMediaPipeFace
Image-to-Image ⢠Updated ⢠885 ⢠571
-
Salesforce/blip-image-captioning-large
Image-to-Text ⢠0.5B ⢠Updated ⢠1.67M ⢠1.38k -
Salesforce/blip-image-captioning-base
Image-to-Text ⢠Updated ⢠1.92M ⢠761 -
microsoft/trocr-base-handwritten
Image-to-Text ⢠0.3B ⢠Updated ⢠473k ⢠420 -
microsoft/git-large-coco
Image-to-Text ⢠0.4B ⢠Updated ⢠2.47k ⢠104
-
Running8585
Grounding DINO Demo
š»Cutting edge open-vocabulary object detection app
-
Running8888
Owlv2
šState-of-the-art Zero-shot Object Detection
-
Runtime error4141
BLIP2 with transformers
šBLIP2 (cutting edge image captioning) in š¤transformers
-
Runtime error377377
IDEFICS Playground
šØ
-
Running8888
Owlv2
šState-of-the-art Zero-shot Object Detection
-
Running on Zero6464
Owl Tracking
ā”Powerful foundation model for zero-shot object tracking
-
Running2525
Search and Detect (CLIP/OWL-ViT)
š¦Search and detect objects in images using text queries
-
Running on Zero102102
OWLSAM
š»State-of-the-art open-vocabulary image segmentation ā”ļø
-
Improved Baselines with Visual Instruction Tuning
Paper ⢠2310.03744 ⢠Published ⢠38 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper ⢠2403.05525 ⢠Published ⢠47 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper ⢠2308.12966 ⢠Published ⢠9 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper ⢠2404.01331 ⢠Published ⢠28
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠139k ⢠135 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠7.77k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠46.3k ⢠25 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠85.3k ⢠27
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation ⢠Updated ⢠10.5k ⢠69 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation ⢠Updated ⢠94.1k ⢠111 -
Running on Zero492492
Depth Anything V2
šGenerate depth maps from images
-
depth-anything/DA-2K
Viewer ⢠Updated ⢠1.04k ⢠743 ⢠12
-
Running166166
Vidore Leaderboard
š„Display document retrieval leaderboard data
-
Running on CPU Upgrade837837
Open VLM Leaderboard
šVLMEvalKit Evaluation Results Collection
-
Running551551
Vision Arena (Testing VLMs side-by-side)
š¼Analyze images to detect and label objects
-
Running8585
SEED-Bench Leaderboard
š
-
vidore/colpali-v1.2
Visual Document Retrieval ⢠Updated ⢠42.3k ⢠109 -
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text ⢠8B ⢠Updated ⢠676k ⢠⢠1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠723k ⢠433 -
Qwen/Qwen2-72B-Instruct
Text Generation ⢠73B ⢠Updated ⢠46.1k ⢠⢠715
-
nvidia/OpenReasoning-Nemotron-32B
Text Generation ⢠33B ⢠Updated ⢠2.35k ⢠⢠103 -
ByteDance-Seed/Seed-X-RM-7B
Updated ⢠306 ⢠24 -
LGAI-EXAONE/EXAONE-4.0-32B
Text Generation ⢠32B ⢠Updated ⢠541k ⢠205 -
vidore/colqwen-omni-v0.1
Visual Document Retrieval ⢠Updated ⢠3.35k ⢠80
-
Qwen/WorldPM-72B
Text Classification ⢠73B ⢠Updated ⢠1.17k ⢠73 -
Running on ZeroMCP1.03k1.03k
LTX Video Fast
š„ultra-fast video model, LTX 0.9.8 13B distilled
-
BLIP3o/BLIP3o-Pretrain-Long-Caption
Viewer ⢠Updated ⢠27.2M ⢠21k ⢠41 -
BLIP3o/BLIP3o-Model-8B
14B ⢠Updated ⢠1.63k ⢠100
-
OpenGVLab/InternVL3-1B-hf
Image-Text-to-Text ⢠0.9B ⢠Updated ⢠42.7k ⢠5 -
OpenGVLab/InternVL3-2B-hf
Image-Text-to-Text ⢠2B ⢠Updated ⢠29.3k ⢠2 -
OpenGVLab/InternVL3-8B-hf
Image-Text-to-Text ⢠8B ⢠Updated ⢠41.4k ⢠7 -
OpenGVLab/InternVL3-14B-hf
Image-Text-to-Text ⢠15B ⢠Updated ⢠5.37k
-
deepseek-ai/DeepSeek-V3-0324
Text Generation ⢠685B ⢠Updated ⢠529k ⢠⢠3.01k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any ⢠11B ⢠Updated ⢠123k ⢠1.73k -
google/txgemma-27b-chat
Text Generation ⢠27B ⢠Updated ⢠1.24k ⢠54 -
Running332332
Qwen2.5 Omni 7B Demo
šGenerate text and speech responses from various inputs
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text ⢠8B ⢠Updated ⢠676k ⢠⢠1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠723k ⢠433 -
CohereLabs/aya-vision-8b
Image-Text-to-Text ⢠9B ⢠Updated ⢠26.6k ⢠⢠302 -
CohereLabs/aya-vision-32b
Image-Text-to-Text ⢠33B ⢠Updated ⢠175 ⢠⢠211
-
Running on Zero255255
Qwen2-VL-7B
š„Generate text by combining an image and a question
-
Running5757
UI-TARS
šSelect coordinates on an image based on instructions
-
Running8787
Qwen2.5-1M Demo
š»Upload documents and ask questions
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation ⢠15B ⢠Updated ⢠22k ⢠⢠316
-
meta-llama/Llama-3.3-70B-Instruct
Text Generation ⢠71B ⢠Updated ⢠408k ⢠⢠2.45k -
Qwen/Qwen2-VL-72B
Image-Text-to-Text ⢠73B ⢠Updated ⢠847 ⢠79 -
google/paligemma2-3b-pt-224
Image-Text-to-Text ⢠3B ⢠Updated ⢠181k ⢠154 -
tencent/HunyuanVideo
Text-to-Video ⢠Updated ⢠1.11k ⢠⢠2k
-
mistralai/Pixtral-Large-Instruct-2411
Updated ⢠79 ⢠417 -
microsoft/orca-agentinstruct-1M-v1
Viewer ⢠Updated ⢠1.05M ⢠6.77k ⢠447 -
Xkev/Llama-3.2V-11B-cot
Image-Text-to-Text ⢠11B ⢠Updated ⢠3.79k ⢠153 -
jinaai/jina-clip-v2
Feature Extraction ⢠0.9B ⢠Updated ⢠34.3k ⢠⢠266
-
ibm-granite/granite-3.0-8b-instruct
Text Generation ⢠8B ⢠Updated ⢠22.8k ⢠202 -
ibm-granite/granite-3.0-2b-instruct
Text Generation ⢠3B ⢠Updated ⢠4.18k ⢠46 -
CohereLabs/aya-expanse-8b
Text Generation ⢠8B ⢠Updated ⢠19.2k ⢠⢠386 -
CohereLabs/aya-expanse-32b
Text Generation ⢠32B ⢠Updated ⢠6.66k ⢠⢠264
-
microsoft/resnet-50
Image Classification ⢠0.0B ⢠Updated ⢠151k ⢠⢠431 -
google/vit-base-patch16-224-in21k
Image Feature Extraction ⢠0.1B ⢠Updated ⢠3.29M ⢠356 -
google/vit-base-patch32-224-in21k
Image Feature Extraction ⢠0.1B ⢠Updated ⢠8.07k ⢠19 -
facebook/dinov2-large
Image Feature Extraction ⢠0.3B ⢠Updated ⢠866k ⢠88
-
facebook/detr-resnet-50
Object Detection ⢠0.0B ⢠Updated ⢠478k ⢠⢠879 -
facebook/detr-resnet-101-dc5
Object Detection ⢠0.1B ⢠Updated ⢠5.08k ⢠19 -
facebook/detr-resnet-50-dc5
Object Detection ⢠0.0B ⢠Updated ⢠1.71k ⢠6 -
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠139k ⢠135
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification ⢠0.4B ⢠Updated ⢠11.3M ⢠1.82k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification ⢠Updated ⢠18.4M ⢠730 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification ⢠Updated ⢠556k ⢠284 -
kakaobrain/align-base
Zero-Shot Image Classification ⢠Updated ⢠22.6k ⢠26
-
microsoft/xclip-base-patch32
Video Classification ⢠0.2B ⢠Updated ⢠253k ⢠96 -
facebook/timesformer-base-finetuned-k400
Video Classification ⢠Updated ⢠25.6k ⢠42 -
facebook/timesformer-base-finetuned-k600
Video Classification ⢠Updated ⢠11k ⢠12 -
google/vivit-b-16x2
Video Classification ⢠Updated ⢠367 ⢠11
-
stabilityai/stable-diffusion-xl-base-1.0
Text-to-Image ⢠Updated ⢠2.47M ⢠⢠6.78k -
warp-ai/wuerstchen
Text-to-Image ⢠Updated ⢠418 ⢠174 -
Deci/DeciDiffusion-v1-0
Text-to-Image ⢠Updated ⢠7 ⢠138 -
stabilityai/stable-diffusion-xl-refiner-1.0
Image-to-Image ⢠Updated ⢠479k ⢠1.94k
-
Running on Zero7171
Draw To Search Art
šDraw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade2222
Compare Clip Siglip
šCompare strong zero-shot image classification models
-
Running on Zero1313
Multilingual Zero Shot Image Clf
š¢Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation ⢠Updated ⢠24 ⢠48
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠139k ⢠135 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠7.77k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠46.3k ⢠25 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠85.3k ⢠27
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠139k ⢠135 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠7.77k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠46.3k ⢠25 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠85.3k ⢠27
-
Running2121
Video Llava
šØGenerate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text ⢠7B ⢠Updated ⢠76.7k ⢠102 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text ⢠7B ⢠Updated ⢠1.8k ⢠9 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text ⢠8B ⢠Updated ⢠738 ⢠7
-
NVEagle/Eagle-X5-13B
Image-Text-to-Text ⢠15B ⢠Updated ⢠55 ⢠15 -
NVEagle/Eagle-X5-13B-Chat
Image-Text-to-Text ⢠15B ⢠Updated ⢠889 ⢠28 -
NVEagle/Eagle-X5-7B
Image-Text-to-Text ⢠9B ⢠Updated ⢠1.23k ⢠26 -
Running on Zero6464
Eagle X5 13B Chat
šCombine text and images to generate responses
-
nvidia/OpenReasoning-Nemotron-32B
Text Generation ⢠33B ⢠Updated ⢠2.35k ⢠⢠103 -
ByteDance-Seed/Seed-X-RM-7B
Updated ⢠306 ⢠24 -
LGAI-EXAONE/EXAONE-4.0-32B
Text Generation ⢠32B ⢠Updated ⢠541k ⢠205 -
vidore/colqwen-omni-v0.1
Visual Document Retrieval ⢠Updated ⢠3.35k ⢠80
-
HuggingFaceTB/SmolLM3-3B
Text Generation ⢠3B ⢠Updated ⢠597k ⢠⢠612 -
moonshotai/Kimi-K2-Instruct
Text Generation ⢠Updated ⢠315k ⢠⢠1.93k -
fal/Realism-Detailer-Kontext-Dev-LoRA
Image-to-Image ⢠Updated ⢠1.99k ⢠⢠32 -
Alibaba-NLP/WebSailor-3B
3B ⢠Updated ⢠624 ⢠64
-
nari-labs/Dia-1.6B-0626
Text-to-Speech ⢠2B ⢠Updated ⢠75.1k ⢠62 -
google/gemma-3n-E4B-it
Image-Text-to-Text ⢠8B ⢠Updated ⢠263k ⢠653 -
ByteDance/XVerse
Text-to-Image ⢠Updated ⢠1.21k ⢠86 -
nvidia/llama-nemoretriever-colembed-3b-v1
Visual Document Retrieval ⢠4B ⢠Updated ⢠566 ⢠35
-
opendatalab/OmniDocBench
Viewer ⢠Updated ⢠984 ⢠5.72k ⢠28 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text ⢠4B ⢠Updated ⢠173k ⢠1.44k -
echo840/MonkeyOCR
Image-Text-to-Text ⢠Updated ⢠20k ⢠495 -
Running on ZeroMCP115115
OCR2
š»monkey ocr / nanonets ocr / smoldocling / typhoon ocr
-
ByteDance-Seed/BAGEL-7B-MoT
Any-to-Any ⢠15B ⢠Updated ⢠1.14k ⢠1.09k -
mistralai/Devstral-Small-2505
24B ⢠Updated ⢠74.9k ⢠838 -
ByteDance/Dolphin
Image-Text-to-Text ⢠0.4B ⢠Updated ⢠18.4k ⢠439 -
moondream/moondream-2b-2025-04-14-4bit
Image-Text-to-Text ⢠1B ⢠Updated ⢠7.96k ⢠52
-
Qwen/WorldPM-72B
Text Classification ⢠73B ⢠Updated ⢠1.17k ⢠73 -
Running on ZeroMCP1.03k1.03k
LTX Video Fast
š„ultra-fast video model, LTX 0.9.8 13B distilled
-
BLIP3o/BLIP3o-Pretrain-Long-Caption
Viewer ⢠Updated ⢠27.2M ⢠21k ⢠41 -
BLIP3o/BLIP3o-Model-8B
14B ⢠Updated ⢠1.63k ⢠100
-
OpenGVLab/InternVL3-1B-hf
Image-Text-to-Text ⢠0.9B ⢠Updated ⢠42.7k ⢠5 -
OpenGVLab/InternVL3-2B-hf
Image-Text-to-Text ⢠2B ⢠Updated ⢠29.3k ⢠2 -
OpenGVLab/InternVL3-8B-hf
Image-Text-to-Text ⢠8B ⢠Updated ⢠41.4k ⢠7 -
OpenGVLab/InternVL3-14B-hf
Image-Text-to-Text ⢠15B ⢠Updated ⢠5.37k
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text ⢠16B ⢠Updated ⢠111k ⢠431 -
agentica-org/DeepCoder-14B-Preview
Text Generation ⢠15B ⢠Updated ⢠19.9k ⢠⢠668 -
HiDream-ai/HiDream-I1-Full
Text-to-Image ⢠Updated ⢠247k ⢠⢠948 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text ⢠78B ⢠Updated ⢠123k ⢠208
-
deepseek-ai/DeepSeek-V3-0324
Text Generation ⢠685B ⢠Updated ⢠529k ⢠⢠3.01k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any ⢠11B ⢠Updated ⢠123k ⢠1.73k -
google/txgemma-27b-chat
Text Generation ⢠27B ⢠Updated ⢠1.24k ⢠54 -
Running332332
Qwen2.5 Omni 7B Demo
šGenerate text and speech responses from various inputs
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text ⢠8B ⢠Updated ⢠676k ⢠⢠1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠723k ⢠433 -
CohereLabs/aya-vision-8b
Image-Text-to-Text ⢠9B ⢠Updated ⢠26.6k ⢠⢠302 -
CohereLabs/aya-vision-32b
Image-Text-to-Text ⢠33B ⢠Updated ⢠175 ⢠⢠211
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text ⢠8B ⢠Updated ⢠23.1k ⢠73 -
AIDC-AI/Ovis2-34B
Image-Text-to-Text ⢠35B ⢠Updated ⢠617 ⢠150 -
open-r1/OpenR1-Qwen-7B
Text Generation ⢠8B ⢠Updated ⢠1.01k ⢠⢠53 -
nomic-ai/nomic-embed-text-v2-moe
Sentence Similarity ⢠0.5B ⢠Updated ⢠161k ⢠416
-
allenai/Llama-3.1-Tulu-3-405B
Text Generation ⢠406B ⢠Updated ⢠468 ⢠107 -
Qwen/Qwen2.5-VL-72B-Instruct
Image-Text-to-Text ⢠73B ⢠Updated ⢠556k ⢠⢠516 -
mistralai/Mistral-Small-24B-Instruct-2501
24B ⢠Updated ⢠95.8k ⢠931 -
deepseek-ai/Janus-Pro-7B
Any-to-Any ⢠Updated ⢠128k ⢠3.47k
-
Running on Zero255255
Qwen2-VL-7B
š„Generate text by combining an image and a question
-
Running5757
UI-TARS
šSelect coordinates on an image based on instructions
-
Running8787
Qwen2.5-1M Demo
š»Upload documents and ask questions
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation ⢠15B ⢠Updated ⢠22k ⢠⢠316
-
ostris/Flex.1-alpha
Text-to-Image ⢠Updated ⢠1.44k ⢠⢠467 -
Qwen/Qwen2.5-Math-PRM-72B
Text Classification ⢠73B ⢠Updated ⢠894 ⢠73 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text ⢠0.5B ⢠Updated ⢠25.7k ⢠164 -
deepseek-ai/DeepSeek-R1
Text Generation ⢠685B ⢠Updated ⢠975k ⢠⢠12.5k
-
meta-llama/Llama-3.3-70B-Instruct
Text Generation ⢠71B ⢠Updated ⢠408k ⢠⢠2.45k -
Qwen/Qwen2-VL-72B
Image-Text-to-Text ⢠73B ⢠Updated ⢠847 ⢠79 -
google/paligemma2-3b-pt-224
Image-Text-to-Text ⢠3B ⢠Updated ⢠181k ⢠154 -
tencent/HunyuanVideo
Text-to-Video ⢠Updated ⢠1.11k ⢠⢠2k
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠92.3k ⢠525 -
Qwen/QwQ-32B-Preview
Text Generation ⢠33B ⢠Updated ⢠26.2k ⢠⢠1.74k -
nvidia/Hymba-1.5B-Base
Text Generation ⢠2B ⢠Updated ⢠5.41k ⢠146 -
vidore/colsmolvlm-v0.1
Visual Document Retrieval ⢠Updated ⢠1.57k ⢠52
-
mistralai/Pixtral-Large-Instruct-2411
Updated ⢠79 ⢠417 -
microsoft/orca-agentinstruct-1M-v1
Viewer ⢠Updated ⢠1.05M ⢠6.77k ⢠447 -
Xkev/Llama-3.2V-11B-cot
Image-Text-to-Text ⢠11B ⢠Updated ⢠3.79k ⢠153 -
jinaai/jina-clip-v2
Feature Extraction ⢠0.9B ⢠Updated ⢠34.3k ⢠⢠266
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification ⢠Updated ⢠100 ⢠58 -
microsoft/LLM2CLIP-EVA02-B-16
Updated ⢠17 ⢠10 -
PleIAs/common_corpus
Viewer ⢠Updated ⢠470M ⢠52.2k ⢠304 -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation ⢠33B ⢠Updated ⢠85.5k ⢠⢠1.91k
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper ⢠2409.11402 ⢠Published ⢠75 -
BRAVE: Broadening the visual encoding of vision-language models
Paper ⢠2404.07204 ⢠Published ⢠19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper ⢠2403.18814 ⢠Published ⢠48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper ⢠2409.17146 ⢠Published ⢠122
-
ibm-granite/granite-3.0-8b-instruct
Text Generation ⢠8B ⢠Updated ⢠22.8k ⢠202 -
ibm-granite/granite-3.0-2b-instruct
Text Generation ⢠3B ⢠Updated ⢠4.18k ⢠46 -
CohereLabs/aya-expanse-8b
Text Generation ⢠8B ⢠Updated ⢠19.2k ⢠⢠386 -
CohereLabs/aya-expanse-32b
Text Generation ⢠32B ⢠Updated ⢠6.66k ⢠⢠264
-
Runtime error101101
LOTUS Normal
šGenerate high-quality predictions from images
-
Runtime error7575
LOTUS Depth
šGenerate depth maps from images and videos
-
jingheya/lotus-depth-g-v1-0
Depth Estimation ⢠Updated ⢠16k ⢠24 -
jingheya/lotus-depth-d-v1-0
Depth Estimation ⢠Updated ⢠357 ⢠5
-
facebook/dinov2-large
Image Feature Extraction ⢠0.3B ⢠Updated ⢠866k ⢠88 -
google/flan-t5-xl
3B ⢠Updated ⢠282k ⢠513 -
google/siglip-large-patch16-384
Zero-Shot Image Classification ⢠0.7B ⢠Updated ⢠19.9k ⢠9 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction ⢠0.6B ⢠Updated ⢠29k ⢠21
-
microsoft/resnet-50
Image Classification ⢠0.0B ⢠Updated ⢠151k ⢠⢠431 -
google/vit-base-patch16-224-in21k
Image Feature Extraction ⢠0.1B ⢠Updated ⢠3.29M ⢠356 -
google/vit-base-patch32-224-in21k
Image Feature Extraction ⢠0.1B ⢠Updated ⢠8.07k ⢠19 -
facebook/dinov2-large
Image Feature Extraction ⢠0.3B ⢠Updated ⢠866k ⢠88
-
facebook/deit-base-distilled-patch16-384
Image Classification ⢠0.1B ⢠Updated ⢠1.24k ⢠5 -
facebook/convnextv2-base-1k-224
Image Classification ⢠0.1B ⢠Updated ⢠705 ⢠⢠3 -
facebook/deit-base-distilled-patch16-224
Image Classification ⢠Updated ⢠14.6k ⢠⢠27 -
google/vit-base-patch32-384
Image Classification ⢠0.1B ⢠Updated ⢠5.44k ⢠⢠23
-
facebook/detr-resnet-50
Object Detection ⢠0.0B ⢠Updated ⢠478k ⢠⢠879 -
facebook/detr-resnet-101-dc5
Object Detection ⢠0.1B ⢠Updated ⢠5.08k ⢠19 -
facebook/detr-resnet-50-dc5
Object Detection ⢠0.0B ⢠Updated ⢠1.71k ⢠6 -
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠139k ⢠135
-
facebook/maskformer-swin-large-coco
Image Segmentation ⢠0.2B ⢠Updated ⢠384k ⢠⢠25 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation ⢠0.0B ⢠Updated ⢠320k ⢠⢠162 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation ⢠0.0B ⢠Updated ⢠124 ⢠⢠3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation ⢠Updated ⢠136k ⢠⢠27
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification ⢠0.4B ⢠Updated ⢠11.3M ⢠1.82k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification ⢠Updated ⢠18.4M ⢠730 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification ⢠Updated ⢠556k ⢠284 -
kakaobrain/align-base
Zero-Shot Image Classification ⢠Updated ⢠22.6k ⢠26
-
timbrooks/instruct-pix2pix
Image-to-Image ⢠Updated ⢠38.3k ⢠1.13k -
TencentARC/t2i-adapter-canny-sdxl-1.0
Image-to-Image ⢠Updated ⢠4.7k ⢠52 -
TencentARC/t2i-adapter-sketch-sdxl-1.0
Image-to-Image ⢠Updated ⢠5.25k ⢠76 -
CrucibleAI/ControlNetMediaPipeFace
Image-to-Image ⢠Updated ⢠885 ⢠571
-
microsoft/xclip-base-patch32
Video Classification ⢠0.2B ⢠Updated ⢠253k ⢠96 -
facebook/timesformer-base-finetuned-k400
Video Classification ⢠Updated ⢠25.6k ⢠42 -
facebook/timesformer-base-finetuned-k600
Video Classification ⢠Updated ⢠11k ⢠12 -
google/vivit-b-16x2
Video Classification ⢠Updated ⢠367 ⢠11
-
Salesforce/blip-image-captioning-large
Image-to-Text ⢠0.5B ⢠Updated ⢠1.67M ⢠1.38k -
Salesforce/blip-image-captioning-base
Image-to-Text ⢠Updated ⢠1.92M ⢠761 -
microsoft/trocr-base-handwritten
Image-to-Text ⢠0.3B ⢠Updated ⢠473k ⢠420 -
microsoft/git-large-coco
Image-to-Text ⢠0.4B ⢠Updated ⢠2.47k ⢠104
-
stabilityai/stable-diffusion-xl-base-1.0
Text-to-Image ⢠Updated ⢠2.47M ⢠⢠6.78k -
warp-ai/wuerstchen
Text-to-Image ⢠Updated ⢠418 ⢠174 -
Deci/DeciDiffusion-v1-0
Text-to-Image ⢠Updated ⢠7 ⢠138 -
stabilityai/stable-diffusion-xl-refiner-1.0
Image-to-Image ⢠Updated ⢠479k ⢠1.94k
-
Running8585
Grounding DINO Demo
š»Cutting edge open-vocabulary object detection app
-
Running8888
Owlv2
šState-of-the-art Zero-shot Object Detection
-
Runtime error4141
BLIP2 with transformers
šBLIP2 (cutting edge image captioning) in š¤transformers
-
Runtime error377377
IDEFICS Playground
šØ
-
Running8888
Owlv2
šState-of-the-art Zero-shot Object Detection
-
Running on Zero6464
Owl Tracking
ā”Powerful foundation model for zero-shot object tracking
-
Running2525
Search and Detect (CLIP/OWL-ViT)
š¦Search and detect objects in images using text queries
-
Running on Zero102102
OWLSAM
š»State-of-the-art open-vocabulary image segmentation ā”ļø
-
Running on Zero7171
Draw To Search Art
šDraw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade2222
Compare Clip Siglip
šCompare strong zero-shot image classification models
-
Running on Zero1313
Multilingual Zero Shot Image Clf
š¢Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation ⢠Updated ⢠24 ⢠48
-
Improved Baselines with Visual Instruction Tuning
Paper ⢠2310.03744 ⢠Published ⢠38 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper ⢠2403.05525 ⢠Published ⢠47 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper ⢠2308.12966 ⢠Published ⢠9 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper ⢠2404.01331 ⢠Published ⢠28
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠139k ⢠135 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠7.77k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠46.3k ⢠25 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠85.3k ⢠27
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠139k ⢠135 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠7.77k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠46.3k ⢠25 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠85.3k ⢠27
-
google/owlvit-base-patch32
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠139k ⢠135 -
google/owlvit-base-patch16
Zero-Shot Object Detection ⢠Updated ⢠7.77k ⢠12 -
google/owlvit-large-patch14
Zero-Shot Object Detection ⢠Updated ⢠46.3k ⢠25 -
google/owlv2-base-patch16
Zero-Shot Object Detection ⢠0.2B ⢠Updated ⢠85.3k ⢠27
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation ⢠Updated ⢠10.5k ⢠69 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation ⢠Updated ⢠94.1k ⢠111 -
Running on Zero492492
Depth Anything V2
šGenerate depth maps from images
-
depth-anything/DA-2K
Viewer ⢠Updated ⢠1.04k ⢠743 ⢠12
-
Running166166
Vidore Leaderboard
š„Display document retrieval leaderboard data
-
Running on CPU Upgrade837837
Open VLM Leaderboard
šVLMEvalKit Evaluation Results Collection
-
Running551551
Vision Arena (Testing VLMs side-by-side)
š¼Analyze images to detect and label objects
-
Running8585
SEED-Bench Leaderboard
š
-
Running2121
Video Llava
šØGenerate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text ⢠7B ⢠Updated ⢠76.7k ⢠102 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text ⢠7B ⢠Updated ⢠1.8k ⢠9 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text ⢠8B ⢠Updated ⢠738 ⢠7
-
NVEagle/Eagle-X5-13B
Image-Text-to-Text ⢠15B ⢠Updated ⢠55 ⢠15 -
NVEagle/Eagle-X5-13B-Chat
Image-Text-to-Text ⢠15B ⢠Updated ⢠889 ⢠28 -
NVEagle/Eagle-X5-7B
Image-Text-to-Text ⢠9B ⢠Updated ⢠1.23k ⢠26 -
Running on Zero6464
Eagle X5 13B Chat
šCombine text and images to generate responses
-
vidore/colpali-v1.2
Visual Document Retrieval ⢠Updated ⢠42.3k ⢠109 -
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text ⢠8B ⢠Updated ⢠676k ⢠⢠1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text ⢠2B ⢠Updated ⢠723k ⢠433 -
Qwen/Qwen2-72B-Instruct
Text Generation ⢠73B ⢠Updated ⢠46.1k ⢠⢠715