-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2405.18669
-
OpenBuddy/openbuddy-codellama2-34b-v11.1-bf16
Text Generation • Updated • 2.84k • 11 -
mistralai/Codestral-22B-v0.1
22B • Updated • 16.9k • 1.28k -
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
Text Generation • 16B • Updated • 465k • • 455 -
Magpie-Align/Magpie-Qwen2.5-Coder-Pro-300K-v0.1
Viewer • Updated • 300k • 42 • 4
-
Gemini: A Family of Highly Capable Multimodal Models
Paper • 2312.11805 • Published • 46 -
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Paper • 2312.14233 • Published • 17 -
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper • 2405.18669 • Published • 12 -
Ming-Omni: A Unified Multimodal Model for Perception and Generation
Paper • 2506.09344 • Published • 27
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 56 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 35
-
parler-tts/parler_tts_mini_v0.1
Text-to-Speech • 0.6B • Updated • 7.1k • 358 -
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Paper • 2405.08317 • Published • 13 -
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper • 2405.18669 • Published • 12 -
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Paper • 2406.02430 • Published • 38
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 56 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 35
-
OpenBuddy/openbuddy-codellama2-34b-v11.1-bf16
Text Generation • Updated • 2.84k • 11 -
mistralai/Codestral-22B-v0.1
22B • Updated • 16.9k • 1.28k -
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
Text Generation • 16B • Updated • 465k • • 455 -
Magpie-Align/Magpie-Qwen2.5-Coder-Pro-300K-v0.1
Viewer • Updated • 300k • 42 • 4
-
parler-tts/parler_tts_mini_v0.1
Text-to-Speech • 0.6B • Updated • 7.1k • 358 -
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Paper • 2405.08317 • Published • 13 -
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper • 2405.18669 • Published • 12 -
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Paper • 2406.02430 • Published • 38
-
Gemini: A Family of Highly Capable Multimodal Models
Paper • 2312.11805 • Published • 46 -
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Paper • 2312.14233 • Published • 17 -
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper • 2405.18669 • Published • 12 -
Ming-Omni: A Unified Multimodal Model for Perception and Generation
Paper • 2506.09344 • Published • 27