-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2503.19325
-
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper • 2506.09113 • Published • 98 -
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Paper • 2506.08009 • Published • 26 -
Seeing Voices: Generating A-Roll Video from Audio with Mirage
Paper • 2506.08279 • Published • 28 -
PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement
Paper • 2506.07848 • Published • 4
-
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Paper • 2502.20388 • Published • 16 -
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper • 2503.19325 • Published • 73 -
Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing
Paper • 2503.19385 • Published • 34 -
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Paper • 2504.12626 • Published • 52
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper • 2503.19325 • Published • 73 -
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper • 2506.09113 • Published • 98 -
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Paper • 2506.13759 • Published • 42
-
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens
Paper • 2401.09985 • Published • 18 -
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
Paper • 2401.09962 • Published • 9 -
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Paper • 2401.10404 • Published • 11 -
ActAnywhere: Subject-Aware Video Background Generation
Paper • 2401.10822 • Published • 13
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper • 2503.19325 • Published • 73 -
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper • 2503.18931 • Published • 30 -
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper • 2505.18129 • Published • 60 -
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
Paper • 2507.15852 • Published • 34
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 32 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 105
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens
Paper • 2401.09985 • Published • 18 -
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
Paper • 2401.09962 • Published • 9 -
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Paper • 2401.10404 • Published • 11 -
ActAnywhere: Subject-Aware Video Background Generation
Paper • 2401.10822 • Published • 13
-
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper • 2506.09113 • Published • 98 -
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Paper • 2506.08009 • Published • 26 -
Seeing Voices: Generating A-Roll Video from Audio with Mirage
Paper • 2506.08279 • Published • 28 -
PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement
Paper • 2506.07848 • Published • 4
-
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Paper • 2502.20388 • Published • 16 -
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper • 2503.19325 • Published • 73 -
Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing
Paper • 2503.19385 • Published • 34 -
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Paper • 2504.12626 • Published • 52
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper • 2503.19325 • Published • 73 -
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper • 2506.09113 • Published • 98 -
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Paper • 2506.13759 • Published • 42
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper • 2503.19325 • Published • 73 -
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper • 2503.18931 • Published • 30 -
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper • 2505.18129 • Published • 60 -
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
Paper • 2507.15852 • Published • 34
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 32 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 105