-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2503.05236
-
OmniGen2: Exploration to Advanced Multimodal Generation
Paper • 2506.18871 • Published • 73 -
OmniGen: Unified Image Generation
Paper • 2409.11340 • Published • 116 -
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation
Paper • 2502.05415 • Published • 22 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • 2408.12528 • Published • 52
-
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Paper • 2503.10615 • Published • 17 -
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Paper • 2503.10630 • Published • 6 -
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 35 -
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Paper • 2503.07536 • Published • 88
-
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Paper • 2502.14768 • Published • 48 -
S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
Paper • 2502.12853 • Published • 29 -
Diverse Inference and Verification for Advanced Reasoning
Paper • 2502.09955 • Published • 18 -
Distillation Scaling Laws
Paper • 2502.08606 • Published • 49
-
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 40 -
Token-Budget-Aware LLM Reasoning
Paper • 2412.18547 • Published • 47 -
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper • 2412.20993 • Published • 38 -
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Paper • 2412.17256 • Published • 48
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 56 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 35
-
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Paper • 2503.03601 • Published • 233 -
Transformers without Normalization
Paper • 2503.10622 • Published • 167 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 153 -
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Paper • 2503.11647 • Published • 144
-
RuCCoD: Towards Automated ICD Coding in Russian
Paper • 2502.21263 • Published • 133 -
Unified Reward Model for Multimodal Understanding and Generation
Paper • 2503.05236 • Published • 124 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 46 -
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Paper • 2503.05592 • Published • 27
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 32 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 105
-
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Paper • 2410.17637 • Published • 37 -
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 84 -
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper • 2411.18203 • Published • 40 -
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Paper • 2411.14432 • Published • 26
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 56 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 35
-
OmniGen2: Exploration to Advanced Multimodal Generation
Paper • 2506.18871 • Published • 73 -
OmniGen: Unified Image Generation
Paper • 2409.11340 • Published • 116 -
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation
Paper • 2502.05415 • Published • 22 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • 2408.12528 • Published • 52
-
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Paper • 2503.03601 • Published • 233 -
Transformers without Normalization
Paper • 2503.10622 • Published • 167 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 153 -
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Paper • 2503.11647 • Published • 144
-
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Paper • 2503.10615 • Published • 17 -
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Paper • 2503.10630 • Published • 6 -
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 35 -
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Paper • 2503.07536 • Published • 88
-
RuCCoD: Towards Automated ICD Coding in Russian
Paper • 2502.21263 • Published • 133 -
Unified Reward Model for Multimodal Understanding and Generation
Paper • 2503.05236 • Published • 124 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 46 -
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Paper • 2503.05592 • Published • 27
-
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Paper • 2502.14768 • Published • 48 -
S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
Paper • 2502.12853 • Published • 29 -
Diverse Inference and Verification for Advanced Reasoning
Paper • 2502.09955 • Published • 18 -
Distillation Scaling Laws
Paper • 2502.08606 • Published • 49
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 32 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 105
-
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 40 -
Token-Budget-Aware LLM Reasoning
Paper • 2412.18547 • Published • 47 -
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper • 2412.20993 • Published • 38 -
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Paper • 2412.17256 • Published • 48
-
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Paper • 2410.17637 • Published • 37 -
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 84 -
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper • 2411.18203 • Published • 40 -
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Paper • 2411.14432 • Published • 26