Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2503.12329

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16 • 26
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13 • 37

DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Paper • 2503.12797 • Published Mar 17 • 32
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16 • 26
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 52
SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 196

Vision Language Models

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 27
TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19, 2024 • 31
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19, 2024 • 32
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 31

about 5 hours ago

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Paper • 2401.09048 • Published Jan 17, 2024 • 10
Improving fine-grained understanding in image-text pre-training

Paper • 2401.09865 • Published Jan 18, 2024 • 18
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Paper • 2401.10891 • Published Jan 19, 2024 • 62
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

Paper • 2401.13627 • Published Jan 24, 2024 • 75

image-llm-papers

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16 • 26

Model Evaluation

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

Paper • 2502.07445 • Published Feb 11 • 11
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Paper • 2502.04689 • Published Feb 7 • 7
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Paper • 2502.03032 • Published Feb 5 • 61
Preference Leakage: A Contamination Problem in LLM-as-a-judge

Paper • 2502.01534 • Published Feb 3 • 41

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 225
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Paper • 2311.16502 • Published Nov 27, 2023 • 35
BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 27
RULER: What's the Real Context Size of Your Long-Context Language Models?

Paper • 2404.06654 • Published Apr 9, 2024 • 39

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16 • 26
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13 • 37

image-llm-papers

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16 • 26

DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Paper • 2503.12797 • Published Mar 17 • 32
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16 • 26
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 52
SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 196

Model Evaluation

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

Paper • 2502.07445 • Published Feb 11 • 11
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Paper • 2502.04689 • Published Feb 7 • 7
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Paper • 2502.03032 • Published Feb 5 • 61
Preference Leakage: A Contamination Problem in LLM-as-a-judge

Paper • 2502.01534 • Published Feb 3 • 41

Vision Language Models

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 27
TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19, 2024 • 31
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19, 2024 • 32
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 31

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 225
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Paper • 2311.16502 • Published Nov 27, 2023 • 35
BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 27
RULER: What's the Real Context Size of Your Long-Context Language Models?

Paper • 2404.06654 • Published Apr 9, 2024 • 39

about 5 hours ago

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Paper • 2401.09048 • Published Jan 17, 2024 • 10
Improving fine-grained understanding in image-text pre-training

Paper • 2401.09865 • Published Jan 18, 2024 • 18
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Paper • 2401.10891 • Published Jan 19, 2024 • 62
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

Paper • 2401.13627 • Published Jan 24, 2024 • 75

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs