-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 45 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 24
Collections
Discover the best community collections!
Collections including paper arxiv:2406.08451
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Paper • 2310.11441 • Published • 29 -
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 65 -
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Paper • 2406.08451 • Published • 26 -
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
Paper • 2406.10819 • Published • 2
-
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Paper • 2406.11833 • Published • 64 -
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
Paper • 2406.11230 • Published • 35 -
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models
Paper • 2406.14035 • Published • 13 -
Needle In A Multimodal Haystack
Paper • 2406.07230 • Published • 55
-
End-to-End Goal-Driven Web Navigation
Paper • 1602.02261 • Published -
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published • 1
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 69 -
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper • 2506.03143 • Published • 52 -
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning
Paper • 2505.12370 • Published -
UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning
Paper • 2505.12493 • Published
-
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper • 2404.01197 • Published • 32 -
CosmicMan: A Text-to-Image Foundation Model for Humans
Paper • 2404.01294 • Published • 16 -
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper • 2406.08707 • Published • 17 -
DataComp-LM: In search of the next generation of training sets for language models
Paper • 2406.11794 • Published • 55
-
NExT-GPT: Any-to-Any Multimodal LLM
Paper • 2309.05519 • Published • 78 -
Large Language Model for Science: A Study on P vs. NP
Paper • 2309.05689 • Published • 21 -
AstroLLaMA: Towards Specialized Foundation Models in Astronomy
Paper • 2309.06126 • Published • 18 -
Large Language Models for Compiler Optimization
Paper • 2309.07062 • Published • 24
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 45 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 24
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 69 -
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper • 2506.03143 • Published • 52 -
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning
Paper • 2505.12370 • Published -
UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning
Paper • 2505.12493 • Published
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Paper • 2310.11441 • Published • 29 -
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 65 -
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Paper • 2406.08451 • Published • 26 -
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
Paper • 2406.10819 • Published • 2
-
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Paper • 2406.11833 • Published • 64 -
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
Paper • 2406.11230 • Published • 35 -
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models
Paper • 2406.14035 • Published • 13 -
Needle In A Multimodal Haystack
Paper • 2406.07230 • Published • 55
-
End-to-End Goal-Driven Web Navigation
Paper • 1602.02261 • Published -
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published • 1
-
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper • 2404.01197 • Published • 32 -
CosmicMan: A Text-to-Image Foundation Model for Humans
Paper • 2404.01294 • Published • 16 -
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper • 2406.08707 • Published • 17 -
DataComp-LM: In search of the next generation of training sets for language models
Paper • 2406.11794 • Published • 55
-
NExT-GPT: Any-to-Any Multimodal LLM
Paper • 2309.05519 • Published • 78 -
Large Language Model for Science: A Study on P vs. NP
Paper • 2309.05689 • Published • 21 -
AstroLLaMA: Towards Specialized Foundation Models in Astronomy
Paper • 2309.06126 • Published • 18 -
Large Language Models for Compiler Optimization
Paper • 2309.07062 • Published • 24