-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2506.03143
-
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
Paper • 2402.15506 • Published • 17 -
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
Paper • 2404.03648 • Published • 29 -
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts
Paper • 2405.19893 • Published • 32 -
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Paper • 2405.19888 • Published • 7
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 67 -
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper • 2506.03143 • Published • 50 -
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning
Paper • 2505.12370 • Published -
UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning
Paper • 2505.12493 • Published
-
FocusedAD: Character-centric Movie Audio Description
Paper • 2504.12157 • Published • 9 -
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper • 2504.10465 • Published • 27 -
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 17 -
OS-Copilot/OS-Atlas-Base-7B
Image-Text-to-Text • 8B • Updated • 7.07k • 41
-
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration
Paper • 2502.17110 • Published • 13 -
WebGames: Challenging General-Purpose Web-Browsing AI Agents
Paper • 2502.18356 • Published • 13 -
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
Paper • 2502.18906 • Published • 12 -
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
Paper • 2503.02268 • Published • 11
-
End-to-End Goal-Driven Web Navigation
Paper • 1602.02261 • Published -
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published • 1
-
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies
Paper • 2506.14477 • Published -
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Paper • 2406.10227 • Published • 9 -
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper • 2506.03143 • Published • 50 -
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
Paper • 2503.21620 • Published • 63
-
Reinforcement Pre-Training
Paper • 2506.08007 • Published • 249 -
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
Paper • 2506.06395 • Published • 128 -
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Paper • 2506.05176 • Published • 67 -
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Paper • 2505.24726 • Published • 264
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 28 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 51
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
End-to-End Goal-Driven Web Navigation
Paper • 1602.02261 • Published -
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published • 1
-
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
Paper • 2402.15506 • Published • 17 -
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
Paper • 2404.03648 • Published • 29 -
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts
Paper • 2405.19893 • Published • 32 -
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Paper • 2405.19888 • Published • 7
-
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies
Paper • 2506.14477 • Published -
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Paper • 2406.10227 • Published • 9 -
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper • 2506.03143 • Published • 50 -
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
Paper • 2503.21620 • Published • 63
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 67 -
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper • 2506.03143 • Published • 50 -
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning
Paper • 2505.12370 • Published -
UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning
Paper • 2505.12493 • Published
-
Reinforcement Pre-Training
Paper • 2506.08007 • Published • 249 -
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
Paper • 2506.06395 • Published • 128 -
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Paper • 2506.05176 • Published • 67 -
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Paper • 2505.24726 • Published • 264
-
FocusedAD: Character-centric Movie Audio Description
Paper • 2504.12157 • Published • 9 -
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper • 2504.10465 • Published • 27 -
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 17 -
OS-Copilot/OS-Atlas-Base-7B
Image-Text-to-Text • 8B • Updated • 7.07k • 41
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 28 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 51
-
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration
Paper • 2502.17110 • Published • 13 -
WebGames: Challenging General-Purpose Web-Browsing AI Agents
Paper • 2502.18356 • Published • 13 -
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
Paper • 2502.18906 • Published • 12 -
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
Paper • 2503.02268 • Published • 11