Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2506.03143

about 4 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning

Paper • 2402.15506 • Published Feb 23, 2024 • 17
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

Paper • 2404.03648 • Published Apr 4, 2024 • 29
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts

Paper • 2405.19893 • Published May 30, 2024 • 32
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

Paper • 2405.19888 • Published May 30, 2024 • 7

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Paper • 2412.04454 • Published Dec 5, 2024 • 67
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Paper • 2506.03143 • Published Jun 3 • 50
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

Paper • 2505.12370 • Published May 18
UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning

Paper • 2505.12493 • Published May 18

FocusedAD: Character-centric Movie Audio Description

Paper • 2504.12157 • Published Apr 16 • 9
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

Paper • 2504.10465 • Published Apr 14 • 27
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Paper • 2504.13180 • Published Apr 17 • 17
OS-Copilot/OS-Atlas-Base-7B

Image-Text-to-Text • 8B • Updated Nov 19, 2024 • 7.07k • 41

GUI Agent相关论文方案简要分析

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

Paper • 2502.17110 • Published Feb 24 • 13
WebGames: Challenging General-Purpose Web-Browsing AI Agents

Paper • 2502.18356 • Published Feb 25 • 13
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Paper • 2502.18906 • Published Feb 26 • 12
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Paper • 2503.02268 • Published Mar 4 • 11

a collection of algorithmic agents for user interfaces/interactions, program synthesis, and robotics

End-to-End Goal-Driven Web Navigation

Paper • 1602.02261 • Published Feb 6, 2016
Learning Language Games through Interaction

Paper • 1606.02447 • Published Jun 8, 2016
Naturalizing a Programming Language via Interactive Learning

Paper • 1704.06956 • Published Apr 23, 2017
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

Paper • 1802.08802 • Published Feb 24, 2018 • 1

GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies

Paper • 2506.14477 • Published Jun 17
VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Paper • 2406.10227 • Published Jun 14, 2024 • 9
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Paper • 2506.03143 • Published Jun 3 • 50
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published Mar 27 • 63

Reinforcement Pre-Training

Paper • 2506.08007 • Published Jun 9 • 249
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Paper • 2506.06395 • Published Jun 5 • 128
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5 • 67
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

Paper • 2505.24726 • Published May 30 • 264

Multimodal Agent

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25 • 28
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 51

deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27 • 918k • • 12.5k
deepseek-ai/DeepSeek-V3

Text Generation • 685B • Updated Mar 27 • 470k • • 3.93k
mistralai/Mistral-Small-24B-Instruct-2501

24B • Updated 4 days ago • 97.3k • 930
deepseek-ai/Janus-Pro-1B

Any-to-Any • Updated Feb 1 • 16.8k • 453

about 4 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

a collection of algorithmic agents for user interfaces/interactions, program synthesis, and robotics

End-to-End Goal-Driven Web Navigation

Paper • 1602.02261 • Published Feb 6, 2016
Learning Language Games through Interaction

Paper • 1606.02447 • Published Jun 8, 2016
Naturalizing a Programming Language via Interactive Learning

Paper • 1704.06956 • Published Apr 23, 2017
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

Paper • 1802.08802 • Published Feb 24, 2018 • 1

AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning

Paper • 2402.15506 • Published Feb 23, 2024 • 17
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

Paper • 2404.03648 • Published Apr 4, 2024 • 29
Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts

Paper • 2405.19893 • Published May 30, 2024 • 32
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

Paper • 2405.19888 • Published May 30, 2024 • 7

GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies

Paper • 2506.14477 • Published Jun 17
VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Paper • 2406.10227 • Published Jun 14, 2024 • 9
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Paper • 2506.03143 • Published Jun 3 • 50
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published Mar 27 • 63

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Paper • 2412.04454 • Published Dec 5, 2024 • 67
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Paper • 2506.03143 • Published Jun 3 • 50
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

Paper • 2505.12370 • Published May 18
UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning

Paper • 2505.12493 • Published May 18

Reinforcement Pre-Training

Paper • 2506.08007 • Published Jun 9 • 249
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Paper • 2506.06395 • Published Jun 5 • 128
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5 • 67
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

Paper • 2505.24726 • Published May 30 • 264

FocusedAD: Character-centric Movie Audio Description

Paper • 2504.12157 • Published Apr 16 • 9
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

Paper • 2504.10465 • Published Apr 14 • 27
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Paper • 2504.13180 • Published Apr 17 • 17
OS-Copilot/OS-Atlas-Base-7B

Image-Text-to-Text • 8B • Updated Nov 19, 2024 • 7.07k • 41

Multimodal Agent

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25 • 28
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 51

GUI Agent相关论文方案简要分析

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

Paper • 2502.17110 • Published Feb 24 • 13
WebGames: Challenging General-Purpose Web-Browsing AI Agents

Paper • 2502.18356 • Published Feb 25 • 13
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Paper • 2502.18906 • Published Feb 26 • 12
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Paper • 2503.02268 • Published Mar 4 • 11

deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27 • 918k • • 12.5k
deepseek-ai/DeepSeek-V3

Text Generation • 685B • Updated Mar 27 • 470k • • 3.93k
mistralai/Mistral-Small-24B-Instruct-2501

24B • Updated 4 days ago • 97.3k • 930
deepseek-ai/Janus-Pro-1B

Any-to-Any • Updated Feb 1 • 16.8k • 453

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs