Trending Papers

GitHub 14.9k arXiv Page

Submitted by

AdinaY

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

DeepPlanning benchmark addresses limitations of current LLM planning assessments by introducing complex, real-world tasks requiring both global optimization and local constraint reasoning.

Qwen · Jan 26, 2026

GitHub 14.9k arXiv Page

Submitted by

taesiri

Helios: Real Real-Time Long Video Generation Model

Helios is a 14 billion parameter autoregressive diffusion model for video generation that achieves real-time performance and high-quality long-video synthesis without conventional optimization techniques.

ByteDance · Published on Mar 4, 2026

132

GitHub 735 arXiv Page

Submitted by

taesiri

Helios: Real Real-Time Long Video Generation Model

Helios is a 14 billion parameter autoregressive diffusion model for video generation that achieves real-time performance and high-quality long-video synthesis without conventional optimization techniques.

ByteDance · Mar 4, 2026

132

GitHub 735 arXiv Page

Submitted by

xssstory

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

AReaL, a fully asynchronous reinforcement learning system, decouples generation and training to achieve higher GPU utilization and up to 2.57x training speedup for large language models on reasoning tasks.

13 authors

· Published on May 30, 2025

31

GitHub 4.51k arXiv Page

Submitted by

xssstory

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

AReaL, a fully asynchronous reinforcement learning system, decouples generation and training to achieve higher GPU utilization and up to 2.57x training speedup for large language models on reasoning tasks.

13 authors

· May 30, 2025

31

GitHub 4.51k arXiv Page

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Published on Aug 22, 2025

55

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Aug 22, 2025

55

Submitted by

akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

8 authors

· Published on Jul 25, 2024

37

Submitted by

akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

8 authors

· Jul 25, 2024

37

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

ReMe is a framework for experience-driven agent evolution in LLMs, enhancing memory management through distillation, context-adaptive reuse, and refinement, outperforming larger memoryless models.

7 authors

· Published on Dec 11, 2025

2

GitHub 2k arXiv Page

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

ReMe is a framework for experience-driven agent evolution in LLMs, enhancing memory management through distillation, context-adaptive reuse, and refinement, outperforming larger memoryless models.

7 authors

· Dec 11, 2025

2

GitHub 2k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Published on Sep 12, 2023

GitHub 72.3k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Sep 12, 2023

GitHub 72.3k arXiv Page

Moonshine: Speech Recognition for Live Transcription and Voice Commands

Moonshine, an encoder-decoder transformer architecture for speech recognition, uses Rotary Position Embedding, reducing compute requirements without decreasing accuracy.

6 authors

· Published on Oct 21, 2024

GitHub 7.11k arXiv Page

Moonshine: Speech Recognition for Live Transcription and Voice Commands

Moonshine, an encoder-decoder transformer architecture for speech recognition, uses Rotary Position Embedding, reducing compute requirements without decreasing accuracy.

6 authors

· Oct 21, 2024

GitHub 7.11k arXiv Page

Submitted by

evanking

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

Monolingual ASR models trained on a balanced mix of high-quality, pseudo-labeled, and synthetic data outperform multilingual models for small model sizes, achieving superior error rates and enabling on-device ASR for underrepresented languages.

5 authors

· Published on Sep 2, 2025

GitHub 7.12k arXiv Page

Submitted by

evanking

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

Monolingual ASR models trained on a balanced mix of high-quality, pseudo-labeled, and synthetic data outperform multilingual models for small model sizes, achieving superior error rates and enabling on-device ASR for underrepresented languages.

5 authors

· Sep 2, 2025

GitHub 7.12k arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

IBM Granite · Published on Mar 14, 2025

150

GitHub 55.1k arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

IBM Granite · Mar 14, 2025

150

GitHub 55.1k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

43

GitHub 48.9k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

43

GitHub 48.9k arXiv Page

Submitted by

OmniLottie

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

OmniLottie framework generates high-quality vector animations from multi-modal instructions using a specialized Lottie tokenizer and pretrained vision-language models.

Fudan University · Published on Mar 2, 2026

134

GitHub 407 arXiv Page

Submitted by

OmniLottie

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

OmniLottie framework generates high-quality vector animations from multi-modal instructions using a specialized Lottie tokenizer and pretrained vision-language models.

Fudan University · Mar 2, 2026

134

GitHub 407 arXiv Page

AutoDev: Automated AI-Driven Development

AutoDev is an AI-driven software development framework that automates complex engineering tasks within a secure Docker environment, achieving high performance in code and test generation.

5 authors

· Published on Mar 13, 2024

9

GitHub 9.19k arXiv Page

AutoDev: Automated AI-Driven Development

AutoDev is an AI-driven software development framework that automates complex engineering tasks within a secure Docker environment, achieving high performance in code and test generation.

5 authors

· Mar 13, 2024

9

GitHub 9.19k arXiv Page

Submitted by

Gofinge

Utonia: Toward One Encoder for All Point Clouds

Utonia enables cross-domain point cloud representation learning through a unified self-supervised transformer encoder, enhancing perception and supporting embodied and multimodal reasoning tasks.

Pointcept · Published on Mar 3, 2026

140

GitHub 388 arXiv Page

Submitted by

Gofinge

Utonia: Toward One Encoder for All Point Clouds

Utonia enables cross-domain point cloud representation learning through a unified self-supervised transformer encoder, enhancing perception and supporting embodied and multimodal reasoning tasks.

Pointcept · Mar 3, 2026

140

GitHub 388 arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Published on Dec 28, 2024

GitHub 31.5k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

GitHub 31.5k arXiv Page

Submitted by

taesiri

Fara-7B: An Efficient Agentic Model for Computer Use

FaraGen creates synthetic datasets for computer use agents, enabling the training of efficient and high-performing models like Fara-7B on diverse web tasks, outperforming larger models on benchmarks.

Microsoft · Published on Nov 24, 2025

15

GitHub 4.3k arXiv Page

Submitted by

taesiri

Fara-7B: An Efficient Agentic Model for Computer Use

FaraGen creates synthetic datasets for computer use agents, enabling the training of efficient and high-performing models like Fara-7B on diverse web tasks, outperforming larger models on benchmarks.

Microsoft · Nov 24, 2025

15

GitHub 4.3k arXiv Page

Submitted by

taesiri

RealWonder: Real-Time Physical Action-Conditioned Video Generation

RealWonder enables real-time action-conditioned video generation by integrating 3D reconstruction, physics simulation, and a distilled video generator to simulate physical consequences of 3D actions.

6 authors

· Published on Mar 5, 2026

GitHub 72 arXiv Page

Submitted by

taesiri

RealWonder: Real-Time Physical Action-Conditioned Video Generation

RealWonder enables real-time action-conditioned video generation by integrating 3D reconstruction, physics simulation, and a distilled video generator to simulate physical consequences of 3D actions.

6 authors

· Mar 5, 2026

GitHub 72 arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

147

GitHub 55.7k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

147

GitHub 55.7k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Published on Oct 16, 2025

118

GitHub 71.7k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Oct 16, 2025

118

GitHub 71.7k arXiv Page

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

LMCACHE enables efficient KV cache management for large language models by storing caches outside GPU memory, supporting cache reuse across queries and inference engines while achieving significant throughput improvements.

11 authors

· Published on Oct 8, 2025

GitHub 7.57k arXiv Page

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

LMCACHE enables efficient KV cache management for large language models by storing caches outside GPU memory, supporting cache reuse across queries and inference engines while achieving significant throughput improvements.

11 authors

· Oct 8, 2025

GitHub 7.57k arXiv Page

Submitted by

taesiri

PaperBanana: Automating Academic Illustration for AI Scientists

_paperbanana is an agentic framework that automates the creation of publication-ready academic illustrations using advanced vision-language models and image generation techniques.

Google · Published on Jan 30, 2026

213

GitHub 4.81k arXiv Page

Submitted by

taesiri

PaperBanana: Automating Academic Illustration for AI Scientists

_paperbanana is an agentic framework that automates the creation of publication-ready academic illustrations using advanced vision-language models and image generation techniques.

Google · Jan 30, 2026

213

GitHub 4.81k arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Published on Jul 23, 2024

76

GitHub 68.7k arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Jul 23, 2024

76

GitHub 68.7k arXiv Page

Submitted by

taesiri

Qwen3-TTS Technical Report

The Qwen3-TTS series presents advanced multilingual text-to-speech models with voice cloning and controllable speech generation capabilities, utilizing dual-track LM architecture and specialized speech tokenizers for efficient streaming synthesis.

Qwen · Published on Jan 22, 2026

70

GitHub 9.14k arXiv Page

Submitted by

taesiri

Qwen3-TTS Technical Report

The Qwen3-TTS series presents advanced multilingual text-to-speech models with voice cloning and controllable speech generation capabilities, utilizing dual-track LM architecture and specialized speech tokenizers for efficient streaming synthesis.

Qwen · Jan 22, 2026

70

GitHub 9.14k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Published on Nov 17, 2025

26

GitHub 18.6k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Nov 17, 2025

26

GitHub 18.6k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

180

GitHub 68k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

180

GitHub 68k arXiv Page

Submitted by

taesiri

FireRed-Image-Edit-1.0 Techinical Report

FireRed-Image-Edit uses a diffusion transformer with optimized data curation and training methods to achieve state-of-the-art performance in instruction-based image editing, supported by a comprehensive benchmark and novel techniques for data efficiency and optimization stability.

19 authors

· Published on Feb 12, 2026

GitHub 761 arXiv Page

Submitted by

taesiri

FireRed-Image-Edit-1.0 Techinical Report

FireRed-Image-Edit uses a diffusion transformer with optimized data curation and training methods to achieve state-of-the-art performance in instruction-based image editing, supported by a comprehensive benchmark and novel techniques for data efficiency and optimization stability.

19 authors

· Feb 12, 2026

GitHub 761 arXiv Page

Submitted by

taesiri

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.

29 authors

· Published on Jan 6, 2026

157

GitHub 4.38k arXiv Page

Submitted by

taesiri

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.

29 authors

· Jan 6, 2026

157

GitHub 4.38k arXiv Page

Submitted by

ZHZisZZ

dLLM: Simple Diffusion Language Modeling

A unified open-source framework is presented that standardizes core components of diffusion language modeling for reproduction, customization, and accessible development of both large and small models.

UC Berkeley · Published on Feb 26, 2026

117

GitHub 2.13k arXiv Page

Submitted by

ZHZisZZ

dLLM: Simple Diffusion Language Modeling

A unified open-source framework is presented that standardizes core components of diffusion language modeling for reproduction, customization, and accessible development of both large and small models.

UC Berkeley · Feb 26, 2026

117

GitHub 2.13k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Published on Oct 23, 2024

GitHub 54.2k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Oct 23, 2024

GitHub 54.2k arXiv Page

Submitted by

xhyandwyy

Mobile-Agent-v3: Foundamental Agents for GUI Automation

GUI-Owl and Mobile-Agent-v3 are open-source GUI agent models and frameworks that achieve state-of-the-art performance across various benchmarks using innovations in environment infrastructure, agent capabilities, and scalable reinforcement learning.

15 authors

· Published on Aug 21, 2025

65

GitHub 8.02k arXiv Page

Submitted by

xhyandwyy

Mobile-Agent-v3: Foundamental Agents for GUI Automation

GUI-Owl and Mobile-Agent-v3 are open-source GUI agent models and frameworks that achieve state-of-the-art performance across various benchmarks using innovations in environment infrastructure, agent capabilities, and scalable reinforcement learning.

15 authors

· Aug 21, 2025

65

GitHub 8.02k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Published on Jan 20, 2025

GitHub 23.4k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Jan 20, 2025

GitHub 23.4k arXiv Page

Submitted by

UglyToilet

MemOS: A Memory OS for AI System

MemOS, a memory operating system for Large Language Models, addresses memory management challenges by unifying plaintext, activation-based, and parameter-level memories, enabling efficient storage, retrieval, and continual learning.

39 authors

· Published on Jul 4, 2025

160

GitHub 6.24k arXiv Page

Submitted by

UglyToilet

MemOS: A Memory OS for AI System

MemOS, a memory operating system for Large Language Models, addresses memory management challenges by unifying plaintext, activation-based, and parameter-level memories, enabling efficient storage, retrieval, and continual learning.

39 authors

· Jul 4, 2025

160

GitHub 6.24k arXiv Page

Submitted by

Ningyu

SkillNet: Create, Evaluate, and Connect AI Skills

SkillNet introduces an open infrastructure for systematically accumulating and transferring AI skills through a unified ontology, significantly improving agent performance across multiple domains.

Zhejiang University · Published on Feb 26, 2026

54

GitHub 147 arXiv Page

Submitted by

Ningyu

SkillNet: Create, Evaluate, and Connect AI Skills

SkillNet introduces an open infrastructure for systematically accumulating and transferring AI skills through a unified ontology, significantly improving agent performance across multiple domains.

Zhejiang University · Feb 26, 2026

54

GitHub 147 arXiv Page

Submitted by

taesiri

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Step 3.5 Flash is a sparse Mixture-of-Experts model that achieves frontier-level agentic intelligence through efficient parameter utilization and optimized attention mechanisms, demonstrating strong performance across multiple benchmarks.

StepFun · Published on Feb 11, 2026

187

GitHub 1.65k arXiv Page

Submitted by

taesiri

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Step 3.5 Flash is a sparse Mixture-of-Experts model that achieves frontier-level agentic intelligence through efficient parameter utilization and optimized attention mechanisms, demonstrating strong performance across multiple benchmarks.

StepFun · Feb 11, 2026

187

GitHub 1.65k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

30

GitHub 29.1k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

30

GitHub 29.1k arXiv Page

Submitted by

stefan-it

GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

GLiNER2 is a unified transformer-based framework that supports multiple NLP tasks with improved efficiency and accessibility compared to large language models.

5 authors

· Published on Jul 24, 2025

34

GitHub 1.06k arXiv Page

Submitted by

stefan-it

GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

GLiNER2 is a unified transformer-based framework that supports multiple NLP tasks with improved efficiency and accessibility compared to large language models.

5 authors

· Jul 24, 2025

34

GitHub 1.06k arXiv Page

Submitted by

l-li

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

CubeComposer is a spatio-temporal autoregressive diffusion model that generates high-resolution 360° panoramic videos by decomposing them into cubemap representations and using efficient autoregressive synthesis techniques.

ARC Lab, Tencent PCG · Published on Mar 4, 2026

GitHub 48 arXiv Page

Submitted by

l-li

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

CubeComposer is a spatio-temporal autoregressive diffusion model that generates high-resolution 360° panoramic videos by decomposing them into cubemap representations and using efficient autoregressive synthesis techniques.

ARC Lab, Tencent PCG · Mar 4, 2026

GitHub 48 arXiv Page

Submitted by

zhongwenxu

Single-stream Policy Optimization

Single-stream Policy Optimization (SPO) improves policy-gradient training for Large Language Models by eliminating group-based issues and providing a stable, low-variance learning signal, leading to better performance and efficiency.

Tencent · Published on Sep 16, 2025

35

GitHub 19.7k arXiv Page

Submitted by

zhongwenxu

Single-stream Policy Optimization

Single-stream Policy Optimization (SPO) improves policy-gradient training for Large Language Models by eliminating group-based issues and providing a stable, low-variance learning signal, leading to better performance and efficiency.

Tencent · Sep 16, 2025

35

GitHub 19.7k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Published on Feb 7, 2025

16

GitHub 64.8k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Feb 7, 2025

16

GitHub 64.8k arXiv Page

Submitted by

taesiri

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new open-source video-language model family that achieves state-of-the-art performance through novel datasets and training methods, particularly excelling in video grounding tasks without relying on proprietary models.

21 authors

· Published on Jan 15, 2026

GitHub 363 arXiv Page

Submitted by

taesiri

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new open-source video-language model family that achieves state-of-the-art performance through novel datasets and training methods, particularly excelling in video grounding tasks without relying on proprietary models.

21 authors

· Jan 15, 2026

GitHub 363 arXiv Page

Underwater Camouflaged Object Tracking Meets Vision-Language SAM2

A large-scale multi-modal underwater camouflaged object tracking dataset was created and evaluated, with a vision-language tracking framework showing superior performance in both underwater and open-air environments.

8 authors

· Published on Sep 25, 2024

GitHub 1.07k arXiv Page

Underwater Camouflaged Object Tracking Meets Vision-Language SAM2

A large-scale multi-modal underwater camouflaged object tracking dataset was created and evaluated, with a vision-language tracking framework showing superior performance in both underwater and open-air environments.

8 authors

· Sep 25, 2024

GitHub 1.07k arXiv Page

Submitted by

WENGSYX

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

FigureBench presents the first large-scale benchmark for generating scientific illustrations from long-form scientific texts, while AutoFigure introduces an agentic framework that produces publication-ready illustrations through extensive thinking, recombination, and validation processes.

Text Intelligence Lab of Westlake University · Published on Feb 3, 2026

20

GitHub 551 arXiv Page

Submitted by

WENGSYX

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

FigureBench presents the first large-scale benchmark for generating scientific illustrations from long-form scientific texts, while AutoFigure introduces an agentic framework that produces publication-ready illustrations through extensive thinking, recombination, and validation processes.

Text Intelligence Lab of Westlake University · Feb 3, 2026

20

GitHub 551 arXiv Page

Submitted by

Piang

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

A feedforward model called Track4World enables efficient holistic 3D tracking of every pixel in a video by utilizing a global 3D scene representation and novel 3D correlation scheme for dense flow estimation.

ARC Lab, Tencent PCG · Published on Mar 3, 2026

GitHub 84 arXiv Page

Submitted by

Piang

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

A feedforward model called Track4World enables efficient holistic 3D tracking of every pixel in a video by utilizing a global 3D scene representation and novel 3D correlation scheme for dense flow estimation.

ARC Lab, Tencent PCG · Mar 3, 2026

GitHub 84 arXiv Page

MNN: A Universal and Efficient Inference Engine

MNN, a universal and efficient deep learning inference engine for mobile devices, addresses model compatibility, device diversity, and resource limitations through pre-inference, kernel optimization, and backend abstraction.

12 authors

· Published on Feb 27, 2020

GitHub 14.4k arXiv Page

MNN: A Universal and Efficient Inference Engine

MNN, a universal and efficient deep learning inference engine for mobile devices, addresses model compatibility, device diversity, and resource limitations through pre-inference, kernel optimization, and backend abstraction.

12 authors

· Feb 27, 2020

GitHub 14.4k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Published on Feb 8, 2025

7

GitHub 19.2k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Feb 8, 2025

7

GitHub 19.2k arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Published on Jun 28, 2020

GitHub 98k arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Jun 28, 2020

GitHub 98k arXiv Page

Submitted by

akhaliq

Qwen Technical Report

Qwen, a series of large language models, including chat and specialized coding and mathematics variants, exhibit superior performance across various tasks and outperform open-source models.

48 authors

· Published on Sep 28, 2023

GitHub 20.6k arXiv Page

Submitted by

akhaliq

Qwen Technical Report

Qwen, a series of large language models, including chat and specialized coding and mathematics variants, exhibit superior performance across various tasks and outperform open-source models.

48 authors

· Sep 28, 2023

GitHub 20.6k arXiv Page

Submitted by

linyq

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

A scalable data generation pipeline creates high-fidelity video editing training data, and a unified architecture enables improved instruction-following and reference fidelity in controllable video editing.

Show Lab · Published on Mar 2, 2026

17

GitHub 87 arXiv Page

Submitted by

linyq

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

A scalable data generation pipeline creates high-fidelity video editing training data, and a unified architecture enables improved instruction-following and reference fidelity in controllable video editing.

Show Lab · Mar 2, 2026

17

GitHub 87 arXiv Page

Submitted by

taesiri

FireRed-OCR Technical Report

FireRed-OCR transforms general vision-language models into specialized OCR systems through structured data synthesis and progressive training strategies.

22 authors

· Published on Mar 2, 2026

GitHub 165 arXiv Page

Submitted by

taesiri

FireRed-OCR Technical Report

FireRed-OCR transforms general vision-language models into specialized OCR systems through structured data synthesis and progressive training strategies.

22 authors

· Mar 2, 2026

GitHub 165 arXiv Page

Kronos: A Foundation Model for the Language of Financial Markets

Kronos, a specialized pre-training framework for financial K-line data, outperforms existing models in forecasting and synthetic data generation through a unique tokenizer and autoregressive pre-training on a large dataset.

7 authors

· Published on Aug 2, 2025

5

GitHub 11k arXiv Page

Kronos: A Foundation Model for the Language of Financial Markets

Kronos, a specialized pre-training framework for financial K-line data, outperforms existing models in forecasting and synthetic data generation through a unique tokenizer and autoregressive pre-training on a large dataset.

7 authors

· Aug 2, 2025