SWE-Universe: Scale Real-World Verifiable Environments to Millions Paper • 2602.02361 • Published 6 days ago • 59
SWE-Universe: Scale Real-World Verifiable Environments to Millions Paper • 2602.02361 • Published 6 days ago • 59
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking Paper • 2601.04720 • Published Jan 8 • 54
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints Paper • 2601.18137 • Published 14 days ago • 25
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience Paper • 2512.17260 • Published Dec 19, 2025 • 52
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices Paper • 2512.01374 • Published Dec 1, 2025 • 104
RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing Paper • 2507.20352 • Published Jul 27, 2025
ExpertPrompting: Instructing Large Language Models to be Distinguished Experts Paper • 2305.14688 • Published May 24, 2023
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions Paper • 2401.00690 • Published Jan 1, 2024 • 1
Building Chinese Biomedical Language Models via Multi-Level Text Discrimination Paper • 2110.07244 • Published Oct 14, 2021
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning Paper • 2311.08182 • Published Nov 14, 2023
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability Paper • 2505.24147 • Published May 30, 2025
From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding Paper • 2506.03968 • Published Jun 4, 2025 • 15
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents Paper • 2506.11763 • Published Jun 13, 2025 • 74
Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking Paper • 2505.20023 • Published May 26, 2025