Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
Abstract
Ego-R1, a reinforcement learning-based framework, uses a structured tool-augmented chain-of-thought process to reason over ultra-long egocentric videos, achieving better performance than existing methods by extending time coverage to a week.
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.
Community
Check out our
- Proj Page: https://egolife-ai.github.io/Ego-R1/
- Code: https://github.com/egolife-ai/Ego-R1
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EgoVLM: Policy Optimization for Egocentric Video Understanding (2025)
- Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL (2025)
- Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding (2025)
- Reinforced Reasoning for Embodied Planning (2025)
- Fostering Video Reasoning via Next-Event Prediction (2025)
- OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning (2025)
- GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper