Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising
Abstract
Time-to-Move (TTM) is a plug-and-play framework for motion- and appearance-controlled video generation using image-to-video (I2V) diffusion models, offering precise control over video content without requiring additional training.
Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.
Community
A training-free, plug-and-play motion- and appearance-controlled video generation method using dual-clock denoising on image-to-video diffusion models with crude motion cues.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoMo: Compositional Motion Customization for Text-to-Video Generation (2025)
- Generating Human Motion Videos using a Cascaded Text-to-Video Framework (2025)
- Real-Time Motion-Controllable Autoregressive Video Diffusion (2025)
- Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures (2025)
- MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation (2025)
- BachVid: Training-Free Video Generation with Consistent Background and Character (2025)
- VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper