-
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
Paper • 2507.09477 • Published • 74 -
BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering
Paper • 2507.04127 • Published • 6 -
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
Paper • 2507.06223 • Published • 13
Oğuzhan Ercan
oguzhanercan
AI & ML interests
Computer Vision, Generative Vision, first trajectory bender
Recent Activity
updated
a collection
5 days ago
Robotics
updated
a collection
5 days ago
Reasoning
updated
a collection
5 days ago
Image Editting
Organizations
None yet
MultiModal Reasoning
Large Language Models
-
Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers
Paper • 2506.14702 • Published • 4 -
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Paper • 2506.13585 • Published • 260 -
Scaling Test-time Compute for LLM Agents
Paper • 2506.12928 • Published • 61 -
A Survey on Latent Reasoning
Paper • 2507.06203 • Published • 85
Robotics
-
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
Paper • 2506.06205 • Published • 29 -
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
Paper • 2506.07530 • Published • 20 -
Ark: An Open-source Python-based Framework for Robot Learning
Paper • 2506.21628 • Published • 15 -
RoboBrain 2.0 Technical Report
Paper • 2507.02029 • Published • 29
Auto Regressive Image Generation
-
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Paper • 2505.19602 • Published • 13 -
DiSA: Diffusion Step Annealing in Autoregressive Image Generation
Paper • 2505.20297 • Published • 2 -
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
Paper • 2506.06962 • Published • 29 -
Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
Paper • 2507.01957 • Published • 19
Vision Reasoning
-
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 -
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 12 -
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 -
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12
Representation Learning
Training Theory
-
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Paper • 2502.03738 • Published • 11 -
Better Embeddings with Coupled Adam
Paper • 2502.08441 • Published • 1 -
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment
Paper • 2502.16894 • Published • 31 -
SALT: Singular Value Adaptation with Low-Rank Transformation
Paper • 2503.16055 • Published • 8
Efficent ML
Video Generation Backbone Models
-
rain1011/pyramid-flow-miniflux
Text-to-Video • Updated • 177 -
TPDiff: Temporal Pyramid Video Diffusion Model
Paper • 2503.09566 • Published • 46 -
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Paper • 2504.08685 • Published • 129 -
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Paper • 2504.12626 • Published • 52
Image-Video General Tasks
-
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Paper • 2501.04001 • Published • 47 -
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Paper • 2501.03895 • Published • 53 -
An Empirical Study of Autoregressive Pre-training from Videos
Paper • 2501.05453 • Published • 42 -
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training
Paper • 2501.07556 • Published • 6
Diffusion/Flow Model Optimization
-
1.58-bit FLUX
Paper • 2412.18653 • Published • 84 -
Region-Adaptive Sampling for Diffusion Transformers
Paper • 2502.10389 • Published • 54 -
One-step Diffusion Models with f-Divergence Distribution Matching
Paper • 2502.15681 • Published • 8 -
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Paper • 2502.20126 • Published • 20
Datasets
Video Generation Control-Style Transfer
-
StyleMaster: Stylize Your Video with Artistic Generation and Translation
Paper • 2412.07744 • Published • 20 -
Video Motion Transfer with Diffusion Transformers
Paper • 2412.07776 • Published • 17 -
ObjCtrl-2.5D: Training-free Object Control with Camera Poses
Paper • 2412.07721 • Published • 8 -
MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance
Paper • 2412.05355 • Published • 9
Image Restoration (SR , Inpainting etc.)
Image-Video MultiModal Understanding
-
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 146 -
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
Paper • 2501.01245 • Published • 5 -
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Paper • 2501.00599 • Published • 48 -
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Paper • 2501.08326 • Published • 34
Architectural Proposals
-
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 107 -
Causal Diffusion Transformers for Generative Modeling
Paper • 2412.12095 • Published • 23 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 89 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 57
Image Editting
-
BrushEdit: All-In-One Image Inpainting and Editing
Paper • 2412.10316 • Published • 36 -
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Paper • 2412.11815 • Published • 26 -
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
Paper • 2412.09611 • Published • 10 -
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Paper • 2412.07517 • Published • 11
Diffusion Model Control
Control Methods for Diffusion and Score Models
-
LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
Paper • 2412.09622 • Published • 8 -
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
Paper • 2412.04146 • Published • 23 -
Learning Flow Fields in Attention for Controllable Person Image Generation
Paper • 2412.08486 • Published • 37 -
LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
Paper • 2412.05148 • Published • 12
Embedding Space İnterpretability
Transformer Optimization / LLM & VLLM etc
Agentic Tools
-
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Paper • 2506.10821 • Published • 19 -
Jan-nano Technical Report
Paper • 2506.22760 • Published • 9 -
MMSearch-R1: Incentivizing LMMs to Search
Paper • 2506.20670 • Published • 60 -
WebSailor: Navigating Super-human Reasoning for Web Agent
Paper • 2507.02592 • Published • 99
Reasoning
-
Magistral
Paper • 2506.10910 • Published • 61 -
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute
Paper • 2506.15882 • Published • 2 -
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
Paper • 2507.14683 • Published • 114 -
The Invisible Leash: Why RLVR May Not Escape Its Origin
Paper • 2507.14843 • Published • 78
Diffusion Language&MultiModal Modeling
Subject Driven Generation Control
Scene Generation
-
HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation
Paper • 2504.21650 • Published • 16 -
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
Paper • 2505.02836 • Published • 7 -
ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies
Paper • 2506.14315 • Published • 10
Image-Text Alignment
-
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
Paper • 2502.05178 • Published • 10 -
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Paper • 2502.14846 • Published • 14 -
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Paper • 2502.14786 • Published • 146 -
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Paper • 2504.00557 • Published • 15
Control Based Video Generation Models
Video Generation Style Models
Generation Quality Enhancement
-
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
Paper • 2412.20800 • Published • 10 -
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models
Paper • 2501.06751 • Published • 33 -
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Paper • 2501.09732 • Published • 72 -
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Paper • 2501.09755 • Published • 37
Voice
-
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Paper • 2412.15322 • Published • 19 -
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
Paper • 2505.02707 • Published • 84 -
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Paper • 2505.02625 • Published • 22 -
Fast Text-to-Audio Generation with Adversarial Post-Training
Paper • 2505.08175 • Published • 23
Mobile Generative Models
Diffusion-Score-Flow Guidance
General Theory
Face Generation-Swap-Contol-Edit
-
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping
Paper • 2412.11279 • Published • 12 -
MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control
Paper • 2501.02260 • Published • 5 -
GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor
Paper • 2501.09978 • Published • 6 -
FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation
Paper • 2502.13995 • Published • 9
Generative Modeling Approachs
-
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
Paper • 2412.10208 • Published • 19 -
Normalizing Flows are Capable Generative Models
Paper • 2412.06329 • Published • 9 -
A Noise is Worth Diffusion Guidance
Paper • 2412.03895 • Published • 31 -
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Paper • 2501.01423 • Published • 44
Video Generation
Video Generation
-
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
Paper • 2412.11100 • Published • 7 -
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Paper • 2412.09856 • Published • 10 -
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
Paper • 2412.09349 • Published • 8 -
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Paper • 2412.04448 • Published • 10
Image Generation
Image Generation
-
Causal Diffusion Transformers for Generative Modeling
Paper • 2412.12095 • Published • 23 -
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
Paper • 2412.09619 • Published • 28 -
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Paper • 2412.07589 • Published • 49 -
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper • 2412.15213 • Published • 29
RAG
-
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
Paper • 2507.09477 • Published • 74 -
BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering
Paper • 2507.04127 • Published • 6 -
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
Paper • 2507.06223 • Published • 13
Embedding Space İnterpretability
MultiModal Reasoning
Transformer Optimization / LLM & VLLM etc
Large Language Models
-
Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers
Paper • 2506.14702 • Published • 4 -
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Paper • 2506.13585 • Published • 260 -
Scaling Test-time Compute for LLM Agents
Paper • 2506.12928 • Published • 61 -
A Survey on Latent Reasoning
Paper • 2507.06203 • Published • 85
Agentic Tools
-
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Paper • 2506.10821 • Published • 19 -
Jan-nano Technical Report
Paper • 2506.22760 • Published • 9 -
MMSearch-R1: Incentivizing LMMs to Search
Paper • 2506.20670 • Published • 60 -
WebSailor: Navigating Super-human Reasoning for Web Agent
Paper • 2507.02592 • Published • 99
Robotics
-
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
Paper • 2506.06205 • Published • 29 -
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
Paper • 2506.07530 • Published • 20 -
Ark: An Open-source Python-based Framework for Robot Learning
Paper • 2506.21628 • Published • 15 -
RoboBrain 2.0 Technical Report
Paper • 2507.02029 • Published • 29
Reasoning
-
Magistral
Paper • 2506.10910 • Published • 61 -
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute
Paper • 2506.15882 • Published • 2 -
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
Paper • 2507.14683 • Published • 114 -
The Invisible Leash: Why RLVR May Not Escape Its Origin
Paper • 2507.14843 • Published • 78
Auto Regressive Image Generation
-
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Paper • 2505.19602 • Published • 13 -
DiSA: Diffusion Step Annealing in Autoregressive Image Generation
Paper • 2505.20297 • Published • 2 -
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
Paper • 2506.06962 • Published • 29 -
Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
Paper • 2507.01957 • Published • 19
Diffusion Language&MultiModal Modeling
Vision Reasoning
-
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 -
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 12 -
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 -
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12
Subject Driven Generation Control
Representation Learning
Scene Generation
-
HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation
Paper • 2504.21650 • Published • 16 -
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
Paper • 2505.02836 • Published • 7 -
ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies
Paper • 2506.14315 • Published • 10
Training Theory
-
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Paper • 2502.03738 • Published • 11 -
Better Embeddings with Coupled Adam
Paper • 2502.08441 • Published • 1 -
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment
Paper • 2502.16894 • Published • 31 -
SALT: Singular Value Adaptation with Low-Rank Transformation
Paper • 2503.16055 • Published • 8
Image-Text Alignment
-
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
Paper • 2502.05178 • Published • 10 -
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Paper • 2502.14846 • Published • 14 -
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Paper • 2502.14786 • Published • 146 -
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Paper • 2504.00557 • Published • 15
Efficent ML
Control Based Video Generation Models
Video Generation Backbone Models
-
rain1011/pyramid-flow-miniflux
Text-to-Video • Updated • 177 -
TPDiff: Temporal Pyramid Video Diffusion Model
Paper • 2503.09566 • Published • 46 -
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Paper • 2504.08685 • Published • 129 -
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Paper • 2504.12626 • Published • 52
Video Generation Style Models
Image-Video General Tasks
-
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Paper • 2501.04001 • Published • 47 -
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Paper • 2501.03895 • Published • 53 -
An Empirical Study of Autoregressive Pre-training from Videos
Paper • 2501.05453 • Published • 42 -
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training
Paper • 2501.07556 • Published • 6
Generation Quality Enhancement
-
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
Paper • 2412.20800 • Published • 10 -
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models
Paper • 2501.06751 • Published • 33 -
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Paper • 2501.09732 • Published • 72 -
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Paper • 2501.09755 • Published • 37
Diffusion/Flow Model Optimization
-
1.58-bit FLUX
Paper • 2412.18653 • Published • 84 -
Region-Adaptive Sampling for Diffusion Transformers
Paper • 2502.10389 • Published • 54 -
One-step Diffusion Models with f-Divergence Distribution Matching
Paper • 2502.15681 • Published • 8 -
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Paper • 2502.20126 • Published • 20
Voice
-
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Paper • 2412.15322 • Published • 19 -
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
Paper • 2505.02707 • Published • 84 -
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Paper • 2505.02625 • Published • 22 -
Fast Text-to-Audio Generation with Adversarial Post-Training
Paper • 2505.08175 • Published • 23
Datasets
Mobile Generative Models
Video Generation Control-Style Transfer
-
StyleMaster: Stylize Your Video with Artistic Generation and Translation
Paper • 2412.07744 • Published • 20 -
Video Motion Transfer with Diffusion Transformers
Paper • 2412.07776 • Published • 17 -
ObjCtrl-2.5D: Training-free Object Control with Camera Poses
Paper • 2412.07721 • Published • 8 -
MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance
Paper • 2412.05355 • Published • 9
Diffusion-Score-Flow Guidance
Image Restoration (SR , Inpainting etc.)
General Theory
Image-Video MultiModal Understanding
-
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 146 -
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
Paper • 2501.01245 • Published • 5 -
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Paper • 2501.00599 • Published • 48 -
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Paper • 2501.08326 • Published • 34
Face Generation-Swap-Contol-Edit
-
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping
Paper • 2412.11279 • Published • 12 -
MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control
Paper • 2501.02260 • Published • 5 -
GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor
Paper • 2501.09978 • Published • 6 -
FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation
Paper • 2502.13995 • Published • 9
Architectural Proposals
-
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 107 -
Causal Diffusion Transformers for Generative Modeling
Paper • 2412.12095 • Published • 23 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 89 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 57
Generative Modeling Approachs
-
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
Paper • 2412.10208 • Published • 19 -
Normalizing Flows are Capable Generative Models
Paper • 2412.06329 • Published • 9 -
A Noise is Worth Diffusion Guidance
Paper • 2412.03895 • Published • 31 -
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Paper • 2501.01423 • Published • 44
Image Editting
-
BrushEdit: All-In-One Image Inpainting and Editing
Paper • 2412.10316 • Published • 36 -
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Paper • 2412.11815 • Published • 26 -
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
Paper • 2412.09611 • Published • 10 -
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Paper • 2412.07517 • Published • 11
Video Generation
Video Generation
-
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
Paper • 2412.11100 • Published • 7 -
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Paper • 2412.09856 • Published • 10 -
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
Paper • 2412.09349 • Published • 8 -
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Paper • 2412.04448 • Published • 10
Diffusion Model Control
Control Methods for Diffusion and Score Models
-
LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
Paper • 2412.09622 • Published • 8 -
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
Paper • 2412.04146 • Published • 23 -
Learning Flow Fields in Attention for Controllable Person Image Generation
Paper • 2412.08486 • Published • 37 -
LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
Paper • 2412.05148 • Published • 12
Image Generation
Image Generation
-
Causal Diffusion Transformers for Generative Modeling
Paper • 2412.12095 • Published • 23 -
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
Paper • 2412.09619 • Published • 28 -
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Paper • 2412.07589 • Published • 49 -
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper • 2412.15213 • Published • 29