-
AppAgent: Multimodal Agents as Smartphone Users
Paper • 2312.13771 • Published • 55 -
En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data
Paper • 2401.01173 • Published • 12 -
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Paper • 2401.00246 • Published • 14 -
Image Sculpting: Precise Object Editing with 3D Geometry Control
Paper • 2401.01702 • Published • 20
ruibing hou
flow2023
AI & ML interests
None yet
Organizations
None yet
3D
-
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
Paper • 2401.04092 • Published • 21 -
AToM: Amortized Text-to-Mesh using 2D Diffusion
Paper • 2402.00867 • Published • 11 -
Advances in 3D Generation: A Survey
Paper • 2401.17807 • Published • 19 -
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
Paper • 2401.09340 • Published • 21
motion generation
-
Scaling Up Dynamic Human-Scene Interaction Modeling
Paper • 2403.08629 • Published • 15 -
Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM
Paper • 2403.07487 • Published • 17 -
Seamless Human Motion Composition with Blended Positional Encodings
Paper • 2402.15509 • Published • 15 -
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper • 2405.20340 • Published • 20
generation-diffusion
-
High-Quality Image Restoration Following Human Instructions
Paper • 2401.16468 • Published • 14 -
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding
Paper • 2401.15708 • Published • 12 -
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
Paper • 2401.14688 • Published • 13 -
TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts
Paper • 2401.14828 • Published • 10
LLM+generate
MLLM
-
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper • 2312.16862 • Published • 31 -
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Paper • 2312.17172 • Published • 30 -
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Paper • 2401.01974 • Published • 7 -
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper • 2401.01885 • Published • 28
LLM
-
Efficient Tool Use with Chain-of-Abstraction Reasoning
Paper • 2401.17464 • Published • 21 -
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation
Paper • 2401.15688 • Published • 11 -
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Paper • 2401.15024 • Published • 74 -
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
Paper • 2401.15071 • Published • 37
CLIP
-
YOLO-World: Real-Time Open-Vocabulary Object Detection
Paper • 2401.17270 • Published • 41 -
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
Improving fine-grained understanding in image-text pre-training
Paper • 2401.09865 • Published • 18 -
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29
video mllm
-
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 37 -
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper • 2403.11481 • Published • 13 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 30 -
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper • 2403.01422 • Published • 29
human generation
-
AppAgent: Multimodal Agents as Smartphone Users
Paper • 2312.13771 • Published • 55 -
En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data
Paper • 2401.01173 • Published • 12 -
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Paper • 2401.00246 • Published • 14 -
Image Sculpting: Precise Object Editing with 3D Geometry Control
Paper • 2401.01702 • Published • 20
MLLM
-
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper • 2312.16862 • Published • 31 -
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Paper • 2312.17172 • Published • 30 -
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Paper • 2401.01974 • Published • 7 -
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper • 2401.01885 • Published • 28
3D
-
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
Paper • 2401.04092 • Published • 21 -
AToM: Amortized Text-to-Mesh using 2D Diffusion
Paper • 2402.00867 • Published • 11 -
Advances in 3D Generation: A Survey
Paper • 2401.17807 • Published • 19 -
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
Paper • 2401.09340 • Published • 21
LLM
-
Efficient Tool Use with Chain-of-Abstraction Reasoning
Paper • 2401.17464 • Published • 21 -
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation
Paper • 2401.15688 • Published • 11 -
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Paper • 2401.15024 • Published • 74 -
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
Paper • 2401.15071 • Published • 37
motion generation
-
Scaling Up Dynamic Human-Scene Interaction Modeling
Paper • 2403.08629 • Published • 15 -
Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM
Paper • 2403.07487 • Published • 17 -
Seamless Human Motion Composition with Blended Positional Encodings
Paper • 2402.15509 • Published • 15 -
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper • 2405.20340 • Published • 20
CLIP
-
YOLO-World: Real-Time Open-Vocabulary Object Detection
Paper • 2401.17270 • Published • 41 -
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
Improving fine-grained understanding in image-text pre-training
Paper • 2401.09865 • Published • 18 -
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29
generation-diffusion
-
High-Quality Image Restoration Following Human Instructions
Paper • 2401.16468 • Published • 14 -
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding
Paper • 2401.15708 • Published • 12 -
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
Paper • 2401.14688 • Published • 13 -
TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts
Paper • 2401.14828 • Published • 10
video mllm
-
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 37 -
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper • 2403.11481 • Published • 13 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 30 -
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper • 2403.01422 • Published • 29
LLM+generate