Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

ruibing hou's picture

3

ruibing hou

flow2023

sulabh-research's profile picture

·

AI & ML interests

None yet

Organizations

None yet

flow2023 's collections 9

human generation

AppAgent: Multimodal Agents as Smartphone Users

Paper • 2312.13771 • Published Dec 21, 2023 • 55
En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data

Paper • 2401.01173 • Published Jan 2, 2024 • 12
Boosting Large Language Model for Speech Synthesis: An Empirical Study

Paper • 2401.00246 • Published Dec 30, 2023 • 14
Image Sculpting: Precise Object Editing with 3D Geometry Control

Paper • 2401.01702 • Published Jan 2, 2024 • 20

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

Paper • 2401.04092 • Published Jan 8, 2024 • 21
AToM: Amortized Text-to-Mesh using 2D Diffusion

Paper • 2402.00867 • Published Feb 1, 2024 • 11
Advances in 3D Generation: A Survey

Paper • 2401.17807 • Published Jan 31, 2024 • 19
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Paper • 2401.09340 • Published Jan 17, 2024 • 21

motion generation

Scaling Up Dynamic Human-Scene Interaction Modeling

Paper • 2403.08629 • Published Mar 13, 2024 • 15
Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM

Paper • 2403.07487 • Published Mar 12, 2024 • 17
Seamless Human Motion Composition with Blended Positional Encodings

Paper • 2402.15509 • Published Feb 23, 2024 • 15
MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Paper • 2405.20340 • Published May 30, 2024 • 20

generation-diffusion

High-Quality Image Restoration Following Human Instructions

Paper • 2401.16468 • Published Jan 29, 2024 • 14
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding

Paper • 2401.15708 • Published Jan 28, 2024 • 12
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

Paper • 2401.14688 • Published Jan 26, 2024 • 13
TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

Paper • 2401.14828 • Published Jan 26, 2024 • 10

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

Paper • 2402.10294 • Published Feb 15, 2024 • 26

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Paper • 2312.16862 • Published Dec 28, 2023 • 31
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Paper • 2312.17172 • Published Dec 28, 2023 • 30
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Paper • 2401.01974 • Published Jan 3, 2024 • 7
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Paper • 2401.01885 • Published Jan 3, 2024 • 28

Efficient Tool Use with Chain-of-Abstraction Reasoning

Paper • 2401.17464 • Published Jan 30, 2024 • 21
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

Paper • 2401.15688 • Published Jan 28, 2024 • 11
SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Paper • 2401.15024 • Published Jan 26, 2024 • 74
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Paper • 2401.15071 • Published Jan 26, 2024 • 37

YOLO-World: Real-Time Open-Vocabulary Object Detection

Paper • 2401.17270 • Published Jan 30, 2024 • 41
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Paper • 2401.14405 • Published Jan 25, 2024 • 13
Improving fine-grained understanding in image-text pre-training

Paper • 2401.09865 • Published Jan 18, 2024 • 18
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24, 2024 • 29

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18, 2024 • 13
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 29

human generation

AppAgent: Multimodal Agents as Smartphone Users

Paper • 2312.13771 • Published Dec 21, 2023 • 55
En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data

Paper • 2401.01173 • Published Jan 2, 2024 • 12
Boosting Large Language Model for Speech Synthesis: An Empirical Study

Paper • 2401.00246 • Published Dec 30, 2023 • 14
Image Sculpting: Precise Object Editing with 3D Geometry Control

Paper • 2401.01702 • Published Jan 2, 2024 • 20

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Paper • 2312.16862 • Published Dec 28, 2023 • 31
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Paper • 2312.17172 • Published Dec 28, 2023 • 30
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Paper • 2401.01974 • Published Jan 3, 2024 • 7
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Paper • 2401.01885 • Published Jan 3, 2024 • 28

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

Paper • 2401.04092 • Published Jan 8, 2024 • 21
AToM: Amortized Text-to-Mesh using 2D Diffusion

Paper • 2402.00867 • Published Feb 1, 2024 • 11
Advances in 3D Generation: A Survey

Paper • 2401.17807 • Published Jan 31, 2024 • 19
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Paper • 2401.09340 • Published Jan 17, 2024 • 21

Efficient Tool Use with Chain-of-Abstraction Reasoning

Paper • 2401.17464 • Published Jan 30, 2024 • 21
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

Paper • 2401.15688 • Published Jan 28, 2024 • 11
SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Paper • 2401.15024 • Published Jan 26, 2024 • 74
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Paper • 2401.15071 • Published Jan 26, 2024 • 37

motion generation

Scaling Up Dynamic Human-Scene Interaction Modeling

Paper • 2403.08629 • Published Mar 13, 2024 • 15
Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM

Paper • 2403.07487 • Published Mar 12, 2024 • 17
Seamless Human Motion Composition with Blended Positional Encodings

Paper • 2402.15509 • Published Feb 23, 2024 • 15
MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Paper • 2405.20340 • Published May 30, 2024 • 20

YOLO-World: Real-Time Open-Vocabulary Object Detection

Paper • 2401.17270 • Published Jan 30, 2024 • 41
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Paper • 2401.14405 • Published Jan 25, 2024 • 13
Improving fine-grained understanding in image-text pre-training

Paper • 2401.09865 • Published Jan 18, 2024 • 18
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24, 2024 • 29

generation-diffusion

High-Quality Image Restoration Following Human Instructions

Paper • 2401.16468 • Published Jan 29, 2024 • 14
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding

Paper • 2401.15708 • Published Jan 28, 2024 • 12
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

Paper • 2401.14688 • Published Jan 26, 2024 • 13
TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

Paper • 2401.14828 • Published Jan 26, 2024 • 10

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18, 2024 • 13
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 29

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

Paper • 2402.10294 • Published Feb 15, 2024 • 26

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs