nielsr HF Staff commited on
Commit
39a9fa2
·
verified ·
1 Parent(s): 9269082

Improve model card with comprehensive details and metadata

Browse files

This PR significantly enhances the model card for LangScene-X, making it much more informative and discoverable on the Hugging Face Hub.

Key improvements include:
- Populating the `README.md` with detailed information from the project's GitHub repository, including an introduction, news, abstract, pipeline overview, video demos, installation instructions, and usage examples.
- Adding the `pipeline_tag: image-to-3d` to the metadata, ensuring the model appears in relevant searches on the Hub.
- Specifying `library_name: diffusers` to enable proper integration and display of the model with the Diffusers library.
- Consolidating all relevant links (paper, project page, code, arXiv) for easy access.

Please review and merge this PR to make this exciting research more accessible to the community!

Files changed (1) hide show
  1. README.md +133 -4
README.md CHANGED
@@ -1,8 +1,137 @@
1
  ---
2
  license: mit
 
 
3
  ---
4
- # LangScene-X
5
 
6
- - Repository: https://github.com/liuff19/LangScene-X/
7
- - Project Page: https://liuff19.github.io/LangScene-X/
8
- - arXiv: https://arxiv.org/abs/2507.02813
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ pipeline_tag: image-to-3d
4
+ library_name: diffusers
5
  ---
 
6
 
7
+ <div align="center">
8
+
9
+ # ✨LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion✨
10
+
11
+ <p align="center">
12
+ <a href="https://liuff19.github.io/">Fangfu Liu</a><sup>1</sup>,
13
+ <a href="https://lifuguan.github.io/">Hao Li</a><sup>2</sup>,
14
+ <a href="https://github.com/chijw">Jiawei Chi</a><sup>1</sup>,
15
+ <a href="https://hanyang-21.github.io/">Hanyang Wang</a><sup>1,3</sup>,
16
+ <a href="https://github.com/liuff19/LangScene-X">Minghui Yang</a><sup>3</sup>,
17
+ <a href="https://github.com/liuff19/LangScene-X">Fudong Wang</a><sup>3</sup>,
18
+ <a href="https://duanyueqi.github.io/">Yueqi Duan</a><sup>1</sup>
19
+ <br>
20
+ <sup>1</sup>Tsinghua University, <sup>2</sup>NTU, <sup>3</sup>Ant Group
21
+ </p>
22
+ <h3 align="center">ICCV 2025 🔥</h3>
23
+ <a href="https://arxiv.org/abs/2507.02813"><img src='https://img.shields.io/badge/arXiv-2507.02813-b31b1b.svg'></a> &nbsp;&nbsp;&nbsp;&nbsp;
24
+ <a href="https://liuff19.github.io/LangScene-X"><img src='https://img.shields.io/badge/Project-Page-Green'></a> &nbsp;&nbsp;&nbsp;&nbsp;
25
+ <a href="https://huggingface.co/chijw/LangScene-X"><img src='https://img.shields.io/badge/LangSceneX-huggingface-yellow'></a> &nbsp;&nbsp;&nbsp;&nbsp;
26
+ <a><img src='https://img.shields.io/badge/License-MIT-blue'></a> &nbsp;&nbsp;&nbsp;&nbsp;
27
+
28
+ ![Teaser Visualization](https://github.com/liuff19/LangScene-X/blob/main/assets/teaser.png?raw=true)
29
+ </div>
30
+
31
+ **LangScene-X:** We propose LangScene-X, a unified model that generates RGB, segmentation map, and normal map, enabling to reconstruct 3D field from sparse views input.
32
+
33
+ ## 📄 Paper
34
+ The model was presented in the paper [LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion](https://huggingface.co/papers/2507.02813).
35
+
36
+ ## 🔗 Links
37
+ - Repository: [https://github.com/liuff19/LangScene-X/](https://github.com/liuff19/LangScene-X/)
38
+ - Project Page: [https://liuff19.github.io/LangScene-X/](https://liuff19.github.io/LangScene-X/)
39
+ - arXiv: [https://arxiv.org/abs/2507.02813](https://arxiv.org/abs/2507.02813)
40
+
41
+ ## 📖 Abstract
42
+
43
+ Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability.
44
+
45
+ ## 📢 News
46
+ - 🔥 [04/07/2025] We release "LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion". Check our [project page](https://liuff19.github.io/LangScene-X) and [arXiv paper](https://arxiv.org/abs/2507.02813).
47
+
48
+ ## 🌟 Pipeline
49
+
50
+ ![Pipeline Visualization](https://github.com/liuff19/LangScene-X/blob/main/assets/pipeline.png?raw=true)
51
+
52
+ Pipeline of LangScene-X. Our model is composed of a TriMap Video Diffusion model which generates RGB, segmentation map, and normal map videos, an Auto Encoder that compresses the language feature, and a field constructor that reconstructs 3DGS from the generated videos.
53
+
54
+
55
+ ## 🎨 Video Demos from TriMap Video Diffusion
56
+
57
+ https://github.com/user-attachments/assets/55346d53-eb04-490e-bb70-64555e97e040
58
+
59
+ https://github.com/user-attachments/assets/d6eb28b9-2af8-49a7-bb8b-0d4cba7843a5
60
+
61
+ https://github.com/user-attachments/assets/396f11ef-85dc-41de-882e-e249c25b9961
62
+
63
+ ## ⚙️ Setup
64
+
65
+ ### 1. Clone Repository
66
+ ```bash
67
+ git clone https://github.com/liuff19/LangScene-X.git
68
+ cd LangScene-X
69
+ ```
70
+ ### 2. Environment Setup
71
+
72
+ 1. **Create conda environment**
73
+
74
+ ```bash
75
+ conda create -n langscenex python=3.10 -y
76
+ conda activate langscenex
77
+ ```
78
+ 2. **Install dependencies**
79
+ ```bash
80
+ conda install pytorch torchvision -c pytorch -y
81
+ pip install -e field_construction/submodules/simple-knn
82
+ pip install -e field_construction/submodules/diff-langsurf-rasterizer
83
+ pip install -e auto-seg/submodules/segment-anything-1
84
+ pip install -e auto-seg/submodules/segment-anything-2
85
+ pip install -r requirements.txt
86
+ ```
87
+
88
+ ### 3. Model Checkpoints
89
+ The checkpoints of SAM, SAM2 and fine-tuned CogVideoX can be downloaded from our [huggingface repository](https://huggingface.co/chijw/LangScene-X).
90
+
91
+ ## 💻Running
92
+
93
+ ### Quick Start
94
+ You can start quickly by running the following scripts:
95
+ ```bash
96
+ chmod +x quick_start.sh
97
+ ./quick_start.sh <first_rgb_image_path> <last_rgb_image_path>
98
+ ```
99
+ ### Render
100
+ Run the following command to render from the reconstructed 3DGS field:
101
+ ```bash
102
+ python entry_point.py \
103
+ pipeline.rgb_video_path="does/not/matter" \
104
+ pipeline.normal_video_path="does/not/matter" \
105
+ pipeline.seg_video_path="does/not/matter" \
106
+ pipeline.data_path="does/not/matter" \
107
+ gaussian.dataset.source_path="does/not/matter" \
108
+ gaussian.dataset.model_path="output/path" \
109
+ pipeline.selection=False \
110
+ gaussian.opt.max_geo_iter=1500 \
111
+ gaussian.opt.normal_optim=True \
112
+ gaussian.opt.optim_pose=True \
113
+ pipeline.skip_video_process=True \
114
+ pipeline.skip_lang_feature_extraction=True \
115
+ pipeline.mode="render"
116
+ ```
117
+ You can also configurate by editting `configs/field_construction.yaml`.
118
+
119
+ ## 🔗Acknowledgement
120
+
121
+ We are thankful for the following great works when implementing LangScene-X:
122
+
123
+ - [CogVideoX](https://github.com/THUDM/CogVideo), [CogvideX-Interpolation](https://github.com/feizc/CogvideX-Interpolation), [LangSplat](https://github.com/minghanqin/LangSplat), [LangSurf](https://github.com/lifuguan/LangSurf), [VGGT](https://github.com/facebookresearch/vggt), [3DGS](https://github.com/graphdeco-inria/gaussian-splatting), [SAM2](https://github.com/facebookresearch/sam2)
124
+
125
+ ## 📚Citation
126
+
127
+ ```bibtex
128
+ @misc{liu2025langscenexreconstructgeneralizable3d,
129
+ title={LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion},
130
+ author={Fangfu Liu and Hao Li and Jiawei Chi and Hanyang Wang and Minghui Yang and Fudong Wang and Yueqi Duan},
131
+ year={2025},
132
+ eprint={2507.02813},
133
+ archivePrefix={arXiv},
134
+ primaryClass={cs.CV},
135
+ url={https://arxiv.org/abs/2507.02813},
136
+ }
137
+ ```