Video-Text-to-Text
Safetensors
pixel_qwen2_5_vl
yeliudev commited on
Commit
98323ef
·
verified ·
1 Parent(s): 7436bbf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -3
README.md CHANGED
@@ -1,3 +1,147 @@
1
- ---
2
- license: bsd-3-clause
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-3-clause
3
+ pipeline_tag: video-text-to-text
4
+ ---
5
+
6
+ # UniPixel-3B
7
+
8
+ <div style="display: flex; gap: 5px;">
9
+ <a href="https://arxiv.org/abs/2509.18094" target="_blank"><img src="https://img.shields.io/badge/arXiv-2509.18094-red"></a>
10
+ <a href="https://polyu-chenlab.github.io/unipixel/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
11
+ <a href="https://github.com/PolyU-ChenLab/UniPixel/blob/main/README.md" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
12
+ <a href="https://github.com/PolyU-ChenLab/UniPixel" target="_blank"><img src="https://img.shields.io/github/stars/PolyU-ChenLab/UniPixel"></a>
13
+ </div>
14
+
15
+ UniPixel is a unified MLLM for pixel-level vision-language understanding. It flexibly supports a variety of fine-grained tasks, including image/video segmentation, regional understanding, and a novel PixelQA task that jointly requires object-centric referring, segmentation, and question-answering in videos.
16
+
17
+ <p align="center"><img width="750" src="https://raw.githubusercontent.com/PolyU-ChenLab/UniPixel/refs/heads/main/.github/method.jpg"></p>
18
+
19
+ ## 🔖 Model Details
20
+
21
+ - **Model type:** Multi-modal Large Language Model
22
+ - **Language(s):** English
23
+ - **License:** BSD-3-Clause
24
+
25
+ ## 🚀 Quick Start
26
+
27
+ ### Install the environment
28
+
29
+ 1. Clone the repository from GitHub.
30
+
31
+ ```shell
32
+ git clone https://github.com/PolyU-ChenLab/UniPixel.git
33
+ cd UniPixel
34
+ ```
35
+
36
+ 2. Setup the virtual environment.
37
+
38
+ ```shell
39
+ conda create -n unipixel python=3.12 -y
40
+ conda activate unipixel
41
+
42
+ # you may modify 'cu128' to your own CUDA version
43
+ pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
44
+
45
+ # other versions have no been verified
46
+ pip install flash_attn==2.8.2 --no-build-isolation
47
+ ```
48
+
49
+ 3. Install dependencies.
50
+
51
+ ```shell
52
+ pip install -r requirements.txt
53
+ ```
54
+
55
+ For NPU users, please install the CPU version of PyTorch and [`torch_npu`](https://github.com/Ascend/pytorch) instead.
56
+
57
+ ### Quick Inference Demo
58
+
59
+ Try our [online demo](https://huggingface.co/spaces/PolyU-ChenLab/UniPixel) or the [inference script](https://github.com/PolyU-ChenLab/UniPixel/blob/main/tools/inference.py) below. Please refer to our [GitHub Repository](https://github.com/PolyU-ChenLab/UniPixel) for more details.
60
+
61
+ ```python
62
+ import imageio.v3 as iio
63
+ import nncore
64
+
65
+ from unipixel.dataset.utils import process_vision_info
66
+ from unipixel.model.builder import build_model
67
+ from unipixel.utils.io import load_image, load_video
68
+ from unipixel.utils.transforms import get_sam2_transform
69
+ from unipixel.utils.visualizer import draw_mask
70
+
71
+ media_path = '<path-to-jpg-or-mp4-file>'
72
+ prompt = 'Please segment the...'
73
+ output_dir = 'outputs'
74
+
75
+ model, processor = build_model('PolyU-ChenLab/UniPixel-7B')
76
+ device = next(model.parameters()).device
77
+
78
+ sam2_transform = get_sam2_transform(model.config.sam2_image_size)
79
+
80
+ if any(media_path.endswith(k) for k in ('jpg', 'png')):
81
+ frames, images = load_image(media_path), [media_path]
82
+ else:
83
+ frames, images = load_video(media_path, sample_frames=16)
84
+
85
+ messages = [{
86
+ 'role':
87
+ 'user',
88
+ 'content': [{
89
+ 'type': 'video',
90
+ 'video': images,
91
+ 'min_pixels': 128 * 28 * 28,
92
+ 'max_pixels': 256 * 28 * 28 * int(16 / len(images))
93
+ }, {
94
+ 'type': 'text',
95
+ 'text': prompt
96
+ }]
97
+ }]
98
+
99
+ text = processor.apply_chat_template(messages, add_generation_prompt=True)
100
+
101
+ images, videos, kwargs = process_vision_info(messages, return_video_kwargs=True)
102
+
103
+ data = processor(text=[text], images=images, videos=videos, return_tensors='pt', **kwargs)
104
+
105
+ data['frames'] = [sam2_transform(frames).to(model.sam2.dtype)]
106
+ data['frame_size'] = [frames.shape[1:3]]
107
+
108
+ output_ids = model.generate(
109
+ **data.to(device),
110
+ do_sample=False,
111
+ temperature=None,
112
+ top_k=None,
113
+ top_p=None,
114
+ repetition_penalty=None,
115
+ max_new_tokens=512)
116
+
117
+ assert data.input_ids.size(0) == output_ids.size(0) == 1
118
+ output_ids = output_ids[0, data.input_ids.size(1):]
119
+
120
+ if output_ids[-1] == processor.tokenizer.eos_token_id:
121
+ output_ids = output_ids[:-1]
122
+
123
+ response = processor.decode(output_ids, clean_up_tokenization_spaces=False)
124
+ print(f'Response: {response}')
125
+
126
+ if len(model.seg) >= 1:
127
+ imgs = draw_mask(frames, model.seg)
128
+
129
+ nncore.mkdir(output_dir)
130
+
131
+ path = nncore.join(output_dir, f"{nncore.pure_name(media_path)}.{'gif' if len(imgs) > 1 else 'png'}")
132
+ print(f'Output Path: {path}')
133
+ iio.imwrite(path, imgs, duration=100, loop=0)
134
+ ```
135
+
136
+ ## 📖 Citation
137
+
138
+ Please kindly cite our paper if you find this project helpful.
139
+
140
+ ```
141
+ @inproceedings{liu2025unipixel,
142
+ title={UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning},
143
+ author={Liu, Ye and Ma, Zongyang and Pu, Junfu and Qi, Zhongang and Wu, Yang and Ying, Shan and Chen, Chang Wen},
144
+ booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
145
+ year={2025}
146
+ }
147
+ ```