PolyU-ChenLab
/

UniPixel-7B

+---
+license: bsd-3-clause
+pipeline_tag: video-text-to-text
+---
+# UniPixel-3B
+<div style="display: flex; gap: 5px;">
+  <a href="https://arxiv.org/abs/2509.18094" target="_blank"><img src="https://img.shields.io/badge/arXiv-2509.18094-red"></a>
+  <a href="https://polyu-chenlab.github.io/unipixel/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
+  <a href="https://github.com/PolyU-ChenLab/UniPixel/blob/main/README.md" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
+  <a href="https://github.com/PolyU-ChenLab/UniPixel" target="_blank"><img src="https://img.shields.io/github/stars/PolyU-ChenLab/UniPixel"></a>
+</div>
+UniPixel is a unified MLLM for pixel-level vision-language understanding. It flexibly supports a variety of fine-grained tasks, including image/video segmentation, regional understanding, and a novel PixelQA task that jointly requires object-centric referring, segmentation, and question-answering in videos.
+<p align="center"><img width="750" src="https://raw.githubusercontent.com/PolyU-ChenLab/UniPixel/refs/heads/main/.github/method.jpg"></p>
+## 🔖 Model Details
+- **Model type:** Multi-modal Large Language Model
+- **Language(s):** English
+- **License:** BSD-3-Clause
+## 🚀 Quick Start
+### Install the environment
+1. Clone the repository from GitHub.
+```shell
+git clone https://github.com/PolyU-ChenLab/UniPixel.git
+cd UniPixel
+```
+2. Setup the virtual environment.
+```shell
+conda create -n unipixel python=3.12 -y
+conda activate unipixel
+# you may modify 'cu128' to your own CUDA version
+pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
+# other versions have no been verified
+pip install flash_attn==2.8.2 --no-build-isolation
+```
+3. Install dependencies.
+```shell
+pip install -r requirements.txt
+```
+For NPU users, please install the CPU version of PyTorch and [`torch_npu`](https://github.com/Ascend/pytorch) instead.
+### Quick Inference Demo
+Try our [online demo](https://huggingface.co/spaces/PolyU-ChenLab/UniPixel) or the [inference script](https://github.com/PolyU-ChenLab/UniPixel/blob/main/tools/inference.py) below. Please refer to our [GitHub Repository](https://github.com/PolyU-ChenLab/UniPixel) for more details.
+```python
+import imageio.v3 as iio
+import nncore
+from unipixel.dataset.utils import process_vision_info
+from unipixel.model.builder import build_model
+from unipixel.utils.io import load_image, load_video
+from unipixel.utils.transforms import get_sam2_transform
+from unipixel.utils.visualizer import draw_mask
+media_path = '<path-to-jpg-or-mp4-file>'
+prompt = 'Please segment the...'
+output_dir = 'outputs'
+model, processor = build_model('PolyU-ChenLab/UniPixel-7B')
+device = next(model.parameters()).device
+sam2_transform = get_sam2_transform(model.config.sam2_image_size)
+if any(media_path.endswith(k) for k in ('jpg', 'png')):
+    frames, images = load_image(media_path), [media_path]
+else:
+    frames, images = load_video(media_path, sample_frames=16)
+messages = [{
+    'role':
+    'user',
+    'content': [{
+        'type': 'video',
+        'video': images,
+        'min_pixels': 128 * 28 * 28,
+        'max_pixels': 256 * 28 * 28 * int(16 / len(images))
+    }, {
+        'type': 'text',
+        'text': prompt
+    }]
+}]
+text = processor.apply_chat_template(messages, add_generation_prompt=True)
+images, videos, kwargs = process_vision_info(messages, return_video_kwargs=True)
+data = processor(text=[text], images=images, videos=videos, return_tensors='pt', **kwargs)
+data['frames'] = [sam2_transform(frames).to(model.sam2.dtype)]
+data['frame_size'] = [frames.shape[1:3]]
+output_ids = model.generate(
+    **data.to(device),
+    do_sample=False,
+    temperature=None,
+    top_k=None,
+    top_p=None,
+    repetition_penalty=None,
+    max_new_tokens=512)
+assert data.input_ids.size(0) == output_ids.size(0) == 1
+output_ids = output_ids[0, data.input_ids.size(1):]
+if output_ids[-1] == processor.tokenizer.eos_token_id:
+    output_ids = output_ids[:-1]
+response = processor.decode(output_ids, clean_up_tokenization_spaces=False)
+print(f'Response: {response}')
+if len(model.seg) >= 1:
+    imgs = draw_mask(frames, model.seg)
+    nncore.mkdir(output_dir)
+    path = nncore.join(output_dir, f"{nncore.pure_name(media_path)}.{'gif' if len(imgs) > 1 else 'png'}")
+    print(f'Output Path: {path}')
+    iio.imwrite(path, imgs, duration=100, loop=0)
+```
+## 📖 Citation
+Please kindly cite our paper if you find this project helpful.
+```
+@inproceedings{liu2025unipixel,
+  title={UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning},
+  author={Liu, Ye and Ma, Zongyang and Pu, Junfu and Qi, Zhongang and Wu, Yang and Ying, Shan and Chen, Chang Wen},
+  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
+  year={2025}
+}
+```