File size: 4,893 Bytes
98323ef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
license: bsd-3-clause
pipeline_tag: video-text-to-text
---
# UniPixel-3B
<div style="display: flex; gap: 5px;">
<a href="https://arxiv.org/abs/2509.18094" target="_blank"><img src="https://img.shields.io/badge/arXiv-2509.18094-red"></a>
<a href="https://polyu-chenlab.github.io/unipixel/" target="_blank"><img src="https://img.shields.io/badge/Project-Page-brightgreen"></a>
<a href="https://github.com/PolyU-ChenLab/UniPixel/blob/main/README.md" target="_blank"><img src="https://img.shields.io/badge/License-BSD--3--Clause-purple"></a>
<a href="https://github.com/PolyU-ChenLab/UniPixel" target="_blank"><img src="https://img.shields.io/github/stars/PolyU-ChenLab/UniPixel"></a>
</div>
UniPixel is a unified MLLM for pixel-level vision-language understanding. It flexibly supports a variety of fine-grained tasks, including image/video segmentation, regional understanding, and a novel PixelQA task that jointly requires object-centric referring, segmentation, and question-answering in videos.
<p align="center"><img width="750" src="https://raw.githubusercontent.com/PolyU-ChenLab/UniPixel/refs/heads/main/.github/method.jpg"></p>
## π Model Details
- **Model type:** Multi-modal Large Language Model
- **Language(s):** English
- **License:** BSD-3-Clause
## π Quick Start
### Install the environment
1. Clone the repository from GitHub.
```shell
git clone https://github.com/PolyU-ChenLab/UniPixel.git
cd UniPixel
```
2. Setup the virtual environment.
```shell
conda create -n unipixel python=3.12 -y
conda activate unipixel
# you may modify 'cu128' to your own CUDA version
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
# other versions have no been verified
pip install flash_attn==2.8.2 --no-build-isolation
```
3. Install dependencies.
```shell
pip install -r requirements.txt
```
For NPU users, please install the CPU version of PyTorch and [`torch_npu`](https://github.com/Ascend/pytorch) instead.
### Quick Inference Demo
Try our [online demo](https://huggingface.co/spaces/PolyU-ChenLab/UniPixel) or the [inference script](https://github.com/PolyU-ChenLab/UniPixel/blob/main/tools/inference.py) below. Please refer to our [GitHub Repository](https://github.com/PolyU-ChenLab/UniPixel) for more details.
```python
import imageio.v3 as iio
import nncore
from unipixel.dataset.utils import process_vision_info
from unipixel.model.builder import build_model
from unipixel.utils.io import load_image, load_video
from unipixel.utils.transforms import get_sam2_transform
from unipixel.utils.visualizer import draw_mask
media_path = '<path-to-jpg-or-mp4-file>'
prompt = 'Please segment the...'
output_dir = 'outputs'
model, processor = build_model('PolyU-ChenLab/UniPixel-7B')
device = next(model.parameters()).device
sam2_transform = get_sam2_transform(model.config.sam2_image_size)
if any(media_path.endswith(k) for k in ('jpg', 'png')):
frames, images = load_image(media_path), [media_path]
else:
frames, images = load_video(media_path, sample_frames=16)
messages = [{
'role':
'user',
'content': [{
'type': 'video',
'video': images,
'min_pixels': 128 * 28 * 28,
'max_pixels': 256 * 28 * 28 * int(16 / len(images))
}, {
'type': 'text',
'text': prompt
}]
}]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
images, videos, kwargs = process_vision_info(messages, return_video_kwargs=True)
data = processor(text=[text], images=images, videos=videos, return_tensors='pt', **kwargs)
data['frames'] = [sam2_transform(frames).to(model.sam2.dtype)]
data['frame_size'] = [frames.shape[1:3]]
output_ids = model.generate(
**data.to(device),
do_sample=False,
temperature=None,
top_k=None,
top_p=None,
repetition_penalty=None,
max_new_tokens=512)
assert data.input_ids.size(0) == output_ids.size(0) == 1
output_ids = output_ids[0, data.input_ids.size(1):]
if output_ids[-1] == processor.tokenizer.eos_token_id:
output_ids = output_ids[:-1]
response = processor.decode(output_ids, clean_up_tokenization_spaces=False)
print(f'Response: {response}')
if len(model.seg) >= 1:
imgs = draw_mask(frames, model.seg)
nncore.mkdir(output_dir)
path = nncore.join(output_dir, f"{nncore.pure_name(media_path)}.{'gif' if len(imgs) > 1 else 'png'}")
print(f'Output Path: {path}')
iio.imwrite(path, imgs, duration=100, loop=0)
```
## π Citation
Please kindly cite our paper if you find this project helpful.
```
@inproceedings{liu2025unipixel,
title={UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning},
author={Liu, Ye and Ma, Zongyang and Pu, Junfu and Qi, Zhongang and Wu, Yang and Ying, Shan and Chen, Chang Wen},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}
```
|