|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- BLIP3o/BLIP3o-Pretrain-Long-Caption |
|
|
- BLIP3o/BLIP3o-Pretrain-Short-Caption |
|
|
- BLIP3o/BLIP3o-Pretrain-JourneyDB |
|
|
base_model: |
|
|
- OpenGVLab/InternVL3-1B |
|
|
--- |
|
|
This repository contains the model (**autoencoders**) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing. |
|
|
|
|
|
UniLIP proposes a unified, CLIP-based encoder featuring both rich semantics and fine-grained image details. Through a **two-stage and self-distillation training** for reconstruction, we empower CLIP to achieve excellent reconstruction results **without compromising its original understanding abilities**. Leveraging this powerful unified representation, UniLIP excels across understanding, generation, and editing tasks. |
|
|
|
|
|
For more details, please refer to the original paper and the GitHub repository: |
|
|
|
|
|
Paper: https://www.arxiv.org/abs/2507.23278 |
|
|
|
|
|
GitHub: https://github.com/nnnth/UniLIP |