CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction Paper • 2310.01403 • Published Oct 2, 2023 • 1
CLIM: Contrastive Language-Image Mosaic for Region Representation Paper • 2312.11376 • Published Dec 18, 2023
OMG-Seg: Is One Model Good Enough For All Segmentation? Paper • 2401.10229 • Published Jan 18, 2024 • 1
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation Paper • 2503.21979 • Published Mar 27 • 3