LLaVE is a series of large language and vision embedding models trained on a variety of multimodal embedding datasets
			
	
	- 
	
	
	
zhibinlan/LLaVE-0.5B
Image-Text-to-Text • 0.9B • Updated • 172 • 7 - 
	
	
	
zhibinlan/LLaVE-2B
Image-Text-to-Text • 2B • Updated • 59 • 45 - 
	
	
	
zhibinlan/LLaVE-7B
Image-Text-to-Text • 8B • Updated • 191 • 5 - 
	
	
	
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Paper • 2503.04812 • Published • 15