matlok
			's Collections
			 
		
			
		Papers - Multimodal
		
	updated
			
 
				
				
	
	
	
			
			TinyLLaVA: A Framework of Small-scale Large Multimodal Models
		
			Paper
			
•
			2402.14289
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			ImageBind: One Embedding Space To Bind Them All
		
			Paper
			
•
			2305.05665
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			DocLLM: A layout-aware generative language model for multimodal document
  understanding
		
			Paper
			
•
			2401.00908
			
•
			Published
				
			•
				
				189
			
 
	
	 
	
	
	
			
			Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
  of Experts
		
			Paper
			
•
			2206.02770
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
		
			Paper
			
•
			2104.03964
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			MoAI: Mixture of All Intelligence for Large Language and Vision Models
		
			Paper
			
•
			2403.07508
			
•
			Published
				
			•
				
				77
			
 
	
	 
	
	
	
			
			Veagle: Advancements in Multimodal Representation Learning
		
			Paper
			
•
			2403.08773
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			mPLUG-Owl: Modularization Empowers Large Language Models with
  Multimodality
		
			Paper
			
•
			2304.14178
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			Gemini: A Family of Highly Capable Multimodal Models
		
			Paper
			
•
			2312.11805
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			Flamingo: a Visual Language Model for Few-Shot Learning
		
			Paper
			
•
			2204.14198
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			Training Compute-Optimal Large Language Models
		
			Paper
			
•
			2203.15556
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
		
			Paper
			
•
			2309.16609
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
		
			Paper
			
•
			2402.12226
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			Unifying Vision, Text, and Layout for Universal Document Processing
		
			Paper
			
•
			2212.02623
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			Uni-SMART: Universal Science Multimodal Analysis and Research
  Transformer
		
			Paper
			
•
			2403.10301
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			VideoAgent: Long-form Video Understanding with Large Language Model as
  Agent
		
			Paper
			
•
			2403.10517
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
		
			Paper
			
•
			2403.11703
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
		
			Paper
			
•
			2403.12906
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
  Large Language Models
		
			Paper
			
•
			2403.13447
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
  Math Problems?
		
			Paper
			
•
			2403.14624
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			A Multimodal Approach to Device-Directed Speech Detection with Large
  Language Models
		
			Paper
			
•
			2403.14438
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
		
			Paper
			
•
			2403.15377
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			Mini-Gemini: Mining the Potential of Multi-modality Vision Language
  Models
		
			Paper
			
•
			2403.18814
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			FormNetV2: Multimodal Graph Contrastive Learning for Form Document
  Information Extraction
		
			Paper
			
•
			2305.02549
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
		
			Paper
			
•
			2308.12966
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
  Models
		
			Paper
			
•
			2404.03118
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency
  Determines Multimodal Model Performance
		
			Paper
			
•
			2404.04125
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
		
			Paper
			
•
			2404.05719
			
•
			Published
				
			•
				
				83
			
 
	
	 
	
	
	
			
			Ferret-v2: An Improved Baseline for Referring and Grounding with Large
  Language Models
		
			Paper
			
•
			2404.07973
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			BLINK: Multimodal Large Language Models Can See but Not Perceive
		
			Paper
			
•
			2404.12390
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			TokenPacker: Efficient Visual Projector for Multimodal LLM
		
			Paper
			
•
			2407.02392
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Data curation via joint example selection further accelerates multimodal
  learning
		
			Paper
			
•
			2406.17711
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			The Synergy between Data and Multi-Modal Large Language Models: A Survey
  from Co-Development Perspective
		
			Paper
			
•
			2407.08583
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			PaliGemma: A versatile 3B VLM for transfer
		
			Paper
			
•
			2407.07726
			
•
			Published
				
			•
				
				72
			
 
	
	 
	
	
	
			
			Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
  Instruction Using Language Model
		
			Paper
			
•
			2407.07053
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality
  Models
		
			Paper
			
•
			2407.11691
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			Openstory++: A Large-scale Dataset and Benchmark for Instance-aware
  Open-domain Visual Storytelling
		
			Paper
			
•
			2408.03695
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
  Large Language Models
		
			Paper
			
•
			2408.04840
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
		
			Paper
			
•
			2408.08872
			
•
			Published
				
			•
				
				100
			
 
	
	 
	
	
	
			
			Law of Vision Representation in MLLMs
		
			Paper
			
•
			2408.16357
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			Geodesic Multi-Modal Mixup for Robust Fine-Tuning
		
			Paper
			
•
			2203.03897
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Multimodal Autoregressive Pre-training of Large Vision Encoders
		
			Paper
			
•
			2411.14402
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
		
			Paper
			
•
			2308.11466
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Apollo: An Exploration of Video Understanding in Large Multimodal Models
		
			Paper
			
•
			2412.10360
			
•
			Published
				
			•
				
				147