Scaling Instructable Agents Across Many Simulated Worlds
			Paper
			•
			2404.10179
			•
			Published
				
			•
				
				28
			
an encoder-decoder model which compresses videos to discrete embeddings (tokens) and a transformer model to translate text embeddings to video tokens.