🧩 Gemma-like Multimodal Speech-Vision-Text (from scratch)

This is a multimodal model trained from scratch on text, speech, and vision data.
It supports any-to-any pipelines (e.g., text-to-text, speech-to-text, vision-to-text).


πŸš€ How to Use

πŸ”Ή Text Input Example

from transformers import AutoTokenizer, AutoModel

model_name = "AarambhAI/gemma-like-multimodal-speech-vision-text"

# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode text
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

print(outputs)
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train AarambhAI/gemma-like-multimodal-speech-vision-text