🧩 Gemma-like Multimodal Speech-Vision-Text (from scratch)

This is a multimodal model trained from scratch on text, speech, and vision data.
It supports any-to-any pipelines (e.g., text-to-text, speech-to-text, vision-to-text).

🚀 How to Use

🔹 Text Input Example

from transformers import AutoTokenizer, AutoModel

model_name = "AarambhAI/gemma-like-multimodal-speech-vision-text"

# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode text
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

print(outputs)

Downloads last month: 3

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

AarambhAI
/

gemma-like-multimodal-speech-vision-text

🧩 Gemma-like Multimodal Speech-Vision-Text (from scratch)

🚀 How to Use

🔹 Text Input Example

Dataset used to train AarambhAI/gemma-like-multimodal-speech-vision-text