π§© Gemma-like Multimodal Speech-Vision-Text (from scratch)
This is a multimodal model trained from scratch on text, speech, and vision data.
It supports any-to-any pipelines (e.g., text-to-text, speech-to-text, vision-to-text).
π How to Use
πΉ Text Input Example
from transformers import AutoTokenizer, AutoModel
model_name = "AarambhAI/gemma-like-multimodal-speech-vision-text"
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Encode text
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)
print(outputs)
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support