MM-VAE Lyra 🎵

Multi-modal Variational Autoencoder for text embedding transformation using geometric fusion.

This first version is essentialy clip_l + t5-base. Similar to those shunt prototypes in concept but entirely divergent in this implementation. This variation is formatted and trained specifically as a VAE to encode/decode pairs of encodings together. Cantor cross-attention allows a form of high-density sparse containment, which when implemented correctly is a highly efficient global attention mechanism to ensure solidity. Fractal modalities make this possible. This is due to sparsity gaps in combinatory route pathologies to learned encoding pattern point encodings, thus this allows the matching of a series of potentials that can be viewed only when necessary in the otherwise empty cantor stair space. Fractal gaps that are filled with purpose occupy this space based on fingerprint routes, allowing emergent fractal mathematics that otherwise could not assist each-other to understand the rules of those topologies.

The current implementation is trained with only a handful of token sequences, so it's essentially front-loaded. Expect short sequences to work along with many longer squences. Full-sequence pretraining will begin soon with a uniform vocabulary that takes both tokens in for a representative uniform token based on the position.

This VAE is not for images - it's trained specifically to encode and decode PAIRS of encodings, each slightly twisted and warped into the direction of intention from the training. This is not your usual VAE, but she's most definitely trained like one.

A lone cybernetic deer with glimmering silver antlers stands beneath a fractured aurora sky, surrounded by glowing fungal trees, floating quartz shards, and bio-luminescent fog. In the distance, ruined monoliths pulse faint glyphs of a forgotten language, while translucent jellyfish swim through the air above a reflective obsidian lake. The atmosphere is electric with tension, color-shifting through prismatic hues. Distant thunderclouds churn violently.

She will do her job when fully trained.

Model Details

Fusion Strategy: cantor
Latent Dimension: 768
Training Steps: 31,899
Best Loss: 0.1840

Architecture

Modalities: CLIP-L (768d) + T5-base (768d)
Encoder Layers: 3
Decoder Layers: 3
Hidden Dimension: 1024

Usage

from geovocab2.train.model.vae.vae_lyra import MultiModalVAE, MultiModalVAEConfig
from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="AbstractPhil/vae-lyra",
    filename="model.pt"
)

# Load checkpoint
checkpoint = torch.load(model_path)

# Create model
config = MultiModalVAEConfig(
    modality_dims={"clip": 768, "t5": 768},
    latent_dim=768,
    fusion_strategy="cantor"
)

model = MultiModalVAE(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Use model
inputs = {
    "clip": clip_embeddings,  # [batch, 77, 768]
    "t5": t5_embeddings        # [batch, 77, 768]
}

reconstructions, mu, logvar = model(inputs)

Training Details

Trained on 10,000 diverse prompts
Mix of LAION flavors (85%) and synthetic prompts (15%)
KL Annealing: True
Learning Rate: 0.0001

Citation

@software{vae_lyra_2025,
  author = {AbstractPhil},
  title = {VAE Lyra: Multi-Modal Variational Autoencoder},
  year = {2025},
  url = {https://huggingface.co/AbstractPhil/vae-lyra}
}

Downloads last month: 322

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support