Background: USM Encoder extracted from Gemma 3n model

Gemma3n is able to process audio inputs. That is achieved by encoding audio with an Universal Speech Encoder (USM, https://arxiv.org/abs/2303.01037). This encoder operates at 6.5 frames per second, and each frame is a continuous embedding with a dimensionality of 1536.

This repo

To facilitate experimentaion with this encoder, I've extracted weights of the audio encoder from the entire Gemma3n model, so that this encoder can be used separately. The weights are comming from this HF Gemma3n repo.

Some imports:

import torch
from transformers.models.gemma3n.feature_extraction_gemma3n import Gemma3nAudioFeatureExtractor
import sphn
import librosa
from transformers import Gemma3nAudioConfig, Gemma3nAudioEncoder
from huggingface_hub import hf_hub_download

Loading the model:


configuration = Gemma3nAudioConfig()

repo_id = "n0mad-0/gemma3n-usm-rip"
filename = "usm.th"

model_path = hf_hub_download(repo_id=repo_id, filename=filename)

encoder = Gemma3nAudioEncoder(configuration).cuda()
encoder.load_state_dict(
    torch.load(model_path, weights_only=True, map_location='cuda')
)

Now we load the audio, build and initialize feature extractor (prepares mel spectrograms), and the USM encoder:

feature_extractor = Gemma3nAudioFeatureExtractor() # operates on 30s chunks, expects 16_000 sampling rate

audio, sample_rate = sphn.read("bria.mp3")
audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=feature_extractor.sampling_rate)
audio = audio[:, : 10 * feature_extractor.sampling_rate]

features = feature_extractor(audio)

audio_mel = torch.stack(
    [torch.from_numpy(x) for x in features['input_features']]
).cuda()

audio_mel_mask = torch.stack(
    [torch.from_numpy(x) for x in features['input_features_mask']]
).cuda()

emb, mask = encoder(audio_mel, ~audio_mel_mask)  # seems I need to invert the mask?
emb.shape  # torch.Size([1, 63, 1536])

n0mad-0
/

gemma3n-usm-rip

Background: USM Encoder extracted from Gemma 3n model

This repo

license: gemma