File size: 2,928 Bytes
ec1dc1f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: mit
language:
- ru
- en
pipeline_tag: automatic-speech-recognition
---

# GigaAM-v3

GigaAM-v3 is a Conformer-based foundation model with 220–240M parameters, pretrained on diverse Russian speech data using the HuBERT-CTC objective.
It is the third generation of the GigaAM family and provides state-of-the-art performance on Russian ASR across a wide range of domains.

GigaAM-v3 includes the following model variants:
- `ssl` — self-supervised HuBERT–CTC encoder pre-trained on 700,000 hours of Russian speech  
- `ctc` — ASR model fine-tuned with a CTC decoder  
- `rnnt` — ASR model fine-tuned with an RNN-T decoder 
- `e2e_ctc` — end-to-end CTC model with punctuation and text normalization  
- `e2e_rnnt` — end-to-end RNN-T model with punctuation and text normalization  

`GigaAM-v3` training incorporates new internal datasets: callcenter conversations, speech with background music, natural speech, and speech with atypical characteristics.
the models perform on average **30%** better on these new domains, while maintaining the same quality as previous GigaAM generations on public benchmarks.

The table below reports the Word Error Rate (%) for `GigaAM-v3` and other existing models over diverse domains.

| Set Name          | V3_CTC | V3_RNNT | T-One + LM | Whisper |
|:------------------|-------:|--------:|-----------:|--------:|
| Open Datasets     |   3.0  |     2.6 |        5.7 |    12.0 |
| Golos Farfield    |   4.5  |     3.9 |       12.2 |    16.7 |
| Natural Speech    |   7.8  |     6.9 |       14.5 |    13.6 |
| Disordered Speech |  20.6  |    19.2 |       51.0 |    59.3 |
| Callcenter        |  10.3  |     9.5 |       13.5 |    23.9 |
| **Average**       | **9.2**| **8.4** |       19.4 |    25.1 |

The end-to-end ASR models (`e2e_ctc` and `e2e_rnnt`) produce punctuated, normalized text directly.
In end-to-end ASR comparisons of `e2e_ctc` and `e2e_rnnt` against Whisper-large-v3, using Gemini 2.5 Pro as an LLM-as-a-judge, GigaAM-v3 models win by an average margin of **70:30**.

For detailed results, see [metrics](https://github.com/salute-developers/GigaAM/blob/main/evaluation.md).

## Usage
```python
from transformers import AutoModel

revision = "e2e_rnnt"  # can be any v3 model: ssl, ctc, rnnt, e2e_ctc, e2e_rnnt
model = AutoModel.from_pretrained(
    "ai-sage/GigaAM-v3",
    revision=revision,
    trust_remote_code=True,
)

transcription = model.transcribe("example.wav")
print(transcription)
```

Recommended versions:
- `torch==2.8.0`, `torchaudio==2.8.0`
- `transformers==4.57.1`
- `pyannote-audio==4.0.0`, `torchcodec==0.7.0`
- (any) `hydra-core`, `omegaconf`, `sentencepiece`

Full usage guide can be found in the [example](https://github.com/salute-developers/GigaAM/blob/main/colab_example.ipynb).

**License:** MIT

**Paper:** [GigaAM: Efficient Self-Supervised Learner for Speech Recognition (InterSpeech 2025)](https://arxiv.org/abs/2506.01192)