utter-project
/

mHuBERT-147-base-1st-iter

 ---
 license: cc-by-nc-4.0
+language:
+- ab
+- af
+- am
+- ar
+- as
+- az
+- ba
+- be
+- bn
+- bo
+- bs
+- br
+- bg
+- ca
+- cs
+- cv
+- cy
+- da
+- de
+- dv
+- el
+- en
+- eo
+- et
+- eu
+- ee
+- fo
+- fa
+- tl
+- fi
+- fr
+- fy
+- ga
+- gl
+- gv
+- gn
+- gu
+- ht
+- ha
+- he
+- hi
+- hr
+- hu
+- hy
+- ig
+- ia
+- id
+- is
+- it
+- jv
+- ja
+- kn
+- ka
+- kk
+- km
+- rw
+- ky
+- ku
+- ko
+- lo
+- la
+- lv
+- ln
+- lt
+- lb
+- lg
+- ml
+- mr
+- mk
+- mg
+- mt
+- mn
+- mi
+- ms
+- my
+- ne
+- nl
+- nn
+- no
+- oc
+- or
+- pa
+- pl
+- pt
+- ps
+- ro
+- ru
+- sa
+- si
+- sl
+- sk
+- sn
+- sd
+- so
+- st
+- es
+- sq
+- sc
+- sr
+- su
+- sw
+- sv
+- ta
+- tt
+- te
+- tg
+- th
+- tn
+- tk
+- tr
+- tw
+- ug
+- uk
+- ur
+- uz
+- vi
+- xh
+- yi
+- yo
+- zh
 ---
+## mHuBERT-147 models
+mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
+This repository contains:
+* Fairseq checkpoint (original);
+* HuggingFace checkpoint;
+* Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
+# Additional Information
+**Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
+Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
+**Fairseq fork:** https://github.com/utter-project/fairseq
+**Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts
+**Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
+# Datasets Included
+For ASR/ST/TTS datasets, only train set is used.
+* [Aishell](https://www.openslr.org/33/) and [AISHELL-3](https://www.openslr.org/93/)
+* [BibleTTS](https://www.openslr.org/129/)
+* [ClovaCall](https://github.com/clovaai/ClovaCall)
+* [CommonVoice v11](https://commonvoice.mozilla.org/en/datasets)
+* Google TTS data: [Javanese](https://www.openslr.org/41/), [Khmer](https://www.openslr.org/42/), [Nepali](https://www.openslr.org/43/), [Sundanese](https://www.openslr.org/44/), [South African Languages](https://www.openslr.org/32/), [Bengali Languages](https://www.openslr.org/37/)
+* IISc-MILE: [Tamil](https://www.openslr.org/127/), [Kannada](https://www.openslr.org/126/)
+* [Japanese Versatile Speech](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus)
+* [Kokoro](https://github.com/kaiidams/Kokoro-Speech-Dataset)
+* [Kosp2e](https://github.com/warnikchow/kosp2e)
+* Media Speech: [Turkish Only](https://www.openslr.org/108/)
+* [Multilingual LibriSpeech](https://www.openslr.org/94/)
+* [Samrómur](https://www.openslr.org/128/)
+* [THCHS-30](https://www.openslr.org/18/) and [THUYG-20](https://www.openslr.org/22/)
+* [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/)
+* [VoxPopuli](https://github.com/facebookresearch/voxpopuli/)
+# Citing
+```
+@inproceedings{boito2024mhubert,
+author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu},
+title={{mHuBERT-147: A Compact Multilingual HuBERT Model}},
+year=2024,
+booktitle={Interspeech 2024},
+}
+```
+# Funding
+This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631.
+For more information go to https://he-utter.eu/