You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
To help us better understand how the model is being used and by whom, we ask you to provide some basic information. This will support future improvements and help ensure the model continues to meet the needs of its user community. Please note: this model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Log in or Sign Up to review the conditions and access this model content.
data2vec-HAT-1.4K-base
This repository provides access to a data2vec1-Base model for Haitian Creole (hat).
Model
Model and data description
The model was pretrained on the following data sets:
- Atlas Linguistique d'Haïti consisting of fieldwork recordings (directed by Dominique Fattier) collected between 1975 and 1985
- Corpus of Northern Haitian Creole consisting of fieldwork recordings (by Albert Valdman) collected in Cap-Haïtien
- Haiti-CMU consisting of read speech
- IARPA Babel Haitian Creole Language Pack consisting of phone-based conversational speech and read speech
- VoxLingua207 consisting of 90h of recordings in Haitian Creole scraped from Youtube
- Radio Haiti consisting or radio broadcast recordings (1950 to early 2000s) in Haitian Creole
The pre-processing scripts are located here : https://gin.g-node.org/CREAM/SSL-Haitian/
The original fairseq
models where converted to HuggingFace format using the following code https://github.com/LLL-Orleans/convert_data2vec_to_hf The original fairseq model is also available, enabling continued pre-training or fine-tuning using this framework.
For more details, see the paper.
Intended uses & limitations
This model is distributed under the Creative Commons Attribution Non Commercial Share Alike 4.0 license.
This is a gated model. Access will be given on a per-user basis, pending formal approval by CREAM PI Pr. Emmanuel Schang.
Acknowledgments
The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under grant ANR-20-CE38-0006 (project CREAM). Experiments were conducted using Grid'5000, developed under INRIA ALADDIN with support from CNRS, RENATER, and various universities (see https://www.grid5000.fr). Additional resources include the CaSciModOT cluster (https://cascimodot.fr/) at Centre de Calcul Scientifique en région Centre-Val de Loire and HPC resources from IDRIS provided by GENCI (allocation 2024-AD011014940).
Referencing this model
@inproceedings{havard-et-al-taln25,
author = "Havard, William N. and Govain, Renauld and Lecouteux, Benjamin and Schang, Emmanuel",
title = "Mod\`eles auto-supervis\'es de traitement de la parole pour le Cr\'eole Haitien",
booktitle = "Actes de CORIA-TALN-RJCRI-RECITAL 2025. Actes des 32\`eme Conf\'erence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux",
month = "6",
year = "2025",
address = "Marseille, France",
publisher = "Association pour le Traitement Automatique des Langues",
pages = "543-555",
note = "",
url = "https://talnarchives.atala.org/TALN/TALN-2025/98.pdf"
}
- Downloads last month
- -