You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

To help us better understand how the model is being used and by whom, we ask you to provide some basic information. This will support future improvements and help ensure the model continues to meet the needs of its user community. Please note: this model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

data2vec-HAT-1.4K-base

This repository provides access to a data2vec1-Base model for Haitian Creole (hat).

Model

Model and data description

The model was pretrained on the following data sets:

Atlas Linguistique d'Haïti consisting of fieldwork recordings (directed by Dominique Fattier) collected between 1975 and 1985
Corpus of Northern Haitian Creole consisting of fieldwork recordings (by Albert Valdman) collected in Cap-Haïtien
Haiti-CMU consisting of read speech
IARPA Babel Haitian Creole Language Pack consisting of phone-based conversational speech and read speech
VoxLingua207 consisting of 90h of recordings in Haitian Creole scraped from Youtube
Radio Haiti consisting or radio broadcast recordings (1950 to early 2000s) in Haitian Creole

The pre-processing scripts are located here : https://gin.g-node.org/CREAM/SSL-Haitian/ The original fairseq models where converted to HuggingFace format using the following code https://github.com/LLL-Orleans/convert_data2vec_to_hf The original fairseq model is also available, enabling continued pre-training or fine-tuning using this framework.

For more details, see the paper.

Intended uses & limitations

This model is distributed under the Creative Commons Attribution Non Commercial Share Alike 4.0 license.

This is a gated model. Access will be given on a per-user basis, pending formal approval by CREAM PI Pr. Emmanuel Schang.

Acknowledgments

The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under grant ANR-20-CE38-0006 (project CREAM). Experiments were conducted using Grid'5000, developed under INRIA ALADDIN with support from CNRS, RENATER, and various universities (see https://www.grid5000.fr). Additional resources include the CaSciModOT cluster (https://cascimodot.fr/) at Centre de Calcul Scientifique en région Centre-Val de Loire and HPC resources from IDRIS provided by GENCI (allocation 2024-AD011014940).

Referencing this model

@inproceedings{havard-et-al-taln25,
    author = "Havard, William N. and Govain, Renauld and Lecouteux, Benjamin and Schang, Emmanuel",
    title = "Mod\`eles auto-supervis\'es de traitement de la parole pour le Cr\'eole Haitien",
    booktitle = "Actes de CORIA-TALN-RJCRI-RECITAL 2025. Actes des 32\`eme Conf\'erence sur le Traitement Automatique des Langues Naturelles (TALN),  volume 1 : articles scientifiques originaux",
    month = "6",
    year = "2025",
    address = "Marseille, France",
    publisher = "Association pour le Traitement Automatique des Langues",
    pages = "543-555",
    note = "",
    url = "https://talnarchives.atala.org/TALN/TALN-2025/98.pdf"
}

Downloads last month: -

Safetensors

Model size

93.2M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support