basharalrfooh
/

Fine-Tashkeel

Text Generation

text2text-generation

text-generation-inference

Model card Files Files and versions

basharalrfooh commited on Apr 8, 2024

Commit

adce6fb

·

verified ·

1 Parent(s): eea70d0

Update README.md

Files changed (1) hide show

README.md +22 -19

README.md CHANGED Viewed

@@ -9,40 +9,43 @@ metrics:
 pipeline_tag: text2text-generation
 ---
-# Project Title
-Brief description of the project, including its purpose and use case.
 ## Table of Contents
 - [Introduction](#introduction)
-  - [Installation](#installation)
 - [Models](#models)
-  - [Model Name](#model-name)
     - [Model Description](#model-description)
-    - [Training Data](#training-data)
-- [Datasets](#datasets)
-  - [Dataset Name](#dataset-name)
-    - [Dataset Description](#dataset-description)
 - [Benchmarks](#benchmarks)
-- [Contributing](#contributing)
-- [License](#license)
 - [Citation](#citation)
 - [Contact](#contact)
 ## Introduction
-A more detailed introduction to the project. Explain the problem being addressed, the proposed solution, and the benefits of your approach.
-## Getting Started
-### Dependencies
-List any dependencies required to run your project.
-### Installation
-Provide step-by-step instructions on how to install and set up your project.
-```bash
-# Example of installation steps
-pip install your-project-name

 pipeline_tag: text2text-generation
 ---
+# Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text Diacritization
 ## Table of Contents
 - [Introduction](#introduction)
 - [Models](#models)
+  - [ByT5](#model-name)
     - [Model Description](#model-description)
 - [Benchmarks](#benchmarks)
 - [Citation](#citation)
 - [Contact](#contact)
 ## Introduction
+Most of previous work on learning diacritization of the Arabic language relied on training models from scratch. In this paper, we investigate how to leverage pre-trained language models to learn diacritization. We finetune token-free pre-trained multilingual models (ByT5) to learn to predict and insert missing diacritics in Arabic text, a complex task that requires understanding the sentence semantics and the morphological structure of the tokens. We show that we can achieve state-of-the-art on the diacritization task with minimal amount of training and no feature engineering, reducing WER by 40%. We release our finetuned models for the greater benefit of the researchers in the community.
+## Model Description
+The ByT5 model, distinguished by its innovative token-free architecture, directly processes raw text to adeptly navigate diverse languages and linguistic nuances. Pre-trained on a comprehensive text corpus mc4, ByT5 excels in understanding and generating text, making it versatile for various NLP tasks. We have further enhanced its capabilities by fine-tuning it on a Tashkeela data set for 13,000 steps, significantly refining its performance in restoring the diacritical marks for Arabic.
+## Benchmarks
+Our model attained a Diarization Error Rate (DER) of 0.95 and a Word Error Rate (WER) of 2.49.
+## Citation
+@misc{alrfooh2023finetashkeel,
+      title={Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text Diacritization},
+      author={Bashar Al-Rfooh and Gheith Abandah and Rami Al-Rfou},
+      year={2023},
+      eprint={2303.14588},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+## Contact
+[email protected]