basharalrfooh commited on
Commit
adce6fb
·
verified ·
1 Parent(s): eea70d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -19
README.md CHANGED
@@ -9,40 +9,43 @@ metrics:
9
  pipeline_tag: text2text-generation
10
  ---
11
 
12
- # Project Title
13
 
14
- Brief description of the project, including its purpose and use case.
15
 
16
  ## Table of Contents
17
  - [Introduction](#introduction)
18
- - [Installation](#installation)
19
  - [Models](#models)
20
- - [Model Name](#model-name)
21
  - [Model Description](#model-description)
22
- - [Training Data](#training-data)
23
- - [Datasets](#datasets)
24
- - [Dataset Name](#dataset-name)
25
- - [Dataset Description](#dataset-description)
26
  - [Benchmarks](#benchmarks)
27
- - [Contributing](#contributing)
28
- - [License](#license)
29
  - [Citation](#citation)
30
  - [Contact](#contact)
31
 
32
  ## Introduction
33
 
34
- A more detailed introduction to the project. Explain the problem being addressed, the proposed solution, and the benefits of your approach.
35
 
36
- ## Getting Started
37
 
38
- ### Dependencies
39
 
40
- List any dependencies required to run your project.
41
 
42
- ### Installation
43
 
44
- Provide step-by-step instructions on how to install and set up your project.
45
 
46
- ```bash
47
- # Example of installation steps
48
- pip install your-project-name
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pipeline_tag: text2text-generation
10
  ---
11
 
12
+ # Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text Diacritization
13
 
 
14
 
15
  ## Table of Contents
16
  - [Introduction](#introduction)
 
17
  - [Models](#models)
18
+ - [ByT5](#model-name)
19
  - [Model Description](#model-description)
 
 
 
 
20
  - [Benchmarks](#benchmarks)
 
 
21
  - [Citation](#citation)
22
  - [Contact](#contact)
23
 
24
  ## Introduction
25
 
26
+ Most of previous work on learning diacritization of the Arabic language relied on training models from scratch. In this paper, we investigate how to leverage pre-trained language models to learn diacritization. We finetune token-free pre-trained multilingual models (ByT5) to learn to predict and insert missing diacritics in Arabic text, a complex task that requires understanding the sentence semantics and the morphological structure of the tokens. We show that we can achieve state-of-the-art on the diacritization task with minimal amount of training and no feature engineering, reducing WER by 40%. We release our finetuned models for the greater benefit of the researchers in the community.
27
 
28
+ ## Model Description
29
 
30
+ The ByT5 model, distinguished by its innovative token-free architecture, directly processes raw text to adeptly navigate diverse languages and linguistic nuances. Pre-trained on a comprehensive text corpus mc4, ByT5 excels in understanding and generating text, making it versatile for various NLP tasks. We have further enhanced its capabilities by fine-tuning it on a Tashkeela data set for 13,000 steps, significantly refining its performance in restoring the diacritical marks for Arabic.
31
 
32
+ ## Benchmarks
33
 
34
+ Our model attained a Diarization Error Rate (DER) of 0.95 and a Word Error Rate (WER) of 2.49.
35
 
 
36
 
37
+ ## Citation
38
+
39
+ @misc{alrfooh2023finetashkeel,
40
+ title={Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text Diacritization},
41
+ author={Bashar Al-Rfooh and Gheith Abandah and Rami Al-Rfou},
42
+ year={2023},
43
+ eprint={2303.14588},
44
+ archivePrefix={arXiv},
45
+ primaryClass={cs.CL}
46
+ }
47
+
48
+
49
+ ## Contact
50
+
51