mehran-sarmadi commited on
Commit
43fd496
·
verified ·
1 Parent(s): de8a253

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -2
README.md CHANGED
@@ -2,6 +2,71 @@
2
  language:
3
  - fa
4
  ---
5
- ## 🔍 About Hakim-small
6
 
7
- **Hakim-small** is a small version of the [Hakim](https://huggingface.co/MCINext/Hakim-base) model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  language:
3
  - fa
4
  ---
5
+ # 🧠 Hakim-small: A Compact and Efficient Farsi Text Embedding Model [![arXiv](https://img.shields.io/badge/arXiv-2505.08435-b31b1b.svg)](https://arxiv.org/abs/2505.08435)
6
 
7
+ **Hakim-small** is a compact and efficient version of the state-of-the-art **Hakim** text embedding model, specifically designed for the Persian language. While the main Hakim model sets the SOTA on the **FaMTEB** benchmark, Hakim-small offers a strong balance of performance and efficiency, making it ideal for applications with resource constraints. It leverages the same advanced training methodologies and datasets as its larger counterpart.
8
+
9
+ Hakim-small is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.
10
+
11
+ ---
12
+
13
+ ## 📌 Model Highlights
14
+
15
+ - 🔍 **Strong FaMTEB Performance**: Achieves excellent results on the FaMTEB benchmark, especially for its size.
16
+ - 🧾 **Instruction-Tuned**: Capable of handling tasks like classification, STS, retrieval, and QA, benefiting from the same instruction-tuning paradigm as Hakim.
17
+ - 🗣️ **Chatbot-Ready**: Fine-tuned with chat history-aware data from the Hakim project.
18
+ - ⚙️ **Highly Compact & Fast**: With only **~38M parameters** (compared to Hakim's ~124M), Hakim-small is significantly smaller and faster, making it highly effective for real-world inference where efficiency is key.
19
+
20
+ ---
21
+
22
+ ## 🏗️ Training Datasets
23
+
24
+ Hakim-small benefits from the comprehensive and high-quality datasets developed for the Hakim project. These include:
25
+
26
+ ### 📚 Pretraining
27
+ - **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
28
+ - **hmBlogs**: 6.8B tokens from ~20M Persian blog posts.
29
+ - **Queries**: 8.5M anonymized search queries.
30
+
31
+ ### 🔄 Unsupervised Stage (Pairsia-unsup)
32
+ - 5M high-quality Persian text pairs from diverse sources including document–title, FAQ, QA, paper title–abstract, and machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).
33
+
34
+ ### 🧠 Supervised Stage (Pairsia-sup)
35
+ - 1.3M labeled pairs with multiple negatives per query.
36
+ - Instruction-based fine-tuning across tasks: Classification, Retrieval, STS, QA, NLI.
37
+
38
+ For more detailed information on the dataset creation and curation process, please refer to the [Hakim paper](https://arxiv.org/abs/2505.08435).
39
+
40
+ ---
41
+
42
+ ## 🧪 Benchmark Results (FaMTEB)
43
+
44
+ | Model | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS | Summarization |
45
+ |------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------|
46
+ | **Hakim** | **73.81** | **84.56** | **70.46** | **89.75** | 69.46 | 40.43 | 76.62 | **85.41** |
47
+ | Hakim-small | 70.45 | 80.19 | 66.31 | 87.41 | 67.30 | 38.05 | 75.53 | 78.40 |
48
+ | Hakim-unsup | 64.56 | 60.65 | 58.89 | 86.41 | 67.56 | 37.71 | 79.36 | 61.34 |
49
+ | BGE-m3 | 65.29 | 58.75 | 57.73 | 85.21 | **74.56** | 43.38 | 76.35 | 61.07 |
50
+ | Jina-embeddings-v3 | 64.53 | 59.93 | 59.15 | 83.71 | 61.26 | **43.51** | **78.65** | 65.50 |
51
+ | multilingual-e5-large | 64.40 | 59.86 | 57.19 | 84.42 | 74.34 | 42.98 | 75.38 | 56.61 |
52
+ | GTE-multilingual-base | 63.64 | 56.07 | 57.28 | 84.58 | 69.72 | 41.22 | 75.75 | 60.88 |
53
+ | multilingual-e5-base | 62.93 | 57.62 | 56.52 | 84.04 | 72.07 | 41.20 | 74.45 | 54.58 |
54
+ | Tooka-SBERT | 60.65 | 59.40 | 56.45 | 87.04 | 58.29 | 27.86 | 76.42 | 59.06 |
55
+
56
+ ---
57
+
58
+ ## 🔧 Usage Example
59
+
60
+ ```python
61
+ Access to the Hakim model will be available through an API. This section will be updated with usage instructions and examples once the API is ready.
62
+ ```
63
+
64
+ ## Citation
65
+ ```bibtext
66
+ @article{sarmadi2025hakim,
67
+ title={Hakim: Farsi Text Embedding Model},
68
+ author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
69
+ journal={arXiv preprint arXiv:2505.08435},
70
+ year={2025}
71
+ }
72
+ ```