Update README.md
Browse files
README.md
CHANGED
|
@@ -2,6 +2,71 @@
|
|
| 2 |
language:
|
| 3 |
- fa
|
| 4 |
---
|
| 5 |
-
#
|
| 6 |
|
| 7 |
-
**Hakim-small** is a
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
- fa
|
| 4 |
---
|
| 5 |
+
# 🧠 Hakim-small: A Compact and Efficient Farsi Text Embedding Model [](https://arxiv.org/abs/2505.08435)
|
| 6 |
|
| 7 |
+
**Hakim-small** is a compact and efficient version of the state-of-the-art **Hakim** text embedding model, specifically designed for the Persian language. While the main Hakim model sets the SOTA on the **FaMTEB** benchmark, Hakim-small offers a strong balance of performance and efficiency, making it ideal for applications with resource constraints. It leverages the same advanced training methodologies and datasets as its larger counterpart.
|
| 8 |
+
|
| 9 |
+
Hakim-small is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## 📌 Model Highlights
|
| 14 |
+
|
| 15 |
+
- 🔍 **Strong FaMTEB Performance**: Achieves excellent results on the FaMTEB benchmark, especially for its size.
|
| 16 |
+
- 🧾 **Instruction-Tuned**: Capable of handling tasks like classification, STS, retrieval, and QA, benefiting from the same instruction-tuning paradigm as Hakim.
|
| 17 |
+
- 🗣️ **Chatbot-Ready**: Fine-tuned with chat history-aware data from the Hakim project.
|
| 18 |
+
- ⚙️ **Highly Compact & Fast**: With only **~38M parameters** (compared to Hakim's ~124M), Hakim-small is significantly smaller and faster, making it highly effective for real-world inference where efficiency is key.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 🏗️ Training Datasets
|
| 23 |
+
|
| 24 |
+
Hakim-small benefits from the comprehensive and high-quality datasets developed for the Hakim project. These include:
|
| 25 |
+
|
| 26 |
+
### 📚 Pretraining
|
| 27 |
+
- **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
|
| 28 |
+
- **hmBlogs**: 6.8B tokens from ~20M Persian blog posts.
|
| 29 |
+
- **Queries**: 8.5M anonymized search queries.
|
| 30 |
+
|
| 31 |
+
### 🔄 Unsupervised Stage (Pairsia-unsup)
|
| 32 |
+
- 5M high-quality Persian text pairs from diverse sources including document–title, FAQ, QA, paper title–abstract, and machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).
|
| 33 |
+
|
| 34 |
+
### 🧠 Supervised Stage (Pairsia-sup)
|
| 35 |
+
- 1.3M labeled pairs with multiple negatives per query.
|
| 36 |
+
- Instruction-based fine-tuning across tasks: Classification, Retrieval, STS, QA, NLI.
|
| 37 |
+
|
| 38 |
+
For more detailed information on the dataset creation and curation process, please refer to the [Hakim paper](https://arxiv.org/abs/2505.08435).
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## 🧪 Benchmark Results (FaMTEB)
|
| 43 |
+
|
| 44 |
+
| Model | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS | Summarization |
|
| 45 |
+
|------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------|
|
| 46 |
+
| **Hakim** | **73.81** | **84.56** | **70.46** | **89.75** | 69.46 | 40.43 | 76.62 | **85.41** |
|
| 47 |
+
| Hakim-small | 70.45 | 80.19 | 66.31 | 87.41 | 67.30 | 38.05 | 75.53 | 78.40 |
|
| 48 |
+
| Hakim-unsup | 64.56 | 60.65 | 58.89 | 86.41 | 67.56 | 37.71 | 79.36 | 61.34 |
|
| 49 |
+
| BGE-m3 | 65.29 | 58.75 | 57.73 | 85.21 | **74.56** | 43.38 | 76.35 | 61.07 |
|
| 50 |
+
| Jina-embeddings-v3 | 64.53 | 59.93 | 59.15 | 83.71 | 61.26 | **43.51** | **78.65** | 65.50 |
|
| 51 |
+
| multilingual-e5-large | 64.40 | 59.86 | 57.19 | 84.42 | 74.34 | 42.98 | 75.38 | 56.61 |
|
| 52 |
+
| GTE-multilingual-base | 63.64 | 56.07 | 57.28 | 84.58 | 69.72 | 41.22 | 75.75 | 60.88 |
|
| 53 |
+
| multilingual-e5-base | 62.93 | 57.62 | 56.52 | 84.04 | 72.07 | 41.20 | 74.45 | 54.58 |
|
| 54 |
+
| Tooka-SBERT | 60.65 | 59.40 | 56.45 | 87.04 | 58.29 | 27.86 | 76.42 | 59.06 |
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## 🔧 Usage Example
|
| 59 |
+
|
| 60 |
+
```python
|
| 61 |
+
Access to the Hakim model will be available through an API. This section will be updated with usage instructions and examples once the API is ready.
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## Citation
|
| 65 |
+
```bibtext
|
| 66 |
+
@article{sarmadi2025hakim,
|
| 67 |
+
title={Hakim: Farsi Text Embedding Model},
|
| 68 |
+
author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
|
| 69 |
+
journal={arXiv preprint arXiv:2505.08435},
|
| 70 |
+
year={2025}
|
| 71 |
+
}
|
| 72 |
+
```
|