MCINext
/

Hakim-small

Persian

Model card Files Files and versions

xet

Community

mehran-sarmadi commited on Jun 2, 2025

Commit

43fd496

verified ·

1 Parent(s): de8a253

Update README.md

Browse files

Files changed (1) hide show

README.md +67 -2

README.md CHANGED Viewed

@@ -2,6 +2,71 @@
 language:
 - fa
 ---
-## 🔍 About Hakim-small
-**Hakim-small** is a small version of the [Hakim](https://huggingface.co/MCINext/Hakim-base) model.

 language:
 - fa
 ---
+# 🧠 Hakim-small: A Compact and Efficient Farsi Text Embedding Model [![arXiv](https://img.shields.io/badge/arXiv-2505.08435-b31b1b.svg)](https://arxiv.org/abs/2505.08435)
+**Hakim-small** is a compact and efficient version of the state-of-the-art **Hakim** text embedding model, specifically designed for the Persian language. While the main Hakim model sets the SOTA on the **FaMTEB** benchmark, Hakim-small offers a strong balance of performance and efficiency, making it ideal for applications with resource constraints. It leverages the same advanced training methodologies and datasets as its larger counterpart.
+Hakim-small is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.
+---
+## 📌 Model Highlights
+- 🔍 **Strong FaMTEB Performance**: Achieves excellent results on the FaMTEB benchmark, especially for its size.
+- 🧾 **Instruction-Tuned**: Capable of handling tasks like classification, STS, retrieval, and QA, benefiting from the same instruction-tuning paradigm as Hakim.
+- 🗣️ **Chatbot-Ready**: Fine-tuned with chat history-aware data from the Hakim project.
+- ⚙️ **Highly Compact & Fast**: With only **~38M parameters** (compared to Hakim's ~124M), Hakim-small is significantly smaller and faster, making it highly effective for real-world inference where efficiency is key.
+---
+## 🏗️ Training Datasets
+Hakim-small benefits from the comprehensive and high-quality datasets developed for the Hakim project. These include:
+### 📚 Pretraining
+- **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
+- **hmBlogs**: 6.8B tokens from ~20M Persian blog posts.
+- **Queries**: 8.5M anonymized search queries.
+### 🔄 Unsupervised Stage (Pairsia-unsup)
+- 5M high-quality Persian text pairs from diverse sources including document–title, FAQ, QA, paper title–abstract, and machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).
+### 🧠 Supervised Stage (Pairsia-sup)
+- 1.3M labeled pairs with multiple negatives per query.
+- Instruction-based fine-tuning across tasks: Classification, Retrieval, STS, QA, NLI.
+For more detailed information on the dataset creation and curation process, please refer to the [Hakim paper](https://arxiv.org/abs/2505.08435).
+---
+## 🧪 Benchmark Results (FaMTEB)
+| Model                   | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS   | Summarization |
+|------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------|
+| **Hakim**              | **73.81**  | **84.56**      | **70.46**  | **89.75**  | 69.46     | 40.43     | 76.62 | **85.41**      |
+| Hakim-small            | 70.45      | 80.19          | 66.31      | 87.41      | 67.30     | 38.05     | 75.53 | 78.40          |
+| Hakim-unsup            | 64.56      | 60.65          | 58.89      | 86.41      | 67.56     | 37.71     | 79.36 | 61.34          |
+| BGE-m3                 | 65.29      | 58.75          | 57.73      | 85.21      | **74.56** | 43.38     | 76.35 | 61.07          |
+| Jina-embeddings-v3     | 64.53      | 59.93          | 59.15      | 83.71      | 61.26     | **43.51** | **78.65** | 65.50      |
+| multilingual-e5-large  | 64.40      | 59.86          | 57.19      | 84.42      | 74.34     | 42.98     | 75.38 | 56.61          |
+| GTE-multilingual-base  | 63.64      | 56.07          | 57.28      | 84.58      | 69.72     | 41.22     | 75.75 | 60.88          |
+| multilingual-e5-base   | 62.93      | 57.62          | 56.52      | 84.04      | 72.07     | 41.20     | 74.45 | 54.58          |
+| Tooka-SBERT            | 60.65      | 59.40          | 56.45      | 87.04      | 58.29     | 27.86     | 76.42 | 59.06          |
+---
+## 🔧 Usage Example
+```python
+Access to the Hakim model will be available through an API. This section will be updated with usage instructions and examples once the API is ready.
+```
+## Citation
+```bibtext
+@article{sarmadi2025hakim,
+  title={Hakim: Farsi Text Embedding Model},
+  author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
+  journal={arXiv preprint arXiv:2505.08435},
+  year={2025}
+}
+```