YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 Hakim: Farsi Text Embedding Model

arXiv

Hakim is a state-of-the-art text embedding model designed for the Persian language. It significantly outperforms previous models on the FaMTEB benchmark, delivering an 8.5% performance gain. Hakim is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.


πŸ“Œ Model Highlights

  • πŸ” FaMTEB SOTA: Ranked #1 (as of May 31, 2025) across 63 Persian NLP datasets
  • 🧾 Instruction-Tuned: Handles tasks like classification, STS, retrieval, QA, and cross-task reasoning
  • πŸ—£οΈ Chatbot-Ready: Fine-tuned with chat history-aware data
  • βš™οΈ Compact & Fast: ~124M parameters, effective for real-world inference

πŸ—οΈ Training Datasets

πŸ“š Pretraining

  • Corpesia: 11B tokens from 46 Persian websites across 21 domains (e.g. news, health, religion, tech)
  • hmBlogs: 6.8B tokens from ~20M Persian blog posts
  • Queries: 8.5M anonymized search queries

πŸ”„ Unsupervised Stage (Pairsia-unsup)

  • 5M high-quality Persian text pairs from:
    • Document–title, FAQ, QA, and paper title–abstract
    • Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.)

🧠 Supervised Stage (Pairsia-sup)

  • 1.3M labeled pairs with 9 negatives per query
  • Instruction-based fine-tuning across:
    • Classification, Retrieval, STS, QA, NLI

πŸ§ͺ Benchmark Results (FaMTEB)

Model Avg. Score Classification Clustering PairClass. Reranking Retrieval STS Summarization
Hakim 73.81 84.56 70.46 89.75 69.46 40.43 76.62 85.41
Hakim-small 70.45 80.19 66.31 87.41 67.30 38.05 75.53 78.40
Hakim-unsup 64.56 60.65 58.89 86.41 67.56 37.71 79.36 61.34
BGE-m3 65.29 58.75 57.73 85.21 74.56 43.38 76.35 61.07
Jina-embeddings-v3 64.53 59.93 59.15 83.71 61.26 43.51 78.65 65.50
multilingual-e5-large 64.40 59.86 57.19 84.42 74.34 42.98 75.38 56.61
GTE-multilingual-base 63.64 56.07 57.28 84.58 69.72 41.22 75.75 60.88
multilingual-e5-base 62.93 57.62 56.52 84.04 72.07 41.20 74.45 54.58
Tooka-SBERT 60.65 59.40 56.45 87.04 58.29 27.86 76.42 59.06

Model Usage

You can interact with the Hakim model through our API. Below are examples using curl and Python.

Inference with curl

Here's how to send a request to the model using a curl command in your terminal.

Important: Replace your_api_key with your actual API key.

Note: For quick testing, you can use the value mcinext as your API key. This will allow you to use the API with some limitations.

curl -X POST 'https://mcinext.ai/api/hakim' \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization": "Bearer your_api_key" \
-d '{
    "model": "Hakim",
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": true
}'

Inference with python

import requests
import json

# --- Configuration ---
API_KEY = "your_api_key"  # Replace with your key or "mcinext" for testing
API_URL = "https://mcinext.ai/api/hakim"

# --- Request Details ---
headers = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

data = {
    "model": "Hakim", 
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": True
}

# --- Send Request ---
try:
    response = requests.post(API_URL, headers=headers, data=json.dumps(data))
    response.raise_for_status()  

    print("Request successful!")
    print("Response JSON:")
    print(response.json())

except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
    print(f"Response content: {response.text}")
except Exception as err:
    print(f"An other error occurred: {err}")

Citation

@article{sarmadi2025hakim,
  title={Hakim: Farsi Text Embedding Model},
  author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
  journal={arXiv preprint arXiv:2505.08435},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support