🧠 Hakim: Farsi Text Embedding Model
Hakim is a state-of-the-art text embedding model designed for the Persian language. It significantly outperforms previous models on the FaMTEB benchmark, delivering an 8.5% performance gain. Hakim is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.
📌 Model Highlights
- 🔍 FaMTEB SOTA: Ranked #1 (as of May 31, 2025) across 63 Persian NLP datasets
- 🧾 Instruction-Tuned: Handles tasks like classification, STS, retrieval, QA, and cross-task reasoning
- 🗣️ Chatbot-Ready: Fine-tuned with chat history-aware data
- ⚙️ Compact & Fast: ~124M parameters, effective for real-world inference
🏗️ Training Datasets
📚 Pretraining
- Corpesia: 11B tokens from 46 Persian websites across 21 domains (e.g. news, health, religion, tech)
- hmBlogs: 6.8B tokens from ~20M Persian blog posts
- Queries: 8.5M anonymized search queries
🔄 Unsupervised Stage (Pairsia-unsup)
- 5M high-quality Persian text pairs from:
- Document–title, FAQ, QA, and paper title–abstract
- Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.)
🧠 Supervised Stage (Pairsia-sup)
- 1.3M labeled pairs with 9 negatives per query
- Instruction-based fine-tuning across:
- Classification, Retrieval, STS, QA, NLI
🧪 Benchmark Results (FaMTEB)
| Model | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS | Summarization |
|---|---|---|---|---|---|---|---|---|
| Hakim | 73.81 | 84.56 | 70.46 | 89.75 | 69.46 | 40.43 | 76.62 | 85.41 |
| Hakim-small | 70.45 | 80.19 | 66.31 | 87.41 | 67.30 | 38.05 | 75.53 | 78.40 |
| Hakim-unsup | 64.56 | 60.65 | 58.89 | 86.41 | 67.56 | 37.71 | 79.36 | 61.34 |
| BGE-m3 | 65.29 | 58.75 | 57.73 | 85.21 | 74.56 | 43.38 | 76.35 | 61.07 |
| Jina-embeddings-v3 | 64.53 | 59.93 | 59.15 | 83.71 | 61.26 | 43.51 | 78.65 | 65.50 |
| multilingual-e5-large | 64.40 | 59.86 | 57.19 | 84.42 | 74.34 | 42.98 | 75.38 | 56.61 |
| GTE-multilingual-base | 63.64 | 56.07 | 57.28 | 84.58 | 69.72 | 41.22 | 75.75 | 60.88 |
| multilingual-e5-base | 62.93 | 57.62 | 56.52 | 84.04 | 72.07 | 41.20 | 74.45 | 54.58 |
| Tooka-SBERT | 60.65 | 59.40 | 56.45 | 87.04 | 58.29 | 27.86 | 76.42 | 59.06 |
How to Use the Hakim Model
You can interact with the Hakim model through our API. This API supports three different models: Hakim, Hakim-small, and Hakim-unsup. Below are the details on how to send requests and use the models.
1. Sending Requests Using curl
To send a request to the model using the curl command in your terminal, use the following code. Be sure to replace your_api_key with your actual API key.
Note: For quick testing, you can use
mcinextas your API key. This will allow you to access the API with some limitations.
curl -X POST 'http://mcinext.ai/api/embedding-model' \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer your_api_key" \
-d '{
"model": "Hakim",
"input": [
"Text of the first document.",
"Text of the second document.",
"And so on..."
],
"prompt_type": "sentiment"
}'
In this example: model: The model to use for processing (e.g., "Hakim", "Hakim-small", or "Hakim-unsup").
input: A list of input texts to be processed.
prompt_type: The type of task you want the model to perform. For example, "sentiment" for sentiment analysis, "classification" for text classification, and so on.
- Sending Requests Using Python To use the API with Python, you can use the following code:
import requests
import json
# --- Configuration ---
API_KEY = "your_api_key" # Replace with your API key or "mcinext" for testing
API_URL = "http://mcinext.ai/api/embedding-model"
# --- Request Details ---
headers = {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"Bearer {API_KEY}"
}
data = {
"model": "Hakim",
"input": [
"Text of the first document.",
"Text of the second document.",
"And so on..."
],
"prompt_type": "classification" # Model task type (e.g., classification or sentiment)
}
# --- Send Request ---
try:
response = requests.post(API_URL, headers=headers, data=json.dumps(data))
response.raise_for_status() # Check if the request was successful
print("Request successful!")
print("Response JSON:")
print(response.json())
except requests.exceptions.HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
print(f"Response content: {response.text}")
except Exception as err:
print(f"An error occurred: {err}")
- Supported Prompt Types The prompt_type field is crucial for guiding the model to perform specific tasks. If you don't provide a prompt_type, the input will be sent to the model without any special prefixes, which is the default behavior. This is particularly useful for the Hakim-unsup model, which is designed for unsupervised tasks.
Here is a list of all supported prompt types and their uses:
| prompt_type | Use Case | Preprocessed Example (in Farsi) |
|---|---|---|
| sentiment | Sentiment analysis of text. | مسئله : دسته بندی , تحلیل احساس رضایت متن | متن : [متن شما] |
| classification | General and topical text classification. | مسئله : دسته بندی , دسته بندی موضوعی متن | متن : [متن شما] |
| clustering | Text clustering and topical classification. | مسئله : دسته بندی , دسته بندی موضوعی متن | متن : [متن شما] |
| sts.sent1 | Semantic Textual Similarity (STS) for the first sentence. | مسئله : تشخیص ارتباط , آیا متن دوم شباهت معنایی با متن اول دارد ؟ | متن اول : [متن شما] |
| sts.sent2 | Semantic Textual Similarity (STS) for the second sentence. | مسئله : تشخیص ارتباط , آیا متن دوم شباهت معنایی با متن اول دارد ؟ | متن دوم : [متن شما] |
| retrieval.query | Information Retrieval (query text). | مسئله : تشخیص ارتباط , آیا متن دوم به متن اول مرتبط است ؟ | متن اول : [متن شما] |
| retrieval.passage | Information Retrieval (document text). | مسئله : تشخیص ارتباط , آیا متن دوم به متن اول مرتبط است ؟ | متن دوم : [متن شما] |
| cross | Classification with two inputs, determining semantic relationship. | مسئله : دسته بندی با دو ورودی , نوع ارتباط معنایی متن دوم با متن اول چگونه است ؟ | متن اول : [متن ۱] | متن دوم : [متن ۲] |
- Handling Special Tasks
STS (Semantic Textual Similarity) For STS tasks, you need to compare the similarity between two pieces of text. You can send one or more sentences for comparison. Send the requests separately for comparison. To do this:
Send the first sentence(s) with the sts.sent1 prompt type.
Send the second sentence(s) with the sts.sent2 prompt type.
Here’s how to do this:
Request 1: First sentence(s) (sts.sent1):
{
"model": "Hakim",
"input": [
"This is the first sentence.",
"This is another first sentence."
],
"prompt_type": "sts.sent1"
}
Request 2: Second sentence(s) (sts.sent2):
{
"model": "Hakim",
"input": [
"This is the second sentence.",
"This is another second sentence."
],
"prompt_type": "sts.sent2"
}
Both requests will return embeddings for the respective sentences. You can then compute the similarity between the two embeddings to measure their semantic similarity.
Retrieval For retrieval tasks, you need to compare a query to multiple documents. You need to send two different types of requests:
Query Embedding (retrieval.query):
{
"model": "Hakim",
"input": [
"What is the capital of France?",
"What is the population of France?"
],
"prompt_type": "retrieval.query"
}
Document Embedding (retrieval.passage):
{
"model": "Hakim",
"input": [
"Paris is the capital of France.",
"Paris has a population of over 2 million."
],
"prompt_type": "retrieval.passage"
}
This way, you can compare the query embeddings to the document embeddings to check if they are related or similar. The model will return embeddings for both the query and the document, and you can compute their similarity.
Cross Task The cross task is used when you want to perform a binary classification or categorization based on the embeddings of two related texts. For example, given two sentences, you might want to categorize them into different categories (e.g., "similar" or "dissimilar").
For this, you provide both texts in a specific format:
{
"model": "Hakim",
"input": [
"[text1]: This is the first text, [text2]: This is the second text",
"[text1]: A new sentence, [text2]: Another different sentence"
],
"prompt_type": "cross"
}
The model will process both pairs of texts, compute their embeddings, and then you can use these embeddings to train a model to categorize or classify them into predefined categories based on the similarity or relationship between the two texts.
Citation
@article{sarmadi2025hakim,
title={Hakim: Farsi Text Embedding Model},
author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
journal={arXiv preprint arXiv:2505.08435},
year={2025}
}