π§ Hakim-unsup
Hakim-unsup represents an intermediate stage of the state-of-the-art Hakim text embedding project for the Persian language. This model is the result of pretraining on large Persian corpora followed by an extensive unsupervised contrastive learning phase on millions of text pairs.
While the fully supervised Hakim model achieves top performance on the FaMTEB benchmark, Hakim-unsup provides strong general-purpose semantic representations. It serves as a powerful foundation for further fine-tuning and is particularly useful for tasks where large labeled datasets are unavailable but understanding semantic similarity from unlabeled pairs is crucial.
π Model Highlights
- π§± Strong Foundational Embeddings: Provides robust general-purpose Persian text embeddings learned from large-scale unsupervised data.
- π Trained on Diverse Unlabeled Pairs: Benefits from the
Pairsia-unsup
dataset, capturing a wide array of semantic relationships. - βοΈ Standard Size: ~124M parameters, same as the base Hakim model.
- π± Basis for Supervised Models: This is the model checkpoint before the supervised instruction-tuning phase that creates the final Hakim and Hakim-small models.
ποΈ Training Datasets
Hakim-unsup is trained in two main phases:
π Pretraining
- Corpesia: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
- hmBlogs: 6.8B tokens from ~20M Persian blog posts.
- Queries: 8.5M anonymized search queries.
π Unsupervised Stage (Pairsia-unsup)
- Pairsia-unsup: 5M high-quality Persian text pairs from diverse sources including:
- Documentβtitle, FAQ, QA, and paper titleβabstract pairs.
- Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).
- The model is trained using a contrastive learning objective on these pairs to learn general semantic representations.
Hakim-unsup does not undergo the subsequent supervised fine-tuning stage with the Pairsia-sup
dataset or instruction tuning. For more detailed information on the dataset creation and curation process, please refer to the Hakim paper.
π§ͺ Benchmark Results (FaMTEB)
Model | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS | Summarization |
---|---|---|---|---|---|---|---|---|
Hakim | 73.81 | 84.56 | 70.46 | 89.75 | 69.46 | 40.43 | 76.62 | 85.41 |
Hakim-small | 70.45 | 80.19 | 66.31 | 87.41 | 67.30 | 38.05 | 75.53 | 78.40 |
Hakim-unsup | 64.56 | 60.65 | 58.89 | 86.41 | 67.56 | 37.71 | 79.36 | 61.34 |
BGE-m3 | 65.29 | 58.75 | 57.73 | 85.21 | 74.56 | 43.38 | 76.35 | 61.07 |
Jina-embeddings-v3 | 64.53 | 59.93 | 59.15 | 83.71 | 61.26 | 43.51 | 78.65 | 65.50 |
multilingual-e5-large | 64.40 | 59.86 | 57.19 | 84.42 | 74.34 | 42.98 | 75.38 | 56.61 |
GTE-multilingual-base | 63.64 | 56.07 | 57.28 | 84.58 | 69.72 | 41.22 | 75.75 | 60.88 |
multilingual-e5-base | 62.93 | 57.62 | 56.52 | 84.04 | 72.07 | 41.20 | 74.45 | 54.58 |
Tooka-SBERT | 60.65 | 59.40 | 56.45 | 87.04 | 58.29 | 27.86 | 76.42 | 59.06 |
Model Usage
You can interact with the Hakim_unsup
model through our API. Below are examples using curl
and Python.
Inference with curl
Here's how to send a request to the model using a curl
command in your terminal.
Important: Replace your_api_key
with your actual API key.
Note: For quick testing, you can use the value
mcinext
as your API key. This will allow you to use the API with some limitations.
curl -X POST 'https://mcinext.ai/api/hakim-unsup' \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization": "Bearer your_api_key" \
-d '{
"model": "Hakim_unsuper",
"input": [
"The text of the first document.",
"The text of the second document.",
"And so on..."
],
"encoding_format": "float",
"add_special_tokens": true
}'
Inference with python
import requests
import json
# --- Configuration ---
API_KEY = "your_api_key" # Replace with your key or "mcinext" for testing
API_URL = "https://mcinext.ai/api/hakim-unsup"
# --- Request Details ---
headers = {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"Bearer {API_KEY}"
}
data = {
"model": "Hakim_unsuper",
"input": [
"The text of the first document.",
"The text of the second document.",
"And so on..."
],
"encoding_format": "float",
"add_special_tokens": True
}
# --- Send Request ---
try:
response = requests.post(API_URL, headers=headers, data=json.dumps(data))
response.raise_for_status()
print("Request successful!")
print("Response JSON:")
print(response.json())
except requests.exceptions.HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
print(f"Response content: {response.text}")
except Exception as err:
print(f"An other error occurred: {err}")
Citation
@article{sarmadi2025hakim,
title={Hakim: Farsi Text Embedding Model},
author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
journal={arXiv preprint arXiv:2505.08435},
year={2025}
}