🧠 Hakim-unsup

Hakim-unsup represents an intermediate stage of the state-of-the-art Hakim text embedding project for the Persian language. This model is the result of pretraining on large Persian corpora followed by an extensive unsupervised contrastive learning phase on millions of text pairs.

While the fully supervised Hakim model achieves top performance on the FaMTEB benchmark, Hakim-unsup provides strong general-purpose semantic representations. It serves as a powerful foundation for further fine-tuning and is particularly useful for tasks where large labeled datasets are unavailable but understanding semantic similarity from unlabeled pairs is crucial.

📌 Model Highlights

🧱 Strong Foundational Embeddings: Provides robust general-purpose Persian text embeddings learned from large-scale unsupervised data.
🔄 Trained on Diverse Unlabeled Pairs: Benefits from the Pairsia-unsup dataset, capturing a wide array of semantic relationships.
⚙️ Standard Size: ~124M parameters, same as the base Hakim model.
🌱 Basis for Supervised Models: This is the model checkpoint before the supervised instruction-tuning phase that creates the final Hakim and Hakim-small models.

🏗️ Training Datasets

Hakim-unsup is trained in two main phases:

📚 Pretraining

Corpesia: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
hmBlogs: 6.8B tokens from ~20M Persian blog posts.
Queries: 8.5M anonymized search queries.

🔄 Unsupervised Stage (Pairsia-unsup)

Pairsia-unsup: 5M high-quality Persian text pairs from diverse sources including:
- Document–title, FAQ, QA, and paper title–abstract pairs.
- Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).
The model is trained using a contrastive learning objective on these pairs to learn general semantic representations.

Hakim-unsup does not undergo the subsequent supervised fine-tuning stage with the Pairsia-sup dataset or instruction tuning. For more detailed information on the dataset creation and curation process, please refer to the Hakim paper.

🧪 Benchmark Results (FaMTEB)

Model	Avg. Score	Classification	Clustering	PairClass.	Reranking	Retrieval	STS	Summarization
Hakim	73.81	84.56	70.46	89.75	69.46	40.43	76.62	85.41
Hakim-small	70.45	80.19	66.31	87.41	67.30	38.05	75.53	78.40
Hakim-unsup	64.56	60.65	58.89	86.41	67.56	37.71	79.36	61.34
BGE-m3	65.29	58.75	57.73	85.21	74.56	43.38	76.35	61.07
Jina-embeddings-v3	64.53	59.93	59.15	83.71	61.26	43.51	78.65	65.50
multilingual-e5-large	64.40	59.86	57.19	84.42	74.34	42.98	75.38	56.61
GTE-multilingual-base	63.64	56.07	57.28	84.58	69.72	41.22	75.75	60.88
multilingual-e5-base	62.93	57.62	56.52	84.04	72.07	41.20	74.45	54.58
Tooka-SBERT	60.65	59.40	56.45	87.04	58.29	27.86	76.42	59.06

Model Usage

You can interact with the Hakim_unsup model through our API. Below are examples using curl and Python.

Inference with `curl`

Here's how to send a request to the model using a curl command in your terminal.

Important: Replace your_api_key with your actual API key.

Note: For quick testing, you can use the value mcinext as your API key. This will allow you to use the API with some limitations.

curl -X POST 'https://mcinext.ai/api/hakim-unsup' \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization": "Bearer your_api_key" \
-d '{
    "model": "Hakim_unsuper",
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": true
}'

Inference with `python`

import requests
import json

# --- Configuration ---
API_KEY = "your_api_key"  # Replace with your key or "mcinext" for testing
API_URL = "https://mcinext.ai/api/hakim-unsup"

# --- Request Details ---
headers = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

data = {
    "model": "Hakim_unsuper", 
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": True
}

# --- Send Request ---
try:
    response = requests.post(API_URL, headers=headers, data=json.dumps(data))
    response.raise_for_status()  

    print("Request successful!")
    print("Response JSON:")
    print(response.json())

except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
    print(f"Response content: {response.text}")
except Exception as err:
    print(f"An other error occurred: {err}")

Citation

@article{sarmadi2025hakim,
  title={Hakim: Farsi Text Embedding Model},
  author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
  journal={arXiv preprint arXiv:2505.08435},
  year={2025}
}