YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 Hakim-unsup

arXiv

Hakim-unsup represents an intermediate stage of the state-of-the-art Hakim text embedding project for the Persian language. This model is the result of pretraining on large Persian corpora followed by an extensive unsupervised contrastive learning phase on millions of text pairs.

While the fully supervised Hakim model achieves top performance on the FaMTEB benchmark, Hakim-unsup provides strong general-purpose semantic representations. It serves as a powerful foundation for further fine-tuning and is particularly useful for tasks where large labeled datasets are unavailable but understanding semantic similarity from unlabeled pairs is crucial.


πŸ“Œ Model Highlights

  • 🧱 Strong Foundational Embeddings: Provides robust general-purpose Persian text embeddings learned from large-scale unsupervised data.
  • πŸ”„ Trained on Diverse Unlabeled Pairs: Benefits from the Pairsia-unsup dataset, capturing a wide array of semantic relationships.
  • βš™οΈ Standard Size: ~124M parameters, same as the base Hakim model.
  • 🌱 Basis for Supervised Models: This is the model checkpoint before the supervised instruction-tuning phase that creates the final Hakim and Hakim-small models.

πŸ—οΈ Training Datasets

Hakim-unsup is trained in two main phases:

πŸ“š Pretraining

  • Corpesia: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
  • hmBlogs: 6.8B tokens from ~20M Persian blog posts.
  • Queries: 8.5M anonymized search queries.

πŸ”„ Unsupervised Stage (Pairsia-unsup)

  • Pairsia-unsup: 5M high-quality Persian text pairs from diverse sources including:
    • Document–title, FAQ, QA, and paper title–abstract pairs.
    • Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).
  • The model is trained using a contrastive learning objective on these pairs to learn general semantic representations.

Hakim-unsup does not undergo the subsequent supervised fine-tuning stage with the Pairsia-sup dataset or instruction tuning. For more detailed information on the dataset creation and curation process, please refer to the Hakim paper.


πŸ§ͺ Benchmark Results (FaMTEB)

Model Avg. Score Classification Clustering PairClass. Reranking Retrieval STS Summarization
Hakim 73.81 84.56 70.46 89.75 69.46 40.43 76.62 85.41
Hakim-small 70.45 80.19 66.31 87.41 67.30 38.05 75.53 78.40
Hakim-unsup 64.56 60.65 58.89 86.41 67.56 37.71 79.36 61.34
BGE-m3 65.29 58.75 57.73 85.21 74.56 43.38 76.35 61.07
Jina-embeddings-v3 64.53 59.93 59.15 83.71 61.26 43.51 78.65 65.50
multilingual-e5-large 64.40 59.86 57.19 84.42 74.34 42.98 75.38 56.61
GTE-multilingual-base 63.64 56.07 57.28 84.58 69.72 41.22 75.75 60.88
multilingual-e5-base 62.93 57.62 56.52 84.04 72.07 41.20 74.45 54.58
Tooka-SBERT 60.65 59.40 56.45 87.04 58.29 27.86 76.42 59.06

Model Usage

You can interact with the Hakim_unsup model through our API. Below are examples using curl and Python.

Inference with curl

Here's how to send a request to the model using a curl command in your terminal.

Important: Replace your_api_key with your actual API key.

Note: For quick testing, you can use the value mcinext as your API key. This will allow you to use the API with some limitations.

curl -X POST 'https://mcinext.ai/api/hakim-unsup' \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization": "Bearer your_api_key" \
-d '{
    "model": "Hakim_unsuper",
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": true
}'

Inference with python

import requests
import json

# --- Configuration ---
API_KEY = "your_api_key"  # Replace with your key or "mcinext" for testing
API_URL = "https://mcinext.ai/api/hakim-unsup"

# --- Request Details ---
headers = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

data = {
    "model": "Hakim_unsuper", 
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": True
}

# --- Send Request ---
try:
    response = requests.post(API_URL, headers=headers, data=json.dumps(data))
    response.raise_for_status()  

    print("Request successful!")
    print("Response JSON:")
    print(response.json())

except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
    print(f"Response content: {response.text}")
except Exception as err:
    print(f"An other error occurred: {err}")

Citation

@article{sarmadi2025hakim,
  title={Hakim: Farsi Text Embedding Model},
  author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
  journal={arXiv preprint arXiv:2505.08435},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support