DIMI-embedding-sts-matryoshka

State-of-the-art DIMI Sentence Embeddings for Arabic Similarity

Author: Ahmed Zaky Mouad
Email: [email protected]

This is a sentence-transformers model finetuned from AhmedZaky1/arabic-bert-nli-matryoshka. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: AhmedZaky1/arabic-bert-nli-matryoshka
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference:

from sentence_transformers import SentenceTransformer
import numpy as np

# Download from the 🤗 Hub
model = SentenceTransformer("AhmedZaky1/DIMI-embedding-sts-matryoshka")

# Basic usage - encoding sentences
sentences = [
    'ديترويت مؤهلة للحماية من الإفلاس',
    'ديترويت مؤهلة للحماية الإفلاسية: قاضي أمريكي',
    'بورصة نيويورك ستعيد فتحها الأربعاء',
    'الطقس اليوم مشمس وجميل',
    'السماء صافية والشمس مشرقة'
]

# Generate embeddings
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")
# Output: Embeddings shape: (5, 768)

# Calculate similarity matrix
similarities = model.similarity(embeddings, embeddings)
print(f"Similarity matrix shape: {similarities.shape}")
# Output: Similarity matrix shape: (5, 5)

# Print similarity scores
for i, sentence1 in enumerate(sentences):
    for j, sentence2 in enumerate(sentences):
        if i < j:  # Only print upper triangle
            similarity = similarities[i][j].item()
            print(f"Similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")

Semantic Search Example

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("AhmedZaky1/DIMI-embedding-sts-matryoshka")

# Documents to search through
documents = [
    "الذكاء الاصطناعي يغير العالم بسرعة",
    "التكنولوجيا الحديثة تؤثر على حياتنا اليومية",
    "الطقس اليوم مشمس ودرجة الحرارة مناسبة",
    "كرة القدم هي الرياضة الأكثر شعبية في العالم",
    "الطبخ المنزلي أفضل من الطعام الجاهز",
    "البرمجة مهارة مهمة في العصر الحديث"
]

# Query
query = "التقنيات الجديدة وتأثيرها"

# Encode documents and query
doc_embeddings = model.encode(documents)
query_embedding = model.encode([query])

# Calculate similarities
similarities = model.similarity(query_embedding, doc_embeddings)[0]

# Get top results
top_indices = np.argsort(similarities)[::-1]

print(f"Query: {query}\n")
print("Most similar documents:")
for i, idx in enumerate(top_indices[:3]):
    print(f"{i+1}. {documents[idx]} (similarity: {similarities[idx]:.4f})")

Text Classification Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("AhmedZaky1/DIMI-embedding-sts-matryoshka")

# Category examples
categories = {
    "رياضة": ["كرة القدم مباراة مثيرة", "السباحة رياضة ممتعة", "الجري يحسن الصحة"],
    "تكنولوجيا": ["الذكاء الاصطناعي متطور", "البرمجة مهارة مهمة", "الهواتف الذكية"],
    "طعام": ["الطبخ المنزلي لذيذ", "المطاعم الشعبية", "الحلويات العربية"]
}

# Encode category examples
category_embeddings = {}
for category, examples in categories.items():
    embeddings = model.encode(examples)
    category_embeddings[category] = np.mean(embeddings, axis=0)

# Classify new text
new_text = "الفريق فاز بالمباراة بصعوبة"
new_embedding = model.encode([new_text])

# Find most similar category
similarities = {}
for category, cat_embedding in category_embeddings.items():
    similarity = cosine_similarity([new_embedding[0]], [cat_embedding])[0][0]
    similarities[category] = similarity

# Get prediction
predicted_category = max(similarities, key=similarities.get)
print(f"Text: {new_text}")
print(f"Predicted category: {predicted_category}")
print(f"Confidence: {similarities[predicted_category]:.4f}")

Evaluation

Metrics

Semantic Similarity

  • Dataset: arabic-sts-dev
  • Evaluated with: EmbeddingSimilarityEvaluator
Metric Value
pearson_cosine 0.9649
spearman_cosine 0.9595

Training Details

Training Dataset

  • Dataset: Unnamed Dataset
  • Size: 27,788 training samples
  • Columns: sentence_0, sentence_1, and label

Approximate statistics based on the first 1000 samples:

sentence_0 sentence_1 label
type string string float
details min: 4 tokens
mean: 27.82 tokens
max: 143 tokens
min: 4 tokens
mean: 27.67 tokens
max: 148 tokens
min: 0.0
mean: 0.53
max: 1.0

Sample data:

sentence_0 sentence_1 label
A man is walking along a path through wilderness. A man is walking down a road. 0.5
China's online population rises to 618 mln China's troubled Xinjiang hit by more violence 0.08
وجد الباحثون فقط تجاويف فارغة و نسيج ندب حيث كانت الأورام لم يتم اكتشاف أي أورام، بل تم العثور على تجاويف فارغة ونسيج ندبة في مكانها. 0.8

Framework Versions

  • Python: 3.12.7
  • Sentence Transformers: 3.3.1
  • Transformers: 4.51.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.4.0
  • Datasets: 3.3.2
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Contact Information:

Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AhmedZaky1/DIMI-embedding-sts-matryoshka

Evaluation results