CATIE-AQ/distilcamembert-base-embedding

Description

This is a sentence-transformers model finetuned from cmarkea/distilcamembert-base (68.1M parameters). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Score on the MTEB leaderboard:

Model	Average	Classification	Clustering	PairClassification	Reranking	Retrieval	STS	Summarization
CATIE-AQ/camembert-base-embedding (111M)	60,057	66,117	45,41	79,675	71,303	45,769	82,049	30,074
CATIE-AQ/distilcamembert-base-embedding (68M)	58,297	63,904	44,549	79,102	67,961	42,222	80,204	30,138

Model Details

Model Description

Model Type: Sentence Transformer
Base model: cmarkea/distilcamembert-base
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': True, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("CATIE-AQ/distilcamembert-base-embedding")
# Run inference
sentences = [
    "Tenet est sous surveillance depuis novembre, lorsque l'ancien directeur général Jeffrey Barbakow a déclaré que la société a utilisé des prix agressifs pour déclencher des paiements plus élevés pour les patients les plus malades de l'assurance maladie.",
    "En novembre, Jeffrey Brabakow, le directeur général de l'époque, a déclaré que la société utilisait des prix agressifs pour obtenir des paiements plus élevés pour les patients les plus malades de l'assurance maladie.",
    'La femme est en route pour un rendez-vous.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Citation

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Downloads last month: 23

Safetensors

Model size

68.1M params

Tensor type

F32

Model tree for CATIE-AQ/distilcamembert-base-embedding

Base model

cmarkea/distilcamembert-base

Finetuned

(7)

this model

Datasets used to train CATIE-AQ/distilcamembert-base-embedding

Collection including CATIE-AQ/distilcamembert-base-embedding

CATIE French dense embedding

Collection

2 items • Updated 7 days ago

Evaluation results

main_score on MTEB AlloProfClusteringP2P (default)
test set self-reported

59.597
v_measure on MTEB AlloProfClusteringP2P (default)
test set self-reported

59.597
v_measure_std on MTEB AlloProfClusteringP2P (default)
test set self-reported

4.011
main_score on MTEB AlloProfClusteringS2S (default)
test set self-reported

47.061
v_measure on MTEB AlloProfClusteringS2S (default)
test set self-reported

47.061
v_measure_std on MTEB AlloProfClusteringS2S (default)
test set self-reported

1.523
main_score on MTEB AlloprofReranking (default)
test set self-reported

60.043
map on MTEB AlloprofReranking (default)
test set self-reported

60.043
mrr on MTEB AlloprofReranking (default)
test set self-reported

61.136
nAUC_map_diff1 on MTEB AlloprofReranking (default)
test set self-reported

39.767

View on Papers With Code