🚀 Upload du modèle optimisé avec Optuna (85.3% accuracy globale, 82.9% funny accuracy)
Browse files- README.md +80 -74
- model.safetensors +1 -1
- threshold.json +3 -0
- tokenizer_config.json +1 -4
README.md
CHANGED
|
@@ -13,6 +13,7 @@ tags:
|
|
| 13 |
- eurobert
|
| 14 |
- lora
|
| 15 |
- git
|
|
|
|
| 16 |
datasets:
|
| 17 |
- custom
|
| 18 |
metrics:
|
|
@@ -20,108 +21,113 @@ metrics:
|
|
| 20 |
- f1
|
| 21 |
library_name: transformers
|
| 22 |
pipeline_tag: text-classification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
---
|
| 24 |
|
| 25 |
-
#
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
-
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 33 |
-
import torch
|
| 34 |
-
|
| 35 |
-
# Charger le modèle et le tokenizer
|
| 36 |
-
tokenizer = AutoTokenizer.from_pretrained("LBerthalon/eurobert-commit-humor", trust_remote_code=True)
|
| 37 |
-
model = AutoModelForSequenceClassification.from_pretrained("LBerthalon/eurobert-commit-humor", trust_remote_code=True)
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
outputs = model(**inputs)
|
| 44 |
-
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 45 |
-
predicted_class = torch.argmax(probabilities, dim=-1)
|
| 46 |
-
confidence = probabilities.max().item()
|
| 47 |
-
|
| 48 |
-
labels = ["PAS DRÔLE", "DRÔLE"]
|
| 49 |
-
return labels[predicted_class.item()], confidence
|
| 50 |
-
|
| 51 |
-
# Test
|
| 52 |
-
message = "gcc et moi c'est compliqué"
|
| 53 |
-
result, confidence = classify_commit(message)
|
| 54 |
-
print(f"Message: '{message}'")
|
| 55 |
-
print(f"Résultat: {result} (confiance: {confidence:.3f})")
|
| 56 |
-
```
|
| 57 |
|
| 58 |
-
##
|
| 59 |
|
| 60 |
-
```
|
| 61 |
-
|
| 62 |
-
→ 😄 DRÔLE (prob: 0.730)
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
| 69 |
```
|
| 70 |
|
| 71 |
-
##
|
| 72 |
-
|
| 73 |
-
- **Modèle Base** : EuroBERT-210m (210M paramètres)
|
| 74 |
-
- **Fine-tuning** : LoRA (Low-Rank Adaptation)
|
| 75 |
-
- **Dataset** : Messages de commit annotés (drôle/pas drôle)
|
| 76 |
-
- **Classification** : Binaire avec seuil ajustable
|
| 77 |
-
- **Langues supportées** : Français, Anglais, Allemand, Espagnol, Italien
|
| 78 |
-
|
| 79 |
-
## 📈 Performance
|
| 80 |
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
## 🎪 Cas d'Usage
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
- **Bots** : Intégration Discord/Slack/Teams
|
| 91 |
|
| 92 |
-
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
```
|
| 97 |
|
| 98 |
-
##
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
|
|
|
| 105 |
|
| 106 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
-
|
| 113 |
|
| 114 |
```bibtex
|
| 115 |
-
@misc{commit-humor-
|
| 116 |
-
title={EuroBERT Commit Humor Classifier},
|
| 117 |
-
author={
|
| 118 |
year={2025},
|
|
|
|
| 119 |
url={https://huggingface.co/LBerthalon/eurobert-commit-humor}
|
| 120 |
}
|
| 121 |
```
|
| 122 |
|
| 123 |
-
|
| 124 |
|
| 125 |
-
|
| 126 |
-
**Auteur** : Assistant IA
|
| 127 |
-
**Date** : 2025
|
|
|
|
| 13 |
- eurobert
|
| 14 |
- lora
|
| 15 |
- git
|
| 16 |
+
- optuna-optimized
|
| 17 |
datasets:
|
| 18 |
- custom
|
| 19 |
metrics:
|
|
|
|
| 21 |
- f1
|
| 22 |
library_name: transformers
|
| 23 |
pipeline_tag: text-classification
|
| 24 |
+
model-index:
|
| 25 |
+
- name: eurobert-commit-humor
|
| 26 |
+
results:
|
| 27 |
+
- task:
|
| 28 |
+
type: text-classification
|
| 29 |
+
name: Text Classification
|
| 30 |
+
dataset:
|
| 31 |
+
type: custom
|
| 32 |
+
name: Git Commit Humor Detection
|
| 33 |
+
metrics:
|
| 34 |
+
- type: accuracy
|
| 35 |
+
value: 85.3
|
| 36 |
+
name: Global Accuracy
|
| 37 |
+
- type: accuracy
|
| 38 |
+
value: 82.9
|
| 39 |
+
name: Funny Class Accuracy
|
| 40 |
---
|
| 41 |
|
| 42 |
+
# 🎭 EuroBERT Commit Humor Classifier (Optimized)
|
| 43 |
|
| 44 |
+
## 📋 Description
|
| 45 |
|
| 46 |
+
Ce modèle est une version optimisée d'EuroBERT fine-tunée pour détecter l'humour dans les messages de commit Git.
|
| 47 |
+
Il a été optimisé avec Optuna sur plusieurs cycles d'amélioration automatique du dataset.
|
| 48 |
|
| 49 |
+
## 🎯 Performances
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
+
- **Accuracy globale**: 85.3%
|
| 52 |
+
- **Accuracy classe "funny"**: 82.9%
|
| 53 |
+
- **Accuracy classe "neutral"**: 85.6%
|
| 54 |
+
- **Seuil optimal**: 0.35
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
+
## 🚀 Utilisation
|
| 57 |
|
| 58 |
+
```python
|
| 59 |
+
from transformers import pipeline
|
|
|
|
| 60 |
|
| 61 |
+
# Charger le modèle
|
| 62 |
+
classifier = pipeline("text-classification",
|
| 63 |
+
model="LBerthalon/eurobert-commit-humor",
|
| 64 |
+
trust_remote_code=True)
|
| 65 |
|
| 66 |
+
# Prédiction
|
| 67 |
+
result = classifier("fix: gcc et moi c'est compliqué")
|
| 68 |
+
print(result)
|
| 69 |
+
# [{"label": "funny", "score": 0.85}]
|
| 70 |
```
|
| 71 |
|
| 72 |
+
## 🔧 Utilisation avancée
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
```python
|
| 75 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 76 |
+
import torch
|
|
|
|
|
|
|
| 77 |
|
| 78 |
+
# Charger le modèle et tokenizer
|
| 79 |
+
tokenizer = AutoTokenizer.from_pretrained("LBerthalon/eurobert-commit-humor", trust_remote_code=True)
|
| 80 |
+
model = AutoModelForSequenceClassification.from_pretrained("LBerthalon/eurobert-commit-humor", trust_remote_code=True)
|
|
|
|
| 81 |
|
| 82 |
+
# Préparer l'input
|
| 83 |
+
text = "feat: ajout de la fonctionnalité qui marche pas"
|
| 84 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
| 85 |
|
| 86 |
+
# Prédiction
|
| 87 |
+
with torch.no_grad():
|
| 88 |
+
outputs = model(**inputs)
|
| 89 |
+
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 90 |
+
|
| 91 |
+
print(f"Funny: {predictions[0][1]:.3f}")
|
| 92 |
+
print(f"Neutral: {predictions[0][0]:.3f}")
|
| 93 |
```
|
| 94 |
|
| 95 |
+
## 📊 Exemples de Prédictions
|
| 96 |
|
| 97 |
+
| Message de Commit | Prédiction | Score |
|
| 98 |
+
|-------------------|------------|-------|
|
| 99 |
+
| "fix: correction du bug" | neutral | 0.92 |
|
| 100 |
+
| "feat: ajout de la magie noire" | funny | 0.78 |
|
| 101 |
+
| "docs: mise à jour README" | neutral | 0.95 |
|
| 102 |
+
| "fix: ça marche sur ma machine" | funny | 0.83 |
|
| 103 |
|
| 104 |
+
## 🛠️ Optimisation
|
| 105 |
+
|
| 106 |
+
Ce modèle a été optimisé avec :
|
| 107 |
+
- **Optuna** pour l'optimisation bayésienne des hyperparamètres
|
| 108 |
+
- **LoRA** (Low-Rank Adaptation) pour un fine-tuning efficace
|
| 109 |
+
- **Amélioration itérative** du dataset
|
| 110 |
+
- **5 cycles d'optimisation** automatique
|
| 111 |
|
| 112 |
+
## 📈 Architecture
|
| 113 |
|
| 114 |
+
- **Modèle de base**: EuroBERT
|
| 115 |
+
- **Technique**: LoRA Fine-tuning
|
| 116 |
+
- **Classes**: 2 (funny, neutral)
|
| 117 |
+
- **Langues supportées**: Français (principal), Anglais, Allemand, Espagnol, Italien
|
| 118 |
|
| 119 |
+
## 🎓 Citation
|
| 120 |
|
| 121 |
```bibtex
|
| 122 |
+
@misc{eurobert-commit-humor-optimized,
|
| 123 |
+
title={EuroBERT Commit Humor Classifier (Optimized)},
|
| 124 |
+
author={LBerthalon},
|
| 125 |
year={2025},
|
| 126 |
+
publisher={Hugging Face},
|
| 127 |
url={https://huggingface.co/LBerthalon/eurobert-commit-humor}
|
| 128 |
}
|
| 129 |
```
|
| 130 |
|
| 131 |
+
## 📄 Licence
|
| 132 |
|
| 133 |
+
MIT License
|
|
|
|
|
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 849445112
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:62cf1928ab98691350cfe8fb8a1c276bd51c46bb63a19068996b8da0b96890f7
|
| 3 |
size 849445112
|
threshold.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"funny_threshold": 0.35
|
| 3 |
+
}
|
tokenizer_config.json
CHANGED
|
@@ -2064,8 +2064,5 @@
|
|
| 2064 |
"pad_token": "<|pad|>",
|
| 2065 |
"pad_token_type_id": 0,
|
| 2066 |
"padding_side": "right",
|
| 2067 |
-
"
|
| 2068 |
-
"tokenizer_class": "PreTrainedTokenizerFast",
|
| 2069 |
-
"truncation_side": "right",
|
| 2070 |
-
"truncation_strategy": "longest_first"
|
| 2071 |
}
|
|
|
|
| 2064 |
"pad_token": "<|pad|>",
|
| 2065 |
"pad_token_type_id": 0,
|
| 2066 |
"padding_side": "right",
|
| 2067 |
+
"tokenizer_class": "PreTrainedTokenizerFast"
|
|
|
|
|
|
|
|
|
|
| 2068 |
}
|