ToxiFrench: Benchmarking and Investigating SLMs and CoT Finetuning for French Toxicity Detection

GitHub Pages GitHub Repository Hugging Face Dataset License: MIT

Author: Axel Delaval

Affiliations: École Polytechnique & Shanghai Jiao Tong University (SJTU)

Email: [email protected]


⚠️ Content Warning This project and the associated dataset contain examples of text that may be considered offensive, toxic, or otherwise disturbing. The content is presented for research purposes only.


Table of Contents

Abstract

Despite significant progress in English toxicity detection, performance drastically degrades in other languages like French, a gap stemming from disparities in training corpora and the culturally nuanced nature of toxicity. This paper addresses this critical gap with three key contributions. First, we introduce ToxiFrench, a new public benchmark dataset for French toxicity detection, comprising 53,622 entries. This dataset was constructed using a novel annotation strategy that required manual labeling for only 10% of the data, minimizing effort and error. Second, we conducted a comprehensive evaluation of toxicity detection models. Our findings reveal that while Large Language Models (LLMs) often achieve high performance, Small Language Models (SLMs) can demonstrate greater robustness to bias, better cross-language consistency, and superior generalization to novel forms of toxicity. Third, to identify optimal transfer-learning methods, we conducted a systematic comparison of In-Context Learning (ICL), Supervised Fine-tuning (SFT), and Chain-of-Thought (CoT) reasoning using Qwen3-4B and analyzed the impact of data imbalance. We propose a novel approach for CoT fine-tuning that employs a dynamic weighted loss function, significantly boosting performance by ensuring the model's reasoning is faithful to its final conclusion.


Key Contributions

  • Dataset and benchmark: Introduction of ToxiFrench, a new public benchmark dataset for French toxicity detection (53,622 entries).
  • Evaluation state-of-the-art detectors: Extensive evaluation of LLMs (GPT-4o, DeepSeek, Gemini, Mistral, ...), SLMs (Qwen, Gemma, Mistral, ...), Transformers (CamemBERT, DistilBERT, ...), and moderation APIs (Perspective API, OpenAI moderation, Mistral moderation, ...), showing that SLMs outperform LLMs in robustness to bias, cross-language consistency, and generalization to novel toxicity forms.
  • Transfer learning strategies: Systematic comparison of ICL, SFT, and CoT reasoning.
  • Model development: Development of a state-of-the-art 4B SLM for French toxicity detection that outperforms several powerful LLMs based on the Qwen3-4B model.
  • CoT fine-tuning: Introduction of a novel approach for CoT fine-tuning that employs a dynamic weighted loss function, significantly boosting performance by ensuring the model's reasoning is faithful to its final conclusion.

How to use ?

This repository contains the ToxiFrench model, a French language model fine-tuned for toxic comment classification. It is based on the Qwen/Qwen3-4B architecture and is designed to detect and classify toxic comments in French text.

We performed a series of experiments to evaluate the model's performance under different fine-tuning configurations, focusing on the impact of data selection strategies and Chain-of-Thought (CoT) annotations.

We used QLORA adapters, make sure to specify adapter_name when loading the model, otherwise the base model, without any fine-tuning, will be loaded.

Notations

For conciseness, we use a three-letter notation to describe the different configurations of the fine-tuning experiments. Each experiment follows a naming scheme like: (r/o)(e/d)(c/b)
Where:

Parameter Code Description
Data Order [r] Training data is presented in a random order.
[o] Data is ordered (Curriculum Learning).
Class Balance [e] Training set has an equal (balanced) number of toxic and non-toxic samples.
[d] Training set uses a different (imbalanced) class distribution.
Training Target [c] Finetuning on the complete Chain-of-Thought annotation.
[b] Finetuning on the final binary label only (direct classification).

e.g. rec is the model trained on an oversampled dataset for balance (e), with batches in an arbitrary order (r), and with CoT reasoning (c).

Example Usage

You can find an example in this notebook.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# Choose which adapter to load
target_adapter_name = "rec" # Among the following six configurations : "odc", "oeb", "oec", "rdc", "reb", "rec"

# Load the base model
base_model_name = "Qwen/Qwen3-4B"

# For small GPUs, use 4-bit quantization
bnb_config = BitsAndBytesConfig(**{
            "load_in_4bit": True,
            "bnb_4bit_use_double_quant": True,
            "bnb_4bit_quant_type": "nf4",
            "bnb_4bit_compute_dtype": torch.float16
        })

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
                base_model_name,
                use_fast=True,
                trust_remote_code=True
            )
tokenizer.padding_side = 'left' 

# Load model
model = AutoModelForCausalLM.from_pretrained(
                base_model_name,
                quantization_config=bnb_config,
                trust_remote_code=True,
                sliding_window=None,
            )

# Resize the model's token embeddings to match the tokenizer's vocabulary size
model_embedding_size = model.get_input_embeddings().weight.size(0)
tokenizer_vocab_size = len(tokenizer)
model.resize_token_embeddings(tokenizer_vocab_size)

# Load the specific adapter by name from the repository
adapter_repo_id = "Naela00/ToxiFrench"
model = PeftModel.from_pretrained(
    model,
    adapter_repo_id,
    subfolder=target_adapter_name # Among the following six configurations : "odc", "oeb", "oec", "rdc", "reb", "rec"
)

# Inference
message_to_analyze = "Je suis vraiment déçu par ce film, c'était nul !"
prompt = f"Message:\n{message_to_analyze}\n\nAnalyse:\n"
if "c" in target_adapter_name:
    prompt += "<think>\nExplication :\n" # If using CoT, add the reasoning part

max_new_tokens: int = 1024
do_sample: bool = True
temperature: float = 0.7
top_p: float = 0.9
top_k: int = 50
repetition_penalty: float = 1.1

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    padding=True,
    truncation=True
).to(model.device)

default_generation_kwargs = {
    "max_new_tokens": max_new_tokens,
    "do_sample": do_sample,
    "temperature": temperature,
    "top_p": top_p,
    "top_k": top_k,
    "repetition_penalty": repetition_penalty,
    "eos_token_id": tokenizer.eos_token_id,
}

outputs = model.generate(**inputs, **default_generation_kwargs)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)

print(generated_text)

License

License: MIT

This project is licensed under the MIT License - see the LICENSE file for details.


Citation

If you use this project in your research, please cite it as follows:

@misc{delaval2025toxifrench,
    title={ToxiFrench: Benchmarking and Investigating SLMs and CoT Finetuning for French Toxicity Detection},
    author={Axel Delaval},
    year={2025},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Naela00/ToxiFrench

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Finetuned
(189)
this model