Model Card for m2im/labse_finetuned_twitter

This model is a fine-tuned version of LaBSE (Language-agnostic BERT Sentence Embedding), specifically adapted to detect collective violence signals in multilingual Twitter discourse. It was developed as part of a research project focused on early-warning systems for conflict prediction.

Model Details

Model Description

  • Developed by: Dr. Milton Mendieta and Dr. Timothy Warren

  • Funded by: Coalition for Open-Source Defense Analysis (CODA) Lab, Department of Defense Analysis, Naval Postgraduate School (NPS)

  • Shared by: Dr. Milton Mendieta and Dr. Timothy Warren

  • Model type: Transformer-based sentence encoder fine-tuned for multilabel classification

  • Language(s): Originally pre-trained on 109 languages (LaBSE), then fine-tuned on 68 languages from X (formerly Twitter, 2014 onward), including the undefined und language category

  • License: MIT

  • Finetuned from model: sentence-transformers/LaBSE

Model Sources

Uses

Direct Use

This model is intended to classify tweets in multiple languages into predefined categories related to proximity to collective violence events.

Downstream Use

The model may be embedded into conflict early-warning systems, government monitoring platforms, or research pipelines analyzing social unrest.

Out-of-Scope Use

  • General-purpose sentiment analysis
  • Legal, health, or financial decision-making
  • Use in low-resource languages not covered by training data

Bias, Risks, and Limitations

  • Geographic bias: The model was primarily trained on short-duration violent events around the world, which limits its applicability to long-running conflicts (e.g., Russia-Ukraine) or high-noise environments (e.g., Washington, D.C.).
  • Temporal bias: Performance degrades in pre-violence scenarios, especially at larger spatial scales (50 km), where signals are weaker and often masked by noise.
  • Sample size sensitivity: The model underperforms when fewer than 5,000 observations are available per label, reducing reliability in low-data settings.
  • Spatial ambiguity: Frequent misclassification between pre7geo50 and post7geo50 labels highlights the model’s challenge in distinguishing temporal contexts at broader spatial radii.
  • Language coverage limitations: While fine-tuned on 67 languages, performance may vary for underrepresented or informal language variants.

Recommendations

  • Use with short-term events: For best results, apply the model to short-term events with geographically concentrated discourse, aligning with the training data distribution.
  • Avoid low-sample inference: Do not deploy the model in scenarios where fewer than 5,000 labeled observations are available per class.
  • Limit reliance on large-radius labels: Exercise caution when interpreting predictions at 50 km radii, which tend to capture noisy or irrelevant information.
  • Contextual validation: Evaluate model performance on local data before broader deployment, especially in unfamiliar regions or languages.
  • Consider post-processing: Incorporate ensemble methods or threshold adjustments to improve label differentiation in ambiguous cases.
  • Batch predictions: Avoid use in isolated tweets; batch predictions are more reliable

How to Get Started with the Model

from transformers import pipeline
import html, re

def clean_tweet(example):
    tweet = example['text']
    tweet = tweet.replace("\n", " ")
    tweet = html.unescape(tweet)
    tweet = re.sub("@[A-Za-z0-9_:]+", "", tweet)
    tweet = re.sub(r'http\S+', '', tweet)
    tweet = re.sub('RT ', '', tweet)
    return {'text': tweet.strip()}

pipe = pipeline("text-classification", model="m2im/labse_finetuned_twitter", tokenizer="m2im/labse_finetuned_twitter", top_k=None)

example = {"text": "Protesta en Quito por medidas económicas."}
cleaned = clean_tweet(example)
print(pipe(cleaned["text"]))

Training Details

Training Data

Training Procedure

  • Text preprocessing using tweet normalization (removal of mentions, URLs, etc.)
  • Tokenization with LaBSE tokenizer
  • Multi-label head using BCEWithLogitsLoss

Training Hyperparameters

  • Model checkpoints: setu4993/LaBSE
  • Head class: AutoModelForSequenceClassification
  • Optimizer: AdamW
  • Batch size (train/validation): 1024
  • Epochs: 20
  • Learning rate: 5e-5
  • Learning rate scheduler: Cosine
  • Weight decay: 0.1
  • Max sequence length: 32
  • Precision: Mixed fp16
  • Random seed: 42
  • Saving strategy: Save the best model only when the ROC-AUC score improves on the validation set

Evaluation

Testing Data, Factors & Metrics

  • Dataset: Held-out portion of the multilingual Twitter collective violence dataset, including over 275,000 tweets labeled across six spatio-temporal categories (pre7geo10, pre7geo30, pre7geo50, post7geo10, post7geo30, post7geo50).
  • Metrics:
    • ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Evaluates the model’s ability to distinguish between classes across all thresholds.
    • Macro F1: Harmonic mean of precision and recall, averaged equally across all classes.
    • Micro F1: Harmonic mean of precision and recall, aggregated globally across all predictions.
    • Precision and Recall: Standard classification metrics to assess false positive and false negative trade-offs.

Results

  • Classical ML models (Random Forest, SVM, Bagging, Boosting, and Decision Trees) were trained on LaBSE-generated sentence embeddings. The best performing classical model—Random Forest—achieved a macro F1 score of approximately 0.61, indicating that embeddings alone provide meaningful but limited discrimination for the multilabel classification task.
  • In contrast, the fine-tuned LaBSE model, trained end-to-end with a classification head, outperformed all baseline classical models by achieving a ROC-AUC score of 0.7238 on the validation set.
  • These results demonstrate the value of supervised fine-tuning over using frozen embeddings with classical classifiers, particularly in tasks involving subtle multilingual and spatio-temporal signal detection.

Model Examination

  • Embedding analysis was conducted using a two-stage dimensionality reduction process: Principal Component Analysis (PCA) reduced the 768-dimensional LaBSE sentence embeddings to 50 dimensions, followed by Uniform Manifold Approximation and Projection (UMAP) to reduce to 2 dimensions for visualization.
  • The resulting 2D projections revealed coherent clustering of sentence embeddings by label, particularly in post-violence scenarios and at smaller spatial scales (10 km), indicating that the model effectively captures latent structure related to spatio-temporal patterns of collective violence.
  • Examination of classification performance across labels further confirmed that the model is most reliable when predicting post-violence instances near the epicenter of an event, while its ability to detect pre-violence signals—especially at broader spatial radii (50 km)—is weaker and more prone to noise.

Environmental Impact

  • Hardware Type: 16 NVIDIA Tesla V100 GPUs
  • Hours used: ~10 hours
  • Cloud Provider: University research computing cluster
  • Compute Region: North America
  • Carbon Emitted: Not formally calculated

Technical Specifications

Model Architecture and Objective

  • Transformer encoder (BERT-based)
  • Objective: Multilabel binary classification with sentence embeddings

Compute Infrastructure

  • Hardware: One server with 16 × V100 GPUs and one server with 3 TB of RAM, both available at the CODA Lab.
  • Software: PyTorch 2.0, Hugging Face Transformers 4.x, KV-Swarm (an in-memory database also hosted at the CODA Lab), Weight and Biases for experiment tracking and model management

Citation

BibTeX:

@misc{mendieta2025labseviolence,
  author       = {Milton Mendieta, Timothy Warren},
  title        = {Fine-Tuning Multilingual Language Models to Predict Collective Violence Using Twitter Data},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/m2im/labse_finetuned_twitter}},
  note         = {Research on multilingual NLP and conflict prediction}
}

Citation

APA:
Mendieta, M., & Warren, T. (2025). Fine-tuning multilingual language models to predict collective violence using Twitter data [Model]. Hugging Face. https://huggingface.co/m2im/labse_finetuned_twitter

Model Card Authors

Dr. Milton Mendieta and Dr. Timothy Warren

Model Card Contact

[email protected]

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train m2im/labse_finetuned_twitter