Model Card for m2im/labse_finetuned_twitter

This model is a fine-tuned version of LaBSE (Language-agnostic BERT Sentence Embedding), specifically adapted to detect collective violence signals in multilingual Twitter discourse. It was developed as part of a research project focused on early-warning systems for conflict prediction.

Model Details

Model Description

Developed by: Dr. Milton Mendieta and Dr. Timothy Warren
Funded by: Coalition for Open-Source Defense Analysis (CODA) Lab, Department of Defense Analysis, Naval Postgraduate School (NPS)
Shared by: Dr. Milton Mendieta and Dr. Timothy Warren
Model type: Transformer-based sentence encoder fine-tuned for multilabel classification
Language(s): Originally pre-trained on 109 languages (LaBSE), then fine-tuned on 68 languages from X (formerly Twitter, 2014 onward), including the undefined und language category
License: MIT
Finetuned from model: sentence-transformers/LaBSE

Model Sources

Repository: https://github.com/m2im/violence_prediction
Paper: TBD

Uses

Direct Use

This model is intended to classify tweets in multiple languages into predefined categories related to proximity to collective violence events.

Downstream Use

The model may be embedded into conflict early-warning systems, government monitoring platforms, or research pipelines analyzing social unrest.

Out-of-Scope Use

General-purpose sentiment analysis
Legal, health, or financial decision-making
Use in low-resource languages not covered by training data

Bias, Risks, and Limitations

Geographic bias: The model was primarily trained on short-duration violent events around the world, which limits its applicability to long-running conflicts (e.g., Russia-Ukraine) or high-noise environments (e.g., Washington, D.C.).
Temporal bias: Performance degrades in pre-violence scenarios, especially at larger spatial scales (50 km), where signals are weaker and often masked by noise.
Sample size sensitivity: The model underperforms when fewer than 5,000 observations are available per label, reducing reliability in low-data settings.
Spatial ambiguity: Frequent misclassification between pre7geo50 and post7geo50 labels highlights the model’s challenge in distinguishing temporal contexts at broader spatial radii.
Language coverage limitations: While fine-tuned on 67 languages, performance may vary for underrepresented or informal language variants.

Recommendations

Use with short-term events: For best results, apply the model to short-term events with geographically concentrated discourse, aligning with the training data distribution.
Avoid low-sample inference: Do not deploy the model in scenarios where fewer than 5,000 labeled observations are available per class.
Limit reliance on large-radius labels: Exercise caution when interpreting predictions at 50 km radii, which tend to capture noisy or irrelevant information.
Contextual validation: Evaluate model performance on local data before broader deployment, especially in unfamiliar regions or languages.
Consider post-processing: Incorporate ensemble methods or threshold adjustments to improve label differentiation in ambiguous cases.
Batch predictions: Avoid use in isolated tweets; batch predictions are more reliable

How to Get Started with the Model

from transformers import pipeline
import html, re

def clean_tweet(example):
    tweet = example['text']
    tweet = tweet.replace("\n", " ")
    tweet = html.unescape(tweet)
    tweet = re.sub("@[A-Za-z0-9_:]+", "", tweet)
    tweet = re.sub(r'http\S+', '', tweet)
    tweet = re.sub('RT ', '', tweet)
    return {'text': tweet.strip()}

pipe = pipeline("text-classification", model="m2im/labse_finetuned_twitter", tokenizer="m2im/labse_finetuned_twitter", top_k=None)

example = {"text": "Protesta en Quito por medidas económicas."}
cleaned = clean_tweet(example)
print(pipe(cleaned["text"]))

Training Details

Training Data

Dataset: m2im/multilingual-twitter-collective-violence-dataset
Labels: 6 of the most informative out of 40 available:
- pre7geo10, pre7geo30, pre7geo50
- post7geo10, post7geo30, post7geo50

Training Procedure

Text preprocessing using tweet normalization (removal of mentions, URLs, etc.)
Tokenization with LaBSE tokenizer
Multi-label head using BCEWithLogitsLoss

Training Hyperparameters

Model checkpoints: setu4993/LaBSE
Head class: AutoModelForSequenceClassification
Optimizer: AdamW
Batch size (train/validation): 1024
Epochs: 20
Learning rate: 5e-5
Learning rate scheduler: Cosine
Weight decay: 0.1
Max sequence length: 32
Precision: Mixed fp16
Random seed: 42
Saving strategy: Save the best model only when the ROC-AUC score improves on the validation set

Evaluation

Testing Data, Factors & Metrics

Dataset: Held-out portion of the multilingual Twitter collective violence dataset, including over 275,000 tweets labeled across six spatio-temporal categories (pre7geo10, pre7geo30, pre7geo50, post7geo10, post7geo30, post7geo50).
Metrics:
- ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Evaluates the model’s ability to distinguish between classes across all thresholds.
- Macro F1: Harmonic mean of precision and recall, averaged equally across all classes.
- Micro F1: Harmonic mean of precision and recall, aggregated globally across all predictions.
- Precision and Recall: Standard classification metrics to assess false positive and false negative trade-offs.

Results

Classical ML models (Random Forest, SVM, Bagging, Boosting, and Decision Trees) were trained on LaBSE-generated sentence embeddings. The best performing classical model—Random Forest—achieved a macro F1 score of approximately 0.61, indicating that embeddings alone provide meaningful but limited discrimination for the multilabel classification task.
In contrast, the fine-tuned LaBSE model, trained end-to-end with a classification head, outperformed all baseline classical models by achieving a ROC-AUC score of 0.7238 on the validation set.
These results demonstrate the value of supervised fine-tuning over using frozen embeddings with classical classifiers, particularly in tasks involving subtle multilingual and spatio-temporal signal detection.

Model Examination

Embedding analysis was conducted using a two-stage dimensionality reduction process: Principal Component Analysis (PCA) reduced the 768-dimensional LaBSE sentence embeddings to 50 dimensions, followed by Uniform Manifold Approximation and Projection (UMAP) to reduce to 2 dimensions for visualization.
The resulting 2D projections revealed coherent clustering of sentence embeddings by label, particularly in post-violence scenarios and at smaller spatial scales (10 km), indicating that the model effectively captures latent structure related to spatio-temporal patterns of collective violence.
Examination of classification performance across labels further confirmed that the model is most reliable when predicting post-violence instances near the epicenter of an event, while its ability to detect pre-violence signals—especially at broader spatial radii (50 km)—is weaker and more prone to noise.

Environmental Impact

Hardware Type: 16 NVIDIA Tesla V100 GPUs
Hours used: ~10 hours
Cloud Provider: University research computing cluster
Compute Region: North America
Carbon Emitted: Not formally calculated

Technical Specifications

Model Architecture and Objective

Transformer encoder (BERT-based)
Objective: Multilabel binary classification with sentence embeddings

Compute Infrastructure

Hardware: One server with 16 × V100 GPUs and one server with 3 TB of RAM, both available at the CODA Lab.
Software: PyTorch 2.0, Hugging Face Transformers 4.x, KV-Swarm (an in-memory database also hosted at the CODA Lab), Weight and Biases for experiment tracking and model management

Citation

BibTeX:

@misc{mendieta2025labseviolence,
  author       = {Milton Mendieta, Timothy Warren},
  title        = {Fine-Tuning Multilingual Language Models to Predict Collective Violence Using Twitter Data},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/m2im/labse_finetuned_twitter}},
  note         = {Research on multilingual NLP and conflict prediction}
}

Citation

APA:
Mendieta, M., & Warren, T. (2025). Fine-tuning multilingual language models to predict collective violence using Twitter data [Model]. Hugging Face. https://huggingface.co/m2im/labse_finetuned_twitter

Model Card Authors

Dr. Milton Mendieta and Dr. Timothy Warren

Model Card Contact

[email protected]

m2im
/

labse_finetuned_twitter

Model Card for m2im/labse_finetuned_twitter

Model Details

Model Description

Model Sources

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Training Hyperparameters

Evaluation

Testing Data, Factors & Metrics

Results

Model Examination

Environmental Impact

Technical Specifications

Model Architecture and Objective

Compute Infrastructure

Citation

Citation

Model Card Authors

Model Card Contact

Dataset used to train m2im/labse_finetuned_twitter