DeBERTa-CRF-VotIE: Portuguese Voting Information Extraction

This model is a fine-tuned DeBERTa v3 Base with a Conditional Random Fields (CRF) layer for extracting structured voting information from Portuguese municipal meeting minutes. It achieves state-of-the-art performance on the VotIE benchmark dataset.

Model Description

DeBERTa-CRF-VotIE combines the robust contextual representations of Microsoft's DeBERTa v3 multilingual base model with a CRF layer for structured sequence prediction. The model performs token-level classification to identify and extract voting-related entities from Portuguese administrative text.

Key Features

Architecture: DeBERTa v3 Base (768-dim, 12 layers) + Linear + CRF
Task: Squence Labeling with BIO tagging
Language: Portuguese (Portugal)
Domain: Municipal meeting minutes and voting records
Entity Types: 8 types (17 labels with BIO encoding)
Performance: 93.00% entity-level F1 score

Intended Uses

This model is designed for:

Extracting voting information from Portuguese municipal documents
Identifying participants and their voting positions (favor, against, abstention, absent)
Recognizing voting subjects and counting methods
Structuring unstructured administrative text
Research in information extraction from Portuguese administrative documents

Entity Types

The model recognizes 8 entity types in BIO format (17 labels total):

Entity Type	Description	Example
`VOTER-FAVOR`	Participants who voted in favor	"The Municipal Executive"
`VOTER-AGAINST`	Participants who voted against	"João Silva"
`VOTER-ABSTENTION`	Participants who abstained	"The councilor from PS"
`VOTER-ABSENT`	Participants who were absent	"Ana Simões"
`VOTING`	Voting action expressions	"deliberado", "aprovado"
`SUBJECT`	The subject matter being voted on	"budget changes"
`COUNTING-UNANIMITY`	Unanimous vote indicators	"unanimously"
`COUNTING-MAJORITY`	Majority vote indicators	"by majority"

Training Details

Training Data

The model was trained on the VotIE dataset, which consists of Portuguese municipal meeting minutes annotated with voting information:

Training set: 1,737 examples
Validation set: 433 examples
Test set: 433 examples
Total tokens: ~300K tokens
Total entities: ~5K entities

Training Procedure

Hyperparameters:

Base model: microsoft/deberta-v3-base
Batch size: 16
Learning rate: 5e-5 (linear decay with warmup)
Warmup proportion: 10%
Weight decay: 0.01
Dropout: 0.1
Max sequence length: 512 tokens
Epochs: 10
Optimizer: AdamW
Training time: ~1.5 hours on NVIDIA L40 GPU

Training Details:

Class imbalance handling with weighted loss (O-tag weight: 0.01)
O-tag bias initialization (bias: 6.0) to prevent model collapse
Windowing for long documents (512 tokens with 50-token overlap)
Early stopping with patience=3 epochs
BIO constraint validation during evaluation

Results

Entity-Level Performance (Test Set)

Metric	Score
F1 Score	93.00%
Precision	91.08%
Recall	95.01%

Per-Entity Performance

Entity Type	Precision	Recall	F1 Score	Support
COUNTING-MAJORITY	92.86%	100.00%	96.30%	52
COUNTING-UNANIMITY	94.47%	100.00%	97.16%	222
SUBJECT	84.22%	84.45%	84.34%	373
VOTER-ABSENT	95.45%	95.45%	95.45%	22
VOTER-ABSTENTION	88.46%	100.00%	93.88%	138
VOTER-AGAINST	97.44%	95.00%	96.20%	40
VOTER-FAVOR	92.19%	97.25%	94.66%	255
VOTING	94.50%	98.26%	96.34%	402

Comparison with Other Models

This model achieves the best performance among all tested architectures on the VotIE dataset:

Model	Architecture	Entity F1	Event F1
DeBERTa-CRF	DeBERTa v3 + CRF	93.0%	90.8%
XLM-R-CRF	XLM-RoBERTa + CRF	92.6%	90.3%
BERTimbau-CRF	BERTimbau + CRF	92.4%	89.9%
DeBERTa-Linear	DeBERTa v3 + Linear	92.1%	88.7%

Full results available in the VotIE paper

Usage

Quick Start

The simplest way to use the model:

from transformers import AutoTokenizer, AutoModel

# Load model
model_name = "Anonymous3445/DeBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

# Analyze text
text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text)

# Print results
for pred in predictions:
    print(f"{pred['word']:20} {pred['label']}")

Output:

O                    B-VOTER-FAVOR
Executivo            I-VOTER-FAVOR
deliberou            B-VOTING
aprovar              O
o                    O
projeto              O
por                  B-COUNTING-UNANIMITY
unanimidade.         I-COUNTING-UNANIMITY

Extract Entities

Get structured entities from voting documents:

from transformers import AutoTokenizer, AutoModel

model_name = "Anonymous3445/DeBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

text = """A Câmara Municipal deliberou aprovar a proposta apresentada pelo
Senhor Presidente. Votaram a favor os Senhores Vereadores João Silva e
Maria Costa. Votou contra o Senhor Vereador Pedro Santos."""

inputs = tokenizer(text, return_tensors="pt")
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text)

# Extract entities by type
entities = {}
current_entity = []
current_type = None

for pred in predictions:
    label = pred['label']
    word = pred['word']

    if label.startswith('B-'):
        # Save previous entity
        if current_entity:
            entity_type = current_type.replace('B-', '').replace('I-', '')
            if entity_type not in entities:
                entities[entity_type] = []
            entities[entity_type].append(' '.join(current_entity))
        # Start new entity
        current_entity = [word]
        current_type = label

    elif label.startswith('I-') and current_entity:
        current_entity.append(word)

    else:  # O tag
        if current_entity:
            entity_type = current_type.replace('B-', '').replace('I-', '')
            if entity_type not in entities:
                entities[entity_type] = []
            entities[entity_type].append(' '.join(current_entity))
        current_entity = []
        current_type = None

# Save last entity
if current_entity:
    entity_type = current_type.replace('B-', '').replace('I-', '')
    if entity_type not in entities:
        entities[entity_type] = []
    entities[entity_type].append(' '.join(current_entity))

# Print entities
for entity_type, entity_list in entities.items():
    print(f"\n{entity_type}:")
    for entity in entity_list:
        print(f"  - {entity}")

Output:

VOTER-FAVOR:
  - A Câmara Municipal
  - João Silva
  - Maria Costa

VOTING:
  - deliberou

VOTER-AGAINST:
  - Pedro Santos

With Character Offsets

Useful for highlighting entities in your UI:

from transformers import AutoTokenizer, AutoModel

model_name = "Anonymous3445/DeBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")

# Get predictions with character positions
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text, return_offsets=True)

# Show only entities (non-O tags)
for pred in predictions:
    if pred['label'] != 'O':
        print(f"{pred['word']:20} {pred['label']:25} [{pred['start']}:{pred['end']}]")

Output:

O                    B-VOTER-FAVOR             [0:1]
Executivo            I-VOTER-FAVOR             [1:11]
deliberou            B-VOTING                  [11:21]
por                  B-COUNTING-UNANIMITY      [39:43]
unanimidade.         I-COUNTING-UNANIMITY      [43:56]

Limitations and Bias

Limitations

Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
Portuguese only: Optimized for European Portuguese;
Sequence length: Limited to 512 tokens per window (handles longer documents via windowing)
Entity types: Limited to 8 predefined voting-related entity types
Complex sentences: May struggle with highly complex or nested voting descriptions

Bias Considerations

Geographic bias: Training data predominantly from Portuguese municipalities; may not capture regional variations
Temporal bias: Training data from municipal minutes of specific time periods
Formality bias: Trained on formal administrative language; informal voting descriptions may be less accurate
Class imbalance: O-tag (non-entity) and rare voter types tokens significantly outnumber entity tokens; addressed with class weighting

Model Card Authors

Anonymous Authors (for blind review)

Model Card Contact

For questions or issues, please open an issue in the GitHub repository.

Additional Resources

GitHub Repository: https://github.com/Anonymous3445/VotIE
Dataset: VotIE Dataset
Paper: [Coming soon]
Demo: VotIE Demo

License

This model is released under the Creative Commons Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) license.

✅ You can: Use the model for research and commercial purposes with attribution
❌ You cannot: Create derivative works or modified versions
📝 You must: Provide attribution to the original authors

See LICENSE for full details.

Acknowledgments

This work builds upon:

DeBERTa v3: Microsoft's DeBERTa v3 multilingual base model
pytorch-crf: CRF implementation by kmkurn
Transformers: Hugging Face Transformers library

Model training was conducted on NVIDIA L40 GPU infrastructure.

Version: 1.0
Last Updated: 2025-10-17
Framework: PyTorch + Transformers + torchcrf

Downloads last month: 5

Model tree for Anonymous3445/DeBERTa-CRF-VotIE

Base model

microsoft/deberta-v3-base

Finetuned

(485)

this model

Space using Anonymous3445/DeBERTa-CRF-VotIE 1

Evaluation results

F1 Score on VotIE
self-reported

0.930
Precision on VotIE
self-reported

0.911
Recall on VotIE
self-reported

0.950

View on Papers With Code