DeBERTa-CRF-VotIE: Portuguese Voting Information Extraction
This model is a fine-tuned DeBERTa v3 Base with a Conditional Random Fields (CRF) layer for extracting structured voting information from Portuguese municipal meeting minutes. It achieves state-of-the-art performance on the VotIE benchmark dataset.
Model Description
DeBERTa-CRF-VotIE combines the robust contextual representations of Microsoft's DeBERTa v3 multilingual base model with a CRF layer for structured sequence prediction. The model performs token-level classification to identify and extract voting-related entities from Portuguese administrative text.
Key Features
- Architecture: DeBERTa v3 Base (768-dim, 12 layers) + Linear + CRF
- Task: Squence Labeling with BIO tagging
- Language: Portuguese (Portugal)
- Domain: Municipal meeting minutes and voting records
- Entity Types: 8 types (17 labels with BIO encoding)
- Performance: 93.00% entity-level F1 score
Intended Uses
This model is designed for:
- Extracting voting information from Portuguese municipal documents
- Identifying participants and their voting positions (favor, against, abstention, absent)
- Recognizing voting subjects and counting methods
- Structuring unstructured administrative text
- Research in information extraction from Portuguese administrative documents
Entity Types
The model recognizes 8 entity types in BIO format (17 labels total):
| Entity Type | Description | Example |
|---|---|---|
VOTER-FAVOR |
Participants who voted in favor | "The Municipal Executive" |
VOTER-AGAINST |
Participants who voted against | "João Silva" |
VOTER-ABSTENTION |
Participants who abstained | "The councilor from PS" |
VOTER-ABSENT |
Participants who were absent | "Ana Simões" |
VOTING |
Voting action expressions | "deliberado", "aprovado" |
SUBJECT |
The subject matter being voted on | "budget changes" |
COUNTING-UNANIMITY |
Unanimous vote indicators | "unanimously" |
COUNTING-MAJORITY |
Majority vote indicators | "by majority" |
Training Details
Training Data
The model was trained on the VotIE dataset, which consists of Portuguese municipal meeting minutes annotated with voting information:
- Training set: 1,737 examples
- Validation set: 433 examples
- Test set: 433 examples
- Total tokens: ~300K tokens
- Total entities: ~5K entities
Training Procedure
Hyperparameters:
- Base model:
microsoft/deberta-v3-base - Batch size: 16
- Learning rate: 5e-5 (linear decay with warmup)
- Warmup proportion: 10%
- Weight decay: 0.01
- Dropout: 0.1
- Max sequence length: 512 tokens
- Epochs: 10
- Optimizer: AdamW
- Training time: ~1.5 hours on NVIDIA L40 GPU
Training Details:
- Class imbalance handling with weighted loss (O-tag weight: 0.01)
- O-tag bias initialization (bias: 6.0) to prevent model collapse
- Windowing for long documents (512 tokens with 50-token overlap)
- Early stopping with patience=3 epochs
- BIO constraint validation during evaluation
Results
Entity-Level Performance (Test Set)
| Metric | Score |
|---|---|
| F1 Score | 93.00% |
| Precision | 91.08% |
| Recall | 95.01% |
Per-Entity Performance
| Entity Type | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| COUNTING-MAJORITY | 92.86% | 100.00% | 96.30% | 52 |
| COUNTING-UNANIMITY | 94.47% | 100.00% | 97.16% | 222 |
| SUBJECT | 84.22% | 84.45% | 84.34% | 373 |
| VOTER-ABSENT | 95.45% | 95.45% | 95.45% | 22 |
| VOTER-ABSTENTION | 88.46% | 100.00% | 93.88% | 138 |
| VOTER-AGAINST | 97.44% | 95.00% | 96.20% | 40 |
| VOTER-FAVOR | 92.19% | 97.25% | 94.66% | 255 |
| VOTING | 94.50% | 98.26% | 96.34% | 402 |
Comparison with Other Models
This model achieves the best performance among all tested architectures on the VotIE dataset:
| Model | Architecture | Entity F1 | Event F1 |
|---|---|---|---|
| DeBERTa-CRF | DeBERTa v3 + CRF | 93.0% | 90.8% |
| XLM-R-CRF | XLM-RoBERTa + CRF | 92.6% | 90.3% |
| BERTimbau-CRF | BERTimbau + CRF | 92.4% | 89.9% |
| DeBERTa-Linear | DeBERTa v3 + Linear | 92.1% | 88.7% |
Full results available in the VotIE paper
Usage
Quick Start
The simplest way to use the model:
from transformers import AutoTokenizer, AutoModel
# Load model
model_name = "Anonymous3445/DeBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Analyze text
text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text)
# Print results
for pred in predictions:
print(f"{pred['word']:20} {pred['label']}")
Output:
O B-VOTER-FAVOR
Executivo I-VOTER-FAVOR
deliberou B-VOTING
aprovar O
o O
projeto O
por B-COUNTING-UNANIMITY
unanimidade. I-COUNTING-UNANIMITY
Extract Entities
Get structured entities from voting documents:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/DeBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
text = """A Câmara Municipal deliberou aprovar a proposta apresentada pelo
Senhor Presidente. Votaram a favor os Senhores Vereadores João Silva e
Maria Costa. Votou contra o Senhor Vereador Pedro Santos."""
inputs = tokenizer(text, return_tensors="pt")
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text)
# Extract entities by type
entities = {}
current_entity = []
current_type = None
for pred in predictions:
label = pred['label']
word = pred['word']
if label.startswith('B-'):
# Save previous entity
if current_entity:
entity_type = current_type.replace('B-', '').replace('I-', '')
if entity_type not in entities:
entities[entity_type] = []
entities[entity_type].append(' '.join(current_entity))
# Start new entity
current_entity = [word]
current_type = label
elif label.startswith('I-') and current_entity:
current_entity.append(word)
else: # O tag
if current_entity:
entity_type = current_type.replace('B-', '').replace('I-', '')
if entity_type not in entities:
entities[entity_type] = []
entities[entity_type].append(' '.join(current_entity))
current_entity = []
current_type = None
# Save last entity
if current_entity:
entity_type = current_type.replace('B-', '').replace('I-', '')
if entity_type not in entities:
entities[entity_type] = []
entities[entity_type].append(' '.join(current_entity))
# Print entities
for entity_type, entity_list in entities.items():
print(f"\n{entity_type}:")
for entity in entity_list:
print(f" - {entity}")
Output:
VOTER-FAVOR:
- A Câmara Municipal
- João Silva
- Maria Costa
VOTING:
- deliberou
VOTER-AGAINST:
- Pedro Santos
With Character Offsets
Useful for highlighting entities in your UI:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/DeBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
# Get predictions with character positions
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text, return_offsets=True)
# Show only entities (non-O tags)
for pred in predictions:
if pred['label'] != 'O':
print(f"{pred['word']:20} {pred['label']:25} [{pred['start']}:{pred['end']}]")
Output:
O B-VOTER-FAVOR [0:1]
Executivo I-VOTER-FAVOR [1:11]
deliberou B-VOTING [11:21]
por B-COUNTING-UNANIMITY [39:43]
unanimidade. I-COUNTING-UNANIMITY [43:56]
Limitations and Bias
Limitations
- Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
- Portuguese only: Optimized for European Portuguese;
- Sequence length: Limited to 512 tokens per window (handles longer documents via windowing)
- Entity types: Limited to 8 predefined voting-related entity types
- Complex sentences: May struggle with highly complex or nested voting descriptions
Bias Considerations
- Geographic bias: Training data predominantly from Portuguese municipalities; may not capture regional variations
- Temporal bias: Training data from municipal minutes of specific time periods
- Formality bias: Trained on formal administrative language; informal voting descriptions may be less accurate
- Class imbalance: O-tag (non-entity) and rare voter types tokens significantly outnumber entity tokens; addressed with class weighting
Model Card Authors
- Anonymous Authors (for blind review)
Model Card Contact
For questions or issues, please open an issue in the GitHub repository.
Additional Resources
- GitHub Repository: https://github.com/Anonymous3445/VotIE
- Dataset: VotIE Dataset
- Paper: [Coming soon]
- Demo: VotIE Demo
License
This model is released under the Creative Commons Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) license.
- ✅ You can: Use the model for research and commercial purposes with attribution
- ❌ You cannot: Create derivative works or modified versions
- 📝 You must: Provide attribution to the original authors
See LICENSE for full details.
Acknowledgments
This work builds upon:
- DeBERTa v3: Microsoft's DeBERTa v3 multilingual base model
- pytorch-crf: CRF implementation by kmkurn
- Transformers: Hugging Face Transformers library
Model training was conducted on NVIDIA L40 GPU infrastructure.
Version: 1.0
Last Updated: 2025-10-17
Framework: PyTorch + Transformers + torchcrf
- Downloads last month
- 5
Model tree for Anonymous3445/DeBERTa-CRF-VotIE
Base model
microsoft/deberta-v3-baseSpace using Anonymous3445/DeBERTa-CRF-VotIE 1
Evaluation results
- F1 Score on VotIEself-reported0.930
- Precision on VotIEself-reported0.911
- Recall on VotIEself-reported0.950