language: en license: mit library_name: transformers tags: - token-classification - ner - plants - botany - roberta - biology - horticulture datasets: - custom widget: - text: "I have a Rosa damascena and some Quercus alba trees in my garden." example_title: "Scientific plant names" - text: "My hibiscus and pachypodium plants need watering." example_title: "Common plant names" - text: "The beautiful roses are blooming next to the oak tree." example_title: "Mixed plant references" pipeline_tag: token-classification model-index: - name: roberta-plant-ner results: - task: type: token-classification name: Token Classification dataset: type: custom name: Plant NER Dataset metrics: - type: f1 value: 0.92 name: F1 Score - type: precision value: 0.90 name: Precision - type: recall value: 0.94 name: Recall
RoBERTa Plant Named Entity Recognition
Model Description
This model is a fine-tuned version of FacebookAI/roberta-base for plant named entity recognition. It identifies and classifies plant names in text into two categories:
- PLANT_COMMON: Common names for plants (e.g., "rose", "hibiscus", "oak tree")
- PLANT_SCI: Scientific/botanical names (e.g., "Rosa damascena", "Quercus alba")
Intended Uses & Limitations
Intended Uses
- Botanical text analysis: Extract plant mentions from research papers, articles, and documentation
- Gardening applications: Identify plants mentioned in gardening guides, forums, and care instructions
- Agricultural text processing: Parse agricultural documents and reports
- Educational tools: Assist in botany and horticulture education
- Content management: Automatically tag and categorize plant-related content
Limitations
- Trained primarily on English text
- May have lower accuracy on rare or highly specialized plant species
- Performance may vary on informal text, social media, or heavily abbreviated content
- Does not distinguish between live plants and plant products (e.g., "rose oil")
Training Data
The model was trained on a custom dataset containing:
- Botanical literature and research papers
- Gardening guides and plant care instructions
- Agricultural documents
- Horticultural databases
- Plant identification guides
Data Format: CoNLL-style IOB2 tagging with whole-word tokenization Training Examples: Thousands of annotated sentences containing plant references
Training Procedure
Training Hyperparameters
- Base Model: FacebookAI/roberta-base
- Training Framework: Hugging Face Transformers
- Tokenization: RoBERTa tokenizer with whole-word alignment
- Label Encoding: IOB2 (Inside-Outside-Begin) format
- Sequence Length: 512 tokens maximum
- Batch Size: Optimized for training efficiency
- Learning Rate: Adaptive with warmup
- Training Epochs: Multiple epochs with early stopping
Label Schema
O # Outside any plant entity
B-PLANT_COMMON # Beginning of common plant name
I-PLANT_COMMON # Inside/continuation of common plant name
B-PLANT_SCI # Beginning of scientific plant name
I-PLANT_SCI # Inside/continuation of scientific plant name
Training Features
- Whole-word tokenization: Ensures proper handling of plant names
- B-I-O validation: Automatic correction of invalid tag sequences
- Class balancing: Weighted sampling for entity type balance
- Data augmentation: Synthetic examples for robustness
Evaluation
The model achieves strong performance on plant entity recognition:
Metric | Overall | PLANT_COMMON | PLANT_SCI |
---|---|---|---|
Precision | 0.90 | 0.88 | 0.92 |
Recall | 0.94 | 0.96 | 0.91 |
F1-Score | 0.92 | 0.92 | 0.91 |
Performance Notes
- Excellent recall for common plant names (0.96)
- Strong precision for scientific names (0.92)
- Robust performance across different text types
Usage
Quick Start
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Load model and tokenizer
model_name = "Dudeman523/roberta-plant-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create pipeline
ner_pipeline = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple"
)
# Extract plant entities
text = "I love my Rosa damascena roses and the old oak tree in my garden."
entities = ner_pipeline(text)
for entity in entities:
print(f"Plant: {entity['word']} | Type: {entity['entity_group']} | Confidence: {entity['score']:.2f}")
Advanced Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model
tokenizer = AutoTokenizer.from_pretrained("Dudeman523/roberta-plant-ner")
model = AutoModelForTokenClassification.from_pretrained("Dudeman523/roberta-plant-ner")
# Tokenize input
text = "The Pachypodium lamerei succulent needs minimal watering."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Process results
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predicted_labels = torch.argmax(predictions, dim=-1)[0]
for token, label_id in zip(tokens, predicted_labels):
label = model.config.id2label[label_id.item()]
if label != "O":
print(f"Token: {token} | Label: {label}")
Batch Processing
# Process multiple texts efficiently
texts = [
"My hibiscus is blooming beautifully this spring.",
"Quercus alba and Acer saccharum are common in this forest.",
"I need care instructions for my Rosa damascena plant."
]
# Batch prediction
results = ner_pipeline(texts)
for i, (text, entities) in enumerate(zip(texts, results)):
print(f"\nText {i+1}: {text}")
for entity in entities:
print(f" ๐ฑ {entity['word']} ({entity['entity_group']}) - {entity['score']:.2f}")
Model Architecture
- Base Architecture: RoBERTa (Robustly Optimized BERT Pretraining Approach)
- Parameters: ~125M parameters
- Layers: 12 transformer layers
- Hidden Size: 768
- Attention Heads: 12
- Vocabulary: 50,265 tokens
- Classification Head: Linear layer for 5-class token classification
Ethical Considerations
Bias and Fairness
- Model may reflect geographical and cultural biases present in training data
- Potential underrepresentation of plants from certain regions or cultures
- May perform better on commonly cultivated plants versus wild or rare species
Environmental Impact
- Training computational cost: Moderate (fine-tuning only)
- Inference efficiency: Optimized for production use
- Carbon footprint: Minimal incremental impact over base model
Technical Specifications
- Input: Text sequences up to 512 tokens
- Output: Token-level classifications with confidence scores
- Inference Speed: ~100-500 texts/second (depending on hardware)
- Memory Requirements: ~500MB RAM for inference
- Supported Formats: Raw text, tokenized input
Citation
If you use this model in your research, please cite:
@misc{roberta-plant-ner,
title={RoBERTa Plant Named Entity Recognition Model},
author={Dudeman523},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Dudeman523/roberta-plant-ner}
}
Contact
For questions, issues, or collaboration opportunities, please open an issue on the model repository or contact the model author.
Model Version: 1.0
Last Updated: December 2024
Framework Compatibility: transformers >= 4.21.0
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for Dudeman523/RoBERTa_ner_plant_names_onnx
Base model
FacebookAI/roberta-base