language: en license: mit library_name: transformers tags: - token-classification - ner - plants - botany - roberta - biology - horticulture datasets: - custom widget: - text: "I have a Rosa damascena and some Quercus alba trees in my garden." example_title: "Scientific plant names" - text: "My hibiscus and pachypodium plants need watering." example_title: "Common plant names" - text: "The beautiful roses are blooming next to the oak tree." example_title: "Mixed plant references" pipeline_tag: token-classification model-index: - name: roberta-plant-ner results: - task: type: token-classification name: Token Classification dataset: type: custom name: Plant NER Dataset metrics: - type: f1 value: 0.92 name: F1 Score - type: precision value: 0.90 name: Precision - type: recall value: 0.94 name: Recall

RoBERTa Plant Named Entity Recognition

Model Description

This model is a fine-tuned version of FacebookAI/roberta-base for plant named entity recognition. It identifies and classifies plant names in text into two categories:

PLANT_COMMON: Common names for plants (e.g., "rose", "hibiscus", "oak tree")
PLANT_SCI: Scientific/botanical names (e.g., "Rosa damascena", "Quercus alba")

Intended Uses & Limitations

Intended Uses

Botanical text analysis: Extract plant mentions from research papers, articles, and documentation
Gardening applications: Identify plants mentioned in gardening guides, forums, and care instructions
Agricultural text processing: Parse agricultural documents and reports
Educational tools: Assist in botany and horticulture education
Content management: Automatically tag and categorize plant-related content

Limitations

Trained primarily on English text
May have lower accuracy on rare or highly specialized plant species
Performance may vary on informal text, social media, or heavily abbreviated content
Does not distinguish between live plants and plant products (e.g., "rose oil")

Training Data

The model was trained on a custom dataset containing:

Botanical literature and research papers
Gardening guides and plant care instructions
Agricultural documents
Horticultural databases
Plant identification guides

Data Format: CoNLL-style IOB2 tagging with whole-word tokenization Training Examples: Thousands of annotated sentences containing plant references

Training Procedure

Training Hyperparameters

Base Model: FacebookAI/roberta-base
Training Framework: Hugging Face Transformers
Tokenization: RoBERTa tokenizer with whole-word alignment
Label Encoding: IOB2 (Inside-Outside-Begin) format
Sequence Length: 512 tokens maximum
Batch Size: Optimized for training efficiency
Learning Rate: Adaptive with warmup
Training Epochs: Multiple epochs with early stopping

Label Schema

O              # Outside any plant entity
B-PLANT_COMMON # Beginning of common plant name
I-PLANT_COMMON # Inside/continuation of common plant name  
B-PLANT_SCI    # Beginning of scientific plant name
I-PLANT_SCI    # Inside/continuation of scientific plant name

Training Features

Whole-word tokenization: Ensures proper handling of plant names
B-I-O validation: Automatic correction of invalid tag sequences
Class balancing: Weighted sampling for entity type balance
Data augmentation: Synthetic examples for robustness

Evaluation

The model achieves strong performance on plant entity recognition:

Metric	Overall	PLANT_COMMON	PLANT_SCI
Precision	0.90	0.88	0.92
Recall	0.94	0.96	0.91
F1-Score	0.92	0.92	0.91

Performance Notes

Excellent recall for common plant names (0.96)
Strong precision for scientific names (0.92)
Robust performance across different text types

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
model_name = "Dudeman523/roberta-plant-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create pipeline
ner_pipeline = pipeline(
    "token-classification", 
    model=model, 
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Extract plant entities
text = "I love my Rosa damascena roses and the old oak tree in my garden."
entities = ner_pipeline(text)

for entity in entities:
    print(f"Plant: {entity['word']} | Type: {entity['entity_group']} | Confidence: {entity['score']:.2f}")

Advanced Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("Dudeman523/roberta-plant-ner")
model = AutoModelForTokenClassification.from_pretrained("Dudeman523/roberta-plant-ner")

# Tokenize input
text = "The Pachypodium lamerei succulent needs minimal watering."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Process results
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predicted_labels = torch.argmax(predictions, dim=-1)[0]

for token, label_id in zip(tokens, predicted_labels):
    label = model.config.id2label[label_id.item()]
    if label != "O":
        print(f"Token: {token} | Label: {label}")

Batch Processing

# Process multiple texts efficiently
texts = [
    "My hibiscus is blooming beautifully this spring.",
    "Quercus alba and Acer saccharum are common in this forest.",
    "I need care instructions for my Rosa damascena plant."
]

# Batch prediction
results = ner_pipeline(texts)

for i, (text, entities) in enumerate(zip(texts, results)):
    print(f"\nText {i+1}: {text}")
    for entity in entities:
        print(f"  🌱 {entity['word']} ({entity['entity_group']}) - {entity['score']:.2f}")

Model Architecture

Base Architecture: RoBERTa (Robustly Optimized BERT Pretraining Approach)
Parameters: ~125M parameters
Layers: 12 transformer layers
Hidden Size: 768
Attention Heads: 12
Vocabulary: 50,265 tokens
Classification Head: Linear layer for 5-class token classification

Ethical Considerations

Bias and Fairness

Model may reflect geographical and cultural biases present in training data
Potential underrepresentation of plants from certain regions or cultures
May perform better on commonly cultivated plants versus wild or rare species

Environmental Impact

Training computational cost: Moderate (fine-tuning only)
Inference efficiency: Optimized for production use
Carbon footprint: Minimal incremental impact over base model

Technical Specifications

Input: Text sequences up to 512 tokens
Output: Token-level classifications with confidence scores
Inference Speed: ~100-500 texts/second (depending on hardware)
Memory Requirements: ~500MB RAM for inference
Supported Formats: Raw text, tokenized input

Citation

If you use this model in your research, please cite:

@misc{roberta-plant-ner,
  title={RoBERTa Plant Named Entity Recognition Model},
  author={Dudeman523},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Dudeman523/roberta-plant-ner}
}

Contact

For questions, issues, or collaboration opportunities, please open an issue on the model repository or contact the model author.

Model Version: 1.0
Last Updated: December 2024
Framework Compatibility: transformers >= 4.21.0

Dudeman523
/

RoBERTa_ner_plant_names_onnx