language: en license: mit library_name: transformers tags: - token-classification - ner - plants - botany - roberta - biology - horticulture datasets: - custom widget: - text: "I have a Rosa damascena and some Quercus alba trees in my garden." example_title: "Scientific plant names" - text: "My hibiscus and pachypodium plants need watering." example_title: "Common plant names" - text: "The beautiful roses are blooming next to the oak tree." example_title: "Mixed plant references" pipeline_tag: token-classification model-index: - name: roberta-plant-ner results: - task: type: token-classification name: Token Classification dataset: type: custom name: Plant NER Dataset metrics: - type: f1 value: 0.92 name: F1 Score - type: precision value: 0.90 name: Precision - type: recall value: 0.94 name: Recall

RoBERTa Plant Named Entity Recognition

Model Description

This model is a fine-tuned version of FacebookAI/roberta-base for plant named entity recognition. It identifies and classifies plant names in text into two categories:

  • PLANT_COMMON: Common names for plants (e.g., "rose", "hibiscus", "oak tree")
  • PLANT_SCI: Scientific/botanical names (e.g., "Rosa damascena", "Quercus alba")

Intended Uses & Limitations

Intended Uses

  • Botanical text analysis: Extract plant mentions from research papers, articles, and documentation
  • Gardening applications: Identify plants mentioned in gardening guides, forums, and care instructions
  • Agricultural text processing: Parse agricultural documents and reports
  • Educational tools: Assist in botany and horticulture education
  • Content management: Automatically tag and categorize plant-related content

Limitations

  • Trained primarily on English text
  • May have lower accuracy on rare or highly specialized plant species
  • Performance may vary on informal text, social media, or heavily abbreviated content
  • Does not distinguish between live plants and plant products (e.g., "rose oil")

Training Data

The model was trained on a custom dataset containing:

  • Botanical literature and research papers
  • Gardening guides and plant care instructions
  • Agricultural documents
  • Horticultural databases
  • Plant identification guides

Data Format: CoNLL-style IOB2 tagging with whole-word tokenization Training Examples: Thousands of annotated sentences containing plant references

Training Procedure

Training Hyperparameters

  • Base Model: FacebookAI/roberta-base
  • Training Framework: Hugging Face Transformers
  • Tokenization: RoBERTa tokenizer with whole-word alignment
  • Label Encoding: IOB2 (Inside-Outside-Begin) format
  • Sequence Length: 512 tokens maximum
  • Batch Size: Optimized for training efficiency
  • Learning Rate: Adaptive with warmup
  • Training Epochs: Multiple epochs with early stopping

Label Schema

O              # Outside any plant entity
B-PLANT_COMMON # Beginning of common plant name
I-PLANT_COMMON # Inside/continuation of common plant name  
B-PLANT_SCI    # Beginning of scientific plant name
I-PLANT_SCI    # Inside/continuation of scientific plant name

Training Features

  • Whole-word tokenization: Ensures proper handling of plant names
  • B-I-O validation: Automatic correction of invalid tag sequences
  • Class balancing: Weighted sampling for entity type balance
  • Data augmentation: Synthetic examples for robustness

Evaluation

The model achieves strong performance on plant entity recognition:

Metric Overall PLANT_COMMON PLANT_SCI
Precision 0.90 0.88 0.92
Recall 0.94 0.96 0.91
F1-Score 0.92 0.92 0.91

Performance Notes

  • Excellent recall for common plant names (0.96)
  • Strong precision for scientific names (0.92)
  • Robust performance across different text types

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
model_name = "Dudeman523/roberta-plant-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create pipeline
ner_pipeline = pipeline(
    "token-classification", 
    model=model, 
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Extract plant entities
text = "I love my Rosa damascena roses and the old oak tree in my garden."
entities = ner_pipeline(text)

for entity in entities:
    print(f"Plant: {entity['word']} | Type: {entity['entity_group']} | Confidence: {entity['score']:.2f}")

Advanced Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("Dudeman523/roberta-plant-ner")
model = AutoModelForTokenClassification.from_pretrained("Dudeman523/roberta-plant-ner")

# Tokenize input
text = "The Pachypodium lamerei succulent needs minimal watering."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Process results
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predicted_labels = torch.argmax(predictions, dim=-1)[0]

for token, label_id in zip(tokens, predicted_labels):
    label = model.config.id2label[label_id.item()]
    if label != "O":
        print(f"Token: {token} | Label: {label}")

Batch Processing

# Process multiple texts efficiently
texts = [
    "My hibiscus is blooming beautifully this spring.",
    "Quercus alba and Acer saccharum are common in this forest.",
    "I need care instructions for my Rosa damascena plant."
]

# Batch prediction
results = ner_pipeline(texts)

for i, (text, entities) in enumerate(zip(texts, results)):
    print(f"\nText {i+1}: {text}")
    for entity in entities:
        print(f"  ๐ŸŒฑ {entity['word']} ({entity['entity_group']}) - {entity['score']:.2f}")

Model Architecture

  • Base Architecture: RoBERTa (Robustly Optimized BERT Pretraining Approach)
  • Parameters: ~125M parameters
  • Layers: 12 transformer layers
  • Hidden Size: 768
  • Attention Heads: 12
  • Vocabulary: 50,265 tokens
  • Classification Head: Linear layer for 5-class token classification

Ethical Considerations

Bias and Fairness

  • Model may reflect geographical and cultural biases present in training data
  • Potential underrepresentation of plants from certain regions or cultures
  • May perform better on commonly cultivated plants versus wild or rare species

Environmental Impact

  • Training computational cost: Moderate (fine-tuning only)
  • Inference efficiency: Optimized for production use
  • Carbon footprint: Minimal incremental impact over base model

Technical Specifications

  • Input: Text sequences up to 512 tokens
  • Output: Token-level classifications with confidence scores
  • Inference Speed: ~100-500 texts/second (depending on hardware)
  • Memory Requirements: ~500MB RAM for inference
  • Supported Formats: Raw text, tokenized input

Citation

If you use this model in your research, please cite:

@misc{roberta-plant-ner,
  title={RoBERTa Plant Named Entity Recognition Model},
  author={Dudeman523},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Dudeman523/roberta-plant-ner}
}

Contact

For questions, issues, or collaboration opportunities, please open an issue on the model repository or contact the model author.


Model Version: 1.0
Last Updated: December 2024
Framework Compatibility: transformers >= 4.21.0

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Dudeman523/RoBERTa_ner_plant_names_onnx

Quantized
(9)
this model