Model Card: BERT NER for Plant Names (PEFT/LoRA Fine-tuned)

Model Description

This model is a fine-tuned version of google-bert/bert-base-cased specifically adapted for Named Entity Recognition (NER) of common and scientific plant names. It utilizes the Parameter-Efficient Fine-Tuning (PEFT) method, specifically LoRA (Low-Rank Adaptation), to modify the base model's attention layers (query and value) for this task. The goal is to identify spans of text corresponding to plant names and classify them as either common (PLANT_COMMON) or scientific (PLANT_SCI) according to the IOB2 tagging scheme.

Developed by: [Your Name/Organization - Fill this in]
Model type: BERT (bert-base-cased) fine-tuned for Token Classification (NER) using PEFT/LoRA
Language(s): Primarily English (based on bert-base-cased and likely training data)
License: Base model (bert-base-cased) uses Apache 2.0. The fine-tuned adapter weights inherit this license unless otherwise specified
Fine-tuned from model: google-bert/bert-base-cased

Intended Uses & Limitations

Intended Use

This model is intended for identifying and classifying mentions of plant names (common and scientific) within English text. Potential applications include:

Extracting plant names from botanical texts, research papers, or gardening articles
Structuring information about plant mentions in databases
Assisting in indexing or searching documents based on contained plant names
Preprocessing text for downstream tasks that require knowledge of plant entities

Limitations

Domain Specificity: The model's performance is likely best on text similar to its training data (generated templates about plants). Performance may degrade on significantly different domains (e.g., highly informal text, complex biological pathway descriptions unless similar data was included)
IOB2 Scheme: The model strictly adheres to the IOB2 tagging scheme (B-TAG, I-TAG, O). It identifies the beginning (B-) and inside (I-) tokens of a named entity span
Specific Tags: Trained only to recognize PLANT_COMMON and PLANT_SCI. It will tag all other tokens as O (Outside). It cannot identify other entity types (e.g., locations, people, chemicals) unless explicitly trained
Ambiguity: May struggle with ambiguous terms where a word could be a plant name in one context but not another (e.g., "Rose" as a name vs. a flower)
Novel Names: Performance on plant names not seen during training (or very different from those seen) may be lower
Context Dependency: Like most NER models, its accuracy depends heavily on the surrounding context. Short, isolated mentions might be harder to classify correctly
Case Sensitivity: Based on bert-base-cased, the model is case-sensitive, which might be beneficial for distinguishing scientific names but could affect common names written inconsistently

How to Use (with Transformers & PEFT)

This model requires loading the base BERT model first and then applying the trained LoRA adapter.

from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig
from peft import PeftModel
import torch

# --- Configuration ---
BASE_MODEL_NAME = "google-bert/bert-base-cased"
# --- *** Point this to the directory containing the saved adapter *** ---
# E.g., your BEST_MODEL_DIR or CHECKPOINT_DIR from training
ADAPTER_PATH = "/kaggle/working/bert_ner_peft_gpu_best_v4"
# --- ************************************************************** ---
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Load Tokenizer (from adapter path or base model)
try:
    tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)
except Exception:
    print(f"Warning: Tokenizer not found in {ADAPTER_PATH}, loading from {BASE_MODEL_NAME}")
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

# 2. Load Base Model (ensure config matches training)
# Need label map from training to load config correctly
label_list = ["O", "B-PLANT_COMMON", "I-PLANT_COMMON", "B-PLANT_SCI", "I-PLANT_SCI"]
label_map = {label: i for i, label in enumerate(label_list)}
id_to_label = {i: label for i, label in enumerate(label_list)}
num_labels = len(label_list)

config = AutoConfig.from_pretrained(
    BASE_MODEL_NAME,
    num_labels=num_labels,
    id2label=id_to_label,
    label2id=label_map
)
base_model = AutoModelForTokenClassification.from_pretrained(
    BASE_MODEL_NAME,
    config=config,
    ignore_mismatched_sizes=True # Important if head was initialized
)

# Resize embeddings if necessary (if pad token was added during training)
if len(tokenizer) != base_model.get_input_embeddings().weight.shape[0]:
    print(f"Resizing model embeddings to {len(tokenizer)}")
    base_model.resize_token_embeddings(len(tokenizer))

# 3. Load PEFT Model (applies adapter)
model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
model.to(DEVICE)
model.eval()

print("PEFT Model loaded and ready for inference.")

# --- Inference Example ---
text = "The Pineapple Guava (Feijoa sellowiana) is different from Ananas comosus."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(DEVICE)

with torch.no_grad():
    logits = model(**inputs).logits

predictions = torch.argmax(logits, dim=2)
predicted_token_class_ids = predictions[0].cpu().numpy()

# Map IDs back to labels, aligning with tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0].cpu().numpy())
word_ids = inputs.word_ids() # Only available with fast tokenizers

aligned_labels = []
previous_word_idx = None
for i, token in enumerate(tokens):
    if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
        continue # Skip special tokens
    word_idx = word_ids[i]
    if word_idx != previous_word_idx: # Only take first token of each word
        label_id = predicted_token_class_ids[i]
        label_str = id_to_label.get(label_id, "O")
        aligned_labels.append(label_str)
    previous_word_idx = word_idx

original_words = text.split() # Simple split for demo, might need better tokenization alignment
# Crude alignment for demo: assume aligned_labels matches original words length
print("Text:", text)
print("Predicted Labels (approx alignment):")
for word, label in zip(original_words[:len(aligned_labels)], aligned_labels):
     if label != "O": print(f"- {word}: {label}")

Using the Merged Model

If you used the merging script, you can load the full model directly:

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# --- *** Point this to the directory containing the MERGED model *** ---
MERGED_MODEL_PATH = "/kaggle/working/bert_ner_peft_gpu_merged"
# --- ************************************************************** ---
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(MERGED_MODEL_PATH)
model = AutoModelForTokenClassification.from_pretrained(MERGED_MODEL_PATH)
model.to(DEVICE)
model.eval()

print("Merged Model loaded and ready for inference.")

# --- Inference Example (same as above) ---

Using the ONNX Model

import onnxruntime as ort
import numpy as np
import os
from transformers import AutoTokenizer, AutoConfig

# --- *** Point this to the directory containing the ONNX model *** ---
ONNX_MODEL_DIR = "/kaggle/working/bert_ner_onnx"
# --- ************************************************************** ---

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(ONNX_MODEL_DIR)

# Load ONNX model and create session
model_path = os.path.join(ONNX_MODEL_DIR, "model.onnx")
ort_session = ort.InferenceSession(model_path, providers=['CPUExecutionProvider']) # Or ['CUDAExecutionProvider'] if available

# Load id_to_label map (needed for decoding)
# You might need to load this from the saved config.json or redefine it
# Example: Reloading config from the directory
config = AutoConfig.from_pretrained(ONNX_MODEL_DIR)
id_to_label = config.id2label

# --- Inference Example ---
text = "The Pineapple Guava (Feijoa sellowiana) is different from Ananas comosus."
inputs = tokenizer(text, return_tensors="np") # Use numpy for ONNX runtime

# Prepare inputs for ONNX session
ort_inputs = {k: v for k, v in inputs.items()}

# Run inference
ort_outputs = ort_session.run(None, ort_inputs)
logits = ort_outputs[0] # Usually the first output

predictions = np.argmax(logits, axis=-1)
predicted_token_class_ids = predictions[0]

# Map IDs back to labels (alignment logic is similar to PyTorch version)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Note: Getting word_ids might require the original 'encoding' object from a Fast tokenizer
# You might need to re-tokenize with return_offsets_mapping=True and align manually
# For simplicity, let's just print raw token labels:
print("Text:", text)
print("Predicted Labels (per token):")
for token, label_id in zip(tokens, predicted_token_class_ids):
     if token not in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
         print(f"- {token}: {id_to_label.get(label_id, 'O')}")

Training Data

The model was fine-tuned on a dataset generated from templates focusing on common and scientific plant names. The data format is CoNLL style (one token and tag per line, separated by TAB, with empty lines between sentences).

Data Split: 90% Training, 10% Validation (using sklearn.model_selection.train_test_split with random_state=42).

Training Procedure

Preprocessing

Tokenizer: BertTokenizerFast from google-bert/bert-base-cased
Padding: Padded/truncated to max_length=128
Label Alignment: Standard IOB2 scheme. Labels are aligned to the first token of each word. Special tokens and subsequent subword tokens are assigned the ignore_index (-100)

Training

Framework: PyTorch with transformers and peft
Environment: GPU (likely Kaggle P100/T4/V100 based on setup)
Precision: Float32 (AMP was enabled but script ran in FP32 due to earlier debugging)
Optimizer: AdamW
Learning Rate: 2e-5 with linear warmup (10% of steps) and decay
Batch Size: 4 (per device)
Epochs: Trained for up to 3 epochs with early stopping (patience=3 based on validation F1)
PEFT Config: LoRA (r=8, alpha=16, dropout=0.1, target_modules=["query", "value"])
Gradient Clipping: Max norm = 1.0

Evaluation Results

Evaluation was performed using the seqeval library with the IOB2 scheme and strict matching. The primary metric tracked was the micro-averaged F1 score.

Environmental Impact

Hardware: Trained on GPU Nvidia P100
Compute: [Estimate training time if known, e.g., Approx. X hours on a single T4 GPU]. Carbon emissions can be estimated using tools like the Machine Learning Impact calculator if compute details are known.

Disclaimer

This model is fine-tuned from a base model and inherits its capabilities and biases. Performance depends heavily on the similarity between the target text and the training data. Always evaluate thoroughly for your specific use case.

Dudeman523
/

NER-Bert-Based-Cased-PlantNames-Onnx