library_name: transformers
license: apache-2.0
tags:
- healthcare
- column-normalization
- text-classification
- distilgpt2
model-index:
- name: tsilva/clinical-field-mapper-classification
results:
- task:
name: Field Classification
type: text-classification
dataset:
name: tsilva/clinical-field-mappings
type: healthcare
metrics:
- name: train Accuracy
type: accuracy
value: 0.9471
- name: validation Accuracy
type: accuracy
value: 0.9144
- name: test Accuracy
type: accuracy
value: 0.9156
Model Card for tsilva/clinical-field-mapper-classification
This model is a fine-tuned version of distilbert/distilgpt2
on the tsilva/clinical-field-mappings
dataset.
Its purpose is to normalize healthcare database column names to a standardized set of target column names.
Task
This model is a sequence classification model that maps free-text field names to a set of standardized schema terms.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("tsilva/clinical-field-mapper-classification") model = AutoModelForSequenceClassification.from_pretrained("tsilva/clinical-field-mapper-classification")
def predict(input_text): inputs = tokenizer(input_text, return_tensors="pt") outputs = model(**inputs) pred = outputs.logits.argmax(-1).item() label = model.config.id2label[str(pred)] if hasattr(model.config, 'id2label') else pred print(f"Predicted label: family_history_reported")
predict('cardi@')
Evaluation Results
- train accuracy: 94.71%
- validation accuracy: 91.44%
- test accuracy: 91.56%
Training Details
- Seed: 42
- Epochs scheduled: 50
- Epochs completed: 34
- Early stopping triggered: Yes
- Final training loss: 1.0888
- Final evaluation loss: 0.9916
- Optimizer: adamw_bnb_8bit
- Learning rate: 0.0005
- Batch size: 1024
- Precision: fp16
- DeepSpeed enabled: True
- Gradient accumulation steps: 1
License
Specify your license here (e.g., Apache 2.0, MIT, etc.)
Limitations and Bias
- Model was trained on a specific clinical mapping dataset.
- Performance may vary on out-of-distribution column names.
- Ensure you validate model outputs in production environments.