ayushsinha commited on
Commit
45c87f6
Β·
verified Β·
1 Parent(s): a65824d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Medical Entity Extraction with BERT
2
+
3
+ ## πŸ“Œ Overview
4
+ This repository hosts the quantized version of the `bert-base-cased` model for Medical Entity Extraction using the 'tner/bc5cdr' dataset. The model is specifically designed to recognize entities related to **Disease,Symptoms,Drug**. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
5
+
6
+ ## πŸ— Model Details
7
+ - **Model Architecture**: BERT Base Cased
8
+ - **Task**: Medical Entity Extraction
9
+ - **Dataset**: Hugging Face's `tner/bc5cdr`
10
+ - **Quantization**: Float16
11
+ - **Fine-tuning Framework**: Hugging Face Transformers
12
+
13
+ ---
14
+ ## πŸš€ Usage
15
+
16
+ ### Installation
17
+ ```bash
18
+ pip install transformers torch
19
+ ```
20
+
21
+ ### Loading the Model
22
+ ```python
23
+ from transformers import BertTokenizerFast, BertForTokenClassification
24
+ import torch
25
+
26
+ device = "cuda" if torch.cuda.is_available() else "cpu"
27
+
28
+ model_name = "AventIQ-AI/bert-medical-entity-extraction"
29
+ model = BertForTokenClassification.from_pretrained(model_name).to(device)
30
+ tokenizer = BertTokenizerFast.from_pretrained(model_name)
31
+ ```
32
+
33
+ ### Named Entity Recognition Inference
34
+ ```python
35
+ from transformers import pipeline
36
+
37
+ ner_pipeline = pipeline("ner", model=model_name, tokenizer=tokenizer)
38
+ test_sentence = "An overdose of Ibuprofen can lead to severe gastric issues."
39
+ ner_results = ner_pipeline(test_sentence)
40
+ label_map = {
41
+ "LABEL_0": "O", # Outside (not an entity)
42
+ "LABEL_1": "Drug",
43
+ "LABEL_2": "Disease",
44
+ "LABEL_3": "Symptom",
45
+ "LABEL_4": "Treatment"
46
+ }
47
+
48
+ def merge_tokens(ner_results):
49
+ merged_entities = []
50
+ current_word = ""
51
+ current_label = ""
52
+ current_score = 0
53
+ count = 0
54
+
55
+ for entity in ner_results:
56
+ word = entity["word"]
57
+ label = entity["entity"] # Model's output (e.g., LABEL_1, LABEL_2)
58
+ score = entity["score"]
59
+
60
+ # Merge subwords
61
+ if word.startswith("##"):
62
+ current_word += word[2:] # Remove '##' and append
63
+ current_score += score
64
+ count += 1
65
+ else:
66
+ if current_word: # Store the previous merged word
67
+ mapped_label = label_map.get(current_label, "Unknown")
68
+ merged_entities.append((current_word, mapped_label, current_score / count))
69
+ current_word = word
70
+ current_label = label
71
+ current_score = score
72
+ count = 1
73
+
74
+ # Add the last word
75
+ if current_word:
76
+ mapped_label = label_map.get(current_label, "Unknown")
77
+ merged_entities.append((current_word, mapped_label, current_score / count))
78
+
79
+ return merged_entities
80
+
81
+ print("\n🩺 Medical NER Predictions:")
82
+ for word, label, score in merge_tokens(ner_results):
83
+ if label != "O": # Skip non-entities
84
+ print(f"πŸ”Ή Entity: {word} | Category: {label} | Score: {score:.4f}")
85
+ ```
86
+ ### **πŸ”Ή Labeling Scheme (BIO Format)**
87
+
88
+ - **B-XYZ (Beginning)**: Indicates the beginning of an entity of type XYZ (e.g., B-PER for the beginning of a person’s name).
89
+ - **I-XYZ (Inside)**: Represents subsequent tokens inside an entity (e.g., I-PER for the second part of a person’s name).
90
+ - **O (Outside)**: Denotes tokens that are not part of any named entity.
91
+
92
+ ---
93
+ ## πŸ“Š Evaluation Results for Quantized Model
94
+
95
+ ### **πŸ”Ή Overall Performance**
96
+
97
+ - **Accuracy**: **93.27%** βœ…
98
+ - **Precision**: **92.31%**
99
+ - **Recall**: **93.27%**
100
+ - **F1 Score**: **92.31%**
101
+
102
+ ---
103
+
104
+ ### **πŸ”Ή Performance by Entity Type**
105
+
106
+ | Entity Type | Precision | Recall | F1 Score | Number of Entities |
107
+ |------------|-----------|--------|----------|--------------------|
108
+ | **Disease** | **91.46%** | **92.07%** | **91.76%** | 3,000 |
109
+ | **Drug** | **71.25%** | **72.83%** | **72.03%** | 1,266 |
110
+ | **Symptom** | **89.83%** | **93.02%** | **91.40%** | 3,524 |
111
+ | **Treatment** | **88.83%** | **92.02%** | **90.40%** | 3,124 |
112
+
113
+
114
+ ---
115
+ #### ⏳ **Inference Speed Metrics**
116
+ - **Total Evaluation Time**: 15.89 sec
117
+ - **Samples Processed per Second**: 217.26
118
+ - **Steps per Second**: 27.18
119
+ - **Epochs Completed**: 3
120
+
121
+ ---
122
+ ## Fine-Tuning Details
123
+ ### Dataset
124
+ The Hugging Face's `tner/bc5cdr` dataset was used, containing texts and their ner tags.
125
+
126
+ ## πŸ“Š Training Details
127
+ - **Number of epochs**: 3
128
+ - **Batch size**: 8
129
+ - **Evaluation strategy**: epoch
130
+ - **Learning Rate**: 2e-5
131
+
132
+ ### ⚑ Quantization
133
+ Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.
134
+
135
+ ---
136
+ ## πŸ“‚ Repository Structure
137
+ ```
138
+ .
139
+ β”œβ”€β”€ model/ # Contains the quantized model files
140
+ β”œβ”€β”€ tokenizer_config/ # Tokenizer configuration and vocabulary files
141
+ β”œβ”€β”€ model.safetensors/ # Quantized Model
142
+ β”œβ”€β”€ README.md # Model documentation
143
+ ```
144
+
145
+ ---
146
+ ## ⚠️ Limitations
147
+ - The model may not generalize well to domains outside the fine-tuning dataset.
148
+ - Quantization may result in minor accuracy degradation compared to full-precision models.
149
+
150
+ ---
151
+ ## 🀝 Contributing
152
+ Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.