sourajeetsahoo119 commited on
Commit
fde5cb5
·
verified ·
1 Parent(s): 2dc31b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -0
README.md ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama2
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ - perplexity
8
+ datasets:
9
+ - epfl-llm/guidelines
10
+ base_model: meta-llama/Llama-2-7b
11
+ pipeline_tag: image-text-to-text
12
+ library_name: transformers
13
+ ---
14
+
15
+ # Model Card for Meditron-7B-finetuned
16
+ Meditron is a suite of open-source medical Large Language Models (LLMs).
17
+ Meditron-7B is a 7 billion parameters model adapted to the medical domain from Llama-2-7B through continued pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, a [new dataset](https://huggingface.co/datasets/epfl-llm/guidelines) of internationally-recognized medical guidelines, and general domain data from [RedPajama-v1](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).
18
+ Meditron-7B-finetuned is finetuned on relevant training data, which outperforms Llama-2-7B and PMC-Llama on multiple medical reasoning tasks.
19
+
20
+ <details open>
21
+ <summary><strong>Advisory Notice</strong></summary>
22
+
23
+ <blockquote style="padding: 10px; margin: 0 0 10px; border-left: 5px solid #ddd;">
24
+ While Meditron is designed to encode medical knowledge from sources of high-quality evidence, it is not yet adapted to deliver this knowledge appropriately, safely, or within professional actionable constraints.
25
+ We recommend against deploying Meditron in medical applications without extensive use-case alignment, as well as additional testing, specifically including randomized controlled trials in real-world practice settings.
26
+ </blockquote>
27
+ </details>
28
+
29
+ ## Model Details
30
+
31
+ - **Finetuned by:** [Vignesh](https://huggingface.co/Sci-fi-vy)
32
+ - **Developed by:** [EPFL LLM Team](https://huggingface.co/epfl-llm)
33
+ - **Model type:** Causal decoder-only transformer language model
34
+ - **Language(s):** English (mainly)
35
+ - **Model License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
36
+ - **Code License:** [APACHE 2.0 LICENSE](LICENSE)
37
+ - **Continue-pretrained from model:** [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b)
38
+ - **Context length:** 2K tokens
39
+ - **Input:** Text-only data
40
+ - **Output:** Model generates text only
41
+ - **Status:** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we enhance model's performance.
42
+ - **Knowledge Cutoff:** August 2023
43
+
44
+
45
+ ### Model Sources
46
+
47
+ - **Repository:** [epflLLM/meditron](https://github.com/epfLLM/meditron)
48
+ - **Trainer:** [epflLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM)
49
+ - **Reference Paper:** *[MediTron-70B: Scaling Medical Pretraining for Large Language Models](https://arxiv.org/abs/2311.16079)*
50
+
51
+ ## Uses
52
+
53
+ Meditron-7B-finetuned is being made available for further testing and assessment as an AI assistant to enhance clinical decision-making and enhance access to an LLM for healthcare use. Potential use cases may include but are not limited to:
54
+ - Medical exam question answering
55
+ - Supporting differential diagnosis
56
+ - Disease information (symptoms, cause, treatment) query
57
+ - General health information query
58
+ - Personalized results
59
+
60
+ ### Direct Use
61
+
62
+ It is possible to use this model to generate text, which is useful for experimentation and understanding its capabilities.
63
+ It should not be used directly for production or work that may impact people.
64
+
65
+ ### Downstream Use
66
+ Meditron-70B and Meditron-7B are both foundation models without finetuning or instruction-tuning. They can be finetuned, instruction-tuned, or RLHF-tuned for specific downstream tasks and applications.
67
+ There are two ways we have used this model for downstream question-answering tasks.
68
+ 1. We apply in-context learning with k demonstrations (3 or 5 in our paper) added to the prompt.
69
+ 2. We finetuned the models for downstream question-answering tasks using specific training sets.
70
+
71
+ We encourage and look forward to the adaption of the base model for more diverse applications.
72
+
73
+ If you want a more interactive way to prompt the model, we recommend using a high-throughput and memory-efficient inference engine with a UI that supports chat and text generation.
74
+
75
+ You can check out our deployment [guide](https://github.com/epfLLM/meditron/blob/main/deployment/README.md), where we used [FastChat](https://github.com/lm-sys/FastChat) with [vLLM](https://github.com/vllm-project/vllm). We collected generations for our qualitative analysis through an interactive UI platform, [BetterChatGPT](https://github.com/ztjhz/BetterChatGPT). Here is the prompt format we used as an example:
76
+
77
+ <img width=70% src="prompt_example.png" alt="qualitative-analysis-prompt" title="Qualitative Analysis Prompt">
78
+
79
+ ### Out-of-Scope Use
80
+
81
+ We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.
82
+
83
+ ## Truthfulness, Helpfulness, Risk, and Bias
84
+
85
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
86
+
87
+ We did an initial assessment of Meditron models' **Truthfulness** against baseline models and consumer-level medical models.
88
+ We use TruthfulQA (multiple choice) as the main evaluation benchmark.
89
+ We only focus on the categories that are relevant to the medical domain, including Health, Nutrition, Psychology, and Science.
90
+ For 7B models, we perform one-shot evaluations for consistent answer generation.
91
+ For 70B models, the evaluations are under the zero-shot setting.
92
+ Below, we report the detailed truthfulness performance of each category.
93
+
94
+ | | | | | | | | |
95
+ | --- | ------ |----- |----- |----- |----- |----- |----- |
96
+ |Category | meditron-70b | llama-2-70b | med42-70b* | meditron-7b | llama-2-7b | PMC-llama-7b |
97
+ |Health | 81.8 | 69.1 | 83.6 | 27.3 | 16.4 | 3.6 |
98
+ |Nutrition | 77.9 | 68.8 | 62.5 | 31.1 | 12.5 | 6.3 |
99
+ |Psychology| 47.4 | 36.8 | 52.6 | 21.1 | 10.5 | 0.0 |
100
+ |Science | 77.8 | 44.4 | 33.3 | 33.3 | 11.1 | 0.0 |
101
+ |Avg | 71.2 | 54.8 | 58.0 | 28.3 | 12.6 | 2.5 |
102
+ | | | | | | | |
103
+
104
+ For a more detailed performance analysis, please see our paper.
105
+
106
+ Significant research is still required to fully explore potential bias, fairness, and safety issues with this language model.
107
+ Please recognize that our evaluation on Meditron-7B's helpfulness, risk, and bias are highly limited.
108
+ Thus, as we noted in the safety notice, we strongly against any deployment in medical applications without further alignment process and rigorous evaluation!
109
+
110
+ ### Recommendations
111
+
112
+ **IMPORTANT!**
113
+ Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model.
114
+ While this model is capable of generating natural language text, we have only begun to explore this capability and its limitations.
115
+ Understanding these limitations is especially important in a domain like medicine.
116
+ Therefore, we strongly recommend against using this model in production for natural language generation or for professional purposes related to health and medicine.
117
+
118
+ ## Training Details
119
+
120
+ ### Training Data
121
+ Meditron’s domain-adaptive pre-training corpus GAP-Replay combines 48.1B tokens from four corpora:
122
+ - [**Clinical Guidelines**](https://huggingface.co/datasets/epfl-llm/guidelines): a new dataset of 46K internationally-recognized clinical practice guidelines from various healthcare-related sources, including hospitals and international organizations.
123
+ - **Medical Paper Abstracts**: 16.1M abstracts extracted from closed-access PubMed and PubMed Central papers.
124
+ - **Medical Papers**: full-text articles extracted from 5M publicly available PubMed and PubMed Central papers.
125
+ - **Replay Data**: 400M tokens of general domain pretraining data sampled from [RedPajama-v1](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
126
+
127
+ <img width=75% src="gap-replay.png" alt="Alt text">
128
+
129
+ #### Data Preprocessing
130
+
131
+ Please see the detailed preprocessing procedure in our paper.
132
+
133
+ ### Training Procedure
134
+
135
+ We used the [Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) distributed training library, a derivative of Nvidia's Megatron LM project, to optimize training efficiency.
136
+ Hardware consists of 1 node of 8x NVIDIA A100 (80GB) SXM GPUs connected by NVLink and NVSwitch with a single Nvidia ConnectX-6 DX network card and equipped with 2 x AMD EPYC 7543 32-Core Processors and 512 GB of RAM.
137
+
138
+ Our three way parallelism scheme uses:
139
+ - Data Parallelism (DP -- different GPUs process different subsets of the batches) of 2,
140
+ - Pipeline Parallelism (PP -- different GPUs process different layers) of 4,
141
+ - Tensor Parallelism (TP -- different GPUs process different subtensors for matrix multiplication) of 1.
142
+
143
+
144
+ #### Training Hyperparameters
145
+
146
+ | | |
147
+ | --- | ------ |
148
+ | bf16 | true |
149
+ | lr | 3e-4 |
150
+ | eps | 1e-5 |
151
+ | betas | \[0.9, 0.95\] |
152
+ | clip_grad | 1 |
153
+ | weight decay | 0.1 |
154
+ | DP size | 16 |
155
+ | TP size | 4 |
156
+ | PP size | 1 |
157
+ | seq length | 2048 |
158
+ | lr scheduler | cosine|
159
+ | min lr | 1e-6 |
160
+ | warmup iteration | 2000 |
161
+ | micro batch size | 10 |
162
+ | global batch size | 1600 |
163
+ | | |
164
+
165
+ #### Sizes
166
+ The model was trained in September 2023.
167
+
168
+ The model architecture is exactly Llama 2, meaning
169
+
170
+ | | |
171
+ | --- | ------ |
172
+ | Model size | 7B |
173
+ | Hidden dimension | 4096 |
174
+ | Num. attention heads | 32 |
175
+ | Num. layers | 32 |
176
+ | | |
177
+
178
+ ## Evaluation
179
+
180
+ <!-- This section describes the evaluation protocols and provides the results. -->
181
+
182
+ ### Testing Data & Metrics
183
+
184
+ #### Testing Data
185
+ - [MedQA (USMLE)](https://huggingface.co/datasets/bigbio/med_qa)
186
+ - [MedMCQA](https://huggingface.co/datasets/medmcqa)
187
+ - [PubMedQA](https://huggingface.co/datasets/bigbio/pubmed_qa)
188
+ - [MMLU-Medical](https://huggingface.co/datasets/lukaemon/mmlu)
189
+ - [MedQA-4-Option](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options)
190
+
191
+ #### Metrics
192
+ - Accuracy: suite the evaluation of multiple-choice question-answering tasks.
193
+
194
+ ### Results
195
+ We finetune meditron-7b, llama-2-7b, pmc-llama-7b on each benchmark (pubmedqa, medmcqa, medqa)'s training data individually.
196
+ We report the finetuned models' performance with top token selection as the inference mode.
197
+ For MMLU-Medical, models finetuned on MedMCQA are used for inference.
198
+ For MedQA-4-Option, models finetuned on MedQA are used for inference.
199
+ For a more detailed performance analysis, please see our paper.
200
+
201
+ | | | | | | |
202
+ | --- | ------ |----- |----- |----- |----- |
203
+ |Dataset | meditron-7b | llama-2-7b | pmc-llama-7b | Zephyr-7B-beta* | Mistral-7B-instruct* |
204
+ |MMLU-Medical | 54.2 | 53.7 | 56.4 | 63.3 | 60.0 |
205
+ |PubMedQA | 74.4 | 61.8 | 59.2 | 46.0 | 17.8 |
206
+ |MedMCQA | 59.2 | 54.4 | 57.6 | 43.0 | 40.2 |
207
+ |MedQA | 47.9 | 44.0 | 42.4 | 42.8 | 32.4 |
208
+ |MedQA-4-Option| 52.0 | 49.6 | 49.2 | 48.5 | 41.1 |
209
+ |Avg | 57.5 | 52.7 | 53.0 | 48.7 | 38.3 |
210
+ | | | | | | |
211
+
212
+ **Note**: models with * are already instruction-tuned, so we exclude them from further finetuning on any training.