forestav commited on
Commit
aa50a79
·
verified ·
1 Parent(s): 1b6edc6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -14
README.md CHANGED
@@ -11,26 +11,57 @@ language:
11
  - en
12
  ---
13
 
14
- # Uploaded model
15
 
16
- - **Developed by:** forestav
17
- - **License:** apache-2.0
18
- - **Finetuned from model :** unsloth/llama-3.2-1b-instruct-bnb-4bit
19
 
20
- This model was a further development of a LoRA adapter model trained on unsloth/Llama-3.2-3B-Instruct with the FineTome-100k dataset.
 
21
 
22
- Here, we used a smaller model, with 1B parameters instead of 3B. We wanted to have faster training capabilities as well as make it easier to finetune to our specific application, which is easier when using a pretrained model with fewer parameters.
 
 
23
 
24
- We also changed the training arguments slightly. We set a smaller learning rate as fewer parameters makes the model more prone to overfitting. By using a smaller learning rate, we reduce the likelihood of catastrophic forgetting where the model loses the general knowledge it acquired during pretraining, becoming overly specialized for medical advice or forgetting basic language patterns.
25
- When having a large learning rate, the large weight updates can overwrite pretrained weights, causing the model to lose generalization.
26
 
27
- Our finetuning dataset ruslanmv/ai-medical-chatbot only has 257k rows, much smaller than the data used for pretraining. A smaller learning rate ensures that the model learns patterns specific to our data without deviating too far from its pretrained state.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- # Hyperparameters and explanations:
30
- **Warm up steps: 5.** The warmup steps tell us how many steps we should have a low learning rate prior to training with the normal learning rate. It is important when having the Adam optimizer as it relies on statistics of the gradients, which we do not have unless we have computed some gradients a priori.
 
31
 
32
- **Per device train batch size: 2.** We take 2 training samples for each GPU (we only had one GPU).
 
33
 
34
- This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
11
  - en
12
  ---
13
 
14
+ # Uploaded model
15
 
16
+ - **Developed by:** forestav
17
+ - **License:** apache-2.0
18
+ - **Finetuned from model:** [unsloth/llama-3.2-1b-instruct-bnb-4bit](https://huggingface.co/unsloth/llama-3.2-1b-instruct-bnb-4bit)
19
 
20
+ ## Model description
21
+ This model is a refined version of a LoRA adapter trained on the **unsloth/Llama-3.2-3B-Instruct** model using the **FineTome-100k** dataset. The finetuned model uses fewer parameters (1B vs. 3B) to achieve faster training and improved adaptability for specific tasks, such as medical applications.
22
 
23
+ ### Key adjustments:
24
+ 1. **Reduced Parameter Count:** The model was downsized to 1B parameters to improve training efficiency and ease customization.
25
+ 2. **Adjusted Learning Rate:** A smaller learning rate was used to prevent overfitting and mitigate catastrophic forgetting. This ensures the model retains its general pretraining knowledge while learning new tasks effectively.
26
 
27
+ The finetuning dataset, **ruslanmv/ai-medical-chatbot**, contains only 257k rows, which necessitated careful hyperparameter tuning to avoid over-specialization.
 
28
 
29
+ ---
30
+
31
+ ## Hyperparameters and explanations
32
+
33
+ - **Learning rate:** `2e-5`
34
+ A smaller learning rate reduces the risk of overfitting and catastrophic forgetting, particularly when working with models containing fewer parameters.
35
+
36
+ - **Warm-up steps:** `5`
37
+ Warm-up allows the optimizer to gather gradient statistics before training at the full learning rate, improving stability.
38
+
39
+ - **Per device train batch size:** `2`
40
+ Each GPU processes 2 training samples per step. This setup is suitable for resource-constrained environments.
41
+
42
+ - **Gradient accumulation steps:** `4`
43
+ Gradients are accumulated over 4 steps to simulate a larger batch size (effective batch size: 8) without exceeding memory limits.
44
 
45
+ - **Optimizer:** `AdamW with 8-bit Quantization`
46
+ - **AdamW:** Adds weight decay to prevent overfitting.
47
+ - **8-bit Quantization:** Reduces memory usage by compressing optimizer states, facilitating faster training.
48
 
49
+ - **Weight decay:** `0.01`
50
+ Standard weight decay value effective across various training scenarios.
51
 
52
+ - **Learning rate scheduler type:** `Linear`
53
+ Gradually decreases the learning rate from the initial value to zero over the course of training.
54
+
55
+ ---
56
+
57
+ ## Quantization details
58
+ The model is saved in **16-bit GGUF format**, which:
59
+ - Ensures **100% accuracy retention**.
60
+ - Trades off speed and memory for improved precision.
61
+
62
+ ### Training optimization
63
+ Training was accelerated by **2x** using [Unsloth](https://github.com/unslothai/unsloth) in combination with Hugging Face's **TRL library**.
64
+
65
+ ---
66
 
67
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)