Update README.md
Browse files
README.md
CHANGED
@@ -17,6 +17,15 @@ This model is designed for scalable training, long-context understanding, and ef
|
|
17 |
|
18 |
---
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
## 🔬 Key Features
|
21 |
|
22 |
- ✅ **Grouped Query Attention (GQA)** — Groups query heads to share key/value heads, saving memory and speeding up attention
|
|
|
17 |
|
18 |
---
|
19 |
|
20 |
+
|
21 |
+
**Key Modifications from the Original Paper:**
|
22 |
+
|
23 |
+
1) Replaced the default positional encoding with Rotary Positional Embeddings (RoPE) ,
|
24 |
+
2) Altered the attention mechanism to use Grouped Query Attention ,
|
25 |
+
3) Customized the DataLoader to support sharded datasets and data parallelism ,
|
26 |
+
4) Implemented Mixed Precision Training along with Distributed Data Parallel (DDP) support ,
|
27 |
+
5) Tweaked several training and model hyperparameters for better adaptability .
|
28 |
+
|
29 |
## 🔬 Key Features
|
30 |
|
31 |
- ✅ **Grouped Query Attention (GQA)** — Groups query heads to share key/value heads, saving memory and speeding up attention
|