---
license: apache-2.0
datasets:
- abhinavv3/edu_fineweb10B_sharded_50shards
language:
- en
pipeline_tag: text-generation
tags:
- text-generation
- transformer
---
# 🧠 GPT with Modified Memorizing Transformer

An extended GPT-style 118m param model that integrates the key ideas from **"Memorizing Transformers" (Wu et al., 2022)** with practical enhancements like Grouped Query Attention, KNN-based memory lookup, RoPE, and XL-style memory recurrence.

This model is designed for scalable training, long-context understanding, and efficient memory usage.

---


**Key Modifications from the Original Paper:**

1) Replaced the default positional encoding with Rotary Positional Embeddings (RoPE) ,
2) Altered the attention mechanism to use Grouped Query Attention ,
3) Customized the DataLoader to support sharded datasets and data parallelism ,
4) Implemented Mixed Precision Training along with Distributed Data Parallel (DDP) support ,
5) Tweaked several training and model hyperparameters for better adaptability .

## 🔬 Key Features

- ✅ **Grouped Query Attention (GQA)** — Groups query heads to share key/value heads, saving memory and speeding up attention
- ✅ **KNN Memory** — A learnable mechanism to retrieve past activations via nearest-neighbor search
- ✅ **XL-style Attention** — Adds recurrence to the attention stack, improving long-sequence learning
- ✅ **Rotary Positional Encoding (RoPE)** — Replaces standard sin-cos encoding for better extrapolation
- ✅ **Memory Lifespan & Clearing** — Custom mechanisms to manage token memory duration
- ✅ **Sharded Dataset Loader** — Efficient `.npy`-based streaming for large datasets
- ✅ **Mixed Precision + DDP Training** — Scalable multi-GPU support using `torchrun` and `torch.autocast`

---

## 📁 Project Structure

```bash
MEM_TRANSFORMER/
├── configs/
│   └── config.json                  # Model + training hyperparameters
│
├── data/
│   ├── edu_fineweb/                 # Token-sharded training data
│   │   ├── train_000001.npy
│   │   ├── train_000002.npy
│   │   └── test_000001.npy
│   ├── hellaswag/
│   │   └── hellaswag_val.jsonl
│   └── fineweb.py                   # Sharding logic with memory-aligned sequence control
│
├── model_core/
│   ├── __init__.py
│   ├── attention.py                 # Grouped Query Attention, KNN & XL attention logic.Rotary Positional Encoding implementation
│   ├── model.py                     # Transformer model with memory and RoPE support
│   ├── dataloader.py                # Memory-aware DataLoader
│   └── training.py                  # train_memgpt function
│
├── scripts/
│   ├── train.py                     # Training script (DDP-compatible)
│   ├── evaluate.py                  # Evaluation on benchmarks
│   └── generate.py                  # Text generation from trained model
│
├── evaluation/
│   ├── __init__.py
│   ├── hellaswag.py                 # HellaSwag data loader
│   └── val_hellaswag.py             # Evaluation logic with loss-based scoring
│
├── logs/
│   ├── log.txt                      # Training logs
│   └── model_*.pt                   # Checkpoints
│
├── .gitignore
├── README.md
├── requirements.txt

```

---

## ⚙️ Configuration

Edit `configs/config.json` to change model or training settings.

<details>
<summary>Example config</summary>

```json
{
  "model": {
    "block_size": 1024,
    "vocab_size": 50304,
    "n_layer": 12,
    "n_head": 12,
    "n_embd": 768,
    "n_kv_head": 4,
    "max_knn_memories": 81920
  },
  "training": {
    "max_steps": 19073,
    "log_dir": "log",
    "total_batch_size": 2048,
    "B": 64,
    "T": 1024,
    "max_lr": 0.0006,
    "min_lr": 0.00006,
    "warmup_steps": 715,
    "weight_decay": 0.1,
    "learning_rate": 0.0006
  }
}
```
</details>
🚀 Training
▶️ Single GPU:python scripts/train.py
🔁 Multi-GPU DDP:torchrun --nproc_per_node=NUM_GPUS scripts/train.py


📊 Evaluation
Evaluate on the HellaSwag benchmark:
```bash
python scripts/evaluate.py
```

Requires:

data/hellaswag/hellaswag_val.jsonl

Model checkpoint(s) in logs/

Scoring is based on masked token loss across multiple choice completions

🧠 Attention Mechanism Deep Dive
<details> <summary>Grouped Query Attention (GQA)</summary>
n_head = total query heads

n_kv_head = shared key/value heads

Reduces compute overhead for large models by grouping query heads to reuse K/V

</details> <details> <summary>KNN Memory Retrieval</summary>
Maintains memory of past key vectors (max: 81920 tokens)

Fast KNN lookup with grouped projections

Integrated into attention flow using model_core/attention.py

</details> <details> <summary>XL-style Recurrence</summary>
Recurrence between attention blocks

Memory cache updated at each step

Custom clearing logic helps avoid stale activations

</details> <details> <summary>Rotary Positional Encoding (RoPE)</summary>
Replaces standard sinusoidal encoding

Better generalization on long contexts

Found in model_core/attention.py

</details>

🧩 Data Handling
Training data is sharded .npy files

Matching stride/memory length logic

DDP-compatible DataLoader

📦 Install Dependencies
```bash
pip install -r requirements.txt
```

Ensure that PyTorch and CUDA versions match your local GPU.

🔗 Reference
Wu et al., Memorizing Transformers, NeurIPS 2022
[Paper link](https://arxiv.org/abs/2203.08913)

💡 Future Work
Add LoRA support

Integrate with Hugging Face transformers API

Add benchmarking on other datasets (e.g. LAMBADA, PIQA)

Built with ❤️ by abhinavv3