NanoLM
Collection
a collection of nano LMs
•
13 items
•
Updated
•
4
English | ç®€ä½“ä¸æ–‡
Based on Qwen2-0.5B, the tokenizer has been replaced with BilingualTokenizer-8K to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.
To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on wikipedia-zh and cosmopedia-100k.
| Value | |
|---|---|
| Total Params | 365 M |
| Trainable Params | < 10 M |
| Trainable Parts | model.embed_tokens |
| Training Steps | 40,000 |
| Training Dataset | wikipedia-zh, cosmopedia-100k |
| Optimizer | adamw_torch |
| Learning Rate | 2e-4 |
| LR Scheduler | cosine |
| Weight Decay | 0.1 |
| Warm-up Ratio | 0.03 |
| Batch Size | 16 |
| Gradient Accumulation Steps | 1 |
| Seq Len | 4096 |
| Dtype | bf16 |
| Peak GPU Memory | < 48 GB |
| Device | NVIDIA A100-SXM4-80GB |