Lunaris-0.6B-base
A high-performance 600M parameter language model trained on 20 billion tokens from the Ultra-FineWeb dataset, featuring modern architectural innovations and efficient training design.
π Model Overview
Lunaris-0.6B-base is a transformer-based language model designed for efficient inference and high-quality text generation. Despite its compact size of ~600M parameters, it delivers strong performance through careful architectural choices and training on high-quality web text data.
Key Features
- 600M parameters with efficient Grouped-Query Attention (GQA)
- Trained on 20B tokens from Ultra-FineWeb high-quality corpus
- Modern architecture with RoPE, SwiGLU, and RMSNorm
- KV Caching for efficient inference
- Tied embeddings for parameter efficiency
- Mixed precision training with gradient accumulation
π Model Specifications
Attribute | Value |
---|---|
Parameters | ~600M |
Architecture | Transformer (Llama-style) |
Context Length | 4,096 tokens |
Vocabulary Size | 65,536 (BPE) |
Layers | 48 |
Hidden Size | 1,024 |
Attention Heads | 16 (4 KV heads with GQA) |
FFN Hidden Multiplier | 4.0x |
Training Tokens | 20B |
Precision | Mixed (FP16/BF16) |
ποΈ Architecture Details
Core Components
- Tokenizer: Custom BPE with 65,536 vocabulary trained on Ultra-FineWeb
- Embeddings: Tied input/output embeddings for parameter efficiency
- Attention: Grouped-Query Attention (GQA) with 16 query heads and 4 KV heads
- Positional Encoding: Rotary Position Embeddings (RoPE) with ΞΈ=10,000
- Normalization: RMSNorm (Root Mean Square Layer Normalization)
- Activation: SwiGLU in feed-forward networks
- Regularization: Dropout (disabled during training for stability)
Architectural Innovations
- Grouped-Query Attention (GQA): Reduces memory usage and improves inference speed while maintaining performance
- Pre-normalization: Applies normalization before attention and FFN layers for training stability
- RoPE: Enables length extrapolation and better positional understanding
- SwiGLU: Provides better activation properties compared to standard ReLU/GELU
- KV Caching: Optimized for efficient autoregressive generation
π Training Data
The model was trained on the Ultra-FineWeb 20B Tokenized Dataset, a high-quality subset of web text specifically prepared for language model training.
Dataset Characteristics
- Source: Ultra-FineWeb English corpus
- Size: 20 billion tokens across 20 shards
- Quality: Filtered and verified web text with quality controls
- Format: Pre-tokenized NumPy arrays for efficient loading
- Tokenizer: Custom BPE trained on Ultra-FineWeb samples
Data Processing Pipeline
- Streaming: Efficient streaming from Ultra-FineWeb dataset
- Tokenization: Parallel processing with custom BPE tokenizer
- Sharding: Split into 1B token shards for distributed training
- Optimization: Stored in uint32 format for memory efficiency
π§ Training Configuration
Optimization Setup
Learning Rate: 3e-4
Weight Decay: 0.1
Optimizer: AdamW (Ξ²β=0.9, Ξ²β=0.95)
Batch Size: 8 per device
Gradient Accumulation: 16 steps
Global Batch Size: 524,288 tokens
Max Steps: 38,500
Warmup Steps: 2,000
Gradient Clipping: 1.0
Training Dynamics
- Total Training Tokens: 20,000,000,000
- Effective Batch Size: 524,288 tokens
- Training Steps: ~38,500 steps
- Warmup: Linear warmup for first 2,000 steps
- Learning Rate Schedule: Cosine decay after warmup
- Compilation: Model compilation enabled for training efficiency
π― Performance Characteristics
Computational Efficiency
- Memory Usage: Optimized for single-GPU training
- Inference Speed: Fast generation with KV caching
- Training Stability: Pre-normalization and gradient clipping
- Scalability: Efficient attention mechanisms for longer sequences
Model Capabilities
- Text Generation: High-quality autoregressive text generation
- Context Understanding: Strong performance on 4K context windows
- Efficiency: Compact model size with competitive performance
- Versatility: Suitable for various downstream tasks
π Usage
Loading the Model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meryyllebr543/Lunaris-0.6B-base")
model = AutoModelForCausalLM.from_pretrained("meryyllebr543/Lunaris-0.6B-base")
# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
Direct Model Usage
import torch
from model import LunarisCodex, LunarisCodexConfig
from tokenizers import Tokenizer
# Load configuration and model
config = LunarisCodexConfig(
vocab_size=65536,
d_model=1024,
n_layers=48,
n_heads=16,
n_kv_heads=4,
max_seq_len=4096
)
model = LunarisCodex(config)
tokenizer = Tokenizer.from_file("lunaris-tokenizer.json")
# Generate text
prompt = "The future of AI is"
tokens = tokenizer.encode(prompt).ids
input_ids = torch.tensor([tokens])
with torch.no_grad():
generated = model.generate(input_ids, max_new_tokens=100, temperature=0.7)
output = tokenizer.decode(generated[0].tolist())
print(output)
π Training Results
Training Metrics
- Loss Convergence: Smooth convergence with stable training
- Perplexity: Competitive perplexity on validation sets
- Efficiency: High token throughput during training
- Stability: No gradient explosion or vanishing issues
Optimization
The model benefits from:
- Gradient Accumulation: Simulates larger batch sizes
- Mixed Precision: Faster training with maintained stability
- Model Compilation: Improved training throughput
- Efficient Data Loading: Pre-tokenized data reduces overhead
π οΈ Technical Implementation
Model Architecture
The model implements a modern transformer architecture with several key innovations:
- Grouped-Query Attention: Reduces KV cache size by 4x while maintaining performance
- RoPE Integration: Seamless positional encoding without learned parameters
- SwiGLU FFN: Improved activation function for better modeling capacity
- Efficient Inference: KV caching and optimized attention patterns
Training Infrastructure
- Framework: PyTorch with compilation support
- Precision: Mixed precision training (FP16/BF16)
- Distributed: Multi-GPU support with gradient accumulation
- Monitoring: Weights & Biases integration for experiment tracking
π Related Resources
Datasets
- Training Data: Ultra-FineWeb 20B Tokenized
- Original Dataset: Ultra-FineWeb
- Ultra-FineWeb Paper: arXiv:2505.05427
Code & Implementation
- Model Code: Custom implementation based on Llama architecture
- Training Scripts: Optimized training pipeline with modern techniques
- Tokenizer: Custom BPE tokenizer trained on Ultra-FineWeb
π License
This model is released under the MIT License. The training data inherits licenses from the original Ultra-FineWeb dataset - please refer to the Ultra-FineWeb license for detailed terms.
π€ Citation
If you use this model in your research, please cite:
@misc{lunaris2025,
title={Lunaris-0.6B-base: A Compact Language Model with Modern Architecture},
author={meryyllebr543},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/meryyllebr543/Lunaris-0.6B-base}
}
Additionally, please cite the Ultra-FineWeb dataset:
@misc{wang2025ultrafineweb,
title={{Ultra-FineWeb}: Efficient Data Filtering and Verification for High-Quality LLM Training Data},
author={Yudong Wang and Zixuan Fu and Jie Cai and Peijun Tang and Hongya Lyu and Yewei Fang and Zhi Zheng and Jie Zhou and Guoyang Zeng and Chaojun Xiao and Xu Han and Zhiyuan Liu},
year={2025},
eprint={2505.05427},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
π Acknowledgments
- Ultra-FineWeb Team for providing high-quality training data
- Hugging Face for hosting and infrastructure support
- OpenBMB for the original Ultra-FineWeb dataset and tools
- Meta AI for the foundational Llama architecture insights
Model created by: meryyllebr543
Last updated: July 2025
Model size: ~600M parameters
Training tokens: 20B tokens
- Downloads last month
- 2