Lunaris-0.6B-base

A high-performance 600M parameter language model trained on 20 billion tokens from the Ultra-FineWeb dataset, featuring modern architectural innovations and efficient training design.

🚀 Model Overview

Lunaris-0.6B-base is a transformer-based language model designed for efficient inference and high-quality text generation. Despite its compact size of ~600M parameters, it delivers strong performance through careful architectural choices and training on high-quality web text data.

Key Features

600M parameters with efficient Grouped-Query Attention (GQA)
Trained on 20B tokens from Ultra-FineWeb high-quality corpus
Modern architecture with RoPE, SwiGLU, and RMSNorm
KV Caching for efficient inference
Tied embeddings for parameter efficiency
Mixed precision training with gradient accumulation

📊 Model Specifications

Attribute	Value
Parameters	~600M
Architecture	Transformer (Llama-style)
Context Length	4,096 tokens
Vocabulary Size	65,536 (BPE)
Layers	48
Hidden Size	1,024
Attention Heads	16 (4 KV heads with GQA)
FFN Hidden Multiplier	4.0x
Training Tokens	20B
Precision	Mixed (FP16/BF16)

🏗️ Architecture Details

Core Components

Tokenizer: Custom BPE with 65,536 vocabulary trained on Ultra-FineWeb
Embeddings: Tied input/output embeddings for parameter efficiency
Attention: Grouped-Query Attention (GQA) with 16 query heads and 4 KV heads
Positional Encoding: Rotary Position Embeddings (RoPE) with θ=10,000
Normalization: RMSNorm (Root Mean Square Layer Normalization)
Activation: SwiGLU in feed-forward networks
Regularization: Dropout (disabled during training for stability)

Architectural Innovations

Grouped-Query Attention (GQA): Reduces memory usage and improves inference speed while maintaining performance
Pre-normalization: Applies normalization before attention and FFN layers for training stability
RoPE: Enables length extrapolation and better positional understanding
SwiGLU: Provides better activation properties compared to standard ReLU/GELU
KV Caching: Optimized for efficient autoregressive generation

📚 Training Data

The model was trained on the Ultra-FineWeb 20B Tokenized Dataset, a high-quality subset of web text specifically prepared for language model training.

Dataset Characteristics

Source: Ultra-FineWeb English corpus
Size: 20 billion tokens across 20 shards
Quality: Filtered and verified web text with quality controls
Format: Pre-tokenized NumPy arrays for efficient loading
Tokenizer: Custom BPE trained on Ultra-FineWeb samples

Data Processing Pipeline

Streaming: Efficient streaming from Ultra-FineWeb dataset
Tokenization: Parallel processing with custom BPE tokenizer
Sharding: Split into 1B token shards for distributed training
Optimization: Stored in uint32 format for memory efficiency

🔧 Training Configuration

Optimization Setup

Learning Rate: 3e-4
Weight Decay: 0.1
Optimizer: AdamW (β₁=0.9, β₂=0.95)
Batch Size: 8 per device
Gradient Accumulation: 16 steps
Global Batch Size: 524,288 tokens
Max Steps: 38,500
Warmup Steps: 2,000
Gradient Clipping: 1.0

Training Dynamics

Total Training Tokens: 20,000,000,000
Effective Batch Size: 524,288 tokens
Training Steps: ~38,500 steps
Warmup: Linear warmup for first 2,000 steps
Learning Rate Schedule: Cosine decay after warmup
Compilation: Model compilation enabled for training efficiency

🎯 Performance Characteristics

Computational Efficiency

Memory Usage: Optimized for single-GPU training
Inference Speed: Fast generation with KV caching
Training Stability: Pre-normalization and gradient clipping
Scalability: Efficient attention mechanisms for longer sequences

Model Capabilities

Text Generation: High-quality autoregressive text generation
Context Understanding: Strong performance on 4K context windows
Efficiency: Compact model size with competitive performance
Versatility: Suitable for various downstream tasks

🚀 Usage

Loading the Model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meryyllebr543/Lunaris-0.6B-base")
model = AutoModelForCausalLM.from_pretrained("meryyllebr543/Lunaris-0.6B-base")

# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Direct Model Usage

import torch
from model import LunarisCodex, LunarisCodexConfig
from tokenizers import Tokenizer

# Load configuration and model
config = LunarisCodexConfig(
    vocab_size=65536,
    d_model=1024,
    n_layers=48,
    n_heads=16,
    n_kv_heads=4,
    max_seq_len=4096
)

model = LunarisCodex(config)
tokenizer = Tokenizer.from_file("lunaris-tokenizer.json")

# Generate text
prompt = "The future of AI is"
tokens = tokenizer.encode(prompt).ids
input_ids = torch.tensor([tokens])

with torch.no_grad():
    generated = model.generate(input_ids, max_new_tokens=100, temperature=0.7)
    output = tokenizer.decode(generated[0].tolist())
    print(output)

📈 Training Results

Training Metrics

Loss Convergence: Smooth convergence with stable training
Perplexity: Competitive perplexity on validation sets
Efficiency: High token throughput during training
Stability: No gradient explosion or vanishing issues

Optimization

The model benefits from:

Gradient Accumulation: Simulates larger batch sizes
Mixed Precision: Faster training with maintained stability
Model Compilation: Improved training throughput
Efficient Data Loading: Pre-tokenized data reduces overhead

🛠️ Technical Implementation

Model Architecture

The model implements a modern transformer architecture with several key innovations:

Grouped-Query Attention: Reduces KV cache size by 4x while maintaining performance
RoPE Integration: Seamless positional encoding without learned parameters
SwiGLU FFN: Improved activation function for better modeling capacity
Efficient Inference: KV caching and optimized attention patterns

Training Infrastructure

Framework: PyTorch with compilation support
Precision: Mixed precision training (FP16/BF16)
Distributed: Multi-GPU support with gradient accumulation
Monitoring: Weights & Biases integration for experiment tracking

🔗 Related Resources

Datasets

Training Data: Ultra-FineWeb 20B Tokenized
Original Dataset: Ultra-FineWeb
Ultra-FineWeb Paper: arXiv:2505.05427

Code & Implementation

Model Code: Custom implementation based on Llama architecture
Training Scripts: Optimized training pipeline with modern techniques
Tokenizer: Custom BPE tokenizer trained on Ultra-FineWeb

📄 License

This model is released under the MIT License. The training data inherits licenses from the original Ultra-FineWeb dataset - please refer to the Ultra-FineWeb license for detailed terms.

🤝 Citation

If you use this model in your research, please cite:

@misc{lunaris2025,
  title={Lunaris-0.6B-base: A Compact Language Model with Modern Architecture},
  author={meryyllebr543},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/meryyllebr543/Lunaris-0.6B-base}
}

Additionally, please cite the Ultra-FineWeb dataset:

@misc{wang2025ultrafineweb,
  title={{Ultra-FineWeb}: Efficient Data Filtering and Verification for High-Quality LLM Training Data},
  author={Yudong Wang and Zixuan Fu and Jie Cai and Peijun Tang and Hongya Lyu and Yewei Fang and Zhi Zheng and Jie Zhou and Guoyang Zeng and Chaojun Xiao and Xu Han and Zhiyuan Liu},
  year={2025},
  eprint={2505.05427},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

🙏 Acknowledgments

Ultra-FineWeb Team for providing high-quality training data
Hugging Face for hosting and infrastructure support
OpenBMB for the original Ultra-FineWeb dataset and tools
Meta AI for the foundational Llama architecture insights

Model created by: meryyllebr543
Last updated: July 2025
Model size: ~600M parameters
Training tokens: 20B tokens

Downloads last month: 2

Datasets used to train meryyllebr543/Lunaris-0.6B-base

Space using meryyllebr543/Lunaris-0.6B-base 1

Collection including meryyllebr543/Lunaris-0.6B-base

Lunaris-0.6B-base

Collection

2 items • Updated Jul 13