Text Generation
English

Lunaris-0.6B-base

A high-performance 600M parameter language model trained on 20 billion tokens from the Ultra-FineWeb dataset, featuring modern architectural innovations and efficient training design.

πŸš€ Model Overview

Lunaris-0.6B-base is a transformer-based language model designed for efficient inference and high-quality text generation. Despite its compact size of ~600M parameters, it delivers strong performance through careful architectural choices and training on high-quality web text data.

Key Features

  • 600M parameters with efficient Grouped-Query Attention (GQA)
  • Trained on 20B tokens from Ultra-FineWeb high-quality corpus
  • Modern architecture with RoPE, SwiGLU, and RMSNorm
  • KV Caching for efficient inference
  • Tied embeddings for parameter efficiency
  • Mixed precision training with gradient accumulation

πŸ“Š Model Specifications

Attribute Value
Parameters ~600M
Architecture Transformer (Llama-style)
Context Length 4,096 tokens
Vocabulary Size 65,536 (BPE)
Layers 48
Hidden Size 1,024
Attention Heads 16 (4 KV heads with GQA)
FFN Hidden Multiplier 4.0x
Training Tokens 20B
Precision Mixed (FP16/BF16)

πŸ—οΈ Architecture Details

Core Components

  • Tokenizer: Custom BPE with 65,536 vocabulary trained on Ultra-FineWeb
  • Embeddings: Tied input/output embeddings for parameter efficiency
  • Attention: Grouped-Query Attention (GQA) with 16 query heads and 4 KV heads
  • Positional Encoding: Rotary Position Embeddings (RoPE) with ΞΈ=10,000
  • Normalization: RMSNorm (Root Mean Square Layer Normalization)
  • Activation: SwiGLU in feed-forward networks
  • Regularization: Dropout (disabled during training for stability)

Architectural Innovations

  1. Grouped-Query Attention (GQA): Reduces memory usage and improves inference speed while maintaining performance
  2. Pre-normalization: Applies normalization before attention and FFN layers for training stability
  3. RoPE: Enables length extrapolation and better positional understanding
  4. SwiGLU: Provides better activation properties compared to standard ReLU/GELU
  5. KV Caching: Optimized for efficient autoregressive generation

πŸ“š Training Data

The model was trained on the Ultra-FineWeb 20B Tokenized Dataset, a high-quality subset of web text specifically prepared for language model training.

Dataset Characteristics

  • Source: Ultra-FineWeb English corpus
  • Size: 20 billion tokens across 20 shards
  • Quality: Filtered and verified web text with quality controls
  • Format: Pre-tokenized NumPy arrays for efficient loading
  • Tokenizer: Custom BPE trained on Ultra-FineWeb samples

Data Processing Pipeline

  1. Streaming: Efficient streaming from Ultra-FineWeb dataset
  2. Tokenization: Parallel processing with custom BPE tokenizer
  3. Sharding: Split into 1B token shards for distributed training
  4. Optimization: Stored in uint32 format for memory efficiency

πŸ”§ Training Configuration

Optimization Setup

Learning Rate: 3e-4
Weight Decay: 0.1
Optimizer: AdamW (β₁=0.9, Ξ²β‚‚=0.95)
Batch Size: 8 per device
Gradient Accumulation: 16 steps
Global Batch Size: 524,288 tokens
Max Steps: 38,500
Warmup Steps: 2,000
Gradient Clipping: 1.0

Training Dynamics

  • Total Training Tokens: 20,000,000,000
  • Effective Batch Size: 524,288 tokens
  • Training Steps: ~38,500 steps
  • Warmup: Linear warmup for first 2,000 steps
  • Learning Rate Schedule: Cosine decay after warmup
  • Compilation: Model compilation enabled for training efficiency

🎯 Performance Characteristics

Computational Efficiency

  • Memory Usage: Optimized for single-GPU training
  • Inference Speed: Fast generation with KV caching
  • Training Stability: Pre-normalization and gradient clipping
  • Scalability: Efficient attention mechanisms for longer sequences

Model Capabilities

  • Text Generation: High-quality autoregressive text generation
  • Context Understanding: Strong performance on 4K context windows
  • Efficiency: Compact model size with competitive performance
  • Versatility: Suitable for various downstream tasks

πŸš€ Usage

Loading the Model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meryyllebr543/Lunaris-0.6B-base")
model = AutoModelForCausalLM.from_pretrained("meryyllebr543/Lunaris-0.6B-base")

# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Direct Model Usage

import torch
from model import LunarisCodex, LunarisCodexConfig
from tokenizers import Tokenizer

# Load configuration and model
config = LunarisCodexConfig(
    vocab_size=65536,
    d_model=1024,
    n_layers=48,
    n_heads=16,
    n_kv_heads=4,
    max_seq_len=4096
)

model = LunarisCodex(config)
tokenizer = Tokenizer.from_file("lunaris-tokenizer.json")

# Generate text
prompt = "The future of AI is"
tokens = tokenizer.encode(prompt).ids
input_ids = torch.tensor([tokens])

with torch.no_grad():
    generated = model.generate(input_ids, max_new_tokens=100, temperature=0.7)
    output = tokenizer.decode(generated[0].tolist())
    print(output)

πŸ“ˆ Training Results

Training Metrics

  • Loss Convergence: Smooth convergence with stable training
  • Perplexity: Competitive perplexity on validation sets
  • Efficiency: High token throughput during training
  • Stability: No gradient explosion or vanishing issues

Optimization

The model benefits from:

  • Gradient Accumulation: Simulates larger batch sizes
  • Mixed Precision: Faster training with maintained stability
  • Model Compilation: Improved training throughput
  • Efficient Data Loading: Pre-tokenized data reduces overhead

πŸ› οΈ Technical Implementation

Model Architecture

The model implements a modern transformer architecture with several key innovations:

  1. Grouped-Query Attention: Reduces KV cache size by 4x while maintaining performance
  2. RoPE Integration: Seamless positional encoding without learned parameters
  3. SwiGLU FFN: Improved activation function for better modeling capacity
  4. Efficient Inference: KV caching and optimized attention patterns

Training Infrastructure

  • Framework: PyTorch with compilation support
  • Precision: Mixed precision training (FP16/BF16)
  • Distributed: Multi-GPU support with gradient accumulation
  • Monitoring: Weights & Biases integration for experiment tracking

πŸ”— Related Resources

Datasets

Code & Implementation

  • Model Code: Custom implementation based on Llama architecture
  • Training Scripts: Optimized training pipeline with modern techniques
  • Tokenizer: Custom BPE tokenizer trained on Ultra-FineWeb

πŸ“„ License

This model is released under the MIT License. The training data inherits licenses from the original Ultra-FineWeb dataset - please refer to the Ultra-FineWeb license for detailed terms.

🀝 Citation

If you use this model in your research, please cite:

@misc{lunaris2025,
  title={Lunaris-0.6B-base: A Compact Language Model with Modern Architecture},
  author={meryyllebr543},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/meryyllebr543/Lunaris-0.6B-base}
}

Additionally, please cite the Ultra-FineWeb dataset:

@misc{wang2025ultrafineweb,
  title={{Ultra-FineWeb}: Efficient Data Filtering and Verification for High-Quality LLM Training Data},
  author={Yudong Wang and Zixuan Fu and Jie Cai and Peijun Tang and Hongya Lyu and Yewei Fang and Zhi Zheng and Jie Zhou and Guoyang Zeng and Chaojun Xiao and Xu Han and Zhiyuan Liu},
  year={2025},
  eprint={2505.05427},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

πŸ™ Acknowledgments

  • Ultra-FineWeb Team for providing high-quality training data
  • Hugging Face for hosting and infrastructure support
  • OpenBMB for the original Ultra-FineWeb dataset and tools
  • Meta AI for the foundational Llama architecture insights

Model created by: meryyllebr543
Last updated: July 2025
Model size: ~600M parameters
Training tokens: 20B tokens

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train meryyllebr543/Lunaris-0.6B-base

Collection including meryyllebr543/Lunaris-0.6B-base