Growing Transformers — Frozen UNICODE Baseline (Monolithic, 247M)

This repository contains growing-transformers-model-frozen-unicode-baseline-monolyth-247m, a monolithic baseline model from the paper:

📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study

Code:
https://github.com/AVBochkov/PGT

What this model is (in one paragraph)

This is a 9-layer decoder-only Transformer trained monolithically end-to-end (all Transformer layers trained simultaneously from scratch), without constructive / layer-wise growth. The key constraint is that the token embedding layer is frozen and is predefined by a visual UNICODE rendering procedure (as described in the papers): tokens are deterministically mapped to d_model = 1024 vectors, and the resulting embedding matrix is kept fixed throughout training.

This repository is intended as a clean baseline to isolate the effect of monolithic training under the frozen visual-UNICODE embedding substrate.

Primary comparison (why this repo exists)

1) Comparison to the 16-bit constructive model (same Transformer stack, different embedding substrate)

This model is explicitly meant to be compared to:

Bochkov/growing-transformers-model-16-bit-1-9-181m
(constructive / layer-wise growth; frozen 16-bit embeddings)

What is identical

Same controlled-study Transformer stack: 9 layers, d_model=1024, n_head=32
Same tokenizer family / vocabulary size: vocab_size = 65,536
Same training setting (controlled study protocol)

What differs

Embedding substrate:
- This repo: frozen visual UNICODE embeddings with a full embedding matrix of shape (65,536 × 1,024)
- 16-bit model: frozen 16-dimensional binary embeddings expanded to d_model
Training procedure:
- This repo: monolithic end-to-end training
- 16-bit model: constructive growth (trained in stages)

Important note on parameter count:
This model has more parameters than the 16-bit models because a full (V × d_model) embedding matrix is stored (and frozen). With V=65,536 and d_model=1,024, the frozen embedding matrix alone is ~67.1M parameters.

2) Comparison within the UNICODE substrate (monolithic vs constructive growth)

If you want to isolate training regime while keeping the UNICODE substrate fixed, compare to:

Bochkov/growing-transformers-model-unicode-1-9-247m
(constructive growth on the same frozen visual UNICODE embeddings)

Model architecture (controlled study)

Type: decoder-only Transformer (GPT-like)
Layers: 9
Hidden size: d_model = 1024
Heads: n_head = 32
Vocabulary size: 65,536
Context length used in training: 1024
Embedding: frozen visual UNICODE embedding matrix (precomputed; not trainable)

Parameter count (paper-controlled-study accounting)

Total: ≈247.6M
Frozen: ≈67.1M (embedding matrix)
Trainable: ≈180.5M

(Counts follow the paper’s controlled-study table.)

Tokenizer and embedding artifacts

Canonical tokenizer repository (same vocab family):

https://huggingface.co/Bochkov/bvv241-2-3
(collection: https://huggingface.co/collections/Bochkov/tokenizers)

Reproducibility note (important):
This model repo may include embedding-related artifacts (including the frozen UNICODE embedding tensor) that are tied to the exact embedding construction used for this run. Even with the same tokenizer, the embedding tensor content is part of the experimental condition. For strict reproducibility, prefer loading the tokenizer (and any embedding artifacts) from this model repository.

Intended use

Research / analysis of:

emergent semantics with frozen visual token embeddings
monolithic vs constructive (layer-wise) training regimes
controlled comparisons across embedding substrates (UNICODE vs 16-bit vs trainable)

Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training data.

How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=50,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. 
#The poem is a poem in the collection of the National Museum of Wales and the National Museum of

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Mumbai
#    </s><

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}

Downloads last month: 2

Collection including Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m

Growing Transformers:Layer-wise Expansion Comparative Study

Collection

Paper: 2507.07129 'Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate' (4.2.2, 5.2. Results) • 8 items • Updated Jan 4 • 1

Papers for Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Paper • 2507.07129 • Published Jul 8, 2025 • 3

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Paper • 2507.04886 • Published Jul 7, 2025 • 3