Growing Transformers β€” Frozen UNICODE Baseline (Monolithic, 247M)

This repository contains growing-transformers-model-frozen-unicode-baseline-monolyth-247m, a monolithic baseline model from the paper:

πŸ“š Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study

Code:
https://github.com/AVBochkov/PGT


What this model is (in one paragraph)

This is a 9-layer decoder-only Transformer trained monolithically end-to-end (all Transformer layers trained simultaneously from scratch), without constructive / layer-wise growth. The key constraint is that the token embedding layer is frozen and is predefined by a visual UNICODE rendering procedure (as described in the papers): tokens are deterministically mapped to d_model = 1024 vectors, and the resulting embedding matrix is kept fixed throughout training.

This repository is intended as a clean baseline to isolate the effect of monolithic training under the frozen visual-UNICODE embedding substrate.


Primary comparison (why this repo exists)

1) Comparison to the 16-bit constructive model (same Transformer stack, different embedding substrate)

This model is explicitly meant to be compared to:

  • Bochkov/growing-transformers-model-16-bit-1-9-181m
    (constructive / layer-wise growth; frozen 16-bit embeddings)

What is identical

  • Same controlled-study Transformer stack: 9 layers, d_model=1024, n_head=32
  • Same tokenizer family / vocabulary size: vocab_size = 65,536
  • Same training setting (controlled study protocol)

What differs

  • Embedding substrate:
    • This repo: frozen visual UNICODE embeddings with a full embedding matrix of shape (65,536 Γ— 1,024)
    • 16-bit model: frozen 16-dimensional binary embeddings expanded to d_model
  • Training procedure:
    • This repo: monolithic end-to-end training
    • 16-bit model: constructive growth (trained in stages)

Important note on parameter count:
This model has more parameters than the 16-bit models because a full (V Γ— d_model) embedding matrix is stored (and frozen). With V=65,536 and d_model=1,024, the frozen embedding matrix alone is ~67.1M parameters.

2) Comparison within the UNICODE substrate (monolithic vs constructive growth)

If you want to isolate training regime while keeping the UNICODE substrate fixed, compare to:

  • Bochkov/growing-transformers-model-unicode-1-9-247m
    (constructive growth on the same frozen visual UNICODE embeddings)

Model architecture (controlled study)

  • Type: decoder-only Transformer (GPT-like)
  • Layers: 9
  • Hidden size: d_model = 1024
  • Heads: n_head = 32
  • Vocabulary size: 65,536
  • Context length used in training: 1024
  • Embedding: frozen visual UNICODE embedding matrix (precomputed; not trainable)

Parameter count (paper-controlled-study accounting)

  • Total: β‰ˆ247.6M
  • Frozen: β‰ˆ67.1M (embedding matrix)
  • Trainable: β‰ˆ180.5M

(Counts follow the paper’s controlled-study table.)


Tokenizer and embedding artifacts

Canonical tokenizer repository (same vocab family):

Reproducibility note (important):
This model repo may include embedding-related artifacts (including the frozen UNICODE embedding tensor) that are tied to the exact embedding construction used for this run. Even with the same tokenizer, the embedding tensor content is part of the experimental condition. For strict reproducibility, prefer loading the tokenizer (and any embedding artifacts) from this model repository.


Intended use

Research / analysis of:

  • emergent semantics with frozen visual token embeddings
  • monolithic vs constructive (layer-wise) training regimes
  • controlled comparisons across embedding substrates (UNICODE vs 16-bit vs trainable)

Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training data.


How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=50,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. 
#The poem is a poem in the collection of the National Museum of Wales and the National Museum of

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Mumbai
#    </s><

πŸ§‘β€πŸ”¬ Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m

Papers for Bochkov/growing-transformers-model-frozen-unicode-baseline-monolyth-247m