🧬 CancerTranscriptome-Mini-48M

A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq

CancerTranscriptome-Mini-48M is a lightweight derivative of BulkFormer, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.
It combines GCN-based gene graph propagation, Rotary Expression Embeddings (REE), local bin-wise Performer attention, and global Performer attention into a single unified encoder.

This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.


πŸ”¬ Origin & References

Primary Reference (BulkFormer)

Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.
β€œA large-scale foundation model for bulk transcriptomes.”
bioRxiv (2025).
doi: https://doi.org/10.1101/2025.06.11.659222

This Model (CancerTranscriptome-Mini-48M)

A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.
Source Code: https://github.com/alwalt/BioFM


πŸ“Š Data Source

All training samples originate from the ARCHS4 Human RNA-seq v2.5 public repository:

ARCHS4 Reference:
Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.
β€œMassive mining of publicly available RNA-seq data from human and mouse.”
Nature Communications 9, 1366 (2018).
Dataset: https://maayanlab.cloud/archs4/

Filtering Procedure

  • Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5
  • Selected samples matching:
    cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma
  • Removed samples lacking clear disease annotations
  • Used ARCHS4 log-TPM matrices (gene Γ— sample)
  • Final dataset: ~76k cancer samples, 19,357 genes

No private, clinical, controlled-access, or proprietary data were used.


🧠 Model Architecture (Summary)

CancerTranscriptome-Mini-48M includes:

1. Gene Identity Embeddings

  • Precomputed ESM2 embeddings for each protein-coding gene
  • Projected into model dimension (320)

2. Rotary Expression Embeddings (REE)

  • Deterministic sinusoidal continuous-value embedding
  • Masked positions zeroed (mask token = –10)

3. Graph Neural Network Layer

  • GCNConv (Kipf & Welling) applied on a curated gene-gene graph
  • Injects biological prior knowledge

4. Expression Binning

  • Learnable importance scores sort genes
  • Genes divided into 10 bins
  • Each bin receives its own local Performer attention

5. Global Performer Attention

  • 2 stacked Performer layers across all genes

6. Prediction Head

  • MLP β†’ scalar value per gene
  • Used for masked-expression reconstruction

Total parameters: 48,336,162 (~48M)


🎯 Intended Use

This model produces context-aware gene embeddings for downstream cancer transcriptomic tasks:

  • Tumor subtype prediction
  • Drug response modeling
  • Immune infiltration scoring
  • Survival / risk modeling
  • Gene expression imputation
  • Dimensionality reduction
  • Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets

πŸš€ How to Use

Download & run:

import torch
from model import BulkFormer   # from this repo
import safetensors.torch as st

# Load model + weights
model = BulkFormer(
    dim=320,
    graph=torch.load("edge_index.pt"),   # provide your graph
    gene_emb=torch.load("esm2_gene_emb.pt"),
    gene_length=19357,
    bin_head=8,
    full_head=4,
    bins=10,
    gb_repeat=1,
    p_repeat=2
)

state = st.load_file("model.safetensors")
model.load_state_dict(state)
model.eval()

# Example input: 19,357-gene log-TPM vector
x = torch.randn(1, 19357)

with torch.no_grad():
    out = model(x)

print(out.shape)  # [1, 19357]
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support