🧬 CancerTranscriptome-Mini-48M

A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq

CancerTranscriptome-Mini-48M is a lightweight derivative of BulkFormer, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.
It combines GCN-based gene graph propagation, Rotary Expression Embeddings (REE), local bin-wise Performer attention, and global Performer attention into a single unified encoder.

This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.

🔬 Origin & References

Primary Reference (BulkFormer)

Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.
“A large-scale foundation model for bulk transcriptomes.”
bioRxiv (2025).
doi: https://doi.org/10.1101/2025.06.11.659222

This Model (CancerTranscriptome-Mini-48M)

A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.
Source Code: https://github.com/alwalt/BioFM

📊 Data Source

All training samples originate from the ARCHS4 Human RNA-seq v2.5 public repository:

ARCHS4 Reference:
Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.
“Massive mining of publicly available RNA-seq data from human and mouse.”
Nature Communications 9, 1366 (2018).
Dataset: https://maayanlab.cloud/archs4/

Filtering Procedure

Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5
Selected samples matching:
cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma
Removed samples lacking clear disease annotations
Used ARCHS4 log-TPM matrices (gene × sample)
Final dataset: ~76k cancer samples, 19,357 genes

No private, clinical, controlled-access, or proprietary data were used.

🧠 Model Architecture (Summary)

CancerTranscriptome-Mini-48M includes:

1. Gene Identity Embeddings

Precomputed ESM2 embeddings for each protein-coding gene
Projected into model dimension (320)

2. Rotary Expression Embeddings (REE)

Deterministic sinusoidal continuous-value embedding
Masked positions zeroed (mask token = –10)

3. Graph Neural Network Layer

GCNConv (Kipf & Welling) applied on a curated gene-gene graph
Injects biological prior knowledge

4. Expression Binning

Learnable importance scores sort genes
Genes divided into 10 bins
Each bin receives its own local Performer attention

5. Global Performer Attention

2 stacked Performer layers across all genes

6. Prediction Head

MLP → scalar value per gene
Used for masked-expression reconstruction

Total parameters: 48,336,162 (~48M)

🎯 Intended Use

This model produces context-aware gene embeddings for downstream cancer transcriptomic tasks:

Tumor subtype prediction
Drug response modeling
Immune infiltration scoring
Survival / risk modeling
Gene expression imputation
Dimensionality reduction
Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets

🚀 How to Use

Download & run:

import torch
from model import BulkFormer   # from this repo
import safetensors.torch as st

# Load model + weights
model = BulkFormer(
    dim=320,
    graph=torch.load("edge_index.pt"),   # provide your graph
    gene_emb=torch.load("esm2_gene_emb.pt"),
    gene_length=19357,
    bin_head=8,
    full_head=4,
    bins=10,
    gb_repeat=1,
    p_repeat=2
)

state = st.load_file("model.safetensors")
model.load_state_dict(state)
model.eval()

# Example input: 19,357-gene log-TPM vector
x = torch.randn(1, 19357)

with torch.no_grad():
    out = model(x)

print(out.shape)  # [1, 19357]

Downloads last month: 13