𧬠CancerTranscriptome-Mini-48M
A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq
CancerTranscriptome-Mini-48M is a lightweight derivative of BulkFormer, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.
It combines GCN-based gene graph propagation, Rotary Expression Embeddings (REE), local bin-wise Performer attention, and global Performer attention into a single unified encoder.
This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.
π¬ Origin & References
Primary Reference (BulkFormer)
Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.
βA large-scale foundation model for bulk transcriptomes.β
bioRxiv (2025).
doi: https://doi.org/10.1101/2025.06.11.659222
This Model (CancerTranscriptome-Mini-48M)
A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.
Source Code: https://github.com/alwalt/BioFM
π Data Source
All training samples originate from the ARCHS4 Human RNA-seq v2.5 public repository:
ARCHS4 Reference:
Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.
βMassive mining of publicly available RNA-seq data from human and mouse.β
Nature Communications 9, 1366 (2018).
Dataset: https://maayanlab.cloud/archs4/
Filtering Procedure
- Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5
- Selected samples matching:
cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma - Removed samples lacking clear disease annotations
- Used ARCHS4 log-TPM matrices (gene Γ sample)
- Final dataset: ~76k cancer samples, 19,357 genes
No private, clinical, controlled-access, or proprietary data were used.
π§ Model Architecture (Summary)
CancerTranscriptome-Mini-48M includes:
1. Gene Identity Embeddings
- Precomputed ESM2 embeddings for each protein-coding gene
- Projected into model dimension (320)
2. Rotary Expression Embeddings (REE)
- Deterministic sinusoidal continuous-value embedding
- Masked positions zeroed (mask token = β10)
3. Graph Neural Network Layer
- GCNConv (Kipf & Welling) applied on a curated gene-gene graph
- Injects biological prior knowledge
4. Expression Binning
- Learnable importance scores sort genes
- Genes divided into 10 bins
- Each bin receives its own local Performer attention
5. Global Performer Attention
- 2 stacked Performer layers across all genes
6. Prediction Head
- MLP β scalar value per gene
- Used for masked-expression reconstruction
Total parameters: 48,336,162 (~48M)
π― Intended Use
This model produces context-aware gene embeddings for downstream cancer transcriptomic tasks:
- Tumor subtype prediction
- Drug response modeling
- Immune infiltration scoring
- Survival / risk modeling
- Gene expression imputation
- Dimensionality reduction
- Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets
π How to Use
Download & run:
import torch
from model import BulkFormer # from this repo
import safetensors.torch as st
# Load model + weights
model = BulkFormer(
dim=320,
graph=torch.load("edge_index.pt"), # provide your graph
gene_emb=torch.load("esm2_gene_emb.pt"),
gene_length=19357,
bin_head=8,
full_head=4,
bins=10,
gb_repeat=1,
p_repeat=2
)
state = st.load_file("model.safetensors")
model.load_state_dict(state)
model.eval()
# Example input: 19,357-gene log-TPM vector
x = torch.randn(1, 19357)
with torch.no_grad():
out = model(x)
print(out.shape) # [1, 19357]
- Downloads last month
- 13