|
--- |
|
license: cc-by-nc-4.0 |
|
tags: |
|
- m42 |
|
- genomics |
|
- biology |
|
- GFM |
|
- BioFM |
|
- BioToken |
|
--- |
|
|
|
# BioFM: A Biologically-Informed Genomic Foundation Model |
|
BioFM is a cutting-edge genomic foundation model that addresses critical limitations in existing genomic sequence modeling. By introducing BioToken, a novel tokenization framework, BioFM encodes genomic variants and structural annotations with unprecedented biological context, enabling more nuanced and accurate representation learning. |
|
|
|
 |
|
|
|
## Model Highlights |
|
- With the introduction of BioToken, we achieved competitive genomic prediction results using only 265 million parameters, significantly reducing computational requirements and training costs. |
|
- Demonstrated superior performance compared to specialized models like Enformer and SpliceTransformer in critical genomic tasks, such as expression prediction and sQTL prediction, respectively. |
|
- BioFM excels at various genomic tasks (e.g., expression prediction, coding/non-coding pathogenicity prediction, and sQTL prediction) that require long-range genomic contexts, outperforming existing GFMs. |
|
|
|
## Model Details |
|
- **Model developers:** M42 Health AI Team |
|
- **Base architecture:** [MistralForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/mistral#transformers.MistralForCausalLM) |
|
- **Context length:** |
|
- **Training:** 6k tokens |
|
- **Inference:** 12k tokens |
|
- **Training data:** 1000 Genomes |
|
- **Input format:** Annotated DNA sequences using BioToken |
|
- **Output options:** |
|
- DNA sequences only |
|
- Embeddings |
|
- **License:** CC BY-NC 4.0 |
|
- **Publication:** [Paper link]() |
|
|
|
## Model Inference |
|
We developed a BioFM-Eval Python package for inference and embedding extraction from genomic sequences. Refer to [BioFM-Eval](https://github.com/m42-health/biofm-eval/) library for setup and installation instructions. |
|
|
|
### Creating Variant Embeddings with BioFM |
|
|
|
This guide will help you quickly generate BioFM embeddings for the variants in your VCF file. These embeddings are created using the method described in our publication. |
|
|
|
```python |
|
from biofm_eval import AnnotatedModel, AnnotationTokenizer, Embedder, VCFConverter |
|
import torch |
|
|
|
# Define paths to the pre-trained BioFM model and tokenizer |
|
MODEL_PATH = "m42-health/BioFM-265M" |
|
TOKENIZER_PATH = "m42-health/BioFM-265M" |
|
|
|
# Load the pre-trained BioFM model and BioToken tokenizer |
|
model = AnnotatedModel.from_pretrained( |
|
MODEL_PATH, |
|
torch_dtype=torch.bfloat16, |
|
) |
|
tokenizer = AnnotationTokenizer.from_pretrained(TOKENIZER_PATH) |
|
|
|
# Initialize the embedder using the model and tokenizer |
|
embedder = Embedder(model, tokenizer) |
|
|
|
# Set up the VCF converter with paths to gene annotations and reference genome |
|
vcf_converter = VCFConverter( |
|
gene_annotation_path="./gencode.v38.annotation.gff3", |
|
reference_genome_path="./GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna" |
|
) |
|
|
|
# Convert a VCF file into an annotated dataset using BioTokens |
|
annotated_dataset = vcf_converter.vcf_to_annotated_dataset( |
|
vcf_path = './HG01779_b.vcf.gz', |
|
max_variants=200 # Set to None to process all variants in the VCF file |
|
) |
|
|
|
# Extract BioFM embeddings for all annotated variants |
|
embeddings = embedder.get_dataset_embeddings(annotated_dataset) |
|
print(embeddings) |
|
|
|
# Example output (dict): |
|
# { |
|
# 'embeddings': array of shape (num_variants, 2*embedding_dim), # Numeric embeddings for each variant |
|
# 'labels': array of shape (num_variants,) # Present only during supervised embedding extraction |
|
# } |
|
|
|
``` |
|
- Sample reference genome fasta file: [download link](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/) |
|
- Gene annotation file: [download_link](https://www.gencodegenes.org/human/release_38.html) |
|
- Sample vcf file from 1000 Genomes data: [download_link](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) |
|
|
|
|
|
### Generation with BioFM |
|
BioFM can generate genomic sequences based on input DNA prompts. |
|
|
|
```python |
|
from biofm_eval import AnnotatedModel, AnnotationTokenizer, Generator |
|
import torch |
|
|
|
# Define paths to the pre-trained BioFM model and tokenizer |
|
MODEL_PATH = "m42-health/BioFM-265M" |
|
TOKENIZER_PATH = "m42-health/BioFM-265M" |
|
|
|
# Load the pre-trained BioFM model and BioToken tokenizer |
|
model = AnnotatedModel.from_pretrained( |
|
MODEL_PATH, |
|
torch_dtype=torch.bfloat16, |
|
) |
|
tokenizer = AnnotationTokenizer.from_pretrained(TOKENIZER_PATH) |
|
|
|
# Initializing the generator using model and tokenizer |
|
seq_generator = Generator(model, tokenizer) |
|
|
|
# Generate DNA sequences |
|
input_sequences = ['AGCT', 'GACTGCA'] |
|
output = seq_generator.generate( |
|
input_sequences, |
|
max_new_tokens=10, |
|
temperature=1.0, |
|
do_sample=True, |
|
top_k=4) |
|
|
|
print(output) |
|
|
|
# Example output: List[str] = ['AGCTACTCCCCTCC', 'GACTGCACCACTGTACT'] |
|
|
|
``` |
|
|
|
## Training Setup |
|
|
|
The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing PyTorch's Fully Sharded Data Parallel (FSDP) framework. |
|
|
|
## Evaluation Results |
|
|
|
To demonstrate the effectiveness of BioToken, we evaluated BioFM against strong supervised baselines: Enformer for gene expression prediction and Splice Transformer for sQTL prediction. |
|
|
|
- *Gene Expression Prediction:* BioFM matches Enformer's performance when both models use a 12K context, making it the first-ever GFM to achieve this. Notably, Enformer fails to reach this performance level even with a 98K context. |
|
- *sQTL Prediction:* BioFM significantly outperforms Splice Transformer across all tissues, highlighting its robustness and generalizability. |
|
|
|
| sQTL prediction | Expression prediction | |
|
|---------|---------| |
|
|  |  | |
|
|
|
We further evaluated BioFM on the Variant Benchmark we curated and the Genomics Long-Range Benchmark. |
|
|
|
- *Variant Benchmark:* Across a broad spectrum of [variant prediction tasks](https://huggingface.co/datasets/m42-health/variant-benchmark), BioFM outperforms other GFMs, showcasing its superior predictive capabilities. |
|
- *Long-Range Genomic Dependencies:* On the Genomics Long-Range Benchmark, BioFM sets new performance standards, surpassing previous GFMs that required extensive fine-tuning and longer genomic contexts. This highlights BioFM’s ability to effectively capture and utilize long-range genomic dependencies. |
|
|
|
| Variant benchmark | Genomics long-range benchmark | |
|
|---------|---------| |
|
|  |  | |
|
|
|
Please go through the [paper]() for more resutls and ablations. |
|
|
|
## Citation |
|
``` |
|
@article {Medvedev2025.03.27.645711, |
|
author = {Medvedev, Aleksandr and Viswanathan, Karthik and Kanithi, Praveenkumar and Vishniakov, Kirill and Munjal, Prateek and Christophe, Clement and Pimentel, Marco AF and Rajan, Ronnie and Khan, Shadab}, |
|
title = {BioToken and BioFM - Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models}, |
|
elocation-id = {2025.03.27.645711}, |
|
year = {2025}, |
|
doi = {10.1101/2025.03.27.645711}, |
|
publisher = {Cold Spring Harbor Laboratory}, |
|
URL = {https://www.biorxiv.org/content/early/2025/04/01/2025.03.27.645711}, |
|
eprint = {https://www.biorxiv.org/content/early/2025/04/01/2025.03.27.645711.full.pdf}, |
|
journal = {bioRxiv} |
|
} |
|
``` |
|
|