|
|
--- |
|
|
license: cc-by-nc-sa-4.0 |
|
|
library_name: pytorch |
|
|
tags: |
|
|
- proteomics |
|
|
- mass-spectrometry |
|
|
- peptide-sequencing |
|
|
- de-novo-sequencing |
|
|
- diffusion |
|
|
- multinomial-diffusion |
|
|
- biology |
|
|
- computational-biology |
|
|
pipeline_tag: text-generation |
|
|
datasets: |
|
|
- InstaDeepAI/ms_ninespecies_benchmark |
|
|
- InstaDeepAI/ms_proteometools |
|
|
--- |
|
|
|
|
|
# InstaNovoPlus: Diffusion-Powered De novo Peptide Sequencing Model |
|
|
|
|
|
|
|
|
|
|
|
## Model Description |
|
|
|
|
|
InstaNovoPlus is a diffusion-based model for de novo peptide sequencing from mass spectrometry data. This model leverages multinomial diffusion for accurate, database-free peptide identification for large-scale proteomics experiments. |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import numpy as np |
|
|
import pandas as pd |
|
|
from instanovo.diffusion.multinomial_diffusion import InstaNovoPlus |
|
|
from instanovo.utils import SpectrumDataFrame |
|
|
from instanovo.transformer.dataset import SpectrumDataset, collate_batch |
|
|
from torch.utils.data import DataLoader |
|
|
from instanovo.inference import ScoredSequence |
|
|
from instanovo.inference.diffusion import DiffusionDecoder |
|
|
from instanovo.utils.metrics import Metrics |
|
|
from tqdm.notebook import tqdm |
|
|
|
|
|
# Load the model from the Hugging Face Hub |
|
|
model, config = InstaNovoPlus.from_pretrained("InstaDeepAI/instanovoplus-v1.1.0") |
|
|
|
|
|
# Move the model to the GPU if available |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model = model.to(device).eval() |
|
|
|
|
|
# Update the residue set with custom modifications |
|
|
model.residue_set.update_remapping( |
|
|
{ |
|
|
"M(ox)": "M[UNIMOD:35]", |
|
|
"M(+15.99)": "M[UNIMOD:35]", |
|
|
"S(p)": "S[UNIMOD:21]", # Phosphorylation |
|
|
"T(p)": "T[UNIMOD:21]", |
|
|
"Y(p)": "Y[UNIMOD:21]", |
|
|
"S(+79.97)": "S[UNIMOD:21]", |
|
|
"T(+79.97)": "T[UNIMOD:21]", |
|
|
"Y(+79.97)": "Y[UNIMOD:21]", |
|
|
"Q(+0.98)": "Q[UNIMOD:7]", # Deamidation |
|
|
"N(+0.98)": "N[UNIMOD:7]", |
|
|
"Q(+.98)": "Q[UNIMOD:7]", |
|
|
"N(+.98)": "N[UNIMOD:7]", |
|
|
"C(+57.02)": "C[UNIMOD:4]", # Carboxyamidomethylation |
|
|
"(+42.01)": "[UNIMOD:1]", # Acetylation |
|
|
"(+43.01)": "[UNIMOD:5]", # Carbamylation |
|
|
"(-17.03)": "[UNIMOD:385]", |
|
|
} |
|
|
) |
|
|
|
|
|
# Load the test data |
|
|
sdf = SpectrumDataFrame.from_huggingface( |
|
|
"InstaDeepAI/ms_ninespecies_benchmark", |
|
|
is_annotated=True, |
|
|
shuffle=False, |
|
|
split="test[:10%]", # Let's only use a subset of the test data for faster inference |
|
|
) |
|
|
|
|
|
# Create the dataset |
|
|
ds = SpectrumDataset( |
|
|
sdf, |
|
|
model.residue_set, |
|
|
config.get("n_peaks", 200), |
|
|
return_str=False, |
|
|
annotated=True, |
|
|
peptide_pad_length=model.config.get("max_length", 30), |
|
|
reverse_peptide=False, # we do not reverse peptide for diffusion |
|
|
add_eos=False, |
|
|
tokenize_peptide=True, |
|
|
) |
|
|
|
|
|
# Create the data loader |
|
|
dl = DataLoader( |
|
|
ds, |
|
|
batch_size=64, |
|
|
num_workers=0, # sdf requirement, handled internally |
|
|
shuffle=False, # sdf requirement, handled internally |
|
|
collate_fn=collate_batch, |
|
|
) |
|
|
|
|
|
# Create the decoder |
|
|
diffusion_decoder = DiffusionDecoder(model=model) |
|
|
|
|
|
predictions = [] |
|
|
log_probs = [] |
|
|
|
|
|
# Iterate over the data loader |
|
|
for batch in tqdm(dl, total=len(dl)): |
|
|
spectra, precursors, spectra_padding_mask, peptides, _ = batch |
|
|
spectra = spectra.to(device) |
|
|
precursors = precursors.to(device) |
|
|
spectra_padding_mask = spectra_padding_mask.to(device) |
|
|
peptides = peptides.to(device) |
|
|
|
|
|
# Perform inference |
|
|
with torch.no_grad(): |
|
|
batch_predictions, batch_log_probs = diffusion_decoder.decode( |
|
|
spectra=spectra, |
|
|
spectra_padding_mask=spectra_padding_mask, |
|
|
precursors=precursors, |
|
|
initial_sequence=peptides, |
|
|
) |
|
|
predictions.extend(batch_predictions) |
|
|
log_probs.extend(batch_log_probs) |
|
|
|
|
|
# Initialize metrics |
|
|
metrics = Metrics(model.residue_set, config["isotope_error_range"]) |
|
|
|
|
|
# Compute precision and recall |
|
|
aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall( |
|
|
peptides, preds |
|
|
) |
|
|
|
|
|
# Compute amino acid error rate and AUC |
|
|
aa_error_rate = metrics.compute_aa_er(targs, preds) |
|
|
auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs))) |
|
|
|
|
|
print(f"amino acid error rate: {aa_error_rate:.5f}") |
|
|
print(f"amino acid precision: {aa_precision:.5f}") |
|
|
print(f"amino acid recall: {aa_recall:.5f}") |
|
|
print(f"peptide precision: {peptide_precision:.5f}") |
|
|
print(f"peptide recall: {peptide_recall:.5f}") |
|
|
print(f"area under the PR curve: {auc:.5f}") |
|
|
``` |
|
|
|
|
|
For more explanation, see the [Getting Started notebook](https://github.com/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb) in the repository. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use InstaNovoPlus in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{eloff_kalogeropoulos_2025_instanovo, |
|
|
title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale |
|
|
proteomics experiments}, |
|
|
author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell, |
|
|
Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen, |
|
|
Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J. |
|
|
and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars, |
|
|
Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and |
|
|
Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.}, |
|
|
year = {2025}, |
|
|
month = {Mar}, |
|
|
day = {31}, |
|
|
journal = {Nature Machine Intelligence}, |
|
|
doi = {10.1038/s42256-025-01019-5}, |
|
|
issn = {2522-5839}, |
|
|
url = {https://doi.org/10.1038/s42256-025-01019-5} |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## Resources |
|
|
|
|
|
- **Code Repository**: [https://github.com/instadeepai/InstaNovo](https://github.com/instadeepai/InstaNovo) |
|
|
- **Documentation**: [https://instadeepai.github.io/InstaNovo/](https://instadeepai.github.io/InstaNovo/) |
|
|
- **Publication**: [https://www.nature.com/articles/s42256-025-01019-5](https://www.nature.com/articles/s42256-025-01019-5) |
|
|
|
|
|
## License |
|
|
|
|
|
- **Code**: Licensed under Apache License 2.0 |
|
|
- **Model Checkpoints**: Licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0) |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install instanovo |
|
|
``` |
|
|
|
|
|
For GPU support, install with CUDA dependencies: |
|
|
```bash |
|
|
pip install instanovo[cu126] |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- Python >= 3.10, < 3.13 |
|
|
- PyTorch >= 1.13.0 |
|
|
- CUDA (optional, for GPU acceleration) |
|
|
|
|
|
|
|
|
## Support |
|
|
|
|
|
For questions, issues, or contributions, please visit the [GitHub repository](https://github.com/instadeepai/InstaNovo) or check the [documentation](https://instadeepai.github.io/InstaNovo/). |
|
|
|