InstaDeepAI
/

instanovo-v1.0.0

@@ -1,10 +1,195 @@
 ---
 tags:
-- model_hub_mixin
-- pytorch_model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Code: [More Information Needed]
-- Paper: [More Information Needed]
-- Docs: [More Information Needed]

 ---
+license: cc-by-nc-sa-4.0
+library_name: pytorch
 tags:
+- proteomics
+- mass-spectrometry
+- peptide-sequencing
+- de-novo-sequencing
+- transformer
+- biology
+- computational-biology
+pipeline_tag: text-generation
+datasets:
+- InstaDeepAI/ms_ninespecies_benchmark
+- InstaDeepAI/ms_proteometools
 ---
+# InstaNovo: De novo Peptide Sequencing Model
+## Model Description
+InstaNovo is a state-of-the-art transformer-based model for de novo peptide sequencing from mass spectrometry data. This model enables accurate, database-free peptide identification for large-scale proteomics experiments. InstaNovo uses a transformer architecture specifically designed for peptide sequencing from tandem mass spectrometry (MS/MS) data. The model predicts peptide sequences directly from MS/MS spectra without requiring a protein database, making it particularly valuable for discovering novel peptides, post-translational modifications, and sequences from organisms with incomplete genomic databases.
+## Usage
+```python
+import torch
+import numpy as np
+import pandas as pd
+from instanovo.transformer.model import InstaNovo
+from instanovo.utils import SpectrumDataFrame
+from instanovo.transformer.dataset import SpectrumDataset, collate_batch
+from torch.utils.data import DataLoader
+from instanovo.inference import ScoredSequence
+from instanovo.inference import BeamSearchDecoder
+from instanovo.utils.metrics import Metrics
+from tqdm.notebook import tqdm
+# Load the model from the Hugging Face Hub
+model, config = InstaNovo.from_pretrained("InstaDeepAI/instanovo-v1.0.0")
+# Move the model to the GPU if available
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = model.to(device).eval()
+# Update the residue set with custom modifications
+model.residue_set.update_remapping(
+    {
+        "M(ox)": "M[UNIMOD:35]",
+        "M(+15.99)": "M[UNIMOD:35]",
+        "S(p)": "S[UNIMOD:21]",  # Phosphorylation
+        "T(p)": "T[UNIMOD:21]",
+        "Y(p)": "Y[UNIMOD:21]",
+        "S(+79.97)": "S[UNIMOD:21]",
+        "T(+79.97)": "T[UNIMOD:21]",
+        "Y(+79.97)": "Y[UNIMOD:21]",
+        "Q(+0.98)": "Q[UNIMOD:7]",  # Deamidation
+        "N(+0.98)": "N[UNIMOD:7]",
+        "Q(+.98)": "Q[UNIMOD:7]",
+        "N(+.98)": "N[UNIMOD:7]",
+        "C(+57.02)": "C[UNIMOD:4]",  # Carboxyamidomethylation
+        "(+42.01)": "[UNIMOD:1]",  # Acetylation
+        "(+43.01)": "[UNIMOD:5]",  # Carbamylation
+        "(-17.03)": "[UNIMOD:385]",
+    }
+)
+# Load the test data
+sdf = SpectrumDataFrame.from_huggingface(
+    "InstaDeepAI/ms_ninespecies_benchmark",
+    is_annotated=True,
+    shuffle=False,
+    split="test[:10%]",  # Let's only use a subset of the test data for faster inference
+)
+# Create the dataset
+ds = SpectrumDataset(
+    sdf,
+    model.residue_set,
+    config.get("n_peaks", 200),
+    return_str=True,
+    annotated=True,
+)
+# Create the data loader
+dl = DataLoader(ds, batch_size=64, shuffle=False, num_workers=0, collate_fn=collate_batch)
+# Create the decoder
+decoder = BeamSearchDecoder(model=model)
+# Initialize lists to store predictions and targets
+preds = []
+targs = []
+probs = []
+# Iterate over the data loader
+for _, batch in tqdm(enumerate(dl), total=len(dl)):
+    spectra, precursors, _, peptides, _ = batch
+    spectra = spectra.to(device)
+    precursors = precursors.to(device)
+    # Perform inference
+    with torch.no_grad():
+        p = decoder.decode(
+            spectra=spectra,
+            precursors=precursors,
+            beam_size=config["n_beams"],
+            max_length=config["max_length"],
+        )
+    preds += [x.sequence if isinstance(x, ScoredSequence) else [] for x in p]
+    probs += [
+        x.sequence_log_probability if isinstance(x, ScoredSequence) else -float("inf") for x in p
+    ]
+    targs += list(peptides)
+# Initialize metrics
+metrics = Metrics(model.residue_set, config["isotope_error_range"])
+# Compute precision and recall
+aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(
+    peptides, preds
+)
+# Compute amino acid error rate and AUC
+aa_error_rate = metrics.compute_aa_er(targs, preds)
+auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))
+print(f"amino acid error rate:    {aa_error_rate:.5f}")
+print(f"amino acid precision:     {aa_precision:.5f}")
+print(f"amino acid recall:        {aa_recall:.5f}")
+print(f"peptide precision:        {peptide_precision:.5f}")
+print(f"peptide recall:           {peptide_recall:.5f}")
+print(f"area under the PR curve:  {auc:.5f}")
+```
+For more explanation, see the [Getting Started notebook](https://github.com/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb) in the repository.
+## Citation
+If you use InstaNovo in your research, please cite:
+```bibtex
+@article{eloff_kalogeropoulos_2025_instanovo,
+        title        = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
+                        proteomics experiments},
+        author       = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
+                        Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
+                        Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
+                        and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
+                        Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
+                        Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
+        year         = {2025},
+        month        = {Mar},
+        day          = {31},
+        journal      = {Nature Machine Intelligence},
+        doi          = {10.1038/s42256-025-01019-5},
+        issn         = {2522-5839},
+        url          = {https://doi.org/10.1038/s42256-025-01019-5}
+}
+```
+## Resources
+- **Code Repository**: [https://github.com/instadeepai/InstaNovo](https://github.com/instadeepai/InstaNovo)
+- **Documentation**: [https://instadeepai.github.io/InstaNovo/](https://instadeepai.github.io/InstaNovo/)
+- **Publication**: [https://www.nature.com/articles/s42256-025-01019-5](https://www.nature.com/articles/s42256-025-01019-5)
+## License
+- **Code**: Licensed under Apache License 2.0
+- **Model Checkpoints**: Licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0)
+## Installation
+```bash
+pip install instanovo
+```
+For GPU support, install with CUDA dependencies:
+```bash
+pip install instanovo[cu126]
+```
+## Requirements
+- Python >= 3.10, < 3.13
+- PyTorch >= 1.13.0
+- CUDA (optional, for GPU acceleration)
+## Support
+For questions, issues, or contributions, please visit the [GitHub repository](https://github.com/instadeepai/InstaNovo) or check the [documentation](https://instadeepai.github.io/InstaNovo/).