BioGeek commited on
Commit
7c9e0b5
·
verified ·
1 Parent(s): f6d4426

Add README for instanovo-v1.0.0

Browse files
Files changed (1) hide show
  1. README.md +191 -6
README.md CHANGED
@@ -1,10 +1,195 @@
1
  ---
 
 
2
  tags:
3
- - model_hub_mixin
4
- - pytorch_model_hub_mixin
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Code: [More Information Needed]
9
- - Paper: [More Information Needed]
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-sa-4.0
3
+ library_name: pytorch
4
  tags:
5
+ - proteomics
6
+ - mass-spectrometry
7
+ - peptide-sequencing
8
+ - de-novo-sequencing
9
+ - transformer
10
+ - biology
11
+ - computational-biology
12
+ pipeline_tag: text-generation
13
+ datasets:
14
+ - InstaDeepAI/ms_ninespecies_benchmark
15
+ - InstaDeepAI/ms_proteometools
16
  ---
17
 
18
+ # InstaNovo: De novo Peptide Sequencing Model
19
+ ## Model Description
20
+
21
+ InstaNovo is a state-of-the-art transformer-based model for de novo peptide sequencing from mass spectrometry data. This model enables accurate, database-free peptide identification for large-scale proteomics experiments. InstaNovo uses a transformer architecture specifically designed for peptide sequencing from tandem mass spectrometry (MS/MS) data. The model predicts peptide sequences directly from MS/MS spectra without requiring a protein database, making it particularly valuable for discovering novel peptides, post-translational modifications, and sequences from organisms with incomplete genomic databases.
22
+
23
+ ## Usage
24
+
25
+ ```python
26
+ import torch
27
+ import numpy as np
28
+ import pandas as pd
29
+ from instanovo.transformer.model import InstaNovo
30
+ from instanovo.utils import SpectrumDataFrame
31
+ from instanovo.transformer.dataset import SpectrumDataset, collate_batch
32
+ from torch.utils.data import DataLoader
33
+ from instanovo.inference import ScoredSequence
34
+ from instanovo.inference import BeamSearchDecoder
35
+ from instanovo.utils.metrics import Metrics
36
+ from tqdm.notebook import tqdm
37
+
38
+ # Load the model from the Hugging Face Hub
39
+ model, config = InstaNovo.from_pretrained("InstaDeepAI/instanovo-v1.0.0")
40
+
41
+ # Move the model to the GPU if available
42
+ device = "cuda" if torch.cuda.is_available() else "cpu"
43
+ model = model.to(device).eval()
44
+
45
+ # Update the residue set with custom modifications
46
+ model.residue_set.update_remapping(
47
+ {
48
+ "M(ox)": "M[UNIMOD:35]",
49
+ "M(+15.99)": "M[UNIMOD:35]",
50
+ "S(p)": "S[UNIMOD:21]", # Phosphorylation
51
+ "T(p)": "T[UNIMOD:21]",
52
+ "Y(p)": "Y[UNIMOD:21]",
53
+ "S(+79.97)": "S[UNIMOD:21]",
54
+ "T(+79.97)": "T[UNIMOD:21]",
55
+ "Y(+79.97)": "Y[UNIMOD:21]",
56
+ "Q(+0.98)": "Q[UNIMOD:7]", # Deamidation
57
+ "N(+0.98)": "N[UNIMOD:7]",
58
+ "Q(+.98)": "Q[UNIMOD:7]",
59
+ "N(+.98)": "N[UNIMOD:7]",
60
+ "C(+57.02)": "C[UNIMOD:4]", # Carboxyamidomethylation
61
+ "(+42.01)": "[UNIMOD:1]", # Acetylation
62
+ "(+43.01)": "[UNIMOD:5]", # Carbamylation
63
+ "(-17.03)": "[UNIMOD:385]",
64
+ }
65
+ )
66
+
67
+ # Load the test data
68
+ sdf = SpectrumDataFrame.from_huggingface(
69
+ "InstaDeepAI/ms_ninespecies_benchmark",
70
+ is_annotated=True,
71
+ shuffle=False,
72
+ split="test[:10%]", # Let's only use a subset of the test data for faster inference
73
+ )
74
+
75
+ # Create the dataset
76
+ ds = SpectrumDataset(
77
+ sdf,
78
+ model.residue_set,
79
+ config.get("n_peaks", 200),
80
+ return_str=True,
81
+ annotated=True,
82
+ )
83
+
84
+ # Create the data loader
85
+ dl = DataLoader(ds, batch_size=64, shuffle=False, num_workers=0, collate_fn=collate_batch)
86
+
87
+ # Create the decoder
88
+ decoder = BeamSearchDecoder(model=model)
89
+
90
+ # Initialize lists to store predictions and targets
91
+ preds = []
92
+ targs = []
93
+ probs = []
94
+
95
+ # Iterate over the data loader
96
+ for _, batch in tqdm(enumerate(dl), total=len(dl)):
97
+ spectra, precursors, _, peptides, _ = batch
98
+ spectra = spectra.to(device)
99
+ precursors = precursors.to(device)
100
+
101
+ # Perform inference
102
+ with torch.no_grad():
103
+ p = decoder.decode(
104
+ spectra=spectra,
105
+ precursors=precursors,
106
+ beam_size=config["n_beams"],
107
+ max_length=config["max_length"],
108
+ )
109
+
110
+
111
+ preds += [x.sequence if isinstance(x, ScoredSequence) else [] for x in p]
112
+ probs += [
113
+ x.sequence_log_probability if isinstance(x, ScoredSequence) else -float("inf") for x in p
114
+ ]
115
+ targs += list(peptides)
116
+
117
+ # Initialize metrics
118
+ metrics = Metrics(model.residue_set, config["isotope_error_range"])
119
+
120
+
121
+ # Compute precision and recall
122
+ aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(
123
+ peptides, preds
124
+ )
125
+
126
+ # Compute amino acid error rate and AUC
127
+ aa_error_rate = metrics.compute_aa_er(targs, preds)
128
+ auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))
129
+
130
+ print(f"amino acid error rate: {aa_error_rate:.5f}")
131
+ print(f"amino acid precision: {aa_precision:.5f}")
132
+ print(f"amino acid recall: {aa_recall:.5f}")
133
+ print(f"peptide precision: {peptide_precision:.5f}")
134
+ print(f"peptide recall: {peptide_recall:.5f}")
135
+ print(f"area under the PR curve: {auc:.5f}")
136
+ ```
137
+
138
+ For more explanation, see the [Getting Started notebook](https://github.com/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb) in the repository.
139
+
140
+
141
+ ## Citation
142
+
143
+ If you use InstaNovo in your research, please cite:
144
+
145
+ ```bibtex
146
+ @article{eloff_kalogeropoulos_2025_instanovo,
147
+ title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
148
+ proteomics experiments},
149
+ author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
150
+ Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
151
+ Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
152
+ and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
153
+ Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
154
+ Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
155
+ year = {2025},
156
+ month = {Mar},
157
+ day = {31},
158
+ journal = {Nature Machine Intelligence},
159
+ doi = {10.1038/s42256-025-01019-5},
160
+ issn = {2522-5839},
161
+ url = {https://doi.org/10.1038/s42256-025-01019-5}
162
+ }
163
+ ```
164
+
165
+ ## Resources
166
+
167
+ - **Code Repository**: [https://github.com/instadeepai/InstaNovo](https://github.com/instadeepai/InstaNovo)
168
+ - **Documentation**: [https://instadeepai.github.io/InstaNovo/](https://instadeepai.github.io/InstaNovo/)
169
+ - **Publication**: [https://www.nature.com/articles/s42256-025-01019-5](https://www.nature.com/articles/s42256-025-01019-5)
170
+
171
+ ## License
172
+
173
+ - **Code**: Licensed under Apache License 2.0
174
+ - **Model Checkpoints**: Licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0)
175
+
176
+ ## Installation
177
+
178
+ ```bash
179
+ pip install instanovo
180
+ ```
181
+
182
+ For GPU support, install with CUDA dependencies:
183
+ ```bash
184
+ pip install instanovo[cu126]
185
+ ```
186
+
187
+ ## Requirements
188
+
189
+ - Python >= 3.10, < 3.13
190
+ - PyTorch >= 1.13.0
191
+ - CUDA (optional, for GPU acceleration)
192
+
193
+ ## Support
194
+
195
+ For questions, issues, or contributions, please visit the [GitHub repository](https://github.com/instadeepai/InstaNovo) or check the [documentation](https://instadeepai.github.io/InstaNovo/).