JemmaDaniel commited on
Commit
a4744ac
·
verified ·
1 Parent(s): 472d57d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -3
README.md CHANGED
@@ -1,3 +1,113 @@
1
- ---
2
- license: cc0-1.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc0-1.0
3
+ ---
4
+
5
+ ## Winnow HeLa Single Shot Probability Calibrator
6
+
7
+ **Winnow** recalibrates confidence scores and provides FDR control for *de novo* peptide sequencing (DNS) workflows.
8
+ This repository contains the calibrator trained on HeLa Single Shot data as referenced in our paper: TODO.
9
+
10
+ - Intended inputs: spectrum input data and corresponding MS/MS PSM results produced by InstaNovo
11
+ - Outputs: calibrated per-PSM probabilities in `calibrated_confidence`.
12
+
13
+ ### What’s inside
14
+ - `calibrator.pkl`: trained classifier
15
+ - `scaler.pkl`: feature standardiser
16
+ - `irt_predictor.pkl`: Prosit iRT regressor used by RT features
17
+
18
+ ---
19
+
20
+ ## How to use
21
+
22
+ ### Python
23
+ ```python
24
+ from pathlib import Path
25
+ from huggingface_hub import snapshot_download
26
+ from winnow.calibration.calibrator import ProbabilityCalibrator
27
+ from winnow.datasets.data_loaders import InstaNovoDatasetLoader
28
+ from winnow.scripts.main import filter_dataset
29
+ from winnow.fdr.nonparametric import NonParametricFDRControl
30
+
31
+ # 1) Download model files
32
+ snapshot_download(
33
+ repo_id="InstaDeepAI/winnow-helaqc-model",
34
+ allow_patterns=["*.pkl"]),
35
+ repo_type="model",
36
+ local_dir=helaqc_model,
37
+ )
38
+
39
+ # 2) Load calibrator
40
+ calibrator = ProbabilityCalibrator.load(helaqc_model)
41
+
42
+ # 3) Load your dataset (InstaNovo-style config)
43
+ dataset = InstaNovoDatasetLoader().load(
44
+ "path_to_spectrum_data.parquet",
45
+ "path_to_instanovo_predictions.csv",
46
+ )
47
+ dataset = filter_dataset(dataset) # standard Winnow filtering
48
+
49
+ # 4) Predict calibrated confidences
50
+ calibrator.predict(dataset) # adds dataset.metadata["calibrated_confidence"]
51
+
52
+ # 5) Optional: FDR control on calibrated confidence
53
+ fdr = NonParametricFDRControl()
54
+ fdr.fit(dataset.metadata["calibrated_confidence"])
55
+ cutoff = fdr.get_confidence_cutoff(0.05) # 5% FDR cutoff
56
+ dataset.metadata["keep@5%"] = dataset.metadata["calibrated_confidence"] >= cutoff
57
+ ```
58
+
59
+ ### CLI
60
+ ```bash
61
+ # After `pip install winnow`
62
+ winnow predict \
63
+ --data-source instanovo \
64
+ --dataset-config-path config_with_dataset_paths.yaml \
65
+ --model-folder general_model_folder \
66
+ --method winnow \
67
+ --fdr-threshold 0.05 \
68
+ --confidence-column calibrated_confidence \
69
+ --output-path outputs/winnow_predictions.csv
70
+ ```
71
+
72
+ ---
73
+
74
+ ## Inputs and outputs
75
+ **Required columns for calibration:**
76
+ - Spectrum data (*.parquet)
77
+ - `spectrum_id` (string): unique spectrum identifier
78
+ - `sequence` (string): ground truth peptide sequence from database search (optional)
79
+ - `retention_time` (float): retention time (seconds)
80
+ - `precursor_mass` (float): mass of the precursor ion (from MS1)
81
+ - `mz_array` (list[float]): mass-to-charge values of the MS2 spectrum
82
+ - `intensity_array` (list[float]): intensity values of the MS2 spectrum
83
+ - `precursor_charge` (int): charge of the precursor (from MS1)
84
+
85
+ - Beam predictions (*_beams.csv)
86
+ - `spectrum_id` (string)
87
+ - `sequence` (string): ground truth peptide sequence from database search (optional)
88
+ - `preds` (string): top prediction, untokenised sequence
89
+ - `preds_tokenised` (string): comma‐separated tokens for the top prediction
90
+ - `log_probs` (float): top prediction log probability
91
+ - `preds_beam_k` (string): untokenised sequence for beam k (k≥0)
92
+ - `log_probs_beam_k` (float)
93
+ - `token_log_probs_k` (string/list-encoded): per-token log probabilities for beam k
94
+
95
+ **Output columns (added by Winnow's calibrator on `predict`):**
96
+ - `calibrated_confidence`: calibrated probability
97
+ - Optional (if requested): `psm_pep`, `psm_fdr`, `psm_qvalue`
98
+ - All input columns are retained in-place
99
+
100
+ ---
101
+
102
+ ## Training data
103
+
104
+ - The general model was trained on the HeLa single-shot dataset (PXD044934)
105
+ - All default features were enabled for the training of this model.
106
+ - Predictions were obtained using InstaNovo v1.1.1 with knapsack beam search set to 50 beams.
107
+
108
+ ---
109
+
110
+ ## Citation
111
+
112
+ If you use Winnow or this model, please cite:
113
+ TODO