Update README.md
Browse files
README.md
CHANGED
@@ -4,7 +4,7 @@ tags:
|
|
4 |
- gene-ontology
|
5 |
- proteomics
|
6 |
datasets:
|
7 |
-
- andrewdalpino/
|
8 |
metrics:
|
9 |
- precision
|
10 |
- recall
|
@@ -16,9 +16,17 @@ pipeline_tag: text-classification
|
|
16 |
|
17 |
# ESM2 Protein Function Caller
|
18 |
|
19 |
-
An Evolutionary-scale Model (ESM) for protein function
|
20 |
|
21 |
-
**Note**: This
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
## Code Repository
|
24 |
|
@@ -30,16 +38,18 @@ https://github.com/andrewdalpino/esm2-function-classifier
|
|
30 |
- **Embedding Dimensions**: 480
|
31 |
- **Attention Heads**: 20
|
32 |
- **Encoder Layers**: 12
|
33 |
-
- **Context Length**:
|
|
|
|
|
34 |
|
35 |
-
|
36 |
|
37 |
```python
|
38 |
import torch
|
39 |
|
40 |
from transformers import EsmTokenizer, EsmForSequenceClassification
|
41 |
|
42 |
-
model_name = "andrewdalpino/ESM2-35M-Protein-
|
43 |
|
44 |
tokenizer = EsmTokenizer.from_pretrained(model_name)
|
45 |
|
@@ -47,15 +57,11 @@ model = EsmForSequenceClassification.from_pretrained(model_name)
|
|
47 |
|
48 |
model.eval()
|
49 |
|
50 |
-
sequence = "
|
51 |
|
52 |
top_k = 10
|
53 |
|
54 |
-
out = tokenizer(
|
55 |
-
sequence,
|
56 |
-
max_length=1026,
|
57 |
-
truncation=True,
|
58 |
-
)
|
59 |
|
60 |
input_ids = out["input_ids"]
|
61 |
|
@@ -83,5 +89,4 @@ for term, probability in zip(terms, probabilities):
|
|
83 |
>- A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021.
|
84 |
>- Z. Lin, et al. Evolutionary-scale prediction of atomic level protein structure with a language model, 2022.
|
85 |
>- G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022.
|
86 |
-
>- I. Friedberg, et al. CAFA 5 Protein Function Prediction. https://kaggle.com/competitions/cafa-5-protein-function-prediction, 2023.
|
87 |
>- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.
|
|
|
4 |
- gene-ontology
|
5 |
- proteomics
|
6 |
datasets:
|
7 |
+
- andrewdalpino/AmiGO
|
8 |
metrics:
|
9 |
- precision
|
10 |
- recall
|
|
|
16 |
|
17 |
# ESM2 Protein Function Caller
|
18 |
|
19 |
+
An Evolutionary-scale Model (ESM) for protein function prediction from amino acid sequences using the Gene Ontology (GO). Based on the ESM2 Transformer architecture, pre-trained on [UniRef50](https://www.uniprot.org/help/uniref), and fine-tuned on the [AmiGO](https://huggingface.co/datasets/andrewdalpino/AmiGO) dataset, this model predicts the GO subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.
|
20 |
|
21 |
+
**Note**: This version only models the `cellular component` subgraph of the gene ontology.
|
22 |
+
|
23 |
+
## What are GO terms?
|
24 |
+
|
25 |
+
> "The Gene Ontology (GO) is a concept hierarchy that describes the biological function of genes and gene products at different levels of abstraction (Ashburner et al., 2000). It is a good model to describe the multi-faceted nature of protein function."
|
26 |
+
|
27 |
+
> "GO is a directed acyclic graph. The nodes in this graph are functional descriptors (terms or classes) connected by relational ties between them (is_a, part_of, etc.). For example, terms 'protein binding activity' and 'binding activity' are related by an is_a relationship; however, the edge in the graph is often reversed to point from binding towards protein binding. This graph contains three subgraphs (subontologies): Molecular Function (MF), Biological Process (BP), and Cellular Component (CC), defined by their root nodes. Biologically, each subgraph represent a different aspect of the protein's function: what it does on a molecular level (MF), which biological processes it participates in (BP) and where in the cell it is located (CC)."
|
28 |
+
|
29 |
+
From [CAFA 5 Protein Function Prediction](https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/data)
|
30 |
|
31 |
## Code Repository
|
32 |
|
|
|
38 |
- **Embedding Dimensions**: 480
|
39 |
- **Attention Heads**: 20
|
40 |
- **Encoder Layers**: 12
|
41 |
+
- **Context Length**: 1026
|
42 |
+
|
43 |
+
## Basic Example
|
44 |
|
45 |
+
For a basic demonstration we can rank the GO terms for a particular sequence. For a more advanced example see the [predict-subgraph.py](https://github.com/andrewdalpino/esm2-function-classifier/blob/master/predict-subgraph.py) source file.
|
46 |
|
47 |
```python
|
48 |
import torch
|
49 |
|
50 |
from transformers import EsmTokenizer, EsmForSequenceClassification
|
51 |
|
52 |
+
model_name = "andrewdalpino/ESM2-35M-Protein-Biological-Process"
|
53 |
|
54 |
tokenizer = EsmTokenizer.from_pretrained(model_name)
|
55 |
|
|
|
57 |
|
58 |
model.eval()
|
59 |
|
60 |
+
sequence = "MCNAWYISVDFEKNREDKSKCIHTRRNSGPKLLEHVMYEVLRDWYCLEGENVYMM"
|
61 |
|
62 |
top_k = 10
|
63 |
|
64 |
+
out = tokenizer(sequence)
|
|
|
|
|
|
|
|
|
65 |
|
66 |
input_ids = out["input_ids"]
|
67 |
|
|
|
89 |
>- A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021.
|
90 |
>- Z. Lin, et al. Evolutionary-scale prediction of atomic level protein structure with a language model, 2022.
|
91 |
>- G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022.
|
|
|
92 |
>- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.
|