sarahalamdari commited on
Commit
332656d
·
verified ·
1 Parent(s): d9f0045

Add files using upload-large-folder tool

Browse files
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ {}
3
+ ---
4
+
5
+ # Model Card for Dayhoff
6
+
7
+ In this work, we combined genomic-derived protein sequences, metagenomics, structure-based synthetic sequences, and MSAs to create the Dayhoff Atlas of protein data and language models.
8
+ We first created a large-scale natural protein dataset, GigaRef, by combining and reclustering sequences from metagenomic databases with UniRef100. With 3.3B sequences in 1.7B clusters, GigaRef is the largest open dataset of natural proteins to date.
9
+
10
+ To infuse the benefits of protein structure information into sequence space, we generated the first large-scale structure-based synthetic dataset, called BackboneRef, by sampling 240,830 backbone structures from a structure-based generative model and then using them to design a total of 46M synthetic sequences.
11
+ Using UniRef, GigaRef, BackboneRef, and 16M MSAs from OpenProteinSet, we then trained the Dayhoff series of PLMs, which use a a hybrid state-space-model (SSM) and transformer architecture along with a mixture-of-experts (MoE) mechanism to enable the long context lengths needed to combine single sequences and MSAs at scale.
12
+ Dayhoff models make accurate zero-shot predictions of mutations effects, generate sequences conditioned on aligned or unaligned homologs, and generate shorter Cas9s that preserve the functional domain architecture.
13
+
14
+ Larger models, metagenomic sequences, and structure-based augmentation all increased the expression rates of unconditional generations in E. coli.
15
+ Finally, we generated, characterized, and release 16M synthetic sequences as DayhoffRef
16
+
17
+ Dayhoff is described in this [preprint](preprint); if you use the code from this repository or the results, please cite the preprint.
18
+
19
+
20
+ ## Model Details
21
+
22
+ ### Model Description
23
+
24
+ - **Developed by:** Kevin K. Yang, Sarah Alamdari, Alex J. Lee, Kaeli Kaymak-Loveless, Samir Char, Garyk Brixi, Carles Domingo-Enrich, Chentong Wang, Suyue Lyu, Nicolo Fusi, Neil Tenenholtz, Ava P. Amini
25
+ - **Model type:** Hybrid state-space-model transformer architecture with mixture-of-experts
26
+ - **License:** MIT
27
+
28
+ ### Model Sources
29
+
30
+ - **Repository:** https://github.com/microsoft/dayhoff
31
+
32
+ ## Uses
33
+
34
+ ### Downstream Use
35
+
36
+ * Protein Language Model Training: Training protein language models, to generate new protein sequences, predict mutation effects, and design functional proteins.
37
+ * Zero-shot Prediction: Predicting the functional impact of mutations.
38
+ * Sequence Generation: Generating new protein sequences unconditionally, or conditioned on homologs for designing proteins with desired properties.
39
+ * Synthetic Sequence Generation: Exploring novel protein structures and functions using synthetic sequences.
40
+
41
+
42
+ ## Bias, Risks, and Limitations
43
+
44
+ The [software/model] described in this repository is provided for research and development use only. The [software/model] is not intended for use in clinical decision-making or for any other clinical use, and the performance of model for clinical use has not been established. You bear sole responsibility for any use of this [software/model], including incorporation into any product intended for clinical use. 
45
+
46
+
47
+ ## How to Get Started with the Model
48
+
49
+ Sample protein generation code:
50
+
51
+ ```py
52
+
53
+ import torch
54
+ from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
55
+
56
+ set_seed(0)
57
+ torch.set_default_device("cuda")
58
+
59
+ model = AutoModelForCausalLM.from_pretrained('microsoft/Dayhoff-170m-UR90')
60
+ tokenizer = AutoTokenizer.from_pretrained('microsoft/Dayhoff-170m-UR90', trust_remote_code=True)
61
+
62
+
63
+ inputs = tokenizer(tokenizer.bos_token, return_tensors="pt", return_token_type_ids=False)
64
+
65
+ outputs = model.generate(inputs['input_ids'],max_length=50,do_sample=True)
66
+ sequence = tokenizer.batch_decode(outputs,skip_special_tokens=True)
67
+ print(sequence)
68
+ ```
69
+
70
+ For detailed instructions on package usage, please refer to the README in model repo.
71
+
72
+ ## Evaluation
73
+
74
+ ### Results
75
+
76
+ Dayhoff models make accurate zero-shot predictions of mutation effects, generate sequences conditioned on aligned or unaligned homologs, and generate shorter Cas9s that preserve the functional domain architecture. Larger models, metagenomic sequences, and structure-based augmentation all increased the expression rates of unconditional generations in E. coli
77
+
78
+ ## Technical Specifications
79
+
80
+ ### Compute Infrastructure
81
+
82
+ * 170M-parameter models: trained on 8 NVIDIA A100 or 8 NVIDIA H100 GPUs using Distributed Data Parallel.
83
+ * 3B-parameter models: trained on 176 NVIDIA H100 GPUs using Fully Sharded Data Parallel in hybrid-shard mode.
84
+
85
+
86
+ ## Citation
87
+
88
+ **BibTeX:**
89
+ If you use this model in your work, please cite it as follows:
90
+
91
+ <ADD INFO>
92
+
93
+
94
+ ## Model Card Authors
95
+
96
+ Samir Char, Sarah A. Alamdari
97
+
__init__.py ADDED
File without changes
config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "ai21labs/Jamba-v0.1",
3
+ "architectures": [
4
+ "JambaForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "attn_layer_offset": 4,
8
+ "attn_layer_period": 8,
9
+ "auto_map": {
10
+ "AutoConfig": "ai21labs/Jamba-v0.1--configuration_jamba.JambaConfig",
11
+ "AutoModel": "ai21labs/Jamba-v0.1--modeling_jamba.JambaModel",
12
+ "AutoModelForCausalLM": "ai21labs/Jamba-v0.1--modeling_jamba.JambaForCausalLM",
13
+ "AutoModelForSequenceClassification": "ai21labs/Jamba-v0.1--model.JambaForSequenceClassification"
14
+ },
15
+ "bos_token_id": 29,
16
+ "eos_token_id": 27,
17
+ "expert_layer_offset": 1,
18
+ "expert_layer_period": 2,
19
+ "hidden_act": "silu",
20
+ "hidden_size": 256,
21
+ "initializer_range": 0.02,
22
+ "intermediate_size": 1024,
23
+ "mamba_conv_bias": true,
24
+ "mamba_d_conv": 4,
25
+ "mamba_d_state": 16,
26
+ "mamba_dt_rank": 16,
27
+ "mamba_expand": 2,
28
+ "mamba_proj_bias": false,
29
+ "max_position_embeddings": 262144,
30
+ "model_type": "jamba",
31
+ "num_attention_heads": 16,
32
+ "num_experts": 16,
33
+ "num_experts_per_tok": 2,
34
+ "num_hidden_layers": 24,
35
+ "num_key_value_heads": 8,
36
+ "num_logits_to_keep": 1,
37
+ "output_router_logits": true,
38
+ "pad_token_id": 30,
39
+ "rms_norm_eps": 1e-06,
40
+ "router_aux_loss_coef": 0.001,
41
+ "sliding_window": null,
42
+ "tie_word_embeddings": false,
43
+ "torch_dtype": "float32",
44
+ "transformers_version": "4.42.4",
45
+ "use_cache": false,
46
+ "use_mamba_kernels": true,
47
+ "vocab_size": 40
48
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 29,
4
+ "eos_token_id": 27,
5
+ "pad_token_id": 30,
6
+ "transformers_version": "4.42.4",
7
+ "use_cache": false
8
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eec6720e94710b5229b21e1d902e0de47a323dd32b3c9f6627828f29f1b7d119
3
+ size 681305728
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "@",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "*",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "mask_token": {
17
+ "content": "#",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "pad_token": {
24
+ "content": "!",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "sep_token": {
31
+ "content": "/",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "27": {
4
+ "content": "*",
5
+ "lstrip": false,
6
+ "normalized": true,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "28": {
12
+ "content": "#",
13
+ "lstrip": false,
14
+ "normalized": true,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "29": {
20
+ "content": "@",
21
+ "lstrip": false,
22
+ "normalized": true,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "30": {
28
+ "content": "!",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "31": {
36
+ "content": "/",
37
+ "lstrip": false,
38
+ "normalized": true,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "auto_map": {
45
+ "AutoTokenizer": [
46
+ "tokenizers.ProteinTokenizer",
47
+ null
48
+ ]
49
+ },
50
+ "bos_token": "@",
51
+ "clean_up_tokenization_spaces": true,
52
+ "eos_token": "*",
53
+ "mask_token": "#",
54
+ "model_max_length": 2048,
55
+ "pad_token": "!",
56
+ "sep_token": "/",
57
+ "tokenizer_class": "ProteinTokenizer"
58
+ }
tokenizers.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
2
+ from typing import List, Optional, Union
3
+ import os
4
+
5
+ MASK = "#"
6
+ MSA_PAD = "!"
7
+ UL_ALPHABET_PLUS = "ACDEFGHIKLMNPQRSTVWYBZXJOU-*#@!/[]{}"
8
+ MSA_AAS = "ACDEFGHIKLMNPQRSTVWYBZXJOU-"
9
+ GAP = "-"
10
+ START = "@"
11
+ STOP = "*"
12
+ SEP = "/"
13
+ END_AL = "]"
14
+ END_UL = "}"
15
+ START_AL = "["
16
+ START_UL = "{"
17
+
18
+ class ProteinTokenizer(PreTrainedTokenizer):
19
+
20
+ def __init__(
21
+ self,
22
+ protein_alphabet: str = UL_ALPHABET_PLUS,
23
+ model_max_length: int = 2048,
24
+ pad_token=MSA_PAD,
25
+ mask_token=MASK,
26
+ all_aas=MSA_AAS,
27
+ gap_token=GAP,
28
+ bos_token=START,
29
+ eos_token=STOP,
30
+ sep_token=SEP,
31
+ **kwargs
32
+ ):
33
+ """Character tokenizer for Hugging Face transformers.
34
+
35
+ model_max_length (int): Model maximum sequence length.
36
+ """
37
+ self.alphabet = list("".join(protein_alphabet))
38
+ self.all_aas = list("".join(all_aas))
39
+ self.a_to_i = {u: i for i, u in enumerate(self.alphabet)}
40
+ self.i_to_a = {i: u for i, u in enumerate(self.alphabet)}
41
+ self.gap_token = gap_token
42
+
43
+
44
+ bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
45
+ eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
46
+ sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
47
+ mask_token = AddedToken(mask_token, lstrip=False, rstrip=False) if isinstance(mask_token, str) else mask_token
48
+ pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
49
+ gap_token = AddedToken(gap_token, lstrip=False, rstrip=False) if isinstance(gap_token, str) else gap_token
50
+
51
+ super().__init__(
52
+ pad_token=pad_token,
53
+ mask_token=mask_token,
54
+ eos_token=eos_token,
55
+ bos_token=bos_token,
56
+ sep_token=sep_token,
57
+ model_max_length=model_max_length,
58
+ **kwargs
59
+ )
60
+
61
+ @property
62
+ def vocab_size(self):
63
+ return len(self.alphabet)
64
+
65
+ @property
66
+ def gap_token_id(self):
67
+ return self.convert_tokens_to_ids(self.gap_token)
68
+
69
+ def get_vocab(self):
70
+ return self.a_to_i
71
+
72
+ def _tokenize(self, text: str) -> List[str]:
73
+ return list(text)
74
+
75
+ def _convert_token_to_id(self, token) -> int:
76
+ return self.a_to_i[token]
77
+
78
+ def _convert_id_to_token(self, index) -> str:
79
+ return self.i_to_a[index]
80
+
81
+ def convert_tokens_to_string(self, tokens):
82
+ return "".join(tokens)
83
+
84
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
85
+ result = token_ids_0
86
+ if token_ids_1 is not None:
87
+ raise NotImplementedError("This tokenizer does not support two sequences")
88
+ return result
89
+
90
+ def get_special_tokens_mask(
91
+ self,
92
+ token_ids_0: List[int],
93
+ token_ids_1: Optional[List[int]] = None,
94
+ already_has_special_tokens: bool = False,
95
+ ) -> List[int]:
96
+ if already_has_special_tokens:
97
+ return super().get_special_tokens_mask(
98
+ token_ids_0=token_ids_0,
99
+ token_ids_1=token_ids_1,
100
+ already_has_special_tokens=True,
101
+ )
102
+
103
+ result = [0] * len(token_ids_0)
104
+ if token_ids_1 is not None:
105
+ raise NotImplementedError("This tokenizer does not support two sequences")
106
+
107
+ return result
108
+
109
+ def create_token_type_ids_from_sequences(
110
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
111
+ ) -> List[int]:
112
+ """
113
+ Identifies the type of token. 0 for the first sentence, 1 for the second sentence if it exists
114
+ """
115
+
116
+ result = len(token_ids_0) * [0]
117
+
118
+ if token_ids_1 is not None:
119
+ raise NotImplementedError("This tokenizer does not support two sequences")
120
+ return result
121
+
122
+ def save_pretrained(self, save_directory: Union[str, os.PathLike], **kwargs):
123
+ super().save_pretrained(save_directory, **kwargs)
124
+
125
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None):
126
+ return ()