ashercn97 commited on
Commit
727b04f
·
verified ·
1 Parent(s): 47a822f

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,363 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:130899
8
+ - loss:MultipleNegativesRankingLoss
9
+ base_model: chandar-lab/NeoBERT
10
+ widget:
11
+ - source_sentence: Also, Lou Reed is tough.
12
+ sentences:
13
+ - Lou Reed is tough.
14
+ - The snow was so deep in the field that if you fell, you wouldn't feel it.
15
+ - Some organizations don't like change.
16
+ - source_sentence: Justice, said the scout.
17
+ sentences:
18
+ - 'At any moment, there are over 100 people guarding the president. '
19
+ - Our three kids did a lot of camping with us.
20
+ - The scout called for justice.
21
+ - source_sentence: More importantly, I looked accurate.
22
+ sentences:
23
+ - 'Kal was not the only one whose eyes went out of focus. '
24
+ - The Commission interpreted it.
25
+ - I looked right.
26
+ - source_sentence: no no they're not real hard
27
+ sentences:
28
+ - I waited eight seconds.
29
+ - G.M. has demonstrated it is capable of producing a first-class commercial with
30
+ it's Saturn line.
31
+ - Hard isn't a word I would use to describe them.
32
+ - source_sentence: but there the majority really haven't done anything with their
33
+ yards this neighborhood is is four years old
34
+ sentences:
35
+ - Perret inhabited Saint-Pierre during the 1930s
36
+ - Most of them haven't done anything, this neighborhood is four years old.
37
+ - Clinton stood to the side and was not in the middle of the attacks.
38
+ pipeline_tag: sentence-similarity
39
+ library_name: sentence-transformers
40
+ ---
41
+
42
+ # SentenceTransformer based on chandar-lab/NeoBERT
43
+
44
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [chandar-lab/NeoBERT](https://huggingface.co/chandar-lab/NeoBERT). It maps sentences & paragraphs to a 1536-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
45
+
46
+ ## Model Details
47
+
48
+ ### Model Description
49
+ - **Model Type:** Sentence Transformer
50
+ - **Base model:** [chandar-lab/NeoBERT](https://huggingface.co/chandar-lab/NeoBERT) <!-- at revision 2e41a1bd984aa78d10daa96e4745303541957410 -->
51
+ - **Maximum Sequence Length:** None tokens
52
+ - **Output Dimensionality:** 1536 dimensions
53
+ - **Similarity Function:** Cosine Similarity
54
+ <!-- - **Training Dataset:** Unknown -->
55
+ <!-- - **Language:** Unknown -->
56
+ <!-- - **License:** Unknown -->
57
+
58
+ ### Model Sources
59
+
60
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
61
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
62
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
63
+
64
+ ### Full Model Architecture
65
+
66
+ ```
67
+ SentenceTransformer(
68
+ (0): Transformer({'max_seq_length': None, 'do_lower_case': False}) with Transformer model: NeoBERT
69
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
70
+ )
71
+ ```
72
+
73
+ ## Usage
74
+
75
+ ### Direct Usage (Sentence Transformers)
76
+
77
+ First install the Sentence Transformers library:
78
+
79
+ ```bash
80
+ pip install -U sentence-transformers
81
+ ```
82
+
83
+ Then you can load this model and run inference.
84
+ ```python
85
+ from sentence_transformers import SentenceTransformer
86
+
87
+ # Download from the 🤗 Hub
88
+ model = SentenceTransformer("ashercn97/neobert-multi-nli")
89
+ # Run inference
90
+ sentences = [
91
+ "but there the majority really haven't done anything with their yards this neighborhood is is four years old",
92
+ "Most of them haven't done anything, this neighborhood is four years old.",
93
+ 'Perret inhabited Saint-Pierre during the 1930s',
94
+ ]
95
+ embeddings = model.encode(sentences)
96
+ print(embeddings.shape)
97
+ # [3, 1536]
98
+
99
+ # Get the similarity scores for the embeddings
100
+ similarities = model.similarity(embeddings, embeddings)
101
+ print(similarities.shape)
102
+ # [3, 3]
103
+ ```
104
+
105
+ <!--
106
+ ### Direct Usage (Transformers)
107
+
108
+ <details><summary>Click to see the direct usage in Transformers</summary>
109
+
110
+ </details>
111
+ -->
112
+
113
+ <!--
114
+ ### Downstream Usage (Sentence Transformers)
115
+
116
+ You can finetune this model on your own dataset.
117
+
118
+ <details><summary>Click to expand</summary>
119
+
120
+ </details>
121
+ -->
122
+
123
+ <!--
124
+ ### Out-of-Scope Use
125
+
126
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
127
+ -->
128
+
129
+ <!--
130
+ ## Bias, Risks and Limitations
131
+
132
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
133
+ -->
134
+
135
+ <!--
136
+ ### Recommendations
137
+
138
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
139
+ -->
140
+
141
+ ## Training Details
142
+
143
+ ### Training Dataset
144
+
145
+ #### Unnamed Dataset
146
+
147
+ * Size: 130,899 training samples
148
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
149
+ * Approximate statistics based on the first 1000 samples:
150
+ | | sentence_0 | sentence_1 |
151
+ |:--------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
152
+ | type | string | string |
153
+ | details | <ul><li>min: 3 tokens</li><li>mean: 26.92 tokens</li><li>max: 197 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 14.43 tokens</li><li>max: 40 tokens</li></ul> |
154
+ * Samples:
155
+ | sentence_0 | sentence_1 |
156
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
157
+ | <code>I can tell you, Hastings, it's making life jolly difficult for us. </code> | <code>This is making life a lot harder for us. </code> |
158
+ | <code>The striking thing about workers' comments after the vote was how many of them mentioned the possibility of the company shutting down its operations.</code> | <code>The striking thing about workers' comments after the vote was how many of them mentioned the possibility of their company shutting down its operations.</code> |
159
+ | <code>Stephen Hargarten said that screening for alcohol applies not only to the potential for interventions, but also to the patient's overall quality of care, including safety from injury due to alcohol impairment or from alcohol withdrawal during the acute phase of treatment for medical or surgical conditions.</code> | <code>Stephen Hargarten said that screening for alcohol applies to patient's overall quality of care.</code> |
160
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
161
+ ```json
162
+ {
163
+ "scale": 20.0,
164
+ "similarity_fct": "cos_sim"
165
+ }
166
+ ```
167
+
168
+ ### Training Hyperparameters
169
+ #### Non-Default Hyperparameters
170
+
171
+ - `per_device_train_batch_size`: 128
172
+ - `per_device_eval_batch_size`: 128
173
+ - `fp16`: True
174
+ - `multi_dataset_batch_sampler`: round_robin
175
+
176
+ #### All Hyperparameters
177
+ <details><summary>Click to expand</summary>
178
+
179
+ - `overwrite_output_dir`: False
180
+ - `do_predict`: False
181
+ - `eval_strategy`: no
182
+ - `prediction_loss_only`: True
183
+ - `per_device_train_batch_size`: 128
184
+ - `per_device_eval_batch_size`: 128
185
+ - `per_gpu_train_batch_size`: None
186
+ - `per_gpu_eval_batch_size`: None
187
+ - `gradient_accumulation_steps`: 1
188
+ - `eval_accumulation_steps`: None
189
+ - `torch_empty_cache_steps`: None
190
+ - `learning_rate`: 5e-05
191
+ - `weight_decay`: 0.0
192
+ - `adam_beta1`: 0.9
193
+ - `adam_beta2`: 0.999
194
+ - `adam_epsilon`: 1e-08
195
+ - `max_grad_norm`: 1
196
+ - `num_train_epochs`: 3
197
+ - `max_steps`: -1
198
+ - `lr_scheduler_type`: linear
199
+ - `lr_scheduler_kwargs`: {}
200
+ - `warmup_ratio`: 0.0
201
+ - `warmup_steps`: 0
202
+ - `log_level`: passive
203
+ - `log_level_replica`: warning
204
+ - `log_on_each_node`: True
205
+ - `logging_nan_inf_filter`: True
206
+ - `save_safetensors`: True
207
+ - `save_on_each_node`: False
208
+ - `save_only_model`: False
209
+ - `restore_callback_states_from_checkpoint`: False
210
+ - `no_cuda`: False
211
+ - `use_cpu`: False
212
+ - `use_mps_device`: False
213
+ - `seed`: 42
214
+ - `data_seed`: None
215
+ - `jit_mode_eval`: False
216
+ - `use_ipex`: False
217
+ - `bf16`: False
218
+ - `fp16`: True
219
+ - `fp16_opt_level`: O1
220
+ - `half_precision_backend`: auto
221
+ - `bf16_full_eval`: False
222
+ - `fp16_full_eval`: False
223
+ - `tf32`: None
224
+ - `local_rank`: 0
225
+ - `ddp_backend`: None
226
+ - `tpu_num_cores`: None
227
+ - `tpu_metrics_debug`: False
228
+ - `debug`: []
229
+ - `dataloader_drop_last`: False
230
+ - `dataloader_num_workers`: 0
231
+ - `dataloader_prefetch_factor`: None
232
+ - `past_index`: -1
233
+ - `disable_tqdm`: False
234
+ - `remove_unused_columns`: True
235
+ - `label_names`: None
236
+ - `load_best_model_at_end`: False
237
+ - `ignore_data_skip`: False
238
+ - `fsdp`: []
239
+ - `fsdp_min_num_params`: 0
240
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
241
+ - `tp_size`: 0
242
+ - `fsdp_transformer_layer_cls_to_wrap`: None
243
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
244
+ - `deepspeed`: None
245
+ - `label_smoothing_factor`: 0.0
246
+ - `optim`: adamw_torch
247
+ - `optim_args`: None
248
+ - `adafactor`: False
249
+ - `group_by_length`: False
250
+ - `length_column_name`: length
251
+ - `ddp_find_unused_parameters`: None
252
+ - `ddp_bucket_cap_mb`: None
253
+ - `ddp_broadcast_buffers`: False
254
+ - `dataloader_pin_memory`: True
255
+ - `dataloader_persistent_workers`: False
256
+ - `skip_memory_metrics`: True
257
+ - `use_legacy_prediction_loop`: False
258
+ - `push_to_hub`: False
259
+ - `resume_from_checkpoint`: None
260
+ - `hub_model_id`: None
261
+ - `hub_strategy`: every_save
262
+ - `hub_private_repo`: None
263
+ - `hub_always_push`: False
264
+ - `gradient_checkpointing`: False
265
+ - `gradient_checkpointing_kwargs`: None
266
+ - `include_inputs_for_metrics`: False
267
+ - `include_for_metrics`: []
268
+ - `eval_do_concat_batches`: True
269
+ - `fp16_backend`: auto
270
+ - `push_to_hub_model_id`: None
271
+ - `push_to_hub_organization`: None
272
+ - `mp_parameters`:
273
+ - `auto_find_batch_size`: False
274
+ - `full_determinism`: False
275
+ - `torchdynamo`: None
276
+ - `ray_scope`: last
277
+ - `ddp_timeout`: 1800
278
+ - `torch_compile`: False
279
+ - `torch_compile_backend`: None
280
+ - `torch_compile_mode`: None
281
+ - `dispatch_batches`: None
282
+ - `split_batches`: None
283
+ - `include_tokens_per_second`: False
284
+ - `include_num_input_tokens_seen`: False
285
+ - `neftune_noise_alpha`: None
286
+ - `optim_target_modules`: None
287
+ - `batch_eval_metrics`: False
288
+ - `eval_on_start`: False
289
+ - `use_liger_kernel`: False
290
+ - `eval_use_gather_object`: False
291
+ - `average_tokens_across_devices`: False
292
+ - `prompts`: None
293
+ - `batch_sampler`: batch_sampler
294
+ - `multi_dataset_batch_sampler`: round_robin
295
+
296
+ </details>
297
+
298
+ ### Training Logs
299
+ | Epoch | Step | Training Loss |
300
+ |:------:|:----:|:-------------:|
301
+ | 0.4888 | 500 | 0.385 |
302
+ | 0.9775 | 1000 | 0.0597 |
303
+ | 1.4663 | 1500 | 0.0209 |
304
+ | 1.9550 | 2000 | 0.0176 |
305
+ | 2.4438 | 2500 | 0.0089 |
306
+ | 2.9326 | 3000 | 0.0072 |
307
+
308
+
309
+ ### Framework Versions
310
+ - Python: 3.10.12
311
+ - Sentence Transformers: 3.4.1
312
+ - Transformers: 4.50.0
313
+ - PyTorch: 2.5.1+cu124
314
+ - Accelerate: 1.5.2
315
+ - Datasets: 3.4.1
316
+ - Tokenizers: 0.21.1
317
+
318
+ ## Citation
319
+
320
+ ### BibTeX
321
+
322
+ #### Sentence Transformers
323
+ ```bibtex
324
+ @inproceedings{reimers-2019-sentence-bert,
325
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
326
+ author = "Reimers, Nils and Gurevych, Iryna",
327
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
328
+ month = "11",
329
+ year = "2019",
330
+ publisher = "Association for Computational Linguistics",
331
+ url = "https://arxiv.org/abs/1908.10084",
332
+ }
333
+ ```
334
+
335
+ #### MultipleNegativesRankingLoss
336
+ ```bibtex
337
+ @misc{henderson2017efficient,
338
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
339
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
340
+ year={2017},
341
+ eprint={1705.00652},
342
+ archivePrefix={arXiv},
343
+ primaryClass={cs.CL}
344
+ }
345
+ ```
346
+
347
+ <!--
348
+ ## Glossary
349
+
350
+ *Clearly define terms in order to be accessible across audiences.*
351
+ -->
352
+
353
+ <!--
354
+ ## Model Card Authors
355
+
356
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
357
+ -->
358
+
359
+ <!--
360
+ ## Model Card Contact
361
+
362
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
363
+ -->
config.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "NeoBERT"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "chandar-lab/NeoBERT--model.NeoBERTConfig",
7
+ "AutoModel": "chandar-lab/NeoBERT--model.NeoBERT",
8
+ "AutoModelForMaskedLM": "chandar-lab/NeoBERT--model.NeoBERTLMHead",
9
+ "AutoModelForSequenceClassification": "chandar-lab/NeoBERT--model.NeoBERTForSequenceClassification"
10
+ },
11
+ "classifier_init_range": 0.02,
12
+ "decoder_init_range": 0.02,
13
+ "dim_head": 64,
14
+ "embedding_init_range": 0.02,
15
+ "hidden_size": 768,
16
+ "intermediate_size": 3072,
17
+ "kwargs": {
18
+ "_commit_hash": "2e41a1bd984aa78d10daa96e4745303541957410",
19
+ "architectures": [
20
+ "NeoBERTLMHead"
21
+ ],
22
+ "attn_implementation": null,
23
+ "auto_map": {
24
+ "AutoConfig": "chandar-lab/NeoBERT--model.NeoBERTConfig",
25
+ "AutoModel": "chandar-lab/NeoBERT--model.NeoBERT",
26
+ "AutoModelForMaskedLM": "chandar-lab/NeoBERT--model.NeoBERTLMHead",
27
+ "AutoModelForSequenceClassification": "chandar-lab/NeoBERT--model.NeoBERTForSequenceClassification"
28
+ },
29
+ "classifier_init_range": 0.02,
30
+ "dim_head": 64,
31
+ "kwargs": {
32
+ "classifier_init_range": 0.02,
33
+ "pretrained_model_name_or_path": "google-bert/bert-base-uncased",
34
+ "trust_remote_code": true
35
+ },
36
+ "model_type": "neobert",
37
+ "pretrained_model_name_or_path": "google-bert/bert-base-uncased",
38
+ "torch_dtype": "float32",
39
+ "transformers_version": "4.48.2",
40
+ "trust_remote_code": true
41
+ },
42
+ "max_length": 4096,
43
+ "model_type": "neobert",
44
+ "norm_eps": 1e-05,
45
+ "num_attention_heads": 12,
46
+ "num_hidden_layers": 28,
47
+ "pad_token_id": 0,
48
+ "pretrained_model_name_or_path": "google-bert/bert-base-uncased",
49
+ "torch_dtype": "float32",
50
+ "transformers_version": "4.50.0",
51
+ "trust_remote_code": true,
52
+ "vocab_size": 30522
53
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.50.0",
5
+ "pytorch": "2.5.1+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d238df8ffd0acdf98a707d888f84a8af47ee7e4b888910bb81d85f8f605e1acd
3
+ size 886680744
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": null,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_input_names": [
50
+ "input_ids",
51
+ "attention_mask"
52
+ ],
53
+ "model_max_length": 4096,
54
+ "pad_token": "[PAD]",
55
+ "sep_token": "[SEP]",
56
+ "strip_accents": null,
57
+ "tokenize_chinese_chars": true,
58
+ "tokenizer_class": "BertTokenizer",
59
+ "unk_token": "[UNK]",
60
+ "vocab_size": 30522
61
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff